Download - Empirical Methods for AI & CS
![Page 1: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/1.jpg)
Empirical Methods for AI & CS
Paul Cohen Ian P. Gent Toby [email protected] [email protected] [email protected]
![Page 2: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/2.jpg)
2
Overview
Introduction What are empirical methods? Why use them?
Case Study Eight Basic Lessons
Experiment design Data analysis How not to do it
Supplementary material
![Page 3: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/3.jpg)
3
Resources
Web
www.cs.york.ac.uk/~tw/empirical.html
www.cs.amherst.edu/~dsj/methday.html Books
“Empirical Methods for AI”, Paul Cohen, MIT Press, 1995 Journals
Journal of Experimental Algorithmics, www.jea.acm.org Conferences
Workshop on Empirical Methods in AI (last Saturday, ECAI-02?)
Workshop on Algorithm Engineering and Experiments, ALENEX 01 (alongside SODA)
![Page 4: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/4.jpg)
Empirical Methods for CS
Part I : Introduction
![Page 5: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/5.jpg)
5
What does “empirical” mean?
Relying on observations, data, experiments Empirical work should complement theoretical work
Theories often have holes (e.g., How big is the constant term? Is the current problem a “bad” one?)
Theories are suggested by observations Theories are tested by observations Conversely, theories direct our empirical attention
In addition (in this tutorial at least) empirical means “wanting to understand behavior of complex systems”
![Page 6: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/6.jpg)
6
Why We Need Empirical Methods Cohen, 1990 Survey of 150 AAAI Papers
Roughly 60% of the papers gave no evidence that the work they described had been tried on more than a single example problem.
Roughly 80% of the papers made no attempt to explain performance, to tell us why it was good or bad and under which conditions it might be better or worse.
Only 16% of the papers offered anything that might be interpreted as a question or a hypothesis.
Theory papers generally had no applications or empirical work to support them, empirical papers were demonstrations, not experiments, and had no underlying theoretical support.
The essential synergy between theory and empirical work was missing
![Page 7: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/7.jpg)
7
Theory, not Theorems
Theory based science need not be all theorems otherwise science would be mathematics
Consider theory of QED based on a model of behaviour of particles predictions accurate to many decimal places (9?)
most accurate theory in the whole of science? success derived from accuracy of predictions
not the depth or difficulty or beauty of theorems QED is an empirical theory!
![Page 8: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/8.jpg)
8
Empirical CS/AI
Computer programs are formal objects so let’s reason about them entirely formally?
Two reasons why we can’t or won’t: theorems are hard some questions are empirical in nature
e.g. are Horn clauses adequate to represent the sort of knowledge met in practice?
e.g. even though our problem is intractable in general, are the instances met in practice easy to solve?
![Page 9: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/9.jpg)
9
Empirical CS/AI
Treat computer programs as natural objects like fundamental particles, chemicals, living organisms
Build (approximate) theories about them construct hypotheses
e.g. greedy hill-climbing is important to GSAT test with empirical experiments
e.g. compare GSAT with other types of hill-climbing refine hypotheses and modelling assumptions
e.g. greediness not important, but hill-climbing is!
![Page 10: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/10.jpg)
10
Empirical CS/AI
Many advantage over other sciences Cost
no need for expensive super-colliders Control
unlike the real world, we often have complete command of the experiment
Reproducibility in theory, computers are entirely deterministic
Ethics no ethics panels needed before you run experiments
![Page 11: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/11.jpg)
11
Types of hypothesis
My search program is better than yours
not very helpful beauty competition? Search cost grows exponentially with number of variables for
this kind of problem
better as we can extrapolate to data not yet seen? Constraint systems are better at handling over-constrained
systems, but OR systems are better at handling under-constrained systems
even better as we can extrapolate to new situations?
![Page 12: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/12.jpg)
12
A typical conference conversation
What are you up to these days?
I’m running an experiment to compare the Davis-Putnam algorithm with GSAT?
Why?
I want to know which is faster
Why?
Lots of people use each of these algorithms
How will these people use your result?...
![Page 13: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/13.jpg)
13
Keep in mind the BIG picture
What are you up to these days?
I’m running an experiment to compare the Davis-Putnam algorithm with GSAT?
Why?
I have this hypothesis that neither will dominate
What use is this?
A portfolio containing both algorithms will be more robust than either algorithm on its own
![Page 14: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/14.jpg)
14
Keep in mind the BIG picture
...
Why are you doing this?
Because many real problems are intractable in theory but need to be solved in practice.
How does your experiment help?
It helps us understand the difference between average and worst case results
So why is this interesting?
Intractability is one of the BIG open questions in CS!
![Page 15: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/15.jpg)
15
Why is empirical CS/AI in vogue?
Inadequacies of theoretical analysis problems often aren’t as hard in practice as theory predicts
in the worst-case average-case analysis is very hard (and often based on
questionable assumptions) Some “spectacular” successes
phase transition behaviour local search methods theory lagging behind algorithm design
![Page 16: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/16.jpg)
16
Why is empirical CS/AI in vogue?
Compute power ever increasing even “intractable” problems coming into range easy to perform large (and sometimes meaningful)
experiments Empirical CS/AI perceived to be “easier” than theoretical
CS/AI often a false perception as experiments easier to mess up
than proofs
![Page 17: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/17.jpg)
Empirical Methods for CS
Part II: A Case Study
Eight Basic Lessons
![Page 18: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/18.jpg)
18
Rosenberg study
“An Empirical Study of Dynamic Scheduling on Rings of Processors”
Gregory, Gao, Rosenberg & Cohen
Proc. of 8th IEEE Symp. on Parallel & Distributed Processing, 1996
Linked to from
www.cs.york.ac.uk/~tw/empirical.html
![Page 19: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/19.jpg)
19
Problem domain
Scheduling processors on ring network jobs spawned as binary trees
KOSO keep one, send one to my left
or right arbitrarily KOSO*
keep one, send one to my least heavily loaded neighbour
![Page 20: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/20.jpg)
20
Theory
On complete binary trees, KOSO is asymptotically optimal
So KOSO* can’t be any better?
But assumptions unrealistic tree not complete asymptotically not necessarily
the same as in practice!Thm: Using KOSO on a ring of p
processors, a binary tree of height n is executed within (2^n-1)/p + low order terms
![Page 21: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/21.jpg)
21
Benefits of an empirical study
More realistic trees probabilistic generator that makes shallow trees, which are
“bushy” near root but quickly get “scrawny” similar to trees generated when performing Trapezoid or
Simpson’s Rule calculationsbinary trees correspond to interval bisection
Startup costs network must be loaded
![Page 22: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/22.jpg)
22
Lesson 1: Evaluation begins with claimsLesson 2: Demonstration is good, understanding better
Hypothesis (or claim): KOSO takes longer than KOSO* because KOSO* balances loads better The “because phrase” indicates a hypothesis about why it
works. This is a better hypothesis than the beauty contest demonstration that KOSO* beats KOSO
Experiment design Independent variables: KOSO v KOSO*, no. of
processors, no. of jobs, probability(job will spawn), Dependent variable: time to complete jobs
![Page 23: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/23.jpg)
23
Criticism 1: This experiment design includes no direct measure of the hypothesized effect
Hypothesis: KOSO takes longer than KOSO* because KOSO* balances loads better
But experiment design includes no direct measure of load balancing: Independent variables: KOSO v KOSO*, no. of
processors, no. of jobs, probability(job will spawn), Dependent variable: time to complete jobs
![Page 24: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/24.jpg)
24
Lesson 3: Exploratory data analysis means looking beneath immediate results for explanations
T-test on time to complete jobs: t = (2825-2935)/587 = -.19 KOSO* apparently no faster than KOSO (as theory predicted) Why? Look more closely at the data:
Outliers create excessive variance, so test isn’t significant
10
20
30
40
50
60
70
80
10000 20000
10
20
30
40
50
60
70
10000 20000
KOSO KOSO*
![Page 25: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/25.jpg)
25
Lesson 4: The task of empirical work is to explain variability
run-time
Algorithm (KOSO/KOSO*)
Number of processors
Number of jobs
“random noise” (e.g., outliers)
Number of processors and number of jobs explain 74% of the variance in run time. Algorithm explains almost none.
Empirical work assumes the variability in a dependent variable (e.g., run time) is the sum of causal factors and random noise. Statistical methods assign parts of this variability to the factors and the noise.
![Page 26: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/26.jpg)
26
Lesson 3 (again): Exploratory data analysis means looking beneath immediate results for explanations
Why does the KOSO/KOSO* choice account for so little of the variance in run time?
Unless processors starve, there will be no effect of load balancing. In most conditions in this experiment, processors never starved. (This is why we run pilot experiments!)
100 200 300
10
20
30
100 200 300
10
20
30
40
50Queue length at processor i Queue length at processor i
KOSO KOSO*
![Page 27: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/27.jpg)
27
Lesson 5: Of sample variance, effect size, and sample size – control the first before touching the last
t x sN
magnitude of effect
backgroundvariance
sample size
This intimate relationship holds for all statistics
![Page 28: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/28.jpg)
28
Lesson 5 illustrated: A variance reduction method
Let N = num-jobs, P = num-processors, T = run timeThen T = k (N / P), or k multiples of the theoretical best timeAnd k = 1 / (N / P T)
k(KOSO) k(KOSO*)
102030405060708090
2 3 4 5
10
20
30
40
50
60
70
2 3 4 5
t 1.61 1.4
.082.42, p .02
![Page 29: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/29.jpg)
29
Where are we?
KOSO* is significantly better than KOSO when the dependent variable is recoded as percentage of optimal run time
The difference between KOSO* and KOSO explains very little of the variance in either dependent variable
Exploratory data analysis tells us that processors aren’t starving so we shouldn’t be surprised
Prediction: The effect of algorithm on run time (or k) increases as the number of jobs decreases or the number of processors increases
This prediction is about interactions between factors
![Page 30: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/30.jpg)
30
Lesson 6: Most interesting science is about interaction effects, not simple main effects
Data confirm prediction KOSO* is superior on larger
rings where starvation is an issue
Interaction of independent variables choice of algorithm number of processors
Interaction effects are essential to explaining how things work
1
2
3
3 6 10 20
number of processors
multiples of optimal run-time KOSO
KOSO*
![Page 31: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/31.jpg)
31
Lesson 7: Significant and meaningful are not synonymous. Is a result meaningful?
KOSO* is significantly better than KOSO, but can you use the result? Suppose you wanted to use the knowledge that the ring is controlled by
KOSO or KOSO* for some prediction. Grand median k = 1.11; Pr(trial i has k > 1.11) = .5 Pr(trial i under KOSO has k > 1.11) = 0.57 Pr(trial i under KOSO* has k > 1.11) = 0.43
Predict for trial i whether it’s k is above or below the median: If it’s a KOSO* trial you’ll say no with (.43 * 150) = 64.5 errors If it’s a KOSO trial you’ll say yes with ((1 - .57) * 160) = 68.8 errors If you don’t know you’ll make (.5 * 310) = 155 errors
155 - (64.5 + 68.8) = 22 Knowing the algorithm reduces error rate from .5 to .43. Is this enough???
![Page 32: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/32.jpg)
32
Lesson 8: Keep the big picture in mind
Why are you studying this?
Load balancing is important to get good performance out of parallel computers
Why is this important?
Parallel computing promises to tackle many of our computational bottlenecks
How do we know this? It’s in the first paragraph of the paper!
![Page 33: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/33.jpg)
33
Case study: conclusions
Evaluation begins with claims Demonstrations of simple main effects are
good, understanding the effects is better Exploratory data analysis means using your
eyes to find explanatory patterns in data The task of empirical work is to explain
variablitity Control variability before increasing sample
size Interaction effects are essential to explanations Significant ≠ meaningful Keep the big picture in mind
![Page 34: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/34.jpg)
Empirical Methods for CS
Part III : Experiment design
![Page 35: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/35.jpg)
35
Experimental Life Cycle
Exploration Hypothesis construction Experiment Data analysis Drawing of conclusions
![Page 36: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/36.jpg)
36
Checklist for experiment design*
Consider the experimental procedure making it explicit helps to identify spurious effects and sampling biases
Consider a sample data table identifies what results need to be collected clarifies dependent and independent variables shows whether data pertain to hypothesis
Consider an example of the data analysis helps you to avoid collecting too little or too much data especially important when looking for interactions
*From Chapter 3, “Empirical Methods for Artificial Intelligence”, Paul Cohen, MIT Press
![Page 37: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/37.jpg)
37
Guidelines for experiment design
Consider possible results and their interpretation may show that experiment cannot support/refute
hypotheses under test unforeseen outcomes may suggest new hypotheses
What was the question again? easy to get carried away designing an experiment and
lose the BIG picture Run a pilot experiment to calibrate parameters (e.g., number
of processors in Rosenberg experiment)
![Page 38: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/38.jpg)
38
Types of experiment
Manipulation experiment Observation experiment Factorial experiment
![Page 39: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/39.jpg)
39
Manipulation experiment
Independent variable, x x=identity of parser, size of dictionary, …
Dependent variable, y y=accuracy, speed, …
Hypothesis x influences y
Manipulation experiment change x, record y
![Page 40: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/40.jpg)
40
Observation experiment
Predictor, x x=volatility of stock prices, …
Response variable, y y=fund performance, …
Hypothesis x influences y
Observation experiment classify according to x, compute y
![Page 41: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/41.jpg)
41
Factorial experiment
Several independent variables, xi
there may be no simple causal links data may come that way
e.g. individuals will have different sexes, ages, ... Factorial experiment
every possible combination of xi considered expensive as its name suggests!
![Page 42: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/42.jpg)
42
Designing factorial experiments
In general, stick to 2 to 3 independent variables Solve same set of problems in each case
reduces variance due to differences between problem sets If this not possible, use same sample sizes
simplifies statistical analysis As usual, default hypothesis is that no influence exists
much easier to fail to demonstrate influence than to demonstrate an influence
![Page 43: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/43.jpg)
43
Some problem issues
Control Ceiling and Floor effects Sampling Biases
![Page 44: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/44.jpg)
44
Control
A control is an experiment in which the hypothesised variation does not occur so the hypothesized effect should not occur either
BUT remember placebos cure a large percentage of patients!
![Page 45: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/45.jpg)
45
Control: a cautionary tale
Macaque monkeys given vaccine based on human T-cells infected with SIV (relative of HIV) macaques gained immunity from SIV
Later, macaques given uninfected human T-cells and macaques still gained immunity!
Control experiment not originally done and not always obvious (you can’t control for all variables)
![Page 46: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/46.jpg)
46
Control: MYCIN case study
MYCIN was a medial expert system recommended therapy for blood/meningitis infections
How to evaluate its recommendations? Shortliffe used
10 sample problems, 8 therapy recommenders5 faculty, 1 resident, 1 postdoc, 1 student
8 impartial judges gave 1 point per problem max score was 80 Mycin 65, faculty 40-60, postdoc 60, resident 45, student 30
![Page 47: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/47.jpg)
47
Control: MYCIN case study
What were controls? Control for judge’s bias for/against computers
judges did not know who recommended each therapy Control for easy problems
medical student did badly, so problems not easy Control for our standard being low
e.g. random choice should do worse Control for factor of interest
e.g. hypothesis in MYCIN that “knowledge is power” have groups with different levels of knowledge
![Page 48: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/48.jpg)
48
Ceiling and Floor Effects
Well designed experiments (with good controls) can still go wrong
What if all our algorithms do particularly well Or they all do badly?
We’ve got little evidence to choose between them
![Page 49: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/49.jpg)
49
Ceiling and Floor Effects
Ceiling effects arise when test problems are insufficiently challenging floor effects the opposite, when problems too challenging
A problem in AI because we often repeatedly use the same benchmark sets most benchmarks will lose their challenge eventually? but how do we detect this effect?
![Page 50: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/50.jpg)
50
Machine learning example
14 datasets from UCI corpus of benchmarks used as mainstay of ML community
Problem is learning classification rules each item is vector of features and a classification measure classification accuracy of method (max 100%)
Compare C4 with 1R*, two competing algorithms
Rob Holte, Machine Learning, vol. 3, pp. 63-91, 1993
www.site.uottawa.edu/~holte/Publications/simple_rules.ps
![Page 51: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/51.jpg)
51
Floor effects: machine learning example
DataSet: BC CH GL G2 HD HE … Mean
C4 72 99.2 63.2 74.3 73.6 81.2 ... 85.9
1R* 72.5 69.2 56.4 77 78 85.1 ... 83.8
Is 1R* above the floor of performance?How would we tell?
![Page 52: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/52.jpg)
52
Floor effects: machine learning example
DataSet: BC CH GL G2 HD HE … Mean
C4 72 99.2 63.2 74.3 73.6 81.2 ... 85.9
1R* 72.5 69.2 56.4 77 78 85.1 ... 83.8
Baseline 70.3 52.2 35.5 53.4 54.5 79.4 … 59.9
“Baseline rule” puts all items in more popular category. 1R* is above baseline on most datasets
A bit like the prime number joke? 1 is prime. 3 is prime. 5 is prime. So, baseline rule isthat all odd numbers are prime.
![Page 53: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/53.jpg)
53
Ceiling Effects: machine learning
DataSet: BC GL HY LY MU … Mean
C4 72 63.2 99.1 77.5 100.0 ... 85.9
1R* 72.5 56.4 97.2 70.7 98.4 ... 83.8
How do we know that C4 and 1R* are not near the ceiling of
performance? Do the datasets have enough attributes to make perfect
classification? Obviously for MU, but what about the rest?
![Page 54: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/54.jpg)
54
Ceiling Effects: machine learning
DataSet: BC GL HY LY MU … Mean
C4 72 63.2 99.1 77.5 100.0 ... 85.9
1R* 72.5 56.4 97.2 70.7 98.4 ... 83.8
max(C4,1R*) 72.5 63.2 99.1 77.5 100.0 … 87.4
max([Buntine]) 72.8 60.4 99.1 66.0 98.6 … 82.0
C4 achieves only about 2% better than 1R*
Best of the C4/1R* achieves 87.4% accuracy We have only weak evidence that C4 better Both methods performing appear to be near ceiling of possible
so comparison hard!
![Page 55: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/55.jpg)
55
Ceiling Effects: machine learning
In fact 1R* only uses one feature (the best one) C4 uses on average 6.6 features 5.6 features buy only about 2% improvement Conclusion?
Either real world learning problems are easy (use 1R*) Or we need more challenging datasets We need to be aware of ceiling effects in results
![Page 56: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/56.jpg)
56
Sampling bias
Data collection is biased against certain data e.g. teacher who says “Girls don’t
answer maths question” observation might suggest:
girls don’t answer many questions
but that the teacher doesn’t ask them many questions
Experienced AI researchers don’t do that, right?
![Page 57: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/57.jpg)
57
Sampling bias: Phoenix case study
AI system to fight (simulated) forest fires
Experiments suggest that wind speed uncorrelated with time to put out fire obviously incorrect as high
winds spread forest fires
![Page 58: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/58.jpg)
58
Sampling bias: Phoenix case study
Wind Speed vs containment time (max 150 hours):
3: 120 55 79 10 140 26 15 110 12 54 10 103
6: 78 61 58 81 71 57 21 32 70
9: 62 48 21 55 101 What’s the problem?
![Page 59: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/59.jpg)
59
Sampling bias: Phoenix case study
The cut-off of 150 hours introduces sampling bias many high-wind fires get cut off, not many low wind
On remaining data, there is no correlation between wind speed and time (r = -0.53)
In fact, data shows that: a lot of high wind fires take > 150 hours to contain those that don’t are similar to low wind fires
You wouldn’t do this, right? you might if you had automated data analysis.
![Page 60: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/60.jpg)
60
Sampling biases can be subtle...
Girls do better at math than boys in random samples at all levels of education. What else might be systematically associated with G that we don't know about? Is this because of their genes or because they have more siblings? Assume gender (G) is an independent variable and number of siblings (S) is a
noise variable. If S is truly a noise variable then under random sampling, no dependency
should exist between G and S in samples. Parents have children until they get at least one boy. They don't feel the same
way about girls. In a sample of 1000 girls the number with S = 0 is smaller than in a sample of 1000 boys.
The frequency distribution of S is different for different genders. S and G are not independent.
![Page 61: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/61.jpg)
Empirical Methods for CS
Part IV: Data analysis
![Page 62: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/62.jpg)
62
Kinds of data analysis
Exploratory (EDA) – looking for patterns in data Statistical inferences from sample data
Testing hypotheses Estimating parameters
Building mathematical models of datasets Machine learning, data mining…
We will introduce hypothesis testing and computer-intensive methods
![Page 63: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/63.jpg)
63
The logic of hypothesis testing
Example: toss a coin ten times, observe eight heads. Is the coin fair (i.e., what is it’s long run behavior?) and what is your residual uncertainty?
You say, “If the coin were fair, then eight or more heads is pretty unlikely, so I think the coin isn’t fair.”
Like proof by contradiction: Assert the opposite (the coin is fair) show that the sample result (≥ 8 heads) has low probability p, reject the assertion, with residual uncertainty related to p.
Estimate p with a sampling distribution.
![Page 64: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/64.jpg)
64
Probability of a sample result under a null hypothesis
If the coin were fair (the null hypothesis) what is the probability distribution of r, the number of heads, obtained in N tosses of a fair coin? Get it analytically or estimate it by simulation (on a computer): Loop K times
r := 0 ;; r is num.heads in N tosses
Loop N times ;; simulate the tosses• Generate a random 0 ≤ x ≤ 1.0• If x < p increment r ;; p is the probability of a head
Push r onto sampling_distribution Print sampling_distribution
![Page 65: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/65.jpg)
65
Sampling distributions
This is the estimated sampling distribution of r under the null hypothesis that the coin is fair. The estimation is constructed by Monte Carlo sampling.
10203040506070
0 1 2 3 4 5 6 7 8 9 10
Number of heads in 10 tosses
Frequency (K = 1000) Probability of r = 8 or moreheads in N = 10 tosses of afair coin is 54 / 1000 = .054
![Page 66: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/66.jpg)
66
The logic of hypothesis testing
Establish a null hypothesis: H0: the coin is fair Establish a statistic: r, the number of heads in N tosses Figure out the sampling distribution of r given H0
The sampling distribution will tell you the probability p of a result at least as extreme as your sample result, r = 8
If this probability is very low, reject H0 the null hypothesis Residual uncertainty is p
0 1 2 3 4 5 6 7 8 9 10
![Page 67: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/67.jpg)
67
The only tricky part is getting the sampling distribution
Sampling distributions can be derived... Exactly, e.g., binomial probabilities for coins are given by
the formula
Analytically, e.g., the central limit theorem tells us that the sampling distribution of the mean approaches a Normal distribution as samples grow to infinity
Estimated by Monte Carlo simulation of the null hypothesis process
N!
r!(N r)!pN
![Page 68: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/68.jpg)
68
A common statistical test: The Z test for different means
A sample N = 25 computer science students has mean IQ m=135. Are they “smarter than average”?
Population mean is 100 with standard deviation 15 The null hypothesis, H0, is that the CS students are “average”,
i.e., the mean IQ of the population of CS students is 100. What is the probability p of drawing the sample if H0 were true?
If p small, then H0 probably false. Find the sampling distribution of the mean of a sample of size
25, from population with mean 100
![Page 69: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/69.jpg)
69
Central Limit Theorem:
The sampling distribution of the mean is given bythe Central Limit Theorem
The sampling distribution of the mean of samples of size N approaches a normal (Gaussian) distribution as N approaches infinity.
If the samples are drawn from a population with mean and standard deviation , then the mean of the sampling distribution is and its standard deviation is as N increases.
These statements hold irrespective of the shape of the original distribution.
x N
![Page 70: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/70.jpg)
70
The sampling distribution for the CS student example
If sample of N = 25 students were drawn from a population with mean 100 and standard deviation 15 (the null hypothesis) then the sampling distribution of the mean would asymptotically be normal with mean 100 and standard deviation 15 25 3
100 135
The mean of the CS students falls nearly 12 standard deviations away from the mean of the sampling distribution
Only ~1% of a normal distribution falls more than two standard deviations away from the mean
If the students were average, this would have a roughly zero chance of happening.
![Page 71: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/71.jpg)
71
The Z test
100 135
Mean of sampling distribution
Samplestatistic
std=3
0 11.67
Mean of sampling distribution
Teststatistic
std=1.0
Z x N
135 100
1525
353
11.67
![Page 72: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/72.jpg)
72
Reject the null hypothesis?
Commonly we reject the H0 when the probability of obtaining a sample statistic (e.g., mean = 135) given the null hypothesis is low, say < .05.
A test statistic value, e.g. Z = 11.67, recodes the sample statistic (mean = 135) to make it easy to find the probability of sample statistic given H0.
We find the probabilities by looking them up in tables, or statistics packages provide them.
For example, Pr(Z ≥ 1.67) = .05; Pr(Z ≥ 1.96) = .01.
Pr(Z ≥ 11) is approximately zero, reject H0.
![Page 73: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/73.jpg)
73
The t test
Same logic as the Z test, but appropriate when population standard deviation is unknown, samples are small, etc.
Sampling distribution is t, not normal, but approaches normal as samples size increases
Test statistic has very similar form but probabilities of the test statistic are obtained by consulting tables of the t distribution, not the normal
![Page 74: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/74.jpg)
74
The t test
100 135
Mean of sampling distribution
Samplestatistic
std=12.1
0 2.89
Mean of sampling distribution
Teststatistic
std=1.0
t x
sN
135 100
275
35
12.12.89
Suppose N = 5 students have mean IQ = 135, std = 27
Estimate the standard deviation of sampling distribution using the sample standard deviation
![Page 75: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/75.jpg)
75
Summary of hypothesis testing
H0 negates what you want to demonstrate; find probability p of sample statistic under H0 by comparing test statistic to sampling distribution; if probability is low, reject H0 with residual uncertainty proportional to p.
Example: Want to demonstrate that CS graduate students are smarter than average. H0 is that they are average. t = 2.89, p ≤ .022
Have we proved CS students are smarter? NO! We have only shown that mean = 135 is unlikely if they aren’t. We
never prove what we want to demonstrate, we only reject H0, with residual uncertainty.
And failing to reject H0 does not prove H0, either!
![Page 76: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/76.jpg)
76
Common tests
Tests that means are equal Tests that samples are uncorrelated or independent Tests that slopes of lines are equal Tests that predictors in rules have predictive power Tests that frequency distributions (how often events happen) are
equal Tests that classification variables such as smoking history and
heart disease history are unrelated
... All follow the same basic logic
![Page 77: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/77.jpg)
77
Computer-intensive Methods
Basic idea: Construct sampling distributions by simulating on a computer the process of drawing samples.
Three main methods: Monte carlo simulation when one knows population parameters; Bootstrap when one doesn’t; Randomization, also assumes nothing about the population.
Enormous advantage: Works for any statistic and makes no strong parametric assumptions (e.g., normality)
![Page 78: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/78.jpg)
78
Another Monte Carlo example, relevant to machine learning...
Suppose you want to buy stocks in a mutual fund; for simplicity assume there are just N = 50 funds to choose from and you’ll base your decision on the proportion of J=30 stocks in each fund that increased in value
Suppose Pr(a stock increasing in price) = .75 You are tempted by the best of the funds, F, which reports
price increases in 28 of its 30 stocks. What is the probability of this performance?
![Page 79: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/79.jpg)
79
Simulate...
Loop K = 1000 times
B = 0 ;; number of stocks that increase in
;; the best of N funds
Loop N = 50 times ;; N is number of funds
H = 0 ;; stocks that increase in this fund
Loop M = 30 times ;; M is number of stocks in this fund
Toss a coin with bias p to decide whether this
stock increases in value and if so increment H
Push H on a list ;; We get N values of H
B := maximum(H) ;; The number of increasing stocks in
;; the best fund
Push B on a list ;; We get K values of B
![Page 80: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/80.jpg)
80
Surprise!
The probability that the best of 50 funds reports 28 of 30 stocks increase in price is roughly 0.4
Why? The probability that an arbitrary fund would report this increase is Pr(28 successes | pr(success)=.75)≈.01, but the probability that the best of 50 funds would report this is much higher.
Machine learning algorithms use critical values based on arbitrary elements, when they are actually testing the best element; they think elements are more unusual than they really are. This is why ML algorithms overfit.
![Page 81: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/81.jpg)
81
The Bootstrap
Monte Carlo estimation of sampling distributions assume you know the parameters of the population from which samples are drawn.
What if you don’t? Use the sample as an estimate of the population. Draw samples from the sample! With or without replacement? Example: Sampling distribution of the mean; check the
results against the central limit theorem.
![Page 82: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/82.jpg)
82
Bootstrapping the sampling distribution of the mean*
S is a sample of size N:
Loop K = 1000 times
Draw a pseudosample S* of size N from S by sampling with replacement
Calculate the mean of S* and push it on a list L L is the bootstrapped sampling distribution of the mean** This procedure works for any statistic, not just the mean.
* Recall we can get the sampling distribution of the mean via the central limit theorem – this example is just for illustration.
** This distribution is not a null hypothesis distribution and so is not directly used for hypothesis testing, but can easily be transformed into a null hypothesis distribution (see Cohen, 1995).
![Page 83: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/83.jpg)
83
Randomization
Used to test hypotheses that involve association between elements of two or more groups; very general.
Example: Paul tosses H H H H, Carole tosses T T T T is outcome independent of tosser?
Example: 4 women score 54 66 64 61, six men score 23 28 27 31 51 32. Is score independent of gender?
Basic procedure: Calculate a statistic f for your sample; randomize one factor relative to the other and calculate your pseudostatistic f*. Compare f to the sampling distribution for f*.
![Page 84: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/84.jpg)
84
Example of randomization
Four women score 54 66 64 61, six men score 23 28 27 31 51 32. Is score independent of gender?
f = difference of means of men’s and women’s scores: 29.25 Under the null hypothesis of no association between gender and score, the
score 54 might equally well have been achieved by a male or a female. Toss all scores in a hopper, draw out four at random and without replacement,
call them female*, call the rest male*, and calculate f*, the difference of means of female* and male*. Repeat to get a distribution of f*. This is an estimate of the sampling distribution of f under H0: no difference between male and female scores.
![Page 85: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/85.jpg)
Empirical Methods for CS
Part V: How Not To Do It
![Page 86: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/86.jpg)
86
Tales from the coal face
Those ignorant of history are doomed to repeat it we have committed many howlers
We hope to help others avoid similar ones …
… and illustrate how easy it is to screw up! “How Not to Do It”
I Gent, S A Grant, E. MacIntyre, P Prosser, P Shaw,
B M Smith, and T Walsh
University of Leeds Research Report, May 1997 Every howler we report committed by at least one of the
above authors!
![Page 87: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/87.jpg)
87
How Not to Do It
Do measure with many instruments in exploring hard problems, we used our best algorithms missed very poor performance of less good algorithms
better algorithms will be bitten by same effect on larger instances than we considered
Do measure CPU time in exploratory code, CPU time often misleading but can also be very informative
e.g. heuristic needed more search but was faster
![Page 88: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/88.jpg)
88
How Not to Do It
Do vary all relevant factors Don’t change two things at once
ascribed effects of heuristic to the algorithmchanged heuristic and algorithm at the same timedidn’t perform factorial experiment
but it’s not always easy/possible to do the “right” experiments if there are many factors
![Page 89: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/89.jpg)
89
How Not to Do It
Do Collect All Data Possible …. (within reason) one year Santa Claus had to repeat all our experiments
ECAI/AAAI/IJCAI deadlines just after new year! we had collected number of branches in search tree
performance scaled with backtracks, not branchesall experiments had to be rerun
Don’t Kill Your Machines we have got into trouble with sysadmins
… over experimental data we never used often the vital experiment is small and quick
![Page 90: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/90.jpg)
90
How Not to Do It
Do It All Again … (or at least be able to) e.g. storing random seeds used in experiments we didn’t do that and might have lost important result
Do Be Paranoid “identical” implementations in C, Scheme gave different
results Do Use The Same Problems
reproducibility is a key to science (c.f. cold fusion) can reduce variance
![Page 91: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/91.jpg)
91
Choosing your test data
We’ve seen the possible problem of over-fitting remember machine learning benchmarks?
Two common approaches benchmark libraries random problems
Both have potential pitfalls
![Page 92: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/92.jpg)
92
Benchmark libraries
+ve can be based on real problems lots of structure
-ve library of fixed size
possible to over-fit algorithms to library problems have fixed size
so can’t measure scaling
![Page 93: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/93.jpg)
93
Random problems
+ve problems can have any size
so can measure scaling can generate any number of problems
hard to over-fit? -ve
may not be representative of real problemslack structure
easy to generate “flawed” problemsCSP, QSAT, …
![Page 94: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/94.jpg)
94
Flawed random problems
Constraint satisfaction example 40+ papers over 5 years by many authors used Models A,
B, C, and D all four models are “flawed” [Achlioptas et al. 1997]
asymptotically almost all problems are trivialbrings into doubt many experimental results
• some experiments at typical sizes affected• fortunately not many
How should we generate problems in future?
![Page 95: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/95.jpg)
95
Flawed random problems
[Gent et al. 1998] fix flaw …. introduce “flawless” problem generation defined in two equivalent ways though no proof that problems are truly flawless
Undergraduate student at Strathclyde found new bug two definitions of flawless not equivalent
Eventually settled on final definition of flawless gave proof of asymptotic non-triviality so we think that we just about understand the problem
generator now
![Page 96: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/96.jpg)
96
Prototyping your algorithm
Often need to implement an algorithm usually novel algorithm, or variant of existing one
e.g. new heuristic in existing search algorithm novelty of algorithm should imply extra care more often, encourages lax implementation
it’s only a preliminary version
![Page 97: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/97.jpg)
97
How Not to Do It
Don’t Trust Yourself bug in innermost loop found by chance all experiments re-run with urgent deadline curiously, sometimes bugged version was better!
Do Preserve Your Code Or end up fixing the same error twice
Do use version control!
![Page 98: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/98.jpg)
98
How Not to Do It
Do Make it Fast Enough emphasis on enough
it’s often not necessary to have optimal codein lifecycle of experiment, extra coding time not won back
e.g. we have published many papers with inefficient codecompared to state of the art
• first GSAT version O(N2), but this really was too slow!• Do Report Important Implementation Details
Intermediate versions produced good results
![Page 99: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/99.jpg)
99
How Not to Do It
Do Look at the Raw Data Summaries obscure important aspects of behaviour Many statistical measures explicitly designed to minimise
effect of outliers Sometimes outliers are vital
“exceptionally hard problems” dominate meanwe missed them until they hit us on the head
when experiments “crashed” overnight
old data on smaller problems showed clear behaviour
![Page 100: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/100.jpg)
100
How Not to Do It
Do face up to the consequences of your results e.g. preprocessing on 450 problems
should “obviously” reduce searchreduced search 448 timesincreased search 2 times
Forget algorithm, it’s useless? Or study in detail the two exceptional cases
and achieve new understanding of an important algorithm
![Page 101: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/101.jpg)
Empirical Methods for CS
Part VI: Coda
![Page 102: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/102.jpg)
102
Our objectives
Outline some of the basic issues exploration, experimental design, data analysis, ...
Encourage you to consider some of the pitfalls we have fallen into all of them!
Raise standards encouraging debate identifying “best practice”
Learn from your experiences experimenters get better as they get older!
![Page 103: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/103.jpg)
103
Summary
Empirical CS and AI are exacting sciences There are many ways to do experiments wrong
We are experts in doing experiments badly As you perform experiments, you’ll make many mistakes Learn from those mistakes, and ours!
![Page 104: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/104.jpg)
Empirical Methods for CS
Part VII : Supplement
![Page 105: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/105.jpg)
105
Some expert advice
Bernard Moret, U. New Mexico
“Towards a Discipline of Experimental Algorithmics” David Johnson, AT&T Labs
“A Theoretician’s Guide to the Experimental Analysis of Algorithms”
Both linked to from www.cs.york.ac.uk/~tw/empirical.html
![Page 106: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/106.jpg)
106
Bernard Moret’s guidelines
Useful types of empirical results: accuracy/correctness of
theoretical results real-world performance heuristic quality impact of data structures ...
![Page 107: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/107.jpg)
107
Bernard Moret’s guidelines
Hallmarks of a good experimental paper clearly defined goals large scale tests
both in number and size of instances mixture of problems
real-world, random, standard benchmarks, ... statistical analysis of results reproducibility
publicly available instances, code, data files, ...
![Page 108: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/108.jpg)
108
Bernard Moret’s guidelines
Pitfalls for experimental papers simpler experiment would have given same result result predictable by (back of the envelope) calculation bad experimental setup
e.g. insufficient sample size, no consideration of scaling, …
poor presentation of data
e.g. lack of statistics, discarding of outliers, ...
![Page 109: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/109.jpg)
109
Bernard Moret’s guidelines
Ideal experimental procedure define clear set of objectives
which questions are you asking? design experiments to meet these objectives collect data
do not change experiments until all data is collected to prevent drift/bias
analyse data
consider new experiments in light of these results
![Page 110: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/110.jpg)
110
David Johnson’s guidelines
3 types of paper describe the implementation of an algorithm application paper
“Here’s a good algorithm for this problem”
sales-pitch paper
“Here’s an interesting new algorithm”
experimental paper
“Here’s how this algorithm behaves in practice”
These lessons apply to all 3
![Page 111: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/111.jpg)
111
David Johnson’s guidelines
Perform “newsworthy” experiments standards higher than for theoretical papers! run experiments on real problems
theoreticians can get away with idealized distributions but experimentalists have no excuse!
don’t use algorithms that theory can already dismiss look for generality and relevance
don’t just report algorithm A dominates algorithm B, identify why it does!
![Page 112: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/112.jpg)
112
David Johnson’s guidelines
Place work in context compare against previous work in literature ideally, obtain their code and test sets
verify their results, and compare with your new algorithm less ideally, re-implement their code
report any differences in performance least ideally, simply report their old results
try to make some ball-park comparisons of machine speeds
![Page 113: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/113.jpg)
113
David Johnson’s guidelines
Use efficient implementations “somewhat” controversial efficient implementation supports claims of practicality
tells us what is achievable in practice can run more experiments on larger instances
can do our research quicker! don’t have to go over-board on this exceptions can also be made
e.g. not studying CPU time, comparing against a previously newsworthy algorithm, programming time more valuable than processing time, ...
![Page 114: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/114.jpg)
114
David Johnson’s guidelines
Use testbeds that support general conclusions ideally one (or more) random class, & real world instances
predict performance on real world problems based on random class, evaluate quality of predictions
structured random generators
parameters to control structure as well as size don’t just study real world instances
hard to justify generality unless you have a very broad class of real world problems!
![Page 115: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/115.jpg)
115
David Johnson’s guidelines
Provide explanations and back them up with experiment adds to credibility of experimental results improves our understanding of algorithms
leading to better theory and algorithms can “weed” out bugs in your implementation!
![Page 116: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/116.jpg)
116
David Johnson’s guidelines
Ensure reproducibilty easily achieved via the Web adds support to a paper if others can (and do) reproduce
the results requires you to use large samples and wide range of
problems
otherwise results will not be reproducible!
![Page 117: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/117.jpg)
117
David Johnson’s guidelines
Ensure comparability (and give the full picture) make it easy for those who come after to reproduce your
results provide meaningful summaries
give sample sizes, report standard deviations, plot graphs but report data in tables in the appendix
do not hide anomalous results report running times even if this is not the main focus
readers may want to know before studying your results in detail
![Page 118: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/118.jpg)
118
David Johnson’s pitfalls
Failing to report key implementation details Extrapolating from tiny samples Using irreproducible benchmarks Using running time as a stopping criterion Ignoring hidden costs (e.g. preprocessing) Misusing statistical tools Failing to use graphs
![Page 119: Empirical Methods for AI & CS](https://reader035.vdocuments.us/reader035/viewer/2022070417/568152a7550346895dc0cc6f/html5/thumbnails/119.jpg)
119
David Johnson’s pitfalls
Obscuring raw data by using hard-to-read charts Comparing apples and oranges Drawing conclusions not supported by the data Leaving obvious anomalies unnoted/unexplained Failing to back up explanations with further experiments Ignoring the literature
the self-referential study!