evaluation and methodology for experimental computer science

Evaluation and MethodologyFor Experimental Computer Science

Steve BlackburnResearch School of Computer ScienceAustralian National University

2Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Research:Solving problems without known answers


Quantitative Experimentation


• Experiment– Measure A and B in context of C

• Claim– “A is better than B”

Does the experiment support the claim?

[Blackburn, Diwan, Hauswirth, Sweeney et al 2012]


Scope of Claim & Experiment

• Claim with broad scope is hard to satisfy– “We improve Java programs by 10%”– Implicitly all Java programs in all circumstances– Scope of experiment limited by resources

• Claim with narrow scope is uninteresting– “We improve Java on lusearch on an i7 on … by 10%”

Scope of claim is the key tension




Components of an Experiment

• Measurement context– Software and hardware components varied or held constant

• Workloads– Benchmarks and their inputs used in the experiment

• Metrics– The properties to measure and how to measure them

• Data analysis and interpretation– How to analyze the data and how to interpret the results


Control / Independent VariablesDependent Variables



Experimental Pitfalls (the four “I”s)

• Inappropriate– Experiments that are inappropriate (or surplus) to the claim

• Ignored– Elements relevant to claim, but omitted

• Inconsistent– Elements are treated inconsistently

• Irreproducible– Others cannot reproduce this experiment



Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Components X Pitfalls[Blackburn, Diwan, Hauswirth, Sweeney et al 2012]


A measurement context is inappropriate when it is flawed or does not reflect the measurement context that is implicit in the claim. This may become manifest as an error or as a distraction (a “red herring”).

✗An aspect of the measurement context is ignored when an experiment design does not consider it even when it is necessary to support the claim.

✗A measurement context is inconsistent when an experiment compares two systems and uses different measurement contexts for each system. The different contexts may produce incomparable results for the two systems. Unfortunately, the more disparate the objects of comparison, the more difficult it is to ensure consistent measurement contexts. Even a benign-looking difference in contexts can introduce bias and make measurement results incomparable. For this reason, it may be important to randomize experimental parameters (e.g., memory layout in experiments that measure performance).

✗If the measurement context is irreproducible then the experiment is also irreproducible. Measurement contexts may be irreproducible because either they are not public or they are not documented.

✗

8

9

Advice

Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012


Ingrained, Systematic Skepticism

Why Write?

Too good to be true? Probably.

• Is the result repeatable?– If it is not, it’s nothing more

than noise• Is the result plausible?

– You need to posses a clear idea of what is plausible

• Can you explain the result?– Plausible support of

hypothesis is essential

Street-Fighting Mathematics MITOPENCOURSEWARE 18.098 / 6.099


Clean Environment

Why Write?

Just as essential as a clean lab for a bio scientist

• Clean OS & distro– All machines run same image of same distro

• Clean hardware– Buy machines in pairs (redundancy & sanity checks)

• Know what is running– No NFS mounts, no non-essential daemons

• Machine reservation system– Ensure only you are using the machine


Repeatability and Accountability

Why Write?

Disk is cheap. Don’t throw anything away.

• All experiments should be scripted• Log every experiment

– Capture the environment and output in log– Keep logs (forever)

• Publish your raw data– Downloadable from your web site– If you’re not comfortable with this, you probably

should not be publishing


Statistics

Why Write?

Lies, damn lies, and statistics

• Understand basic statistics• Are your results statistically significant?• Report confidence intervals


Good Tools

Why Write?

Good evaluation infrastructure gives you an edge

• Good data management system– Easy manipulation of and recovery of data

• Good data analysis tools– See results that others can’t and share with your collaborators

• Good workloads– Realistic workloads key to credibility

• Good teamwork– Resist the temptation to write your own. Work as a team.


Good Tools

Why Write?

Good evaluation infrastructure gives you an edge


Questions?

Mechanics

evaluation and methodology for experimental computer science

Documents

experiment blackburn

methodology phd workshop

results blackburn

different measurement

sweeney et

scope of claim experimentclaim

context of cclaima

experiment design