evaluation and methodology for experimental computer science
DESCRIPTION
Evaluation and Methodology For Experimental Computer Science. Steve Blackburn Research School of Computer Science Australian National University. Research: Solving problems without known answers. Quantitative Experimentation. Quantitative Experimentation. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Evaluation and Methodology For Experimental Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022033023/568164d6550346895dd71435/html5/thumbnails/1.jpg)
Evaluation and MethodologyFor Experimental Computer Science
Steve BlackburnResearch School of Computer ScienceAustralian National University
![Page 2: Evaluation and Methodology For Experimental Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022033023/568164d6550346895dd71435/html5/thumbnails/2.jpg)
2Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Research:Solving problems without known answers
![Page 3: Evaluation and Methodology For Experimental Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022033023/568164d6550346895dd71435/html5/thumbnails/3.jpg)
3Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
![Page 4: Evaluation and Methodology For Experimental Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022033023/568164d6550346895dd71435/html5/thumbnails/4.jpg)
4Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Quantitative Experimentation
Quantitative Experimentation
• Experiment– Measure A and B in context of C
• Claim– “A is better than B”
Does the experiment support the claim?
[Blackburn, Diwan, Hauswirth, Sweeney et al 2012]
![Page 5: Evaluation and Methodology For Experimental Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022033023/568164d6550346895dd71435/html5/thumbnails/5.jpg)
5Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Scope of Claim & Experiment
• Claim with broad scope is hard to satisfy– “We improve Java programs by 10%”– Implicitly all Java programs in all circumstances– Scope of experiment limited by resources
• Claim with narrow scope is uninteresting– “We improve Java on lusearch on an i7 on … by 10%”
Scope of claim is the key tension
[Blackburn, Diwan, Hauswirth, Sweeney et al 2012]
Quantitative Experimentation
![Page 6: Evaluation and Methodology For Experimental Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022033023/568164d6550346895dd71435/html5/thumbnails/6.jpg)
6Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Components of an Experiment
• Measurement context– Software and hardware components varied or held constant
• Workloads– Benchmarks and their inputs used in the experiment
• Metrics– The properties to measure and how to measure them
• Data analysis and interpretation– How to analyze the data and how to interpret the results
[Blackburn, Diwan, Hauswirth, Sweeney et al 2012]
Control / Independent VariablesDependent Variables
Quantitative Experimentation
![Page 7: Evaluation and Methodology For Experimental Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022033023/568164d6550346895dd71435/html5/thumbnails/7.jpg)
7Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Experimental Pitfalls (the four “I”s)
• Inappropriate– Experiments that are inappropriate (or surplus) to the claim
• Ignored– Elements relevant to claim, but omitted
• Inconsistent– Elements are treated inconsistently
• Irreproducible– Others cannot reproduce this experiment
[Blackburn, Diwan, Hauswirth, Sweeney et al 2012]
Quantitative Experimentation
![Page 8: Evaluation and Methodology For Experimental Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022033023/568164d6550346895dd71435/html5/thumbnails/8.jpg)
Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Components X Pitfalls[Blackburn, Diwan, Hauswirth, Sweeney et al 2012]
Quantitative Experimentation
A measurement context is inappropriate when it is flawed or does not reflect the measurement context that is implicit in the claim. This may become manifest as an error or as a distraction (a “red herring”).
✗An aspect of the measurement context is ignored when an experiment design does not consider it even when it is necessary to support the claim.
✗A measurement context is inconsistent when an experiment compares two systems and uses different measurement contexts for each system. The different contexts may produce incomparable results for the two systems. Unfortunately, the more disparate the objects of comparison, the more difficult it is to ensure consistent measurement contexts. Even a benign-looking difference in contexts can introduce bias and make measurement results incomparable. For this reason, it may be important to randomize experimental parameters (e.g., memory layout in experiments that measure performance).
✗If the measurement context is irreproducible then the experiment is also irreproducible. Measurement contexts may be irreproducible because either they are not public or they are not documented.
✗
8
![Page 9: Evaluation and Methodology For Experimental Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022033023/568164d6550346895dd71435/html5/thumbnails/9.jpg)
9
Advice
Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
![Page 10: Evaluation and Methodology For Experimental Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022033023/568164d6550346895dd71435/html5/thumbnails/10.jpg)
10Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Ingrained, Systematic Skepticism
Why Write?
Too good to be true? Probably.
• Is the result repeatable?– If it is not, it’s nothing more
than noise• Is the result plausible?
– You need to posses a clear idea of what is plausible
• Can you explain the result?– Plausible support of
hypothesis is essential
Street-Fighting Mathematics MITOPENCOURSEWARE 18.098 / 6.099
![Page 11: Evaluation and Methodology For Experimental Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022033023/568164d6550346895dd71435/html5/thumbnails/11.jpg)
11Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Clean Environment
Why Write?
Just as essential as a clean lab for a bio scientist
• Clean OS & distro– All machines run same image of same distro
• Clean hardware– Buy machines in pairs (redundancy & sanity checks)
• Know what is running– No NFS mounts, no non-essential daemons
• Machine reservation system– Ensure only you are using the machine
![Page 12: Evaluation and Methodology For Experimental Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022033023/568164d6550346895dd71435/html5/thumbnails/12.jpg)
12Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Repeatability and Accountability
Why Write?
Disk is cheap. Don’t throw anything away.
• All experiments should be scripted• Log every experiment
– Capture the environment and output in log– Keep logs (forever)
• Publish your raw data– Downloadable from your web site– If you’re not comfortable with this, you probably
should not be publishing
![Page 13: Evaluation and Methodology For Experimental Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022033023/568164d6550346895dd71435/html5/thumbnails/13.jpg)
13Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Statistics
Why Write?
Lies, damn lies, and statistics
• Understand basic statistics• Are your results statistically significant?• Report confidence intervals
![Page 14: Evaluation and Methodology For Experimental Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022033023/568164d6550346895dd71435/html5/thumbnails/14.jpg)
14Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Good Tools
Why Write?
Good evaluation infrastructure gives you an edge
• Good data management system– Easy manipulation of and recovery of data
• Good data analysis tools– See results that others can’t and share with your collaborators
• Good workloads– Realistic workloads key to credibility
• Good teamwork– Resist the temptation to write your own. Work as a team.
![Page 15: Evaluation and Methodology For Experimental Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022033023/568164d6550346895dd71435/html5/thumbnails/15.jpg)
15Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Good Tools
Why Write?
Good evaluation infrastructure gives you an edge
![Page 16: Evaluation and Methodology For Experimental Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022033023/568164d6550346895dd71435/html5/thumbnails/16.jpg)
16Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012
Questions?
Mechanics