how did this get published? pitfalls in experimental evaluation of computing systems
DESCRIPTION
How did this get published? Pitfalls in experimental evaluation of computing systems. José Nelson Amaral University of Alberta Edmonton, AB, Canada. Thing #1. Aggregation. Thing #2. Learning. Thing #3. SPEC Research. Evaluate. Reproducibility. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/1.jpg)
How did this get published?Pitfalls in experimental evaluation of computing
systems
José Nelson AmaralUniversity of Alberta
Edmonton, AB, Canada
![Page 2: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/2.jpg)
Thing #1
Thing #2
Thing #3
Aggregation
Learning
Reproducibility
SPEC
Research
Evaluate
![Page 3: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/3.jpg)
http://archive.constantcontact.com/fs042/1101916237075/archive/1102594461324.html
http://bitchmagazine.org/post/beyond-the-panel-an-interview-with-danielle-corsetto-of-girls-with-slingshots
So, a computing scientist entered a Store….
![Page 4: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/4.jpg)
So, a computing scientist entered a Store….
http://bitchmagazine.org/post/beyond-the-panel-an-interview-with-danielle-corsetto-of-girls-with-slingshots
$ 3,000.00
$ 200.00
They want $2,700 for the
server and $100 for the
iPod.
I will get both and pay only $2,240
altogether!
![Page 5: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/5.jpg)
http://bitchmagazine.org/post/beyond-the-panel-an-interview-with-danielle-corsetto-of-girls-with-slingshots
So, a computing scientist entered an Store….
Ma’am you are $560
short.
http://www.businessinsider.com/10-ways-to-fix-googles-busted-android-app-market-2010-1?op=1$ 3,000.00
$ 200.00
But the average of 10% and 50% is 30% and 70% of $3,200
is $2,240.
![Page 6: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/6.jpg)
So, a computing scientist entered an Store….
Ma’am you cannot take the
arithmetic average of
percentages!
http://www.businessinsider.com/10-ways-to-fix-googles-busted-android-app-market-2010-1?op=1http://bitchmagazine.org/post/beyond-the-panel-an-interview-with-danielle-corsetto-of-girls-with-slingshots
$ 3,000.00
$ 200.00
But… I just came from at top CS
conference in San Jose where they do
it!
![Page 7: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/7.jpg)
The Problem with Averages
![Page 8: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/8.jpg)
A Hypothetical Experiment
benchA benchB benchC benchD Arith Avg0
5
10
15
20
25
Execution Time
BaselineTransformed
Benchmark
Tim
e (m
inut
es)
1
5
2
10 10
2
20
4
*With thanks to Iain Ireland
![Page 9: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/9.jpg)
Speedup
![Page 10: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/10.jpg)
Performance Comparison
benchA benchB benchC benchD Arith Avg Geo Mean0
1
2
3
4
5
6
Speedup
Benchmarks
Spee
dup
0.2 0.2
5 5
1
5
benchA
2
10
benchB
10
benchC
20
4
benchD
2.6
The transformed system is, on average, 2.6 times fasterthan the baseline!
![Page 11: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/11.jpg)
Normalized Time
![Page 12: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/12.jpg)
Normalized Time
benchA benchB benchC benchD Arith Avg Geo Mean0
1
2
3
4
5
6
Normalized Time
Benchmark
Nor
mal
ized
Tim
e
The transformed system is, on average, 2.6 times slowerthan the baseline!
5 5
0.2 0.2
2.6
![Page 13: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/13.jpg)
Latency × Throughput
• What matters is latency:
Start Time
• What matters is throughput:
Start Time
![Page 14: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/14.jpg)
Aggregation for Latency:Geometric Mean
![Page 15: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/15.jpg)
Speedup
benchA benchB benchC benchD Arith Avg Geo Mean0
1
2
3
4
5
6
Speedup
Benchmarks
Spee
dup
The performance of the transformed system is, on average, the same as the baseline!
1
5 5
2.6
0.2 0.2
![Page 16: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/16.jpg)
Normalized Time
benchA benchB benchC benchD Arith Avg Geo Mean0
1
2
3
4
5
6
Normalized Time
Benchmark
Nor
mal
ized
Tim
e
The performance of the transformed system is, on average, the same as the baseline!
5 5
0.2 0.2
2.6
1.0
![Page 17: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/17.jpg)
Aggregation for Throughput
benchA benchB benchC benchD Arith Avg0
5
10
15
20
25
Execution Time
BaselineTransformed
Benchmark
Tim
e (m
inut
es)
8.255.25
The throughput of the transformed system is, on average, 1.6 times faster than the baseline.
1
5
2
10 10
2
20
4
![Page 18: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/18.jpg)
The Evidence• A careful reader will find the use of arithmetic average to
aggregate normalized numbers in many top CS conferences.
• Papers that have done that have appeared in:– LCTES 2011– PLDI 2012 (at least two papers)– CGO 2012
• A paper where the use of the wrong average changed a negative conclusion into a positive one.
– 2007 SPEC Workshop• A methodology paper by myself and a student that won the best paper
award.
![Page 19: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/19.jpg)
This is not a new observation…
Communications of the ACM, March 1986, pp. 218-221.
![Page 20: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/20.jpg)
No need to dig dusty papers…
![Page 21: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/21.jpg)
So, the computing scientist returns to the Store….
http://www.businessinsider.com/10-ways-to-fix-googles-busted-android-app-market-2010-1?op=1
???
http://bitchmagazine.org/post/beyond-the-panel-an-interview-with-danielle-corsetto-of-girls-with-slingshots
$ 3,000.00
$ 200.00
Hello. I am just back from Beijing. Now I know that
we should take the geometric average of
percentages.
![Page 22: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/22.jpg)
So, a computing scientist entered a Store….
http://www.businessinsider.com/10-ways-to-fix-googles-busted-android-app-market-2010-1?op=1http://bitchmagazine.org/post/beyond-the-panel-an-interview-with-danielle-corsetto-of-girls-with-slingshots
$ 3,000.00
$ 200.00
Thus I should get . = 22.36% discount and pay 0.7764×$3,200 =
$2,484.48
Sorry Ma’am, we don’t average
percentages…
![Page 23: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/23.jpg)
So, a computing scientist entered a Store….
http://www.businessinsider.com/10-ways-to-fix-googles-busted-android-app-market-2010-1?op=1http://bitchmagazine.org/post/beyond-the-panel-an-interview-with-danielle-corsetto-of-girls-with-slingshots
$ 3,000.00
$ 200.00
The original price is $ 3,200. You pay $ 2,700 + $ 100 = $ 2,800.
If you want an aggregate summary,your discount is 400/3,200 = 12.5%
![Page 24: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/24.jpg)
Disregard to methodology when using automated learning
Thing #2
Learning
![Page 25: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/25.jpg)
Example:Evaluation of Feedback Directed
Optimization (FDO)
![Page 26: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/26.jpg)
http://www.orchardoo.com
Compiler
ApplicationCode
Application Inputs
We have:
We want to measure theeffectiveness of an FDO-basedcode transformation.
![Page 27: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/27.jpg)
ApplicationCode
Application Inputs
http://www.orchardoo.com
Compiler
-I
InstrumentedCode
Profile
Training Set
![Page 28: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/28.jpg)
Evaluation Set
ApplicationCode
http://www.orchardoo.com
Compiler
-FDO
Profile
OptimizedCode
Performance
The FDO transformation produces code thatis XX faster for this application.
Application InputsUn-evaluatedSet
![Page 29: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/29.jpg)
The Evidence
• Many papers that use a single input for training and a single input for testing appeared in conferences (notably CGO).
• For instance, a paper that uses a single input for training and a single input for testing appears in:– ASPLOS 2004
![Page 30: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/30.jpg)
Evaluation Set
ApplicationCode
http://www.orchardoo.com
Compiler
-FDO
Profile
OptimizedCode
PerformancePerformance
PerformancePerformance
Performance
![Page 31: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/31.jpg)
ApplicationCode
http://www.orchardoo.com
Compiler
-FDO
Profile
OptimizedCode
Training Set
Evaluation Set
Profile Profile ProfileProfile
Cross-Validated Evaluation (Berube, SPEC07)
Combined Profiling (Berube, ISPASS12)
![Page 32: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/32.jpg)
Evaluation Set
ApplicationCode
http://www.orchardoo.com
Compiler
-FDO
Profile
OptimizedCode
Performance
Wrong Evaluation!
![Page 33: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/33.jpg)
The Evidence
• For instance, a paper that incorrectly uses the same input for training and testing appeared in:– PLDI 2006
![Page 34: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/34.jpg)
Thing #3
Reproducibility
Expectation:When reproduced, an experimental evaluation should produce similar results.
![Page 35: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/35.jpg)
IssuesThing #3
Reproducibility
Availability of code, data, and precise descriptionof experimental setup.
Lack of incentives for reproducibility studies.
Have the measurements been repeated a sufficientnumber of time to capture measurement variations?
![Page 36: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/36.jpg)
ProgressThing #3
Reproducibility
Program committees/reviewers starting to askquestions about reproducibility.
Steps toward infrastructure to facilitate reproducibility.
![Page 37: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/37.jpg)
SPEC
Research14 industrial organizations
20 universities or research institutes
http://research.spec.org/
SPEC Research Group
![Page 38: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/38.jpg)
SPEC Research Group
SPEC
ResearchPerformance Evaluation
Benchmarks for New Areas
Performance Evaluation Tools
Evaluation Methodology
Repository for Reproducibility
http://research.spec.org/
http://icpe2013.ipd.kit.edu/
![Page 39: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/39.jpg)
Evaluate
Anti Patterns
Evaluation in CS education
Open Letter to PC Chairs
Evaluate Collaboratory:http://evaluate.inf.usi.ch/
![Page 40: How did this get published? Pitfalls in experimental evaluation of computing systems](https://reader035.vdocuments.us/reader035/viewer/2022062521/56816694550346895dda775b/html5/thumbnails/40.jpg)
Parting Thoughts….
Creating a culture that enables full reproducibility seems daunting…
Initially we could aim for:
Reasonable expectation by a reasonable reader that, if reproduced, the experimentalevaluation would produce similar results.