benchmarking: you're doing it wrong (strangeloop 2014)
DESCRIPTION
Knowledge of how to set up good benchmarks is invaluable in understanding performance of the system. Writing correct and useful benchmarks is hard, and verification of the results is difficult and prone to errors. When done right, benchmarks guide teams to improve the performance of their systems. When done wrong, hours of effort may result in a worse performing application, upset customers or worse! In this talk, we will discuss what you need to know to write better benchmarks. We will look at examples of bad benchmarks and learn about what biases can invalidate the measurements, in the hope of correctly applying our new-found skills and avoiding such pitfalls in the future.TRANSCRIPT
Benchmarking: You’re Doing It Wrong
Aysylu Greenberg @aysylu22
In Memes…
To Write Good Benchmarks…
Need to be Full Stack
What’s a Benchmark
How fast?
Your process vs Goal Your process vs Best PracLces
Today
• How Not to Write Benchmarks • Benchmark Setup & Results: - Wrong about the machine - Wrong about stats - Wrong about what maOers
• Becoming Less Wrong
HOW NOT TO WRITE BENCHMARKS
Website Serving Images
• Access 1 image 1000 Lmes • Latency measured for each access • Start measuring immediately • 3 runs • Find mean • Dev machine
Web Request
Server
S3 Cache
WHAT’S WRONG WITH THIS BENCHMARK?
BENCHMARK SETUP & RESULTS: COMMON PITFALLS
You’re wrong about the machine
Wrong About the Machine
• Cache, cache, cache, cache!
It’s Caches All The Way Down
It’s Caches All The Way Down
Caches in Benchmarks Prof. Saman Amarasinghe, MIT 2009
Caches in Benchmarks Prof. Saman Amarasinghe, MIT 2009
Caches in Benchmarks Prof. Saman Amarasinghe, MIT 2009
Caches in Benchmarks Prof. Saman Amarasinghe, MIT 2009
Caches in Benchmarks Prof. Saman Amarasinghe, MIT 2009
Website Serving Images
• Access 1 image 1000 Lmes • Latency measured for each access • Start measuring immediately • 3 runs • Find mean • Dev machine
Web Request
Server
S3 Cache
Wrong About the Machine
• Cache, cache, cache, cache! • Warmup & Timing
Website Serving Images
• Access 1 image 1000 Lmes • Latency measured for each access • Start measuring immediately • 3 runs • Find mean • Dev machine
Web Request
Server
S3 Cache
Wrong About the Machine
• Cache, cache, cache, cache! • Warmup & Timing • Periodic interference
Website Serving Images
• Access 1 image 1000 Lmes • Latency measured for each access • Start measuring immediately • 3 runs • Find mean • Dev machine
Web Request
Server
S3 Cache
Wrong About the Machine
• Cache, cache, cache, cache! • Warmup & Timing • Periodic interference • Different specs in test vs prod machines
Website Serving Images
• Access 1 image 1000 Lmes • Latency measured for each access • Start measuring immediately • 3 runs • Find mean • Dev machine
Web Request
Server
S3 Cache
Wrong About the Machine
• Cache, cache, cache, cache! • Warmup & Timing • Periodic interference • Different specs in test vs prod machines • Power mode changes
BENCHMARK SETUP & RESULTS: COMMON PITFALLS
You’re wrong about the stats
Wrong About Stats
• Too few samples
Wrong About Stats
0
20
40
60
80
100
120
0 10 20 30 40 50 60
Latency
Time
Convergence of Median on Samples
Stable Samples
Stable Median
Decaying Samples
Decaying Median
Website Serving Images
• Access 1 image 1000 Lmes • Latency measured for each access • Start measuring immediately • 3 runs • Find mean • Dev machine
Web Request
Server
S3 Cache
Wrong About Stats
• Too few samples • Non-‐Gaussian
Website Serving Images
• Access 1 image 1000 Lmes • Latency measured for each access • Start measuring immediately • 3 runs • Find mean • Dev machine
Web Request
Server
S3 Cache
Wrong About Stats
• Too few samples • Non-‐Gaussian • MulLmodal distribuLon
MulLmodal DistribuLon
50% 99%
# occurren
ces
Latency 5 ms 10 ms
Wrong About Stats
• Too few samples • Non-‐Gaussian • MulLmodal distribuLon • Outliers
BENCHMARK SETUP & RESULTS: COMMON PITFALLS
You’re wrong about what maOers
Wrong About What MaOers
• Premature opLmizaLon
“Programmers waste enormous amounts of Lme thinking about … the speed of noncriLcal parts of their programs ... Forget about small efficiencies …97% of the Lme: premature opKmizaKon is the root of all evil. Yet we should not pass up our opportuniLes in that criLcal 3%.”
-‐-‐ Donald Knuth
Wrong About What MaOers
• Premature opLmizaLon • UnrepresentaLve Workloads
Wrong About What MaOers
• Premature opLmizaLon • UnrepresentaLve Workloads • Memory pressure
BECOMING LESS WRONG The How
Becoming Less Wrong
User AcLons MaOer
X > Y for workload Z with trade offs A, B, and C
-‐ hOp://www.toomuchcode.org/
Becoming Less Wrong
Profiling Code InstrumentaLon Aggregate Over Logs Traces
Microbenchmarking: Blessing & Curse
+ Quick & cheap + Answers narrow ?s well - Open misleading results - Not representaLve of the program
Microbenchmarking: Blessing & Curse
• Choose your N wisely
Choose Your N Wisely Prof. Saman Amarasinghe, MIT 2009
Microbenchmarking: Blessing & Curse
• Choose your N wisely • Measure side effects
Microbenchmarking: Blessing & Curse
• Choose your N wisely • Measure side effects • Beware of clock resoluLon
Microbenchmarking: Blessing & Curse
• Choose your N wisely • Measure side effects • Beware of clock resoluLon • Dead Code EliminaLon
Microbenchmarking: Blessing & Curse
• Choose your N wisely • Measure side effects • Beware of clock resoluLon • Dead Code EliminaLon • Constant work per iteraLon
Non-‐Constant Work Per IteraLon
Follow-‐up Material • How NOT to Measure Latency by Gil Tene
– hOp://www.infoq.com/presentaLons/latency-‐piralls • Taming the Long Latency Tail on highscalability.com
– hOp://highscalability.com/blog/2012/3/12/google-‐taming-‐the-‐long-‐latency-‐tail-‐when-‐more-‐machines-‐equal.html
• Performance Analysis Methodology by Brendan Gregg – hOp://www.brendangregg.com/methodology.html
• Robust Java benchmarking by Brent Boyer – hOp://www.ibm.com/developerworks/library/j-‐benchmark1/ – hOp://www.ibm.com/developerworks/library/j-‐benchmark2/
• Benchmarking arLcles by Aleksey Shipilëv – hOp://shipilev.net/#benchmarking
Takeaway #1: Cache
Takeaway #2: Outliers
Takeaway #3: Workload
Benchmarking: You’re Doing It Wrong
Aysylu Greenberg @aysylu22