taming performance variability - eth z€¦ · cyber-physical systems/internet of things original...
TRANSCRIPT
![Page 1: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/1.jpg)
Taming Performance Variability
Aleksander Maricq* Dmitry Duplyakin* Ivo Jimenez†
Carlos Maltzahn† Ryan Stutsman* Robert Ricci** University of Utah
† University of California Santa Cruz
1
![Page 2: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/2.jpg)
Outline
2
Work published at OSDI‘18
Current E orts
Future Directions
![Page 3: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/3.jpg)
Cyber-Physical Systems/Internet of Things
● Original context: Performance metrics on bare-metal compute HW
● Analysis techniques are not specific to this context
● Applicable to environments with more and less control over factors
3
![Page 4: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/4.jpg)
Taming Performance Variability - OSDI’18
4
![Page 5: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/5.jpg)
Motivation: Performance Variability
5
How confident should I be that my results are correct?
How many times do I need to run my experiments?
As a testbed builder, how can I help users figure this out?
![Page 6: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/6.jpg)
6
Examine performance variability of testbed hardware
11 months~892,000 data points835 servers
MemoryDiskNetwork
Within serversAcross servers
![Page 7: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/7.jpg)
● 1,500 servers at three sites
○ Several distinct ‘types’ of identical servers
● Exclusive, raw access to hardware
○ No interference on servers from simultaneous users
○ Doesn’t add virtualization overhead / variability
● Our experiments were run on servers allocated only to us
● Configuration: Combination of hardware type, workload, parameters
7
c220g1, single-threaded mem copy, dvfs off m510, net bw,
rack-local
https://www.cloudlab.us/
![Page 8: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/8.jpg)
How confident can we be in the correctness of our results?
8
![Page 9: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/9.jpg)
How much trouble are we in?
9
Network Latency
Network Bandwidth
Mixed Disk, Mem
Noisier Disk, Mem
![Page 10: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/10.jpg)
Confidence Intervals● Range for your mean (different than stdev)
● Represents some % confidence (eg. 95%) the true mean lies between
● More runs -> narrower CI
10
![Page 11: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/11.jpg)
Testing Normality
11
● Many statistical models assume normal (Gaussian) bell-curve
● Is our data normal? Shapiro-Wilk test (95% confidence)
Use Non-Parametric Statistics to Avoid Assumptions of
Normality
![Page 12: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/12.jpg)
12
How confident can we be in the correctness of our results?
● Some variation is unavoidable● Results are often non-normal● More runs → more confidence
![Page 13: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/13.jpg)
How many times should we run our experiments?
13
![Page 14: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/14.jpg)
CONFIRM - CONFIdence-based Repetition Meter
● Uses all our collected data to build estimates of how many runs are needed
○ For configurations on a single server or group of servers
● Uses random sub-samples of historical data
○ Takes many sub-samples, computes mean and CI
● Calculating observed empirical CIs still necessary
● Integrated into CloudLab, but doesn’t have to be specific to it
14
![Page 15: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/15.jpg)
15
CONFIRM From past data, uses random subsets to model median and CI behavior for increasing numbers of runs
Median and CI converge with more runs
33 runs until CI is within 1% of median
![Page 16: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/16.jpg)
CONFIRM Recommendations
16
CoV Recommended Runs
Mem Config A(c8220, ST copy, no dvfs, socket 1)
0.262 10
Disk Config B(c8220, /dev/sda4, seqwrite, iodepth 4096)
1.708 37
Mem Config C(c220g1, ST copy, dvfs, socket 1)
6.139 74
Net Config D(m400, not rack-local, iperf3 (bw), forward)
6.309 10
Net Config E(m510, not rack-local, latency, forward)
8.086 230
Disk Config F(c8220, /dev/sda4, randread, iodepth 4096)
8.122 610
Trend: Higher CoV → More Runs
Recommended runs rise fast with higher CoV
CoV and recommended runs are not perfectly correlated
![Page 17: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/17.jpg)
17
How many timesshould we run our experiments?
● Enough for target confidence● Trend: high CoV → more runs● Use past data to estimate
![Page 18: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/18.jpg)
Can the facility help?
18
![Page 19: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/19.jpg)
19
Can The Facility Help? ● Provide indistinguishable resources
![Page 20: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/20.jpg)
20
Performance results gathered on any server should be representative of the population as a whole.
Indistinguishable:
![Page 21: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/21.jpg)
What is unrepresentative behavior?
21
Server Z (73 points)
Server X (6 points)
Server Y (75 points)
1326 data points from one HW type
![Page 22: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/22.jpg)
Detecting Unrepresentative Resources● Kernel two-sample test based on Maximum Mean Discrepancy (MMD)
○ Provides a measure of similarity between two non-parametric distributions
● We compare:
○ Each server to all others of its type
○ … using many dimensions: disk, memory, and network
● Remove servers that are statistically dissimilar from the rest
22
![Page 23: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/23.jpg)
23
Removing Unrepresentative Servers
![Page 24: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/24.jpg)
24
Can The Facility Help? ● Identify and/or fix anomalous components
![Page 25: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/25.jpg)
Related Work● Profiling
○ Cloud-scale (distributed) (Kanev et al., 2015, [1]) (Kozyrakis et al., 2010, [2])
○ Single-node (VM) applications (Yadwakar et al., 2014, [3])
● Quantifying Variability
○ Virtualized clouds (Iosup et al., 2011, [4])
○ Warehouse-scale computers (Dean and Barroso, 2013, [5])
● Other experimentation platforms
○ Baselining performance for Grid’5000 (Nussbaum, 2017, [6])
25
![Page 26: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/26.jpg)
Summary of the Original Work● How confident can we be in the correctness of our results?
○ Measure confidence with (non-parametric) CIs to account for unavoidable variability
● How many times should we run our experiments?
○ CONFIRM - Pick a target CI width, estimate number of runs using past performance data
● Can the facility help?
○ Provide statistically indistinguishable resources
● More results, experiences with pitfalls in the paper
26
![Page 27: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/27.jpg)
Current Efforts
27
![Page 28: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/28.jpg)
28As of Apr 9, 2019
877 K 452 K 2.7 M 47 K 24 K4 M, 1.3GB
Continuously Collecting Performance Data
![Page 29: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/29.jpg)
29
CoV = 152%!
Highly Variable CPU Performance
(Clemson, c6320, NPB Multi-Grid solver, Socket 0, DVFS on)
![Page 30: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/30.jpg)
30
Exploring Correlations
![Page 31: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/31.jpg)
31
Zooming into Performance Tails
![Page 32: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/32.jpg)
Stationarity
32
![Page 33: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/33.jpg)
Future Directions
33
![Page 34: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/34.jpg)
34
Future Directions● Randomization of benchmark order
● Change-point detection in gathered measurements
● Additional hardware and architectures
● Expand to other clouds and facilities
![Page 35: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/35.jpg)
● Platform for Open Wireless Data-driven Experimental Research
○ Flux Research Group - University of Utah
○ RENEW - Rice University
● Multiple Deployment areas
○ Encompases Campus, Downtown area, and a Residential neighborhood
● Fixed and Mobile endpoints
35
https://www.powderwireless.net/
![Page 36: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/36.jpg)
36
Summary: IoT and CPS
● Compute/Storage/Networking: Evaluate fine-grained performance variability
● Sensory data: Explore and find patterns in environment variability
● Modeling and Prediction: Establish and enforce QoS for learning variability
![Page 37: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/37.jpg)
37
Summary: IoT and CPS ● Shapiro Wilks Test: Check for normality
● Non-Parametric Statistics: Analyze non-Gaussian data
● CONFIRM: Change in CIs and Median over repeated measurements
● Kernel Two-Sample Test: How “representative” is a subset?
● Augmented Fuller-Dickey Test: Check for stationarity
![Page 38: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/38.jpg)
References[1]: Kanev et al., Profiling a warehouse-scale computer. ACM SIGARCH News, 2015.
[2]: Kozyrakis et al., Server engineering insights for large-scale online services. IEEE micro, 2010.
[3]: Yadwadkar et al., Predictable and faster jobs using fewer resources. SOCC'14.
[4]: Iosup et al, On the performance variability of production cloud services. CCGrid'11.
[5]: Dean and Barroso. The tail at scale. Communications of the ACM, 2013.
[6]: Nussbaum. Towards trustworthy testbeds thanks to throughout testing, IPDPSW’17.
38
![Page 39: Taming Performance Variability - ETH Z€¦ · Cyber-Physical Systems/Internet of Things Original context: Performance metrics on bare-metal compute HW Analysis techniques are not](https://reader033.vdocuments.us/reader033/viewer/2022060221/5f0771757e708231d41d0213/html5/thumbnails/39.jpg)
39https://confirm.fyi