estimating cloud application performance based on micro ... · 2018-07-02 ieee cloud'18 5...
TRANSCRIPT
Joel Scheuner! [email protected]" joe4dev#@joe4dev
Estimating Cloud Application Performance Based on Micro-Benchmark Profiling
Joel Scheuner, Philipp Leitner
Supported by
Context: Public Infrastructure-as-a-Service CloudsIaaS PaaS SaaS
Applications
Data
Runtime
Middleware
OS
Virtualization
Servers
Storage
Networking
Applications
Data
Runtime
Middleware
OS
Virtualization
Servers
Storage
Networking
Applications
Data
Runtime
Middleware
OS
Virtualization
Servers
Storage
Networking
User-Managed
Provider-ManagedInfrastructure-as-a-Service (IaaS)Platform-as-a-Service (PaaS)Software-as-a-Service (SaaS)
2018-07-02 IEEE CLOUD'18 2
2018-07-02 IEEE CLOUD'18 3
Motivation: Capacity Planning in IaaS CloudsWhat cloud provider should I choose?
https://www.cloudorado.com
0
20
40
60
80
100
120
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
Number of Instance Type
2018-07-02 IEEE CLOUD'18 4
Motivation: Capacity Planning in IaaS CloudsWhat cloud service (i.e., instance type) should I choose?
t2.nano0.05-1 vCPU0.5 GB RAM
$0.006/h
x1e.32xlarge128 vCPUs
3904 GB RAM$26.688 hourly
à Impractical to Test all Instance Types
2018-07-02 IEEE CLOUD'18 5
Topic: Performance Benchmarking in the Cloud
“The instance type itself is a very major tunable parameter”
! @brendangregg re:Invent’17https://youtu.be/89fYOo1V2pA?t=5m4s
2018-07-02 IEEE CLOUD'18 6
Background
Generic
Artificial
Resource-specific
Specific
Real-World
Resource-heterogeneous
Micro Benchmarks
CPU Memory I/O Network Overall performance(e.g., response time)
ApplicationBenchmarks
Domain
Workload
ResourceUsage
2018-07-02 IEEE CLOUD'18 7
Problem: Isolation, Reproducibility of Execution
Generic
Artificial
Resource-specific
Specific
Real-World
Resource-
heterogeneous
Micro Benchmarks
CPU Memory I/O Network Overall performance
(e.g., response time)
ApplicationBenchmarks
2018-07-02 IEEE CLOUD'18 8
Question
Generic
Artificial
Resource-specific
Specific
Real-World
Resource-
heterogeneous
Micro Benchmarks
CPU Memory I/O Network Overall performance
(e.g., response time)
ApplicationBenchmarks
How relevant?
?
2018-07-02 IEEE CLOUD'18 9
Research QuestionsPRE – Performance VariabilityDoes the performance of equally configured cloud instances vary relevantly?
RQ1 – Estimation AccuracyHow accurate can a set of micro benchmarks estimate application performance?
RQ2 – Micro Benchmark SelectionWhich subset of micro benchmarks estimates application performance most accurately?
2018-07-02 IEEE CLOUD'18 10
IdeaMicro
Benchmarks
CPU Memory I/O Network Overall performance(e.g., response time)
ApplicationBenchmarks
Evaluate a Prediction Model
performance
cost
performance
cost
VMN VMN
2018-07-02 IEEE CLOUD'18 11
MethodologyBenchmark
Design
2018-07-02 IEEE CLOUD'18 12
CPU• sysbench/cpu-single-thread• sysbench/cpu-multi-thread• stressng/cpu-callfunc• stressng/cpu-double• stressng/cpu-euler• stressng/cpu-ftt• stressng/cpu-fibonacci• stressng/cpu-int64• stressng/cpu-loop• stressng/cpu-matrixprod
Memory• sysbench/memory-4k-block-size• sysbench/memory-1m-block-size
Broad resource coverage and specific resource testing
Micro BenchmarksMicro
Benchmarks
CPU Memory I/O Network
I/O• [file I/O] sysbench/fileio-1m-seq-write• [file I/O] sysbench/fileio-4k-rand-read• [disk I/O] fio/4k-seq-write• [disk I/O] fio/8k-rand-read
Network• iperf/single-thread-bandwidth• iperf/multi-thread-bandwidth• stressng/network-epoll• stressng/network-icmp• stressng/network-sockfd• stressng/network-udp
Software (OS)• sysbench/mutex• sysbench/thread-lock-1• sysbench/thread-lock-128
2018-07-02 IEEE CLOUD'18 13
Application BenchmarksOverall performance(e.g., response time)
ApplicationBenchmarks
Molecular Dynamics Simulation (MDSim)
WordPress Benchmark (WPBench)
Multiple short bloggingsession scenarios(read, search, comment)
0
20
40
60
80
100
00:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00Elapsed Time [min]
Num
ber o
f Con
curre
nt T
hrea
ds
2018-07-02 IEEE CLOUD'18 14
MethodologyBenchmark
DesignBenchmark Execution
A Cloud Benchmark Suite Combining Micro and Applications BenchmarksQUDOS@ICPE’18, Scheuner and Leitner
2018-07-02 IEEE CLOUD'18 15
Execution Methodology
guidelines, the authors point out that even with such guidelines,comparing multiple alternatives and conducting repeatable experi-ments on EC2 might be challenging or impossible due to the highvariability. In the paper, we directly examine methodologies forconducting experiments in such environments.
Despite, the high variation in the performance of cloud comput-ing systems, a number of studies [10, 4] choose to run the ex-periment only once. More rigorously conducted evaluations in-volve running experiments multiple times and reporting the aver-age value. For example, some studies [7, 12, 5, 6] use an eval-uation technique we call multiple consecutive trials (described inSection 3). In Section 4, we show how variations in the cloud en-vironment might lead to misleading results if these techniques areutilized for evaluation.
We recently studied methodologies for conducting repeatable ex-periments and fair comparisons when performing performance eval-uations in WiFi networks [2]. This study [2] shows that manycommonly used techniques for the experimental evaluation of WiFinetworks are flawed and could result in misleading conclusions be-ing drawn. In that work, although we propose the use of random-ized multiple interleaved trials (RMIT) (described in Section 3) asa methodology for coping with changing wireless channel condi-tions, randomization was not necessary. In this paper, we examinecommonly used approaches for measuring and comparing perfor-mance as well as the suitability of RMIT in a completely differentscenario, namely cloud computing environments. We find that ran-domization is required, due to periodic changes in the environment,and that RMIT can be used to obtain repeatable results.
Some work [3, 4] has proposed techniques to reduce the vari-ability of application performance when executing in cloud en-vironments by limiting variability of network performance. Un-fortunately, such work only reduces but does not eliminate vari-ability. Other shared resources such as disks can also cause per-formance measurements to be highly variable. One study [9] re-ports that during off-peak hours disk read bandwidths would rangefrom 100–140 MB/sec, while during peak hours it ranged from 40–70 MB/sec. Moreover, even with techniques to reduce variability,methodologies are still needed to ensure that differences in perfor-mance of different alternatives are due to the differences in thosealternatives, rather than differences in the conditions under whichthey were executed.
3. OVERVIEW OF METHODOLOGIESWe use the term trial to refer to one measurement, typically ob-
tained by running a benchmark or micro-benchmark for some pe-riod of time (the length of the trial). An experiment can be com-prised of multiple trials executing the same benchmark, where theresults of the experiment are reported over the multiple trials (e.g.,the average of the measurements obtained over the trials).
Most experiments conducted in cloud computing environmentsutilize the single trial or multiple consecutive trials methodologies(referred to as commonly used methodologies in this paper). Inprevious work [2], we have argued for and demonstrated the use ofMultiple Interleaved Trials and Randomized Multiple InterleavedTrials. We now briefly explain each of these methodologies.• Single Trial: In this approach, an experiment consists of only a
single trial. Figure 1-A shows an example of this approach withthree alternatives. The performance results obtained from sin-gle trials are compared directly. This is the easiest methodologyfor running an experiment to compare multiple alternatives.
• Multiple Consecutive Trials: All trials for the first alterna-tive are run, followed by the second alternative and each of the
remaining alternatives. Figure 1-B shows the Multiple Consec-utive Trials technique for 3 alternatives.
• Multiple Interleaved Trials: One trial is conducted using thefirst alternative, followed by one trial with the second, and soon until each alternative has been run once. When one trial hasbeen conducted using each alternative we say that one roundhas been completed. Rounds are repeated until the appropriatenumber of trials has been conducted (Figure 1-C).
• Randomized Multiple Interleaved Trials:If the cloud computing environment is affected at regular in-tervals, and the intervening period coincides with the length ofeach trial, it is possible that some alternatives are affected morethan others. Therefore, the randomized multiple interleaved tri-als methodology randomly reorders alternatives for each round(Figure 1-D). In essence, a randomized block design [13] isconstructed where the blocks are intervals of time (rounds) andwithin each block all alternatives are tested, with a new randomordering of alternatives being generated for each block.
A B C A B C B C
B A C C B A A C BD) Randomized Multiple Interleaved Trials (RMIT)
Figure 1: Different methodologies: with 3 alternatives
3.1 Methodologies used in PracticeTo illustrate that these methodologies are actually used in prac-
tice, we studied the performance evaluation methodologies used inthe 38 research papers published in the ACM Symposium on CloudComputing 2016 (SoCC’16). 9 papers conduct experimental evalu-ations on public clouds (7 on Amazon EC2, 1 on Microsoft Azure,and 1 on Google computing engine). We found that the single trialand multiple consecutive trials (MCT) methodologies are utilizedby 7 and 4 papers respectively (some papers use both techniques).No other evaluation methodology is used in these papers. Addi-tionally, 9 other papers also use these two methodologies whenconducting evaluations on research clusters.
The fundamental problem is that researchers often incorrectlyassume that the characteristics of the systems and networks beingused and workloads that are simultaneously executing during theirexperiments, do not change in ways that impact the performance ofthe artifacts they are evaluating. Previous work has demonstratedthat performance is in fact impacted [14, 11]. In this paper, we ex-amine methodologies commonly used for comparing performancein cloud environments and describe our RMIT methodology that isdesigned to handle environments in which performance measure-ments are variable due to circumstances which can not be con-trolled. To the best of our knowledge, ours is the first work thatstudies the (randomized) multiple interleaved trials methodologiesin the context of the repeatability of experiments in cloud comput-ing environments.
4. EVALUATIONFor our evaluation, we utilize traces, collected from benchmark
measurements by other researchers [14], conducted over extendedperiod of time on Amazon EC2 servers. The authors of that paper
30 benchmark scenarios3 trials~2-3h runtime
[1] A. Abedi and T. Brecht. Conducting repeatable experiments in highly variable cloud computing environments. ICPE’17
[1]
2018-07-02 IEEE CLOUD'18 16
Benchmark ManagerCloud WorkBench (CWB)Tool for scheduling cloud experiments
! sealuzh/cloud-workbenchCloud Work Bench – Infrastructure-as-Code Based Cloud BenchmarkingCloudCom’14, Scheuner, Leitner, Cito, and Gall
Cloud WorkBench: Benchmarking IaaS Providers based on Infrastructure-as-CodeDemo@WWW’15, Scheuner, Cito, Leitner, and Gall
2018-07-02 IEEE CLOUD'18 17
MethodologyBenchmark
DesignBenchmark Execution
Data Pre-Processing
Data Analysis
� � � � � �� � � � � � �
� � � � �
� �
� �� �
� �
� �
� �
�
� �
�
�
��
�
� � � � � � �� � � � � � �
� � �� � � �
� �
�
�� ����
� �
��
�
�
� �
� � � � � � � � � � � � � �� � � � � �
� � � � � �� � �
� �
�
��
�
�
�
�
� � � � � � � � � � � � � � �� � �� � �� � � �� � � �� �
� �
�
�
�
��
� � � � � � � � � � � �
� �� �
� � � �� � ��
�
� � � �
�
�
�
�
�
�
�
�
�
4.41 4.33.16 3.32
6.83
0
5
10
20
30
40
50
m1.small (eu) m1.small (us) m3.medium (eu) m3.medium (us) m3.large (eu)Configuration [Instance Type (Region)]
Rel
ative
Sta
ndar
d D
evia
tion
(RSD
) [%
]
A Cloud Benchmark Suite Combining Micro and Applications BenchmarksQUDOS@ICPE’18, Scheuner and Leitner
Estimating Cloud Application PerformanceBased on Micro Benchmark ProfilingCLOUD’18, Scheuner and Leitner
2018-07-02 IEEE CLOUD'18 18
Performance Data Seteu + us
eu + us
eu
*
* ECU := Elastic Compute Unit (i.e., Amazon’s metric for CPU performance)
>240 Virtual Machines (VMs) à 3 Iterations à ~750 VM hours
>60’000 Measurements (258 per instance)
PRE
RQ1+2
m1.small 1 1 1.7 PV Low
m1.medium 1 2 3.75
Instance Typ e vCPU ECU RAM [GiB] Virtualization Network Performance
PV Moderate
m3.medium 1 3 3.75 PV /HVM Moderate
2m1.large 4 7.5 PV Moderate
2m3.large 6.5 7.5 HVM Moderate
2m4.large 6.5 8.0 HVM Moderate
2c3.large 7 3.75 HVM Moderate
c4.large 2 8 3.75 HVM Moderate
4c3.xlarge 14 7.5 HVM Moderate
4c4.xlarge 16 7.5 HVM High
c1.xlarge 8 20 7 PV High
2018-07-02 IEEE CLOUD'18 19
4.41 4.33.16 3.32
4.14
2 outliers
(54% and 56%)
0
5
10
20
30
m1.small (eu) m1.small (us) m3.medium (eu) m3.medium (us) m3.large (eu)
Configuration [Instance Type (Region)]
Rela
tive S
tandard
Devi
atio
n (
RS
D)
[%]
Threads LatencyFileio Random
NetworkFileio Seq.
mean
PRE – Performance VariabilityDoes the performance of equally configured cloud instances vary relevantly?
Results
2018-07-02 IEEE CLOUD'18 20
Instance Type1(m1.small)
Instance Type2
Instance Type12(c1.xlarge)
…micro1, micro2, …, microN
app1, app2
app 1
micro1
Linear RegressionModel
RQ1 – Estimation AccuracyHow accurate can a set of micro benchmarks estimate application performance?
Approach
Forward feature selectionto optimize relative error
2018-07-02 IEEE CLOUD'18 21
RQ1 – Estimation AccuracyHow accurate can a set of micro benchmarks estimate application performance?
0
1000
2000
25 50 75 100Sysbench − CPU Multi Thread Duration [s]
WPB
ench
Rea
d −
Res
pons
e Ti
me
[ms]
Instance Typem1.small
m3.medium (pv)
m3.medium (hvm)
m1.medium
m3.large
m1.large
c3.large
m4.large
c4.large
c3.xlarge
c4.xlarge
c1.xlarge
Grouptest
train
Relative Error (RE) = 12.5%!" = 99.2%
Results
RQ2 – Micro Benchmark SelectionWhich subset of micro benchmarks estimates application performance most accurately?
8/25/18 Chalmers 22
Results
Relative Error [%]Micro BenchmarkSysbench – CPU Multi Thread 12Sysbench – CPU Single Thread 454BaselinevCPUs 616ECU 359Cost 663
(i.e., Amazon’s metric for CPU performance)
2018-07-02 IEEE CLOUD'18 23
RQ – Implications
Suitability of selected micro benchmarks to estimate application performance
Benchmarks cannot be used interchangeableà Configuration is important
Baseline metrics vCPU and ECU are insufficient
2018-07-02 IEEE CLOUD'18 24
Threats to ValidityConstruct ValidityAlmost 100% of benchmarking reports are wrong because benchmarking is "very very error-prone”1
[senior performance architect @Netflix]à Guidelines, rationalization, open source
1 https://www.youtube.com/watch?v=vm1GJMp0QN4&feature=youtu.be&t=18m29s
Internal Validitythe extent to which cloud environmental factors, such as multi-tenancy, evolving infrastructure, or dynamic resource limits, affect the performance level of a VM instanceà Variability PRE, stop interfering process
External Validity (Generalizability)Other cloud providers?Larger instance types?Other application domains?à Future work
Reproducibilitythe extent to which the methodology and analysis is repeatable at any time for anyone and thereby leads to the same conclusions! dynamic cloud environmentà Fully automated execution, open source
2018-07-02 IEEE CLOUD'18 25
Related Work
[1] Athanasia Evangelinou, Michele Ciavotta, Danilo Ardagna, Aliki Kopaneli, George Kousiouris, and Theodora Varvarigou. Enterprise applicationscloud rightsizing through a joint benchmarking and optimizationapproach. Future Generation Computer Systems, 2016[2] Mauro Canuto, Raimon Bosch, Mario Macias, and Jordi Guitart.A methodology for full-system power modeling in heterogeneous data centers. In Proceedings of the 9th International Conference on Utility and Cloud Computing (UCC ’16), 2016[3] Kenneth Hoste, Aashish Phansalkar, Lieven Eeckhout, Andy Georges, Lizy K. John, and Koen De Bosschere. Performance prediction based oninherent program similarity. In PACT ’06, 2006
Application Performance Prediction
Application PerformanceProfiling
• System-level resource monitoring [1,2]• Compiler-level program similarity [3]
• Trace and reply with Cloud-Prophet [4,5] • Bayesian cloud configuration refinement
for big data analytics [6][4] Ang Li, Xuanran Zong, Ming Zhang, Srikanth Kandula, and Xiaowei Yang.Cloud-prophet: predicting web application performance in the cloud. ACM SIGCOMM Poster, 2011[5] Ang Li, Xuanran Zong, Srikanth Kandula, Xiaowei Yang, and Ming Zhang.Cloud-prophet: Towards application performance prediction in cloud.In Proceedings of the ACM SIGCOMM 2011 Conference (SIGCOMM ’11), 2011[6] Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, and Ming Zhang.Cherrypick: Adaptively unearthing the best cloud configurations for big data analytics. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), 2017
2018-07-02 IEEE CLOUD'18 26
ConclusionMethodology
Benchmark Design
Benchmark Execution
Data Pre-Processing
Data Analysis
� � � � � �� � � � � � �
� � � � �
� �
� �� �
� �
� �
� �
�
� �
�
�
��
�
� � � � � � �� � � � � � �
� � �� � � �
� �
�
�� ����
� �
��
�
�
� �
� � � � � � � � � � � � � �� � � � � �
� � � � � �� � �
� �
�
��
�
�
�
�
� � � � � � � � � � � � � � �� � �� � �� � � �� � � �� �
� �
�
�
�
��
� � � � � � � � � � � �
� �� �
� � � �� � ��
�
� � � �
�
�
�
�
�
�
�
�
�
4.41 4.33.16 3.32
6.83
0
5
10
20
30
40
50
m1.small (eu) m1.small (us) m3.medium (eu) m3.medium (us) m3.large (eu)Configuration [Instance Type (Region)]
Rel
ative
Sta
ndar
d D
evia
tion
(RSD
) [%
]
A Cloud Benchmark Suite Combining Micro and Applications BenchmarksQUDOS@ICPE’18, Scheuner and Leitner
Estimating Cloud Application PerformanceBased on Micro Benchmark ProfilingCLOUD’18, Scheuner and Leitner
RQ – Implications
Suitability of selected micro benchmarks to estimate application performance
Benchmarks cannot be used interchangeableà Configuration is important
Baseline metrics vCPU and ECU are insufficient
! [email protected]"# joe4dev
RQ1 – Estimation AccuracyHow accurate can a set of micro benchmarks estimate application performance?
0
1000
2000
25 50 75 100Sysbench − CPU Multi Thread Duration [s]
WPB
ench
Rea
d −
Res
pons
e Ti
me
[ms]
Instance Typem1.small
m3.medium (pv)
m3.medium (hvm)
m1.medium
m3.large
m1.large
c3.large
m4.large
c4.large
c3.xlarge
c4.xlarge
c1.xlarge
Grouptest
train
Relative Error (RE) = 12.5%!" = 99.2%
Results RQ2 – Micro Benchmark SelectionWhich subset of micro benchmarks estimates application performance most accurately?
Results
Relative Error [%]Micro BenchmarkSysbench – CPU Multi Thread 12Sysbench – CPU Single Thread 454BaselinevCPUs 616ECU 359Cost 663
(i.e., Amazon’s metric for CPU performance)
0
20
40
60
80
100
120
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
Number of Instance Type
Motivation: Capacity Planning in IaaS CloudsWhat cloud service (i.e., instance type) should I choose?
t2.nano0.05-1 vCPU0.5 GB RAM
$0.006/h
x1e.32xlarge128 vCPUs
3904 GB RAM$26.688 hourly
à Impractical to Test all Instance Types