d3.5 eubra-bigsea qos infrastructure services final version...abstract: europe - brazil...

www.eubra-bigsea.eu| [email protected] |@bigsea_eubr 0

D3.5 EUBra-BIGSEA QoS infrastructure services final version

Author(s)

Jussara Almeida (UFMG), Danilo Ardagna (POLIMI), Igor Ataíde

(UFCG), Enrico Barbierato(POLIMI), Ignacio Blanquer (UPV),

Sergio López (UPV), Túlio Braga (UFMG), Andrey Brito (UFCG),

Ana Paula Couto (UFMG), Athanasia Evangelinou (POLIMI), Iury

Ferreira (UFCG), Armstrong Goes (UFCG), Marco Gribaudo

(POLIMI), Raffaela Mirandola (POLIMI), Fábio Morais (UFCG)

Status Final

Version v1.1

Date 04/10/2017

Dissemination Level

X PU: Public

PP: Restricted to other programme participants (including the Commission)

RE: Restricted to a group specified by the consortium (including the Commission)

CO: Confidential, only for members of the consortium (including the Commission)

Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale research project funded by the European Commission under the Cooperation Programme, and the Ministry of Science and Technology (MCT) of Brazil in the frame of the third European-Brazilian coordinated call. The document has been produced with the co-funding of the European Commission and the MCT.

This document describes the final version of the infrastructure services that enable the EUBra-BIGSEA ecosystem. These services include: (i) tools for predicting big data application performance, (ii) mechanisms for horizontal and vertical elasticity, and (iii) approaches based on pro-activity and on optimization to perform horizontal and vertical scaling of running big data applications, as well as load


balancing of the infrastructure.

Document identifier: EUBRA BIGSEA -WP3-D3.5

Deliverable lead POLIMI

Related work package WP3

Author(s) Danilo Ardagna (POLIMI), Ignacio Blanquer (UPV), Andrey Brito (UFCG)

Contributor(s) Jussara Almeida (UFMG), Igor Ataíde (UFCG), Enrico Barbierato(POLIMI), Ignacio

Blanquer (UPV), Túlio Braga (UFMG), Ana Paula Couto (UFMG), Athanasia

Evangelinou (POLIMI), Iury Ferreira (UFCG), Armstrong Goes (UFCG), Marco

Gribaudo (POLIMI), Raffaela Mirandola (POLIMI), Fábio Morais (UFCG)

Due date 30/09/2017

Actual submission date

Reviewed by Sandro Fiore (CMCC) and Dorgival Guedes (UFMG)

Approved by PMB

Start date of Project 01/01/2016

Duration 24 months

Keywords Quality of Service, Cloud services, Monitoring

EUBra-BIGSEA is funded by the European Commission under the

Cooperation Programme, Horizon 2020 grant agreement No 690116.

Este projeto é resultante da 3a Chamada Coordenada BR-UE em Tecnologias da Informação


Versioning and contribution history

Version Date Authors Notes

0.1 12/07/2017 Andrey Brito (UFCG), Danilo Ardagna

(POLIMI), Ignacio Blanquer (UPV)

TOC, initial version

0.2 29/08/2017 Athanasia Evangelinou (POLIMI) Initial version Sections 2, 4.

0.3 29/08/2017 Athanasia Evangelinou (POLIMI), Enrico

Barbierato (POLIMI)

Initial version Section 11 and

added Appendixes A and B.

0.4 11/09/2017 Danilo Ardagna (POLIMI), Athanasia

Evangelinou (POLIMI), Raffaela Mirandola

(POLIMI), Marco Gribaudo (POLIMI), Túlio

Braga (UFMG), Jussara Almeida (UFMG)

Added Section 3, 10, Section 11

completed.

0.5 13/09/2017 Danilo Ardagna (POLIMI), Enrico Barbierato

(POLIMI)

Initial version Section 12.

0.6 14/09/2017 Danilo Ardagna (POLIMI), Athanasia

Evangelinou (POLIMI), Túlio Braga (UFMG),

Ignacio Blanquer (UPV)

Added Section 8, Appendix B

completed

0.7 16/09/2017 Danilo Ardagna (POLIMI), Andrey Brito

(UFMG), Ignacio Blanquer (UPV)

Start consolidating, Section 12

completed.

0.8 17/09/2017 Athanasia Evangelinou (POLIMI), Enrico

Barbierato (POLIMI), Igor Ataíde (UFCG),

Fábio Morais (UFCG), Armstrong Goes

(UFCG), Iury Ferreira (UFCG)

Section 10 completed. Added

Sections 5, 6, 7, 9 and Appendixes

C, D, E, F, G.



Added Sections 1 and 13, finalising

consolidation.



Deliverable sent to reviewers.



Review comments addressed and

final version.

Copyright notice: This work is licensed under the Creative Commons CC-BY 4.0 license. To view a copy of this license, visit

https://creativecommons.org/licenses/by/4.0.

Disclaimer: The content of the document herein is the sole responsibility of the publishers and it does not necessarily represent

the views expressed by the European Commission or its services.

https://creativecommons.org/licenses/by/4.0




While the information contained in the document is believed to be accurate, the author(s) or any other participant in the

EUBra-BIGSEA Consortium make no warranty of any kind with regard to this material including, but not limited to the implied

warranties of merchantability and fitness for a particular purpose.

Neither the EUBra-BIGSEA Consortium nor any of its members, their officers, employees or agents shall be responsible or liable

in negligence or otherwise howsoever in respect of any inaccuracy or omission herein.

Without derogating from the generality of the foregoing neither the EUBra-BIGSEA Consortium nor any of its members, their

officers, employees or agents shall be liable for any direct or indirect or consequential loss or damage caused by or arising from

any information advice or inaccuracy or omission herein.


Table of Contents

Table of Contents 5

1. Introduction 12

1.1. Scope of the Document 12

1.2. Target Audience 12

1.3. Structure 12

2. WP3 Architecture summary 13

3. OPT_JR implementation 16

3.1. Problem definition 16

3.2. Algorithms 19

4. Hybrid Machine Learning Module Implementation 24

4.1. Configuration and Deployment 24

4.2. Configuration and dataset preparation 25

4.3. Description of scripts and functions 26

4.3.1. Extrapolation on few and many cores and Interpolation capabilities 30

4.4. Package Information and Code Structure 31

5. Pro-active policies implementation 33

5.1. Vertical Scaling Pro-active Policies 33

5.1.1. Contextualization 33

5.2. Pro-active Policies for OpenNebula 34

5.2.1. Spark-Mesos Plugin 34

5.2.2. OpenNebula Actuator 36

5.3. OpenStack Opportunistic Plugin 41

5.3.1. Opportunistic Service 41

5.3.2. Opportunism and Load Balancer 42

6. Load balancing implementation 43


6.1. Architecture 43

6.2. Heuristics 44

6.2.1. Balance Instances 44

6.2.2. CPU-Cap Aware 45

6.2.3. Sysbench Performance+CPU Cap 47

6.3. Integration with Opportunistic Services 48

7. Pro-active policies validation 50


7.1.1. Infrastructure 50

7.1.2. Experiment design 52

Controller configurations 52

Application deadline 53

Metrics collection method 53

7.1.3. Analysis 53


7.2.1. Infrastructure and experiment design 57

Actuation configurations 58

7.2.2. Analysis 58

8. Frameworks experiments validation 61

8.1. Use Cases 61

8.2. System Architecture 61

8.2.1. Job submit and execution 63

8.2.2. Monitoring the application 63

8.2.3. Decision making 64

8.2.4. Elasticity 65

8.3. Results of the experiments 65


8.3.1. Chronos Application 65

8.3.2. Marathon Application 67

9. Load balancing validation 70

9.1. Infrastructure 70

9.2. Experiment Design 70

9.3. Analysis 71

10. Performance Prediction service validation 85

10.1. Experimental Settings 85

10.2. dagSim Validation 87

10.3. Lundstrom Validation 91

11. HML validation 100

11.1. Experiments setting 100

12. Optimization service validation 106

12.1. OPT_IC validation on BULMA application 106

12.2. OPT_JR validation 108

13. Conclusions 127

GLOSSARY 128

Annex 130

Appendix A - Hybrid Machine learning module usage manual 130

A.1 Deployment and Installation instructions 130

A.2 User Manual 131

Appendix B - Performance Prediction and Optimization service enhancements implemented in the new release 136

Appendix B.1 Performance prediction and Optimization (OPT_IC) service new release 136

WS_2P Enhancement: Prediction of the residual execution time of an application 136

Appendix B.2 WS_OPT_IC Enhancement: Revise the optimal number of cores and VMs while an application is running 140

Appendix B.3 Performance and optimization service tutorial 144


Technical tutorial 145

Database Creation 145

Creation Database Container (Bash Script) 145

Creation Docker Container Services 148

Useful Commands 150

Operational guide 150

Configuration file 150

Data Log files 152

Spark log parser 153

dagSim 156

OPT_IC 157

OPT_JR 160

Appendix C - Controller Service Guide 163

Installation and configuration 163

Creating a new controller plugin 163

Creating a new actuator plugin 166

Appendix D - Load Balancer Guide 168

Installation 168

Configuration 169

Heuristics Configuration 171

BalanceInstancesOS heuristic configuration 172

CPUCapAware heuristic configuration 172

SysbenchPerfCPUCap heuristic configuration 173

Optimizer Configuration 173

Execution 174

Creating a new Heuristic 175

Creating a new Optimizer Plugin 175


Appendix E - Monitoring-Hosts Daemon 177

Installation 177

Configuration 178

Execution 179

Appendix F - EUBra-BIGSEA Applications Submission Guide 181

Appendix G - Load Balancer Results 187

Scenario 2 - Initial Cap 25% 187

Scenario 3 - Initial Cap 50% 196


Executive Summary

This document describes the final release of the infrastructure services of the EUBra-BIGSEA platform.

This deliverable puts together results from previous deliverables, i.e., D3.1 QoS Monitoring System

Architecture, D3.2 Big Data Application Performance models and run-time Optimization Policies, D3.3

EUBra-BIGSEA QoS infrastructure services initial version. Moreover, this document is an update of the

deliverable D3.4 EUBra-BIGSEA QoS infrastructure services intermediate version. For the sake of the

reader’s convenience, the context of the tools developed within the workpackage and their

interrelationship are summarized and updated but we did our best to minimize the overlap between the

two documents.

In its final version, the EUBra-BIGSEA platform enables users to provide data processing applications that

will be profiled and modelled in a pre-production phase and, then, be deployed for either asynchronous

execution (e.g., driven by external triggers such as the availability of new data) or for periodic execution.

In both cases, the platform will be able to estimate the initial amount of resources needed to run the

application within the specified deadlines and will be able to trigger adaptation of the running

infrastructure to match the resources needed in order to satisfy the deadlines.

This document targets developers both internal and external to the project. Internally, the document

will help partners from connected workpackages to fully use the infrastructure services and leverage

them to provide higher-level abstractions. Externally, the open-source tools available in the EUBra-

BIGSEA Github repositories enable developers outside of the consortium to configure their own clusters

and execute applications. In addition, this document guides developers in customization activities such

as the usage of the service platform, including pro-active policies and optimization services, with cloud

infrastructures different to the ones considered by the EUBra-BIGSEA partners.

Finally, the main advances in this released version of the platform can be summarized as follows: (i) new

release of the hybrid machine learning module, performance prediction service and optimization

services, (ii) pro-active policies module enhancement, (iii) final implementation of the load balancing

module, (iv) extended validation of all workpackage components, including vertical elasticity, (v) initial

validation of the contextualization service, monitoring, pro-active policies and optimization service on

the BULMA case study application developed within WP7, (vi) integration of the core architectural

components.


1. Introduction

1.1. Scope of the Document

This document describes the final version of the infrastructure services that enable the EUBra-BIGSEA ecosystem. These services include: (i) tools for predicting Big Data application performance, (ii) mechanisms for horizontal and vertical elasticity, and (iii) approaches based on pro-activity and on optimization to perform horizontal and vertical scaling of running big data applications, as well as load balancing of the infrastructure. The results of the initial validation performed by considering the execution of the case study applications developed within WP7 are also reported.

1.2. Target Audience

This document has two main targets. One target is internal, the document serves as a documentation of

the WP3 activities on performance prediction, pro-active and optimization-based policies, service

contextualization, and on mechanism for scalability. With this target in mind, this document describes

the software architecture and the mutual interrelationship of WP3 services. Therefore, it aids the EUBra-

BIGSEA team of technical experts involved in the development of the application layers within WP4 and

WP5. The second target is the general cloud community: a full description of all the open-source

components is provided, as well as a reference for the installation of the cloud services that enable the

different features developed in the project.

1.3. Structure

Section 2 summarises and revise the software architecture of EUBra-BIGSEA. Section 3 describes the

final implementation of the optimization module minimising soft-deadline applications tardiness under

heavy load. Section 4 provides an update, with respect to deliverable D3.2, of the hybrid machine

learning tools developed for predicting big data applications execution time. Sections 5 and 6 describe

the implementation of the pro-active policies module and load balancing, respectively. Sections 7 to 12

are devoted to the initial validation of all the platform services, from the pro-active policies tested on

OpenNebula to the optimization-based policy module tested also on the BULMA application developed

in the context of WP7. Conclusions are drawn in Section 13.

Additionally, the deliverable includes a glossary and several annexes describing tool tutorials and

installation guidelines and summarising the main changes implemented in the platform components

with respect to the release reported in deliverable D3.4.

1.


2. WP3 Architecture summary

EUBra-BIGSEA has developed mechanisms and policies to deal with resource provision, reduction of Big

Data application execution costs and Quality of Service (QoS) guarantees. EUBra-BIGSEA proposes

innovative and intelligent modules, which address challenges in exploiting the best use of infrastructure

resources and minimizing costs by monitoring and allocating resources dynamically to meet deadlines.

Figure 1 shows the overall architecture and the execution workflow, which enables a flexible interaction

among the different components.

Figure 1 - EUBra-BIGSEA platform architecture

The reference architecture consists of a monitoring system, based on Monasca, which gathers

information related to the metrics (as batch queue length, network load, CPU utilization, application

performance, application progress, container resources) and two software layers implementing dynamic

adaptation policies. The first adaptation policies module is responsible for specifying and enacting high-

level pro-active rules to trigger infrastructure adaptation and load balancing while the other

implements advanced adaptation policies based on performance modelling and optimization methods

to automatically adapt the system configuration, provide QoS guarantees and reduce resource usage

costs. The main components of the EUBra-BIGSEA architecture include: (i) the Broker API to submit job


requests and to specify additional information concerning application configuration and deadlines, (ii)

the Performance Prediction Service, which - given the current configuration - estimates application

execution time, (iii) the Optimization Service, which - given a deadline - estimates, at deployment time,

the initial configuration that minimizes cost (OPT_IC component), or minimizes tardiness of soft-

deadline applications under heavy loads (OPT_JR component), (iv) the Pro-active Policies module, which

provides the bridge between the users' job submission and the execution platforms, and adds or

removes resources based on triggers associated with thresholds in the monitoring infrastructure, and

finally (v) the contextualization and configuration service, which deploys applications starting from

Tosca blueprints and actuates the Pro-active Policies module decisions. The overall architecture ensures

at any point in time the availability of the appropriate number of resources to deal with the existing

workload (additional details are available in deliverable D3.4).

Initially, users submit an application with a specific deadline through the broker API, which triggers the

execution of the Optimization Service. In turn, the Machine Learning sub-component is used by the

optimization policies module to estimate a Big Data application execution time, and identify an initial

solution in terms of the number of VMs. The latter is refined by a local search heuristic algorithm, which

is based on the performance prediction service in order to identify the final solution to be deployed. The

initial deployment is obtained by relying on the configuration and contextualization service.

During the application execution, the QoS monitoring system is responsible for gathering information

periodically and verifying if there is enough capacity for it execution. In particular, the pro-active policies

module interacts with the performance prediction service inquiring, given the current resource

allocation, an estimate for the application's finishing time. In case of violation or over-provisioning, the

pro-active policies module enacts high-level rules and triggers the infrastructure adaptation by

horizontally or vertically scaling the resource containers currently deployed and balancing the load

among the physical servers.

There are four execution plugins for the Broker, each one is adequate for some specific application and

infrastructure context. Two of these have already been presented before: OpenStack Generic, which

starts virtual machines on OpenStack, executes all the required tasks for the job and has a log-based

monitoring; and Spark-Sahara, which uses Sahara (Data Processing as a Service for OpenStack) to deploy

clusters with predefined configurations and executes Spark jobs. There are two new implementations

made to synchronize the efforts among WP3 teams: Spark-Mesos and Marathon-Chronos. These new

plugins will be detailed in Section 5.


3. OPT_JR implementation

As discussed in EUBra-BIGSEA deliverable D3.4, as part of task T3.5, EUBra-BigSEA implements the

OPT_IC optimization component, capable of calculating a configuration, in terms of allocated VMs,

which is minimal from a cost perspective and fulfils a given deadline.

When a new application is submitted, two cases are possible: either the system capacity can

accommodate the request and the execution starts immediately accordingly to the number of VMs

identified by OPT_IC, or the new application needs additional resources not available currently in the

cluster. In the latter case, the goal is to update the cluster resources allocation to minimize the

(weighted) sum of the tardiness of soft-deadline applications (as a result, the submitted application can

be executed if and only if the soft deadline applications can lean resources).

This section is structured as follows: in Section 3.1, the definition of the minimum tardiness optimization

problem is recalled from deliverable D3.2. In Section 3.2, the algorithms used to compute the new VMs

assignment for the soft-deadline applications are discussed. The validation of the OPT_JR component is

reported in Section 12.2.

3.1. Problem definition

The OPT_JR optimization module considers the set of entities denoted in Table 1.

The goal is to minimize the weighted tardiness of soft-deadline applications. If we denote by E[Ri] the

average execution time for application i, the tardiness is defined as E[Ri]-Di if Ri >Di, 0 otherwise. The

overall objective is given by:

(1)

Note that OPT_JR is used when the deadline Di cannot be met when the system is under heavy load; as a

result, for every application the tardiness is strictly positive.


Table 1 - Description of the terms used

Entity Description

Ad The set of indexes of soft deadline applications

Machine learning model application i profile

Di The application deadline

Mi Total RAM memory available at the YARN Node Manager for application i

mi Container size (in terms of RAM) for application i

Vi Total number of vCPUs available at the YARN Node Manager for application i

vi Container size (in terms of vCPUs) for application i

The initial number of cores estimated for application i

wi Application’s weight

N Available spare capacity for soft-deadline applications (total number of cores)

As discussed in Section 7 of EUBra-BIGSEA deliverable D3.2, relying on ML models, big data applications

average execution time can be approximated by:

(2)


where ni denotes the number of cores allocated to application i. If N denotes the total number of cores

available OPT_JR minimizes Equation (1) under the constraint:

By exploiting Karush Kuhn Tucker conditions, at the global optimum solution for every application the

number of cores which minimize the tardiness is given by:

(3)

3.2. Algorithms

Since (2) and (3) are raw approximations, we only exploit them to quickly obtain an initial estimate for

the optimal solution. Afterwards, in the second step of our optimization heuristic, we take this

configuration as the starting point of a design space exploration via a search technique that evaluates

application performance through more accurate performance methods and relies on dagSim or the

Lundstrom tool to identify the final solution. The solution implemented is described in Figure 2 and 3.

During the first step, for each application i) an initial solution is computed by referring to Equations (3),

then ii) a bound is calculated (step 7).

The bound is the minimum number of cores such that E[Ri]<Di and then the application tardiness

becomes 0. The procedure to compute the bound value is a simple line search, which considers as

starting point the number of cores identified by the OPT_IC tool at deployment time. The OPT_JR

optimization module retrieves this value from a DB table called OPTMIZER_CONFIGURATION_TABLE (see

deliverable D3.4). The process of calculating the bounds considers two cases: either the execution time

returned by the performance predictor adopted (i.e., Lundstrom or dagSim) for the current number of


cores exceeds the requested deadline or lays behind it. As a result, the number of cores within the given

configuration is increased or decreased by a certain step (currently equal to the value of the Vi

parameter) and the predictor is invoked again. The process is iterated as many times as required,

reducing the distance from the deadline until stops. The predictor component must also consider the

so-called rescaling issue, i.e., provided the current stage in execution, the tool rescales the residual

application execution time, provided by the performance prediction tool, considering the ratio between

the elapsed time of the application and the current stage end time (estimated again by the performance

prediction tool).

The rescaled time remaining for the application is computed according to the following formula:

where the EstimatedTimeRemaining and EstimatedStageEndTime are obtained by the performance

prediction tool and ElapsedTime is obtained considering the current time and the application start time

stored in the RUNNING_APPLICATION_TABLE (see Appendix B.1 and EUBra-BIGSEA Deliverable D3.4 for

further details).


Figure 2 - OPT_JR Optimization Algorithm


The algorithm continues by considering the initial solutions calculated through Equations (2) and (3).

They are adjusted (steps 11-12) with the aim to round the number of cores adopted to be a multiple of

the ones available within the YARN container/VM.

The local search process takes place within the second stage of the algorithm (line 16 onward) by

looping according to a general index. At each iteration, a different algorithm called Approx_Algorithm

(line 18, the algorithm is described later in this document) is invoked to generate a list L2 of candidate

applications for which the total Objective Function (OF) can be improved by moving the same number of

cores from one application to another. The local search neighborhood is defined by this simple move.

The list L2 computed by the approximated algorithm at step 19 includes the set of moves that improve

the OF evaluated by using the ML models and formula (2). The list L1 (introduced at step 17) will store,

at each iteration, the set of moves leading to an improvement of the OF evaluated by relying on either

dagSim or Lundstrom predictors.

The earlier use of an approximation function to estimate application execution time mitigates the

performance issues deriving from the actual evaluation of the cores shift by using a more accurate

predictor (Lundstrom or dagSim). By exploiting the approximation-based algorithm (based on ML

reported in Figure 3) that explores exhaustively the neighborhood of size O(N2), where N is the number

of soft-deadline applications, the overall algorithm execution time decreases to a few msecs. As a result,

the effect of the cores shift is precisely calculated only on those promising candidates identified earlier.

Note that, to improve also the estimate provided by the ML models, at every call of the performance

predictor the parameters Xci and X0

i are updated by fitting the hyperbole of equation (2) with the last

two values (ni, E[Ri]) returned by the performance predictor.1

In every iteration, the best move that minimizes the total objective function is committed (step 42) and

the search continues, otherwise the search stops in a (local) optimum. Each member of the lists L1 and

L2 includes the number of cores of two different applications after the core shift and the expected value

of the delta of the Objective Function.

By cycling on the approximated list L2, the delta of the objective function is now evaluated for every

candidate solution by invoking the predictor component2 with the new core assignment. If the new

1 Performance predictor execution values are also cached in the DB tables (see also EUBra-BIGSEA deliverable

D3.4) to accelerate the overall procedure execution.

2 (dagSim or Lundstrom, lines 26-27).


cores configuration implies an improvement of the objective function (line 31), then the configuration is

stored in the sorted list L1 (line 33). Once the entries in the list L2 have been analysed, the process

retrieves from the sorted list L1 the configuration for which the objective is minimum; it performs the

move and the current solution is updated accordingly (line 42). The optimization process ends either

because the maximum number of iterations MAX has been reached or the last iteration has not

generated any improvement (line 38).

The approximation-based algorithm generating list L2 is reported in Figure 3. Its purpose consists in

improving the general performance. This is made possible by the fact that the evaluation of the

objective function of the different attempts to add (or remove) cores is not actually calculated by a

predictor component but only approximated (lines 8-9) and evaluated through the machine learning

model.

All the candidate configurations (for which the objective function is effectively improving, as per line 13)

are stored in a sorted list (line 14) and returned to main algorithm for actual evaluation (line 17).

Figure 3 - The Approximation-based Algorithm


4. Hybrid Machine Learning Module Implementation

This section focuses on the description of the Hybrid Machine Learning (HML) prototype, which is used

for the performance prediction of big data cloud applications. The machine learning model is used by

the optimization components (see D3.4, Section 6) in order to define the minimum cost infrastructure

configuration which fulfils the deadline constraints. In particular, this component is based on the hybrid

machine learning approach initially published in Ataie et al.3 and described in EUBra-BIGSEA Deliverable

D3.2, Section 5.5, which combines analytic and machine learning models and is able to identify accurate

models reducing the number of experiments for big data application profiling.

The HML module replaces the fluid simulation version initially foreseen for the dagSim simulator, since

providing a closed formula for the evaluation of big data application execution time, allowed us to

develop analytical optimization models, which are used to identify an initial promising solution in the

implementation of the OPT_IC and OPT_JR components.

The HML prototype is implemented to support the design-time phase. Indeed, the accuracy of the final

machine learning model needs to be assessed off-line and to be supervised by a system administrator

and cannot be fully automated and integrated into the run-time platform.

The main target of this section is the WP3 team implementing the prototype for the performance prediction and optimization services. In particular, the outcomes of the Hybrid Machine learning tool will be used as an input to the latter. Moreover, this chapter describes extensively the software architecture as well as all the open-source components.

This section consists of four parts: the supported platforms, as well as the configuration and the dataset preparation, are depicted in Section 4.1 and 4.2 respectively. The overall architecture, the scripts, the functions and the extrapolation capabilities of the approach are described in Section 4.3. The package information is illustrated in Section 4.4, while the deployment and the user installation guidelines are included in Appendix A.

4.1. Configuration and Deployment

The Hybrid Machine Learning prototype can be deployed on a machine with Octave GNU installed. The following sections describe extensively the requirements and instructions necessary for user’s guidance.

The Hybrid Machine Learning prototype for QoS prediction is available as open source under the Apache

2.0 license on GitHub (https://github.com/eubr-bigsea/HybridMachineLearning). It was implemented in

GNU Octave and consists of a set of scripts (m files) and functions, responsible for predicting with a good

accuracy the execution time of Spark applications4. The core of our prototype is the use of the Support

3 E. Ataie, E. Gianniti, D. Ardagna, and A. Movaghar, “A combined analytical modeling machine learning approach

for performance prediction of mapreduce jobs in Hadoop clusters,” in Proc. of SYNASC , 2016.

4 HML module have been tested also with MapReduce and Tez applications.

https://github.com/eubr-bigsea/HybridMachineLearning


Vector Regression technique to estimate the execution time of each application. The implementation

depends on the LIBSVM5 open source library under the BSD license. This library is responsible for

training a data set to obtain a model and then, using the model to predict information on a testing data

set.

4.2. Configuration and dataset preparation

The first step is to collect and process the Spark logs and generate the performance values from an

analytical model (e.g., approximated formula or dagSim), necessary to be used as input for the training

of the Hybrid algorithm which is the core of the implemented tool.

In order to process the logs of the executed applications and get the suitable summary csv files, needed

as an input for the next steps, we should run the set of scripts available on GitHub

(https://github.com/deib-polimi/Spark-Log-Parser, see also D3.4 Section 6). An example from the

processed logs generated by the Log Spark Parser is shown in Figure 4. In particular, the generated

summary csv contains a set of features which can characterize Spark jobs obtained during the execution

of a query. Those features are the completion time of jobs, the number of stages and tasks per stage,

the max/min and average time of tasks, Shuffle time and bytes sent. Moreover, the number of users, the

application data set size and the number of containers are included.

Figure 4 - Example of processed data logs in csv format for Spark jobs

5 https://www.csie.ntu.edu.tw/~cjlin/libsvm

https://github.com/deib-polimi/Spark-Log-Parser

https://www.csie.ntu.edu.tw/~cjlin/libsvm/


After the successful download of the set of scripts, the execution process can be started by navigating to

the bootstrap/octave directory.

Initially, the calculation of the estimated response times of Spark application for single a user scenario

and the generation of the initial set of synthetic data samples (Knowledge Base-KB) is needed. In order

to obtain the previous information, dagSim, JMT simulators or the approximated formula should be

executed. The formula used for generating dataset is described in equation (4).

(4)

where ni is the number of tasks and ti is the average execution time of each task associated with stage i.

Moreover, nc is the number of available core during the jobs execution and k is the total number of

stages. Such approximation has been proven to be accurate in scheduling theory for systems in which

the number of tasks is greater than the available resources. The rationale of this formula is the

following: the term ni/nc equals to the number of waves requested to execute sequentially tasks at stage

i. During the first ni/nc-1 waves, tasks statistically keep the nc cores busy while during the very last wave

the final tasks (in number lower than ni) complete stage i execution. The approximation formula has

been implemented as a function and is considered one of the components of the prototype.

4.3. Description of scripts and functions

The ML model produced by our prototype reported in equation 4 is characterized by the two

parameters chi_0 and chi_c is then used by the optimization service component to predict big data

application execution time.

In the following, a detailed step-by-step description of the execution of our code is presented. The first

step is to obtain the analytical data considering the approximated formula by executing the

“retrieve_response_times_with_formula” script which is responsible for implementing Equation 4.

Alternatively, in the case of using dagSim or another AM simulation process, the data is stored as a

vector in a variable named “query_analytical_data” and is ready to be used from the corresponding

script which is responsible for the next step of the process. Regarding the

“retrieve_response_times_with_formula” script, it calls three functions named “read_data”,

"clear_outliers" and “compute_waves_prediction” to read the data from the summary csv files, clean

them and finally calculate the approximated formula which was introduced in Equation 4.


Then, the identification of the optimal thresholds for the proposed Hybrid approach follows. In

particular, an optimal combination for iteration and stop thresholds is needed (see Figure 5). At this

point, we should clarify that the available operation data samples in the KB are partitioned into four

disjoint sets, called training set, cross validation set CV1 , cross validation set CV2 and test set.

The pseudo-code of our proposed hybrid algorithm, which is an improved and refined version of our

previous work (D3.2), is shown in Figure 5. A synthetic data set, used to form an initial KB, is generated

at line 2 based on the AM. The KB is then used to select and train an initial ML model at line 3. Since the

real data samples might be noisy, an iterative procedure is adopted for merging real data from the

operational system into the KB, which is implemented at lines 4–15. Adding a new configuration in the

KB means that many operational data points are included which subsequently are split into the training,

test and cross-validation sets respectively. The operational data for all available configurations are

gathered and then merged into KB at lines 5–8. Then the updated KB is shuffled and partitioned at lines

10 and 11 as stated before. Using these sets, line 12 is dedicated to the selection of an ML model among

the alternatives and to retraining it. Then some error metrics are evaluated at line 13. At lines 14 and 15,

two conditions are checked. Both conditions consider the Mean Absolute Percentage Errors (MAPEs) on

the training set and on the cross validation set CV2 (shown as tr-Error and CV2 -Error in Figure 5) to

check whether they are lower than the specific thresholds (itrThr and stopThr, respectively) or not.

The error on the training set determines if the model fits well on its training set itself. On the other

hand, the error on the CV2 set determines if the model has generalization capability. If the values of

errors for the first condition are not small enough, the algorithm jumps to line 9 to reshuffle, to split the

KB into train, CV1, and CV2 sets, and to choose a different model.


Figure 5 - Hybrid Machine Learning Algorithm


Figure 6 - Components of Hybrid Machine Learning Tool

Moreover, since the goal is to identify models capable of achieving generalization, the operational data

obtained with the largest configurations are included only in the CV2. While the training set is used to

train different alternative models, the CV1 set is exploited for SVR model selection, while CV2 is used as

a stopping criterion and to determine the best values for the thresholds of the iterative algorithm. The

error on the training set determines if the model fits well on its training set itself. So, if this error is small

enough, the model will avoid underfitting or high bias. On the other hand, the error on the CV2 set

determines if the model has generalization capability. So, if this error is sufficiently small, the model will

avoid overfitting or high variance.

The test set has been used for evaluating the accuracy of the selected model and restricted to include

some specific configurations (see Section 11). The main script for obtaining the desirable thresholds

(itrThr and stopThr) is “compare_prediction_errors” which generates the state space of threshold

combinations for the Hybrid approach; its results are stored in a folder named HyOptimumFinding,

which path can be determined in the code. Before the execution of the script, the user should set up a

number of parameters which are shown in Table 2. The first one is called “configurations.runs” and is

related to the number of cores which are examined. In our experiments (see Section 11) data samples

reported from 20 to 120 cores were gathered.

Another important parameter is the configuration.missing_runs, which includes the number of cores

which are not considered for the KB during the machine learning process. Regarding the threshold

parameters, the user should set a range for which the algorithm starts the exploration process in order

to find the optimum pair, while the max number of iterations and the number of seeds should also be

provided. For every threshold combination, the algorithm runs for 50 different seed values to generate

different results. Notice that in case of extrapolation and interpolation processes the values of seeds are

different from the ones used to identify the optimal threshold combination in order to demonstrate the

effectiveness of the algorithm’s accuracy (see Table 2).

Table 2 - Parameters of compare_prediction_errors

Name of parameters Description


configuration.runs Number of examined cores

configuration.missing_runs Number of missing cores configuration used for validation

outer_thresholds Minimize the MAPE on training set

inner_thresholds Minimize the MAPE on cross validation set

max_inner_iterations Maximum number of iterations of the algorithm

seeds Seed for the random number generator

There are two main functions which are included in the previous scripts named “clear_outliers” and

“model_selection_with_thresholds”. The first is responsible for excluding rows of the arrays of data

where the value on a column is more than three standard deviations away from the mean, while the

latter performs model selection on the training set according to the performance on the cross validation

set. The general model is defined via search spans C and epsilon and in the end their optimal values are

returned. Additionally, in the latter function proper scaling is considered in order to obtain accurate

computations.

The average error for each combination of thresholds for the Hybrid approach is calculated by the

“computeAvgsOfHybridForOptimumFinding” function, which finally returns the one that minimizes the

error.

4.3.1. Extrapolation on few and many cores and Interpolation capabilities

After finding the optimal combination the process for investigating the extrapolation and interpolation

capabilities of the algorithm in the upper region of the configuration set follows by calculating the

relative error. As before, the same script (compare_prediction_errors) is used. However, in this case the

values of the optimal thresholds are inserted as an input. Regarding the extrapolation on many cores,

the execution of the process starts while progressively the configurations with the largest capacity are

removed from the training set and cross validations sets CV1 and CV2 and moved to the test set. In

other words, considering that the different configurations for cores are configuration.runs= [20 30 40 48

60 72 80 90 100 108 120], in the first scenario the training and CV1 set included data coming from the


real system only for configurations from 20 to 100 cores, while the CV2 and test sets included

experimental data for 108 and 120 cores, respectively. In the second case, real data for the training and

CV1 set from 20 to 90 cores are considered, the CV2 set included operational data for 100 cores while

the test set included experimental results in the range [108, 120] cores and so on.

Concerning the extrapolation on a few cores in the lower region of the configuration set, the process for

the calculation of the error on response time prediction starts considering that only point 20 is included

in the test set and gradually moving from the left side of the configuration axis towards the right other

points are added to the test set. However, the quality of the results obtained from the aforementioned

process can be characterized as poor, thus we would not recommend it to be applied but only for right

extrapolation.

For generating the output plots of right and left extrapolation both the

"computeExtrapolationFromRight" and "computeExtrapolationFromLeft" scripts are executed and the

graphs are stored in the folder indicated by the user.

According to our experimental process (the related results are listed in Section 11) to assess the

interpolation capability of the algorithm, three different scenarios are considered by using the

“compare_prediction_errors” script.

For generating the output plots of interpolation "computeInterpolation" script is executed while the

results are stored in the folder indicated by the user.

4.4. Package Information and Code Structure

The file structure of the delivery code is shown in Figure 7. The Hybrid Machine Learning tool is

distributed as a set of octave scripts, containing the source code and dependencies. There are five core

components, two of which have a more complex structure since a set of functions are invoked when

they are executed. The most important outcomes of the prototype are the Machine Learning Model,

which is used as an input to the optimization service, and both the extrapolation and interpolation

components, which prove the effectiveness of the approach.


Figure 7 - Structure of Hybrid Machine Learning code


5. Pro-active policies implementation

The pro-active policies system comprises three coupled modules that deal with the three main

interactions of the infrastructure and the tasks: The broker, the monitor, and the controller. The main

details about the architecture and initial implementation were already described in the D3.4. In this

section, we discuss the recent updates, with the new implementations and changes made to the pro-

active policies.

5.1. Vertical Scaling Pro-active Policies

The controller service comprises three plugins: the Controller plugin, the Actuator plugin, and the Metric

Source plugin. The Controller and the Actuator plugins implement the Controller and Actuator

components, respectively. The Metric Source plugin is responsible for getting application metrics from a

metric source, like Monasca, and returning them to the Controller.

The Controller plugin receives one or more error values associated with one or more metrics, from the

metric source. The error values are the difference between the expected and the measured values of

the metrics (e.g., the difference between the expected and the observed progress). Based on these

metrics, the controller decides when and how many resources should be allocated/deallocated from the

infrastructure to run the application properly.

5.1.1. Contextualization

Currently, there are three controller plugins available. The reference controller plugin is based on fixed-

step actuations when there is a progress error as is detailed in the previous deliverable (see EUBra-

BIGSEA D3.4). The two new controllers are:

1. Proportional Controller: changes the number of allocated resources based on the error values it

receives from the metric source and a proportional factor (also known as the proportional gain)

received as a parameter. The actuation size is proportional to the error value received and to

the proportional factor. The higher the error absolute value, the higher the number of resources

added or removed;

2. Derivative-proportional Controller: changes the number of allocated resources based on the

current error value, the difference between the current value and the last received and two

parameters: the proportional and derivative factors. The actuation size is proportional to the

error value and the factors received. The number of resources to add or remove is calculated as

shown in the expressions below:

cp=-fp*error

cd=-fd*(error-lasterror)

size=cp+cd


Where, cp is the proportional component, cd is the derivative component and size is the amount

of resources. fpis the proportional factor parameter and fd is the derivative factor parameter.

The proportional component is purely reactive, while the derivative one contains basic

predictive logic.

The Actuation plugins act on the infrastructure based on action requests from the Controller plugins.

Currently, the actuation plugins available act on KVM virtual machines, getting and changing the amount

of allocated resources using libvirt. For that, they use SSH to access the infrastructure’s compute nodes.

The actuation plugins available are:

1. KVM CPU cap plugin: the plugin changes the number of CPU resources allocated to the virtual

machines by controlling the CPU cap at the hypervisor layer;

2. KVM I/O plugin: the plugin changes the number of both CPU and disk I/O resources allocated to

the virtual machines. It requires two additional parameters, used as reference to the actuation:

iops_reference and bs_reference. The iops_reference parameter is the maximum disk

throughput, in operations per second, that can be allocated to a virtual machine. The

bs_reference parameter is the maximum disk throughput, in bytes per second, that can be

allocated to a virtual machine. These values must be added to the controller configuration file

(controller.cfg), in the actuation section.

The actuation plugins implement the following methods:

1. adjust_resources(self, vm_data): modifies the environment properties, like allocated resources,

using a dictionary that maps virtual machine IDs to cap values as parameter. Typically used to

adjust the CPU cap of VMs allocated to the application;

2. get_allocated_resources(self, vm_id): returns the cap of the given virtual machine.

3. get_allocated_resources_to_cluster(self, vms_ids): returns the cap of the given cluster virtual

machines

5.2. Pro-active Policies for OpenNebula

Another improvement in the pro-active component was the implementation of plugins for running Spark

over Mesos in OpenNebula clouds. The following subsections detail the Spark-Mesos plugin, responsible

for interpreting a job description file and submitting it to a Spark cluster that uses Mesos as resource

manager, and the actuation plugin, which interacts with the cloud management platform when

actuation is needed.

5.2.1. Spark-Mesos Plugin

The Spark-Mesos plugin uses a cluster preconfigured with Mesosphere to execute and manage Spark

jobs. This plugin has basically the same functionality as the Spark-Sahara plugin, but with a different


cluster provider. A great advantage here is the possibility to use the current approaches for vertical and

horizontal scaling (EC3, IM, and CLUES) combined.

This implementation works for OpenNebula and OpenStack. There is a parameter in the request body

called vm_provider that must specify which virtual machine provider is being used. It is needed to make

the broker able to find and actuate on the hosts. It is also capable to run spark binaries in JAR or Python

formats. When the execution class is specified, it means the binary is in the JAR format.

The plugin flow for an execution in an OpenNebula Infrastructure consists of the following steps:

1. The Mesos cluster master is accessed via SSH and all the needed steps to run a Spark application

are executed;

2. The binary and the input files provided in the request body;

3. The Spark application is submitted, pointing to the Mesos cluster (using flag --master

mesos://ip:port);

4. The list of all the virtual machines running this Spark job (master and slaves) is retrieved;

5. Depending on the virtual machine provider:

1. OpenStack: the OpenStack client is used to discover the virtual machines ids on Nova,

these ids are the same ids for KVM;

2. OpenNebula: the ovm-client is used to map the IPs of the virtual machines and to

discover the KVM ids for each, which are not the same in OpenNebula;

6. The controller service is started (the same for Spark-Sahara); if the broker does not run in the

same network as the virtual machine provider, adjustments need to be made to tunnel the

access; this is the case, for example, when the worker machines are in a remote cluster,

accessible via VPN.

7. The monitoring service is started (the same for Spark-Sahara);

8. When the execution is finished, the files are removed and the services are stopped.

To execute a Spark job using this template, it will be necessary to provide some specific information

about the application, job binary location, output, and infrastructure, besides the information regarding

controller and monitor. It is important to highlight that in this case the application must organize its

parameters in this exact order: [input files] [others arguments]. This configuration file below is an

example with all these information.


[manager]

ip = 10.30.1.69 port = 1514 plugin = spark-mesos bigsea_username = username bigsea_password = password [plugin] binary_url = http://owncloud.lsd.ufcg.edu.br/as6sgb21xb/SparkPi.jar

# in this case spark pi has no input files, but you can include input files in the same order as the parameters in the execution

input_files = http://storage.example/b21xb/input1.csv http://storage.example/as6sgxb/input2.csv execution_parameters = 1000

execution_class = org.apache.spark.examples.SparkPi

vm_provider = opennebula

[scaler] starting_cap = 50 scaler_plugin = progress-error actuator = kvm-upv metric_source = monasca application_type = os_generic check_interval = 10 trigger_down = 10 trigger_up = 10 min_cap = 20 max_cap = 100 actuation_size = 15 metric_rounding = 2

Figure 8 - Configuration file to execute the spark-mesos plugin

5.2.2. OpenNebula Actuator

There are only a few differences between the way to actuate in VMs provided by OpenNebula instead of

OpenStack using KVM CLI and it is basically related to how we can discover where (in which compute

node) the vm is hosted.


First of all, unlike OpenStack, the VM id that OpenNebula uses to identify each VM is not the same that

KVM uses. As the controller receives only the VM id to execute the virsh command and set the CPU

cap, the broker is responsible for mapping the slave ips from Mesos and the vm ids in OpenNebula, and

then the actuator can identify its respective id in KVM. The code below shows how it works:


# List the vms information on OpenNebula

list_vms_one = 'ovm list --user %s --password %s --endpoint %s' % \

(api.one_username, api.one_password, api.one_url)

stdin, stdout, stderr = conn.exec_command(list_vms_one)

# Get the ips of each spark executor from Mesos API. This method blocks the flow until the application be available on the Mesos API

executors = self._get_executors_ip()

vms_ips = executors[0]

# Extract the slaves ids from the output of the command to list opennebula vms

vms_ids = self._extract_vms_ids(stdout.read())

# This loop maps the vms ips with the opennebula ids

executors_vms_ids = []

for ip in vms_ips:

for id in vms_ids:

vm_info_one = 'ovm show % --user %s --password %s --endpoint %s' % \

(id, api.one_username, api.one_password, api.one_url)

stdin, stdout, stderr = conn.exec_command(vm_info_one)

if ip in stdout.read():

executors_vms_ids.append(id)

break

Figure 9 - How the broker discover the slave ids from Mesos and OpenNebula and calls the actuator service


On the other side, the actuator needs to look for the host where the VM is allocated and execute the

commands to set the resources allocation. The following code snippet shows how it works:


# This method receives as argument a map {vm-id:CPU cap}

def adjust_resources(self, vm_data):

instances_locations = {}

# Discover vm_id - compute nodes map

for instance in vm_data.keys():

# Access compute nodes to discover vms location

instances_locations[instance] = self._find_host(instance)

for instance in vm_data.keys():

# Access a compute node and change cap

self._change_vcpu_quota(instances_locations[instance],

instance, int(vm_data[instance]))

# This method executes the adjustment actions

def _change_vcpu_quota(self, host, vm_id, cap):

# ssh for the actual host

self.conn.exec_command("ssh %s") % host

# List all the vms to get the OpenNebula id and map with the KVM id

stdin, stdout, stderr = self.conn.exec_command("virsh list")

vm_list = stdout.read().split("\n")

# Discover the KVM id using the OpenNebula id

virsh_id = self._extract_id(vm_list, vm_id)

# Set the CPU cap

self.conn.exec_command("virsh schedinfo %s " +

"--set vcpu_quota=%s " +


"> /dev/null") % (virsh_id, cap)

Figure 10 - How the actuator maps the KVM id to execute the virsh commands

5.3. OpenStack Opportunistic Plugin

In the context of cloud computing, it is a fact that lots of resources are idle most of the time. This

happens because users allocate more resources than they need, and this is typically the case even if

elasticity mechanisms are used (e.g., thresholds for scaling are typically set to 60% or 70%). As a

consequence, cloud providers started providing these resources with a lower QoS to improve the

infrastructure utilization.

Thus, we developed the concept of opportunistic resources in the EUBra-BIGSEA framework as a

different approach to pro-activity. In this context, the broker workflow was modified to add a new

decision layer. After the application submission, even with a cluster size specification by the user, the

broker service demands from the optimizer service an initial cluster configuration size to execute the

application. After that, considering the opportunistic scenario, the broker may redefine the cluster size

with opportunistic instances, what is defined by calculating the availability of resources in the

infrastructure.

Opportunism can be used with spark applications just by adding the opportunistic flag

(opportunistic=True) to the submit request made to the broker. This is supported only by the Sahara

plugin because it is the only one in which we can control instance groups. This limitation is due to

OpenStack not having properly implemented the opportunistic instances, so in order to overcome this

limitation we use a separated node group template (subset of services that will run on instances

belonging to the group) to specify opportunistic instances in Sahara and this way we can preempt them

when needed. Please refer to sahara docs6 in order to understand more on node group templates.

5.3.1. Opportunistic Service

The HSOptimizer7 is the service responsible for gathering information from the hosts of the cloud and

calculating the number of available resources it can use to spawn opportunistic instances to help speed

up the application processing time.

6 https://sahara.readthedocs.io/en/latest/devref/quickstart.html#create-node-group-templates

7 https://github.com/bigsea-ufcg/bigsea-hsoptimizer

https://sahara.readthedocs.io/en/latest/devref/quickstart.html#create-node-group-templates

https://sahara.readthedocs.io/en/latest/devref/quickstart.html#create-node-group-templates

https://github.com/bigsea-ufcg/bigsea-hsoptimizer


It has two entry points on its API:

@hso_api.route('/optimizer/get_cluster_size', methods=['GET'])

def get_cluster_size()

@hso_api.route('/optimizer/get_preemtible_instances', methods=['GET'])

def get_preemtible_instances()

Figure 11 - Routes to use the opportunist services

The get_cluster_size() method is responsible for calculating the cluster size with respect to the resource

availability and the get_preemtible_instances() method gives a by-cluster priority list of instances to be

deleted.

To compute the cluster size, the optimizer will take the following steps:

1. Request from Monasca the resource utilization in the last hour of the hosts on the partition of

the cloud that uses opportunism;

2. Use a load predictor to compute the availability of resources for the next 30 minutes;

3. Return the infrastructure availability.

The predictor8 here was implemented in R and uses time series and the forecast package from R to

calculate four different estimates, returning the most conservative one.

5.3.2. Opportunism and Load Balancer

One important part of the opportunism feature is the pre-emptability of instances. This means that,

since they have lower QoS, they are the best candidates to be migrated or deleted when needed. The LB

service, detailed below, is responsible for monitoring the cloud state and making the decision to migrate

or kill instances. To perform these potentially disruptive actions, the load balancer uses the optimizer

APIs to query cluster priority list, selecting opportunistic instances as the best instances to act upon.

8 https://github.com/bigsea-ufcg/bigsea-

hsoptimizer/blob/master/horizontal_scaling_optimizer/service/horizontal_scaling/resources/predictor.R

https://github.com/bigsea-ufcg/bigsea-hsoptimizer/blob/master/horizontal_scaling_optimizer/service/horizontal_scaling/resources/predictor.R

https://github.com/bigsea-ufcg/bigsea-hsoptimizer/blob/master/horizontal_scaling_optimizer/service/horizontal_scaling/resources/predictor.R


6. Load balancing implementation

The Load Balancer is a service that runs in the cloud infrastructure. Each Load Balancer is responsible for

managing a specific subset of hosts where VMs are running EUBra-BIGSEA applications, and periodically

reallocate VMs that are on overloaded hosts based on a given heuristic. In addition to this functionality,

it is also possible to configure the Load Balancer to use the HSOptimizer (see Section 5.3) or the OPT_JR

(see Section 3) service, which help to select VMs that can be destroyed while minimizing SLAs or QoS

issues of the applications. The current version works on OpenStack infrastructures and requires that the

infrastructure supports live migrations (i.e., the hosts have shared storage for instance disks9).

6.1. Architecture

The load balancer requires access to the Monasca and Nova (OpenStack Compute) services with admin

privileges. The higher privilege is necessary so it can discover all VMs information and perform live

migrations. The load balancer also requires SSH access to hosts so it can retrieve CPU capacity limitation

(i.e., the CPU CAP) information through KVM Hypervisor. Accessing the CPU capacity information is

necessary to understand the actual allocation of machines. Cloud management platforms typically allow

CPU capacity limitation10, but do not consider it during scheduling11.

The configuration with the HSOptimizer or the OPT_JR services is optional. The architecture of the load

balancer is shown in Figure 8. Information about how to install, configure and use the load balancer is

available in the documentation12 and in Appendix E.

9 https://docs.openstack.org/admin-guide/compute-configuring-migrations.html

10 https://wiki.openstack.org/wiki/InstanceResourceQuota

11 https://docs.openstack.org/ocata/config-reference/compute/schedulers.html

12 https://github.com/bigsea-ufcg/bigsea-loadbalancer#bigsea-wp3---loadbalancer

https://docs.openstack.org/admin-guide/compute-configuring-migrations.html

https://wiki.openstack.org/wiki/InstanceResourceQuota

https://docs.openstack.org/ocata/config-reference/compute/schedulers.html

https://github.com/bigsea-ufcg/bigsea-loadbalancer#bigsea-wp3---loadbalancer


Figure 12 - Load Balancer architecture

6.2. Heuristics

The heuristics are responsible for periodically verifying which hosts are overloaded, taking actions to

reallocate VMs of these hosts to others, and trying to make them less overloaded than before when

possible. By doing this rebalancing operations pro-actively, the performance of the virtual machines and

of the applications running on them is preserved.

Different heuristics may consider different aspects. Nevertheless, two parameters are common:

cpu_ratio and wait_rounds. The cpu_ratio parameter in the configuration file represents the ratio of the

number of CPUs that each host can use, the heuristics take this into consideration when trying to

determine when a host is overloaded, and the wait_rounds parameter represents the number of

executions of the load balancer that each migrated instance must wait before any heuristic can choose it

to be migrated again. The implementation of the heuristic part is implemented as a plugin, enabling

developers to easily add their own heuristics.

6.2.1. Balance Instances


This heuristic aims to balance the hosts that are overloaded, by migrating the virtual machines that have

a higher number of vCPUs. A host is considered overloaded when it total of allocated vCPUs divided by

its total number of CPUs is greater than the provided cpu_ratio. The heuristic pseudo-algorithm is

detailed below. More information can be found in the component documentation13.

1. Select all hosts that are overloaded

2. If no overloaded hosts or If all hosts are overloaded, nothing can be done

3. For each overloaded host

3.1 While selected_host is overloaded and still has instances that could be migrated:

3.1.1 Select instance with a high number of vcpus that was not migrated before.

3.1.2 Look for a host that is less loaded and can receive this instance without exceeding the

cpu_ratio.

4. Migrate all instances selected.

5. Update the number of wait rounds for all migrated instances.

Figure 13 - Balance Instances algorithm

6.2.2. CPU-Cap Aware

This heuristic aims to balance the hosts that are overloaded by migrating the virtual machines that are

responsible for the increased consumption of CPU in the hosts. A host is considered overloaded when

the host total consumption or the total used capacity divided by the number of CPUs is greater than the

provided cpu_ratio. More information is available in the documentation14. The total consumption and

total used capacity for each host are given by the equations below. These equations consider, for each

VM i, the number of allocated vCPUs (vcpus), the CPU CAP for the set of vCPUs in the VM and the

average utilization of these vCPUs (%CPU).

1. Total Consumption: the sum of the consumption of each virtual machine on the host.

1. Total Used Capacity: the sum of the used capacity of each virtual machine on the host.

13

https://github.com/bigsea-ufcg/bigsea-loadbalancer/blob/master/loadbalancer/service/heuristic/doc/balance.md

14 https://github.com/bigsea-ufcg/bigsea-loadbalancer/blob/master/loadbalancer/service/heuristic/doc/cpu_capacity.md

https://github.com/bigsea-ufcg/bigsea-loadbalancer/blob/master/loadbalancer/service/heuristic/doc/balance.md

https://github.com/bigsea-ufcg/bigsea-loadbalancer/blob/master/loadbalancer/service/heuristic/doc/cpu_capacity.md



2. If no overloaded hosts or if all are equally overloaded, nothing can be done.


3.1 While selected_host is overloaded and still has virtual machines that can be migrated:

3.1.1 Selected virtual machines with high utilization that were not migrated recently.

3.1.2 Look for a host that is less loaded and can receive this virtual machine without exceeding

the cpu_ratio.

4. Migrate the all virtual machines selected

5. Update the number of wait rounds for all migrated virtual machines.

Figure 14 - CPU Cap Aware algorithm

6.2.3. Sysbench Performance+CPU Cap

This heuristic aims to balance the hosts that are overloaded, considering not only allocation and usage,

but the actual performance of the physical machine. Such a heuristic is especially useful in

heterogeneous infrastructures. The decision of which instances to migrate takes into consideration

whether the available hosts have better or worse performance than the overloaded hosts. When the

performance of available hosts is better, the heuristics rebalance by selecting for migration the virtual

machines with higher consumption of CPU in the hosts. When the performance of available hosts is

worse than the overloaded one, the selected virtual machines are the ones with lower consumption of

CPU in the hosts.

A host is considered overloaded when the host total consumption or total used capacity divided by the

number of CPUs is greater than the provided cpu_ratio. More information is available in the

documentation15. To measure the performance of hosts we have created a daemon16 that runs on each

host of the infrastructure and publishes the results into Monasca so the heuristic can use the results to

make the decision. The daemon executes the Sysbench17 benchmark, information about how to use the

tool is provided in Appendix F.

15

https://github.com/bigsea-ufcg/bigsea-loadbalancer/blob/master/loadbalancer/service/heuristic/doc/performance.md

16 https://github.com/bigsea-ufcg/monitoring-hosts

17 http://manpages.ubuntu.com/manpages/xenial/man1/sysbench.1.html

https://github.com/bigsea-ufcg/bigsea-loadbalancer/blob/master/loadbalancer/service/heuristic/doc/performance.md

https://github.com/bigsea-ufcg/monitoring-hosts

http://manpages.ubuntu.com/manpages/xenial/man1/sysbench.1.html


The total consumption and total used capacity for each host are given by the equations below. These

equations consider, for each VM i, the number of allocated vCPUs (vcpus), the CPU CAP for the set of

vCPUs in the VM and the average utilization of these vCPUs (%CPU).

1. Total Consumption: the sum of the consumption of each virtual machine on the host.

1. Total Used Capacity: the sum of the used capacity of each virtual machine on the host.


2. If no host is overloaded hosts or If all overloaded hosts are equally loaded, nothing can be done


3.1. While selected_host is overloaded:

3.1.1. While list of hosts with better performance is not empty:

3.1.1.1. Search for the instance with higher utilization that was not migrated before

3.1.1.2. If selected_host is not overloaded: break

3.1.2. If selected_host is not overloaded: break

3.1.3. While list of hosts with worse performance is not empty:

3.1.3.1. Search for the instance with lower utilization that was not migrated before

3.1.3.2. If selected_host is not overloaded: break

3.1.4. If selected_host is still overloaded: break, issue warning that nothing can be done

4. Migrate all the selected instances

5. Update the number of wait rounds for all migrated instances

Figure 15 - Sysbench Performance CPU Cap algorithm

6.3. Integration with Opportunistic Services

The implementation of the integration between Optimizer Services and the Load Balancer is based on

plugins, which enables developers to write new plugins to integrate with other opportunistic services. In

this scenario, if possible, extra VM resources are allocated to run the application at the beginning in a

"best effort" fashion. So, if during the application execution the extra resources are required by other

applications the Load Balancing service acts on the infrastructure to release resources. For example, by

destroying preemptable VMs provided by the Opportunistic service (HSOptimizer). For that, the


opportunistic service provides a list of opportunistic instances that can be deleted without affecting the

application (e.g., the master node in a Spark cluster cannot be killed). The Load Balancer will contact the

HSOptimizer service when at the end of its decision there are still overloaded hosts, so it will destroy the

small number of preemptable VMs and make a new decision.


7. Pro-active policies validation

In this section we detail the experiments made to validate the new approaches for the pro-active

policies described in Section 5.


The following subsections summarize the environment and the experiments performed for the

validation of the Controller component. First, we introduce the infrastructure where the experiments

were performed, followed by the description of the experiment, and the analysis of the results.

7.1.1. Infrastructure

The experiments were performed in the production cloud of the Distributed Systems Laboratory (LSD) at

the Federal University of Campina Grande (UFCG), which uses the Newton version of OpenStack. Within

this environment, we selected a set of three dedicated servers, to which we had full access, two Dell

PowerEdge R410, and one Dell PowerEdge R420. Together these hosts had a total of 28 CPUs and

50.96GB of ram. The servers were connected through a Gigabit network and used shared storage

provided via Ceph18. The configuration of each server can be seen in Table 3.

Table 3 - Configuration of each server

Servers Model Processor CPUs Memory (GB)

c4-compute11 Dell PowerEdge R410

Intel Xeon X5675 3.07GHz

24 31.4



24 31.4


Intel Xeon E5-2407 2.20GHz

4 19.56

18

http://ceph.com/

http://ceph.com/


The experiment made use of a Spark cluster, whose configuration is described in Table 4.


Table 4 - Spark Cluster Configuration

Cluster configuration

Amount vCPUs Memory (GB) Disk Size (GB)

Master 1 2 4 80

Worker 3 2 4 80

7.1.2. Experiment design

The experiment’s objective is to compare the performance of the Controller component plugins with an

execution of EMaaS. Since the Controller’s main objective is to adjust the number of resources available

to the application in order to make sure the deadline is not violated, we chose “Application execution

time” and “CPU usage” as metrics to measure the controller’s performance. We have one scenario,

which consists in the execution of the following workflow with nine controller configurations, with ten-

time repetition:

1. Select the Controller configuration to use; 2. Use the Broker API to start EMaaS execution and controller execution; 3. Wait until EMaaS execution ends;

1. Collect CPU usage; 4. Collect application execution time.

Figure 16 - Driver Comparison Algorithm for the experiment

Controller configurations

We used the progress error, proportional and derivative-proportional plugins. For each plugin, we

planned three different configurations. The used configurations are described in Table 5.

Table 5 - Controller configurations used in the comparison experiment

Plugin Aggressive Regular Conservative

Progress Error Trigger down = 0 Trigger down = 10 Trigger down = 20


Trigger up = 0

Actuation size = 100

Trigger up = -10


Trigger up = -20


Proportional Aggressive factor = 2 Aggressive factor = 1 Aggressive factor = 0.5

Derivative-proportional proportional factor = 2

derivative factor = 2

proportional factor = 1

derivative factor = 1

proportional factor = 0.5

derivative factor = 0.5

This configuration for the progress-error controller forces it to use the full speed of a vCPU when a

vertical scaling actuation. We used KVM CPU cap plugin as the actuator plugin and 50% as starting cap.

Application deadline

In order to obtain a reasonable deadline value, we executed EMaaS, with no scaling performed and CPU

cap of 50%, and collected the execution time of eight of such executions. The chosen deadline (380

seconds) is the mean of the collected values.

Metrics collection method

CPU usage was collected using SSH to access the servers and virt-top19 to get CPU usage information

from the hypervisor. The application execution time was obtained from the Broker, which queries the

Spark API to get the value.

7.1.3. Analysis

Figure 17 depicts the values of the application execution time and CPU usage for all the used controller

configurations. The metric “CPU Usage” was calculated by adding the collected values of CPU usage of

the workers used in the execution.

19

https://linux.die.net/man/1/virt-top

https://linux.die.net/man/1/virt-top


Figure 17 - CPU Usage and application time for each Controller configuration

By grouping the results by Controller plugin (progress-error, proportional and proportional-derivative), it

is clear that the conservative configurations have the worst results on both evaluation metrics. This

result can be explained by the fact that EMaaS normally requires a resource boost in the end of the

execution. The conservative configurations are likely to fail to add resources quickly enough.


The results also show that, for EMaaS, the aggressive configurations perform better. However, regarding

execution time, the difference between the aggressive and the regular configurations is not as big as the

difference between both and the conservative configurations. Moving from a regular configuration to a

purely aggressive one does not yield a performance improvement as big as the one result of moving

from a purely conservative to one regular. It suggests there is an optimum aggressiveness level.

The Progress error plugin has the best results regarding “Application time” and “CPU Usage”. This result

can, as well, be explained by EMaaS resources requirements. Since Actuation Size = 100 in all

configurations, this plugin can give the biggest possible resource boost to EMaaS and remove almost all

the resources when necessary. It results in the best execution times and resource usage. However, this

plugin’s behaviour seems too extreme and may generate an excessive disturbance in the environment.

The disadvantage of the progress-error plugin is the oscillation in resource usage, which can be seen in

Figures 18 and 19. Note that the spikes in the CPU consumed by Progress-Error configuration are higher;

this could bring instability to the performance of the cloud and also make the load balancer inefficient.

In Figure 19, we can see that the CPU CAP oscillates more abruptly in the two other cases than with the

Proportional-Derivative controller.


Figure 18 - Examples of compute node CPU Usage for three controller configurations


Figure 19 - Examples of CPU cap over time for three controller configurations


Another improvement in the pro-active policies module regards the actuation for vertical scaling. The

current version supports the usage of multiple resource dimensions, namely controlling CPU CAP and

I/O limits (operations per second and bandwidth). The following subsections illustrate the impact of the

usage of these metrics.

7.2.1. Infrastructure and experiment design

The infrastructure is the same as described in Section 7.1.1. The experiment’s objective is to evaluate

the effectiveness of I/O actuator in controlling the amount of allocated resources and application time.

We have two scenarios, which consist in the execution of the following workflow, with ten-time

repetition. We used an I/O bound application in the executions.


1. Select the Actuator configuration to use; 2. Use the Broker API to start application execution and controller execution, using 50% as cap

value; 3. Wait until application execution ends; 4. Collect application execution time; 5. Use the Broker API to start application execution and controller execution, using 100% as cap

value; 6. Wait until application execution ends; 7. Collect application execution time.

Actuation configurations

We used two Actuator plugins: KVM CPU cap and KVM I/O.

Scenario Actuator

1 KVM CPU cap

2 KVM I/O

7.2.2. Analysis

Figure 20 depicts the values of the application execution time for all scenarios.


Figure 20. Values of execution time for both scenarios

The results are also summarized below:

Table 6 - Benchmark execution time for the different scenarios.

Plugin Cap Median

KVM I/O 50 78.43286

KVM I/O 100 42.39330


KVM CPU cap 50 86.37899

KVM CPU cap 100 80.60844

Typically, virtual machines are configured with resource limitations to minimize interference in other

machines. For example, in OpenStack this is done through definitions in the template for the machine

configuration, the flavor20. Thus, in our case, all virtual machines start with a initial limit for both CPU

and io. As seen in the figure, using the KVM CPU CAP as actuation plugin, causes an impact on the

execution. Increasing CPU cap from 50 to 100 yields a small performance boost (86 to 81 seconds).

However, the performance boost resulting from the usage of the KVM I/O actuation, which in fact

atuates on both CPU and IO, is much larger (78 seconds to 42 seconds). Therefore, in production clouds,

where the resources in the cloud compute nodes are designed to be correlated (i.e., the IO bandwidth of

a VM is related to its CPU capacity), the default plugin for the actuation should be KVM I/O.

20




8. Frameworks experiments validation

This section summarizes the vertical elasticity for Mesos frameworks implemented in the frame of

EUBra-BIGSEA. First, we introduce the use cases supported, then the architecture is presented and we

end with the results of the experiments performed.

8.1. Use Cases

Details from the system architecture are provided in deliverable D3.4. Here we present a comprehensive

and updated summary.

The architecture addresses two scenarios:

1. First scenario deals with a job that involves several iterations, each iteration with a known

execution cost. A Quality of Service is defined and all the iterations must be performed

(sequentially) on a given deadline. The objective will be to adapt the CPU share for each iteration

before they start to meet a given deadline. The implementation is used by means of Chronos. The

user submits a Chronos application that iterates multiple times through independent executions,

which could not take place concurrently due to flow dependencies, external data dependencies, or

the like. The user wants to guarantee that the individual executions are completed in a given

deadline. Each iteration is started with a specific resource allocation, and the system learns from

each past case to readjust the allocation of the resources for the new case. Moreover, each

execution may take longer or shorter than the previous one, and the system varies the resource

allocation accordingly.

2. Second scenario deals with a job whose progress cannot be evaluated but it is expected to consume

a known amount of CPU time. The objective will be to guarantee that this share of the CPU time

has been allocated to that application in a given timeframe. This use case is implemented through

Marathon. The user submits a long-living Marathon job, with the guarantee that a minimum share

of the CPU time has been allocated to that job. The system can dynamically adjust the number of

resources to guarantee that the CPU time ratio has been allocated. This scenario assumes that

other tasks have been scheduled in the same resources with different requirements and allocation

periods, so temporarily speeding-down a task may lead to a lower overall QoS violation ratio.

8.2. System Architecture

The system architecture is composed of three main components: Launcher, Supervisor, and Executor.

1. Launcher receives the job specification in JavaScript Object Notation (JSON) format. Then, it

generates and assigns to the application a universally unique identifier (UUID). According to the

scenario’s specification, Launcher submits the job with a modified job specification. Finally,

Launcher sends by REST request to Supervisor relevant information for monitoring and scaling the

application.


1. Supervisor is a REST server, which receives the information sent by Launcher and the containers.

According to the scenario’s specification, it monitors the application in order to, if it is necessary,

scale the application for accomplish the quality of service agreed. Furthermore, the Supervisor

sends metrics to Monasca for visualization.

1. All the services are executed using Docker containers. In the second scenario it is necessary to

manage a Docker container “manually” since the system must perform checkpointing. For this

reason, a new component is required: Executor. The purposes of this component are the realization

of previous tasks for the management and monitoring by Monasca Agent (which it is installed on all

nodes of the infrastructure), the management of the Docker container and their checkpoints and

the information delivery to Supervisor.

Figure 21 shows how the first scenario is implemented in this architecture. Launcher and supervisor

transform the user’s job updating the information about the remaining (non-scheduled) iterations, the

deadline and other necessary information for following the job (job name, job uuid, framework id).

Then, the supervisor publish and consumes information about the progress in the Monasca Server, using

it for triggering the reallocation of resources for the remaining iterations.

Figure 21 - Execution of a Chronos job.

Figure 22 shows how the second scenario is implemented in this architecture. Main differences are 1)

There are no iterations, and progress is defined according to the consumption of cpu on a given

deadline; 2) Application will be restarted when the allocation of resources changes, so automatic

checkpointing is needed; 3) Application should be instrumented to automatically restart a checkpoint

image if it exists.


Figure 22 - Execution of a Marathon job.

8.2.1. Job submit and execution

The user creates the application getting the job specification in JSON format to the component named

Launcher, which it generates and assigns to the job an UUID. Then, it uses the job specification to create

a new job specification with the modifications necessary for execution and monitoring. These

modifications depend on each of the scenarios previously described.

In the case of the second scenario the Docker container must be explicitly managed by the Launcher

while the container is executed by Executor. To get this, Launcher creates a new job specification in

JSON format that executes Executor (with the job specification parameters) on a Mesos native

container, i.e., the job is running on a Docker container, which is managed by Executor running on a

Mesos native container managed by the Marathon Executor. There are not monitoring modifications

because Executor and Monasca Agent do that.

Launcher sends, by means of an initTask REST API, a request with the application information to

Supervisor. The information is composed by the job name, the UUID, the framework name, the duration

of one iteration, the application deadline and the maximum over-progress percentage (rate of the time

interval available since the start of application until the completion time that the user consider the

state). In case of the first scenario, the number of desired iterations is sent too.

When the new job specification is created, Launcher sends to the appropriate framework the new job

specification, which it will execute.

8.2.2. Monitoring the application


Once the application is launched, it is necessary to monitor the application to collect its performance in

order to modify its assigned resources.

Monitoring of the first scenario is performed by the Supervisor and is received in a REST request (which

is embedded in its tasks by Launcher) from the containers. The request body is a JSON object formed

with the date of start and finish the execution and the UUID and name of the application. When a

request arrives, Supervisor processes it, sends the information of statistics to Monasca and starts the

decision making in order to accomplish the QoS agreed.

Monitoring of the second scenario is more complex than the first one because, in this situation,

monitoring is required during the execution. The system uses Monasca Agent to collect container

performance metrics of each container but a modification of the Docker plugin for the Monasca Agent is

needed to allow monitoring of restored containers from checkpoints. The metrics collected by Monasca

Agent are sent to Monasca, and are consulted periodically by Supervisor for decision making in order to

accomplish the QoS agreed. Furthermore, Executor sends the same information about the application as

in the request of each iteration of the first scenario. It should be noted that this information is only used

to get statistics which are sent to Monasca.

8.2.3. Decision making

Once the data collection is completed, the next step consists in determining the performance state of

the application, i.e., the task progress.

There are three states for the application: under-progress, on-time, and over-progress.

In case of the first scenario, the decision is taken after each iteration of the application is completed.

The task progress state is computed using a prediction of the termination time percentage given the

iteration achieved, the application deadline and the given maximum over-progress percentage.

The prediction of the termination time percentage is calculated using the expression: pprediction=(tprediction-

tstart)/tdeadline, (where tprediction is obtained using a formula presented later and it is the estimated job end

time in timestamp format, tstart is the start time of the job in timestamp format and tdeadline is the interval

for executing the application with the agreed QoS.

The predicted termination time in timestamp format is calculated with the equation: tprediction = tcurrent +

rem_iter · ( djob + ddeployment) and the available interval of time since the start time of the application until

the deadline. In this expression, tcurrent is the current time in timestamp format, rem_iter is the number

of remaining iterations, djob is the given duration of the job and ddeployment is the mean duration of the

deployment in the infrastructure for Chronos.

A progress threshold is defined. If the progress time is in the interval between 100%-threshold and

100%+threshold of the estimated time, the job will be on the state on-time. If the progress is below the

threshold of the estimated progress, the state is under-progress. Finally, if the estimated progress is

higher than the threshold, the job enters the over-progress state.


In case of the second scenario, the decision making is done periodically. The task progress state in a

given time t is computed using the task progress ratio (as described below) and a given over-progress

percentage threshold. The state (under-progress, on-time, overprogess) of the task is defined in the

same way as in the first scenario.

The task progress ratio of the execution in a given time t is computed using the equation: progress_ratio

(t) = cputimecurrent(t) / cputimedesired(t). For this, Supervisor uses the consumed CPU time (cputimecurrent)

and the committed CPU time on a given time cputimedesired(t), both expressed in seconds.

The CPU time consumed on a given time cputimecurrent(t) is the result of the transformation of two

queries from Monasca, which are named container.cpu.user_tim} and container.cpu.system_time.

These values correspond with the total container clock ticks consumed in the node where the container

is running. Before these values can be used, we must transform them into seconds by dividing them by

the value of clock ticks per second of the node operating system.

The desired CPU time consumed for a given execution time (cputimedesired(t)), is calculated using the next

expression:

where tcurrent is the current time in timestamp format, tstart is the start time of the execution in timestamp

format, tdeadline is the interval for executing the application in the QoS agreed and djob is the estimated

duration of the job.

For this, Supervisor uses the current time and the start time, duration and deadline of the application.

8.2.4. Elasticity

If the application state is over-progress or under-progress, a new allocation of resources will be

computed. In case of the first scenario, the Supervisor will send to the Chronos scheduler the new

resource allocation before the next application execution, leading the unplanned executions to run with

the new resource allocation. The elasticity in the second scenario is more complex, as resources change

while the application is running. When the Supervisor sends the new resource allocation to the

Marathon scheduler, the Marathon executor sends to the Mesos container a termination signal before

removing it. This signal is captured by Executor, which creates a checkpoint of the running container and

stores it in an location accessible by all nodes of the infrastructure. As soon as the new container with

the new resource allocation is running, Executor restores the checkpoint created previously to resume

the execution at the same point the signal was captured.

8.3. Results of the experiments

8.3.1. Chronos Application


In this scenario, we analyse the behaviour of the system considering that the execution time grows

linearly (all the iterations require the same computing time). However, the real execution time will vary

depending on the real allocation of physical resources. Even if the Docker containers are limited

according to the Mesos Framework allocation, they will consume the maximum of free resources. The

CPU limitation of a Mesos Framework only applies If a job runs on a full node. If the CPU is free, the job

will take all the cycles it is able to consume. Obviously, this is a desired effect, so resources are not

wasted and give us the capability of speeding up and down jobs.

The results are shown in Figure 23. This figure shows the difference between the predicted time using

the equations of the previous section and the actual termination time of the iterations. This value can be

used to identify the three regions that relate to these three possible states of a job.

1. If the value is in the red area, the performance is below the QoS so the resources allocated to

the application must be increased.

2. If the value is in the yellow area, the performance is higher that the committed QoS, so

resources allocated to the application can be reduced.

3. If the value is in the green region, the performance is slightly higher what it is expected, so no

variation in the resource allocation will be performed. The white line shows the estimated

evolution.

Figure 24 shows the allocated CPU and the duration of each iteration for one of the jobs. The tests were

performed on an infrastructure with four nodes with 2 CPUs each. The experiment aimed at executing

the six jobs with 10 iterations each. Despite the total expected CPU time for a single job was initially

shorter than the deadline, there were no resources to accommodate the total execution time, to force

the system to reallocate resources. It can be clearly seen that four of the six jobs submitted have ended

earlier than the deadline, one was very close and the one missing the deadline ended with a delay of

10%. It is important to state that each iteration execution time was different.


Figure 23 - Difference between the estimated time and the actual progress per iteration.

Figure 24 - Dynamic assignment of CPU and cost per iteration. Higher CPU assignments lead to shorter execution time, with

some background noise due to other load of the node.

8.3.2. Marathon Application


This study shows the behaviour of the framework elasticity for a Marathon long running job. In this case,

a single job was executed in a node, starting with an allocation of 1 CPU and letting the elasticity system

to alter the allocation of resources. The system was loaded with other tasks to force the resource

limitation of the framework to act.

As it can be seen, the CPU allocation is modified twice. First, (Figure 25, timestamp 2:33:30) the system

detects that the application is progressing much quicker than expected (), so it reduces the allocation of

resources. Then, the progress is slowed down, entering in the under-progress state close to timestamp

2:36). CPU is increased, and progress is speeding up. Next check detects that the application is in the

safe zone (close to 100% of the expected performance), so no changes are applied. Finally, it finishes

earlier than expected.

Figure 25 - Progress with respect to the estimated deadline (above) and CPU allocation (below) along the time.

Figure 26 - shows the same behaviour but with respect to the expected timeline.


Figure 27 - Evolution of the application CPU consumption and the linear behaviour used to define the QoS.


9. Load balancing validation

This section summarizes the experiments performed concerning the validation of the Load Balanced

tool, developed within EUBra-BIGSEA. First, we introduce the infrastructure where the experiments

were performed, followed by the description of the experiment and the analysis of the results obtained.

9.1. Infrastructure

The experiments were performed in the Production Cloud of the Distributed Systems Laboratory (LSD)

at the Federal University of Campina Grande (UFCG), which uses the Newton version of OpenStack. The

environment provided was a set of two dedicated servers, which we had full access, one Dell PowerEdge

R410 and one Dell PowerEdge R420, together they had a total of 28 CPUs and 50.96 GB of RAM, the

servers are connected through a Gigabit network and use shared storage provided via Ceph21. The

configuration of each server can be seen in Table 7 below.

Table 7 - Configuration of each server

Servers Model Processor CPUs Memory (GB)



24 31.4


Intel Xeon E5-2407 2.20GHz

4 19.56

9.2. Experiment Design

The objective of the experiment is to investigate whether the use of the heuristics can improve the

performance of overloaded servers. We have three scenarios, each scenario consists in the execution of

the three heuristics with five repetitions. The following workflow applies for each execution.

1. Select the initial cap for virtual machines [25%, 50%, 75%] 2. First execution of the Load Balancer 3. Interfere with the cloud to break the balance

1. Create a new machine with 2 vcpus in c4-compute12 and set the CAP to the initial value.

2. Remove two virtual machines with 1 vcpu in c4-compute22

21

http://ceph.com/

http://ceph.com/


4. Second execution of the Load Balancer 5. Change the cap for all virtual machines to 100% 6. Third execution of the Load Balancer

Figure 28: Workflow for heuristic executions

Table 8 describes the initial configuration each server has before the begin of each execution, such as

the number of virtual machines, number of CPUs allocated and the total of CPU. Table 9 describes the

design we used with the variables and the corresponding levels.

Table 8 - Initial configuration for each server before each execution for the scenarios.

Server Number of VMs Number of allocated CPUs

Total number of CPUs

c4-compute12 3 6 24

c4-compute22 7 8 4

Table 9 - Variables and corresponding levels used.

Heuristics Initial CAP cpu_ratio wait_rounds

1. BalanceInstancesOS

2. CPUCapAware 3. SysbenchPerfC

PUCap

1. 25% 2. 50% 3. 75%

1. 0.5 1. 1

9.3. Analysis

Below we analyze the results of one scenario with the initial CAP with 75%; the results of the

experiments with initial CAP set to 25% and 50% can be found in Appendix G. In this scenario all virtual

machines start with an initial CAP of 75%. Table 10 depicts the time when the Load Balancer executed

migrations for each heuristic in all five executions.


Table 10 - Time when migrations started.

Execution

Heuristics 1 2 3 4 5

BalanceInstancesOS

12:16

12:21

12:38

12:44

13:04

13:10

13:42

13:47

14:03

14:10

CPUCapAware 11:57

12:58

12:19

12:30

12:42

12:52

13:01

13:13

13:26

13:36

SysbenchPerfCPUCap 12:16

12:28

12:42

12:53

13:14

13:26

13:54

14:06

14:16

14:28

In the following, we look at 5 experiments for each heuristic. We monitor the following metrics:

total_consumption considers the allocated vCPUs, the CAP of these CPUs and the actual percentage

usage; total_cap considers only allocation (i.e., vCPUs allocated and their CAP, not the actual usage); the

CPU usage of the physical machine hosting the virtual machines; and, the Sysbench execution time.

For the heuristic BalanceInstancesOS the migrations occurred in the first and second executions of the

Load Balancer, for the other heuristics the migrations occurred in the first and third executions. The

BalanceInstanceOS heuristic actuates in the second round because, even if the new machines have not

used too much resource, this heuristic monitor the allocated resources. Similarly, because the two other

heuristics consider used resources, they actuate only in the third round, when the CAP is raised.

Analyzing the total_consumption (Figures 29 to 31) and total_cap (Figures 32 to 34) graphs we can see

that after each execution the Load Balancer enforced the cpu_ratio value previously configured.


Figure 29 - Heuristic BalanceInstancesOS - Total Consumption per host.

In Figure 29 we present the graph of total_consumption over time for the BalanceInstancesOS heuristic.

In each graph of execution we can see that in the beginning, the total_consumption is above the

cpu_ratio limit (dashed line). This heuristic uses the ratio of allocated vCPUs over the total of CPUs to

determine if a host is overloaded. This value is always higher than the total_consumption, as it assumes

that allocated CPUs are 100% used and, therefore, if total_consumption is higher than the limit, the


allocated resources will also be. Then, if the machine is overloaded the Load Balancer takes action to

trigger the migration of virtual machines. This already takes place in the first execution round (see time

in Table 9) from c4-compute22 to c4-compute12. At the end, the total_consumption stays equal or lower

than the cpu_ratio.


Figure 30 - Heuristic CPUCapAware - Total Consumption per host.

Figure 30 depicts total_consumption over time for the CPUCapAware heuristic. In each graph of

execution we can see that in the beginning, the total_consumption is above the cpu_ratio limit (dashed

line), the Load Balancer take action and trigger migration of virtual machines in the first execution (see

time in Table 9) from c4-compute22 to c4-compute12 and after this the total_consumption stays equal

or lower than the cpu_ratio.


Figure 31 - Heuristic SysbenchCPUCap - Total Consumption per host.

Similarly, Figure 31 depicts total_consumption over time for the SysbenchPerfCPUCap heuristic. In each

execution we can see that in the beginning the total_consumption is above the cpu_ratio limit (dashed

line), except from the execution #2 because the virtual machines utilization was not high enough.


Nevertheless, even in this execution #2, the total CAP exceeds the target values (see Figure 34). The

Load Balancer takes action and trigger migration of virtual machines in the first execution (see time in

Table 9) from c4-compute22 to c4-compute12 and after this the total_consumption stays equal or lower

than the cpu_ratio. For this figure, it is also important to notice that the low utilization of host c4-

compute22 is due to the fact that the instance migrated is the one that consumes the lowest amount of

resources, as the target host has lower performance.


Figure 32 - Heuristic BalanceInstancesOS - Total Cap per host.


Figure 33 - Heuristic CPUCapAware - Total Cap per host.


Figure 34 - Heuristic SysbenchCPUCap - Total Cap per host.

Figures 32 through 34 present the graphs of total_ccap over time for all heuristics. The graph for

BalanceInstancesOS heuristic (Figure 32) does not increase after the virtual machines CAP is changed to

100% because the migration occurred during the second execution that was when another virtual

machine was created in c4-compute12 and the Load Balancer migrated it to c4-compute22. For the

heuristics CPUCapAware (Figure 33) and SysbenchPerfCPUCap (Figure 34) we see that approximately in


the middle of each execution the total_cap for the host c4-compute22 increases above the cpu_ratio

limit, this is caused because of the creation of a virtual machine in this host and the change in the CAP of

all virtual machines in all hosts to 100%, the Load Balancer trigger the migration in the third execution to

decrease the total_cap, by migrating virtual machines from c4-compute12 to c4-compute22.

The CPU utilization of the hosts was also collected. This information is depicted in Figures 35 through 37.

The common pattern we can see in the graphs is the decreased usage in overloaded hosts when the

Load Balance starts the execution and an increase in the utilization in the host that is receiving the

migrated virtual machines. In addition, the peak utilization right after the first execution is caused by the

removal and creation of virtual machines in the hosts. Also, note that there are peaks in the CPU that

are due to the execution of Sysbench to collect the performance metric. For c4-compute22 this impact is

much more noticeable because it has only 4 cores, in contrast to the 24 cores in c4-compute12.


Figure 35 - Heuristic BalanceInstancesOS - Host %CPU during executions.


Figure 36 - Heuristic CPUCapAware - Host %CPU during executions.


Figure 37 - Heuristic SysbenchPerfCPUCap - Host %CPU during executions.


10. Performance Prediction service validation

In this section we present the results of a set of experiments we performed to explore and validate the

dagSim simulator and the Lundstrom predictor on the TPC-DS industry standard benchmark22 for data

warehouse systems (which considers an SQL-like workload) as well as on some reference Machine

Learning (ML) benchmarks, namely Support Vector Machine (SVM), Logistic Regression and K-Means.

Such experiments were performed on the Microsoft Azure Cloud. Overall our tests include sequential

workloads (obtained from the SQL queries execution plan) and iterative workloads, which characterize

ML algorithms and are becoming more and more popular in the Spark community23. Moreover, our

performance prediction tools were validated considering the K-Means benchmark over COMPSs running

on the new Mare Nostrum 4 at Barcelona Supercomputing Center24.

10.1. Experimental Settings

For what concerns the TPC-DS benchmark experimental campaign, Microsoft Azure HDInsight PaaS

offering based on Spark 1.6.2 release and Ubuntu 14.04 was considered. Multiple experiments were

conducted on two different deployments. In particular, two different types of VMs, namely, A3 and

D12v2, were tested. A3 VMs include 4 cores with 7 GB of RAM and 250 GB local disk. Regarding the

second deployment, D12v2 nodes include 4 cores, 28 GB of RAM and 200 GB local SSD. In the A3 case,

the workers’ configuration consisted of 6 up to 48 cores, while in the case of D12v2 the number of cores

was varied between 12 and 52. Each Spark executor had 2 cores with 2GB RAM. Two dedicated master

nodes over D12V2 were used and the Spark diver had 4GB allocated.

We considered two TPC-DS queries named Q26 and Q52 running on 500 GB data set; their Directed

Acyclic Graphs (DAGs) are presented in Figure 38. Queries were run at least 10 times for every

configuration. As aforementioned, we also ran experiments with SVM, Logistic Regression, and K-Means

benchmarks. For each benchmark, three sizes of datasets were considered, namely 8GB, 48GB, and

96GB. Spark 2.1.0 deployment, configured to have two dedicated master nodes over D12V2 VMs. For

this set of experiments, we considered two different configurations, one with three worker nodes and

the other with 6 worker nodes, both over D4V2 VMs. The D4V2 instances have 8 cores, 28GB RAM and

400 GB Local SSD. Each executor had 4 cores and 10GB memory while the driver had 8 GB. The number

of executors varied between 6 and 12. For each given configuration, the experiments were repeated 10

times.

Regarding COMPSs experiments, we ran K-Means algorithm over different infrastructure configurations,

from 1 to 4 nodes, each one composed of 48 cores. The experiments comprise two different scenarios.

One in which the number of fragments is the same as the number of cores (thus, as the number of cores

22

4http://www.tpc.org/tpcds

23 https://databricks.com/blog/2016/09/27/spark-survey-2016-released.html

24 https://www.bsc.es/marenostrum/marenostrum/technical-information


increases so does the total number of fragments), and the other with the number of fragments fixed,

equal to the maximum number of cores (i.e., 192). Since the number of fragments is reflected in the

number of parallel tasks, this latter scenario is the one with the most similar characteristics to the

Spark's behaviour where the number of tasks is usually constant across the configuration (unless a

different setup was adopted by the system administrator).

The accuracy of the performance prediction tools is assessed through the percentage of error which is

evaluated as the relative difference between the average execution time of real experiments across the

same configuration and the execution time predicted by our tools with respect to the average time of

the real experiments:

(5)


Figure 38 - Directed Acyclic Graph for queries Q26(a) and Q52(b)

10.2. dagSim Validation

The aim of this section is to evaluate the accuracy achieved by our dagSim simulator. The results

obtained from the experiments on Azure for TPC-DS and ML workloads are presented in the following

tables.

For TPC-DS experiments executed on Azure A3 and D12v2 VM types, the results for Q26 and Q52

regarding prediction error varies from 0.02% to 8.53% for A3 deployment, while for D12v2 from 0.16%

to 19.01%. The max error values for the aforementioned deployments correspond to 48 cores

configuration for 500 GB dataset (A3 Microsoft Azure) and 28 cores for 500 GB (D12v2 Microsoft Azure)

dataset respectively. Detailed results are reported in EUBra-BIGSEA Deliverable D3.4.

Regarding K-means (see Table 11), the prediction error for dagSim ranges from -29.80% to 6.49%.

Moreover, we observe that the maximum absolute error value is obtained for the largest configuration

which consists of 96 GB data set and 48 cores.

Table 11 - K-Means experimental results on Azure and dagSim prediction

#Cores Data set size (GB) Real [ms] dagSim prediction [ms]

% Error

24 8 82,644.6 75,553 8.58%

24 48 326,593 364,556 -11.62%

24 96 846,885.8 788,371 6.91%

48 8 75,151.6 70,276 6.49%

48 48 179,737.6 219,195 -21.95%


48 96 574,868.4 746,151 -29.80%

In Table 12 the prediction error for Logistic Regression is presented. According to the experimental

results, prediction errors range from 0.34% to 6.00% and the max prediction error value is reported for

48 GB dataset and 8 cores.


Table 12 -Logistic Regression experimental results on Azure and dagSim prediction

#Cores Data set size (GB) Real [ms] dagSim [ms] % Error

24 8 164,574.6 156,070 5.17%

24 48 669,401 671,693.49 -0.34%

24 96 1,418,837.4 1,404,886 0.98%

48 8 166,517.8 1,565,267 6.00%

48 48 368,236.6 362,927 1.44%

48 96 1,200,678.2 1,193,930 0.56%

In relation to the SVM benchmark, the prediction error (Table 13) is also very small and almost similar to

the one we got for the Logistic regression prediction. In this case, it ranges from 0.08% to 6.16% while

the max prediction error value is reported for 8 GB dataset and 48 cores.

Table 13 - SVM experimental results on Azure and dagSim prediction

#Cores Data set size (GB) Real [ms] dagSim [ms] % Error

24 8 176,201.83 167,686 4.83%

24 48 623,714.2 621,391 0.37%

24 96 135,3471 323,928 2.18%

48 8 174,811 164,043 6.16%

48 48 357,922 352,166 1.61%


48 96 635,922 635,421 0.08%

Finally, Table 14 includes the experimental results obtained from the execution of K-Means application

on COMPSs for a different number of cores considering the two different scenarios. In all cases the

prediction errors are usually very low and the maximum error value (17%) is obtained in the case where

3 nodes and 192 fragments (equal to the maximum number of cores) are used. Note that we obtained

definitely better results than those reported on EUBra-BIGSEA Deliverable D3.4 (characterized by up to -

66% error). We argue that the data of the previous runs were affected by large variability due to an

unusual high resource contention experienced by the BSC infrastructure at the end of February 2017.

The high workload was experienced since Mare Nostrum was going to be switched off for 5 months to

upgrade the system to the new version 4.

Overall, the results for both TPC-DS ML benchmark and COMPSs demonstrate that dagSim predictor has

good accuracy and is a suitable tool for Big Data applications performance prediction.

Table 14 - COMPSs application and dagSim prediction

#Nodes

#Cores #Fragments Real [ms] dagSim [ms] % Error

1 48 48 880,533 874,458 1%

2 96 96 591,215 582889 1%

3 144 144 372,232 363,,861 2%

4 192 192 268,058 259,922 3%

1 48 192 1,011,834 1,057,884 -4%


2 96 192 646,748 699,206 -8%

3 144 192 610146 523223 17%

10.3. Lundstrom Validation

We now turn to the evaluation of the accuracy of the Lundstrom analytical model. In the following we

present and discuss the results obtained in the various setup configurations considered.

The TPC-DS results reported in EUBra-BIGSEA Deliverable D3.4 did not include the Lundstrom

evaluation. In this deliverable, we also present the results for the Lundstrom model over the

experiments executed on Azure A3 and D12v2 VM types, with the number of cores per executor set to

2. Accuracy error ranges from 4.43% to 23.70%. Maximum error value occurs in the scenario with 48

cores, processing 500 GB (D12v2 Microsoft Azure) dataset for both Q26 and Q52. Table 15 depicts the

results.

Table 15 - TPC-DS Q26 and Q52 on Spark over D12v2 Azure VMs

Nodes #Cores Query Real [ms] Lundstrom [ms] % Error

3 12 Q26 722,191.54 690,216.98 4.43%

Q52 719,922.93 660,772.88 8.22%

4 16 Q26 582,916.50 543,854.76 6.70%

Q52 562,701.25 517,314.43 8.07%

5 20 Q26 515,921.32 469,010.19 9.09%

Q52 471,807.51 412,718.74 12.52%


6 24 Q26 447,640.05 398,265.11 11.03%

Q52 417,689.57 358,323.66 14.21%

7 28 Q26 415,717.33 367,184.59 11.67%

Q52 364,080.10 304,676.52 16.32%

8 32 Q26 366,135.08 316,546.96 13.54%

Q52 324,692.89 264,963.70 18.40%

9 36 Q26 306,080.43 256,057.86 16.34%

Q52 306,760.14 246,952.58 19.50%

10 40 Q26 287,453.29 236,815.25 17.62%

Q52 275,196.48 215,247.32 21.78%

11 44 Q26 259,665.43 209,579.07 19.29%

Q52 258,779.14 200,155.59 22.65%

12 48 Q26 248,604.10 197,181.57 20.69%

Q52 249,954.80 190,722.71 23.70%

13 52 Q26 220,247.86 181,354.17 17.66%

Q52 226,047.19 179,279.19 20.69%


Prediction errors for both queries vary from 0.85% to 15.92% for A3 deployment. Maximum error value

occurs in the aforementioned deployment corresponding to 48 cores, processing 500 GB dataset for

both Q26 and Q52.

Table 16 - TPC-DS Q26 and Q52 on Spark over A3 Azure VMs

Nodes #Cores Query Real [ms] Lundstrom [ms] % Error

4 10 Q26 1,778,802.07 1,763,649.39 0.85%

Q52 1,327,441.57 1,276,419.90 3.84%

4 12 Q26 1,690,553.40 1674504.77 0.95%

Q52 1,124,490.00 1,072,515.57 4.62%

5 14 Q26 1,439,273.57 1,413,964.87 1.76%

Q52 976,408.72 924,048.86 5.36%

5 16 Q26 127,1578.5 1243,254.23 2.23%

Q52 884,947.43 832,631.01 5.91%

6 18 Q26 1,127,209.05 1,099,812.68 2.43%

Q52 816,461.67 764,813.25 6.33%

6 20 Q26 1,093,593.50 1,064,210.94 2.69%

Q52 738,760.00 687,252.22 6.97%

7 22 Q26 809,747.91 764,252.94 5.62%


Q52 667,874.21 613,641.05 8.12%

7 24 Q26 911,222.20 874,843.44 3.99%

Q52 620,125.14 566,192.81 8.70%

8 26 Q26 720,810.43 682,170.15 5.36%

Q52 572,076.00 512,687.86 10.38%

8 28 Q26 - - -

Q52 538,226.91 478,838.17 11.03%

9 30 Q26 581,430.71 531,497.96 8.59%

Q52 620,208.00 572,866.98 7.63%

9 32 Q26 629,799.50 588,982.65 6.48%

Q52 492,334.53 432,698.46 12.11%

10 34 Q26 625,767.95 584,327.56 6.62%

Q52 462,200.71 402,496.46 12.92%

10 36 Q26 577,367.50 532,404.92 7.79%

Q52 442,087.13 382,594.58 13.46%

11 38 Q26 546,599.32 503,100.31 7.96%


Q52 438,667.54 380,176.33 13.33%

11 40 Q26 530,579.83 487,333.68 8.15%

Q52 418,379.05 359,130.77 14.16%

12 42 Q26 488,396.22 447,348.19 8.40%

Q52 392,136.82 334,238.91 14.76%

12 44 Q26 470,695.00 427,621.31 9.15%

Q52 383,732.22 325,851.76 15.08%

13 46 Q26 446,374.92 405,558.27 9.14%

Q52 378,361.94 318,829.68 15.73%

13 48 Q26 430,512.00 387,504.98 9.99%

Q52 362,790.14 305,020.66 15.92%

Regarding the ML benchmarks, Table 17 shows the prediction errors for the Lundstrom model for the

SVM benchmarks. Errors vary from 1.27% to 10.28% while the max prediction error value is reported

for the setup with an 8 GB dataset and 48 cores.

Table 17 - SVM experiments on Spark on Azure


Nodes #Cores Dataset size

(GB)

Real [ms] Lundstrom [ms] % Error

3 24 8 190,905.83 171,651 10.09%

3 24 48 356,721.20 339,246 4.90%

3 24 96 1,366,994.80 1,349,569 1.27%

6 48 8 189,684.00 170,184 10.28%

6 48 48 372,529.20 353,248 5.18%

6 48 96 650,509.40 631,144 2.98%

Similarly to what was observed for SVM, the prediction errors for Logistic Regression range from 1.24%

to 11.55%, and the maximum error value was also obtained for 8GB data set and 48 cores.

Table 18 - Logistic Regression experiments on Spark on Azure


(GB)


3 24 8 178,899.80 159,459 10.87%

3 24 48 684,430.71 664,363 2.93%

3 24 96 1,431,853.00 1,414,072 1.24%


6 48 8 181,977.80 160,961 11.55%

6 48 48 382,951.80 362,505 5.34%

6 48 96 1,219,151.00 1,192,617 2.18%

Regarding K-means, the prediction error for Lundstrom ranges from 1.87% to 18.10%. Once again, the

maximum error was obtained for the smallest data set which consists of 8GB in the largest

configuration, 48 cores.

We further looked into the response times measured for individual runs of each benchmark on each

configuration and observed that the setup with the largest errors for all three benchmarks (8GB on 48

cores) coincides with the scenario with the highest variance across multiple runs. The large number of

cores used on a relatively small dataset, which might occasionally cause resource underutilization, may

explain the slightly worse performance of the model in this setup.


Table 19 - K-Means experiments on Spark on Azure


(GB)


3 24 8 99,023.75 81,877 17.32%

3 24 48 342,195.47 325,131 4.99%

3 24 96 862,058.80 845,920 1.87%

6 48 8 90,299.38 73,958 18.10%

6 48 48 195,026.89 178,804 8.32%

6 48 96 594,309.60 572,919 3.60%

Regarding the experiments on COMPSs, the prediction errors for Lundstrom are quite small, ranging

from 0.30% to 1.15%. We observed that the maximum error was obtained for the scenario with the

largest numbers of nodes, cores, and fragments: 4, 192 and 192 respectively. Moreover, we found that

prediction error tends to increase as the number of nodes/cores increases.


Table 20 - K-Means experiments on COMPSs on Mare nostrum

Nodes #Cores Fragments Real [ms] Lundstrom [ms] % Error

1 48 48 880,533 877,493 0.35%

2 96 96 591,215 588,138 0.52%

3 144 144 372,232 369,151 0.83%

4 192 192 268,058 264,979 1.15%

1 48 192 1,011,834 1,008,828 0.30%

2 96 192 646,748 643,657 0.48%

3 144 192 610,146 604,478 0.93%

Overall, across the considered set of experiments, which cover different platforms and configurations,

both dagSim and Lundstrom have shown to provide very good prediction accuracy: while dagSim

achieved 6.17% average percentage error across all scenarios, the Lundstrom tool obtained 9.03%

average error. We note that, for analytical queuing models (such as Lundstrom), 30% errors in response

time predictions can be usually expected25. Thus, both tools are suitable for predicting the performance

of Big Data applications at run-time. Moreover, both have also achieved K3.5 KPI (median error below

40%).

25

Edward D. Lazowska, John Zahorjan, G. Scott Graham, and Kenneth C. Sevcik. 1984. Quantitative System

Performance: Computer System Analysis Using Queueing Network Models. Prentice-Hall, Inc., Upper Saddle River, NJ, USA.


11. HML validation

This section contains the follow-up of the initial experiments included in D.3.2, which are extended and

applied in Spark technology used in order to validate the effectiveness of our hybrid algorithm. An

analytical description of the experimental process is illustrated in 11.1. In particular, the description of

the examined queries, the dataset, the tested technologies and the configuration of the experiments are

presented. Moreover, a comparison of HML with two different approaches is described, while

extrapolation and interpolation capabilities of the algorithm are examined.

11.1. Experiments setting

The datasets used for running the experiments has been generated using the TPC-DS benchmark. In

D.3.2, two set of experiments had been conducted for testing MapReduce and Tez jobs. In particular, for

MapReduce validation, the experiments were performed on a Hive ad-hoc MapReduce and Tez queries.

In order to extend our work and examine the effectiveness of hybrid algorithm we executed Q40 (see

Figure 39), from the official TPC-DS benchmark on Spark and we analysed the outcomes. The

experimental results are used to feed the synthetic data set which forms an initial KB.

Figure 39 - TPC-DS Q40


The data set size has been set to 250GB. The experiments were executed on CINECA, the Italian

supercomputing center. PICO, the Big Data cluster available at CINECA, is composed of 74 nodes, each of

them boasting two Intel Xeon 10-core 2670 [email protected], with 128 GB of RAM. Out of the 74 nodes, 66

are available for computation. The storage is constituted by 4 PB of high throughput disks based on the

GSS technology. For Hadoop experiments we deployed Hadoop 2.5.1 release and Apache Tez 0.6.2,

while for Spark applications version 1.6.0 was considered. The cluster is shared among different users;

resources are managed by the Portable Batch System (PBS) which allows submitting jobs and checking

their progress, configuring at a fine-grained level the computational requirements. For all submissions it

is possible to request a number of nodes and to define how many CPUs and what amount of memory is

needed. The YARN Capacity Scheduler is set up to provide one container per core. Since the cluster is

shared among different users, the performance of single jobs depends on the overall system load, even

though PBS tries to split the resources. Due to this, it is possible to have large variations in performance

according to the total usage of the cluster and network contention.

In particular, storage is not handled directly by PBS, thus leading to an even greater impact on

performance. Overall, queries execution required about 20,000 CPU hours on the PICO cluster. For each

configuration, execution runs more than three times the standard deviation from the mean have been

considered outliers and have been discarded. In SVR training, weighting is used as a means to suggest

the ML to give more relevance and trust to real than to synthetic samples. Therefore, the weight of real

data is assumed to be five times the weight of analytical data in all experiments regarding the hybrid

approach.

For validating the effectiveness of the proposed hybrid approach in a comparative manner, two

solutions that lack AM are considered:

1. Basic Machine Learning (BML): relies on SVR for the computation of the regression function. In

this case, the algorithm is fed with the same operational data applied by our hybrid ML at the

final iteration.

2. Iterative Machine Learning (IML): in this approach, operational data are added iteratively and

the initial KB is empty lacking the AM information. In other terms, we considered the general

structure of HML Algorithm (Figure 5) except lines 2 and 3 that corresponds to AM involvement.

To compare quantitatively our hybrid approach with BML and IML baseline methods, three performance

measures are defined:

1. The MAPE of response time: this measure focuses on the prediction accuracy. It is defined as the

percentage of relative error of the response time predicted from the learned model with respect

to the expected value of response times acquired from the operational system.


2. MAPE is used to evaluate both extrapolation and interpolation capabilities. The number of

iterations: the number of iterations of the external loop of HML Algorithm (see Figure 5), which

equals the number of real data samples fed into the ML model.

3. The cost: this measure focuses on the cloud expenses for model construction and is defined as:

where AC is the set of configurations used for model selection and training, NCi and RTimei are the

number of cores and the execution time associated with each configuration i and Nei is the number of

operational data points used for training the model. Moreover, P is the price of using one core per time

unit.

The DAG structure of Q40 is shown in Figure 40.

Figure 40 - Q40 DAG


The profiling phase has been conducted extracting the number of tasks and the average task durations

from around ten runs with the same configuration.

The four key steps of the proposed hybrid procedure, the comparison with the IML approach and the

outcomes are presented below. In order to generate the data for the analytical model based on the

approximation formula, the equivalent script is used named retrieve_response_times_with_formula,

while for finding the optimal thresholds which minimize the mean relative error the

compare_prediction_errors script is executed. Finally, computeExtrapolationFromRight",

"computeExtrapolationFromLeft and "computeInterpolation" scripts are applied in order to investigate

and draw conclusions for the extrapolation and interpolation capabilities of the hybrid algorithm.

1) Data from the analytical model: In Figure 41 (a) we compare the average execution times for every

configuration obtained across the experiments to the ones obtained from Equation (4). The average

relative error of the values obtained from the approximation formula is around 34%.

2) Finding the optimal thresholds: The optimal combination of the (itrThr, stopThr) was determined only

for IML approach by applying the approximated formula on Q40 data and this combination was equal to

(25, 15). For every threshold combination, the algorithm run for 50 different seed values to generate

different results. Vice versa, in the case of hybrid approach, we used the same thresholds (34, 23)

obtained for R1 query (see EUBra-BIGSEA Deliverable D.3.2).


Figure 41 - (a) Comparison of formula-based approximation and the mean values of real data, (b) right extrapolation and (c)

cost for Q40

3) Extrapolation capabilities on many cores: Right extrapolation capability analysis results are reported

in Figure 41 (b) and (c). Figure 41 (b) shows that the use of the proposed approach defeats the IML,

providing always a lower MAPE on the test set. What is remarkable is that while we are moving to the

left side, where more points are missing the MAPE error of IML increases dramatically, demonstrating

the dominance of our proposed approach. In particular, when only the configuration with 120 cores is

missing, the error of hybrid approach is 7% approximately, contrary to 15% of IML. However, when 5

points are missing from the configuration set the error of IML shoots up to 30% while in case of the

hybrid algorithm a small increase is observed (11%).

4) Interpolation capability: Concerning interpolation, we considered three different scenarios when

applying hybrid algorithm to Q40: (i) three points 20, 72 and 120 cores are missing (ii) five points 20, 48,

72, 100, 120 cores are missing, and finally (iii) seven points 20, 40, 48, 72, 80, 100 and 120 cores are

missing. Concerning the relative error, we observe from Figure 42 that the IML approach gives better

accuracy compared to our approach.

Figure 42 - Interpolation for Q40

Although the hybrid error for three missing points is high (17%), in case of seven missing points the

hybrid algorithm get closer to HML performance. However, the number of iterations of the hybrid

approach is smaller than the one of IML and better costs can be achieved as reported in Figure 43.


Figure 44 - Cost of interpolation analyses results


12. Optimization service validation

This chapter describes the validation of OPT_IC and OPT_JR optimization tools considering both a case

study application developed in WP7 named BULMA (Section 12.1) and a set of seven test cases, which

include a number of applications from the TPC-DS benchmark (Section 12.2).

12.1. OPT_IC validation on BULMA application

The WS_OPT_IC and the underlying OPT_IC optimization tool have been validated considering the

BULMA WP7 case study application developed within the task T5.7. The problem faced by BULMA is the

following. The task of identifying bus trajectories from the sequences of noisy geospatial-temporal data

sources is known as a map-matching problem. It consists of performing the linkage between the bus GPS

trajectories and their corresponding road segments (i.e., predefined trajectory or shapes) on a digital

map. In this sense, BULMA is a novel unsupervised technique capable of matching a bus trajectory with

the "correct" shape, considering the cases in which there are multiple shapes for the same route

(common cases in many Brazil cities, e.g., Curitiba and São Paulo). Furthermore, BULMA is able to detect

bus trajectory deviations and mark them in its output. The goal of BULMA is to provide high-quality

integrated geospatial-temporal training data to support predictive machine learning algorithms of

Intelligent Transport Systems (ITS) applications and services.

There are two types of files used by BULMA, shape and GPS files. The first consists of data for the same

root and corresponds to a rich/detailed set of georeferenced points describing the trajectory that a bus

should follow, i.e., the predefined bus trajectory. It is extracted from the General Transit Feed

Specification data (GTFS), which is an income file (captured by an operator or authority) containing

details of transit supply. Regarding GPS files, they contain all the GPS (Global Positioning Systems)

trajectories of a city bus fleets during a certain period of time. In our case, each GPS file corresponds to

all the GPS trajectories of the city bus fleet during a day. Thus, BULMA matches a bus trajectory with the

''correct'' shape, considering the cases in which there are multiple shapes for the same route.

As an example, Figure 44 includes two shapes describing the trajectories (with the same direction) that a

bus could follow related to route 022. Most of the state-of-the-art techniques are able to detect

whether or not a bus is performing the route 022. Thus, to the best of our knowledge none of them is

able to indicate if a bus (defined to perform route 022) will start its trajectory from point A or B. Note

that if a classification error occurs during the generation of the bus trajectories history (e.g., a bus

becomes associated with the shape depicted on the right of the figure instead of the - correct - left one),

a predictive algorithm (of wp7) will estimate an arrival time at point C erroneously, since the bus is

programmed to finish its trip at point A (according to shape depicted on the left). For all the

aforementioned reasons, BULMA solves this problem with a good accuracy.


Figure 44 - Two different shapes describing the trajectories for route 022 that a bus could follow

BULMA has been implemented in Spark and experiments were performed by running the application in

the Microsoft Azure HDInsight platform, considering Spark 2.1.0. Workers were deployed on D4 v2 VMs

(including 8 cores and 28GB of memory) and every executor was configured to run with 8 cores and

16GB of RAM. To check the performance of our tools, a set of different configurations was considered

where a different number of nodes in the cluster was tested. In particular, we performed runs from 1 to

6 nodes and 5 days GPS training data was generated.

To validate OPT_IC we aimed at assessing the quality of the optimal solution obtained using OPT_IC.

Assuming to optimize the BULMA initial deployment and a deadline to meet, we focus on the response

time measured in a real cluster provisioned according to the number of VMs determined by the

optimization procedure, quantifying the relative gap as a metric of the optimizer accuracy. Formally:

(5)


where D is the deadline and TReal the execution time measured on the real cluster, so that possible

misses would yield a negative result.

We considered 56 cases, varying deadline between 5,500 sec and 11,000 sec with step 100 sec, and ran

OPT_IC to determine the optimal resource assignment to meet the QoS constraints on D4 v2 instances.

Figure 45 plots the data we obtained.

Figure 45 OPT_IC Percentage error as a function of the BULMA deadline

We experienced a deadline miss only in 5 out of 56 cases (8.9%) of cases. Moreover, the relative error is

always below 35%, with a worst case result of 35.12% and the average absolute error settling at 18.13%.

From the plot, we observe some abnormal behaviour (rapid jumps) which is related to the change of the

configuration deployment (number of required VMs) suggested by OPT_IC. In particular, at most the gap

in terms of required VMs is 1 and the accuracy is increased when the deadline is tight, i.e., for larger

clusters. Since the initial deployment can also be updated by the pro-active policies module, overall we

can conclude that the OPT_IC tool is effective in identifying the minimum cost initial deployment,

guaranteeing that deadlines are met as well.

12.2. OPT_JR validation


In the following, the experimentation of the OPT_JR tool on a set of seven test cases is reported. All the

experiments have considered Spark TPC-DS benchmark execution on the IBM Power8 (P8) cluster

available at Politecnico di Milano with 1TB data set size.

Each test case includes a set of applications characterized by: i) an application identifier (i.e., a TPC-DS

query), iii) a weight, iii)-iv) two ML parameters describing the profile application, v) the total RAM

memory available at the YARN Node Manager, vi) the container size (in terms of RAM) for the

application (e.g., the memory devoted to Spark executors), vii) the total number of vCPUs available at

the YARN Node Manager, viii) the container size (in terms of vCPUs), ix) the deadline, and finally x) a

label describing the current stage. The test details are reported in Tables 21 -27. The total number of

cores used for each test are reported in Table 26.

Test 1 is the base test. The following cases present different variations in terms of weight and deadlines

and number of applications. For example, in Test2, the deadlines of the applications have been

increased while we considered the same weights and the capacity N is increased. In Test 3 we consider a

pool of 10 applications. The aim of Test 4 to Test 7 is to investigate the impact of the weights to the

final results. Test 4 consider the same set of applications with the same parameters of Test 1 but the

weight for Q26 is increased progressively from 2 to 5.

Table 21 - Test1 parameters

app_id w chi_0 chi_c M m V v Deadline

[ms]

stage_id

Q26 1 18,906.97517 12,945,621.49 28 8 4 2 200,000 J1S1

Q52 1 9,474.291259 10,019,343.23 28 8 4 2 100,000 J3S3

Q40 1 38,056.87096 20,224,206.15 56 18 4 2 1,000,000 J1S1

Q55 1 10,137.70356 9,932,642.77 56 18 4 2 100,000 J1S1




[ms]

stage_id

Q26 1 18,906.97517 12,945,621.49 28 8 4 2 400,000 J1S1

Q52 1 9,474.291259 10,019,343.23 28 8 4 2 200,000 J3S3

Q40 1 38,056.87096 20,224,206.15 56 18 4 2 1,000,000 J1S1

Q55 1 10,137.70356 9,932,642.77 56 18 4 2 200,000 J1S1




[ms]

stage_id

Q26 1 18,906.97517 12,945,621.49 28 8 4 2 200,000 J1S1

Q52 1 9,474.291259 10,019,343.23 28 8 4 2 100,000 J3S3

Q40 1 38,056.87096 20,224,206.15 56 18 4 2 100,000 J1S1

Q55 1 10,137.70356 9,932,642.77 56 18 4 2 100,000 J1S1

Q26 1 18,906.97517 12,945,621.49 28 8 4 2 400,000 J1S1

Q52 1 9,474.291259 10,019,343.23 28 8 4 2 200,000 J3S3

Q40 1 3,8056.87096 20,224,206.15 56 18 4 2 500,000 J1S1

Q55 1 10,137.70356 9,932,642.77 56 18 4 2 200,000 J1S1

Q40 1 38,056.87096 20,224,206.15 56 18 4 2 600,000 J1S1

Q55 1 10,137.70356 9,932,642.77 56 18 4 2 500,000 J1S1



[ms]

stage_id

Q26 2 18,906.97517 12,945,621.49 28 8 4

2 200,000 J1S1

Q52 1 9,474.291259 10,019,343.23 28 8 4

2 100,000 J3S3


Q40 1 38,056.87096 20,224,206.15 56 18 4

2 1,000,000 J1S1

Q55 1 10,137.70356 9,932,642.77 56 18 4

2 100,000 J1S1




[ms]

stage_id

Q26 3 18,906.97517 12,945,621.49 28 8 4 2 200,000 J1S1

Q52 1 9,474.291259 10,019,343.23 28 8 4 2 100,000 J3S3

Q40 1 38,056.87096 20,224,206.15 56 18 4 2 1,000,000 J1S1

Q55 1

10,137.70356 9,932,642.77 56 18 4 2 100,000 J1S1



[ms]

stage_id

Q26 4 18,906.97517 12,945,621.49 28 8 4 2 200,000 J1S1

Q52 1 9,474.291259 10,019,343.23 28 8 4 2 100,000 J3S3

Q40 1 38,056.87096 20,224,206.15 56 18 4 2 1,000,000 J1S1

Q55 1 10,137.70356 9,932,642.77 56 18 4 2 100,000 J1S1


app_id w chi_0 chi_c M m V v Deadline [ms]

stage_id

Q26 5 18,906.97517 12,945,621.49 28 8 4 2 200,000 J1S1


Q52 1 9,474.291259 10,019,343.23 28 8 4 2 100,000 J3S3

Q40 1 38,056.87096 20,224,206.15 56 18 4 2 1,000,000 J1S1

Q55 1

10,137.70356 9,932,642.77 56 18 4 2 100,000 J1S1


Table 28 - Residual capacity for the soft-deadline applications considered in the test cases

Test N

Test1 150

Test2 200

Test3 220

Test4 150

Test5 150

Test6 150

Test7 150

For each test case, the following metrics have been considered:

1. The number of iterations (the maximum value MAX_IT has been fixed to 15) to identify the

final (local) optimal solution;

2. The size of the candidate lists and how this varies with the number of iterations;

3. The value of the total objective function and how this varies with the number of iterations;

4. The gap between the initial number of cores νi identified by the KKT conditions and the final

value identified at the local optima. We report also the number of cores identified by OPT_IC to

determine the initial minimum cost deployment;

5. The number of iterations necessary to determine the bound;

6. The total execution time of the OPT_JR tool required to identify the final solution.

Table 29 - Test1 results

App_id Bound #iterations

OPT_IC Final solution


#cores #cores

Q26 32 2 76 48

Q55 30 1 115 40

Q40 42 1 22 24

Q52 28 1 113 36




OPT_IC

#cores

Final solution #cores

Q26 44 1 36 36

Q55 41 1 53 52

Q40 58 1 24 3

Q52 52 1 54 52



OPT_IC

#cores


Q26 17 2 76 32

Q55 15 1 113 28

Q40 22 1 22 20

Q55 15 1 115 20

Q26 17 1 36 20

Q52 15 1 54 20

Q40 22 1 46 24

Q55 15 1 53 16

Q40 22 1 38 24


Q55 15 2 21 16




OPT_IC

#cores


Q26 41 1 76 52

Q55 27 1 115 36

Q40 39 1 22 24

Q52 25 2 113 36



OPT_IC

#cores


Q26 47 2 76 64

Q55 25 1 115 32

Q40 36 1 22 20

Q52 24 1 113 32



OPT_IC

#cores


Q26 52 2 76 64


Q55 24 1 115 32

Q40 34 1 22 20

Q52 22 1 113 32



App_id Iterations (bound)

Cores (DB)

Current cores

Q26 55 2 76 68

Q55 23 1 115 28

Q40 33 1 22 20

Q52 21 1 113 32

Table 36 - Total elapsed time for tests 1 -7

Test Use of cached predictor results

Total elapsed time

[s]

Test1 No 6862.381712

Test1 Yes 0.233007

Test2 No 3850.414621

Test2 Yes 0.139280

Test3 No 46466.197959

Test3 Yes 2.412670

Test4 No 9856.271436

Test4 Yes 0.337701


Test5 No 10553.975149

Test5 Yes 0.366470

Test6 No 8543.538833

Test6 Yes 0.287102

Test7 No 9448.558896

Test7 Yes 0.341946

Figure 46 - Total OF vs Iterations number ( Test3)


Figure 47 - Total OF vs Iterations number (All tests with exception of Test3)


Figure 48 - Candidates List size vs Iterations number

Table 21 to 27 summarize the detailed parameters considered in the seven test cases. Fig. 47 shows the

size of the L1 and L2 lists versus the number of iterations over the considered scenarios. Figure 46

represents the behaviour of the total objective function with regard to the iteration (because of the

different magnitudes, the figure for test case 3 is represented separately in Figure 47).

OPT_JR was run on a Virtualbox VM based on Ubuntu 14.04.1 LTS server running on an Intel Xeon

Nehalem dual socket quad-core system with 32 GB of RAM. OPT_JR was exploiting a single core.

A dual approach was used when running the tests. In the former, all the calculations to determine the

bounds and the solutions to the local search algorithm were performed when needed, resulting in

longer execution times (not acceptable in cases such as Test3, when the number of applications is higher

than in other scenarios). To mitigate the issue, OPT_JR can cache the output from earlier dagSim or

Lundstrom calls on a DB table (called PREDICTOR_CACHE_TABLE, Table 37), resulting in a faster process

execution.


Table 37 - PREDICTOR_CACHE_TABLE

Column name Description Type Is Primary

application_id The application identifier

varchar(100) Y

dataset_size The dataset size double Y

num_cores The cores number int(11) Y

val The output from the predictor

double N

predictor The predictor for which value was memorized

char Y

stage Current DAG’s stage varchar(10) Y

However, other options are available to improve OPT_JR general performance. Acknowledging that

clusters need to be managed hierarchically, it is also possible to run dagSim/Lundstrom in a multicore

fashion, therefore improving performance (both in the bound evaluation step and neighborhood

exploration).

Another approach to reducing the execution time consists in limiting the number of iterations, at the

cost of penalizing the precision of the solution provided. Future OPT_JR releases will take into account

the above architecture enhancements.

By comparing the parameters and the results for Test1 and Test2, t is possible to see that by increasing

the deadline of an application, the execution time of the overall process tends to decrease (specifically

doubling the value of the deadline, the execution time reduces of approximately half the earlier value).

This could be expected since tighter deadlines (in Test1) make the problem more difficult to be solved.


The curve reporting the total objective functions (see Figures 49 and 50) lowers down progressively on

each iteration, confirming that overall local search is able to improve the initial heuristic assignment by

selecting the best core configuration switch at each iteration. As a whole, the FO total value has

improved of 24.08% in average ranging between 20.57% and 31.23% for test case 1 and 7 respectively.

Besides, the number of iterations needed to reach the best configurations is 12 that is lower than the

maximum fixed to 15 in the experiments, demonstrating that the optimization algorithm identifies in

few iterations the final local optimum solution.

On the other hand, Figure 51 shows the dimension of the candidates lists L1 and L2 with respect to the

number of iterations. In this case, there is a non-monotonic behaviour.

Finally, as per Tables 31 to 34, the values of νi and the total number of cores assigned to application 1

(Q26) in the final solution tend to grow by increasing the value of its weight of one unit per time (Test 4,

5, 6 and 7) that can also be expected (higher priority applications get more resources).


13. Conclusions

In this deliverable, we reported the advances in the implementation of the final version of the EUBra-

BIGSEA infrastructure services achieved between M19 and M21. These services include: (i) tools for

predicting big data application performance, (ii) mechanisms for horizontal and vertical elasticity, and

(iii) approaches based on pro-activity and on optimization to perform horizontal and vertical scaling of

running big data applications, as well as load balancing of the infrastructure. The services have been

validated considering industry benchmark applications and also the current version of a subset of the

applications developed within the case study in WP7.


GLOSSARY

Acronym or Name Definition

AM Analytical Model

BSD Berkeley Software Distribution

CPU CAP CPU capacity limiting. Provides a fine-grained control of the CPU resources used by the

virtual CPU, limiting the percentage of physical CPU used by the virtual CPUs

CV1/CV2 Cross Validation Sets

dagSim A discrete event simulator estimating (by simulation) a DAG-based

model execution time

EMaaS Entity Matching-as-a-Service targets the problem of identifying records that refer to the

same entity in the real world

HML Hybrid Machine Learning

HML prototype A set of Octave scripts calculating, via Machine Learning algorithms, the optimal values

used by OPT_IC tool

HSOptimizer Horizontal Scaling Optimizer is an opportunistic service that can provide more VMs

than required if there are idle resources

IOPS Input-Output operations per second. Along with CPU speed, it is the most important

metric for the performance of a computation resource

JMT Java Modelling Tools

KB Knowledge Base

Log Repository A collection of nested folders, positioned and named according to a set of rules, including

the input files for the Performance Predictors

Lundstrom A Lundstrom predictor solving a DAG-based model


MAPEs Mean Absolute Percentage Errors

ML Machine Learning

OPT_IC The tool that provides the initial configuration (optimal number of Virtual Machines). It

calls and executes dagSim

OPT_JR The tool that provides the optimal reconfiguration

Spark Log Parser A tool used to parse the original log files and prepare them to be properly processed by

the Performance Predictors

WS_OPT_IC Web Service invoking OPT_IC tool

WS_OPT_JR Web Service invoking OPT_JR tool


Annex

Appendix A - Hybrid Machine learning module usage manual

This appendix is organized as follows. The first sub-appendix presents the operational guidelines to

install and launch the Hybrid machine learning tools and is followed by a in-depth manual for the user.

Each step is commented thoroughly with the support of a table showing the parameters and the

dependencies among the scripts.

A.1 Deployment and Installation instructions

Three basic steps are required to launch the tool: a) download the collection of scripts and functions and

b) install the LIBSVM library c) prepare the input log files from Spark applications and of the

performance values from an analytical model (e.g. approximated formula or dagSim).

For the deployment of the system the user needs to perform the following steps:

1. Install GNU Octave

2. Install LIBSVM

3. Download the collection of Octave scripts and functions

4. Input files for MR and Tez jobs and Spark jobs

Prerequisites:

1. The Hybrid Machine Learning Tool consists of a set of scripts implemented in Octave GNU and

can be executed in all operating systems which can support Octave.

2. Installation of LIBSVM library

3. Input files Scripts for processing logs obtained from the execution of queries via MR, DAG-based

Tez and Spark jobs on Hadoop and Spark clusters. In case of Spark logs the user should


download Spark-log-Parser (see D3.4 Appendix E1.1) which is needed for obtaining the

appropriate csv files after the processing of data logs which are used as an input to the Hybrid

Machine Learning

A.2 User Manual

1. Preparation of summary csv log files by executing the Spark-log-Parser

2. Open Octave GNU

3. Navigate to the directory where retrieve_response_times_with_formula script is located. Open

the m file and fill in the required information regarding the values of the parameters which are

described in table 38. In case of using dagSim or an alternative AM the relevant data information

should be provided as a csv file which will be used as an input to the compare_prediction_errors

script (the user should fill in the path of the previous csv file in the script).

4. Navigate to the main script called compare_prediction_errors used to identify the optimal

thresholds which minimize the error. Open the m file and fill in the requested information. More

details are shown in table 39. Subsequently, execute the script using the command window of

Octave.

5. When the execution of compare_prediction_errors is completed, the average error for each

combination of thresholds for the Hybrid approach is calculated by function

“computeAvgsOfHybridForOptimumFinding”, which finally returns the one that minimizes the

error.

6. To investigate the right and left extrapolation the optimal thresholds are inserted as an input in

the same script as previously and the execution starts. The process is repeated while in each

iteration one of the available configuration of cores is removed from the training and cross

validations sets CV1 and CV2 to the test set.

7. To generate the output plots of the right and left extrapolations, scripts

"computeExtrapolationFromRight" and "computeExtrapolationFromLeft" are executed and the

graphs are stored in the folder indicated by the user.

8. In the same way the interpolation is calculated by using “compare_prediction_errors” script

while the plots are created after executing "computeInterpolation". Some representative plots

of both extrapolation and interpolation process are presented in Figure 49.


Table 38 - Parameters of retrieval_response_times_with_formula script

Parameters Description Example

configurations Number of examined cores configurations = [20 30 40 48 60 72 80 90 100 108 120]

base_path The path where the processed logs are located

base_path = "/path/to/processed/data"

task_idx Task IDs of operational data which are used by the formula

Example to retrieve task_id from csv files:

head -n 2 10.csv| tail -n 1 | tr , '\n' | grep -n nTask | cut -d : -f 1 | xargs echo

task_idx = [4 11 18 25 32 39 46 53 60 67];

filename Filename which includes the results aster script execution

filename = [base_path, "/estimated_response_times.csv"];


Table 39 - Parameters of compare_prediction_errors

Parameters Description Example

base_directory

the directory where the csv files of data are located

base_directory = /path/to/csv/with/processed/logs

hybrid_csv

The csv file where the results with errors are stored

hybrid_csv = "/path/to/csv/file/with/the/errors/hybrid.csv

configuration.runs

Number of examined cores configuration.runs = [20 30 40 48 60 72 80 90 100 108 120]

configuration.missing_runs Number of missing cores configuration.missing_runs = [20 30]

outer_thresholds Minimize the MAPE on training set outer_thresholds = 25:40 (a range in order to find the optimum threshold)

or

outer_thresholds = 25 (after finding the optimum threshold)

inner_thresholds Minimize the MAPE on cross validation set

outer_thresholds = 15:30 (a range in order to find the optimum threshold)

or

outer_thresholds = 25 (after finding the optimum threshold)

max_inner_iterations Maximum number of iteration of the algorithm

max_inner_iterations = 10;


seeds Seed for the random number generator

seeds = 101:150;


Figure 49 - (a) Left extrapolation(b) right extrapolation and (c) interpolation for R1 query


Appendix B - Performance Prediction and Optimization service enhancements implemented in the new

release

This section is organized according to three subappendices. The first and the second describe the

enhancements implemented in OPT_IC and WS_Perf modules that allow to revise the optimal number

of cores and VMs for an application and estimate its residual time while the application is running.

Finally, the third is articulated in a technical and an operation guide for the user in order to support the

deployment and the efficient usage of the tools.

Appendix B.1 Performance prediction and Optimization (OPT_IC) service new release

This appendix summarizes the main changes implemented in the WP_2P and WS_OPT_IC services

between M17 and M21 (for additional details see deliverable D3.4).

WS_2P Enhancement: Prediction of the residual execution time of an application

With its new version, the performance prediction service WS_2P can also be used to predict the

remaining time of a running job. In doing so, we can enquire the web service every time a stage finishes

to have some insights about the remaining time. In case dagSim is adopted as the underlying

performance prediction tool, the WS rest call requires the following parameters:

REST API REQUEST

Type: GET

Format: Nodes/Cores/RamG/Dataset/Application_Session_id/Stage_id

For example:

curl http://localhost:8080/bigsea/rest/ws/dagsim/6/2/8G/500/application_1483347394756_1/J2S2


Note that this service has access to the RUNNING_APPLICATION_TABLE (see EUBra-BIGSEA Deliverable

D3.4) to retrieve the submission time of the job, hence differently from the other services, it is required

to send the application_session_id rather than the application_id.

An example of output is:

5043 6045

The first value is the predicted average time remaining computed by dagSim. This value is computed as

the difference between the expected duration of the entire job and the expected end time of the stage.

The second one is a rescaled estimate of the remaining time that considers also the time already spent

by the application. The rescaled value is computed according to the following formula:

The rationale behind this formula is the following: if an application spent 10s (ElapsedTime) to perform

its work that should have been done in 8s (EstimatedStageEndTime), then the application will be

conservatively delayed by the factor of 10/8. The assumption is that such a delaying factor will remain

constant also for the remaining part of the application, i.e., the application will experience the same

resource contention, which will slow it down by the same amount with respect to the conditions

experienced for the application profiling. Note that the second value depends on the time we call the

web service, hence it may differ from the example above.

To speed up the service, we introduced a lookup table named DAGSIM_STAGE_LOOKUP_TABLE that

stores previously computed values for EstimatedStageEndTime and EstimatedTimeRemaining for each

stage.


A row in this table is shown next:

application_id num_cores stage dataset_size ramGB stage_end_time remaining_time

Q26 12 J0S0 500 8G 1767 624019

The primary key of this table includes columns (application_id, num_cores, stage, dataset_size).

Note that the total number of cores used is computed as the product of the number of nodes and cores

per node.

This lookup table is checked by WS_2P before starting dagSim: if a corresponding entry already exists,

then the rescaled time remaining will be computed using these values, otherwise we start dagSim and

we will store the computed values for all the stages in this lookup table26.

In the case there is not a perfect correspondence with the data folder (as in the previous release),

dagSim will retrieve the folder representing the closest match in terms of number of nodes and cores.

The following Table shows some examples:

26

The EstimatedStageEndTime is calculated as follows:

- At first we grab the number of cores used by the running job in the

RUNNING_APPLICATION_TABLE

- Then we check if a previously computed value is present in DAGSIM_STAGE_LOOKUP_TABLE

- If it is not the case, then dagSim will be called and its result stored in

DAGSIM_STAGE_LOOKUP_TABLE


Content of log repository Input log folder best match

● 6_2_8G_500

● 10_2_8G_500

● 16_2_8G_500

● 22_2_8G_500

7 / 2 / 8G / 500 6_2_8G_500

● 6_2_8G_500

● 10_2_8G_500

● 16_2_8G_500

● 22_2_8G_500

14 / 2 / 8G / 500 16_2_8G_500

● 6_2_8G_500

● 10_2_8G_500

● 16_2_8G_500

● 22_2_8G_500

25 / 2 / 8G / 500 22_2_8G_500

In case the Lundstrom tool is used the WS rest call requires the following parameters:

REST API REQUEST

Type: GET

Format: Nodes/Cores/RamG/Dataset/Application_Session_id/Stage_id


For example:

curl http://localhost:8080/bigsea/rest/ws/lundstrom/6/2/8G/500/application_1483347394756_1/J2S2

Note that this service has access to the RUNNING_APPLICATION_TABLE (see EUBra-BIGSEA Deliverable

D3.4) to retrieve the submission time of the job, hence differently from the other services, it is required

to send the application_session_id rather than the application_id.


5043 6045

The Lundstrom's output follows the same pattern as dagSim providing a unified API. The first value is the

predicted average time remaining computed by Lundstrom, using the same principles as dagSim, while

the second is a rescaled estimate of the remaining time that considers also the time already spent by the

application as discussed for dagSim.

Appendix B.2 WS_OPT_IC Enhancement: Revise the optimal number of cores and VMs while an

application is running

With its new version, the optimization service WS_OPT_IC can also consider the elapsed time of an

application and compute the increase (or decrease) of computational resources to match the deadline

while the application is running. This is done by calling the underlying resource optimization tool with a

rescaled deadline.

The WS rest call requires the following parameters:

REST API REQUEST


Type: GET

Format: Application_Session_Id/Dataset/Deadline/Stage_id

For example:

curl http://localhost:8080/bigsea/rest/ws/resopt/application_1483347394756_1/500/500000/J2S2

Note that also this service has access to the RUNNING_APPLICATION_TABLE (see EUBra-BIGSEA

Deliverable D3.4) to retrieve the submission time of the application and the current number of cores on

which it is running, hence as the WS_2P service, it is required to send the application_session_id rather

than the application_id.


56 11 153000

The first two values are the number of cores and the number of VMs that should be used to satisfy the

deadline.

The third value is the rescaled deadline calculated by the web service. Note that the first and second

values should be (almost) the same values returned by a call to OPT_IC called with the rescaled

deadline.

The rescaled deadline is computed according to the following formula:


The rationale behind this formula is the following. OPT_IC computes the optimal configuration

assuming that the application has not yet started. WS_OPT_IC will compute the new deadline

considering the current time of the application and the time the current stage is going to finish. So, if

an application spent 10s (ElapsedTime) to perform his work that should be done in 8s

(EstimatedStageEndTime), then the new configuration will be computed by decreasing the initial

deadline by 8/10. In this way, OPT_IC will determine an initial configuration with a capacity, possibly,

larger than the initial one to speed up the application execution to recover the accumulated delay.

The assumption is that such a delaying factor will remain constant also for the remaining part of the

application, i.e., the application will experience the same resource contention, which will slow it down

by the same amount with respect to the conditions experienced for the application profiling.

The EstimatedStageEndTime is calculated as follows:

- At first, we grab the number of cores used by the running job in the

RUNNING_APPLICATION_TABLE

- Then we check if a previously computed value is present in DAGSIM_STAGE_LOOKUP_TABLE

- If it is not the case, then dagSim will be called and its result stored in

DAGSIM_STAGE_LOOKUP_TABLE

Since the call to OPT_IC may take a long time, we implemented two methods to deal with this problem:

1. Caching the previously computed values for num_cores and num_vm in the lookup table

OPTIMIZER_CONFIGURATION_TABLE (see deliverable D3.4).

2. Interpolation of previously computed values when no exact match is found in the lookup table.

The interpolation will pick the two closest points to the desired deadline and linearly interpolate

them to find a first order approximation of num_cores_opt and num_vm_opt.

The full interaction of the WS_OPT_IC and OPT_IC is described in the following diagram:


Figure 50 - Sequence diagram of the WS_OPT_IC service

Note that when the WS_OPT_IC service is called with a deadline which cannot be fulfilled, then the

output will be “Deadline too strict”. There are two cases when the deadline is considered too strict:

1. The elapsed time is greater than or equal to the deadline.

2. The call to OPT_IC with the rescaled deadline fails returning null values for the number of cores

and number of VMs. Since the actual call to OPT_IC is asynchronous, this condition is checked by

considering previously computed values in OPTIMIZER_CONFIGURATION_TABLE to determine

which is the smallest feasible deadline to complete the job.


In the case there is not a perfect correspondence with the data folder (as in the previous release),

OPT_IC will retrieve the folder representing the closest match in terms of number of nodes and cores.

The following table shows some examples:

Content of log repository Input log folder best

match

● 6_2_8G_500

● 10_2_8G_500

● 16_2_8G_500

● 22_2_8G_500

7 / 2 / 8G / 500 6_2_8G_500

● 6_2_8G_500

● 10_2_8G_500

● 16_2_8G_500

● 22_2_8G_500

14 / 2 / 8G / 500 16_2_8G_500

● 6_2_8G_500

● 10_2_8G_500

● 16_2_8G_500

● 22_2_8G_500

25 / 2 / 8G / 500 22_2_8G_500

Appendix B.3 Performance and optimization service tutorial

This appendix represents a tutorial on the performance prediction and optimization services. It consists

of a technical and operational section. The technical section presents first the scripts to create the

working environment via a docker file, whose usage is described step by step.


In the operational guide, the general configuration file (wsi_config.xml) is described. The structure of the

log data files is presented next. The remaining subsections are dedicated to the usage description of

software tools, specifically the discrete event simulator (dagSim) and the optimization tools (OPT_IC and

OPT_JR).

Technical tutorial

The content of this tutorial is hosted on Github.

Open your terminal and clone the repository with the command:

git clone https://github.com/eubr-bigsea/wsi.git

Database Creation

The repository contains a MySQL script in order to automatically create the tables schema.

How to install a MySQL server is out of the scope of this guide. However, the repository provides a bash

script in order to generate a proper docker container also.

Creation Database Container (Bash Script)

The repository project contains a bash script which generates a MySQL docker container from the library

image `MySQL` provided by dockerHub.

The script container will automatize the following steps:

1. Create a MySQL docker container (default named: mysql_bigsea). The server instance will listen

on the default port 3306.

2. Create the database.

3. Create the database schema (all needed tables).

4. Create a user (in order to avoid the usage of root).

5. Grant all privileges to that user on the database.

6. Generation of sample data into the tables.

The script can be retrieved from https://github.com/eubr-bigsea/wsi.git

From the root directory of the repository, move into the database scripts directory:

cd Database/

https://github.com/eubr-bigsea/wsi.git

https://github.com/eubr-bigsea/wsi.git


Make the bash script executable:

chmod u+x startNewDockerContainer.sh

Before to launch the script it is highly recommended to configure passwords. In order to do that just

open the script with your favourite editor, for example:

vim startNewDockerContainer.sh

The main lines are the following:

MYSQL_ROOT_PASSWORD=4dm1n

MYSQL_USER_PASSWORD=b1g534

If the password is changed, remember that you will have to properly modify the configuration file of the

services later.

Once the script has been modified, you can execute it:

sudo ./startNewDockerContainer.sh

The output should be something like:

6cd3e2d06d414a2504560ac411c4fd05a46508e3ff630e323d99c0e73ca59a76

Wait for connection... (this operation will take about 15 seconds)

mysql: [Warning] Using a password on the command line interface can be insecure.





Done.

You can see the docker container is running with the command:

sudo docker ps

And the output should show the row:

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES

6cd3e2d06d41 mysql:latest "docker-entrypoint..." About a minute ago Up About a minute

0.0.0.0:3306->3306/tcp mysql_bigsea

The container will expose the default port 3306. The database, indeed, will be visible from the external.

However, if you need to connect two docker container the procedure is slightly more complicated.

Docker will automatically assign a local ip address to the container, you will need it. To obtain the ip

address you can execute the command:

sudo docker exec -it mysql_bigsea ip addr

The command will show the interfaces inside the Database container. The output should be something

like:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default

qlen 1

link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00


inet 127.0.0.1/8 scope host lo

valid_lft forever preferred_lft forever

374: eth0@if375: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default

link/ether 02:42:ac:11:00:03 brd ff:ff:ff:ff:ff:ff

inet 172.17.0.3/16 scope global eth0

valid_lft forever preferred_lft forever

The `eth` interface will display the proper internal IP address, in this example: inet 172.17.0.3.

Creation Docker Container Services

Once the database has been built, you are ready to execute the final docker which will build all EUBra-

BigSea services into a single container.

Go to https://github.com/eubr-bigsea/wsi .

From the root repository move into the docker directory

cd WSI/docker/

Before to build the docker images, you need to properly configure the infrastructure. This process is

quite easy, just edit the configuration (wsi_config.xml) with your favourite editor, for example:

vim wsi_config.xml

The important property to check are the following:

The listen port of the database:

https://github.com/eubr-bigsea/wsi


<entry key="AppsPropDB_port">3306</entry>

<entry key="OptDB_port">3306</entry>

1. The username for the database:

<entry key="OptDB_user">bigsea</entry>

<entry key="AppsPropDB_user">bigsea</entry>

2. The database IP:

<entry key="AppsPropDB_IP">172.17.0.1</entry>

<entry key="OptDB_IP">172.17.0.1</entry>

3. The database name:

<entry key="AppsPropDB_dbName">bigsea</entry>

<entry key="OptDB_dbName">bigsea</entry>

4. The user password:

<entry key="OptDB_pass">b1g534</entry>

<entry key="AppsPropDB_pass">b1g534</entry>

As you can see, those lines are about the database configuration. You need to set them in accordance

with your database configuration.

You can skip all other configuration properties.

If your database is inside a docker container you need to properly configure the container network in

order that the two containers can communicate among them. In order to obtain the internal IP address

of a Docker container you may refer the end of the Creation Database Container subsection.


After you have configured all database properties, you can build the docker image. From the same

directory launch the command:

sudo docker build --no-cache --force-rm -t wsi

This will build a docker image tagged as wsi. Be patient, the building process may take several minutes.

Once the images have been built, the Web Service can be launched in a docker container.

Just launch the command:

sudo docker run --name wsi_service -d -p 8080:8080 wsi

Useful Commands

1. Display the configuration inside the container:

sudo docker exec -it wsi_service cat wsi_config.xml

1. Restart the container:

sudo docker restart wsi_service

2. Modify the configuration inside the container (with Nano):

sudo docker exec -it wsi_service nano wsi_config.xml

sudo docker restart wsi_service

Operational guide

Configuration file

wsi_config.xml is a configuration file located in the home directory. Each line includes the name of a

variable and its value. For example:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">

<properties>

<comment/>

<entry key="RESOPT_HOME">/home/work/OPT_IC</entry>

<entry key="RESULTS_HOME">/home/work/TPCDS500-D_processed_logs</entry>

<entry key="AppsPropDB_port">3306</entry>

<entry key="OptDB_port">3306</entry>

<entry key="DAGSIM_HOME">/home/work/Dagsim</entry>

<entry key="OptDB_user">root</entry>

<entry key="AppsPropDB_user">root</entry>

<entry key="AppsPropDB_IP">localhost</entry>

<entry key="AppsPropDB_dbName">150test</entry>

<entry key="OptDB_tablename">OPT_SESSIONS_RESULTS</entry>

<entry key="OptDB_IP">localhost</entry>

<entry key="UPLOAD_HOME">/home/work/Uploaded</entry>

<entry key="OptDB_pass">test</entry>

<entry key="AppsPropDB_pass">test</entry>

<entry key="OptDB_dbName">150test</entry>

</properties>


This file contains information about the location of tools such as dagSim and OPT_IC, the data log files

and the DB credentials.

Data Log files

It is expected that the folder containing the data log files be coherent with the following format:

root_folder_name (this can be any name you like)

nNodes_nCores_Memory_dataset

application_id

logs

For example:

/home/work/TPCDS500-D_processed_logs

├── 10_2_8G_500

│ ├── Q26

│ │ ├── logs

etc.

Currently, a (already processed) set of data located under /home/work/TPCDS500-D_processed_logs

has been installed at docker level. If the user desires to use a different set of log files, the new data


structure must be compliant as per the above format. It is also requested to set up the wsi_config.xml

file accordingly (set RESULTS_HOME entry).

Spark log parser

Spark log parser is a script called process_logs.sh and is located under /home/work/SparkLogParser-

master. This tool requires (at least in one of the ways it can be used) the presence of the discrete event

simulator dagSim. Spark log parser can be configured by using a file called config.txt, such as:

#APP_REGEX='(app(lication)?[-_][0-9]+[-_][0-9]+)|(\w+-\w+-\w+-\w+-\w+-[0-9]+)'

APP_REGEX='(eventLogs-app(lication)?[-_][0-9]+[-_][0-9]+)|(\w+-\w+-\w+-\w+-\w+-[0-9]+)'

## DAGSIM parameters

## These apply only to process_logs.sh

# DagSim executable

DAGSIM=/home/work/Dagsim/dagsim.sh

# Number of users accessing the system

DAGSIM_USERS=1

# Distribution of the think time for the users.

# This element is a distribution with the same

# format as the task running times

DAGSIM_UTHINKTIMEDISTR_TYPE="exp"

DAGSIM_UTHINKTIMEDISTR_PARAMS="{rate = 0.001}"


# Total number of jobs to simulate

DAGSIM_MAXJOBS=1000

# Coefficient for the Confidence Intervals

# 99% 2.576

# 98% 2.326

# 95% 1.96

# 90% 1.645

DAGSIM_CONFINTCOEFF=1.96

q26

While most of the above parameters refer to dagSim, the APP_REGEX variable:

APP_REGEX='(eventLogs-app(lication)?[-_][0-9]+[-_][0-9]+)|(\w+-\w+-\w+-\w+-\w+-[0-9]+)'

allows to process raw log files with different prefixes (in this case, everything starting with “eventLogs-

app”).

To parse the raw data log files, you can use the following:

./process_logs.sh -p <data_log_foldername>


For example:

./process_logs.sh -p /home/work/TPCDS500-D_processed_logs

This step creates several subfolders in every logs folder.

The second step simulates the content of the log files defined so far by using dagSim. This is achieved

the command:

./process_logs.sh -s <data_log_foldername>

For example:

./process_logs.sh -s /home/work/TPCDS500-D_processed_logs

The content of the initial folder would be now:

├── 10_2_8G_500

│ ├── Q26

│ │ ├── logs


│ │ │ ├── application_1483347394756_0167

│ │ │ ├── application_1483347394756_0167_csv

│ │ │ │ ├── app_1.csv

│ │ │ │ ├── application_1483347394756_0167.lua

│ │ │ │ ├── application_1483347394756_0167.lua.template

│ │ │ │ ├── ConfigApp_1.txt

│ │ │ │ ├── ConfigApp_1.txt.bak

│ │ │ │ ├── J0S0.txt

│ │ │ │ ├── J1S1.txt

│ │ │ │ ├── J2S2.txt

│ │ │ │ ├── J3S3.txt

│ │ │ │ ├── J3S4.txt

│ │ │ │ ├── J3S5.txt

│ │ │ │ ├── J3S6.txt

│ │ │ │ ├── jobs_1.csv

│ │ │ │ ├── stages_1.csv

│ │ │ │ └── tasks_1.csv

etc.

dagSim

The discrete event simulator dagSim can be invoked by a Web Service. For example:


curl localhost:8080/bigsea/rest/ws/dagsim/5/2/8G/500/Q26

In the case a sub-folder called 5_2_8G_500 does not exist, the service will consider the best match

among the existing sub-folders.

The logs sub-folder contain several sub-folders. In any case, dagSim considers always the first (with

respect to the corresponding lua configuration file).

OPT_IC

The OPT_IC optimizer can be invoked by a Web Service producing in output the optimal number of cores

and VMs. Both values are then stored in OPTIMIZER_CONFIGURATION_TABLE.

The Web service can be used in two scenarios: either for a known application or for a new one. To verify

if the application already exists in the table, the SQL statement:

select * from APPLICATION_PROFILE_TABLE;

Information regarding the output produced by OPT_IC Web Service are stored on

optimizer_configuration_table, whose content can be determined by using the SQL statement:

select * from optimizer_configuration_table;

This query may return, for example, the following result:


mysql> SELECT * FROM OPTIMIZER_CONFIGURATION_TABLE;

+----------------+--------------+----------+---------------+------------+

| application_id | dataset_size | deadline | num_cores_opt | num_vm_opt |

+----------------+--------------+----------+---------------+------------+

| Q26 | 500 | 100000 | 88 | 22 |

| Q26 | 500 | 1000000 | 8 | 2 |

| Q52 | 500 | 200000 | 42 | 11 |

| Q52 | 500 | 800000 | 11 | 3 |

If the APPLICATION_PROFILE_TABLE includes a row for the considered application, the user can invoke

OPT_IC by calling the Web Service as follows:

curl localhost:8080/bigsea/rest/ws/resopt/application_id/dataset/deadline

For example:

curl localhost:8080/bigsea/rest/ws/resopt/Q52/500/800000

If an identical configuration (application_id, dataset, deadline) exists in the DB, the Web Service

retrieves the output values without performing further calculations; otherwise, a linear extrapolation

based on the two closest configurations is performed. Finally, the optimizer is asynchronously invoked

storing the actual output values on the DB for future reference.


Note that if the application is used for the first time, then a preliminary procedure performed by a shell

script initialize.sh must be followed. This script implements automatically the following steps:

1. Verify that APPLICATION_PROFILE_TABLE includes the profile for the application session; if not,

the script exits with an error;

2. Add two rows on OPTIMIZER_CONFIGURATION_TABLE (with mock values for dataset_size,

deadline, num_cores_opt and num_vm_opt);

3. Execute twice the Web Service invoking OPT_IC as shown earlier (the application_id must be the

one considered; the dataset is the one used in the log data files, while the choice of the deadline

is left to the user): these steps will store in the optimizer_configuration_table two rows

containing the number of cores and VMs for a specific configuration;

4. At this step, optimizer_configuration_table should now contain 4 rows about the new

application (2 mock rows and 2 real rows);

5. Before ending, the script removes the 2 mock rows created at step 2.

The script is invoked as follows:

./initialize.sh <application_id>

For example:

./initialize.sh query52

OPT_IC can be invoked by using a shell script as follows:

./script.sh <application_id> <dataset_size> <deadline>


For example:

./script.sh query52 1000 200000

OPT_JR

OPT_JR is a tool to determine the best applications configuration in terms of number of cores

minimising the tardiness for soft-deadline applications under heavy load conditions. OPT_JR applies a

local search algorithm to explore the application’s neighborhood and find the best cores configuration.

The process is iterative and performed until the application’s configuration cannot be refined anymore.

The output of the algorithm is stored on OPT_SESSION_RESULTS_TABLE.

OPT_JR uses a performance predictor that can be either dagSim or Lundstrom. Performance is greatly

improved by estimating initially a list of applications regarded as candidates for evaluation and then

applying the actual predictor on them only. Results from the predictor are cached on

PREDICTOR_CACHE_TABLE and reused when necessary.

OPT_JR can be invoked as follows:

/OPT_JR -f=<filename> -n=<N> -k=<number> -d=<y|n> -i=<ITERATIONS> -s=<PREDICTOR> -g=<Y|N>

Where:


1. <filename> is a file including one or more application profiles in csv format and stored in a folder

whose path is specified under the global configuration file wsi_config.xml (“UPLOAD_HOME”);

2. <N> is the maximum number of cores available;

3. -d specifies the debug option. When the option is chosen, the tool will work in verbose mode

printing a series of messages to be used in a debug session;

4. <ITERATIONS> specify the maximum number of iterations for the local search algorithm;

5. <PREDICTOR> is either one of “dagSim” or “lundstrom”;

6. -g option calculates or not the global objective function at each iteration, slowing the

performance. This option can be used for debug purposes.

./OPT_JR -f="Test1.csv" -n=220 -k=0 -d=y -i=10 -s=dagSim -g=Y

OPT_JR requires the global configuration file wsi_config.xml and a performance predictor (dagSim or

Lundstrom). The web service which provide a REST API to OPT_JR is called WS_OPT_JR.

The following script shows an example the way the WS_OPT_JR web service can be invoked:

#!/bin/bash

WSI_IP="localhost"

WSI_PORT="8080"

APP_IDS=("application_1483347394756_0" \

"application_1483347394756_1")

NUM_CORES=$(( (RANDOM % 1001) + 1))

# Open new session

echo "Open a new session..."

SID=$(curl "http://${WSI_IP}:${WSI_PORT}/WSI/session/new")


echo "Session ID ${SID}"

# Set the number of calls

NUMCALLS=${#APP_IDS[@]}

echo "Setting ${NUMCALLS} calls and ${NUM_CORES} cores"

curl -X POST

"http://${WSI_IP}:${WSI_PORT}/WSI/session/setcalls?SID=${SID}&ncalls=${NUMCALLS}&ncores=${NU

M_CORES}"

echo ""

# Set app param for each app

for app in ${APP_IDS[@]}; do

echo "Setting application: ${app}";

curl -X POST -H "Content-Type: text/plain" -d "${app} 3.14 3.14"

"http://${WSI_IP}:${WSI_PORT}/WSI/session/setparams?SID=${SID}"

echo ""

done;


Appendix C - Controller Service Guide

Installation and configuration

In the virtual machine that you want to install the controller follow the steps below:

1. Install git and clone the EUBra-BIGSEA Controller repository

$ sudo apt-get install git

$ git clone https://github.com/bigsea-ufcg/bigsea-controller

2. Access the EUBra-BIGSEA Controller folder and run the setup script

$ cd bigsea-controller/

$ ./setup.sh # You must run this command as superuser to install some requirements

3. Write the configuration file (controller.cfg) with the required information contained into

controller.cfg.template.

4. Start the service.

$ ./run.sh

Creating a new controller plugin

1. Write a new python class

It must implement the methods __init__, start_application_scaling, stop_application_scaling and status.

1. __init__(self, application_id, parameters)

Creates a new controller which scales the given application using the given parameters.

app_id: application id to scale.

parameters: a dictionary containing scaling parameters.

https://github.com/bigsea-ufcg/bigsea-controller

https://github.com/bigsea-ufcg/bigsea-controller/blob/master/controller.cfg.template


Example:

{

"instances": *instance_id_0, instance_id_1, …, instance_id_n+,

"check_interval": 5,

"trigger_down": 10,

"trigger_up": 10,

"min_cap": 10,

"max_cap": 100,

"actuation_size": 25,

"metric_rounding": 2,

"actuator": “basic”,

"metric_source": “monasca”,

“application_type”: “os_generic”

}

1. start_application_scaling(self)

Starts scaling for an application.

This method is used as a run method by a thread.

1. stop_application_scaling(self)

Stops scaling of an application.

1. status(self)

Returns information on the status of the scaling of applications, normally as a string.


Example:

class MyController:

def __init__(self, application_id, parameters):

# set things up

pass

def start_application_scaling(self):

# scaling logic

pass

def stop_application_scaling(self):

# stop logic

pass

def status(self):

# status logic

return status_string

2. Add the class to controller project into the plugins directory.

3. Edit Controller_Builder

Add a new condition to get_controller. Instantiate the plugin using the new condition.

Example:

https://github.com/bigsea-ufcg/bigsea-controller/tree/master/bigsea-controller/service/api/controller/plugins

https://github.com/bigsea-ufcg/bigsea-controller/blob/master/bigsea-controller/service/api/controller/controller_builder.py


...

elif name == "mycontroller":

# prepare parameters

return MyController(application_id, parameters)

...

The string used in the condition must be passed in the request to start scaling as the value to “plugin”

parameter.

The examples of controller plugins contained in the repository are composed of two classes: the

Controller and the Alarm. The Alarm contains the logic used to adjust the number of resources allocated

to applications. The Controller dictates the pace of the scaling process. It controls when the Alarm is

called to check the application state and when is necessary to wait.

Creating a new actuator plugin

1. Write a new python class

It must extend Actuator and implement the methods prepare_environment, adjust_resources,

get_allocated_resources and get_allocated_resources_to_cluster. See the example below:

class MyActuator(Actuator):

def __init__(self, params):

# set things up

pass

def prepare_environment(self, vm_data):

# prepare environment logic

pass’

def adjust_resources(self, vm_data):

https://github.com/bigsea-ufcg/bigsea-controller/blob/master/bigsea-controller/service/api/actuator/actuator.py


# adjust resources logic

pass

def get_allocated_resources(self, vm_id):

# get allocated resources logic

pass

def get_allocated_resources_to_cluster(self, vms_ids):

# get allocated resources to cluster logic

pass

2. Add the class to controller project into the plugins directory.

3. Edit Actuator_Builder

Add a new condition to get_actuator. Instantiate the plugin using the new condition. See the example

below:

...

elif name == "myactuator":

# prepare parameters

return MyActuator(params)

...

The string used in the condition must be passed in the request to start scaling, using the Controller’s

REST API, as the value of “actuator” parameter.

https://github.com/bigsea-ufcg/bigsea-controller/tree/master/bigsea-controller/service/api/actuator/plugins

https://github.com/bigsea-ufcg/bigsea-controller/blob/master/bigsea-controller/service/api/actuator/actuator_builder.py


Appendix D - Load Balancer Guide

This appendix will help you to deploy the EUBra-BIGSEA WP3 Load Balancer service in a machine with a

fresh Linux installation, and also provide information about how to configure the different heuristics.

Please make sure that the machine you are installing the service has access the KVM Hypervisors in the

infrastructure hosts and to the following OpenStack services: Keystone, Nova, and Monasca. The current

version only supports OpenStack infrastructure and considers shared storage for the live migration

process.

Installation

You can install the Load Balancer in any physical or virtual machine, the minimal configuration is

described below.

Minimal configuration:

1. OS: Ubuntu 14.04 2. CPU: 1 core 3. Memory: 2GB of RAM 4. Disk: there are no disk requirements.

1. Update and upgrade your machine

$ sudo apt-get update && sudo apt-get upgrade

2. Install the necessary packages (python dependencies, git, and pip)

$ sudo apt-get install python-setuptools python-dev build-essential

$ sudo easy_install pip


$ sudo apt-get install libssl-dev

3. Clone the Load Balancer repository

$ git clone https://github.com/bigsea-ufcg/bigsea-loadbalancer.git

https://github.com/bigsea-ufcg/bigsea-loadbalancer.git


4. Access the Load Balancer folder and run the install script.

$ cd bigsea-loadbalancer/

# Some requirements require sudo

$ sudo ./install.sh

Configuration

The configuration file for the load balancer is currently composed by the following sections: monitoring,

heuristic, infrastructure, openstack, and optimizer. Below we describe the template for the

configuration file with the explanation of each section and its attributes.

# Section to configure information used to access the Monasca.

[monitoring]

# Username that will be used to authenticate

username=<@username>

# Password that will be used to authenticate

password=<@password>

# The project name that will be used to authenticate

project_name=<@project_name>

# The authentication url that the monasca use.

auth_url=<@auth_url>

# Monasca api version

monasca_api_version=v2_0

# Section to configure all heuristic information.

[heuristic]

# The filename for the module that is located in /loadbalancer/service/heuristic/

# without .py extension


module=<module_name>

# The class name that is inside the given module, this class should implement BasicHeuristic

class=<class_name>

#Number of seconds before executing the heuristic again

period=<value>

# A float value that represents the ratio of number of CPUS in the hosts. (overcommit factor)

cpu_ratio=0.5

# An integer that represent the number of rounds that an instance need to wait before be migrated again

# Each round represents an execution of the loadbalancer

wait_rounds= 1

# All information about the infrastructure that the Load Balance will have access

[infrastructure]

# The user that has access to each host

user=<username>

#List of full hostnames of servers that the loadbalancer will manage (separated by comma).

#e.g compute1.mylab.edu.br

hosts=<host>,<host2>

#The key used to access the hosts

key=<key_path>

#The type of IaaS provider on your infrastructure e.g OpenStack, OpenNebula

provider=OpenStack

# Section to configure OpenStack credentials used for Keystone and Nova

[openstack]




user_domain_name=<@user_domain_name>


project_domain_name=<@project_domain_name>


# Section to configure Optimizer services

[optimizer]



module=<optimizer_module_name>

# The class name that is inside the given module, this class should implement BaseOptimizer

class=<optimizer_class_name>

# The url to make the request to the optimizer service

request_url=<http://url/...>

# The type of the request: GET, POST, etc...

request_type=<type>

# Parameters to be used with the request url

request_params=<params>

Heuristics Configuration

Below we give the configuration for the heuristic section according to each heuristic we have available in

our repository, it’s possible to find examples27 in the repository too.

27

https://github.com/bigsea-ufcg/bigsea-loadbalancer/tree/master/examples

https://github.com/bigsea-ufcg/bigsea-loadbalancer/tree/master/examples


BalanceInstancesOS heuristic configuration

[heuristic]



module=instances


class=BalanceInstancesOS


period=600

# A float value that represents the ratio of number of CPUS in the hosts. (overcommit factor)

cpu_ratio=1

# An integer that represents the number of rounds that an instance need to wait before be migrated again


wait_rounds=1

CPUCapAware heuristic configuration

[heuristic]



module=cpu_capacity


class=CPUCapAware


period=600

# A float value that represents the ratio of number of CPUs in the hosts.

cpu_ratio=1




wait_rounds=1

SysbenchPerfCPUCap heuristic configuration

[heuristic]



module=benchmark_performance


class=SysbenchPerfCPUCap


period=600

# A float value that represents the ratio of number of CPUs in the hosts.

cpu_ratio=1



wait_rounds=1

Optimizer Configuration

The only available plugin is the HSOptimizer that uses the HSOptimizer28 Service. The configuration to

have the optimizer section in the configuration file is shown below. The request_params should receive

a dictionary with username, project_id, password and auth_ip to work properly.

28




[optimizer]



module=hsoptimizer

# The class name that is inside the given module, this class should implement BaseOptimizer

class=HSOptimizer

# The url to make the request to the optimizer service

request_url=http://IP:1517/optimizer/get_preemtible_instances

# The type of the request: GET, POST, etc...

request_type=GET

# Parameters to be used with the request url

request_params={'username': 'myuser', 'project_id': '98sab49kdo1280zlapq', 'domain': 'mydomain', 'password': 'mypassword', 'auth_ip': 'http://mycloud.domain.com'}

Execution

To execute the Load Balancer you need to access the directory where you have cloned the repository

and set the PYTHONPATH environment variable.

$ cd bigsea-loadbalancer/

$ export PYTHONPATH=”:”`pwd`

Now you can run using a default configuration file name or by giving the configuration file name.

$ python loadbalancer/cli/main.py

$ python loadbalancer/cli/main.py -conf load_balancer.cfg


Creating a new Heuristic

To create a new heuristic you must you just need to follow the steps below:

1. Create a python module file in loadbalancer/service/heuristic directory

2. In the module file, create a class that inherits BaseHeuristic class from

loadbalancer/service/heuristic/base.py

from loadbalancer.service.heuristic.base import BaseHeuristic

...

class MyNewHeuristic(BaseHeuristic):

def __init__(self, **kwargs):

…..

…..

def collect_information(self):

…..

return ...

def decision(self):

……

…...

3. You must override collect_information and decision methods in your class

Creating a new Optimizer Plugin

To create a new heuristic you must you just need to follow the steps below:


1. Create a python module file in loadbalancer/service/heuristic directory

2. In the module file, create a class that inherits BaseOptimizer class from

loadbalancer/service/optimizer/base.py

from loadbalancer.service.optimizer.base import BaseOptimizer

...

class MyOptimizer(BaseOptimizer):

def __init__(self, **kwargs):

…..

def request_instances(self):

…..

…..

return …..

def decision(self, **kwargs):

…….

return …..

3. You must override request_instances and decision methods in your class


Appendix E - Monitoring-Hosts Daemon

This appendix will help you to deploy the monitoring-host daemon in the hosts of your infrastructure.

That daemon is responsible for executing the Sysbench benchmark and publishing the results into

Monasca, so the Load Balancer can use the metrics when the selected heuristic is SysbenchPerfCPUCap.

Installation

You can install the LoadBalancer in any physical or virtual machine; the minimal configuration is

described below.

Minimal configuration:

1. OS: Ubuntu 14.04 2. CPU: 1 core 3. Memory: 2GB of RAM 4. Disk: there are no disk requirements.

1. Update and upgrade your machine

$ sudo apt-get update && sudo apt-get upgrade

2. Install the necessary packages (python dependencies, git and pip and sysbench)

$ sudo apt-get install python-setuptools python-dev build-essential

$ sudo easy_install pip


$ sudo apt-get install sysbench

3. Clone the Load Balancer repository

$ git clone https://github.com/bigsea-ufcg/monitoring-hosts.git

4. Access the monitoring-hosts folder and install the requirements

https://github.com/bigsea-ufcg/monitoring-hosts.git


$ cd monitoring-hosts/

# Some requirements require sudo

$ sudo pip install -r requirements.txt --no-cache-dir

Configuration

The configuration file for the monitoring-hosts daemon defines what benchmarks it will run, below we

show the default configuration, the file need only to have a DEFAULT section containing the necessary

information for the daemon to run, the backend parameter defines if the benchmark results will be

written in files on the output_dir or if it’s going to be published into Monasca.

[DEFAULT]

type=CPU

name=sysbench

# Full path to access the sysbench.json file that is in the sample directory

parameters=/path/sysbench.json

output_dir=/tmp

backend=

To use Monasca as backend set the backend parameter and add a new section called monasca with all

necessary authentication parameters.

[DEFAULT]

type=CPU

name=sysbench

# Full path to access the sysbench.json file that is in the sample directory

parameters=/path/sysbench.json

output_dir=/tmp

backend=OS_MONASCA


[monasca]





monasca_api_version=2_0

It’s possible to configure two parameters for the execution of the sysbench with the CPU test,

number_of_threads, and max_prime, those parameters are in the sysbench.json file that is in the sample

directory of the repository.

{

"number_of_threads": ["8", "4", "2", "1"],

"max_prime": "25000"

}

Execution

The daemon offers a CLI that you can you use to start, stop or restart the execution, below we show the

usage of each option.

usage: python monitoring.py [-h] {start,restart,stop} ... Monitoring Host Daemon positional arguments: {start,restart,stop} Operation with the monitoring host daemon. Accepts any of these values: start, stop, restart start Starts python monitoring.py daemon restart Restarts python monitoring.py daemon stop Stops python monitoring.py daemon optional arguments: -h, --help show this help message and exit


The start option can receive two arguments, time_interval and configuration, the only required

argument is the configuration since time_interval have a zero as the default value.

ubuntu@sysbench:~/monitoring-hosts$ python monitoring/run.py start -h

usage: python monitoring.py start [-h] [-time TIME_INTERVAL] -conf

CONFIGURATION

optional arguments:

-h, --help show this help message and exit

-time TIME_INTERVAL, --time_interval TIME_INTERVAL

Number of seconds to wait before running the Monitoring

Daemon again.(Integer)

-conf CONFIGURATION, --configuration CONFIGURATION

Filename with all benchmark information

$ python monitoring/run.py start -conf my.cfg

$ python monitoring/run.py start -conf my.cfg -time 120

The stop option requires the configuration argument. Use this option to stop the execution of the

daemon.

usage: python monitoring.py stop [-h] -conf CONFIGURATION

Stops the daemon if it is currently running.

optional arguments:




$ python monitoring/run.py stop -conf my.cfg


The restart option can receive two arguments, time_interval and configuration, the only required

argument is the configuration since time_interval have a zero as the default value. You can use to restart

the daemon with a new configuration file and new time interval.

usage: python monitoring.py restart [-h] [-time TIME_INTERVAL] -conf

CONFIGURATION

optional arguments:


-time TIME_INTERVAL, --time_interval TIME_INTERVAL

Number of seconds to wait before running the Monitoring

Daemon again. (Integer)



$ python monitoring/run.py restart -conf my.cfg

$ python monitoring/run.py restart -conf my.cfg -time 120

Appendix F - EUBra-BIGSEA Applications Submission Guide

This appendix will guide you on how to execute some EUBRa-BIGSEA applications (BTR and BULMA) in a

WP3 infrastructure configured with the following components: Broker API, Monitor, Controller,

Authorization, and Optimizer.

The submissions of WP4 applications must follow the configuration file described below:


[manager] ip = # Broker IP port = # Broker port plugin = # To WP4 submissions, the plugin is sahara cluster_size = # Cluster size flavor_id = # Flavor ID image_id = # Image ID bigsea_username = # AAAaS username bigsea_password = # AAAaS password [plugin] opportunistic = # Details in section “Integration with Opportunistic Services” args = # Details in section 4.1.1 and 4.1.2 of this appendix job_binary_url = # Job binary url dependencies = # Dependencies to run app main_class = # Main class of a jar application like EMaaS job_template_name = # Template name of job (via sahara submission) job_binary_name = # Binary name (via sahara submission) plugin_app = # Plugin of application (via sahara submission) openstack_plugin = # Openstack plugin (via sahara submission) plugin_version = # Plugin version (via sahara submission) job_type = # Job type (via sahara submission) expected_time = # Expected time (QoS metric) collect_period = # Collect period (QoS metric) master_ng = # Master node group slave_ng = # Slave node group opportunistic_slave_ng = # Opportunistic slave node group net_id = # Network ID [scaler] starting_cap = # Starting CAP scaler_plugin = # Scaler plugin actuator = # Actuator type metric_source = # Metric source application_type = # Application type check_interval = # Check interval trigger_down = # Trigger down trigger_up = # Trigger up min_cap = # Minimum CAP max_cap = # Maximum CAP actuation_size = # Actuation size metric_rounding = # Metric rounding

BTR


The BTR (Best Trip Recommender) is a multi-modal journey recommendation system which predicts

future trips characteristics, such as duration and crowdedness, and uses the predictions to rank the trips

for the user.

The BTR execution is divided into two parts: pre-processing and training. The execution must be done

sequentially for a successful execution. Each part requires changes in specific fields of the configuration

file.

The first change is the edit of the parameter job_binary_url in the plugin section of the configuration

file. We only consider the usage of HDFS paths to submit these applications. The BTR pre-processing29

and training30 application files can be found in the BTR repository31.

After this, you need to edit the field args of configuration file in this format described below:

1. Preprocessing: <btr-input-folder> <btr-pre-processing-output-folder> <btr-route-stop-output-

folder>

2. Training: <btr-pre-processing-output-folder> <train-info-output-file> <duration-model-output-

folder> <pipeline-output-folder>

Last, fill the specific field dependencies with com.databricks:spark-csv_2.10:1.5.0.

Example of BTR pre-processing specific fields:

…

[plugin]

...

29

https://github.com/analytics-ufcg/best-trip-recommender-

spark/blob/master/utils/preprocessing_subsequent_stops.py

30 https://github.com/analytics-ufcg/best-trip-recommender-spark/blob/master/src/jobs/train_btr_2.0.py

31 https://github.com/analytics-ufcg/best-trip-recommender-spark


args = hdfs://url/BTR/input/ hdfs://url/BTR/output/preprocessing/ hdfs://url/BTR/output/routestops/

dependencies = com.databricks:spark-csv_2.10:1.5.0

...

job_binary_url = hdfs://url/BTR/preprocessing_subsequent_stops.py

...

Example of BTR training specific fields:

…

[plugin]

...

args = hdfs://url/BTR/output/preprocessing/ hdfs://url/BTR/output/ hdfs://url/BTR/output/duration/ hdfs://url/BTR/output/pipeline/

dependencies = com.databricks:spark-csv_2.10:1.5.0

...

job_binary_url = hdfs://url/BTR/train_btr_2.0.py

...

BULMA

BULMA, as discussed in Section 12.1, is an unsupervised technique capable of matching a bus trajectory

with the "correct" shape, considering the cases in which there are multiple shapes for the same route.

First, You need a generated jar of BULMA project on a remote HDFS or on a Swift container. You can find

the files needed to generate a jar of BULMA in EMaaS repository32. Here we have no limitation about the

path being of Swift or HDFS, they both work. After you have a BULMA jar file in a Swift or HDFS path, the

next step is to edit the field job_binary_url with the path to the application file.

32

https://github.com/eubr-bigsea/EMaaS


After, you need to edit the field args of configuration file in the format described below:

1. <shape-file> <gps-file-or-folder> <output-folder> <partitioning-number>

Now, fill the specific field main_class with BULMAversion20.RunBULMAv2. If you are using Swift paths,

you necessarily need to set the fields job_template_name, job_binary_name, openstack_plugin,

plugin_version, and job_type.

Example of BULMA specific fields:

…

[plugin]

...

args = hdfs://url/BULMA/input/shapes.csv hdfs://url/BULMA/input/gps/ hdfs://url/BULMA/output/ 100

main_class = BULMAversion20.RunBULMAv2

job_template_name = EMaaS

job_binary_name = EMaaS

job_binary_url = hdfs://url/BULMA/EMaaSv2.jar

openstack_plugin = spark

plugin_version = 2.1.0

job_type = Spark

...

Submission

With the configuration file of the application you want to run filled correctly, now you just need to run

the client33.

Example of execution: 33

https://github.com/bigsea-ufcg/bigsea-manager/blob/master/client/sahara/client.py


$ python client.py <config-file-of-application-you-want-to-run>

Obs: Remembering that you can request through Broker API the status of execution.


Appendix G - Load Balancer Results

In this Appendix we present the analyses of the other two scenarios (Initial cap 25% and 50%). The first

scenario, with initial CPU CAP set to 75%, was detailed in Section 9.

Scenario 2 - Initial Cap 25%

In this scenario all virtual machines start with an initial cap of 25%, the Table 40 depicts the time when

the Load Balancer executed migrations for each heuristic in all five executions, for the

BalanceInstancesOS the migrations had occurred in the first and second execution of the Load Balancer,

for the other heuristics the migrations occurred only in the third execution. Analyzing the

total_consumption (Figures 51 through 53) and total_cap (Figures 54 through 56) graphs we can see

that after each execution the Load Balancer could respect the cpu_ratio value used.

The CPU utilization of the hosts was collected, we can see the data in Figures 57 through 59, the

common pattern we can see in the graphs is the decreased usage in overloaded hosts when the Load

Balance start the execution and an increase in the utilization in the host that is receiving the migrated

virtual machines, also the peak of utilization right after the first execution is caused by the removal and

creation of virtual machines in the hosts.


Execution


BalanceInstancesOS

14:25

14:31

14:46

14:52

16:54

17:01

17:59

18:04

18:19

18:24

CPUCapAware 19:04 19:34 19:57 20:18 20:39

SysbenchPerfCPUCap 21:59 22:24 22:45 23:06 23:28


Figure 53 - Heuristic SysbenchPerfCPUCap - Total Consumption per host.


Figure 56 - Heuristic SysbenchPerfCPUCap - Total Cap per host.


Figure 57- Heuristic BalanceInstancesOS - Host %CPU during executions.


Figure 59 - Heuristic SysbenchPerfCPUCap - Host %CPU during executions.

Scenario 3 - Initial Cap 50%

In this scenario all virtual machines start with an initial cap of 50%, the Table 41 shows the time when

the Load Balancer executed migrations for each heuristic in all five executions, for the

BalanceInstancesOS the migrations had occurred in the first and second execution of the Load Balancer,


for the other heuristics the migrations occurred in the first and third execution. Analyzing the

total_consumption (Figures 60 to 62) and total_cap (Figures 63 to 65) graphs we can see that after each

execution the Load Balancer could respect the cpu_ratio value used.

The CPU utilization of the hosts was collected and we can see the data in Figures 66 through 68. The

common pattern we can see in the graphs is the decreased usage in overloaded hosts when the Load

Balance start the execution and an increase in the utilization in the host that is receiving the migrated

virtual machines, also the peak of utilization right after the first execution is caused by the removal and

creation of virtual machines in the hosts.


Execution


BalanceInstancesOS

19:27

19:33

19:52

19:58

20:13

20:19

20:36

20:44

21:11

21:17

CPUCapAware 20:57 21:19 21:40 22:16 22:37

SysbenchPerfCPUCap 21:13 21:33 21:57 22:21 22:43


Figure 62 - Heuristic SysbenchPerfCPUCap - Total Consumption per host.


Figure 65 - Heuristic SysbenchPerfCPUCap - Total Cap per host.


Figure 66 - Heuristic BalanceInstancesOS - Host %CPU during executions.


Figure 68 - Heuristic SysbenchPefCPUCap - Host %CPU during executions.

d3.5 eubra-bigsea qos infrastructure services final version...abstract: europe - brazil...

Documents