transfer learning for optimal configuration of big data software

TransferLearningforOptimalConfigurationofBigDataSoftware

Pooyan Jamshidi, Giuliano Casale, Salvatore DipietroImperial College [email protected]

Department of ComputingResearch Associate Symposium

14th June 2016

Motivation

1- Many different Parameters => - large state space- interactions

2- Defaults are typically used =>- poor performance

Motivation

0 1 2 3 4 5average read latency (µs)

×104

0

20

40

60

80

100

120

140

160

obse

rvatio

ns

1000 1200 1400 1600 1800 2000average read latency (µs)

0

10

20

30

40

50

60

70

obse

rvatio

ns

1

1200

1300

1400

1500

1600

1700

1800

1900

1

0.5 1

1.5 2

2.5 3

3.5 4

×10

4(a) cass-20 (b) cass-10

Best configurations

Worst configurations

Experiments on Apache Cassandra:- 6 parameters, 1024 configurations- Average read latency- 10 millions records (cass-10)- 20 millions records (cass-20)

Goal!

information from the previous versions, the acquired data onthe current version, and apply a variety of kernel estimators[27] to locate regions where optimal configurations may lie.The key benefit of MTGPs over GPs is that the similaritybetween the response data helps the model to converge tomore accurate predictions much earlier. We experimentallyshow that TL4CO outperforms state-of-the-art configurationtuning methods. Our real configuration datasets are collectedfor three stream processing systems (SPS), implemented withApache Storm, a NoSQL benchmark system with ApacheCassandra, and using a dataset on 6 cloud-based clustersobtained in [21] worth 3 months of experimental time.The rest of this paper is organized as follows. Section 2

overviews the problem and motivates the work via an exam-ple. TL4CO is introduced in Section 3 and then validated inSection 4. Section 5 provides behind the scene and Section 6concludes the paper.

2. PROBLEM AND MOTIVATION

2.1 Problem statementIn this paper, we focus on the problem of optimal system

configuration defined as follows. Let Xi indicate the i-thconfiguration parameter, which ranges in a finite domainDom(Xi). In general, Xi may either indicate (i) integer vari-able such as level of parallelism or (ii) categorical variablesuch as messaging frameworks or Boolean variable such asenabling timeout. We use the terms parameter and factor in-terchangeably; also, with the term option we refer to possiblevalues that can be assigned to a parameter.

We assume that each configuration x 2 X in the configura-tion space X = Dom(X1)⇥ · · ·⇥Dom(Xd) is valid, i.e., thesystem accepts this configuration and the corresponding testresults in a stable performance behavior. The response withconfiguration x is denoted by f(x). Throughout, we assumethat f(·) is latency, however, other metrics for response maybe used. We here consider the problem of finding an optimalconfiguration x

⇤ that globally minimizes f(·) over X:

x

⇤ = argminx2X

f(x) (1)

In fact, the response function f(·) is usually unknown orpartially known, i.e., yi = f(xi),xi ⇢ X. In practice, suchmeasurements may contain noise, i.e., yi = f(xi) + ✏. Thedetermination of the optimal configuration is thus a black-box optimization program subject to noise [27, 33], whichis considerably harder than deterministic optimization. Apopular solution is based on sampling that starts with anumber of sampled configurations. The performance of theexperiments associated to this initial samples can delivera better understanding of f(·) and guide the generation ofthe next round of samples. If properly guided, the processof sample generation-evaluation-feedback-regeneration willeventually converge and the optimal configuration will belocated. However, a sampling-based approach of this kind canbe prohibitively expensive in terms of time or cost (e.g., rentalof cloud resources) considering that a function evaluation inthis case would be costly and the optimization process mayrequire several hundreds of samples to converge.

2.2 Related workSeveral approaches have attempted to address the above

problem. Recursive Random Sampling (RRS) [43] integratesa restarting mechanism into the random sampling to achievehigh search e�ciency. Smart Hill Climbing (SHC) [42] inte-grates the importance sampling with Latin Hypercube Design

(lhd). SHC estimates the local regression at each potentialregion, then it searches toward the steepest descent direction.An approach based on direct search [45] forms a simplexin the parameter space by a number of samples, and itera-tively updates a simplex through a number of well-definedoperations including reflection, expansion, and contractionto guide the sample generation. Quick Optimization viaGuessing (QOG) in [31] speeds up the optimization processexploiting some heuristics to filter out sub-optimal configu-rations. The statistical approach in [33] approximates thejoint distribution of parameters with a Gaussian in order toguide sample generation towards the distribution peak. Amodel-based approach [35] iteratively constructs a regressionmodel representing performance influences. Some approacheslike [8] enable dynamic detection of optimal configurationin dynamic environments. Finally, our earlier work BO4CO[21] uses Bayesian Optimization based on GPs to acceleratethe search process, more details are given later in Section 3.

2.3 SolutionAll the previous e↵orts attempt to improve the sampling

process by exploiting the information that has been gained inthe current task. We define a task as individual tuning cyclethat optimizes a given version of the system under test. As aresult, the learning is limited to the current observations andit still requires hundreds of sample evaluations. In this pa-per, we propose to adopt a transfer learning method to dealwith the search e�ciency in configuration tuning. Ratherthan starting the search from scratch, the approach transfersthe learned knowledge coming from similar versions of thesoftware to accelerate the sampling process in the currentversion. This idea is inspired from several observations arisein real software engineering practice [2, 3]. For instance, (i)in DevOps di↵erent versions of a system is delivered con-tinuously, (ii) Big Data systems are developed using similarframeworks (e.g., Apache Hadoop, Spark, Kafka) and runon similar platforms (e.g., cloud clusters), (iii) and di↵erentversions of a system often share a similar business logic.

To the best of our knowledge, only one study [9] exploresthe possibility of transfer learning in system configuration.The authors learn a Bayesian network in the tuning processof a system and reuse this model for tuning other similarsystems. However, the learning is limited to the structure ofthe Bayesian network. In this paper, we introduce a methodthat not only reuse a model that has been learned previouslybut also the valuable raw data. Therefore, we are not limitedto the accuracy of the learned model. Moreover, we do notconsider Bayesian networks and instead focus on MTGPs.

2.4 MotivationA motivating example. We now illustrate the previous

points on an example. WordCount (cf. Figure 1) is a popularbenchmark [12]. WordCount features a three-layer architec-ture that counts the number of words in the incoming stream.A Processing Element (PE) of type Spout reads the inputmessages from a data source and pushes them to the system.A PE of type Bolt named Splitter is responsible for splittingsentences, which are then counted by the Counter.

Kafka Spout Splitter Bolt Counter Bolt

(sentence) (word)[paintings, 3][poems, 60][letter, 75]

Kafka Topic

Stream to Kafka

File(sentence)

(sentence)

(sen

tenc

e)

Figure 1: WordCount architecture.

2








x

⇤ = argminx2X

f(x) (1)












Kafka Topic

Stream to Kafka

File(sentence)

(sentence)

(sen

tenc

e)


2








x

⇤ = argminx2X

f(x) (1)












Kafka Topic

Stream to Kafka

File(sentence)

(sentence)

(sen

tenc

e)


2








x

⇤ = argminx2X

f(x) (1)












Kafka Topic

Stream to Kafka

File(sentence)

(sentence)

(sen

tenc

e)


2

number of countersnumber of splitters

late

ncy

(ms)

100

150

1

200

250

2

300

Cubic Interpolation Over Finer Grid

243

684 10

125 14166 18

Partially known

Measurements contain noise

Configuration space

Findingoptimumconfigurationisdifficult!

0 5 10 15 20Number of counters

100

120

140

160

180

200

220

240

Late

ncy

(m

s)

splitters=2splitters=3

number of countersnumber of splitters

late

ncy

(ms)

100

150

1

200

250

2

300

Cubic Interpolation Over Finer Grid

243

684 10

125 14166 18








x

⇤ = argminx2X

f(x) (1)












Kafka Topic

Stream to Kafka

File(sentence)

(sentence)(s

ente

nce)


2








x

⇤ = argminx2X

f(x) (1)












Kafka Topic

Stream to Kafka

File(sentence)

(sentence)

(sent

ence

)


2








x

⇤ = argminx2X

f(x) (1)












Kafka Topic

Stream to Kafka

File(sentence)

(sentence)

(sent

ence

)


2








x

⇤ = argminx2X

f(x) (1)












Kafka Topic

Stream to Kafka

File(sentence)

(sentence)

(sent

ence

)


2

Response surface is:- Non-linear- Non convex- Multi-modal- Measurements subject to noise

BayesianOptimizationforConfigurationOptimization(BO4CO)Code:https://github.com/dice-project/DICE-Configuration-BO4COData:https://zenodo.org/record/56238Paper:P.Jamshidi,G.Casale,“AnUncertainty-AwareApproachtoOptimalConfigurationofStreamProcessingSystems”,MASCOTS2016. https://arxiv.org/abs/1606.06543

GPformodelingblackbox responsefunction

true function

GP mean

GP variance

observation

selected point

true minimum

Table 1: Pearson (Spearman) correlation coe�cients.v1 v2 v3 v4

v1 1 0.41 (0.49) -0.46 (-0.51) -0.50 (-0.51)v2 7.36E-06 (5.5E-08) 1 -0.20 (-0.2793) -0.18 (-0.24)v3 6.92E-07 (1.3E-08) 0.04 (0.003) 1 0.94 (0.88)v4 2.54E-08 (1.4E-08) 0.07 (0.01) 1.16E-52 (8.3E-36) 1

Table 2: Signal to noise ratios for WordCount.

Top. µ � µci �ciµ�

wc(v1) 516.59 7.96 [515.27, 517.90] [7.13, 9.01] 64.88wc(v2) 584.94 2.58 [584.51, 585.36] [2.32, 2.92] 226.32wc(v3) 654.89 13.56 [652.65, 657.13] [12.15, 15.34] 48.30wc(v4) 1125.81 16.92 [1123, 1128.6] [15.16, 19.14] 66.56

Figure 2 (a,b,c,d) shows the response surfaces for 4 di↵er-ent versions of the WordCount when splitters and countersare varied in [1, 6] and [1, 18]. WordCount v1, v2 (also v3, v4)are identical in terms of source code, but the environmentin which they are deployed on is di↵erent (we have deployedseveral other systems that compete for capacity in the samecluster). WordCount v1, v3 (also v2, v4) are deployed on a sim-ilar environment, but they have undergone multiple softwarechanges (we artificially injected delays in the source codeof its components). A number of interesting observationscan be made from the experimental results in Figure 2 andTables 1, 2 that we describe in the following subsections.

Correlation across di↵erent versions. We have measuredthe correlation coe�cients between the four versions of Word-Count in Table 1 (upper triangle shows the coe�cients whilelower triangle shows the p-values). The correlations be-tween the response functions are significant (p-values areless than 0.05). However, the correlation di↵ers betweenversions to versions. Also, more interestingly, di↵erent ver-sions of the system have di↵erent optimal configurations:x⇤v1 = (5, 1), x⇤

v2 = (6, 2), x⇤v3 = (2, 13), x⇤

v4 = (2, 16). InDevOps, di↵erent versions of a system will be delivered con-tinuously on daily basis [3]. Current DevOps practices donot systematically use the knowledge from previous versionsfor performance tuning of the current version under test de-spite such significant correlations [3]. There are two reasonsbehind this: (i) the techniques that are used for performancetuning cannot exploit the historical data belong to a di↵erentversion. (ii) they assume di↵erent versions have the sameoptimum configuration. However, based on our experimentalobservations above, this is not true. As a result, the existingpractice treat the experimental data as one-time-use.

Nonlinear interactions. The response functions f(·) in Fig-ure 2 are strongly non-linear, non-convex and multi-modal.The performance di↵erence between the best and worst set-tings is substantial, e.g., 65% in v4, providing a case foroptimal tuning. Moreover, the non-linear relations amongthe parameters imply that the optimal number of countersdepends on splitters, and vice-versa. In other words, if onetries to minimize latency by acting just on one of theseparameters, this may not lead to a global optimum [21].Measurement uncertainty. We have taken samples of the

latency for the same configuration (splitters=counters=1) ofthe 4 versions of WordCount. The experiment were conductedon Amazon EC2 (m3.large (2 CPU, 7.5 GB)). After filteringthe initial burn-in, we have computed average and variance ofthe measurements. The results in Table 2 illustrate that thevariability of measurements across di↵erent versions can beof di↵erent scales. In traditional techniques, such as designof experiments, the variability is typically disregarded byrepeating experiments and obtaining the mean. However, wehere pursue an alternative approach that relies on MTGPmodels that are able to explicitly take into account variability.

3. TL4CO: TRANSFER LEARNING FOR CON-FIGURATION OPTIMIZATION

3.1 Single-task GP Bayesian optimizationBayesian optimization [34] is a sequential design strategy

that allows us to perform global optimization of black-boxfunctions. Figure 3 illustrates the GP-based Bayesian Op-timization approach using a 1-dimensional response. Thecurve in blue is the unknown true response, whereas themean is shown in yellow and the 95% confidence interval ateach point in the shaded red area. The stars indicate ex-perimental measurements (or observation interchangeably).Some points x 2 X have a large confidence interval due tolack of observations in their neighborhood, while others havea narrow confidence. The main motivation behind the choiceof Bayesian Optimization here is that it o↵ers a frameworkin which reasoning can be not only based on mean estimatesbut also the variance, providing more informative decisionmaking. The other reason is that all the computations inthis framework are based on tractable linear algebra.In our previous work [21], we proposed BO4CO that ex-

ploits single-task GPs (no transfer learning) for prediction ofposterior distribution of response functions. A GP model iscomposed by its prior mean (µ(·) : X ! R) and a covariancefunction (k(·, ·) : X⇥ X ! R) [41]:

y = f(x) ⇠ GP(µ(x), k(x,x0)), (2)

where covariance k(x,x0) defines the distance between x

and x

0. Let us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} bethe collection of t experimental data (observations). In thisframework, we treat f(x) as a random variable, conditionedon observations S1:t, which is normally distributed with thefollowing posterior mean and variance functions [41]:

µt(x) = µ(x) + k(x)|(K + �2I)�1(y � µ) (3)

�2t (x) = k(x,x) + �2

I � k(x)|(K + �2I)�1

k(x) (4)

where y := y1:t, k(x)| = [k(x,x1) k(x,x2) . . . k(x,xt)],

µ := µ(x1:t), K := k(xi,xj) and I is identity matrix. Theshortcoming of BO4CO is that it cannot exploit the observa-tions regarding other versions of the system and as thereforecannot be applied in DevOps.

3.2 TL4CO: an extension to multi-tasksTL4CO 1 uses MTGPs that exploit observations from other

previous versions of the system under test. Algorithm 1defines the internal details of TL4CO. As Figure 4 shows,TL4CO is an iterative algorithm that uses the learning fromother system versions. In a high-level overview, TL4CO: (i)selects the most informative past observations (details inSection 3.3); (ii) fits a model to existing data based on kernellearning (details in Section 3.4), and (iii) selects the nextconfiguration based on the model (details in Section 3.5).

In the multi-task framework, we use historical data to fit abetter GP providing more accurate predictions. Before that,we measure few sample points based on Latin Hypercube De-sign (lhd) D = {x1, . . . , xn} (cf. step 1 in Algorithm 1). Wehave chosen lhd because: (i) it ensures that the configurationsamples in D is representative of the configuration space X,whereas traditional random sampling [26, 17] (called brute-force) does not guarantee this [29]; (ii) another advantage isthat the lhd samples can be taken one at a time, making it e�-cient in high dimensional spaces. We define a new notation for1Code+Data will be released (due in July 2016), as this isfunded under the EU project DICE: https://github.com/dice-project/DICE-Configuration-BO4CO

3

Table 1: Pearson (Spearman) correlation coe�cients.v1 v2 v3 v4

v1 1 0.41 (0.49) -0.46 (-0.51) -0.50 (-0.51)v2 7.36E-06 (5.5E-08) 1 -0.20 (-0.2793) -0.18 (-0.24)v3 6.92E-07 (1.3E-08) 0.04 (0.003) 1 0.94 (0.88)v4 2.54E-08 (1.4E-08) 0.07 (0.01) 1.16E-52 (8.3E-36) 1

Table 2: Signal to noise ratios for WordCount.

Top. µ � µci �ciµ�

wc(v1) 516.59 7.96 [515.27, 517.90] [7.13, 9.01] 64.88wc(v2) 584.94 2.58 [584.51, 585.36] [2.32, 2.92] 226.32wc(v3) 654.89 13.56 [652.65, 657.13] [12.15, 15.34] 48.30wc(v4) 1125.81 16.92 [1123, 1128.6] [15.16, 19.14] 66.56

Figure 2 (a,b,c,d) shows the response surfaces for 4 di↵er-ent versions of the WordCount when splitters and countersare varied in [1, 6] and [1, 18]. WordCount v1, v2 (also v3, v4)are identical in terms of source code, but the environmentin which they are deployed on is di↵erent (we have deployedseveral other systems that compete for capacity in the samecluster). WordCount v1, v3 (also v2, v4) are deployed on a sim-ilar environment, but they have undergone multiple softwarechanges (we artificially injected delays in the source codeof its components). A number of interesting observationscan be made from the experimental results in Figure 2 andTables 1, 2 that we describe in the following subsections.

Correlation across di↵erent versions. We have measuredthe correlation coe�cients between the four versions of Word-Count in Table 1 (upper triangle shows the coe�cients whilelower triangle shows the p-values). The correlations be-tween the response functions are significant (p-values areless than 0.05). However, the correlation di↵ers betweenversions to versions. Also, more interestingly, di↵erent ver-sions of the system have di↵erent optimal configurations:x⇤v1 = (5, 1), x⇤

v2 = (6, 2), x⇤v3 = (2, 13), x⇤

v4 = (2, 16). InDevOps, di↵erent versions of a system will be delivered con-tinuously on daily basis [3]. Current DevOps practices donot systematically use the knowledge from previous versionsfor performance tuning of the current version under test de-spite such significant correlations [3]. There are two reasonsbehind this: (i) the techniques that are used for performancetuning cannot exploit the historical data belong to a di↵erentversion. (ii) they assume di↵erent versions have the sameoptimum configuration. However, based on our experimentalobservations above, this is not true. As a result, the existingpractice treat the experimental data as one-time-use.

Nonlinear interactions. The response functions f(·) in Fig-ure 2 are strongly non-linear, non-convex and multi-modal.The performance di↵erence between the best and worst set-tings is substantial, e.g., 65% in v4, providing a case foroptimal tuning. Moreover, the non-linear relations amongthe parameters imply that the optimal number of countersdepends on splitters, and vice-versa. In other words, if onetries to minimize latency by acting just on one of theseparameters, this may not lead to a global optimum [21].Measurement uncertainty. We have taken samples of the

latency for the same configuration (splitters=counters=1) ofthe 4 versions of WordCount. The experiment were conductedon Amazon EC2 (m3.large (2 CPU, 7.5 GB)). After filteringthe initial burn-in, we have computed average and variance ofthe measurements. The results in Table 2 illustrate that thevariability of measurements across di↵erent versions can beof di↵erent scales. In traditional techniques, such as designof experiments, the variability is typically disregarded byrepeating experiments and obtaining the mean. However, wehere pursue an alternative approach that relies on MTGPmodels that are able to explicitly take into account variability.

3. TL4CO: TRANSFER LEARNING FOR CON-FIGURATION OPTIMIZATION

3.1 Single-task GP Bayesian optimizationBayesian optimization [34] is a sequential design strategy

that allows us to perform global optimization of black-boxfunctions. Figure 3 illustrates the GP-based Bayesian Op-timization approach using a 1-dimensional response. Thecurve in blue is the unknown true response, whereas themean is shown in yellow and the 95% confidence interval ateach point in the shaded red area. The stars indicate ex-perimental measurements (or observation interchangeably).Some points x 2 X have a large confidence interval due tolack of observations in their neighborhood, while others havea narrow confidence. The main motivation behind the choiceof Bayesian Optimization here is that it o↵ers a frameworkin which reasoning can be not only based on mean estimatesbut also the variance, providing more informative decisionmaking. The other reason is that all the computations inthis framework are based on tractable linear algebra.In our previous work [21], we proposed BO4CO that ex-

ploits single-task GPs (no transfer learning) for prediction ofposterior distribution of response functions. A GP model iscomposed by its prior mean (µ(·) : X ! R) and a covariancefunction (k(·, ·) : X⇥ X ! R) [41]:

y = f(x) ⇠ GP(µ(x), k(x,x0)), (2)

where covariance k(x,x0) defines the distance between x

and x

0. Let us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} bethe collection of t experimental data (observations). In thisframework, we treat f(x) as a random variable, conditionedon observations S1:t, which is normally distributed with thefollowing posterior mean and variance functions [41]:

µt(x) = µ(x) + k(x)|(K + �2I)�1(y � µ) (3)

�2t (x) = k(x,x) + �2

I � k(x)|(K + �2I)�1

k(x) (4)

where y := y1:t, k(x)| = [k(x,x1) k(x,x2) . . . k(x,xt)],

µ := µ(x1:t), K := k(xi,xj) and I is identity matrix. Theshortcoming of BO4CO is that it cannot exploit the observa-tions regarding other versions of the system and as thereforecannot be applied in DevOps.

3.2 TL4CO: an extension to multi-tasksTL4CO 1 uses MTGPs that exploit observations from other

previous versions of the system under test. Algorithm 1defines the internal details of TL4CO. As Figure 4 shows,TL4CO is an iterative algorithm that uses the learning fromother system versions. In a high-level overview, TL4CO: (i)selects the most informative past observations (details inSection 3.3); (ii) fits a model to existing data based on kernellearning (details in Section 3.4), and (iii) selects the nextconfiguration based on the model (details in Section 3.5).

In the multi-task framework, we use historical data to fit abetter GP providing more accurate predictions. Before that,we measure few sample points based on Latin Hypercube De-sign (lhd) D = {x1, . . . , xn} (cf. step 1 in Algorithm 1). Wehave chosen lhd because: (i) it ensures that the configurationsamples in D is representative of the configuration space X,whereas traditional random sampling [26, 17] (called brute-force) does not guarantee this [29]; (ii) another advantage isthat the lhd samples can be taken one at a time, making it e�-cient in high dimensional spaces. We define a new notation for1Code+Data will be released (due in July 2016), as this isfunded under the EU project DICE: https://github.com/dice-project/DICE-Configuration-BO4CO

3

Motivations:1- mean estimates + variance2- all computations are linear algebra

-1.5 -1 -0.5 0 0.5 1 1.5-1.5

-1

-0.5

0

0.5

1

x1 x2 x3 x4

true function

GP surrogate mean estimate

observation

Fig. 5: An example of 1D GP model: GPs provide mean esti-mates as well as the uncertainty in estimations, i.e., variance.

Configuration Optimisation Tool

performance repository

Monitoring

Deployment Service

Data Preparation

configuration parameters

values


values

Experimental Suite

Testbed

Doc

Data Broker

Tester

experiment timepolling interval

configurationparameters

GP model

Kafka

System Under TestWorkloadGenerator

Technology Interface

Stor

m

Cas

sand

ra

Spar

k

Fig. 6: BO4CO architecture: (i) optimization and (ii) exper-imental suite are integrated via (iii) a data broker. The in-tegrated solution is available: https://github.com/dice-project/DICE-Configuration-BO4CO.

B. BO4CO algorithm

BO4CO’s high-level architecture is shown in Figure 6 andthe procedure that drives the optimization is described in Al-gorithm. We start by bootstrapping the optimization followingLatin Hypercube Design (lhd) to produce an initial designD = {x1, . . . ,xn

} (cf. step 1 in Algorithm 1). Although otherdesign approaches (e.g., random) could be used, we have cho-sen lhd because: (i) it ensures that the configuration samplesin D is representative of the configuration space X, whereastraditional random sampling [17], [9] (called brute-force) doesnot guarantee this [20]; (ii) another advantage is that the lhdsamples can be taken one at a time, making it efficient inhigh dimensional spaces. After obtaining the measurementsregarding the initial design, BO4CO then fits a GP model tothe design points D to form our belief about the underlyingresponse function (cf. step 3 in Algorithm 1). The while loop inAlgorithm 1 iteratively updates the belief until the budget runsout: As we accumulate the data S1:t = {(x

i

, yi

)}ti=1, where

yi

= f(xi

)+ ✏i

with ✏ ⇠ N (0,�2), a prior distribution Pr(f)

and the likelihood function Pr(S1:t|f) form the posteriordistribution: Pr(f |S1:t) / Pr(S1:t|f) Pr(f).

A GP is a distribution over functions [31], specified by itsmean (see Section III-E2), and covariance (see Section III-E1):

y = f(x) ⇠ GP(µ(x), k(x,x0)), (3)

Algorithm 1 : BO4COInput: Configuration space X, Maximum budget N

max

, Re-sponse function f , Kernel function K

✓

, Hyper-parameters✓, Design sample size n, learning cycle N

l

Output: Optimal configurations x

⇤ and learned model M1: choose an initial sparse design (lhd) to find an initial

design samples D = {x1, . . . ,xn

}2: obtain performance measurements of the initial design,

yi

f(xi

) + ✏i

, 8xi

2 D3: S1:n {(x

i

, yi

)}ni=1; t n+ 1

4: M(x|S1:n,✓) fit a GP model to the design . Eq.(3)5: while t N

max

do6: if (t mod N

l

= 0) ✓ learn the kernel hyper-parameters by maximizing the likelihood

7: find next configuration x

t

by optimizing the selectioncriteria over the estimated response surface given the data,x

t

argmaxx

u(x|M, S1:t�1) . Eq.(9)8: obtain performance for the new configuration x

t

, yt

f(x

t

) + ✏t

9: Augment the configuration S1:t = {S1:t�1, (xt

, yt

)}10: M(x|S1:t,✓) re-fit a new GP model . Eq.(7)11: t t+ 1

12: end while13: (x

⇤, y⇤) = min S1:Nmax

14: M(x)

where k(x,x0) defines the distance between x and x

0. Let usassume S1:t = {(x1:t, y1:t)|yi := f(x

i

)} be the collection oft observations. The function values are drawn from a multi-variate Gaussian distribution N (µ,K), where µ := µ(x1:t),

K :=

2

64k(x1,x1) . . . k(x1,xt

)

.... . .

...k(x

t

,x1) . . . k(xt

,xt

)

3

75 (4)

In the while loop in BO4CO, given the observations weaccumulated so far, we intend to fit a new GP model:

f1:tft+1

�⇠ N (µ,

K + �2

I k

k

| k(xt+1,xt+1)

�), (5)

where k(x)

|= [k(x,x1) k(x,x2) . . . k(x,x

t

)] and I

is identity matrix. Given the Eq. (5), the new GP model canbe drawn from this new Gaussian distribution:

Pr(ft+1|S1:t,xt+1) = N (µ

t

(x

t+1),�2t

(x

t+1)), (6)

where

µt

(x) = µ(x) + k(x)

|(K + �2

I)

�1(y � µ) (7)

�2t

(x) = k(x,x) + �2I � k(x)

|(K + �2

I)

�1k(x) (8)

These posterior functions are used to select the next point xt+1

as detailed in Section III-C.

C. Configuration selection criteriaThe selection criteria is defined as u : X ! R that selects

x

t+1 2 X, should f(·) be evaluated next (step 7):

x

t+1 = argmax

x2Xu(x|M, S1:t) (9)

-1.5 -1 -0.5 0 0.5 1 1.5-1.5

-1

-0.5

0

0.5

1

x1 x2 x3 x4

true function

GP surrogate mean estimate

observation

Fig. 5: An example of 1D GP model: GPs provide mean esti-mates as well as the uncertainty in estimations, i.e., variance.



Monitoring

Deployment Service

Data Preparation


values


values

Experimental Suite

Testbed

Doc

Data Broker

Tester

experiment timepolling interval


GP model

Kafka

System Under TestWorkloadGenerator


Stor

m

Cas

sand

ra

Spar

k

Fig. 6: BO4CO architecture: (i) optimization and (ii) exper-imental suite are integrated via (iii) a data broker. The in-tegrated solution is available: https://github.com/dice-project/DICE-Configuration-BO4CO.

B. BO4CO algorithm

BO4CO’s high-level architecture is shown in Figure 6 andthe procedure that drives the optimization is described in Al-gorithm. We start by bootstrapping the optimization followingLatin Hypercube Design (lhd) to produce an initial designD = {x1, . . . ,xn

} (cf. step 1 in Algorithm 1). Although otherdesign approaches (e.g., random) could be used, we have cho-sen lhd because: (i) it ensures that the configuration samplesin D is representative of the configuration space X, whereastraditional random sampling [17], [9] (called brute-force) doesnot guarantee this [20]; (ii) another advantage is that the lhdsamples can be taken one at a time, making it efficient inhigh dimensional spaces. After obtaining the measurementsregarding the initial design, BO4CO then fits a GP model tothe design points D to form our belief about the underlyingresponse function (cf. step 3 in Algorithm 1). The while loop inAlgorithm 1 iteratively updates the belief until the budget runsout: As we accumulate the data S1:t = {(x

i

, yi

)}ti=1, where

yi

= f(xi

)+ ✏i

with ✏ ⇠ N (0,�2), a prior distribution Pr(f)

and the likelihood function Pr(S1:t|f) form the posteriordistribution: Pr(f |S1:t) / Pr(S1:t|f) Pr(f).

A GP is a distribution over functions [31], specified by itsmean (see Section III-E2), and covariance (see Section III-E1):

y = f(x) ⇠ GP(µ(x), k(x,x0)), (3)

Algorithm 1 : BO4COInput: Configuration space X, Maximum budget N

max

, Re-sponse function f , Kernel function K

✓

, Hyper-parameters✓, Design sample size n, learning cycle N

l

Output: Optimal configurations x

⇤ and learned model M1: choose an initial sparse design (lhd) to find an initial

design samples D = {x1, . . . ,xn

}2: obtain performance measurements of the initial design,

yi

f(xi

) + ✏i

, 8xi

2 D3: S1:n {(x

i

, yi

)}ni=1; t n+ 1

4: M(x|S1:n,✓) fit a GP model to the design . Eq.(3)5: while t N

max

do6: if (t mod N

l

= 0) ✓ learn the kernel hyper-parameters by maximizing the likelihood

7: find next configuration x

t

by optimizing the selectioncriteria over the estimated response surface given the data,x

t

argmaxx

u(x|M, S1:t�1) . Eq.(9)8: obtain performance for the new configuration x

t

, yt

f(x

t

) + ✏t

9: Augment the configuration S1:t = {S1:t�1, (xt

, yt

)}10: M(x|S1:t,✓) re-fit a new GP model . Eq.(7)11: t t+ 1

12: end while13: (x

⇤, y⇤) = min S1:Nmax

14: M(x)

where k(x,x0) defines the distance between x and x

0. Let usassume S1:t = {(x1:t, y1:t)|yi := f(x

i

)} be the collection oft observations. The function values are drawn from a multi-variate Gaussian distribution N (µ,K), where µ := µ(x1:t),

K :=

2

64k(x1,x1) . . . k(x1,xt

)

.... . .

...k(x

t

,x1) . . . k(xt

,xt

)

3

75 (4)

In the while loop in BO4CO, given the observations weaccumulated so far, we intend to fit a new GP model:

f1:tft+1

�⇠ N (µ,

K + �2

I k

k

| k(xt+1,xt+1)

�), (5)

where k(x)

|= [k(x,x1) k(x,x2) . . . k(x,x

t

)] and I

is identity matrix. Given the Eq. (5), the new GP model canbe drawn from this new Gaussian distribution:

Pr(ft+1|S1:t,xt+1) = N (µ

t

(x

t+1),�2t

(x

t+1)), (6)

where

µt

(x) = µ(x) + k(x)

|(K + �2

I)

�1(y � µ) (7)

�2t

(x) = k(x,x) + �2I � k(x)

|(K + �2

I)

�1k(x) (8)

These posterior functions are used to select the next point xt+1

as detailed in Section III-C.

C. Configuration selection criteriaThe selection criteria is defined as u : X ! R that selects

x

t+1 2 X, should f(·) be evaluated next (step 7):

x

t+1 = argmax

x2Xu(x|M, S1:t) (9)

0 2000 4000 6000 8000 10000Iteration

4

4.5

5

5.5

6

6.5

7

7.5

8

8.5

Ka

pp

a

ϵ=1ϵ=0.1ϵ=0.01

Fig. 7: Change of value over time: it begins with a smallvalue to exploit the mean estimates and it increases over timein order to explore.

BO4CO uses Lower Confidence Bound (LCB) [25]. LCBminimizes the distance to optimum by balancing exploitationof mean and exploration via variance:

uLCB

(x|M, S1:n) = argmin

x2Xµt

(x)� �t

(x), (10)

where can be set according to the objectives. For instance,if we require to find a near optimal configuration quickly weset a low value to to take the most out of the initial designknowledge. However, if we are looking for a globally optimumconfiguration, we can set a high value in order to skip localminima. Furthermore, can be adapted over time to benefitfrom the both [13]. For instance, can start with a reasonablysmall value to exploit the initial design and increase over timeto do more explorations (cf. Figure 7).

D. IllustrationThe steps in Algorithm 1 are illustrated in Figure 8. Firstly,

an initial design based on lhd is produced (Figure 8(a)).Secondly, a GP model is fit to the initial design (Figure 8(b)).Then, the model is used to calculate the selection criteria(Figure 8(c)). Finally, the configuration that maximizes theselection criteria is used to run the next experiment and providedata for refitting a more accurate model (Figure 8(d)).

E. Model fitting in BO4COIn this section, we provide some practical considerations to

make GPs applicable for configuration optimization.1) Kernel function: In BO4CO, as shown in Algorithm

1, the covariance function k : X ⇥ X ! R dictates thestructure of the response function we fit to the observed data.For integer variables (cf. Section II-A), we implemented theMatern kernel [31]. The main reason behind this choice is thatalong each dimension in the configuration response functionsdifferent level of smoothness can be observed (cf. Figure 2).Matern kernels incorporate a smoothness parameter ⌫ > 0 thatpermits greater flexibility in modeling such functions [31]. Thefollowing is a variation of the Matern kernel for ⌫ = 1/2:

k⌫=1/2(xi

,xj

) = ✓20 exp(�r), (11)

where r2(xi

,xj

) = (x

i

� x

j

)

|⇤(x

i

� x

j

) for some positivesemidefinite matrix ⇤. For categorical variables, we imple-mented the following [12]:

k✓

(x

i

,xj

) = exp(⌃

d

`=1(�✓`

�(xi

6= x

j

))), (12)

where d is the number of dimensions (i.e., the number of con-figuration parameters), ✓

`

adjust the scales along the functiondimensions and � is a function gives the distance betweentwo categorical variables using Kronecker delta [12], [25].TL4CO uses different scales {✓

`

, ` = 1 . . . d} on differentdimensions as suggested in [31], [25], this technique is calledAutomatic Relevance Determination (ARD). After learning thehyper-parameters (step 6), if the `-th dimension turns out tobe irrelevant, then ✓

`

will be a small value, and therefore, willbe discarded. This is particularly helpful in high dimensionalspaces, where it is difficult to find the optimal configuration.

2) Prior mean function: While the kernel controls thestructure of the estimated function, the prior mean µ(x) :

X ! R provides a possible offset for our estimation. Bydefault, this function is set to a constant µ(x) := µ, which isinferred from the observations [25]. However, the prior meanfunction is a way of incorporating the expert knowledge, ifit is available, then we can use this knowledge. Fortunately,we have collected extensive experimental measurements andbased on our datasets (cf. Table I), we observed that typically,for Big Data systems, there is a significant distance betweenthe minimum and the maximum of each function (cf. Figure2). Therefore, a linear mean function µ(x) := ax+ b, allowsfor more flexible structures, and provides a better fit for thedata than a constant mean. We only need to learn the slopefor each dimension and an offset (denoted by µ

`

= (a, b)).3) Learning parameters: marginal likelihood: This section

describe the step 7 in Algorithm 1. Due to the heavy compu-tation of the learning, this process is computed only every N

l

iterations. For learning the hyper-parameters of the kernel andalso the prior mean functions (cf. Sections III-E1 and III-E2),we maximize the marginal likelihood [25] of the observationsS1:t. To do that, we train GP model (7) with S1:t. We optimizethe marginal likelihood using multi-started quasi-Newton hill-climbers [23]. For this purpose, we use the off-the-shelf gpmllibrary presented in [23]. Using the kernel defined in (12), welearn ✓ := (✓0:d, µ0:d,�

2) that comprises the hyper-parameters

of the kernel and mean functions. The learning is performediteratively resulting in a sequence of ✓

i

for i = 1 . . . bN

max

N

`

c.4) Observation noise: The primary way for determining the

noise variance � in BO4CO is to use historical data: In SectionII-B4, we have shown that such noise can be measured with ahigh confidence and the signal-to-noise ratios shows that suchnoise is stationary. The secondary alternative is to learn thenoise variance sequentially as we collect new data. We treatthem just as any other hyper-parameters, see Section III-E3.

IV. EXPERIMENTAL RESULTS

A. Implementation

From an implementation perspective, BO4CO consists ofthree major components: (i) an optimization component (cf.left part of Figure 6), (ii) an experimental suite (cf. right partof Figure 6) integrated via a (iii) data broker. The optimizationcomponent implements the model (re-)fitting (7) and criteriaoptimization (9) steps in Algorithm 1 and is developed in Mat-lab 2015b. The experimental suite component implements thefacilities for automated deployment of topologies, performancemeasurements and data preparation and is developed in Java.

0 2000 4000 6000 8000 10000Iteration

4

4.5

5

5.5

6

6.5

7

7.5

8

8.5

Kappa

ϵ=1ϵ=0.1ϵ=0.01

Fig. 7: Change of value over time: it begins with a smallvalue to exploit the mean estimates and it increases over timein order to explore.

BO4CO uses Lower Confidence Bound (LCB) [25]. LCBminimizes the distance to optimum by balancing exploitationof mean and exploration via variance:

uLCB

(x|M, S1:n) = argmin

x2Xµt

(x)� �t

(x), (10)

where can be set according to the objectives. For instance,if we require to find a near optimal configuration quickly weset a low value to to take the most out of the initial designknowledge. However, if we are looking for a globally optimumconfiguration, we can set a high value in order to skip localminima. Furthermore, can be adapted over time to benefitfrom the both [13]. For instance, can start with a reasonablysmall value to exploit the initial design and increase over timeto do more explorations (cf. Figure 7).

D. IllustrationThe steps in Algorithm 1 are illustrated in Figure 8. Firstly,

an initial design based on lhd is produced (Figure 8(a)).Secondly, a GP model is fit to the initial design (Figure 8(b)).Then, the model is used to calculate the selection criteria(Figure 8(c)). Finally, the configuration that maximizes theselection criteria is used to run the next experiment and providedata for refitting a more accurate model (Figure 8(d)).

E. Model fitting in BO4COIn this section, we provide some practical considerations to

make GPs applicable for configuration optimization.1) Kernel function: In BO4CO, as shown in Algorithm

1, the covariance function k : X ⇥ X ! R dictates thestructure of the response function we fit to the observed data.For integer variables (cf. Section II-A), we implemented theMatern kernel [31]. The main reason behind this choice is thatalong each dimension in the configuration response functionsdifferent level of smoothness can be observed (cf. Figure 2).Matern kernels incorporate a smoothness parameter ⌫ > 0 thatpermits greater flexibility in modeling such functions [31]. Thefollowing is a variation of the Matern kernel for ⌫ = 1/2:

k⌫=1/2(xi

,xj

) = ✓20 exp(�r), (11)

where r2(xi

,xj

) = (x

i

� x

j

)

|⇤(x

i

� x

j

) for some positivesemidefinite matrix ⇤. For categorical variables, we imple-mented the following [12]:

k✓

(x

i

,xj

) = exp(⌃

d

`=1(�✓`

�(xi

6= x

j

))), (12)

where d is the number of dimensions (i.e., the number of con-figuration parameters), ✓

`

adjust the scales along the functiondimensions and � is a function gives the distance betweentwo categorical variables using Kronecker delta [12], [25].TL4CO uses different scales {✓

`

, ` = 1 . . . d} on differentdimensions as suggested in [31], [25], this technique is calledAutomatic Relevance Determination (ARD). After learning thehyper-parameters (step 6), if the `-th dimension turns out tobe irrelevant, then ✓

`

will be a small value, and therefore, willbe discarded. This is particularly helpful in high dimensionalspaces, where it is difficult to find the optimal configuration.

2) Prior mean function: While the kernel controls thestructure of the estimated function, the prior mean µ(x) :

X ! R provides a possible offset for our estimation. Bydefault, this function is set to a constant µ(x) := µ, which isinferred from the observations [25]. However, the prior meanfunction is a way of incorporating the expert knowledge, ifit is available, then we can use this knowledge. Fortunately,we have collected extensive experimental measurements andbased on our datasets (cf. Table I), we observed that typically,for Big Data systems, there is a significant distance betweenthe minimum and the maximum of each function (cf. Figure2). Therefore, a linear mean function µ(x) := ax+ b, allowsfor more flexible structures, and provides a better fit for thedata than a constant mean. We only need to learn the slopefor each dimension and an offset (denoted by µ

`

= (a, b)).3) Learning parameters: marginal likelihood: This section

describe the step 7 in Algorithm 1. Due to the heavy compu-tation of the learning, this process is computed only every N

l

iterations. For learning the hyper-parameters of the kernel andalso the prior mean functions (cf. Sections III-E1 and III-E2),we maximize the marginal likelihood [25] of the observationsS1:t. To do that, we train GP model (7) with S1:t. We optimizethe marginal likelihood using multi-started quasi-Newton hill-climbers [23]. For this purpose, we use the off-the-shelf gpmllibrary presented in [23]. Using the kernel defined in (12), welearn ✓ := (✓0:d, µ0:d,�

2) that comprises the hyper-parameters

of the kernel and mean functions. The learning is performediteratively resulting in a sequence of ✓

i

for i = 1 . . . bN

max

N

`

c.4) Observation noise: The primary way for determining the

noise variance � in BO4CO is to use historical data: In SectionII-B4, we have shown that such noise can be measured with ahigh confidence and the signal-to-noise ratios shows that suchnoise is stationary. The secondary alternative is to learn thenoise variance sequentially as we collect new data. We treatthem just as any other hyper-parameters, see Section III-E3.

IV. EXPERIMENTAL RESULTS

A. Implementation

From an implementation perspective, BO4CO consists ofthree major components: (i) an optimization component (cf.left part of Figure 6), (ii) an experimental suite (cf. right partof Figure 6) integrated via a (iii) data broker. The optimizationcomponent implements the model (re-)fitting (7) and criteriaoptimization (9) steps in Algorithm 1 and is developed in Mat-lab 2015b. The experimental suite component implements thefacilities for automated deployment of topologies, performancemeasurements and data preparation and is developed in Java.

Acquisition function:

Kernel function:

-1.5 -1 -0.5 0 0.5 1 1.5-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

configuration domain

resp

onse

val

ue

-1.5 -1 -0.5 0 0.5 1 1.5-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1 true response functionGP fit

-1.5 -1 -0.5 0 0.5 1 1.5-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

criteria evaluation

new selected point

-1.5 -1 -0.5 0 0.5 1 1.5-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

new GP fit

Correlationsacrossdifferentversions

100

150

1

200

250

Late

ncy

(ms)

300

2 53 104

5 156

14

16

18

20

1

22

24

26

Late

ncy

(ms)

28

30

32

2 53 1045 156 number of countersnumber of splitters number of countersnumber of splitters

2.8

2.9

1

3

3.1

3.2

3.3

2

Late

ncy

(ms)

3.4

3.5

3.6

3 54 10

5 156

1.2

1.3

1.4

1

1.5

1.6

1.7

Late

ncy

(ms)

1.8

1.9

2 53104

5 156

(a) WordCount v1

(b) WordCount v2

(c) WordCount v3 (d) WordCount v4

(e) Pearson correlation coefficients

(g) Measurement noise across WordCount versions

(f) Spearman correlation coefficients

correlation coefficient

p-va

lue

v1 v2 v3 v4

500

600

700

800

900

1000

1100

1200

Late

ncy (

ms)

hardware change

softw

are

chan

ge

Table 1: My caption

v1 v2 v3 v4

v1 1 0.41 -0.46 -0.50

v2 7.36E-06 1 -0.20 -0.18

v3 6.92E-07 0.04 1 0.94

v4 2.54E-08 0.07 1.16E-52 1

Table 2: My caption

v1 v2 v3 v4

v1 1 0.49 -0.51 -0.51

v2 5.50E-08 1 -0.2793 -0.24

v3 1.30E-08 0.003 1 0.88

v4 1.40E-08 0.01 8.30E-36 1

Table 3: My caption

ver. µ � µ�

v1 516.59 7.96 64.88

v2 584.94 2.58 226.32

v3 654.89 13.56 48.30

v4 1125.81 16.92 66.56

1

Table 1: My caption

v1 v2 v3 v4

v1 1 0.41 -0.46 -0.50

v2 7.36E-06 1 -0.20 -0.18

v3 6.92E-07 0.04 1 0.94

v4 2.54E-08 0.07 1.16E-52 1

Table 2: My caption

v1 v2 v3 v4

v1 1 0.49 -0.51 -0.51

v2 5.50E-08 1 -0.2793 -0.24

v3 1.30E-08 0.003 1 0.88

v4 1.40E-08 0.01 8.30E-36 1

Table 3: My caption

ver. µ � µ�

v1 516.59 7.96 64.88

v2 584.94 2.58 226.32

v3 654.89 13.56 48.30

v4 1125.81 16.92 66.56

1

- Different correlations- Different optimum Configurations- Different noise level

DevOps

- Differentversionsarecontinuouslydelivered(dailybasis).- BigDatasystemsaredevelopedusingsimilarframeworks

(ApacheStorm,Spark,Hadoop,Kafka,etc).- Differentversionssharesimilarbusinesslogics.

Solution:TransferLearningforConfigurationOptimizationConfiguration Optimization

(version j=M)

performance measurements

Initial Design

Model Fit

Next Experiment

Model Update

Budget Finished


Configuration Optimization(version j=N)

Initial Design

Model Fit

Next Experiment

Model Update

Budget Finished

select data for training

GP model hyper-parameters

store filter

TransferLearningforConfigurationOptimization(TL4CO)Code:https://github.com/dice-project/DICE-Configuration-TL4CO

100

150

1

200

250

Late

ncy

(m

s)

300

2 53 104

5 156

14

16

18

20

1

22

24

26

Late

ncy

(m

s)

28

30

32

2 53 1045 156

2.8

2.9

1

3

3.1

3.2

3.3

2

Late

ncy

(m

s)

3.4

3.5

3.6

3 54 10

5 156

1.2

1.3

1.4

1

1.5

1.6

1.7

Late

ncy

(m

s)

1.8

1.9

2 53104

5 156 number of countersnumber of splitters number of countersnumber of splitters number of countersnumber of splitters number of countersnumber of splitters

(a) WordCount v1 (b) WordCount v2 (c) WordCount v3 (d) WordCount v4

Figure 2: The response functions corresponding to the four versions of WordCount.

true function

GP mean

GP variance

observation

selected point

true minimumX

f(x)

Figure 3: An example of 1-dimensional GP model.

Configuration Optimization(version j=M)


Initial Design

Model Fit

Next Experiment

Model Update

Budget Finished



Initial Design

Model Fit

Next Experiment

Model Update

Budget Finished



store filter

Figure 4: Transfer learning for configuration optimization.

observations, S11:t = {(xj

i , yji )|j = 1, . . . , T, i = 1, . . . , tj}ti=1

(cf. S1:t), where x

ji are the configurations of di↵erent ver-

sions for which we observed response yji . The version of

the system (task) is defined as j that contains tj observa-tions. Without loss of generality, we always assume j = 1is the current version under test. In order to associate theobservations (xj

i , yji ) to task j a label lj = j is used. To

transfer the learning from previous tasks to the current one,the MTPG framework allows us to use the multiplication oftwo independent kernels [5]:

kTL4CO(l, l0,x,x0) = kt(l, l

0)⇥ kxx(x,x0), (5)

where the kernels kt and kxx represent the correlation be-tween di↵erent versions and covariance functions for a partic-ular version, respectively. Note that kt depends only on thelabels l pointing to the versions for which we have some obser-vations, and kxx depends only on the configuration x. This

Algorithm 1 : TL4CO

Input: Configuration space X, Number of historical con-figuration optimization datasets T , Maximum budgetNmax, Response function f , Kernel function K, Hyper-parameters ✓

j , The current dataset Sj=1, Datasets be-longing to other versions Sj=2:T , Diagonal noise matrix⌃, Design sample size n, Learning cycle Nl

Output: Optimal configurations x⇤ and learned model M1: D = {x1, . . . ,xn} create an initial sparse design (lhd)2: Obtain the performance 8xi 2 D, yi f(xi) + ✏i3: Sj=2:T select m points from other versions (j = 2 : T )

that reduce entropy . Eq.(8)4: S1

1:n Sj=2:T [ {(xi, yi)}ni=1; t n+ 15: M(x|S1

1:n,✓j) fit a multi-task GP to D . Eq. (6)

6: while t Nmax do7: If (t mod Nl = 0) then [✓ learn the kernel hyper-

parameters by maximizing the likelihood]8: Determine the next configuration, xt, by optimizing

LCB over M, xt argmaxx

u(x|M, S11:t�1) . Eq.(11)

9: Obtain the performance of xt, yt f(xt) + ✏t10: Augment the configuration S1

1:t = S11:t�1

S{(xt, yt)}

11: M(x|S11:t,✓) re-fit a new GP model . Eq.(6)

12: t t+ 113: end while14: (x⇤, y⇤) = min[S1

1:Nmax

� Sj=2:T ] locating the optimumconfiguration by discarding data for other system versions

15: M(x) storing the learned model

method has several benefits over the ordinary GP models:(i) we can plug di↵erent observations regarding di↵erent ver-sions; (ii) automatic learning of the correlation between thesystem versions; (iii) fast learning of the hyper-parameters byexploiting the similarity between measurements. Similarlyto (3), (4), the mean and variance of MTGP are [34]:

µt(x) = µ(x) + k

|(K(X,X) + ⌃)�1(y � µ) (6)

�2t (x) = k(x,x) + �2 � k

|(K(X,X) + ⌃)�1k, (7)

where y := [f1, . . . ,fT ]| is the measured performance re-

garding all tasks, X = [X1, . . . , XT ] are the correspondingconfiguration and ⌃ = diag[�2

1 , . . . ,�2T ] is a diagonal noise

matrix where each noise term is repeated as many times as thenumber of historical observations represented by t1, . . . , tT

and k := k(X,x). This is a conditional estimate at a de-sired configuration given the historical datasets for previoussystem versions and their respective GP hyper-parameters.

An illustration of a multi-task GP versus a single-task GPand its e↵ect on providing a more accurate model is given in

4

100

150

1

200

250

Late

ncy

(m

s)

300

2 53 104

5 156

14

16

18

20

1

22

24

26

Late

ncy

(m

s)

28

30

32

2 53 1045 156

2.8

2.9

1

3

3.1

3.2

3.3

2

Late

ncy

(m

s)

3.4

3.5

3.6

3 54 10

5 156

1.2

1.3

1.4

1

1.5

1.6

1.7

Late

ncy

(m

s)

1.8

1.9

2 53104




true function

GP mean

GP variance

observation

selected point

true minimumX

f(x)




Initial Design

Model Fit

Next Experiment

Model Update

Budget Finished



Initial Design

Model Fit

Next Experiment

Model Update

Budget Finished



store filter



i , yji )|j = 1, . . . , T, i = 1, . . . , tj}ti=1

(cf. S1:t), where x







0)⇥ kxx(x,x0), (5)


Algorithm 1 : TL4CO










u(x|M, S11:t�1) . Eq.(11)


1:t = S11:t�1

S{(xt, yt)}



1:Nmax




µt(x) = µ(x) + k

|(K(X,X) + ⌃)�1(y � µ) (6)

�2t (x) = k(x,x) + �2 � k

|(K(X,X) + ⌃)�1k, (7)







4

100

150

1

200

250

Late

ncy

(m

s)

300

2 53 104

5 156

14

16

18

20

1

22

24

26

Late

ncy

(m

s)

28

30

32

2 53 1045 156

2.8

2.9

1

3

3.1

3.2

3.3

2

Late

ncy

(m

s)

3.4

3.5

3.6

3 54 10

5 156

1.2

1.3

1.4

1

1.5

1.6

1.7

Late

ncy

(m

s)

1.8

1.9

2 53104




true function

GP mean

GP variance

observation

selected point

true minimumX

f(x)




Initial Design

Model Fit

Next Experiment

Model Update

Budget Finished



Initial Design

Model Fit

Next Experiment

Model Update

Budget Finished



store filter



i , yji )|j = 1, . . . , T, i = 1, . . . , tj}ti=1

(cf. S1:t), where x







0)⇥ kxx(x,x0), (5)


Algorithm 1 : TL4CO










u(x|M, S11:t�1) . Eq.(11)


1:t = S11:t�1

S{(xt, yt)}



1:Nmax




µt(x) = µ(x) + k

|(K(X,X) + ⌃)�1(y � µ) (6)

�2t (x) = k(x,x) + �2 � k

|(K(X,X) + ⌃)�1k, (7)







4

correlation between different versions

covariance functions

Multi-taskGPvssingle-taskGP

-1.5 -1 -0.5 0 0.5 1 1.5-4

-3

-2

-1

0

1

2

3

(a) 3 sample response functions

configuration domain

resp

onse

val

ue

(1)

(2)

(3)

observations

(b) GP fit for (1) ignoring observations for (2),(3)

LCB

not informative

(c) multi-task GP fit for (1) by transfer learning from (2),(3)

highly informative

GP prediction meanGP prediction variance

probability distribution of the minimizers

Exploitationvsexploration

0 2000 4000 6000 8000 10000Iteration

4

4.5

5

5.5

6

6.5

7

7.5

8

8.5

κ

κTL4CO

[ϵ=1]

κBO4CO

[ϵ=1]

κTL4CO

[ϵ=0.1]

κBO4CO

[ϵ=0.1]

κTL4CO

[ϵ=0.01]

κBO4CO

[ϵ=0.01]

Figure 5. Three di↵erent 1D (1-dimensional) response func-tions are to be minimized. Further samples are available allover two of them, whereas function (1) that is under test hasonly few sparse observations. Merely using these few sampleswould result in poor (uninformative) predictions of the func-tion especially in the areas where there is no observations.Using the correlation with the other two functions enablesthe MTGP model to provide more accurate predictions andas a result locating the optimum more quickly.

3.3 Filtering out irrelevant dataHere is described how the selection of the points in the

previous system is made. The number of historical observa-tions taken in consideration is important because a↵ect thecomputational requirements, due to the matrix inversion in(6) incurs a cubic cost (shown in Section 4.6). TL4CO selects(step 3 in Algorithm 1) points with the lowest entropy value.Entropy reduction is computed as the log of the ratio of theposterior uncertainty given an observation xi,j belonging toversion j, to its uncertainty without it:

Iij = log⇣v(x|X,xi,j)

v(x|X)

⌘, (8)

where v(x|X) is the posterior uncertainty of the predictionfrom MTGP model of query point x obtained in (7).

3.4 Model fitting in TL4COIn this section, we provide some practical considerations

and the extensions we made to the MTGP framework tomake it applicable for configuration optimization.

3.4.1 Kernel functionWe implemented the following kernel function to support

integer and categorical variables (cf. Section 2.1):

kxx(xi,xj) = exp(⌃d`=1(�✓`�(xi 6= xj))), (9)

where d is the number of dimensions (i.e., the number ofconfiguration parameters), ✓` adjust the scales along thefunction dimensions and � is a function gives the distancebetween two categorical variables using Kronecker delta [20,34]. TL4CO uses di↵erent scales {✓`, ` = 1 . . . d} on di↵erentdimensions as suggested in [41, 34], this technique is calledAutomatic Relevance Determination (ARD). After learningthe hyper-parameters (step 7 in Algorithm 1), if the `-thdimension (parameter) turns out be irrelevant, so the ✓`will be a small value, and therefore, will be automaticallydiscarded. This is particularly helpful in high dimensionalspaces, where it is di�cult to find the optimal configuration.

3.4.2 Prior mean functionWhile the kernel controls the structure of the estimated

function, the prior mean µ(x) : X ! R provides a possibleo↵set for our estimation. By default, this function is set to aconstant µ(x) := µ, which is inferred from the observations[34]. However, the prior mean function is a way of incorpo-rating the expert knowledge, if it is available, then we canuse this knowledge. Fortunately, we have collected extensiveexperimental measurements and based on our datasets (cf.Table 3), we observed that typically, for Big Data systems,there is a significant distance between the minimum and themaximum of each function (cf. Figure 2). Therefore, a linearmean function µ(x) := ax+ b, allows for more flexible struc-tures, and provides a better fit for the data than a constantmean. We only need to learn the slope for each dimensionand an o↵set (denoted by µ` = (a, b), see next).

3.4.3 Learning hyper-parametersThis section describe the step 7 in Algorithm 1. Due to the

heavy computation of the learning, this process is computedonly every Nl iterations. For learning the hyper-parametersof the kernel and also the prior mean functions (cf. Sections3.4.1 and 3.4.2), we maximize the marginal likelihood [34,5] of the observations S1

1:t. To do that, we train our GPmodel (6) with S1

1:t . We optimize the marginal likelihoodusing multi-started quasi-Newton hill-climbers [32]. For thispurpose, we use the o↵-the-shelf gpml library presented in[32]. Using the multi-task kernel defined in (5), we learn✓ := (✓t, ✓xx, µ`) that comprises the hyper-parameters of kt,kxx (cf. (5)) and the mean function µ(·) (cf. Section 3.4.2).The learning is performed iteratively resulting in a sequenceof priors with parameters ✓i for i = 1 . . . bN

max

N`

c.

3.4.4 Observation noiseIn Section 2.4, we have shown that such noise level can

be measured with a high confidence and the signal-to-noiseratios shows that such noise is stationary. In (7), � representsthe noise of the selected point x. Typically this noise valueis not known. In TL4CO, we estimate the noise level by anapproximation. The query points can be assumed to be asnoisy as the observed data. In other words, we treat � as arandom variable and we calculate its expected value as:

� =⌃T

j=1Ni�2i

⌃Ti=1Ni

(10)

where Ni are the number of observations from the ith datasetand �2

i is the noise variance of the individual datasets.

3.5 Configuration selection criteriaTL4CO requires a selection criterion (step 8 in Algorithm

1) to decide the next configuration to measure. Intuitively,we want to select the minimum response. This is done usinga utility function u : X ! R that determines xt+1 2 X,should f(·) be evaluated next as:

xt+1 = argmaxx2X

u(x|M, S11:t) (11)

The selection criterion depends on the MTGP model Msolely through its predictive mean µt(xt) and variance �2

t (xt)conditioned on observations S1

1:t. TL4CO uses the LowerConfidence Bound (LCB) [24]:

uLCB(x|M, S11:t) = argmin

x2Xµt(x)� �t(x), (12)

where is a exploitation-exploration parameter. For instance,if we require to find a near optimal configuration we set alow value to to take the most out of the predictive mean.However, if we are looking for a globally optimum one, we canset a high value in order to skip local minima. Furthermore, can be adapted over time [22] to perform more explorations.Figure 6 shows that in TL4CO, can start with a relativelyhigher value at the early iterations comparing to BO4COsince the former provides a better estimate of mean andtherefore contains more information at the early stages.

TL4CO output. Once the Nmax di↵erent configurations ofthe system under test are measured, the TL4CO algorithmterminates. Finally, TL4CO produces the outputs includingthe optimal configuration (step 14 in Algorithm 1) as wellas the learned MTGP model (step 15 in Algorithm 1). Allthese information are versioned and stored in a performancerepository (cf. Figure 7) to be used in the future versions ofthe system.

5





v(x|X)

⌘, (8)














max

N`

c.



� =⌃T

j=1Ni�2i

⌃Ti=1Ni

(10)





xt+1 = argmaxx2X

u(x|M, S11:t) (11)





x2Xµt(x)� �t(x), (12)



5

TL4COarchitecture


(TL4CO)


Monitoring

Deployment Service

Data Preparation


values


values

Experimental Suite

Testbed

Doc

Data Broker

Tester

experiment timepolling intervalprevious versions


GP model

Kafka

Vn

V1 V2

System Under Test

historical data

WorkloadGenerator


Stor

m

Cas

sand

ra

Spar

k

Experimentalresults

0 20 40 60 80 100

Iteration

10-6

10-4

10-2

100

102

104

Absolu

te E

rror

TL4CO

BO4CO

SA

GA

HILL

PS

Drift

(b) WC(3D)

(8 minutes each)0 20 40 60 80 100

Iteration

10-3

10-2

10-1

100

101

102

103

Ab

so

lute

Err

or

TL4CO

BO4CO

SA

GA

HILL

PS

Drift (a) WC(5D)

(8 minutes each)

- 30 runs, report average performance- Yes, we did full factorial measurements and we know where global min is… J

Experimentalresults(cont.)

0 50 100 150 200

Iteration

10-3

10-2

10-1

100

101

102

103

104

Absolu

te E

rror

TL4CO

BO4CO

SA

GA

HILL

PS

Drift

(b) WC(6D)

(8 minutes each)

Experimentalresults(cont.)

0 50 100 150 200

Iteration

10-2

10-1

100

101

102

103

104

Ab

so

lute

Err

or

TL4CO

BO4CO

SA

GA

HILL

PS

Drift

0 50 100 150 200

Iteration

10-4

10-2

100

102

104

Ab

so

lute

Err

or

TL4CO

BO4CO

SA

GA

HILL

PS

Drift

(a) SOL(6D) (b) RS(6D)

(8 minutes each)(8 minutes each)

Comparisonwithdefaultandexpertprescription

0 500 1000 1500Throughput (ops/sec)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Ave

rag

e r

ea

d la

ten

cy (µ

s)

×104

TL4COBO4CO

BO4CO after 20 iterations TL4CO after

20 iterations

TL4CO after 100 iterations


500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Ave

rag

e w

rite

late

ncy

(µ

s)

TL4COBO4CO

Default configuration

Configuration recommended

by expertTL4CO after

100 iterations

BO4CO after 100 iterations



by expert

Modelaccuracy

TL4CO polyfit1 polyfit2 polyfit3 polyfit4 polyfit5

10-12

10-10

10-8

10-6

10-4

10-2

100

102

Ab

solu

te P

erc

en

tag

e E

rro

r [%

]

TL4CO BO4CO M5Tree R-Tree M5Rules MARS PRIM

10-2

10-1

100

101

102

Ab

solu

te P

erc

en

tag

e E

rro

r [%

]

Abso

lute

Per

cent

age

Erro

r [%

]

Predictionaccuracyovertime

0 20 40 60 80 100Iteration

10-4

10-3

10-2

10-1

100

101

Pre

dic

tion

Err

or

(RM

SE

)

T=2,m=100T=2,m=200T=2,m=300T=3,m=100

0 20 40 60 80 100Iteration

10-4

10-3

10-2

10-1

100

101

Pre

dic

tion E

rror

(RM

SE

)

TL4COpolyfit1polyfit2polyfit4polyfit5M5TreeM5RulesPRIM (a) (b)

Entropyofthedensityfunctionoftheminimizers

0 20 40 60 80 100Iteration

0

1

2

3

4

5

6

7

8

9

10

Entr

opy

T=1(BO4CO)T=2,m=100T=2,m=200T=2,m=300T=2,m=400T=3,m=100

1 2 3 4 5 6 7 8 9

0

2

4

6

8

10

Entr

opy

BO4COTL4CO

Entro

py

Iteration

Branin Hartmann WC(3D) SOL(6D) WC(5D)Dixon WC(6D) RS(6D) cass-20

0 20 40 60 80 100Iteration (10 minutes each)

101

102

103

104

Abso

lute

Err

or

(µs)

TL4COBO4COSAGAHILLPSDrift

Figure 15: cass-20 configuration optimization.


0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Ave

rage r

ead la

tency

(µ

s)

×104

TL4COBO4CO

BO4CO after 20 iterations TL4CO after

20 iterations

TL4CO after 100 iterations


500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Ave

rage w

rite

late

ncy

(µ

s)

TL4COBO4CO



by expertTL4CO after

100 iterations

BO4CO after 100 iterations



by expert

Figure 16: TL4CO, BO4CO and expert prescription.

4.5 Entropy analysisThe knowledge about the location of optimum configura-

tion is summarized by the approximation of the conditionalprobability density function of the response function mini-mizers, i.e., X⇤ = Pr(x⇤|f(x)), where f(·) is drawn fromthe MTGP model (cf. solid red line in Figure 5(b,c)). Theentropy of the density functions in Figure 5(b,c) are 6.39,3.96, so we know more information about the latter.

The results in Figure 19 confirm that the entropy measureof the minimizers with the models provided by TL4CO for allthe datasets (synthetic and real) significantly contains moreinformation. The results demonstrate that the main reasonfor finding quick convergence comparing with the baselinesis that TL4CO employs a more e↵ective model. The resultsin Figure 19(b) show the change of entropy of X⇤ over timefor WC(5D) dataset. First, it shows that in TL4CO, theentropy decreases sharply. However, the overall decrease ofentropy for BO4CO is slow. The second observation is thatas we increase the number of observations from similar tasksthe gain in terms of entropy change is not significant. Toshow this more clearly, we have disabled the filtering stepfor T = 2,m = 200 and the results shows that the entropyof X⇤ not only decreases, but it slightly increases over time.

4.6 Computational and memory requirementsThe inference complexity in TL4CO is O(T 2t2) because

of the Cholesky kernel inversion in (6). Since in TL4CO welearn the hyper-parameters every N` iterations, Choleskydecomposition must be re-computed. Therefore the com-plexity is in principle O(T 2t2 ⇥ t/N`). Figure 20(a) providesthe runtime overhead of TL4CO. The time is measured on aMacBook Pro with 2.5 GHz Intel Core i7 CPU and 16GB ofMemory. The computation time in larger datasets (RS(6D),SOL(6D), WC(6D)) is higher than those with less data andlower dimensions (WC(3,5D)). Moreover, the computationtime increases over time since the matrix size for Choleskyinversion gets larger. Figure 20(b) shows the mean and vari-ance of response time over 100 iterations for di↵erent numberof tasks and historical observations per task.

TL4CO polyfit1 polyfit2 polyfit3 polyfit4 polyfit5

10-12

10-10

10-8

10-6

10-4

10-2

100

102

Abso

lute

Perc

enta

ge E

rror

[%]

TL4CO BO4CO M5Tree R-Tree M5Rules MARS PRIM

10-2

10-1

100

101

102

Abso

lute

Perc

enta

ge E

rror

[%]

Abso

lute

Per

cent

age

Erro

r [%

]

Figure 17: Absolute percentage error of predictions made byTL4CO’s MTGP vs regression models for cass-20(6D).

0 20 40 60 80 100Iteration

10-4

10-3

10-2

10-1

100

101

Pre

dic

tion

Err

or

(RM

SE

)

T=2,m=100T=2,m=200T=2,m=300T=3,m=100

0 20 40 60 80 100Iteration

10-4

10-3

10-2

10-1

100

101

Pre

dic

tion

Err

or

(RM

SE

)

TL4COpolyfit1polyfit2polyfit4polyfit5M5TreeM5RulesPRIM (a) (b)

Figure 18: (a) comparing prediction accuracy of TL4CO withother regression models, (b) comparing prediction accuracyof TL4CO with di↵erent number of tasks and observations.

TL4CO requires to store 3 vectors of size |X| = N for mean,variance, and LCB estimates and matrix of size Tn⇥ Tn forstoring K and Tn for keeping tack of historical observations,making the memory requirement O(3N + Tn⇥ Tn+ Tn)).

5. DISCUSSIONS

5.1 Behind the sceneTL4CO in practice: usability and extensibility. In our

experiments, we have evaluated whether TL4CO is feasible inpractice. Although partly parallelized, we have invested morethan three months (24/7) for performance measurements ofthe systems in Table 3 to obtain a huge data basis. However,our approach would require much less measurements, as mostmeasurements were meant to evaluate our approach.

TL4CO can be also used in a di↵erent scenario for configu-ration tuning of Big Data software. Let us assume that thecurrent version of the system is expensive to test (e.g., hasa very large database and each round of experiments takesseveral hours [10]). We, therefore, use a smaller scale of thecurrent version (e.g., with only few records of representativedata). The idea is to spend larger portion of the budget tocollect data from the cheaper version. This data can then beexploited to accelerate the tuning of the expensive version.TL4CO is easy to use, end users only need to determine

the parameters of interests as well as experimental param-eters and then the tool automatically sets the optimizedparameters for the system in its configuration file (all inYAML format). Currently, TL4CO supports Apache Stormand Cassandra. However, it is designed to be extensible, seethe technology interface in Figure 7.How much did ARD help? TL4CO uses ARD technique

(see Section 3.4.1) in order to get a better MTGP fit in statespaces where a subset of parameters interact [21]. To comparethe e↵ects of this technique, we disable ARD for optimization

9

Knowledge about the location of the minimizer

EffectofARD


10-6

10-5

10-4

10-3

10-2

10-1

100

101

Abso

lute

Err

or

(ms)

TL4CO(ARD on)TL4CO(ARD off)


10-3

10-2

10-1

100

101

Abso

lute

Err

or

(ms)

TL4CO(ARD on)TL4CO(ARD off)

Iteration Iteration(a) (b)





v(x|X)

⌘, (8)














max

N`

c.



� =⌃T

j=1Ni�2i

⌃Ti=1Ni

(10)





xt+1 = argmaxx2X

u(x|M, S11:t) (11)





x2Xµt(x)� �t(x), (12)



5

Effectofcorrelationbetweentasks

0 20 40 60 80 100Iteration

10-6

10-5

10-4

10-3

10-2

10-1

100

101

Abso

lute

Err

or

TL4CO [f(x)+randn×f(x)]TL4CO [f(x)+2randn×f(x)]BO4CO

Runtimeoverhead

0 20 40 60 80 100Iteration

0.15

0.2

0.25

0.3

0.35

0.4

Ela

pse

d T

ime

(s)

WordCount (3D)WordCount (6D)SOL (6D)RollingSort (6D)WordCount (5D)

T=2,m=100 T=2,m=200 T=2,m=400 T=3,m=100 T=3,m=200

0.5

1

1.5

2

2.5

3

3.5

4

Ela

pse

d T

ime

(s)

(a) TL4CO runtime for different datasets (b) TL4CO runtime for different size of observations

Keytakeaways

- Aprincipledwaytoleveragepriorknowledgegainedfromsearchesoverpreviousversionsofthesystem.

- Multi-taskGPscanbeusedinordertocapturecorrelationbetweenrelatedtuningtasks.

- MTGPsaremoreaccuratethanSTGPandotherregressionmodels.

- ApplicationtoSPSs,BatchandNoSQL- Leadtoabetterperformanceinpractice

Acknowledgement: BO4CO and TL4CO are now integrated with other DevOps tools in the delivery pipeline for Big Data in H2020 DICE project

BigDataTechnologies

Cloud(Priv/Pub)`

DICEIDE

Profile

Plugins

Sim Ver Opt

DPIM

DTSM

DDSMTOSCAMethodology

Deploy Config Test

Mon

AnomalyTrace

Iter.Enh.

DataIntensiveApplication(DIA)

Cont.Int. FaultInj.

WP4

WP3

WP2

WP5

WP1 WP6- Demonstrators

Tool Demo: http://www.slideshare.net/pooyanjamshidi/configuration-optimization-tool