transfer learning for optimal configuration of big data software
TRANSCRIPT
TransferLearningforOptimalConfigurationofBigDataSoftware
Pooyan Jamshidi, Giuliano Casale, Salvatore DipietroImperial College [email protected]
Department of ComputingResearch Associate Symposium
14th June 2016
Motivation
1- Many different Parameters => - large state space- interactions
2- Defaults are typically used =>- poor performance
Motivation
0 1 2 3 4 5average read latency (µs)
×104
0
20
40
60
80
100
120
140
160
obse
rvatio
ns
1000 1200 1400 1600 1800 2000average read latency (µs)
0
10
20
30
40
50
60
70
obse
rvatio
ns
1
1200
1300
1400
1500
1600
1700
1800
1900
1
0.5 1
1.5 2
2.5 3
3.5 4
×10
4(a) cass-20 (b) cass-10
Best configurations
Worst configurations
Experiments on Apache Cassandra:- 6 parameters, 1024 configurations- Average read latency- 10 millions records (cass-10)- 20 millions records (cass-20)
Goal!
information from the previous versions, the acquired data onthe current version, and apply a variety of kernel estimators[27] to locate regions where optimal configurations may lie.The key benefit of MTGPs over GPs is that the similaritybetween the response data helps the model to converge tomore accurate predictions much earlier. We experimentallyshow that TL4CO outperforms state-of-the-art configurationtuning methods. Our real configuration datasets are collectedfor three stream processing systems (SPS), implemented withApache Storm, a NoSQL benchmark system with ApacheCassandra, and using a dataset on 6 cloud-based clustersobtained in [21] worth 3 months of experimental time.The rest of this paper is organized as follows. Section 2
overviews the problem and motivates the work via an exam-ple. TL4CO is introduced in Section 3 and then validated inSection 4. Section 5 provides behind the scene and Section 6concludes the paper.
2. PROBLEM AND MOTIVATION
2.1 Problem statementIn this paper, we focus on the problem of optimal system
configuration defined as follows. Let Xi indicate the i-thconfiguration parameter, which ranges in a finite domainDom(Xi). In general, Xi may either indicate (i) integer vari-able such as level of parallelism or (ii) categorical variablesuch as messaging frameworks or Boolean variable such asenabling timeout. We use the terms parameter and factor in-terchangeably; also, with the term option we refer to possiblevalues that can be assigned to a parameter.
We assume that each configuration x 2 X in the configura-tion space X = Dom(X1)⇥ · · ·⇥Dom(Xd) is valid, i.e., thesystem accepts this configuration and the corresponding testresults in a stable performance behavior. The response withconfiguration x is denoted by f(x). Throughout, we assumethat f(·) is latency, however, other metrics for response maybe used. We here consider the problem of finding an optimalconfiguration x
⇤ that globally minimizes f(·) over X:
x
⇤ = argminx2X
f(x) (1)
In fact, the response function f(·) is usually unknown orpartially known, i.e., yi = f(xi),xi ⇢ X. In practice, suchmeasurements may contain noise, i.e., yi = f(xi) + ✏. Thedetermination of the optimal configuration is thus a black-box optimization program subject to noise [27, 33], whichis considerably harder than deterministic optimization. Apopular solution is based on sampling that starts with anumber of sampled configurations. The performance of theexperiments associated to this initial samples can delivera better understanding of f(·) and guide the generation ofthe next round of samples. If properly guided, the processof sample generation-evaluation-feedback-regeneration willeventually converge and the optimal configuration will belocated. However, a sampling-based approach of this kind canbe prohibitively expensive in terms of time or cost (e.g., rentalof cloud resources) considering that a function evaluation inthis case would be costly and the optimization process mayrequire several hundreds of samples to converge.
2.2 Related workSeveral approaches have attempted to address the above
problem. Recursive Random Sampling (RRS) [43] integratesa restarting mechanism into the random sampling to achievehigh search e�ciency. Smart Hill Climbing (SHC) [42] inte-grates the importance sampling with Latin Hypercube Design
(lhd). SHC estimates the local regression at each potentialregion, then it searches toward the steepest descent direction.An approach based on direct search [45] forms a simplexin the parameter space by a number of samples, and itera-tively updates a simplex through a number of well-definedoperations including reflection, expansion, and contractionto guide the sample generation. Quick Optimization viaGuessing (QOG) in [31] speeds up the optimization processexploiting some heuristics to filter out sub-optimal configu-rations. The statistical approach in [33] approximates thejoint distribution of parameters with a Gaussian in order toguide sample generation towards the distribution peak. Amodel-based approach [35] iteratively constructs a regressionmodel representing performance influences. Some approacheslike [8] enable dynamic detection of optimal configurationin dynamic environments. Finally, our earlier work BO4CO[21] uses Bayesian Optimization based on GPs to acceleratethe search process, more details are given later in Section 3.
2.3 SolutionAll the previous e↵orts attempt to improve the sampling
process by exploiting the information that has been gained inthe current task. We define a task as individual tuning cyclethat optimizes a given version of the system under test. As aresult, the learning is limited to the current observations andit still requires hundreds of sample evaluations. In this pa-per, we propose to adopt a transfer learning method to dealwith the search e�ciency in configuration tuning. Ratherthan starting the search from scratch, the approach transfersthe learned knowledge coming from similar versions of thesoftware to accelerate the sampling process in the currentversion. This idea is inspired from several observations arisein real software engineering practice [2, 3]. For instance, (i)in DevOps di↵erent versions of a system is delivered con-tinuously, (ii) Big Data systems are developed using similarframeworks (e.g., Apache Hadoop, Spark, Kafka) and runon similar platforms (e.g., cloud clusters), (iii) and di↵erentversions of a system often share a similar business logic.
To the best of our knowledge, only one study [9] exploresthe possibility of transfer learning in system configuration.The authors learn a Bayesian network in the tuning processof a system and reuse this model for tuning other similarsystems. However, the learning is limited to the structure ofthe Bayesian network. In this paper, we introduce a methodthat not only reuse a model that has been learned previouslybut also the valuable raw data. Therefore, we are not limitedto the accuracy of the learned model. Moreover, we do notconsider Bayesian networks and instead focus on MTGPs.
2.4 MotivationA motivating example. We now illustrate the previous
points on an example. WordCount (cf. Figure 1) is a popularbenchmark [12]. WordCount features a three-layer architec-ture that counts the number of words in the incoming stream.A Processing Element (PE) of type Spout reads the inputmessages from a data source and pushes them to the system.A PE of type Bolt named Splitter is responsible for splittingsentences, which are then counted by the Counter.
Kafka Spout Splitter Bolt Counter Bolt
(sentence) (word)[paintings, 3][poems, 60][letter, 75]
Kafka Topic
Stream to Kafka
File(sentence)
(sentence)
(sen
tenc
e)
Figure 1: WordCount architecture.
2
information from the previous versions, the acquired data onthe current version, and apply a variety of kernel estimators[27] to locate regions where optimal configurations may lie.The key benefit of MTGPs over GPs is that the similaritybetween the response data helps the model to converge tomore accurate predictions much earlier. We experimentallyshow that TL4CO outperforms state-of-the-art configurationtuning methods. Our real configuration datasets are collectedfor three stream processing systems (SPS), implemented withApache Storm, a NoSQL benchmark system with ApacheCassandra, and using a dataset on 6 cloud-based clustersobtained in [21] worth 3 months of experimental time.The rest of this paper is organized as follows. Section 2
overviews the problem and motivates the work via an exam-ple. TL4CO is introduced in Section 3 and then validated inSection 4. Section 5 provides behind the scene and Section 6concludes the paper.
2. PROBLEM AND MOTIVATION
2.1 Problem statementIn this paper, we focus on the problem of optimal system
configuration defined as follows. Let Xi indicate the i-thconfiguration parameter, which ranges in a finite domainDom(Xi). In general, Xi may either indicate (i) integer vari-able such as level of parallelism or (ii) categorical variablesuch as messaging frameworks or Boolean variable such asenabling timeout. We use the terms parameter and factor in-terchangeably; also, with the term option we refer to possiblevalues that can be assigned to a parameter.
We assume that each configuration x 2 X in the configura-tion space X = Dom(X1)⇥ · · ·⇥Dom(Xd) is valid, i.e., thesystem accepts this configuration and the corresponding testresults in a stable performance behavior. The response withconfiguration x is denoted by f(x). Throughout, we assumethat f(·) is latency, however, other metrics for response maybe used. We here consider the problem of finding an optimalconfiguration x
⇤ that globally minimizes f(·) over X:
x
⇤ = argminx2X
f(x) (1)
In fact, the response function f(·) is usually unknown orpartially known, i.e., yi = f(xi),xi ⇢ X. In practice, suchmeasurements may contain noise, i.e., yi = f(xi) + ✏. Thedetermination of the optimal configuration is thus a black-box optimization program subject to noise [27, 33], whichis considerably harder than deterministic optimization. Apopular solution is based on sampling that starts with anumber of sampled configurations. The performance of theexperiments associated to this initial samples can delivera better understanding of f(·) and guide the generation ofthe next round of samples. If properly guided, the processof sample generation-evaluation-feedback-regeneration willeventually converge and the optimal configuration will belocated. However, a sampling-based approach of this kind canbe prohibitively expensive in terms of time or cost (e.g., rentalof cloud resources) considering that a function evaluation inthis case would be costly and the optimization process mayrequire several hundreds of samples to converge.
2.2 Related workSeveral approaches have attempted to address the above
problem. Recursive Random Sampling (RRS) [43] integratesa restarting mechanism into the random sampling to achievehigh search e�ciency. Smart Hill Climbing (SHC) [42] inte-grates the importance sampling with Latin Hypercube Design
(lhd). SHC estimates the local regression at each potentialregion, then it searches toward the steepest descent direction.An approach based on direct search [45] forms a simplexin the parameter space by a number of samples, and itera-tively updates a simplex through a number of well-definedoperations including reflection, expansion, and contractionto guide the sample generation. Quick Optimization viaGuessing (QOG) in [31] speeds up the optimization processexploiting some heuristics to filter out sub-optimal configu-rations. The statistical approach in [33] approximates thejoint distribution of parameters with a Gaussian in order toguide sample generation towards the distribution peak. Amodel-based approach [35] iteratively constructs a regressionmodel representing performance influences. Some approacheslike [8] enable dynamic detection of optimal configurationin dynamic environments. Finally, our earlier work BO4CO[21] uses Bayesian Optimization based on GPs to acceleratethe search process, more details are given later in Section 3.
2.3 SolutionAll the previous e↵orts attempt to improve the sampling
process by exploiting the information that has been gained inthe current task. We define a task as individual tuning cyclethat optimizes a given version of the system under test. As aresult, the learning is limited to the current observations andit still requires hundreds of sample evaluations. In this pa-per, we propose to adopt a transfer learning method to dealwith the search e�ciency in configuration tuning. Ratherthan starting the search from scratch, the approach transfersthe learned knowledge coming from similar versions of thesoftware to accelerate the sampling process in the currentversion. This idea is inspired from several observations arisein real software engineering practice [2, 3]. For instance, (i)in DevOps di↵erent versions of a system is delivered con-tinuously, (ii) Big Data systems are developed using similarframeworks (e.g., Apache Hadoop, Spark, Kafka) and runon similar platforms (e.g., cloud clusters), (iii) and di↵erentversions of a system often share a similar business logic.
To the best of our knowledge, only one study [9] exploresthe possibility of transfer learning in system configuration.The authors learn a Bayesian network in the tuning processof a system and reuse this model for tuning other similarsystems. However, the learning is limited to the structure ofthe Bayesian network. In this paper, we introduce a methodthat not only reuse a model that has been learned previouslybut also the valuable raw data. Therefore, we are not limitedto the accuracy of the learned model. Moreover, we do notconsider Bayesian networks and instead focus on MTGPs.
2.4 MotivationA motivating example. We now illustrate the previous
points on an example. WordCount (cf. Figure 1) is a popularbenchmark [12]. WordCount features a three-layer architec-ture that counts the number of words in the incoming stream.A Processing Element (PE) of type Spout reads the inputmessages from a data source and pushes them to the system.A PE of type Bolt named Splitter is responsible for splittingsentences, which are then counted by the Counter.
Kafka Spout Splitter Bolt Counter Bolt
(sentence) (word)[paintings, 3][poems, 60][letter, 75]
Kafka Topic
Stream to Kafka
File(sentence)
(sentence)
(sen
tenc
e)
Figure 1: WordCount architecture.
2
information from the previous versions, the acquired data onthe current version, and apply a variety of kernel estimators[27] to locate regions where optimal configurations may lie.The key benefit of MTGPs over GPs is that the similaritybetween the response data helps the model to converge tomore accurate predictions much earlier. We experimentallyshow that TL4CO outperforms state-of-the-art configurationtuning methods. Our real configuration datasets are collectedfor three stream processing systems (SPS), implemented withApache Storm, a NoSQL benchmark system with ApacheCassandra, and using a dataset on 6 cloud-based clustersobtained in [21] worth 3 months of experimental time.The rest of this paper is organized as follows. Section 2
overviews the problem and motivates the work via an exam-ple. TL4CO is introduced in Section 3 and then validated inSection 4. Section 5 provides behind the scene and Section 6concludes the paper.
2. PROBLEM AND MOTIVATION
2.1 Problem statementIn this paper, we focus on the problem of optimal system
configuration defined as follows. Let Xi indicate the i-thconfiguration parameter, which ranges in a finite domainDom(Xi). In general, Xi may either indicate (i) integer vari-able such as level of parallelism or (ii) categorical variablesuch as messaging frameworks or Boolean variable such asenabling timeout. We use the terms parameter and factor in-terchangeably; also, with the term option we refer to possiblevalues that can be assigned to a parameter.
We assume that each configuration x 2 X in the configura-tion space X = Dom(X1)⇥ · · ·⇥Dom(Xd) is valid, i.e., thesystem accepts this configuration and the corresponding testresults in a stable performance behavior. The response withconfiguration x is denoted by f(x). Throughout, we assumethat f(·) is latency, however, other metrics for response maybe used. We here consider the problem of finding an optimalconfiguration x
⇤ that globally minimizes f(·) over X:
x
⇤ = argminx2X
f(x) (1)
In fact, the response function f(·) is usually unknown orpartially known, i.e., yi = f(xi),xi ⇢ X. In practice, suchmeasurements may contain noise, i.e., yi = f(xi) + ✏. Thedetermination of the optimal configuration is thus a black-box optimization program subject to noise [27, 33], whichis considerably harder than deterministic optimization. Apopular solution is based on sampling that starts with anumber of sampled configurations. The performance of theexperiments associated to this initial samples can delivera better understanding of f(·) and guide the generation ofthe next round of samples. If properly guided, the processof sample generation-evaluation-feedback-regeneration willeventually converge and the optimal configuration will belocated. However, a sampling-based approach of this kind canbe prohibitively expensive in terms of time or cost (e.g., rentalof cloud resources) considering that a function evaluation inthis case would be costly and the optimization process mayrequire several hundreds of samples to converge.
2.2 Related workSeveral approaches have attempted to address the above
problem. Recursive Random Sampling (RRS) [43] integratesa restarting mechanism into the random sampling to achievehigh search e�ciency. Smart Hill Climbing (SHC) [42] inte-grates the importance sampling with Latin Hypercube Design
(lhd). SHC estimates the local regression at each potentialregion, then it searches toward the steepest descent direction.An approach based on direct search [45] forms a simplexin the parameter space by a number of samples, and itera-tively updates a simplex through a number of well-definedoperations including reflection, expansion, and contractionto guide the sample generation. Quick Optimization viaGuessing (QOG) in [31] speeds up the optimization processexploiting some heuristics to filter out sub-optimal configu-rations. The statistical approach in [33] approximates thejoint distribution of parameters with a Gaussian in order toguide sample generation towards the distribution peak. Amodel-based approach [35] iteratively constructs a regressionmodel representing performance influences. Some approacheslike [8] enable dynamic detection of optimal configurationin dynamic environments. Finally, our earlier work BO4CO[21] uses Bayesian Optimization based on GPs to acceleratethe search process, more details are given later in Section 3.
2.3 SolutionAll the previous e↵orts attempt to improve the sampling
process by exploiting the information that has been gained inthe current task. We define a task as individual tuning cyclethat optimizes a given version of the system under test. As aresult, the learning is limited to the current observations andit still requires hundreds of sample evaluations. In this pa-per, we propose to adopt a transfer learning method to dealwith the search e�ciency in configuration tuning. Ratherthan starting the search from scratch, the approach transfersthe learned knowledge coming from similar versions of thesoftware to accelerate the sampling process in the currentversion. This idea is inspired from several observations arisein real software engineering practice [2, 3]. For instance, (i)in DevOps di↵erent versions of a system is delivered con-tinuously, (ii) Big Data systems are developed using similarframeworks (e.g., Apache Hadoop, Spark, Kafka) and runon similar platforms (e.g., cloud clusters), (iii) and di↵erentversions of a system often share a similar business logic.
To the best of our knowledge, only one study [9] exploresthe possibility of transfer learning in system configuration.The authors learn a Bayesian network in the tuning processof a system and reuse this model for tuning other similarsystems. However, the learning is limited to the structure ofthe Bayesian network. In this paper, we introduce a methodthat not only reuse a model that has been learned previouslybut also the valuable raw data. Therefore, we are not limitedto the accuracy of the learned model. Moreover, we do notconsider Bayesian networks and instead focus on MTGPs.
2.4 MotivationA motivating example. We now illustrate the previous
points on an example. WordCount (cf. Figure 1) is a popularbenchmark [12]. WordCount features a three-layer architec-ture that counts the number of words in the incoming stream.A Processing Element (PE) of type Spout reads the inputmessages from a data source and pushes them to the system.A PE of type Bolt named Splitter is responsible for splittingsentences, which are then counted by the Counter.
Kafka Spout Splitter Bolt Counter Bolt
(sentence) (word)[paintings, 3][poems, 60][letter, 75]
Kafka Topic
Stream to Kafka
File(sentence)
(sentence)
(sen
tenc
e)
Figure 1: WordCount architecture.
2
information from the previous versions, the acquired data onthe current version, and apply a variety of kernel estimators[27] to locate regions where optimal configurations may lie.The key benefit of MTGPs over GPs is that the similaritybetween the response data helps the model to converge tomore accurate predictions much earlier. We experimentallyshow that TL4CO outperforms state-of-the-art configurationtuning methods. Our real configuration datasets are collectedfor three stream processing systems (SPS), implemented withApache Storm, a NoSQL benchmark system with ApacheCassandra, and using a dataset on 6 cloud-based clustersobtained in [21] worth 3 months of experimental time.The rest of this paper is organized as follows. Section 2
overviews the problem and motivates the work via an exam-ple. TL4CO is introduced in Section 3 and then validated inSection 4. Section 5 provides behind the scene and Section 6concludes the paper.
2. PROBLEM AND MOTIVATION
2.1 Problem statementIn this paper, we focus on the problem of optimal system
configuration defined as follows. Let Xi indicate the i-thconfiguration parameter, which ranges in a finite domainDom(Xi). In general, Xi may either indicate (i) integer vari-able such as level of parallelism or (ii) categorical variablesuch as messaging frameworks or Boolean variable such asenabling timeout. We use the terms parameter and factor in-terchangeably; also, with the term option we refer to possiblevalues that can be assigned to a parameter.
We assume that each configuration x 2 X in the configura-tion space X = Dom(X1)⇥ · · ·⇥Dom(Xd) is valid, i.e., thesystem accepts this configuration and the corresponding testresults in a stable performance behavior. The response withconfiguration x is denoted by f(x). Throughout, we assumethat f(·) is latency, however, other metrics for response maybe used. We here consider the problem of finding an optimalconfiguration x
⇤ that globally minimizes f(·) over X:
x
⇤ = argminx2X
f(x) (1)
In fact, the response function f(·) is usually unknown orpartially known, i.e., yi = f(xi),xi ⇢ X. In practice, suchmeasurements may contain noise, i.e., yi = f(xi) + ✏. Thedetermination of the optimal configuration is thus a black-box optimization program subject to noise [27, 33], whichis considerably harder than deterministic optimization. Apopular solution is based on sampling that starts with anumber of sampled configurations. The performance of theexperiments associated to this initial samples can delivera better understanding of f(·) and guide the generation ofthe next round of samples. If properly guided, the processof sample generation-evaluation-feedback-regeneration willeventually converge and the optimal configuration will belocated. However, a sampling-based approach of this kind canbe prohibitively expensive in terms of time or cost (e.g., rentalof cloud resources) considering that a function evaluation inthis case would be costly and the optimization process mayrequire several hundreds of samples to converge.
2.2 Related workSeveral approaches have attempted to address the above
problem. Recursive Random Sampling (RRS) [43] integratesa restarting mechanism into the random sampling to achievehigh search e�ciency. Smart Hill Climbing (SHC) [42] inte-grates the importance sampling with Latin Hypercube Design
(lhd). SHC estimates the local regression at each potentialregion, then it searches toward the steepest descent direction.An approach based on direct search [45] forms a simplexin the parameter space by a number of samples, and itera-tively updates a simplex through a number of well-definedoperations including reflection, expansion, and contractionto guide the sample generation. Quick Optimization viaGuessing (QOG) in [31] speeds up the optimization processexploiting some heuristics to filter out sub-optimal configu-rations. The statistical approach in [33] approximates thejoint distribution of parameters with a Gaussian in order toguide sample generation towards the distribution peak. Amodel-based approach [35] iteratively constructs a regressionmodel representing performance influences. Some approacheslike [8] enable dynamic detection of optimal configurationin dynamic environments. Finally, our earlier work BO4CO[21] uses Bayesian Optimization based on GPs to acceleratethe search process, more details are given later in Section 3.
2.3 SolutionAll the previous e↵orts attempt to improve the sampling
process by exploiting the information that has been gained inthe current task. We define a task as individual tuning cyclethat optimizes a given version of the system under test. As aresult, the learning is limited to the current observations andit still requires hundreds of sample evaluations. In this pa-per, we propose to adopt a transfer learning method to dealwith the search e�ciency in configuration tuning. Ratherthan starting the search from scratch, the approach transfersthe learned knowledge coming from similar versions of thesoftware to accelerate the sampling process in the currentversion. This idea is inspired from several observations arisein real software engineering practice [2, 3]. For instance, (i)in DevOps di↵erent versions of a system is delivered con-tinuously, (ii) Big Data systems are developed using similarframeworks (e.g., Apache Hadoop, Spark, Kafka) and runon similar platforms (e.g., cloud clusters), (iii) and di↵erentversions of a system often share a similar business logic.
To the best of our knowledge, only one study [9] exploresthe possibility of transfer learning in system configuration.The authors learn a Bayesian network in the tuning processof a system and reuse this model for tuning other similarsystems. However, the learning is limited to the structure ofthe Bayesian network. In this paper, we introduce a methodthat not only reuse a model that has been learned previouslybut also the valuable raw data. Therefore, we are not limitedto the accuracy of the learned model. Moreover, we do notconsider Bayesian networks and instead focus on MTGPs.
2.4 MotivationA motivating example. We now illustrate the previous
points on an example. WordCount (cf. Figure 1) is a popularbenchmark [12]. WordCount features a three-layer architec-ture that counts the number of words in the incoming stream.A Processing Element (PE) of type Spout reads the inputmessages from a data source and pushes them to the system.A PE of type Bolt named Splitter is responsible for splittingsentences, which are then counted by the Counter.
Kafka Spout Splitter Bolt Counter Bolt
(sentence) (word)[paintings, 3][poems, 60][letter, 75]
Kafka Topic
Stream to Kafka
File(sentence)
(sentence)
(sen
tenc
e)
Figure 1: WordCount architecture.
2
number of countersnumber of splitters
late
ncy
(ms)
100
150
1
200
250
2
300
Cubic Interpolation Over Finer Grid
243
684 10
125 14166 18
Partially known
Measurements contain noise
Configuration space
Findingoptimumconfigurationisdifficult!
0 5 10 15 20Number of counters
100
120
140
160
180
200
220
240
Late
ncy
(m
s)
splitters=2splitters=3
number of countersnumber of splitters
late
ncy
(ms)
100
150
1
200
250
2
300
Cubic Interpolation Over Finer Grid
243
684 10
125 14166 18
information from the previous versions, the acquired data onthe current version, and apply a variety of kernel estimators[27] to locate regions where optimal configurations may lie.The key benefit of MTGPs over GPs is that the similaritybetween the response data helps the model to converge tomore accurate predictions much earlier. We experimentallyshow that TL4CO outperforms state-of-the-art configurationtuning methods. Our real configuration datasets are collectedfor three stream processing systems (SPS), implemented withApache Storm, a NoSQL benchmark system with ApacheCassandra, and using a dataset on 6 cloud-based clustersobtained in [21] worth 3 months of experimental time.The rest of this paper is organized as follows. Section 2
overviews the problem and motivates the work via an exam-ple. TL4CO is introduced in Section 3 and then validated inSection 4. Section 5 provides behind the scene and Section 6concludes the paper.
2. PROBLEM AND MOTIVATION
2.1 Problem statementIn this paper, we focus on the problem of optimal system
configuration defined as follows. Let Xi indicate the i-thconfiguration parameter, which ranges in a finite domainDom(Xi). In general, Xi may either indicate (i) integer vari-able such as level of parallelism or (ii) categorical variablesuch as messaging frameworks or Boolean variable such asenabling timeout. We use the terms parameter and factor in-terchangeably; also, with the term option we refer to possiblevalues that can be assigned to a parameter.
We assume that each configuration x 2 X in the configura-tion space X = Dom(X1)⇥ · · ·⇥Dom(Xd) is valid, i.e., thesystem accepts this configuration and the corresponding testresults in a stable performance behavior. The response withconfiguration x is denoted by f(x). Throughout, we assumethat f(·) is latency, however, other metrics for response maybe used. We here consider the problem of finding an optimalconfiguration x
⇤ that globally minimizes f(·) over X:
x
⇤ = argminx2X
f(x) (1)
In fact, the response function f(·) is usually unknown orpartially known, i.e., yi = f(xi),xi ⇢ X. In practice, suchmeasurements may contain noise, i.e., yi = f(xi) + ✏. Thedetermination of the optimal configuration is thus a black-box optimization program subject to noise [27, 33], whichis considerably harder than deterministic optimization. Apopular solution is based on sampling that starts with anumber of sampled configurations. The performance of theexperiments associated to this initial samples can delivera better understanding of f(·) and guide the generation ofthe next round of samples. If properly guided, the processof sample generation-evaluation-feedback-regeneration willeventually converge and the optimal configuration will belocated. However, a sampling-based approach of this kind canbe prohibitively expensive in terms of time or cost (e.g., rentalof cloud resources) considering that a function evaluation inthis case would be costly and the optimization process mayrequire several hundreds of samples to converge.
2.2 Related workSeveral approaches have attempted to address the above
problem. Recursive Random Sampling (RRS) [43] integratesa restarting mechanism into the random sampling to achievehigh search e�ciency. Smart Hill Climbing (SHC) [42] inte-grates the importance sampling with Latin Hypercube Design
(lhd). SHC estimates the local regression at each potentialregion, then it searches toward the steepest descent direction.An approach based on direct search [45] forms a simplexin the parameter space by a number of samples, and itera-tively updates a simplex through a number of well-definedoperations including reflection, expansion, and contractionto guide the sample generation. Quick Optimization viaGuessing (QOG) in [31] speeds up the optimization processexploiting some heuristics to filter out sub-optimal configu-rations. The statistical approach in [33] approximates thejoint distribution of parameters with a Gaussian in order toguide sample generation towards the distribution peak. Amodel-based approach [35] iteratively constructs a regressionmodel representing performance influences. Some approacheslike [8] enable dynamic detection of optimal configurationin dynamic environments. Finally, our earlier work BO4CO[21] uses Bayesian Optimization based on GPs to acceleratethe search process, more details are given later in Section 3.
2.3 SolutionAll the previous e↵orts attempt to improve the sampling
process by exploiting the information that has been gained inthe current task. We define a task as individual tuning cyclethat optimizes a given version of the system under test. As aresult, the learning is limited to the current observations andit still requires hundreds of sample evaluations. In this pa-per, we propose to adopt a transfer learning method to dealwith the search e�ciency in configuration tuning. Ratherthan starting the search from scratch, the approach transfersthe learned knowledge coming from similar versions of thesoftware to accelerate the sampling process in the currentversion. This idea is inspired from several observations arisein real software engineering practice [2, 3]. For instance, (i)in DevOps di↵erent versions of a system is delivered con-tinuously, (ii) Big Data systems are developed using similarframeworks (e.g., Apache Hadoop, Spark, Kafka) and runon similar platforms (e.g., cloud clusters), (iii) and di↵erentversions of a system often share a similar business logic.
To the best of our knowledge, only one study [9] exploresthe possibility of transfer learning in system configuration.The authors learn a Bayesian network in the tuning processof a system and reuse this model for tuning other similarsystems. However, the learning is limited to the structure ofthe Bayesian network. In this paper, we introduce a methodthat not only reuse a model that has been learned previouslybut also the valuable raw data. Therefore, we are not limitedto the accuracy of the learned model. Moreover, we do notconsider Bayesian networks and instead focus on MTGPs.
2.4 MotivationA motivating example. We now illustrate the previous
points on an example. WordCount (cf. Figure 1) is a popularbenchmark [12]. WordCount features a three-layer architec-ture that counts the number of words in the incoming stream.A Processing Element (PE) of type Spout reads the inputmessages from a data source and pushes them to the system.A PE of type Bolt named Splitter is responsible for splittingsentences, which are then counted by the Counter.
Kafka Spout Splitter Bolt Counter Bolt
(sentence) (word)[paintings, 3][poems, 60][letter, 75]
Kafka Topic
Stream to Kafka
File(sentence)
(sentence)(s
ente
nce)
Figure 1: WordCount architecture.
2
information from the previous versions, the acquired data onthe current version, and apply a variety of kernel estimators[27] to locate regions where optimal configurations may lie.The key benefit of MTGPs over GPs is that the similaritybetween the response data helps the model to converge tomore accurate predictions much earlier. We experimentallyshow that TL4CO outperforms state-of-the-art configurationtuning methods. Our real configuration datasets are collectedfor three stream processing systems (SPS), implemented withApache Storm, a NoSQL benchmark system with ApacheCassandra, and using a dataset on 6 cloud-based clustersobtained in [21] worth 3 months of experimental time.The rest of this paper is organized as follows. Section 2
overviews the problem and motivates the work via an exam-ple. TL4CO is introduced in Section 3 and then validated inSection 4. Section 5 provides behind the scene and Section 6concludes the paper.
2. PROBLEM AND MOTIVATION
2.1 Problem statementIn this paper, we focus on the problem of optimal system
configuration defined as follows. Let Xi indicate the i-thconfiguration parameter, which ranges in a finite domainDom(Xi). In general, Xi may either indicate (i) integer vari-able such as level of parallelism or (ii) categorical variablesuch as messaging frameworks or Boolean variable such asenabling timeout. We use the terms parameter and factor in-terchangeably; also, with the term option we refer to possiblevalues that can be assigned to a parameter.
We assume that each configuration x 2 X in the configura-tion space X = Dom(X1)⇥ · · ·⇥Dom(Xd) is valid, i.e., thesystem accepts this configuration and the corresponding testresults in a stable performance behavior. The response withconfiguration x is denoted by f(x). Throughout, we assumethat f(·) is latency, however, other metrics for response maybe used. We here consider the problem of finding an optimalconfiguration x
⇤ that globally minimizes f(·) over X:
x
⇤ = argminx2X
f(x) (1)
In fact, the response function f(·) is usually unknown orpartially known, i.e., yi = f(xi),xi ⇢ X. In practice, suchmeasurements may contain noise, i.e., yi = f(xi) + ✏. Thedetermination of the optimal configuration is thus a black-box optimization program subject to noise [27, 33], whichis considerably harder than deterministic optimization. Apopular solution is based on sampling that starts with anumber of sampled configurations. The performance of theexperiments associated to this initial samples can delivera better understanding of f(·) and guide the generation ofthe next round of samples. If properly guided, the processof sample generation-evaluation-feedback-regeneration willeventually converge and the optimal configuration will belocated. However, a sampling-based approach of this kind canbe prohibitively expensive in terms of time or cost (e.g., rentalof cloud resources) considering that a function evaluation inthis case would be costly and the optimization process mayrequire several hundreds of samples to converge.
2.2 Related workSeveral approaches have attempted to address the above
problem. Recursive Random Sampling (RRS) [43] integratesa restarting mechanism into the random sampling to achievehigh search e�ciency. Smart Hill Climbing (SHC) [42] inte-grates the importance sampling with Latin Hypercube Design
(lhd). SHC estimates the local regression at each potentialregion, then it searches toward the steepest descent direction.An approach based on direct search [45] forms a simplexin the parameter space by a number of samples, and itera-tively updates a simplex through a number of well-definedoperations including reflection, expansion, and contractionto guide the sample generation. Quick Optimization viaGuessing (QOG) in [31] speeds up the optimization processexploiting some heuristics to filter out sub-optimal configu-rations. The statistical approach in [33] approximates thejoint distribution of parameters with a Gaussian in order toguide sample generation towards the distribution peak. Amodel-based approach [35] iteratively constructs a regressionmodel representing performance influences. Some approacheslike [8] enable dynamic detection of optimal configurationin dynamic environments. Finally, our earlier work BO4CO[21] uses Bayesian Optimization based on GPs to acceleratethe search process, more details are given later in Section 3.
2.3 SolutionAll the previous e↵orts attempt to improve the sampling
process by exploiting the information that has been gained inthe current task. We define a task as individual tuning cyclethat optimizes a given version of the system under test. As aresult, the learning is limited to the current observations andit still requires hundreds of sample evaluations. In this pa-per, we propose to adopt a transfer learning method to dealwith the search e�ciency in configuration tuning. Ratherthan starting the search from scratch, the approach transfersthe learned knowledge coming from similar versions of thesoftware to accelerate the sampling process in the currentversion. This idea is inspired from several observations arisein real software engineering practice [2, 3]. For instance, (i)in DevOps di↵erent versions of a system is delivered con-tinuously, (ii) Big Data systems are developed using similarframeworks (e.g., Apache Hadoop, Spark, Kafka) and runon similar platforms (e.g., cloud clusters), (iii) and di↵erentversions of a system often share a similar business logic.
To the best of our knowledge, only one study [9] exploresthe possibility of transfer learning in system configuration.The authors learn a Bayesian network in the tuning processof a system and reuse this model for tuning other similarsystems. However, the learning is limited to the structure ofthe Bayesian network. In this paper, we introduce a methodthat not only reuse a model that has been learned previouslybut also the valuable raw data. Therefore, we are not limitedto the accuracy of the learned model. Moreover, we do notconsider Bayesian networks and instead focus on MTGPs.
2.4 MotivationA motivating example. We now illustrate the previous
points on an example. WordCount (cf. Figure 1) is a popularbenchmark [12]. WordCount features a three-layer architec-ture that counts the number of words in the incoming stream.A Processing Element (PE) of type Spout reads the inputmessages from a data source and pushes them to the system.A PE of type Bolt named Splitter is responsible for splittingsentences, which are then counted by the Counter.
Kafka Spout Splitter Bolt Counter Bolt
(sentence) (word)[paintings, 3][poems, 60][letter, 75]
Kafka Topic
Stream to Kafka
File(sentence)
(sentence)
(sent
ence
)
Figure 1: WordCount architecture.
2
information from the previous versions, the acquired data onthe current version, and apply a variety of kernel estimators[27] to locate regions where optimal configurations may lie.The key benefit of MTGPs over GPs is that the similaritybetween the response data helps the model to converge tomore accurate predictions much earlier. We experimentallyshow that TL4CO outperforms state-of-the-art configurationtuning methods. Our real configuration datasets are collectedfor three stream processing systems (SPS), implemented withApache Storm, a NoSQL benchmark system with ApacheCassandra, and using a dataset on 6 cloud-based clustersobtained in [21] worth 3 months of experimental time.The rest of this paper is organized as follows. Section 2
overviews the problem and motivates the work via an exam-ple. TL4CO is introduced in Section 3 and then validated inSection 4. Section 5 provides behind the scene and Section 6concludes the paper.
2. PROBLEM AND MOTIVATION
2.1 Problem statementIn this paper, we focus on the problem of optimal system
configuration defined as follows. Let Xi indicate the i-thconfiguration parameter, which ranges in a finite domainDom(Xi). In general, Xi may either indicate (i) integer vari-able such as level of parallelism or (ii) categorical variablesuch as messaging frameworks or Boolean variable such asenabling timeout. We use the terms parameter and factor in-terchangeably; also, with the term option we refer to possiblevalues that can be assigned to a parameter.
We assume that each configuration x 2 X in the configura-tion space X = Dom(X1)⇥ · · ·⇥Dom(Xd) is valid, i.e., thesystem accepts this configuration and the corresponding testresults in a stable performance behavior. The response withconfiguration x is denoted by f(x). Throughout, we assumethat f(·) is latency, however, other metrics for response maybe used. We here consider the problem of finding an optimalconfiguration x
⇤ that globally minimizes f(·) over X:
x
⇤ = argminx2X
f(x) (1)
In fact, the response function f(·) is usually unknown orpartially known, i.e., yi = f(xi),xi ⇢ X. In practice, suchmeasurements may contain noise, i.e., yi = f(xi) + ✏. Thedetermination of the optimal configuration is thus a black-box optimization program subject to noise [27, 33], whichis considerably harder than deterministic optimization. Apopular solution is based on sampling that starts with anumber of sampled configurations. The performance of theexperiments associated to this initial samples can delivera better understanding of f(·) and guide the generation ofthe next round of samples. If properly guided, the processof sample generation-evaluation-feedback-regeneration willeventually converge and the optimal configuration will belocated. However, a sampling-based approach of this kind canbe prohibitively expensive in terms of time or cost (e.g., rentalof cloud resources) considering that a function evaluation inthis case would be costly and the optimization process mayrequire several hundreds of samples to converge.
2.2 Related workSeveral approaches have attempted to address the above
problem. Recursive Random Sampling (RRS) [43] integratesa restarting mechanism into the random sampling to achievehigh search e�ciency. Smart Hill Climbing (SHC) [42] inte-grates the importance sampling with Latin Hypercube Design
(lhd). SHC estimates the local regression at each potentialregion, then it searches toward the steepest descent direction.An approach based on direct search [45] forms a simplexin the parameter space by a number of samples, and itera-tively updates a simplex through a number of well-definedoperations including reflection, expansion, and contractionto guide the sample generation. Quick Optimization viaGuessing (QOG) in [31] speeds up the optimization processexploiting some heuristics to filter out sub-optimal configu-rations. The statistical approach in [33] approximates thejoint distribution of parameters with a Gaussian in order toguide sample generation towards the distribution peak. Amodel-based approach [35] iteratively constructs a regressionmodel representing performance influences. Some approacheslike [8] enable dynamic detection of optimal configurationin dynamic environments. Finally, our earlier work BO4CO[21] uses Bayesian Optimization based on GPs to acceleratethe search process, more details are given later in Section 3.
2.3 SolutionAll the previous e↵orts attempt to improve the sampling
process by exploiting the information that has been gained inthe current task. We define a task as individual tuning cyclethat optimizes a given version of the system under test. As aresult, the learning is limited to the current observations andit still requires hundreds of sample evaluations. In this pa-per, we propose to adopt a transfer learning method to dealwith the search e�ciency in configuration tuning. Ratherthan starting the search from scratch, the approach transfersthe learned knowledge coming from similar versions of thesoftware to accelerate the sampling process in the currentversion. This idea is inspired from several observations arisein real software engineering practice [2, 3]. For instance, (i)in DevOps di↵erent versions of a system is delivered con-tinuously, (ii) Big Data systems are developed using similarframeworks (e.g., Apache Hadoop, Spark, Kafka) and runon similar platforms (e.g., cloud clusters), (iii) and di↵erentversions of a system often share a similar business logic.
To the best of our knowledge, only one study [9] exploresthe possibility of transfer learning in system configuration.The authors learn a Bayesian network in the tuning processof a system and reuse this model for tuning other similarsystems. However, the learning is limited to the structure ofthe Bayesian network. In this paper, we introduce a methodthat not only reuse a model that has been learned previouslybut also the valuable raw data. Therefore, we are not limitedto the accuracy of the learned model. Moreover, we do notconsider Bayesian networks and instead focus on MTGPs.
2.4 MotivationA motivating example. We now illustrate the previous
points on an example. WordCount (cf. Figure 1) is a popularbenchmark [12]. WordCount features a three-layer architec-ture that counts the number of words in the incoming stream.A Processing Element (PE) of type Spout reads the inputmessages from a data source and pushes them to the system.A PE of type Bolt named Splitter is responsible for splittingsentences, which are then counted by the Counter.
Kafka Spout Splitter Bolt Counter Bolt
(sentence) (word)[paintings, 3][poems, 60][letter, 75]
Kafka Topic
Stream to Kafka
File(sentence)
(sentence)
(sent
ence
)
Figure 1: WordCount architecture.
2
information from the previous versions, the acquired data onthe current version, and apply a variety of kernel estimators[27] to locate regions where optimal configurations may lie.The key benefit of MTGPs over GPs is that the similaritybetween the response data helps the model to converge tomore accurate predictions much earlier. We experimentallyshow that TL4CO outperforms state-of-the-art configurationtuning methods. Our real configuration datasets are collectedfor three stream processing systems (SPS), implemented withApache Storm, a NoSQL benchmark system with ApacheCassandra, and using a dataset on 6 cloud-based clustersobtained in [21] worth 3 months of experimental time.The rest of this paper is organized as follows. Section 2
overviews the problem and motivates the work via an exam-ple. TL4CO is introduced in Section 3 and then validated inSection 4. Section 5 provides behind the scene and Section 6concludes the paper.
2. PROBLEM AND MOTIVATION
2.1 Problem statementIn this paper, we focus on the problem of optimal system
configuration defined as follows. Let Xi indicate the i-thconfiguration parameter, which ranges in a finite domainDom(Xi). In general, Xi may either indicate (i) integer vari-able such as level of parallelism or (ii) categorical variablesuch as messaging frameworks or Boolean variable such asenabling timeout. We use the terms parameter and factor in-terchangeably; also, with the term option we refer to possiblevalues that can be assigned to a parameter.
We assume that each configuration x 2 X in the configura-tion space X = Dom(X1)⇥ · · ·⇥Dom(Xd) is valid, i.e., thesystem accepts this configuration and the corresponding testresults in a stable performance behavior. The response withconfiguration x is denoted by f(x). Throughout, we assumethat f(·) is latency, however, other metrics for response maybe used. We here consider the problem of finding an optimalconfiguration x
⇤ that globally minimizes f(·) over X:
x
⇤ = argminx2X
f(x) (1)
In fact, the response function f(·) is usually unknown orpartially known, i.e., yi = f(xi),xi ⇢ X. In practice, suchmeasurements may contain noise, i.e., yi = f(xi) + ✏. Thedetermination of the optimal configuration is thus a black-box optimization program subject to noise [27, 33], whichis considerably harder than deterministic optimization. Apopular solution is based on sampling that starts with anumber of sampled configurations. The performance of theexperiments associated to this initial samples can delivera better understanding of f(·) and guide the generation ofthe next round of samples. If properly guided, the processof sample generation-evaluation-feedback-regeneration willeventually converge and the optimal configuration will belocated. However, a sampling-based approach of this kind canbe prohibitively expensive in terms of time or cost (e.g., rentalof cloud resources) considering that a function evaluation inthis case would be costly and the optimization process mayrequire several hundreds of samples to converge.
2.2 Related workSeveral approaches have attempted to address the above
problem. Recursive Random Sampling (RRS) [43] integratesa restarting mechanism into the random sampling to achievehigh search e�ciency. Smart Hill Climbing (SHC) [42] inte-grates the importance sampling with Latin Hypercube Design
(lhd). SHC estimates the local regression at each potentialregion, then it searches toward the steepest descent direction.An approach based on direct search [45] forms a simplexin the parameter space by a number of samples, and itera-tively updates a simplex through a number of well-definedoperations including reflection, expansion, and contractionto guide the sample generation. Quick Optimization viaGuessing (QOG) in [31] speeds up the optimization processexploiting some heuristics to filter out sub-optimal configu-rations. The statistical approach in [33] approximates thejoint distribution of parameters with a Gaussian in order toguide sample generation towards the distribution peak. Amodel-based approach [35] iteratively constructs a regressionmodel representing performance influences. Some approacheslike [8] enable dynamic detection of optimal configurationin dynamic environments. Finally, our earlier work BO4CO[21] uses Bayesian Optimization based on GPs to acceleratethe search process, more details are given later in Section 3.
2.3 SolutionAll the previous e↵orts attempt to improve the sampling
process by exploiting the information that has been gained inthe current task. We define a task as individual tuning cyclethat optimizes a given version of the system under test. As aresult, the learning is limited to the current observations andit still requires hundreds of sample evaluations. In this pa-per, we propose to adopt a transfer learning method to dealwith the search e�ciency in configuration tuning. Ratherthan starting the search from scratch, the approach transfersthe learned knowledge coming from similar versions of thesoftware to accelerate the sampling process in the currentversion. This idea is inspired from several observations arisein real software engineering practice [2, 3]. For instance, (i)in DevOps di↵erent versions of a system is delivered con-tinuously, (ii) Big Data systems are developed using similarframeworks (e.g., Apache Hadoop, Spark, Kafka) and runon similar platforms (e.g., cloud clusters), (iii) and di↵erentversions of a system often share a similar business logic.
To the best of our knowledge, only one study [9] exploresthe possibility of transfer learning in system configuration.The authors learn a Bayesian network in the tuning processof a system and reuse this model for tuning other similarsystems. However, the learning is limited to the structure ofthe Bayesian network. In this paper, we introduce a methodthat not only reuse a model that has been learned previouslybut also the valuable raw data. Therefore, we are not limitedto the accuracy of the learned model. Moreover, we do notconsider Bayesian networks and instead focus on MTGPs.
2.4 MotivationA motivating example. We now illustrate the previous
points on an example. WordCount (cf. Figure 1) is a popularbenchmark [12]. WordCount features a three-layer architec-ture that counts the number of words in the incoming stream.A Processing Element (PE) of type Spout reads the inputmessages from a data source and pushes them to the system.A PE of type Bolt named Splitter is responsible for splittingsentences, which are then counted by the Counter.
Kafka Spout Splitter Bolt Counter Bolt
(sentence) (word)[paintings, 3][poems, 60][letter, 75]
Kafka Topic
Stream to Kafka
File(sentence)
(sentence)
(sent
ence
)
Figure 1: WordCount architecture.
2
Response surface is:- Non-linear- Non convex- Multi-modal- Measurements subject to noise
BayesianOptimizationforConfigurationOptimization(BO4CO)Code:https://github.com/dice-project/DICE-Configuration-BO4COData:https://zenodo.org/record/56238Paper:P.Jamshidi,G.Casale,“AnUncertainty-AwareApproachtoOptimalConfigurationofStreamProcessingSystems”,MASCOTS2016. https://arxiv.org/abs/1606.06543
GPformodelingblackbox responsefunction
true function
GP mean
GP variance
observation
selected point
true minimum
Table 1: Pearson (Spearman) correlation coe�cients.v1 v2 v3 v4
v1 1 0.41 (0.49) -0.46 (-0.51) -0.50 (-0.51)v2 7.36E-06 (5.5E-08) 1 -0.20 (-0.2793) -0.18 (-0.24)v3 6.92E-07 (1.3E-08) 0.04 (0.003) 1 0.94 (0.88)v4 2.54E-08 (1.4E-08) 0.07 (0.01) 1.16E-52 (8.3E-36) 1
Table 2: Signal to noise ratios for WordCount.
Top. µ � µci �ciµ�
wc(v1) 516.59 7.96 [515.27, 517.90] [7.13, 9.01] 64.88wc(v2) 584.94 2.58 [584.51, 585.36] [2.32, 2.92] 226.32wc(v3) 654.89 13.56 [652.65, 657.13] [12.15, 15.34] 48.30wc(v4) 1125.81 16.92 [1123, 1128.6] [15.16, 19.14] 66.56
Figure 2 (a,b,c,d) shows the response surfaces for 4 di↵er-ent versions of the WordCount when splitters and countersare varied in [1, 6] and [1, 18]. WordCount v1, v2 (also v3, v4)are identical in terms of source code, but the environmentin which they are deployed on is di↵erent (we have deployedseveral other systems that compete for capacity in the samecluster). WordCount v1, v3 (also v2, v4) are deployed on a sim-ilar environment, but they have undergone multiple softwarechanges (we artificially injected delays in the source codeof its components). A number of interesting observationscan be made from the experimental results in Figure 2 andTables 1, 2 that we describe in the following subsections.
Correlation across di↵erent versions. We have measuredthe correlation coe�cients between the four versions of Word-Count in Table 1 (upper triangle shows the coe�cients whilelower triangle shows the p-values). The correlations be-tween the response functions are significant (p-values areless than 0.05). However, the correlation di↵ers betweenversions to versions. Also, more interestingly, di↵erent ver-sions of the system have di↵erent optimal configurations:x⇤v1 = (5, 1), x⇤
v2 = (6, 2), x⇤v3 = (2, 13), x⇤
v4 = (2, 16). InDevOps, di↵erent versions of a system will be delivered con-tinuously on daily basis [3]. Current DevOps practices donot systematically use the knowledge from previous versionsfor performance tuning of the current version under test de-spite such significant correlations [3]. There are two reasonsbehind this: (i) the techniques that are used for performancetuning cannot exploit the historical data belong to a di↵erentversion. (ii) they assume di↵erent versions have the sameoptimum configuration. However, based on our experimentalobservations above, this is not true. As a result, the existingpractice treat the experimental data as one-time-use.
Nonlinear interactions. The response functions f(·) in Fig-ure 2 are strongly non-linear, non-convex and multi-modal.The performance di↵erence between the best and worst set-tings is substantial, e.g., 65% in v4, providing a case foroptimal tuning. Moreover, the non-linear relations amongthe parameters imply that the optimal number of countersdepends on splitters, and vice-versa. In other words, if onetries to minimize latency by acting just on one of theseparameters, this may not lead to a global optimum [21].Measurement uncertainty. We have taken samples of the
latency for the same configuration (splitters=counters=1) ofthe 4 versions of WordCount. The experiment were conductedon Amazon EC2 (m3.large (2 CPU, 7.5 GB)). After filteringthe initial burn-in, we have computed average and variance ofthe measurements. The results in Table 2 illustrate that thevariability of measurements across di↵erent versions can beof di↵erent scales. In traditional techniques, such as designof experiments, the variability is typically disregarded byrepeating experiments and obtaining the mean. However, wehere pursue an alternative approach that relies on MTGPmodels that are able to explicitly take into account variability.
3. TL4CO: TRANSFER LEARNING FOR CON-FIGURATION OPTIMIZATION
3.1 Single-task GP Bayesian optimizationBayesian optimization [34] is a sequential design strategy
that allows us to perform global optimization of black-boxfunctions. Figure 3 illustrates the GP-based Bayesian Op-timization approach using a 1-dimensional response. Thecurve in blue is the unknown true response, whereas themean is shown in yellow and the 95% confidence interval ateach point in the shaded red area. The stars indicate ex-perimental measurements (or observation interchangeably).Some points x 2 X have a large confidence interval due tolack of observations in their neighborhood, while others havea narrow confidence. The main motivation behind the choiceof Bayesian Optimization here is that it o↵ers a frameworkin which reasoning can be not only based on mean estimatesbut also the variance, providing more informative decisionmaking. The other reason is that all the computations inthis framework are based on tractable linear algebra.In our previous work [21], we proposed BO4CO that ex-
ploits single-task GPs (no transfer learning) for prediction ofposterior distribution of response functions. A GP model iscomposed by its prior mean (µ(·) : X ! R) and a covariancefunction (k(·, ·) : X⇥ X ! R) [41]:
y = f(x) ⇠ GP(µ(x), k(x,x0)), (2)
where covariance k(x,x0) defines the distance between x
and x
0. Let us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} bethe collection of t experimental data (observations). In thisframework, we treat f(x) as a random variable, conditionedon observations S1:t, which is normally distributed with thefollowing posterior mean and variance functions [41]:
µt(x) = µ(x) + k(x)|(K + �2I)�1(y � µ) (3)
�2t (x) = k(x,x) + �2
I � k(x)|(K + �2I)�1
k(x) (4)
where y := y1:t, k(x)| = [k(x,x1) k(x,x2) . . . k(x,xt)],
µ := µ(x1:t), K := k(xi,xj) and I is identity matrix. Theshortcoming of BO4CO is that it cannot exploit the observa-tions regarding other versions of the system and as thereforecannot be applied in DevOps.
3.2 TL4CO: an extension to multi-tasksTL4CO 1 uses MTGPs that exploit observations from other
previous versions of the system under test. Algorithm 1defines the internal details of TL4CO. As Figure 4 shows,TL4CO is an iterative algorithm that uses the learning fromother system versions. In a high-level overview, TL4CO: (i)selects the most informative past observations (details inSection 3.3); (ii) fits a model to existing data based on kernellearning (details in Section 3.4), and (iii) selects the nextconfiguration based on the model (details in Section 3.5).
In the multi-task framework, we use historical data to fit abetter GP providing more accurate predictions. Before that,we measure few sample points based on Latin Hypercube De-sign (lhd) D = {x1, . . . , xn} (cf. step 1 in Algorithm 1). Wehave chosen lhd because: (i) it ensures that the configurationsamples in D is representative of the configuration space X,whereas traditional random sampling [26, 17] (called brute-force) does not guarantee this [29]; (ii) another advantage isthat the lhd samples can be taken one at a time, making it e�-cient in high dimensional spaces. We define a new notation for1Code+Data will be released (due in July 2016), as this isfunded under the EU project DICE: https://github.com/dice-project/DICE-Configuration-BO4CO
3
Table 1: Pearson (Spearman) correlation coe�cients.v1 v2 v3 v4
v1 1 0.41 (0.49) -0.46 (-0.51) -0.50 (-0.51)v2 7.36E-06 (5.5E-08) 1 -0.20 (-0.2793) -0.18 (-0.24)v3 6.92E-07 (1.3E-08) 0.04 (0.003) 1 0.94 (0.88)v4 2.54E-08 (1.4E-08) 0.07 (0.01) 1.16E-52 (8.3E-36) 1
Table 2: Signal to noise ratios for WordCount.
Top. µ � µci �ciµ�
wc(v1) 516.59 7.96 [515.27, 517.90] [7.13, 9.01] 64.88wc(v2) 584.94 2.58 [584.51, 585.36] [2.32, 2.92] 226.32wc(v3) 654.89 13.56 [652.65, 657.13] [12.15, 15.34] 48.30wc(v4) 1125.81 16.92 [1123, 1128.6] [15.16, 19.14] 66.56
Figure 2 (a,b,c,d) shows the response surfaces for 4 di↵er-ent versions of the WordCount when splitters and countersare varied in [1, 6] and [1, 18]. WordCount v1, v2 (also v3, v4)are identical in terms of source code, but the environmentin which they are deployed on is di↵erent (we have deployedseveral other systems that compete for capacity in the samecluster). WordCount v1, v3 (also v2, v4) are deployed on a sim-ilar environment, but they have undergone multiple softwarechanges (we artificially injected delays in the source codeof its components). A number of interesting observationscan be made from the experimental results in Figure 2 andTables 1, 2 that we describe in the following subsections.
Correlation across di↵erent versions. We have measuredthe correlation coe�cients between the four versions of Word-Count in Table 1 (upper triangle shows the coe�cients whilelower triangle shows the p-values). The correlations be-tween the response functions are significant (p-values areless than 0.05). However, the correlation di↵ers betweenversions to versions. Also, more interestingly, di↵erent ver-sions of the system have di↵erent optimal configurations:x⇤v1 = (5, 1), x⇤
v2 = (6, 2), x⇤v3 = (2, 13), x⇤
v4 = (2, 16). InDevOps, di↵erent versions of a system will be delivered con-tinuously on daily basis [3]. Current DevOps practices donot systematically use the knowledge from previous versionsfor performance tuning of the current version under test de-spite such significant correlations [3]. There are two reasonsbehind this: (i) the techniques that are used for performancetuning cannot exploit the historical data belong to a di↵erentversion. (ii) they assume di↵erent versions have the sameoptimum configuration. However, based on our experimentalobservations above, this is not true. As a result, the existingpractice treat the experimental data as one-time-use.
Nonlinear interactions. The response functions f(·) in Fig-ure 2 are strongly non-linear, non-convex and multi-modal.The performance di↵erence between the best and worst set-tings is substantial, e.g., 65% in v4, providing a case foroptimal tuning. Moreover, the non-linear relations amongthe parameters imply that the optimal number of countersdepends on splitters, and vice-versa. In other words, if onetries to minimize latency by acting just on one of theseparameters, this may not lead to a global optimum [21].Measurement uncertainty. We have taken samples of the
latency for the same configuration (splitters=counters=1) ofthe 4 versions of WordCount. The experiment were conductedon Amazon EC2 (m3.large (2 CPU, 7.5 GB)). After filteringthe initial burn-in, we have computed average and variance ofthe measurements. The results in Table 2 illustrate that thevariability of measurements across di↵erent versions can beof di↵erent scales. In traditional techniques, such as designof experiments, the variability is typically disregarded byrepeating experiments and obtaining the mean. However, wehere pursue an alternative approach that relies on MTGPmodels that are able to explicitly take into account variability.
3. TL4CO: TRANSFER LEARNING FOR CON-FIGURATION OPTIMIZATION
3.1 Single-task GP Bayesian optimizationBayesian optimization [34] is a sequential design strategy
that allows us to perform global optimization of black-boxfunctions. Figure 3 illustrates the GP-based Bayesian Op-timization approach using a 1-dimensional response. Thecurve in blue is the unknown true response, whereas themean is shown in yellow and the 95% confidence interval ateach point in the shaded red area. The stars indicate ex-perimental measurements (or observation interchangeably).Some points x 2 X have a large confidence interval due tolack of observations in their neighborhood, while others havea narrow confidence. The main motivation behind the choiceof Bayesian Optimization here is that it o↵ers a frameworkin which reasoning can be not only based on mean estimatesbut also the variance, providing more informative decisionmaking. The other reason is that all the computations inthis framework are based on tractable linear algebra.In our previous work [21], we proposed BO4CO that ex-
ploits single-task GPs (no transfer learning) for prediction ofposterior distribution of response functions. A GP model iscomposed by its prior mean (µ(·) : X ! R) and a covariancefunction (k(·, ·) : X⇥ X ! R) [41]:
y = f(x) ⇠ GP(µ(x), k(x,x0)), (2)
where covariance k(x,x0) defines the distance between x
and x
0. Let us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} bethe collection of t experimental data (observations). In thisframework, we treat f(x) as a random variable, conditionedon observations S1:t, which is normally distributed with thefollowing posterior mean and variance functions [41]:
µt(x) = µ(x) + k(x)|(K + �2I)�1(y � µ) (3)
�2t (x) = k(x,x) + �2
I � k(x)|(K + �2I)�1
k(x) (4)
where y := y1:t, k(x)| = [k(x,x1) k(x,x2) . . . k(x,xt)],
µ := µ(x1:t), K := k(xi,xj) and I is identity matrix. Theshortcoming of BO4CO is that it cannot exploit the observa-tions regarding other versions of the system and as thereforecannot be applied in DevOps.
3.2 TL4CO: an extension to multi-tasksTL4CO 1 uses MTGPs that exploit observations from other
previous versions of the system under test. Algorithm 1defines the internal details of TL4CO. As Figure 4 shows,TL4CO is an iterative algorithm that uses the learning fromother system versions. In a high-level overview, TL4CO: (i)selects the most informative past observations (details inSection 3.3); (ii) fits a model to existing data based on kernellearning (details in Section 3.4), and (iii) selects the nextconfiguration based on the model (details in Section 3.5).
In the multi-task framework, we use historical data to fit abetter GP providing more accurate predictions. Before that,we measure few sample points based on Latin Hypercube De-sign (lhd) D = {x1, . . . , xn} (cf. step 1 in Algorithm 1). Wehave chosen lhd because: (i) it ensures that the configurationsamples in D is representative of the configuration space X,whereas traditional random sampling [26, 17] (called brute-force) does not guarantee this [29]; (ii) another advantage isthat the lhd samples can be taken one at a time, making it e�-cient in high dimensional spaces. We define a new notation for1Code+Data will be released (due in July 2016), as this isfunded under the EU project DICE: https://github.com/dice-project/DICE-Configuration-BO4CO
3
Motivations:1- mean estimates + variance2- all computations are linear algebra
-1.5 -1 -0.5 0 0.5 1 1.5-1.5
-1
-0.5
0
0.5
1
x1 x2 x3 x4
true function
GP surrogate mean estimate
observation
Fig. 5: An example of 1D GP model: GPs provide mean esti-mates as well as the uncertainty in estimations, i.e., variance.
Configuration Optimisation Tool
performance repository
Monitoring
Deployment Service
Data Preparation
configuration parameters
values
configuration parameters
values
Experimental Suite
Testbed
Doc
Data Broker
Tester
experiment timepolling interval
configurationparameters
GP model
Kafka
System Under TestWorkloadGenerator
Technology Interface
Stor
m
Cas
sand
ra
Spar
k
Fig. 6: BO4CO architecture: (i) optimization and (ii) exper-imental suite are integrated via (iii) a data broker. The in-tegrated solution is available: https://github.com/dice-project/DICE-Configuration-BO4CO.
B. BO4CO algorithm
BO4CO’s high-level architecture is shown in Figure 6 andthe procedure that drives the optimization is described in Al-gorithm. We start by bootstrapping the optimization followingLatin Hypercube Design (lhd) to produce an initial designD = {x1, . . . ,xn
} (cf. step 1 in Algorithm 1). Although otherdesign approaches (e.g., random) could be used, we have cho-sen lhd because: (i) it ensures that the configuration samplesin D is representative of the configuration space X, whereastraditional random sampling [17], [9] (called brute-force) doesnot guarantee this [20]; (ii) another advantage is that the lhdsamples can be taken one at a time, making it efficient inhigh dimensional spaces. After obtaining the measurementsregarding the initial design, BO4CO then fits a GP model tothe design points D to form our belief about the underlyingresponse function (cf. step 3 in Algorithm 1). The while loop inAlgorithm 1 iteratively updates the belief until the budget runsout: As we accumulate the data S1:t = {(x
i
, yi
)}ti=1, where
yi
= f(xi
)+ ✏i
with ✏ ⇠ N (0,�2), a prior distribution Pr(f)
and the likelihood function Pr(S1:t|f) form the posteriordistribution: Pr(f |S1:t) / Pr(S1:t|f) Pr(f).
A GP is a distribution over functions [31], specified by itsmean (see Section III-E2), and covariance (see Section III-E1):
y = f(x) ⇠ GP(µ(x), k(x,x0)), (3)
Algorithm 1 : BO4COInput: Configuration space X, Maximum budget N
max
, Re-sponse function f , Kernel function K
✓
, Hyper-parameters✓, Design sample size n, learning cycle N
l
Output: Optimal configurations x
⇤ and learned model M1: choose an initial sparse design (lhd) to find an initial
design samples D = {x1, . . . ,xn
}2: obtain performance measurements of the initial design,
yi
f(xi
) + ✏i
, 8xi
2 D3: S1:n {(x
i
, yi
)}ni=1; t n+ 1
4: M(x|S1:n,✓) fit a GP model to the design . Eq.(3)5: while t N
max
do6: if (t mod N
l
= 0) ✓ learn the kernel hyper-parameters by maximizing the likelihood
7: find next configuration x
t
by optimizing the selectioncriteria over the estimated response surface given the data,x
t
argmaxx
u(x|M, S1:t�1) . Eq.(9)8: obtain performance for the new configuration x
t
, yt
f(x
t
) + ✏t
9: Augment the configuration S1:t = {S1:t�1, (xt
, yt
)}10: M(x|S1:t,✓) re-fit a new GP model . Eq.(7)11: t t+ 1
12: end while13: (x
⇤, y⇤) = min S1:Nmax
14: M(x)
where k(x,x0) defines the distance between x and x
0. Let usassume S1:t = {(x1:t, y1:t)|yi := f(x
i
)} be the collection oft observations. The function values are drawn from a multi-variate Gaussian distribution N (µ,K), where µ := µ(x1:t),
K :=
2
64k(x1,x1) . . . k(x1,xt
)
.... . .
...k(x
t
,x1) . . . k(xt
,xt
)
3
75 (4)
In the while loop in BO4CO, given the observations weaccumulated so far, we intend to fit a new GP model:
f1:tft+1
�⇠ N (µ,
K + �2
I k
k
| k(xt+1,xt+1)
�), (5)
where k(x)
|= [k(x,x1) k(x,x2) . . . k(x,x
t
)] and I
is identity matrix. Given the Eq. (5), the new GP model canbe drawn from this new Gaussian distribution:
Pr(ft+1|S1:t,xt+1) = N (µ
t
(x
t+1),�2t
(x
t+1)), (6)
where
µt
(x) = µ(x) + k(x)
|(K + �2
I)
�1(y � µ) (7)
�2t
(x) = k(x,x) + �2I � k(x)
|(K + �2
I)
�1k(x) (8)
These posterior functions are used to select the next point xt+1
as detailed in Section III-C.
C. Configuration selection criteriaThe selection criteria is defined as u : X ! R that selects
x
t+1 2 X, should f(·) be evaluated next (step 7):
x
t+1 = argmax
x2Xu(x|M, S1:t) (9)
-1.5 -1 -0.5 0 0.5 1 1.5-1.5
-1
-0.5
0
0.5
1
x1 x2 x3 x4
true function
GP surrogate mean estimate
observation
Fig. 5: An example of 1D GP model: GPs provide mean esti-mates as well as the uncertainty in estimations, i.e., variance.
Configuration Optimisation Tool
performance repository
Monitoring
Deployment Service
Data Preparation
configuration parameters
values
configuration parameters
values
Experimental Suite
Testbed
Doc
Data Broker
Tester
experiment timepolling interval
configurationparameters
GP model
Kafka
System Under TestWorkloadGenerator
Technology Interface
Stor
m
Cas
sand
ra
Spar
k
Fig. 6: BO4CO architecture: (i) optimization and (ii) exper-imental suite are integrated via (iii) a data broker. The in-tegrated solution is available: https://github.com/dice-project/DICE-Configuration-BO4CO.
B. BO4CO algorithm
BO4CO’s high-level architecture is shown in Figure 6 andthe procedure that drives the optimization is described in Al-gorithm. We start by bootstrapping the optimization followingLatin Hypercube Design (lhd) to produce an initial designD = {x1, . . . ,xn
} (cf. step 1 in Algorithm 1). Although otherdesign approaches (e.g., random) could be used, we have cho-sen lhd because: (i) it ensures that the configuration samplesin D is representative of the configuration space X, whereastraditional random sampling [17], [9] (called brute-force) doesnot guarantee this [20]; (ii) another advantage is that the lhdsamples can be taken one at a time, making it efficient inhigh dimensional spaces. After obtaining the measurementsregarding the initial design, BO4CO then fits a GP model tothe design points D to form our belief about the underlyingresponse function (cf. step 3 in Algorithm 1). The while loop inAlgorithm 1 iteratively updates the belief until the budget runsout: As we accumulate the data S1:t = {(x
i
, yi
)}ti=1, where
yi
= f(xi
)+ ✏i
with ✏ ⇠ N (0,�2), a prior distribution Pr(f)
and the likelihood function Pr(S1:t|f) form the posteriordistribution: Pr(f |S1:t) / Pr(S1:t|f) Pr(f).
A GP is a distribution over functions [31], specified by itsmean (see Section III-E2), and covariance (see Section III-E1):
y = f(x) ⇠ GP(µ(x), k(x,x0)), (3)
Algorithm 1 : BO4COInput: Configuration space X, Maximum budget N
max
, Re-sponse function f , Kernel function K
✓
, Hyper-parameters✓, Design sample size n, learning cycle N
l
Output: Optimal configurations x
⇤ and learned model M1: choose an initial sparse design (lhd) to find an initial
design samples D = {x1, . . . ,xn
}2: obtain performance measurements of the initial design,
yi
f(xi
) + ✏i
, 8xi
2 D3: S1:n {(x
i
, yi
)}ni=1; t n+ 1
4: M(x|S1:n,✓) fit a GP model to the design . Eq.(3)5: while t N
max
do6: if (t mod N
l
= 0) ✓ learn the kernel hyper-parameters by maximizing the likelihood
7: find next configuration x
t
by optimizing the selectioncriteria over the estimated response surface given the data,x
t
argmaxx
u(x|M, S1:t�1) . Eq.(9)8: obtain performance for the new configuration x
t
, yt
f(x
t
) + ✏t
9: Augment the configuration S1:t = {S1:t�1, (xt
, yt
)}10: M(x|S1:t,✓) re-fit a new GP model . Eq.(7)11: t t+ 1
12: end while13: (x
⇤, y⇤) = min S1:Nmax
14: M(x)
where k(x,x0) defines the distance between x and x
0. Let usassume S1:t = {(x1:t, y1:t)|yi := f(x
i
)} be the collection oft observations. The function values are drawn from a multi-variate Gaussian distribution N (µ,K), where µ := µ(x1:t),
K :=
2
64k(x1,x1) . . . k(x1,xt
)
.... . .
...k(x
t
,x1) . . . k(xt
,xt
)
3
75 (4)
In the while loop in BO4CO, given the observations weaccumulated so far, we intend to fit a new GP model:
f1:tft+1
�⇠ N (µ,
K + �2
I k
k
| k(xt+1,xt+1)
�), (5)
where k(x)
|= [k(x,x1) k(x,x2) . . . k(x,x
t
)] and I
is identity matrix. Given the Eq. (5), the new GP model canbe drawn from this new Gaussian distribution:
Pr(ft+1|S1:t,xt+1) = N (µ
t
(x
t+1),�2t
(x
t+1)), (6)
where
µt
(x) = µ(x) + k(x)
|(K + �2
I)
�1(y � µ) (7)
�2t
(x) = k(x,x) + �2I � k(x)
|(K + �2
I)
�1k(x) (8)
These posterior functions are used to select the next point xt+1
as detailed in Section III-C.
C. Configuration selection criteriaThe selection criteria is defined as u : X ! R that selects
x
t+1 2 X, should f(·) be evaluated next (step 7):
x
t+1 = argmax
x2Xu(x|M, S1:t) (9)
0 2000 4000 6000 8000 10000Iteration
4
4.5
5
5.5
6
6.5
7
7.5
8
8.5
Ka
pp
a
ϵ=1ϵ=0.1ϵ=0.01
Fig. 7: Change of value over time: it begins with a smallvalue to exploit the mean estimates and it increases over timein order to explore.
BO4CO uses Lower Confidence Bound (LCB) [25]. LCBminimizes the distance to optimum by balancing exploitationof mean and exploration via variance:
uLCB
(x|M, S1:n) = argmin
x2Xµt
(x)� �t
(x), (10)
where can be set according to the objectives. For instance,if we require to find a near optimal configuration quickly weset a low value to to take the most out of the initial designknowledge. However, if we are looking for a globally optimumconfiguration, we can set a high value in order to skip localminima. Furthermore, can be adapted over time to benefitfrom the both [13]. For instance, can start with a reasonablysmall value to exploit the initial design and increase over timeto do more explorations (cf. Figure 7).
D. IllustrationThe steps in Algorithm 1 are illustrated in Figure 8. Firstly,
an initial design based on lhd is produced (Figure 8(a)).Secondly, a GP model is fit to the initial design (Figure 8(b)).Then, the model is used to calculate the selection criteria(Figure 8(c)). Finally, the configuration that maximizes theselection criteria is used to run the next experiment and providedata for refitting a more accurate model (Figure 8(d)).
E. Model fitting in BO4COIn this section, we provide some practical considerations to
make GPs applicable for configuration optimization.1) Kernel function: In BO4CO, as shown in Algorithm
1, the covariance function k : X ⇥ X ! R dictates thestructure of the response function we fit to the observed data.For integer variables (cf. Section II-A), we implemented theMatern kernel [31]. The main reason behind this choice is thatalong each dimension in the configuration response functionsdifferent level of smoothness can be observed (cf. Figure 2).Matern kernels incorporate a smoothness parameter ⌫ > 0 thatpermits greater flexibility in modeling such functions [31]. Thefollowing is a variation of the Matern kernel for ⌫ = 1/2:
k⌫=1/2(xi
,xj
) = ✓20 exp(�r), (11)
where r2(xi
,xj
) = (x
i
� x
j
)
|⇤(x
i
� x
j
) for some positivesemidefinite matrix ⇤. For categorical variables, we imple-mented the following [12]:
k✓
(x
i
,xj
) = exp(⌃
d
`=1(�✓`
�(xi
6= x
j
))), (12)
where d is the number of dimensions (i.e., the number of con-figuration parameters), ✓
`
adjust the scales along the functiondimensions and � is a function gives the distance betweentwo categorical variables using Kronecker delta [12], [25].TL4CO uses different scales {✓
`
, ` = 1 . . . d} on differentdimensions as suggested in [31], [25], this technique is calledAutomatic Relevance Determination (ARD). After learning thehyper-parameters (step 6), if the `-th dimension turns out tobe irrelevant, then ✓
`
will be a small value, and therefore, willbe discarded. This is particularly helpful in high dimensionalspaces, where it is difficult to find the optimal configuration.
2) Prior mean function: While the kernel controls thestructure of the estimated function, the prior mean µ(x) :
X ! R provides a possible offset for our estimation. Bydefault, this function is set to a constant µ(x) := µ, which isinferred from the observations [25]. However, the prior meanfunction is a way of incorporating the expert knowledge, ifit is available, then we can use this knowledge. Fortunately,we have collected extensive experimental measurements andbased on our datasets (cf. Table I), we observed that typically,for Big Data systems, there is a significant distance betweenthe minimum and the maximum of each function (cf. Figure2). Therefore, a linear mean function µ(x) := ax+ b, allowsfor more flexible structures, and provides a better fit for thedata than a constant mean. We only need to learn the slopefor each dimension and an offset (denoted by µ
`
= (a, b)).3) Learning parameters: marginal likelihood: This section
describe the step 7 in Algorithm 1. Due to the heavy compu-tation of the learning, this process is computed only every N
l
iterations. For learning the hyper-parameters of the kernel andalso the prior mean functions (cf. Sections III-E1 and III-E2),we maximize the marginal likelihood [25] of the observationsS1:t. To do that, we train GP model (7) with S1:t. We optimizethe marginal likelihood using multi-started quasi-Newton hill-climbers [23]. For this purpose, we use the off-the-shelf gpmllibrary presented in [23]. Using the kernel defined in (12), welearn ✓ := (✓0:d, µ0:d,�
2) that comprises the hyper-parameters
of the kernel and mean functions. The learning is performediteratively resulting in a sequence of ✓
i
for i = 1 . . . bN
max
N
`
c.4) Observation noise: The primary way for determining the
noise variance � in BO4CO is to use historical data: In SectionII-B4, we have shown that such noise can be measured with ahigh confidence and the signal-to-noise ratios shows that suchnoise is stationary. The secondary alternative is to learn thenoise variance sequentially as we collect new data. We treatthem just as any other hyper-parameters, see Section III-E3.
IV. EXPERIMENTAL RESULTS
A. Implementation
From an implementation perspective, BO4CO consists ofthree major components: (i) an optimization component (cf.left part of Figure 6), (ii) an experimental suite (cf. right partof Figure 6) integrated via a (iii) data broker. The optimizationcomponent implements the model (re-)fitting (7) and criteriaoptimization (9) steps in Algorithm 1 and is developed in Mat-lab 2015b. The experimental suite component implements thefacilities for automated deployment of topologies, performancemeasurements and data preparation and is developed in Java.
0 2000 4000 6000 8000 10000Iteration
4
4.5
5
5.5
6
6.5
7
7.5
8
8.5
Kappa
ϵ=1ϵ=0.1ϵ=0.01
Fig. 7: Change of value over time: it begins with a smallvalue to exploit the mean estimates and it increases over timein order to explore.
BO4CO uses Lower Confidence Bound (LCB) [25]. LCBminimizes the distance to optimum by balancing exploitationof mean and exploration via variance:
uLCB
(x|M, S1:n) = argmin
x2Xµt
(x)� �t
(x), (10)
where can be set according to the objectives. For instance,if we require to find a near optimal configuration quickly weset a low value to to take the most out of the initial designknowledge. However, if we are looking for a globally optimumconfiguration, we can set a high value in order to skip localminima. Furthermore, can be adapted over time to benefitfrom the both [13]. For instance, can start with a reasonablysmall value to exploit the initial design and increase over timeto do more explorations (cf. Figure 7).
D. IllustrationThe steps in Algorithm 1 are illustrated in Figure 8. Firstly,
an initial design based on lhd is produced (Figure 8(a)).Secondly, a GP model is fit to the initial design (Figure 8(b)).Then, the model is used to calculate the selection criteria(Figure 8(c)). Finally, the configuration that maximizes theselection criteria is used to run the next experiment and providedata for refitting a more accurate model (Figure 8(d)).
E. Model fitting in BO4COIn this section, we provide some practical considerations to
make GPs applicable for configuration optimization.1) Kernel function: In BO4CO, as shown in Algorithm
1, the covariance function k : X ⇥ X ! R dictates thestructure of the response function we fit to the observed data.For integer variables (cf. Section II-A), we implemented theMatern kernel [31]. The main reason behind this choice is thatalong each dimension in the configuration response functionsdifferent level of smoothness can be observed (cf. Figure 2).Matern kernels incorporate a smoothness parameter ⌫ > 0 thatpermits greater flexibility in modeling such functions [31]. Thefollowing is a variation of the Matern kernel for ⌫ = 1/2:
k⌫=1/2(xi
,xj
) = ✓20 exp(�r), (11)
where r2(xi
,xj
) = (x
i
� x
j
)
|⇤(x
i
� x
j
) for some positivesemidefinite matrix ⇤. For categorical variables, we imple-mented the following [12]:
k✓
(x
i
,xj
) = exp(⌃
d
`=1(�✓`
�(xi
6= x
j
))), (12)
where d is the number of dimensions (i.e., the number of con-figuration parameters), ✓
`
adjust the scales along the functiondimensions and � is a function gives the distance betweentwo categorical variables using Kronecker delta [12], [25].TL4CO uses different scales {✓
`
, ` = 1 . . . d} on differentdimensions as suggested in [31], [25], this technique is calledAutomatic Relevance Determination (ARD). After learning thehyper-parameters (step 6), if the `-th dimension turns out tobe irrelevant, then ✓
`
will be a small value, and therefore, willbe discarded. This is particularly helpful in high dimensionalspaces, where it is difficult to find the optimal configuration.
2) Prior mean function: While the kernel controls thestructure of the estimated function, the prior mean µ(x) :
X ! R provides a possible offset for our estimation. Bydefault, this function is set to a constant µ(x) := µ, which isinferred from the observations [25]. However, the prior meanfunction is a way of incorporating the expert knowledge, ifit is available, then we can use this knowledge. Fortunately,we have collected extensive experimental measurements andbased on our datasets (cf. Table I), we observed that typically,for Big Data systems, there is a significant distance betweenthe minimum and the maximum of each function (cf. Figure2). Therefore, a linear mean function µ(x) := ax+ b, allowsfor more flexible structures, and provides a better fit for thedata than a constant mean. We only need to learn the slopefor each dimension and an offset (denoted by µ
`
= (a, b)).3) Learning parameters: marginal likelihood: This section
describe the step 7 in Algorithm 1. Due to the heavy compu-tation of the learning, this process is computed only every N
l
iterations. For learning the hyper-parameters of the kernel andalso the prior mean functions (cf. Sections III-E1 and III-E2),we maximize the marginal likelihood [25] of the observationsS1:t. To do that, we train GP model (7) with S1:t. We optimizethe marginal likelihood using multi-started quasi-Newton hill-climbers [23]. For this purpose, we use the off-the-shelf gpmllibrary presented in [23]. Using the kernel defined in (12), welearn ✓ := (✓0:d, µ0:d,�
2) that comprises the hyper-parameters
of the kernel and mean functions. The learning is performediteratively resulting in a sequence of ✓
i
for i = 1 . . . bN
max
N
`
c.4) Observation noise: The primary way for determining the
noise variance � in BO4CO is to use historical data: In SectionII-B4, we have shown that such noise can be measured with ahigh confidence and the signal-to-noise ratios shows that suchnoise is stationary. The secondary alternative is to learn thenoise variance sequentially as we collect new data. We treatthem just as any other hyper-parameters, see Section III-E3.
IV. EXPERIMENTAL RESULTS
A. Implementation
From an implementation perspective, BO4CO consists ofthree major components: (i) an optimization component (cf.left part of Figure 6), (ii) an experimental suite (cf. right partof Figure 6) integrated via a (iii) data broker. The optimizationcomponent implements the model (re-)fitting (7) and criteriaoptimization (9) steps in Algorithm 1 and is developed in Mat-lab 2015b. The experimental suite component implements thefacilities for automated deployment of topologies, performancemeasurements and data preparation and is developed in Java.
Acquisition function:
Kernel function:
-1.5 -1 -0.5 0 0.5 1 1.5-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
configuration domain
resp
onse
val
ue
-1.5 -1 -0.5 0 0.5 1 1.5-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1 true response functionGP fit
-1.5 -1 -0.5 0 0.5 1 1.5-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
criteria evaluation
new selected point
-1.5 -1 -0.5 0 0.5 1 1.5-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
new GP fit
Correlationsacrossdifferentversions
100
150
1
200
250
Late
ncy
(ms)
300
2 53 104
5 156
14
16
18
20
1
22
24
26
Late
ncy
(ms)
28
30
32
2 53 1045 156 number of countersnumber of splitters number of countersnumber of splitters
2.8
2.9
1
3
3.1
3.2
3.3
2
Late
ncy
(ms)
3.4
3.5
3.6
3 54 10
5 156
1.2
1.3
1.4
1
1.5
1.6
1.7
Late
ncy
(ms)
1.8
1.9
2 53104
5 156
(a) WordCount v1
(b) WordCount v2
(c) WordCount v3 (d) WordCount v4
(e) Pearson correlation coefficients
(g) Measurement noise across WordCount versions
(f) Spearman correlation coefficients
correlation coefficient
p-va
lue
v1 v2 v3 v4
500
600
700
800
900
1000
1100
1200
Late
ncy (
ms)
hardware change
softw
are
chan
ge
Table 1: My caption
v1 v2 v3 v4
v1 1 0.41 -0.46 -0.50
v2 7.36E-06 1 -0.20 -0.18
v3 6.92E-07 0.04 1 0.94
v4 2.54E-08 0.07 1.16E-52 1
Table 2: My caption
v1 v2 v3 v4
v1 1 0.49 -0.51 -0.51
v2 5.50E-08 1 -0.2793 -0.24
v3 1.30E-08 0.003 1 0.88
v4 1.40E-08 0.01 8.30E-36 1
Table 3: My caption
ver. µ � µ�
v1 516.59 7.96 64.88
v2 584.94 2.58 226.32
v3 654.89 13.56 48.30
v4 1125.81 16.92 66.56
1
Table 1: My caption
v1 v2 v3 v4
v1 1 0.41 -0.46 -0.50
v2 7.36E-06 1 -0.20 -0.18
v3 6.92E-07 0.04 1 0.94
v4 2.54E-08 0.07 1.16E-52 1
Table 2: My caption
v1 v2 v3 v4
v1 1 0.49 -0.51 -0.51
v2 5.50E-08 1 -0.2793 -0.24
v3 1.30E-08 0.003 1 0.88
v4 1.40E-08 0.01 8.30E-36 1
Table 3: My caption
ver. µ � µ�
v1 516.59 7.96 64.88
v2 584.94 2.58 226.32
v3 654.89 13.56 48.30
v4 1125.81 16.92 66.56
1
- Different correlations- Different optimum Configurations- Different noise level
DevOps
- Differentversionsarecontinuouslydelivered(dailybasis).- BigDatasystemsaredevelopedusingsimilarframeworks
(ApacheStorm,Spark,Hadoop,Kafka,etc).- Differentversionssharesimilarbusinesslogics.
Solution:TransferLearningforConfigurationOptimizationConfiguration Optimization
(version j=M)
performance measurements
Initial Design
Model Fit
Next Experiment
Model Update
Budget Finished
performance repository
Configuration Optimization(version j=N)
Initial Design
Model Fit
Next Experiment
Model Update
Budget Finished
select data for training
GP model hyper-parameters
store filter
TransferLearningforConfigurationOptimization(TL4CO)Code:https://github.com/dice-project/DICE-Configuration-TL4CO
100
150
1
200
250
Late
ncy
(m
s)
300
2 53 104
5 156
14
16
18
20
1
22
24
26
Late
ncy
(m
s)
28
30
32
2 53 1045 156
2.8
2.9
1
3
3.1
3.2
3.3
2
Late
ncy
(m
s)
3.4
3.5
3.6
3 54 10
5 156
1.2
1.3
1.4
1
1.5
1.6
1.7
Late
ncy
(m
s)
1.8
1.9
2 53104
5 156 number of countersnumber of splitters number of countersnumber of splitters number of countersnumber of splitters number of countersnumber of splitters
(a) WordCount v1 (b) WordCount v2 (c) WordCount v3 (d) WordCount v4
Figure 2: The response functions corresponding to the four versions of WordCount.
true function
GP mean
GP variance
observation
selected point
true minimumX
f(x)
Figure 3: An example of 1-dimensional GP model.
Configuration Optimization(version j=M)
performance measurements
Initial Design
Model Fit
Next Experiment
Model Update
Budget Finished
performance repository
Configuration Optimization(version j=N)
Initial Design
Model Fit
Next Experiment
Model Update
Budget Finished
select data for training
GP model hyper-parameters
store filter
Figure 4: Transfer learning for configuration optimization.
observations, S11:t = {(xj
i , yji )|j = 1, . . . , T, i = 1, . . . , tj}ti=1
(cf. S1:t), where x
ji are the configurations of di↵erent ver-
sions for which we observed response yji . The version of
the system (task) is defined as j that contains tj observa-tions. Without loss of generality, we always assume j = 1is the current version under test. In order to associate theobservations (xj
i , yji ) to task j a label lj = j is used. To
transfer the learning from previous tasks to the current one,the MTPG framework allows us to use the multiplication oftwo independent kernels [5]:
kTL4CO(l, l0,x,x0) = kt(l, l
0)⇥ kxx(x,x0), (5)
where the kernels kt and kxx represent the correlation be-tween di↵erent versions and covariance functions for a partic-ular version, respectively. Note that kt depends only on thelabels l pointing to the versions for which we have some obser-vations, and kxx depends only on the configuration x. This
Algorithm 1 : TL4CO
Input: Configuration space X, Number of historical con-figuration optimization datasets T , Maximum budgetNmax, Response function f , Kernel function K, Hyper-parameters ✓
j , The current dataset Sj=1, Datasets be-longing to other versions Sj=2:T , Diagonal noise matrix⌃, Design sample size n, Learning cycle Nl
Output: Optimal configurations x⇤ and learned model M1: D = {x1, . . . ,xn} create an initial sparse design (lhd)2: Obtain the performance 8xi 2 D, yi f(xi) + ✏i3: Sj=2:T select m points from other versions (j = 2 : T )
that reduce entropy . Eq.(8)4: S1
1:n Sj=2:T [ {(xi, yi)}ni=1; t n+ 15: M(x|S1
1:n,✓j) fit a multi-task GP to D . Eq. (6)
6: while t Nmax do7: If (t mod Nl = 0) then [✓ learn the kernel hyper-
parameters by maximizing the likelihood]8: Determine the next configuration, xt, by optimizing
LCB over M, xt argmaxx
u(x|M, S11:t�1) . Eq.(11)
9: Obtain the performance of xt, yt f(xt) + ✏t10: Augment the configuration S1
1:t = S11:t�1
S{(xt, yt)}
11: M(x|S11:t,✓) re-fit a new GP model . Eq.(6)
12: t t+ 113: end while14: (x⇤, y⇤) = min[S1
1:Nmax
� Sj=2:T ] locating the optimumconfiguration by discarding data for other system versions
15: M(x) storing the learned model
method has several benefits over the ordinary GP models:(i) we can plug di↵erent observations regarding di↵erent ver-sions; (ii) automatic learning of the correlation between thesystem versions; (iii) fast learning of the hyper-parameters byexploiting the similarity between measurements. Similarlyto (3), (4), the mean and variance of MTGP are [34]:
µt(x) = µ(x) + k
|(K(X,X) + ⌃)�1(y � µ) (6)
�2t (x) = k(x,x) + �2 � k
|(K(X,X) + ⌃)�1k, (7)
where y := [f1, . . . ,fT ]| is the measured performance re-
garding all tasks, X = [X1, . . . , XT ] are the correspondingconfiguration and ⌃ = diag[�2
1 , . . . ,�2T ] is a diagonal noise
matrix where each noise term is repeated as many times as thenumber of historical observations represented by t1, . . . , tT
and k := k(X,x). This is a conditional estimate at a de-sired configuration given the historical datasets for previoussystem versions and their respective GP hyper-parameters.
An illustration of a multi-task GP versus a single-task GPand its e↵ect on providing a more accurate model is given in
4
100
150
1
200
250
Late
ncy
(m
s)
300
2 53 104
5 156
14
16
18
20
1
22
24
26
Late
ncy
(m
s)
28
30
32
2 53 1045 156
2.8
2.9
1
3
3.1
3.2
3.3
2
Late
ncy
(m
s)
3.4
3.5
3.6
3 54 10
5 156
1.2
1.3
1.4
1
1.5
1.6
1.7
Late
ncy
(m
s)
1.8
1.9
2 53104
5 156 number of countersnumber of splitters number of countersnumber of splitters number of countersnumber of splitters number of countersnumber of splitters
(a) WordCount v1 (b) WordCount v2 (c) WordCount v3 (d) WordCount v4
Figure 2: The response functions corresponding to the four versions of WordCount.
true function
GP mean
GP variance
observation
selected point
true minimumX
f(x)
Figure 3: An example of 1-dimensional GP model.
Configuration Optimization(version j=M)
performance measurements
Initial Design
Model Fit
Next Experiment
Model Update
Budget Finished
performance repository
Configuration Optimization(version j=N)
Initial Design
Model Fit
Next Experiment
Model Update
Budget Finished
select data for training
GP model hyper-parameters
store filter
Figure 4: Transfer learning for configuration optimization.
observations, S11:t = {(xj
i , yji )|j = 1, . . . , T, i = 1, . . . , tj}ti=1
(cf. S1:t), where x
ji are the configurations of di↵erent ver-
sions for which we observed response yji . The version of
the system (task) is defined as j that contains tj observa-tions. Without loss of generality, we always assume j = 1is the current version under test. In order to associate theobservations (xj
i , yji ) to task j a label lj = j is used. To
transfer the learning from previous tasks to the current one,the MTPG framework allows us to use the multiplication oftwo independent kernels [5]:
kTL4CO(l, l0,x,x0) = kt(l, l
0)⇥ kxx(x,x0), (5)
where the kernels kt and kxx represent the correlation be-tween di↵erent versions and covariance functions for a partic-ular version, respectively. Note that kt depends only on thelabels l pointing to the versions for which we have some obser-vations, and kxx depends only on the configuration x. This
Algorithm 1 : TL4CO
Input: Configuration space X, Number of historical con-figuration optimization datasets T , Maximum budgetNmax, Response function f , Kernel function K, Hyper-parameters ✓
j , The current dataset Sj=1, Datasets be-longing to other versions Sj=2:T , Diagonal noise matrix⌃, Design sample size n, Learning cycle Nl
Output: Optimal configurations x⇤ and learned model M1: D = {x1, . . . ,xn} create an initial sparse design (lhd)2: Obtain the performance 8xi 2 D, yi f(xi) + ✏i3: Sj=2:T select m points from other versions (j = 2 : T )
that reduce entropy . Eq.(8)4: S1
1:n Sj=2:T [ {(xi, yi)}ni=1; t n+ 15: M(x|S1
1:n,✓j) fit a multi-task GP to D . Eq. (6)
6: while t Nmax do7: If (t mod Nl = 0) then [✓ learn the kernel hyper-
parameters by maximizing the likelihood]8: Determine the next configuration, xt, by optimizing
LCB over M, xt argmaxx
u(x|M, S11:t�1) . Eq.(11)
9: Obtain the performance of xt, yt f(xt) + ✏t10: Augment the configuration S1
1:t = S11:t�1
S{(xt, yt)}
11: M(x|S11:t,✓) re-fit a new GP model . Eq.(6)
12: t t+ 113: end while14: (x⇤, y⇤) = min[S1
1:Nmax
� Sj=2:T ] locating the optimumconfiguration by discarding data for other system versions
15: M(x) storing the learned model
method has several benefits over the ordinary GP models:(i) we can plug di↵erent observations regarding di↵erent ver-sions; (ii) automatic learning of the correlation between thesystem versions; (iii) fast learning of the hyper-parameters byexploiting the similarity between measurements. Similarlyto (3), (4), the mean and variance of MTGP are [34]:
µt(x) = µ(x) + k
|(K(X,X) + ⌃)�1(y � µ) (6)
�2t (x) = k(x,x) + �2 � k
|(K(X,X) + ⌃)�1k, (7)
where y := [f1, . . . ,fT ]| is the measured performance re-
garding all tasks, X = [X1, . . . , XT ] are the correspondingconfiguration and ⌃ = diag[�2
1 , . . . ,�2T ] is a diagonal noise
matrix where each noise term is repeated as many times as thenumber of historical observations represented by t1, . . . , tT
and k := k(X,x). This is a conditional estimate at a de-sired configuration given the historical datasets for previoussystem versions and their respective GP hyper-parameters.
An illustration of a multi-task GP versus a single-task GPand its e↵ect on providing a more accurate model is given in
4
100
150
1
200
250
Late
ncy
(m
s)
300
2 53 104
5 156
14
16
18
20
1
22
24
26
Late
ncy
(m
s)
28
30
32
2 53 1045 156
2.8
2.9
1
3
3.1
3.2
3.3
2
Late
ncy
(m
s)
3.4
3.5
3.6
3 54 10
5 156
1.2
1.3
1.4
1
1.5
1.6
1.7
Late
ncy
(m
s)
1.8
1.9
2 53104
5 156 number of countersnumber of splitters number of countersnumber of splitters number of countersnumber of splitters number of countersnumber of splitters
(a) WordCount v1 (b) WordCount v2 (c) WordCount v3 (d) WordCount v4
Figure 2: The response functions corresponding to the four versions of WordCount.
true function
GP mean
GP variance
observation
selected point
true minimumX
f(x)
Figure 3: An example of 1-dimensional GP model.
Configuration Optimization(version j=M)
performance measurements
Initial Design
Model Fit
Next Experiment
Model Update
Budget Finished
performance repository
Configuration Optimization(version j=N)
Initial Design
Model Fit
Next Experiment
Model Update
Budget Finished
select data for training
GP model hyper-parameters
store filter
Figure 4: Transfer learning for configuration optimization.
observations, S11:t = {(xj
i , yji )|j = 1, . . . , T, i = 1, . . . , tj}ti=1
(cf. S1:t), where x
ji are the configurations of di↵erent ver-
sions for which we observed response yji . The version of
the system (task) is defined as j that contains tj observa-tions. Without loss of generality, we always assume j = 1is the current version under test. In order to associate theobservations (xj
i , yji ) to task j a label lj = j is used. To
transfer the learning from previous tasks to the current one,the MTPG framework allows us to use the multiplication oftwo independent kernels [5]:
kTL4CO(l, l0,x,x0) = kt(l, l
0)⇥ kxx(x,x0), (5)
where the kernels kt and kxx represent the correlation be-tween di↵erent versions and covariance functions for a partic-ular version, respectively. Note that kt depends only on thelabels l pointing to the versions for which we have some obser-vations, and kxx depends only on the configuration x. This
Algorithm 1 : TL4CO
Input: Configuration space X, Number of historical con-figuration optimization datasets T , Maximum budgetNmax, Response function f , Kernel function K, Hyper-parameters ✓
j , The current dataset Sj=1, Datasets be-longing to other versions Sj=2:T , Diagonal noise matrix⌃, Design sample size n, Learning cycle Nl
Output: Optimal configurations x⇤ and learned model M1: D = {x1, . . . ,xn} create an initial sparse design (lhd)2: Obtain the performance 8xi 2 D, yi f(xi) + ✏i3: Sj=2:T select m points from other versions (j = 2 : T )
that reduce entropy . Eq.(8)4: S1
1:n Sj=2:T [ {(xi, yi)}ni=1; t n+ 15: M(x|S1
1:n,✓j) fit a multi-task GP to D . Eq. (6)
6: while t Nmax do7: If (t mod Nl = 0) then [✓ learn the kernel hyper-
parameters by maximizing the likelihood]8: Determine the next configuration, xt, by optimizing
LCB over M, xt argmaxx
u(x|M, S11:t�1) . Eq.(11)
9: Obtain the performance of xt, yt f(xt) + ✏t10: Augment the configuration S1
1:t = S11:t�1
S{(xt, yt)}
11: M(x|S11:t,✓) re-fit a new GP model . Eq.(6)
12: t t+ 113: end while14: (x⇤, y⇤) = min[S1
1:Nmax
� Sj=2:T ] locating the optimumconfiguration by discarding data for other system versions
15: M(x) storing the learned model
method has several benefits over the ordinary GP models:(i) we can plug di↵erent observations regarding di↵erent ver-sions; (ii) automatic learning of the correlation between thesystem versions; (iii) fast learning of the hyper-parameters byexploiting the similarity between measurements. Similarlyto (3), (4), the mean and variance of MTGP are [34]:
µt(x) = µ(x) + k
|(K(X,X) + ⌃)�1(y � µ) (6)
�2t (x) = k(x,x) + �2 � k
|(K(X,X) + ⌃)�1k, (7)
where y := [f1, . . . ,fT ]| is the measured performance re-
garding all tasks, X = [X1, . . . , XT ] are the correspondingconfiguration and ⌃ = diag[�2
1 , . . . ,�2T ] is a diagonal noise
matrix where each noise term is repeated as many times as thenumber of historical observations represented by t1, . . . , tT
and k := k(X,x). This is a conditional estimate at a de-sired configuration given the historical datasets for previoussystem versions and their respective GP hyper-parameters.
An illustration of a multi-task GP versus a single-task GPand its e↵ect on providing a more accurate model is given in
4
correlation between different versions
covariance functions
Multi-taskGPvssingle-taskGP
-1.5 -1 -0.5 0 0.5 1 1.5-4
-3
-2
-1
0
1
2
3
(a) 3 sample response functions
configuration domain
resp
onse
val
ue
(1)
(2)
(3)
observations
(b) GP fit for (1) ignoring observations for (2),(3)
LCB
not informative
(c) multi-task GP fit for (1) by transfer learning from (2),(3)
highly informative
GP prediction meanGP prediction variance
probability distribution of the minimizers
Exploitationvsexploration
0 2000 4000 6000 8000 10000Iteration
4
4.5
5
5.5
6
6.5
7
7.5
8
8.5
κ
κTL4CO
[ϵ=1]
κBO4CO
[ϵ=1]
κTL4CO
[ϵ=0.1]
κBO4CO
[ϵ=0.1]
κTL4CO
[ϵ=0.01]
κBO4CO
[ϵ=0.01]
Figure 5. Three di↵erent 1D (1-dimensional) response func-tions are to be minimized. Further samples are available allover two of them, whereas function (1) that is under test hasonly few sparse observations. Merely using these few sampleswould result in poor (uninformative) predictions of the func-tion especially in the areas where there is no observations.Using the correlation with the other two functions enablesthe MTGP model to provide more accurate predictions andas a result locating the optimum more quickly.
3.3 Filtering out irrelevant dataHere is described how the selection of the points in the
previous system is made. The number of historical observa-tions taken in consideration is important because a↵ect thecomputational requirements, due to the matrix inversion in(6) incurs a cubic cost (shown in Section 4.6). TL4CO selects(step 3 in Algorithm 1) points with the lowest entropy value.Entropy reduction is computed as the log of the ratio of theposterior uncertainty given an observation xi,j belonging toversion j, to its uncertainty without it:
Iij = log⇣v(x|X,xi,j)
v(x|X)
⌘, (8)
where v(x|X) is the posterior uncertainty of the predictionfrom MTGP model of query point x obtained in (7).
3.4 Model fitting in TL4COIn this section, we provide some practical considerations
and the extensions we made to the MTGP framework tomake it applicable for configuration optimization.
3.4.1 Kernel functionWe implemented the following kernel function to support
integer and categorical variables (cf. Section 2.1):
kxx(xi,xj) = exp(⌃d`=1(�✓`�(xi 6= xj))), (9)
where d is the number of dimensions (i.e., the number ofconfiguration parameters), ✓` adjust the scales along thefunction dimensions and � is a function gives the distancebetween two categorical variables using Kronecker delta [20,34]. TL4CO uses di↵erent scales {✓`, ` = 1 . . . d} on di↵erentdimensions as suggested in [41, 34], this technique is calledAutomatic Relevance Determination (ARD). After learningthe hyper-parameters (step 7 in Algorithm 1), if the `-thdimension (parameter) turns out be irrelevant, so the ✓`will be a small value, and therefore, will be automaticallydiscarded. This is particularly helpful in high dimensionalspaces, where it is di�cult to find the optimal configuration.
3.4.2 Prior mean functionWhile the kernel controls the structure of the estimated
function, the prior mean µ(x) : X ! R provides a possibleo↵set for our estimation. By default, this function is set to aconstant µ(x) := µ, which is inferred from the observations[34]. However, the prior mean function is a way of incorpo-rating the expert knowledge, if it is available, then we canuse this knowledge. Fortunately, we have collected extensiveexperimental measurements and based on our datasets (cf.Table 3), we observed that typically, for Big Data systems,there is a significant distance between the minimum and themaximum of each function (cf. Figure 2). Therefore, a linearmean function µ(x) := ax+ b, allows for more flexible struc-tures, and provides a better fit for the data than a constantmean. We only need to learn the slope for each dimensionand an o↵set (denoted by µ` = (a, b), see next).
3.4.3 Learning hyper-parametersThis section describe the step 7 in Algorithm 1. Due to the
heavy computation of the learning, this process is computedonly every Nl iterations. For learning the hyper-parametersof the kernel and also the prior mean functions (cf. Sections3.4.1 and 3.4.2), we maximize the marginal likelihood [34,5] of the observations S1
1:t. To do that, we train our GPmodel (6) with S1
1:t . We optimize the marginal likelihoodusing multi-started quasi-Newton hill-climbers [32]. For thispurpose, we use the o↵-the-shelf gpml library presented in[32]. Using the multi-task kernel defined in (5), we learn✓ := (✓t, ✓xx, µ`) that comprises the hyper-parameters of kt,kxx (cf. (5)) and the mean function µ(·) (cf. Section 3.4.2).The learning is performed iteratively resulting in a sequenceof priors with parameters ✓i for i = 1 . . . bN
max
N`
c.
3.4.4 Observation noiseIn Section 2.4, we have shown that such noise level can
be measured with a high confidence and the signal-to-noiseratios shows that such noise is stationary. In (7), � representsthe noise of the selected point x. Typically this noise valueis not known. In TL4CO, we estimate the noise level by anapproximation. The query points can be assumed to be asnoisy as the observed data. In other words, we treat � as arandom variable and we calculate its expected value as:
� =⌃T
j=1Ni�2i
⌃Ti=1Ni
(10)
where Ni are the number of observations from the ith datasetand �2
i is the noise variance of the individual datasets.
3.5 Configuration selection criteriaTL4CO requires a selection criterion (step 8 in Algorithm
1) to decide the next configuration to measure. Intuitively,we want to select the minimum response. This is done usinga utility function u : X ! R that determines xt+1 2 X,should f(·) be evaluated next as:
xt+1 = argmaxx2X
u(x|M, S11:t) (11)
The selection criterion depends on the MTGP model Msolely through its predictive mean µt(xt) and variance �2
t (xt)conditioned on observations S1
1:t. TL4CO uses the LowerConfidence Bound (LCB) [24]:
uLCB(x|M, S11:t) = argmin
x2Xµt(x)� �t(x), (12)
where is a exploitation-exploration parameter. For instance,if we require to find a near optimal configuration we set alow value to to take the most out of the predictive mean.However, if we are looking for a globally optimum one, we canset a high value in order to skip local minima. Furthermore, can be adapted over time [22] to perform more explorations.Figure 6 shows that in TL4CO, can start with a relativelyhigher value at the early iterations comparing to BO4COsince the former provides a better estimate of mean andtherefore contains more information at the early stages.
TL4CO output. Once the Nmax di↵erent configurations ofthe system under test are measured, the TL4CO algorithmterminates. Finally, TL4CO produces the outputs includingthe optimal configuration (step 14 in Algorithm 1) as wellas the learned MTGP model (step 15 in Algorithm 1). Allthese information are versioned and stored in a performancerepository (cf. Figure 7) to be used in the future versions ofthe system.
5
Figure 5. Three di↵erent 1D (1-dimensional) response func-tions are to be minimized. Further samples are available allover two of them, whereas function (1) that is under test hasonly few sparse observations. Merely using these few sampleswould result in poor (uninformative) predictions of the func-tion especially in the areas where there is no observations.Using the correlation with the other two functions enablesthe MTGP model to provide more accurate predictions andas a result locating the optimum more quickly.
3.3 Filtering out irrelevant dataHere is described how the selection of the points in the
previous system is made. The number of historical observa-tions taken in consideration is important because a↵ect thecomputational requirements, due to the matrix inversion in(6) incurs a cubic cost (shown in Section 4.6). TL4CO selects(step 3 in Algorithm 1) points with the lowest entropy value.Entropy reduction is computed as the log of the ratio of theposterior uncertainty given an observation xi,j belonging toversion j, to its uncertainty without it:
Iij = log⇣v(x|X,xi,j)
v(x|X)
⌘, (8)
where v(x|X) is the posterior uncertainty of the predictionfrom MTGP model of query point x obtained in (7).
3.4 Model fitting in TL4COIn this section, we provide some practical considerations
and the extensions we made to the MTGP framework tomake it applicable for configuration optimization.
3.4.1 Kernel functionWe implemented the following kernel function to support
integer and categorical variables (cf. Section 2.1):
kxx(xi,xj) = exp(⌃d`=1(�✓`�(xi 6= xj))), (9)
where d is the number of dimensions (i.e., the number ofconfiguration parameters), ✓` adjust the scales along thefunction dimensions and � is a function gives the distancebetween two categorical variables using Kronecker delta [20,34]. TL4CO uses di↵erent scales {✓`, ` = 1 . . . d} on di↵erentdimensions as suggested in [41, 34], this technique is calledAutomatic Relevance Determination (ARD). After learningthe hyper-parameters (step 7 in Algorithm 1), if the `-thdimension (parameter) turns out be irrelevant, so the ✓`will be a small value, and therefore, will be automaticallydiscarded. This is particularly helpful in high dimensionalspaces, where it is di�cult to find the optimal configuration.
3.4.2 Prior mean functionWhile the kernel controls the structure of the estimated
function, the prior mean µ(x) : X ! R provides a possibleo↵set for our estimation. By default, this function is set to aconstant µ(x) := µ, which is inferred from the observations[34]. However, the prior mean function is a way of incorpo-rating the expert knowledge, if it is available, then we canuse this knowledge. Fortunately, we have collected extensiveexperimental measurements and based on our datasets (cf.Table 3), we observed that typically, for Big Data systems,there is a significant distance between the minimum and themaximum of each function (cf. Figure 2). Therefore, a linearmean function µ(x) := ax+ b, allows for more flexible struc-tures, and provides a better fit for the data than a constantmean. We only need to learn the slope for each dimensionand an o↵set (denoted by µ` = (a, b), see next).
3.4.3 Learning hyper-parametersThis section describe the step 7 in Algorithm 1. Due to the
heavy computation of the learning, this process is computedonly every Nl iterations. For learning the hyper-parametersof the kernel and also the prior mean functions (cf. Sections3.4.1 and 3.4.2), we maximize the marginal likelihood [34,5] of the observations S1
1:t. To do that, we train our GPmodel (6) with S1
1:t . We optimize the marginal likelihoodusing multi-started quasi-Newton hill-climbers [32]. For thispurpose, we use the o↵-the-shelf gpml library presented in[32]. Using the multi-task kernel defined in (5), we learn✓ := (✓t, ✓xx, µ`) that comprises the hyper-parameters of kt,kxx (cf. (5)) and the mean function µ(·) (cf. Section 3.4.2).The learning is performed iteratively resulting in a sequenceof priors with parameters ✓i for i = 1 . . . bN
max
N`
c.
3.4.4 Observation noiseIn Section 2.4, we have shown that such noise level can
be measured with a high confidence and the signal-to-noiseratios shows that such noise is stationary. In (7), � representsthe noise of the selected point x. Typically this noise valueis not known. In TL4CO, we estimate the noise level by anapproximation. The query points can be assumed to be asnoisy as the observed data. In other words, we treat � as arandom variable and we calculate its expected value as:
� =⌃T
j=1Ni�2i
⌃Ti=1Ni
(10)
where Ni are the number of observations from the ith datasetand �2
i is the noise variance of the individual datasets.
3.5 Configuration selection criteriaTL4CO requires a selection criterion (step 8 in Algorithm
1) to decide the next configuration to measure. Intuitively,we want to select the minimum response. This is done usinga utility function u : X ! R that determines xt+1 2 X,should f(·) be evaluated next as:
xt+1 = argmaxx2X
u(x|M, S11:t) (11)
The selection criterion depends on the MTGP model Msolely through its predictive mean µt(xt) and variance �2
t (xt)conditioned on observations S1
1:t. TL4CO uses the LowerConfidence Bound (LCB) [24]:
uLCB(x|M, S11:t) = argmin
x2Xµt(x)� �t(x), (12)
where is a exploitation-exploration parameter. For instance,if we require to find a near optimal configuration we set alow value to to take the most out of the predictive mean.However, if we are looking for a globally optimum one, we canset a high value in order to skip local minima. Furthermore, can be adapted over time [22] to perform more explorations.Figure 6 shows that in TL4CO, can start with a relativelyhigher value at the early iterations comparing to BO4COsince the former provides a better estimate of mean andtherefore contains more information at the early stages.
TL4CO output. Once the Nmax di↵erent configurations ofthe system under test are measured, the TL4CO algorithmterminates. Finally, TL4CO produces the outputs includingthe optimal configuration (step 14 in Algorithm 1) as wellas the learned MTGP model (step 15 in Algorithm 1). Allthese information are versioned and stored in a performancerepository (cf. Figure 7) to be used in the future versions ofthe system.
5
TL4COarchitecture
Configuration Optimisation Tool
(TL4CO)
performance repository
Monitoring
Deployment Service
Data Preparation
configuration parameters
values
configuration parameters
values
Experimental Suite
Testbed
Doc
Data Broker
Tester
experiment timepolling intervalprevious versions
configurationparameters
GP model
Kafka
Vn
V1 V2
System Under Test
historical data
WorkloadGenerator
Technology Interface
Stor
m
Cas
sand
ra
Spar
k
Experimentalresults
0 20 40 60 80 100
Iteration
10-6
10-4
10-2
100
102
104
Absolu
te E
rror
TL4CO
BO4CO
SA
GA
HILL
PS
Drift
(b) WC(3D)
(8 minutes each)0 20 40 60 80 100
Iteration
10-3
10-2
10-1
100
101
102
103
Ab
so
lute
Err
or
TL4CO
BO4CO
SA
GA
HILL
PS
Drift (a) WC(5D)
(8 minutes each)
- 30 runs, report average performance- Yes, we did full factorial measurements and we know where global min is… J
Experimentalresults(cont.)
0 50 100 150 200
Iteration
10-3
10-2
10-1
100
101
102
103
104
Absolu
te E
rror
TL4CO
BO4CO
SA
GA
HILL
PS
Drift
(b) WC(6D)
(8 minutes each)
Experimentalresults(cont.)
0 50 100 150 200
Iteration
10-2
10-1
100
101
102
103
104
Ab
so
lute
Err
or
TL4CO
BO4CO
SA
GA
HILL
PS
Drift
0 50 100 150 200
Iteration
10-4
10-2
100
102
104
Ab
so
lute
Err
or
TL4CO
BO4CO
SA
GA
HILL
PS
Drift
(a) SOL(6D) (b) RS(6D)
(8 minutes each)(8 minutes each)
Comparisonwithdefaultandexpertprescription
0 500 1000 1500Throughput (ops/sec)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Ave
rag
e r
ea
d la
ten
cy (µ
s)
×104
TL4COBO4CO
BO4CO after 20 iterations TL4CO after
20 iterations
TL4CO after 100 iterations
0 500 1000 1500Throughput (ops/sec)
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Ave
rag
e w
rite
late
ncy
(µ
s)
TL4COBO4CO
Default configuration
Configuration recommended
by expertTL4CO after
100 iterations
BO4CO after 100 iterations
Default configuration
Configuration recommended
by expert
Modelaccuracy
TL4CO polyfit1 polyfit2 polyfit3 polyfit4 polyfit5
10-12
10-10
10-8
10-6
10-4
10-2
100
102
Ab
solu
te P
erc
en
tag
e E
rro
r [%
]
TL4CO BO4CO M5Tree R-Tree M5Rules MARS PRIM
10-2
10-1
100
101
102
Ab
solu
te P
erc
en
tag
e E
rro
r [%
]
Abso
lute
Per
cent
age
Erro
r [%
]
Predictionaccuracyovertime
0 20 40 60 80 100Iteration
10-4
10-3
10-2
10-1
100
101
Pre
dic
tion
Err
or
(RM
SE
)
T=2,m=100T=2,m=200T=2,m=300T=3,m=100
0 20 40 60 80 100Iteration
10-4
10-3
10-2
10-1
100
101
Pre
dic
tion E
rror
(RM
SE
)
TL4COpolyfit1polyfit2polyfit4polyfit5M5TreeM5RulesPRIM (a) (b)
Entropyofthedensityfunctionoftheminimizers
0 20 40 60 80 100Iteration
0
1
2
3
4
5
6
7
8
9
10
Entr
opy
T=1(BO4CO)T=2,m=100T=2,m=200T=2,m=300T=2,m=400T=3,m=100
1 2 3 4 5 6 7 8 9
0
2
4
6
8
10
Entr
opy
BO4COTL4CO
Entro
py
Iteration
Branin Hartmann WC(3D) SOL(6D) WC(5D)Dixon WC(6D) RS(6D) cass-20
0 20 40 60 80 100Iteration (10 minutes each)
101
102
103
104
Abso
lute
Err
or
(µs)
TL4COBO4COSAGAHILLPSDrift
Figure 15: cass-20 configuration optimization.
0 500 1000 1500Throughput (ops/sec)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Ave
rage r
ead la
tency
(µ
s)
×104
TL4COBO4CO
BO4CO after 20 iterations TL4CO after
20 iterations
TL4CO after 100 iterations
0 500 1000 1500Throughput (ops/sec)
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Ave
rage w
rite
late
ncy
(µ
s)
TL4COBO4CO
Default configuration
Configuration recommended
by expertTL4CO after
100 iterations
BO4CO after 100 iterations
Default configuration
Configuration recommended
by expert
Figure 16: TL4CO, BO4CO and expert prescription.
4.5 Entropy analysisThe knowledge about the location of optimum configura-
tion is summarized by the approximation of the conditionalprobability density function of the response function mini-mizers, i.e., X⇤ = Pr(x⇤|f(x)), where f(·) is drawn fromthe MTGP model (cf. solid red line in Figure 5(b,c)). Theentropy of the density functions in Figure 5(b,c) are 6.39,3.96, so we know more information about the latter.
The results in Figure 19 confirm that the entropy measureof the minimizers with the models provided by TL4CO for allthe datasets (synthetic and real) significantly contains moreinformation. The results demonstrate that the main reasonfor finding quick convergence comparing with the baselinesis that TL4CO employs a more e↵ective model. The resultsin Figure 19(b) show the change of entropy of X⇤ over timefor WC(5D) dataset. First, it shows that in TL4CO, theentropy decreases sharply. However, the overall decrease ofentropy for BO4CO is slow. The second observation is thatas we increase the number of observations from similar tasksthe gain in terms of entropy change is not significant. Toshow this more clearly, we have disabled the filtering stepfor T = 2,m = 200 and the results shows that the entropyof X⇤ not only decreases, but it slightly increases over time.
4.6 Computational and memory requirementsThe inference complexity in TL4CO is O(T 2t2) because
of the Cholesky kernel inversion in (6). Since in TL4CO welearn the hyper-parameters every N` iterations, Choleskydecomposition must be re-computed. Therefore the com-plexity is in principle O(T 2t2 ⇥ t/N`). Figure 20(a) providesthe runtime overhead of TL4CO. The time is measured on aMacBook Pro with 2.5 GHz Intel Core i7 CPU and 16GB ofMemory. The computation time in larger datasets (RS(6D),SOL(6D), WC(6D)) is higher than those with less data andlower dimensions (WC(3,5D)). Moreover, the computationtime increases over time since the matrix size for Choleskyinversion gets larger. Figure 20(b) shows the mean and vari-ance of response time over 100 iterations for di↵erent numberof tasks and historical observations per task.
TL4CO polyfit1 polyfit2 polyfit3 polyfit4 polyfit5
10-12
10-10
10-8
10-6
10-4
10-2
100
102
Abso
lute
Perc
enta
ge E
rror
[%]
TL4CO BO4CO M5Tree R-Tree M5Rules MARS PRIM
10-2
10-1
100
101
102
Abso
lute
Perc
enta
ge E
rror
[%]
Abso
lute
Per
cent
age
Erro
r [%
]
Figure 17: Absolute percentage error of predictions made byTL4CO’s MTGP vs regression models for cass-20(6D).
0 20 40 60 80 100Iteration
10-4
10-3
10-2
10-1
100
101
Pre
dic
tion
Err
or
(RM
SE
)
T=2,m=100T=2,m=200T=2,m=300T=3,m=100
0 20 40 60 80 100Iteration
10-4
10-3
10-2
10-1
100
101
Pre
dic
tion
Err
or
(RM
SE
)
TL4COpolyfit1polyfit2polyfit4polyfit5M5TreeM5RulesPRIM (a) (b)
Figure 18: (a) comparing prediction accuracy of TL4CO withother regression models, (b) comparing prediction accuracyof TL4CO with di↵erent number of tasks and observations.
TL4CO requires to store 3 vectors of size |X| = N for mean,variance, and LCB estimates and matrix of size Tn⇥ Tn forstoring K and Tn for keeping tack of historical observations,making the memory requirement O(3N + Tn⇥ Tn+ Tn)).
5. DISCUSSIONS
5.1 Behind the sceneTL4CO in practice: usability and extensibility. In our
experiments, we have evaluated whether TL4CO is feasible inpractice. Although partly parallelized, we have invested morethan three months (24/7) for performance measurements ofthe systems in Table 3 to obtain a huge data basis. However,our approach would require much less measurements, as mostmeasurements were meant to evaluate our approach.
TL4CO can be also used in a di↵erent scenario for configu-ration tuning of Big Data software. Let us assume that thecurrent version of the system is expensive to test (e.g., hasa very large database and each round of experiments takesseveral hours [10]). We, therefore, use a smaller scale of thecurrent version (e.g., with only few records of representativedata). The idea is to spend larger portion of the budget tocollect data from the cheaper version. This data can then beexploited to accelerate the tuning of the expensive version.TL4CO is easy to use, end users only need to determine
the parameters of interests as well as experimental param-eters and then the tool automatically sets the optimizedparameters for the system in its configuration file (all inYAML format). Currently, TL4CO supports Apache Stormand Cassandra. However, it is designed to be extensible, seethe technology interface in Figure 7.How much did ARD help? TL4CO uses ARD technique
(see Section 3.4.1) in order to get a better MTGP fit in statespaces where a subset of parameters interact [21]. To comparethe e↵ects of this technique, we disable ARD for optimization
9
Knowledge about the location of the minimizer
EffectofARD
0 20 40 60 80 100Iteration (10 minutes each)
10-6
10-5
10-4
10-3
10-2
10-1
100
101
Abso
lute
Err
or
(ms)
TL4CO(ARD on)TL4CO(ARD off)
0 20 40 60 80 100Iteration (10 minutes each)
10-3
10-2
10-1
100
101
Abso
lute
Err
or
(ms)
TL4CO(ARD on)TL4CO(ARD off)
Iteration Iteration(a) (b)
Figure 5. Three di↵erent 1D (1-dimensional) response func-tions are to be minimized. Further samples are available allover two of them, whereas function (1) that is under test hasonly few sparse observations. Merely using these few sampleswould result in poor (uninformative) predictions of the func-tion especially in the areas where there is no observations.Using the correlation with the other two functions enablesthe MTGP model to provide more accurate predictions andas a result locating the optimum more quickly.
3.3 Filtering out irrelevant dataHere is described how the selection of the points in the
previous system is made. The number of historical observa-tions taken in consideration is important because a↵ect thecomputational requirements, due to the matrix inversion in(6) incurs a cubic cost (shown in Section 4.6). TL4CO selects(step 3 in Algorithm 1) points with the lowest entropy value.Entropy reduction is computed as the log of the ratio of theposterior uncertainty given an observation xi,j belonging toversion j, to its uncertainty without it:
Iij = log⇣v(x|X,xi,j)
v(x|X)
⌘, (8)
where v(x|X) is the posterior uncertainty of the predictionfrom MTGP model of query point x obtained in (7).
3.4 Model fitting in TL4COIn this section, we provide some practical considerations
and the extensions we made to the MTGP framework tomake it applicable for configuration optimization.
3.4.1 Kernel functionWe implemented the following kernel function to support
integer and categorical variables (cf. Section 2.1):
kxx(xi,xj) = exp(⌃d`=1(�✓`�(xi 6= xj))), (9)
where d is the number of dimensions (i.e., the number ofconfiguration parameters), ✓` adjust the scales along thefunction dimensions and � is a function gives the distancebetween two categorical variables using Kronecker delta [20,34]. TL4CO uses di↵erent scales {✓`, ` = 1 . . . d} on di↵erentdimensions as suggested in [41, 34], this technique is calledAutomatic Relevance Determination (ARD). After learningthe hyper-parameters (step 7 in Algorithm 1), if the `-thdimension (parameter) turns out be irrelevant, so the ✓`will be a small value, and therefore, will be automaticallydiscarded. This is particularly helpful in high dimensionalspaces, where it is di�cult to find the optimal configuration.
3.4.2 Prior mean functionWhile the kernel controls the structure of the estimated
function, the prior mean µ(x) : X ! R provides a possibleo↵set for our estimation. By default, this function is set to aconstant µ(x) := µ, which is inferred from the observations[34]. However, the prior mean function is a way of incorpo-rating the expert knowledge, if it is available, then we canuse this knowledge. Fortunately, we have collected extensiveexperimental measurements and based on our datasets (cf.Table 3), we observed that typically, for Big Data systems,there is a significant distance between the minimum and themaximum of each function (cf. Figure 2). Therefore, a linearmean function µ(x) := ax+ b, allows for more flexible struc-tures, and provides a better fit for the data than a constantmean. We only need to learn the slope for each dimensionand an o↵set (denoted by µ` = (a, b), see next).
3.4.3 Learning hyper-parametersThis section describe the step 7 in Algorithm 1. Due to the
heavy computation of the learning, this process is computedonly every Nl iterations. For learning the hyper-parametersof the kernel and also the prior mean functions (cf. Sections3.4.1 and 3.4.2), we maximize the marginal likelihood [34,5] of the observations S1
1:t. To do that, we train our GPmodel (6) with S1
1:t . We optimize the marginal likelihoodusing multi-started quasi-Newton hill-climbers [32]. For thispurpose, we use the o↵-the-shelf gpml library presented in[32]. Using the multi-task kernel defined in (5), we learn✓ := (✓t, ✓xx, µ`) that comprises the hyper-parameters of kt,kxx (cf. (5)) and the mean function µ(·) (cf. Section 3.4.2).The learning is performed iteratively resulting in a sequenceof priors with parameters ✓i for i = 1 . . . bN
max
N`
c.
3.4.4 Observation noiseIn Section 2.4, we have shown that such noise level can
be measured with a high confidence and the signal-to-noiseratios shows that such noise is stationary. In (7), � representsthe noise of the selected point x. Typically this noise valueis not known. In TL4CO, we estimate the noise level by anapproximation. The query points can be assumed to be asnoisy as the observed data. In other words, we treat � as arandom variable and we calculate its expected value as:
� =⌃T
j=1Ni�2i
⌃Ti=1Ni
(10)
where Ni are the number of observations from the ith datasetand �2
i is the noise variance of the individual datasets.
3.5 Configuration selection criteriaTL4CO requires a selection criterion (step 8 in Algorithm
1) to decide the next configuration to measure. Intuitively,we want to select the minimum response. This is done usinga utility function u : X ! R that determines xt+1 2 X,should f(·) be evaluated next as:
xt+1 = argmaxx2X
u(x|M, S11:t) (11)
The selection criterion depends on the MTGP model Msolely through its predictive mean µt(xt) and variance �2
t (xt)conditioned on observations S1
1:t. TL4CO uses the LowerConfidence Bound (LCB) [24]:
uLCB(x|M, S11:t) = argmin
x2Xµt(x)� �t(x), (12)
where is a exploitation-exploration parameter. For instance,if we require to find a near optimal configuration we set alow value to to take the most out of the predictive mean.However, if we are looking for a globally optimum one, we canset a high value in order to skip local minima. Furthermore, can be adapted over time [22] to perform more explorations.Figure 6 shows that in TL4CO, can start with a relativelyhigher value at the early iterations comparing to BO4COsince the former provides a better estimate of mean andtherefore contains more information at the early stages.
TL4CO output. Once the Nmax di↵erent configurations ofthe system under test are measured, the TL4CO algorithmterminates. Finally, TL4CO produces the outputs includingthe optimal configuration (step 14 in Algorithm 1) as wellas the learned MTGP model (step 15 in Algorithm 1). Allthese information are versioned and stored in a performancerepository (cf. Figure 7) to be used in the future versions ofthe system.
5
Effectofcorrelationbetweentasks
0 20 40 60 80 100Iteration
10-6
10-5
10-4
10-3
10-2
10-1
100
101
Abso
lute
Err
or
TL4CO [f(x)+randn×f(x)]TL4CO [f(x)+2randn×f(x)]BO4CO
Runtimeoverhead
0 20 40 60 80 100Iteration
0.15
0.2
0.25
0.3
0.35
0.4
Ela
pse
d T
ime
(s)
WordCount (3D)WordCount (6D)SOL (6D)RollingSort (6D)WordCount (5D)
T=2,m=100 T=2,m=200 T=2,m=400 T=3,m=100 T=3,m=200
0.5
1
1.5
2
2.5
3
3.5
4
Ela
pse
d T
ime
(s)
(a) TL4CO runtime for different datasets (b) TL4CO runtime for different size of observations
Keytakeaways
- Aprincipledwaytoleveragepriorknowledgegainedfromsearchesoverpreviousversionsofthesystem.
- Multi-taskGPscanbeusedinordertocapturecorrelationbetweenrelatedtuningtasks.
- MTGPsaremoreaccuratethanSTGPandotherregressionmodels.
- ApplicationtoSPSs,BatchandNoSQL- Leadtoabetterperformanceinpractice
Acknowledgement: BO4CO and TL4CO are now integrated with other DevOps tools in the delivery pipeline for Big Data in H2020 DICE project
BigDataTechnologies
Cloud(Priv/Pub)`
DICEIDE
Profile
Plugins
Sim Ver Opt
DPIM
DTSM
DDSMTOSCAMethodology
Deploy Config Test
Mon
AnomalyTrace
Iter.Enh.
DataIntensiveApplication(DIA)
Cont.Int. FaultInj.
WP4
WP3
WP2
WP5
WP1 WP6- Demonstrators
Tool Demo: http://www.slideshare.net/pooyanjamshidi/configuration-optimization-tool