automl from service provider’s perspective: multi-device...

AutoML from Service Provider’s Perspective: Multi-device,Multi-tenant Model Selection with GP-EI

Chen Yu Bojan Karlas Jie Zhong Ce Zhang Ji LiuDepartment of

ComputerScience,

University ofRochester

Deparment ofComputerScience,

ETH Zurich

Deparment ofMathematics,

California StateUniversity

Los Angeles

Deparment ofComputerScience,

ETH Zurich

Department ofComputerScience,

University ofRochester

Abstract

AutoML has become a popular service thatis provided by most leading cloud serviceproviders today. In this paper, we focus on theAutoML problem from the service provider’sperspective, motivated by the following prac-tical consideration: When an AutoML ser-vice needs to serve multiple users with mul-tiple devices at the same time, how can weallocate these devices to users in an efficientway? We focus on GP-EI, one of the mostpopular algorithms for automatic model se-lection and hyperparameter tuning, used bysystems such as Google Vizer. The technicalcontribution of this paper is the first multi-device, multi-tenant algorithm for GP-EI thatis aware of multiple computation devices andmultiple users sharing the same set of compu-tation devices. Theoretically, given N usersand M devices, we obtain a regret bound ofO((MIU(T,K)+M)N

2

M ), where MIU(T,K)refers to the maximal incremental uncertaintyup to time T for the covariance matrix K. Em-pirically, we evaluate our algorithm on twoapplications of automatic model selection, andshow that our algorithm significantly outper-forms the strategy of serving users indepen-dently. Moreover, when multiple computationdevices are available, we achieve near-linearspeedup when the number of users is muchlarger than the number of devices.

Proceedings of the 22nd

International Conference on Ar-

tificial Intelligence and Statistics (AISTATS) 2019, Naha,

Okinawa, Japan. PMLR: Volume 89. Copyright 2019 by

the author(s).

AlexNet ResNet-18 InceptionV4

Per-user GP-EI Model Selection



Per-user GP-EI Model SelectionWhich user should

we serve for the next round of training?


Figure 1: Multi-device, Multi-tenant Model Selection

1 INTRODUCTION

One of the next frontiers of machine learning researchis its accessibility — How can we make machine learn-ing systems easy to use such that users do not needto worry about decisions such as model selection andhyperparameter tuning as much as today? The in-dustry’s answer to this question seems to be makingAutoML services available on the cloud, and prominentexamples include Google Cloud AutoML and MicrosoftCognitive Services. In these services, users are providedwith a single interface for uploading the data, auto-matically dealing with hyperparameter tuning and/ormodel selection, and returning a model directly withoutany user intervention as shown in Figure 1. Bayesianoptimization is one of the core techniques that makeAutoML services possible by strategically planning aseries of models and hyperparameter configurations totune and try.

From the service provider point of view, resource alloca-tion is the problem coming together with the emergingpopularity of such a service — When an online Au-toML service needs to serve multiple users with limitednumber of devices, what is the most cost efficient wayof allocating these devices to different users? Duringour conversation with multiple cloud service providers,many of them believe that an effective and cost efficientresource sharing could be of great practical interests

AutoML from Service Provider’s Perspective: Multi-device, Multi-tenant Model Selection with GP-EI

and is a natural technical question to ask.

In this paper, we focus on GP-EI, one of the mostpopular algorithms for AutoML that is used in sys-tems such as Google Vizier Golovin et al. [2017] andSpearmint Snoek et al. [2012]. Specifically, we are in-terested in the scenarios that each user runs her ownGP-EI instance on a different machine learning task,and there are multiple devices, each of which can onlyserve one user at the same time. How to allocate re-sources? What are the theoretical properties of such analgorithm?

The result of this paper is the first multi-device, multi-tenant, cost sensitive GP-EI algorithm that aims atoptimizing for the “global happiness” of all users givenlimited resources. In order to analyze its performance,we introduce a new notation of Maximum Incremen-tal Uncertainty (MIU) to measure dependence amongall models. Given N users and M devices, the upperbound of cumulative regret is O((MIU(T,K)+M)N

2

M ),where MIU(T,K) will be specified in Section 5.2.This bound converges to optimum and is nearly linearspeedup when more devices are employed, which willbe discussed also in Section 5.2.

We evaluate our algorithms on two real-world datasets:(1) model selection for image classification tasks thatcontains 22 datasets (users) and 8 different neural net-work architectures; and (2) model selection for modelsavailable on Microsoft Azure Machine Learning Stu-dio that contains 17 datasets and 8 different machinelearning models. We find that, multi-tenant GP-EIoutperforms standard GP-EI serving users randomlyor in a round robin fashion, sometimes by up to 5⇥ interms of the time it needs to reach the same “globalhappiness” of all users. When multiple devices areavailable, it can provide near linear speedups to theperformance of our algorithm.

2 RELATED WORK

Multi-armed bandit problem There are lots ofresearch on the multi-armed bandit problem. Someearly work such as Lai and Robbins [1985] provided thelower bound of stochastic multi-armed bandit scenarioand designed an algorithm to attain this lower bound.Then many research focused on improving constants ofthe bound and designing distribution-free-bound algo-rithms such as upper confidence bound (UCB) [Aueret al., 2002] and Minimax Optimal Strategy in theStochastic case (MOSS) [Audibert and Bubeck, 2009].UCB is now becoming a very important algorithm inbandit problem. Lots of algorithms are based on UCB,such as LUCB [Kalyanakrishnan et al., 2012], GP-UCB[Srinivas et al., 2012]. The UCB algorithm constructsthe upper confident bound for each arm in every itera-

tion, and chooses the arm with largest bound as thenext arm to observe. UCB is very efficient by effectivelybalancing exploration and exploitation, admitting theregret upper bound O(

pT |L| log T ) [Bubeck et al.,

2012], where T is running time, L is the set of arms.There also exist some variations of the bandit problemother than stochastic bandit problem, such as contex-tual bandit problem [Deshmukh et al., 2017, Langfordand Zhang, 2008] and bandit optimization problem[Arora et al., 2012, Hazan et al., 2016]. We recommendreaders to refer to the book by [Bubeck et al., 2012].Regret is a common metric of algorithms above, whileanother important metric is the sample complexity[Gabillon et al., 2012, Kalyanakrishnan et al., 2012].Recently there are also some research to combine ban-dit algorithms and Monte Carlo tree methods to findout an optimal path in a tree with random leaf nodes,that corresponds to finding the optimal strategy in agame. Representative algorithms include UCT [Kocsiset al., 2006], UGapE-MCTS [Kaufmann and Koolen,2017], and LUCB-micro [Huang et al., 2017].

Expected improvement methods The expectedimprovement method dates back to 1970s, whenJ. Mockus and Zilinskas [1978] proposed to use theexpected improvement function to decide which armto choose in each iteration, that is, the arm is chosenwith the maximal expected improvement. The advan-tage of this method is that the expected improvementcan be computed analytically [J. Mockus and Zilin-skas, 1978, Jones et al., 1998]. Recently, Snoek et al.[2012] extended the idea of expected improvement tothe time sensitive case by evaluating the expected im-provement per second to make selection. We also adoptthis concept in this paper, namely, EIrate. There arealso some works on analyzing the asymptotically con-vergence of EI method. Ryzhov [2016] analyzed thehit number of each non-optimal arm and the work byBull [2011] provides the lower and upper bound ofinstantaneous regret when values of all arms are inReproducing-Kernel Hilbert Space. Expected improve-ment methods have many application scenarios, see[Malkomes et al., 2016].

GP-UCB GP-UCB is a type of approaches consider-ing the correlation among all arms, while the standardUCB [Auer et al., 2002] does not consider the correla-tion. GP-UCB chooses the arm with the largest UCBvalue in each iteration where the UCB value uses thecorrelation information. The proven regret achieves therate O(

pT log T log |L|�T ) where T is running time,

L is the set of arms, and �T is the maximum infor-mation gain at time T . Some variants of GP-UCBinclude Branch and Bound algorithm [de Freitas et al.,2012], EGP algorithm [Rana et al., 2017], distributed

Chen Yu, Bojan Karlas, Jie Zhong, Ce Zhang, Ji Liu

batch GP-UCB [Daxberger and Low, 2017], MF-GP-UCB [Kandasamy et al., 2016], BOCA [Kandasamyet al., 2017], GP-(UCB/EST)-DPP-(MAX/SAMPLE)[Kathuria et al., 2016], to name a few.

Parallelization bandit algorithms To improvethe efficiency of bandit algorithms, multiple agentscan be employed and they can perform simultaneousinvestigation. Zhong et al. [2017] designed an asyn-chronous parallel bandit algorithm that allows multipleagents working in parallel without waiting for eachother. Both theoretical analysis and empirical studiesvalidate that the nearly linear speedup can be achieved.Kandasamy et al. [2018] designed an asynchronousversion of Thompson sampling algorithm to solve paral-lelization Bayesian bandit optimization problem. Whilethey consider the single user scenario, our work extendsthe setup to the multi-user case, which leads to ournew notation to reflect the global happiness and needssome new technique in theoretical analysis.

AutoML Closely related to this work is the emerg-ing trend of AutoML system and services. Model se-lection and hyperparameter tuning is the key tech-nique behind such services. Prominent systems in-clude Spark TuPAQ [Sparks et al.], Auto-WEKA [Kot-thoff et al., Thornton et al.], Google Vizier [Golovinet al.], Spearmint [Snoek et al., 2012], GPyOpt [GPy-Opt, 2016], and Auto scikit-learn [Feurer et al.] andprominent online services include Google Cloud Au-toML and Microsoft Cognitive Services. Most of thesesystems focus on a single-tenant setting. Recently, Liet al. [2018] describes a system for what the authors call“multi-tenant model selection”. Our paper is motivatedby this multi-tenant setting, however, we focus on amuch more realistic choice of algorithm, GP-EI, thatis actually used by real-world AutoML services.

3 MATHEMATICAL PROBLEMSTATEMENT

In this section, we give the mathematical expression ofthe problem. We first introduce the multi-device, multi-tenant (MDMT) AutoML problem, which is a moregeneral scenario of single-device, single-tenant (SDST)AutoML problem. Then in Section 3.1, we propose anew notion – time sensitive hierarchical bandit (TSHB)problem to abstract the MDMT AutoML problem. Atlast in Section 3.2, we define a new metric to quantifythe goal.

Single-device, single-tenant (SDST) AutoMLAutoML often refers to the single-device, single-tenantscenario, in which a single user aims finding the bestmodel (or hyper parameter) for his or her individual

dataset as soon as possible. Here the device is an ab-stract concept, it can refer to a server, or a CPU, GPU.The problem is usually formulated into a Bayesian opti-mization problem or a bandit problem using Gaussianprocess to characterize the connection among differentmodels (or hyper parameters).

Multi-device, multi-tenant (MDMT) AutoMLThe MDMT AutoML considers the scenario where thereare multiple devices available and multiple tenants whoseek the best model for each individual dataset (onetenant corresponds to one dataset). While the objectiveof the SDST AutoML is purely from the perspective ofa customer, the objective of MDMT AutoML is fromthe service provider, because it is generally impossibleto optimize performance for each single one, given thelimited computing resource. There are two fundamen-tal challenges: 1) When there are multiple devices areavailable, how to coordinate all computing resourcesto maximize the efficiency? Simply extending the al-gorithms in SDST AutoML often let multiple devicesto run the same model on the same dataset, whichapparently wastes the computing resource; 2) To servemultiple customer, we need to find a global metric toguide us to specify the most appropriate customer toserve besides of choosing the most promising the modelfor his / her dataset for each time. The overall problemis how to utilize all devices to achieve a certain globalhappiness. Each device is considered to be atomic, thatis, each device can only run one algorithm (model) onone dataset at the same time.

3.1 Time sensitive hierarchical bandit(TSHB) problem

To find a systematic solution to the MDMT AutoMLproblem, we develop a new notion – time sensitivehierarchical bandit (TSHB) problem to formulate theMDMT AutoML problem.

Definition of TSHB problem Now we formallypropose the ime sensitive hierarchical bandit problem toabstract the multi-device, multi-tenant autoML frame-work. Suppose that there are N users (or datasetsin autoML framework) and M devices. Each user hasa candidate set of models (or algorithms in autoMLframework) he or she is interested, specifically, Li is thecandidate model set for user i 2 [N ]. Here we considera more general situation, that is we do not assume thatLi and Lj are disjoint for i, j 2 [N ]. Denote the setof all models by L = L1 [ L2 [ · · · [ LN . Running amodel x 2 L on a device will take c(x) units of time.One model can only be run on one device at the sametime and one device can only run one model at thesame time. Since one model has been assigned to anidle device, c(x) units time later, the performance of


model x will be observed, denoted by z(x). For ex-ample, the performance could be the accuracy of themodel. W.L.O.G., we assume that the larger the valueof z(x), the better. Roughly saying, the overall goalis to utilize M devices to help N users to find outeach individual optimal model from the correspondingcandidate set as soon as possible. More technically,the goal (from the perspective of service provider) isto maximize the cumulative global happiness over allusers. We will define a regret to reflect this metric inthe next subsection.Remark 1. Here we simply assume c(x) to be knownbeforehand. Although it is usually unknown beforehand,it is easy to estimate an approximate (but high accurate)value by giving the dataset set size, the computationalhardware parameters, historical data, and other infor-mation. Therefore, for simplicity in analysis, we justuse estimated value so that we can assume the runtimeof each model to be exactly known beforehand. In ourempirical study, this approximation does not degradethe performance of our algorithm.

3.2 Regret definition for cumulative globalhappiness

To quantify the goal – cumulative global happiness, wedefine the corresponding regret. Let us first introducesome more notations and definitions:

• L(t): the set of models whose performances havebeen observed up to time t;

• x⇤i : the best model for user i, that is, x

⇤i :=

argmaxx2Liz(x);

• x⇤i (t): the best model for user i observed up to time

t, that is,x⇤i (t) := argmax

x2L(t)\Li

z(x). (1)

We define the individual regret (or negative individualhappiness) of user i at time t by the gap between thecurrently best and the optimal, i.e., z(x⇤

i )� z(x⇤i (t)).

In most AutoML systems, the user experience goes be-yond the regret at a single time point for a single user.Instead, the regret is defined by the integration overtime and the sum over all users’ regrets. It is worthnoting that the regret for each user is not the sameas the one in the SDST scenario since even a user isnot served currently, he or she still receives the penalty(measured by the gap between the optimal model’s per-formance and the currently best performance). Moreformally, the regret at time T is defined by

RegretT =NX

i=1

Z T

0

⇣z(x⇤

i )� z�x⇤i (t)

�⌘dt. (2)

Our goal is to utilize all devices to minimize this regret.

Discussion: Why is Multi-Device Important?Having multiple devices in the pool of computationresources does not necessarily mean that we need tohave a multi-device GP-EI algorithm to do the schedul-ing. One naive solution, which is adopted by ease.ml Liet al. [2018] is to treat all devices as a single device todo distributed training for each training task. In fact, Ifthe training process can scale up linearly to all devicesin the computation pool, such a baseline strategy mightnot be too bad. However, the availability of resourcesprovided on a modern cloud is far larger than the cur-rent limitation of distributed training — Whenever thescalability becomes sublinear, some resources could beused to serve other users instead. Given the growthrate of online machine learning services, we believe themulti-device setting will only become more important.

4 ALGORITHMThe proposed Multi-device Multi-tenant GP-EI (MM-GP-EI) algorithm follows a simple philosophy – as longas there is a device available, select a model to run onthis device. To minimize the regret (or equivalentlymaximize the cumulative global happiness), the key isto select a promising model to run whenever there isan available device. We use the expected improvementrate (EIrate) to measure the quality of each model inthe set of models that have not yet been selected before(selected models include the ones whose performancehave been observed or that are under test currently).The EIrate value for model x depends on two factors:the running time of model x and its expected improve-ment (EI) value (defined as Expected ImprovementFunction in Section 4.1)

EIrate(x) := EI(x)/c(x).

This measures the averaged expected improvement.This concept also appeared in Snoek et al. [2012].

4.1 Expected Improvement FunctionEvery Bayesian-based optimization algorithm has afunction called acquisition function [Brochu et al., 2010]that guides the search for the next model to test. Inour algorithm, we will call it expected improvementfunction (EI function).

Suppose at time t there is a device free, we first computeposterior distributions for all models given all currentobservation and then use these posterior distributionsto construct EI function for every model.

First, for each model x and any user who has this model(notice that different users can share the same model),we use EIi,t(x) to denote expected improvement of useri’s best performance if model x is observed. Formally,


we have

EIi,t(x) = Ehmax

�z(x)� z

�x⇤i (t)

�, 0 i

. (3)

Here E means taking expectation of posterior distribu-tion of z(x) at time t.

Then, we sum this value over all users who have modelx to represent the total expected improvement EIt(x)if model x is observed. Formally, we have

EIt(x) =NX

i=1

(x 2 Li)EIi,t(x), (4)

where (A) = 1 if A happens, and (A) = 0 if A doesnot happen. At last, we define the EIrate value of xat time t as follows:

EIratet(x) =EIt(x)c(x)

. (5)

Now, we can choose the model with the max value ofEIrate as the next one to run at time t:

xnext to run at time t = argmaxx2L\L(t)

EIratet(x). (6)

Algorithm 1 MM-GP-EI Algorithm

Input: µ(x), k(x, x0), c(x), L, {Li}Ni=1 and the totaltime budget T .

1: x(i)initial = argmaxx2Li

µ(x), 8i 2 [N ].2: Lob =

�x(i)initial

N

i=13: while there is a device available and the elapsed

time t is less than T do4: Refresh Lob to include all observed models at

present5: Update posterior mean µt(·), posterior covari-

ance kt(·, ·0) of z(x) given {z(x)}x2Lob

6: Update x⇤i (t) = argmaxx2Lob\Li

z(x), 8i 2 [N ]

7: EI(x) =NPi=1

Px2Li\Lob

�t(x)⌧⇣

µt(x)�z�x⇤i (t)

�

�t(x)

⌘, 8x 2

L8: Run xnext = argmax

x2L\Lob

EI(x)c(x) on this free device

9: end whileOutput: x

⇤1(T ), x

⇤2(T ), · · · , x⇤

N (T ).

4.2 Choosing Prior: Gaussian ProcessNext, we must choose a suitable prior of z(x) to es-timate EI function in (3). Here, we choose GaussianProcess (GP) as the prior like many other Bayesianoptimization algorithms [Bull, 2011, Srinivas et al.,2012], mainly because of its convenience of computingposterior distribution and EI function.

A Gaussian Process GP (µ(x), k(x, x0)) is determinedby its mean function µ(x) and covariance functionk(x, x0). If z(x) has a GP prior GP (µ(x), k(x, x0)),then after observing models in L(t) at time t, the pos-terior distribution of z(x) given {z(x)}x2L(t) is also aGaussian Process GP (µt(x), kt(x, x0)). Here, posteriormean µt(x) and variance kt(x, x0) can be computedanalytically. We give the formulas in SupplementalMeterials (Section A).

EI function can also be computed analytically if z(x)obeys Gaussian Process (whatever prior or posterior).The following lemma gives the expression.

Lemma 1. Let �(x) denote cumulative distributionfunction (CDF) of standard normal distribution and�(x) denote probability density function (PDF) of stan-dard normal distribution. Also, let ⌧(x) = x�(x)+�(x).Then, if X ⇠ N (µ,�2), and a 2 R is a constant, wehave

Ehmax

�X � a, 0

i= �⌧

⇣µ� a

�

⌘.

This section ends by the detailed description of theproposed MM-GP-EI algorithm in Algorithm 1.

Discussion: How to Choose Prior Mean µ(x)and Prior Covariance k(x, x0) Prior mean µ(x)and prior covariance k(x, x0) are chosen according tothe specific problem. They often characterize someproperties of models in the problem, such as expectedvalue of models and correlations among different mod-els. In our multi-device multi-tenant example, theparameters of Gaussian process can be obtained fromhistorical experiences and the correlation depends ontwo factors: the similarity of algorithms and the similar-ity of users’ datasets. We following standard AutoMLpractice (used in Google Vizier or ease.ml) to constructthe kernel matrix from historical runs.

5 MAIN RESULTBefore we introduce the main theoretical result, let uspropose a new notation Maximum Incremental Uncer-tainty (MIU), which plays a key role in our theory.

5.1 Maximum Incremental UncertaintySuppose that K is the kernel matrix, that is, K :=[k(x, x0)](x,x02L), where k(x, x0) is the kernel functionand L is the set of all models as defined in Section 3.1.So, K is an |L|⇥ |L| positive semi-definite (covariance)matrix. Suppose S is a subset of [|L|] := {1, 2, · · · , |L|}.Let KS be a submatrix of K with columns and rowsindexed by S.

We define the s-MIU score of matrix K (1 s |L|)by


MIUs(K) := maxS0⇢S✓[L],

|S|=s,|S0|=s�1

(qdet(KS)det(KS0 )

, det(KS0) 6= 0;

0, otherwise,

where we define det(K?) = 1.

Let us understand the meaning of the notation MIU.Given |L| Gaussian random variables with covariancematrix K 2 R|L|⇥|L|, det(KS0) denotes the total quan-tity of uncertainty for all random variables in S

0 ⇢ L.det(KS)/ det(KS0) denotes the incremental quantityof uncertainty by adding one more random variableinto S

0 to form S. If the added random variable canbe linearly represented by random variables in S

0, theincremental uncertainty is zero. If the added randomvariable is independent to all variables in S, the incre-mental uncertainty is the variance of the added variable.Therefore, MIUs(K) measures the largest incrementalquantity of uncertainty from s� 1 random variables tos random variables in L.Remark 2. Why do not use Information Gain?People who are familiar with the concept of informationgain (IG) may ask “IG is a commonly used metric tomeasure how much uncertainty reduced after observeda sample. Why not use it here?” Although IG andMIU essentially follow the same spirit, the IG metricis not suitable in our setup. Based on the mathemati-cal definition of IG (see Lemma 5.3 in Srinivas et al.[2012]), it requires the observation noise of the sampleto be nonzero to make it valid (otherwise it is infinity),which makes it inappropriate in the main target sce-nario in this paper. In our motivating example – cloudcomputing platform, the observation noise is usuallyconsidered to be zero, since no people run the sameexperiment twice. That motivates us to define a slightlydifferent metric, (i.e., Maximum Incremental Uncer-tainty), to fix the non-observation-noise issue.

5.2 Main TheoremTo simplify the analysis and result, we make the follow-ing assumption commonly used for analyzing EI [Bull,2011] and GP-UCB [Srinivas et al., 2012].Assumption 1. Assume that

• there exists a constant R such that: |z(x)� µt(x)| �t(x)R, for any model x 2 L and any t � 0;

• �(x) 1.

Now we are ready to provide the upper bound for theregret defined in (2).

Theorem 2. Let MIU(T,K) :=P|L(t)|

s=2 MIUs(K).Under Assumption 1, the regret of the output of Al-gorithm 1 up to time T admits the following upperbound

RegretT . (MIU(T,K) +M)N

2

Mc.

where c := 1N

NPi=1

c(x⇤i ) is the average time cost of all

optimal models, and . means “less than equal to” upto a constant multiplier.

The proof of Theorem 2 can be found in the Supple-mental Materials. To the best of our knowledge, this isthe first bound for time sensitive regret. We offer thefollowing general observations:

• (convergence to optimum) If the growth ofMIU(T,K) with respect to T is o(T ), then averageregret converges to zero, that is,

1

TRegretT ! 0.

In other words, the service provider will find the optimalmodel for each individual user.

• (nearly linear speedup) When more and moredevices are employed, that is, increasing M , then theregret will decrease roughly by a factor M as long asM is dominated by MIU(T,K).

Convergence Rate of the Average Regret. Weconsider the scenario where M ⌧ MIU(T,K) and|L(t)| increases linearly with respect to T . Then thegrowth of MIU(T,K) will dominate the convergencerate of the average regret. Note that MIU(T,K) isbounded by

MIU(T,K) X

i2top |L(t)| elements in diag(K)

pK(i, i).

Discussion: Special Cases. Consider the followingspecial cases:

• (O(1/T ) rate) The convergence rate for 1T RegretT

achieves O(1/T ), if MIU(T,K) is bounded, for exam-ple, all random variables (models) are linearly combi-nation of a finite number of hidden Gaussian randomvariables.

• (not converge) If all models are independent, thenK is a diagonal matrix, and MIUs(K) is a constant,which means MIU(T,K) is linearly increased of T . Insuch a case, the regret is of order T , which implies noconvergence for the average regret. This is plausible inthat the algorithm gains no information from previousobservations to decide next because of independence.

• (O(1/T (1�↵)) rate with ↵ 2 (0, 1)) This rate canbe achieved if MIU(T,K) grows with the rate O(T↵).


0.0 0.1 0.2 0.3 0.4 0.5Normalized time

0

100 K

200 K

300 K

Cum

mul

ativ

ere

gret

GP-EI-Random

GP-EI-Round-Robin

GP-EI-MDMT

(a) Azure - 1 device

0.0 0.1 0.2 0.3 0.4 0.5Normalized time

0

1 K

2 K

3 K

Cum

mul

ativ

ere

gret GP-EI-Random

GP-EI-Round-Robin

GP-EI-MDMT

(b) DeepLearning - 1 device

0.0 0.1 0.2 0.3 0.4 0.5Normalized time

10�5

10�4

10�3

10�2

10�1

100

Inst

anta

neou

sre

gret GP-EI-Random

GP-EI-Round-RobinGP-EI-MDMT

5.0x

(c) Azure - 1 device

0.0 0.1 0.2 0.3 0.4 0.5Normalized time

10�4

10�3

10�2

10�1

100

Inst

anta

neou

sre

gret

GP-EI-Random

GP-EI-Round-RobinGP-EI-MDMT

(d) DeepLearning - 1 device

Figure 2: Performance of Different Model Selection Algorithms with a Single Computation Device.

0.0 0.1 0.2 0.3 0.4 0.5Normalized time

0

10 K

20 K

30 K

40 K

50 K

60 K

Cum

mul

ativ

ere

gret

1

2

48

(a) Azure

0.0 0.1 0.2 0.3 0.4 0.5Normalized time

0

500

1 K

1 K

2 K

2 K

Cum

mul

ativ

ere

gret

1248

(b) DeepLearning

0.0 0.1 0.2 0.3 0.4 0.5Normalized time

10�6

10�4

10�2

100

Inst

anta

neou

sre

gret

1248

(c) Azure

0.0 0.1 0.2 0.3 0.4 0.5Normalized time

10�6

10�5

10�4

10�3

10�2

10�1

100

Inst

anta

neou

sre

gret

1

2

48

(d) DeepLearning

Figure 3: The Impact of Multiple Devices on Our Approach.

0.0 0.1 0.2 0.3 0.4 0.5Normalized time

0

20 K

40 K

60 K

80 K

100 K

Cum

mul

ativ

ere

gret

GP-EI-Random

GP-EI-Round-Robin

GP-EI-MDMT

(e) Azure - 4 devicess

0.0 0.1 0.2 0.3 0.4 0.5Normalized time

0

500

1 K

2 K

2 K

Cum

mul

ativ

ere

gret GP-EI-RandomGP-EI-Round-Robin

GP-EI-MDMT

(f) DeepLearning - 4 devicess

0.0 0.1 0.2 0.3 0.4 0.5Normalized time

10�6

10�5

10�4

10�3

10�2

10�1

100

Inst

anta

neou

sre

gret

GP-EI-RandomGP-EI-Round-Robin

GP-EI-MDMT

(g) Azure - 4 devicess

0.0 0.1 0.2 0.3 0.4 0.5Normalized time

10�6

10�5

10�4

10�3

10�2

10�1

100

Inst

anta

neou

sre

gret

GP-EI-Random

GP-EI-Round-Robin

GP-EI-MDMT

(h) DeepLearning - 4 devicess

Figure 4: Performance of Different Model Selection Algorithms with Four Computation Devices.

0.0 0.2 0.4 0.6 0.8 1.0Normalized time

10�5

10�4

10�3

10�2

10�1

100

Inst

anta

neou

sre

gret

Cuto�

1 Device248

(a) Synthetic - Multi Device

0 2 4 6 8Number of devices

0

2

4

6

8

Spe

edup

(b) Synthetic - Multi Device Speedup

Figure 5: Speedup of using Multiple Devices for OurApproach on Synthetic Data

6 EXPERIMENTSWe validate the effectiveness of the multi-device, multi-tenant GP-EI algorithm.

6.1 Data Sets and Protocol

We use two datasets for our experiments, namely (1)DeepLearning and (2) Azure. Both datasets arefrom the ease.ml paper Li et al. [2018] in which theauthors evaluate their single-device, multi-tenant GP-UCB algorithm. The DeepLearning dataset is col-lected from 22 users, each runs an image classifica-tion task. The system needs to select from 8 deeplearning models, including NIN, GoogLeNet, ResNet-50, AlexNet, BNAlexNet, ResNet-18, VGG-16, andSqueezeNet. The Azure dataset is collected from 17users, each runs a Kaggle competition. The systemneeds to select from 8 binary classifiers, including Av-eraged Perceptron, Bayes Point Machine, Boosted De-


cision Tree, Decision Forests, Decision Jungle, LogisticRegression, Neural Network, and SVM.

Protocol We run all experiments with the followingprotocol. In each run we randomly select 8 users whichwe will isolate and use to estimate the mean and thecovariance matrix of the prior for the Gaussian process.We test different model selection algorithms using theremaining users.

For all model selection strategies, we warm start bytraining two fastest models for each user, and thenswitching to each specific automatic model selectionalgorithm:

• GP-EI-Random: Each user runs their own GP-EI model selection algorithm; the system choosesthe next user to serve uniformly at random andtrains one model for the selected user;

• GP-EI-Round-Robin: Each user runs their ownGP-EI model selection algorithm; the system picksthe next user to serve in a round robin manner;

• GP-EI-MDMT: Our proposed approach inwhich each user runs their own GP-EI model selec-tion algorithm; the system picks the next user toserve using the MM-GP-EI algorithm we proposed.

Metrics We measure the performance of a modelselection system in two ways: (1) Cumulative RegretRegretT and (2) Instantaneous Regret: at time T , wecalculate the average among all users of the differencebetween the best possible accuracy for each user andthe current best accuracy the user gets. Intuitively,this measures the global “unhappiness” among all usersat time T .

6.2 Single device experiments

We validate the hypothesis that, given a single device,multi-tenant GP-EI outperforms both round robin andrandom strategies for picking the next user to serve.

Figure 2 shows the result on both datasets. The col-ored region around the curve shows the 1� confidenceinterval. On Azure, our approach outperforms bothround robin and random significantly — we reach thesame instantaneous regret up to 5⇥ faster than roundrobin. This is because, by prioritizing different userswith respect to their expected improvement, the globalhappiness of all users can increase faster than treatingall users equally. On the other hand, for DeepLearn-ing, we do not observe a significant speedup for ourapproach. This is because the first two trials of mod-els already give a reasonable quality. If we measurethe standard deviation of the accuracy of models foreach user, the average for Azure is 0.12 while for

DeepLearning it is 0.04. This means that, once thesystem trains the first two initial models for each userin Azure, there could be more potential performancegain still undiscovered among models that have not yetbeen sampled.

6.3 Multiple device experiments

We validate the hypothesis that using multiple devicesspeeds up our multi-device, multi-tenant model selec-tion algorithm. Figure 3 shows the result of usingmultiple devices for our algorithm. We see that themore devices we use, the faster the instantaneous regretdrops. In terms of speedup, since DeepLearning hasmore users than Azure (14 vs. 9), we see that thespeedup is larger on DeepLearning. The significantspeedup of reaching instantaneous regret of 0.03 forAzure is most probably due to the small number ofusers compared to the number of devices (9 vs. 8).

We now compare our approach against GP-EI-Round-Robin and GP-EI-Random when there are multipledevices available. Figure 4 shows the result. We seethat, up to 4 devices (9 users in total), our approachoutperforms round robin significantly on Azure. Whenwe use 8 devices for Azure, because there are only9 users, both our approach and round robin achievealmost the same performance.

We also conduct an experiment using a syntheticdataset with 50 users and 50 models (Figure 5). Wemodel the performance as a Gaussian Process andgenerate random samples independently for each user.The Gaussian Process has zero mean and a covariancematrix derived from the Matérn kernel with ⌫ = 5/2.Each generated sample is upwards in order to be non-negative. We run our approach on the same datasetwhile varying the number of devices. For each devicecount we repeat the experiment 5 times. To quan-tify speed gains we measure the average time it takesthe instantaneous regret to hit a cutoff point of 0.01.We can observe that adding more devices makes theconvergence time drop at a near-linear rate.

7 CONCLUSIONIn this paper, we introduced a novel multi-device, multi-tenant algorithm using GP-EI to maximize the “globalhappiness” for all users, who share the same set of com-puting resources. We formulated the “global happiness”in terms of a cumulative regret and first time provideda theoretical upper bound for the time sensitive regretin the GP-EI framework. We evaluated our algorithmon two real-world datasets, which significantly outper-forms the standard GP-EI serving users randomly or ina round robin fashion. Both our theoretical results andexperiments show that our algorithm can provide nearlinear speedups when multiple devices are available.


Acknowledgements

Chen Yu and Ji Liu are in part supported by NSFCCF1718513, IBM faculty award, and NEC fellowship.Ce Zhang and the DS3Lab gratefully acknowledge thesupport from Mercedes-Benz Research & DevelopmentNorth America, MeteoSwiss, Oracle Labs, Swiss DataScience Center, Swisscom, Zurich Insurance, ChineseScholarship Council, and the Department of ComputerScience at ETH Zurich.

ReferencesR. Arora, O. Dekel, and A. Tewari. Online bandit

learning against an adaptive adversary: from regretto policy regret. In Proceedings of the 29th Inter-national Coference on International Conference onMachine Learning, pages 1747–1754, 2012.

J.-Y. Audibert and S. Bubeck. Minimax policies foradversarial and stochastic bandits. In Conference onLearning Theory, pages 217–226, 2009.

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-timeanalysis of the multiarmed bandit problem. Machinelearning, 47(2-3):235–256, 2002.

E. Brochu, V. M. Cora, and N. De Freitas. A tuto-rial on bayesian optimization of expensive cost func-tions, with application to active user modeling andhierarchical reinforcement learning. arXiv preprintarXiv:1012.2599, 2010.

S. Bubeck, N. Cesa-Bianchi, et al. Regret analysisof stochastic and nonstochastic multi-armed banditproblems. Foundations and Trends R� in MachineLearning, 5(1):1–122, 2012.

A. D. Bull. Convergence rates of efficient global opti-mization algorithms. Journal of Machine LearningResearch, 12(Oct):2879–2904, 2011.

E. A. Daxberger and B. K. H. Low. Distributed batchGaussian process optimization. In D. Precup andY. W. Teh, editors, Proceedings of the 34th Interna-tional Conference on Machine Learning, volume 70of Proceedings of Machine Learning Research, pages951–960, International Convention Centre, Sydney,Australia, 06–11 Aug 2017. PMLR.

N. de Freitas, A. Smola, and M. Zoghi. Regret boundsfor deterministic gaussian process bandits. arXivpreprint arXiv:1203.2177, 2012.

A. A. Deshmukh, U. Dogan, and C. Scott. Multi-task learning for contextual bandits. In Advances inNeural Information Processing Systems, pages 4851–4859, 2017.

M. Feurer, A. Klein, K. Eggensperger, J. Springenberg,M. Blum, and F. Hutter. Efficient and Robust Auto-mated Machine Learning. In NIPS, pages 2962–2970.

V. Gabillon, M. Ghavamzadeh, and A. Lazaric. Bestarm identification: A unified approach to fixed bud-get and fixed confidence. In Advances in NeuralInformation Processing Systems, pages 3212–3220,2012.

D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. E.Karro, and D. Sculley. Google Vizier: A Service forBlack-Box Optimization. In KDD.

D. Golovin, B. Solnik, S. Moitra, G. Kochanski,J. Karro, and D. Sculley. Google vizier: A service forblack-box optimization. In Proceedings of the 23rdACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining, pages 1487–1495.ACM, 2017.

GPyOpt. {GPyOpt}: A BayesianOptimization framework in python.\url{http://github.com/SheffieldML/GPyOpt},2016.

E. Hazan et al. Introduction to online convex optimiza-tion. Foundations and Trends R� in Optimization, 2(3-4):157–325, 2016.

R. Huang, M. M. Ajallooeian, C. Szepesvári, andM. Müller. Structured best arm identification withfixed confidence. In International Conference onAlgorithmic Learning Theory, pages 593–616, 2017.

V. T. J. Mockus and A. Zilinskas. Toward GlobalOptimization, volume 2, chapter The application ofBayesian methods for seeking the extremum, pages117–128. Elsevier, 1978.

D. R. Jones, M. Schonlau, and W. J. Welch. Efficientglobal optimization of expensive black-box functions.Journal of Global optimization, 13(4):455–492, 1998.

S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone.Pac subset selection in stochastic multi-armed ban-dits. In International Conference on Machine Learn-ing, volume 12, pages 655–662, 2012.

K. Kandasamy, G. Dasarathy, J. B. Oliva, J. Schnei-der, and B. Poczos. Gaussian process bandit opti-misation with multi-fidelity evaluations. In D. D.Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, andR. Garnett, editors, Advances in Neural Informa-tion Processing Systems 29, pages 992–1000. CurranAssociates, Inc., 2016.

K. Kandasamy, G. Dasarathy, J. Schneider, and B. Póc-zos. Multi-fidelity Bayesian optimisation with contin-uous approximations. In D. Precup and Y. W. Teh,editors, Proceedings of the 34th International Confer-ence on Machine Learning, volume 70 of Proceedingsof Machine Learning Research, pages 1799–1808, In-ternational Convention Centre, Sydney, Australia,06–11 Aug 2017. PMLR.


K. Kandasamy, A. Krishnamurthy, J. Schneider, andB. Póczos. Parallelised bayesian optimisation viathompson sampling. In International Conference onArtificial Intelligence and Statistics, pages 133–142,2018.

T. Kathuria, A. Deshpande, and P. Kohli. Batchedgaussian process bandit optimization via determi-nantal point processes. In D. D. Lee, M. Sugiyama,U. V. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems29, pages 4206–4214. Curran Associates, Inc., 2016.

E. Kaufmann and W. M. Koolen. Monte-carlo treesearch by best arm identification. In Advances inNeural Information Processing Systems, pages 4904–4913, 2017.

L. Kocsis, C. Szepesvári, and J. Willemson. Improvedmonte-carlo search. Univ. Tartu, Estonia, Tech. Rep,1, 2006.

L. Kotthoff, C. Thornton, H. H. Hoos, F. Hutter, andK. Leyton-Brown. Auto-WEKA 2.0: Automaticmodel selection and hyperparameter optimizationin WEKA. Journal of Machine Learning Research,(25):1–5.

T. L. Lai and H. Robbins. Asymptotically efficientadaptive allocation rules. Advances in applied math-ematics, 6(1):4–22, 1985.

J. Langford and T. Zhang. The epoch-greedy algorithmfor multi-armed bandits with side information. InAdvances in neural information processing systems,pages 817–824, 2008.

T. Li, J. Zhong, J. Liu, W. Wu, and C. Zhang. Ease.ml:Towards multi-tenant resource sharing for machinelearning workloads. The Proceedings of the VeryLarge Database Endowment, 2018.

G. Malkomes, C. Schaff, and R. Garnett. Bayesianoptimization for automated model selection. In D. D.Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, andR. Garnett, editors, Advances in Neural InformationProcessing Systems 29, pages 2900–2908. Curran As-sociates, Inc., 2016.

S. Rana, C. Li, S. Gupta, V. Nguyen, and S. Venkatesh.High dimensional bayesian optimization with elasticgaussian process. In International Conference onMachine Learning, pages 2883–2891, 2017.

I. O. Ryzhov. On the convergence rates of expectedimprovement methods. Operations Research, 64(6):1515–1528, 2016.

J. Snoek, H. Larochelle, and R. P. Adams. Practicalbayesian optimization of machine learning algorithms.In Advances in neural information processing sys-tems, pages 2951–2959, 2012.

E. R. Sparks, A. Talwalkar, D. Haas, M. J. Franklin,M. I. Jordan, and T. Kraska. Automating modelsearch for large scale machine learning. In Proceed-ings of the Sixth ACM Symposium on Cloud Com-puting - SoCC ’15, pages 368–380, New York, NewYork, USA. ACM Press. ISBN 9781450336512. doi:10.1145/2806777.2806945.

N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger.Information-theoretic regret bounds for gaussian pro-cess optimization in the bandit setting. IEEE Trans-actions on Information Theory, 58(5):3250–3265,2012.

C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown. Auto-WEKA. In Proceedings of the 19thACM SIGKDD international conference on Knowl-edge discovery and data mining - KDD ’13, page847, New York, New York, USA. ACM Press. ISBN9781450321747. doi: 10.1145/2487575.2487629.

J. Zhong, Y. Huang, and J. Liu. Asynchronousparallel empirical variance guided algorithms forthe thresholding bandit problem. arXiv preprintarXiv:1704.04567, 2017.

automl from service provider’s perspective: multi-device...

Documents