adeepreinforcementlearningapproachtotheoptimizationof...

Research ArticleADeep Reinforcement Learning Approach to the Optimization ofData Center Task Scheduling

Haiying Che1 Zixing Bai1 Rong Zuo1 and Honglei Li 2

1Beijing Institute of Technology Zhongguancun South Street No 5 Beijing 100081 China2Liaoning Normal University Huanghelu 850 Dalian 116029 China

Correspondence should be addressed to Honglei Li lhllnnueducn

Received 25 May 2020 Revised 19 July 2020 Accepted 13 August 2020 Published 31 August 2020

Academic Editor Shuping He

Copyright copy 2020 Haiying Che et al (is is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

With more businesses are running online the scale of data centers is increasing dramatically (e task-scheduling operation withtraditional heuristic algorithms is facing the challenges of uncertainty and complexity of the data center environment It is urgentto use new technology to optimize the task scheduling to ensure the efficient task execution (is study aimed at building a newscheduling model with deep reinforcement learning algorithm which integrated the task scheduling with resource-utilizationoptimization(e proposed scheduling model was trained tested and compared with classical scheduling algorithms on real datacenter datasets in experiments to show the effectiveness and efficiency(e experiment report showed that the proposed algorithmworked better than the compared classical algorithms in the key performancemetrics average delay time of tasks task distributionin different delay time levels and task congestion degree

1 Introduction

For the data centers with a huge number of servers even alittle bit improvement of the operation can save millions ofdollars Good task scheduling has been proved to be apractical way to bring benefits without extra hardware in-vestment Todayrsquos data center scheduling systems mostly useheuristic algorithms such as Fair Scheduling (Fair) First-Come-First-Service (FCFS) scheduling and Shortest-Job-First (SJF) scheduling (ese algorithms are easy to un-derstand and implement but only take effects in some certainsituations due to the limitation of the complicated pro-duction environment (e challenges of data centerscheduling in the real world are as follows

(1) (e data center environment is complex and dy-namically changing To achieve efficient and effectivescheduling traditional heuristic algorithms mostly relyon precise environment modeling In other words ifthe environment cannot be accurately modeled thereasonable and effective scheduling algorithm will notbe successfully applied (erefore most data center

scheduling algorithms still use basic and simple heu-ristic algorithms such as Fair FCFS and SJF Practi-cally it is too hard to model the environment preciselydue to the uncertainty of the coming tasks and thedynamic environment For example the executiontime of a task is affected by network bandwidth theprocessing performance of different machines the diskspeed and the location of the required resources tosupport the task execution

(2) Data center scheduling is usually performed withoutsufficient information support (ere are no patternsto predict the task arriving way (at is the numberand size of tasks coming next are unknown So thealgorithm has to schedule the tasks at once withoutany prior experience and prepared information

(3) Resource requirements change dynamically For datacenter services or tasks the demand for resourcesvaries according to different time period environ-mental conditions and so on (e scheduling al-gorithm needs automatically optimize resourceutilization based on the changing demand

HindawiComplexityVolume 2020 Article ID 3046769 12 pageshttpsdoiorg10115520203046769

In order to solve the above problems a lot of relatedstudies are carried out Most of them focus on specificscheduling scenarios or rely on the acquired details of thecoming tasks in advance In addition most of the previousstudies are single objective optimization-oriented

In recent years the deep reinforcement learning [1 2]which achieves outstanding performance in complex controlfields strongly shows its superiority of decision making incomplex and unknown environments Mao et al tried totranslate the problem of packing tasks with multiple re-source demands into a learning problem (eir work showsthat deep reinforcement learning performs comparably tostate-of-the-art heuristics adapts to different conditionsconverges quickly and learns strategies that are sensible inhindsight [3] Inspired by these research results we believethat deep reinforcement learning is suitable for taskscheduling of data centers in complex production envi-ronments (is paper proposed a method based on deepreinforcement learning to improve task scheduling perfor-mance and resource utilization of data centers With theneural network model trained through deep reinforcementlearning tasks scheduling and resource utilization im-provement were achieved

In this study we used deep reinforcement learning andaimed at two objectives minimizing the average taskcompletion time and improving resource utilization effi-ciency without any prior knowledge of the coming tasks(e experiments and verification were performed usingreal production data from the Alibaba Cluster TraceProgram sponsored by Alibaba Group [4] (e resultsshowed that compared with the traditional heuristic al-gorithms the proposed method achieved betterperformance

(is paper is organized as follows In Section 2 relatedstudies were discussed Reinforcement learning-basedscheduling as the prevailing technology used in taskscheduling of the data center was introduced In Section 3the technical architecture and related definitions of thescheduling system proposed in this paper were illustrated InSection 4 the reinforcement learning algorithm for sched-uling optimization was introduced in detail In Section 5experiments were performed on Alibaba real productiondataset to show the advantage of the proposed schedulingmodel In the last section the conclusion and future workwere discussed

2 Related Studies

For data center task scheduling many studies have beenlaunched in the past decade Heuristic algorithms and re-inforcement learning are popular in this domain

21 Heuristic Algorithm-Based Studies (e traditionalmethods are mainly based on heuristic algorithms Typ-ically Delimitrou and Kozyrakis proposed the ARQ al-gorithm a multiclass admission control protocol thatconstrains application waiting time and limits applicationlatency to achieve QoS [5] (ey evaluated the algorithm

with a wide range of workload scenarios on both small-and large-scale systems and found that it enforces per-formance guarantees for 91 percent of applications whilethe utilization is improved Perry et al proposed a Fastpassalgorithm to improve data center transmission efficiency[6] Fastpass incorporates two fast algorithms the firstdetermines the time at which each packet should betransmitted while the second determines the path used forthat packet (ey deployed and evaluated Fastpass in aportion of Facebookrsquos data center network which showsthat Fastpass achieves better efficiency in transmissionTseng et al argued that previous studies have typicallyfocused on modifying the original TCP or increasingadditional switch hardware costs and rarely focused onthe existing data center network (DCN) environments Sothey proposed a cross-layer flow schedule with a dynamicgrouping (CLFS-DG) algorithm to reduce the effect of TCPincast in DCNs [7] Yuan et al solved the cost optimizationproblem under CDCs (cloud data centers) from two as-pects Firstly a revenue-based workload admission controlmethod is proposed to selectively accept requests (en acost-aware workload scheduling method is proposed toallocate requests among multiple Internet service pro-viders connected to the distributed CDCs Finally intel-ligent scheduling requests are realized which can achievelower cost and higher throughput for CDC providers [8]Yuan et al proposed the profit maximization algorithm(PMA) for the profit maximization challenge in the hybridcloud scenario (e algorithm uses the hybrid heuristicoptimization algorithm to simulate annealing particleswarm optimization (SAPSO) to improve the throughputand profit of private cloud [9] Bi et al proposed a newdynamic hybrid meta heuristic algorithm based on sim-ulated annealing and particle swarm optimization (PSO)for minimizing energy cost and maximizing revenue ofvarious applications running in virtualized cloud datacenters [10] Yuan et al proposed a heuristic timescheduling algorithm (TTSA) to minimize the cost ofprivate cloud data centers in hybrid cloud which caneffectively improve the throughput of private cloud datacenters [11] Yuan et al proposed a biological target dif-ferential evolution algorithm (SBDE) based on simulatedannealing aiming at the challenges of maximizing profitand minimizing the probability of average task loss indistributed green data centers scenario Compared withseveral existing scheduling algorithms SBDE achievesgreater benefits [12]

With the development of data centers reasonableprediction is very important to improve the efficiency oftask scheduling Although prediction is not involved in theexperiment it is necessary to predict the execution timewhen the model is transferred to the actual scene Zhanget al (2018) proposed an integrated forecasting methodequipped with noise filtering and data frequency repre-sentation named SavitzkyndashGolay and wavelet-supportedstochastic configuration networks (SGW-SCNs) [13] Biet al also proposed an integrated forecasting method thatcombines SavitzkyndashGolay filtering and wavelet decompo-sition with stochastic configuration networks to get the

2 Complexity

workload forecast in the next period [14] Bi et al proposedan integrated prediction method that combines theSavitzkyndashGolay filter and wavelet decomposition withstochastic configuration networks to predict workload atthe next time slot [15]

Although the previous research results are quiteabundant Luo et al argued that not only because thescheduling problem is theoretically NP-hard but alsobecause it is tough to perform practical flow scheduling inlarge-scale DCNs It is quite challenging to minimize thetask completion time in todayrsquos DCNs [16] (at meansbecause of the scalability complexity and variability ofdata center production environment heuristic algorithmscannot achieve the expected performance even with thedeliberated design and exhausting tuning work in realproduction environment

22 Reinforcement Learning-Based Studies Reinforcementlearning [1 17] as the prevailing machine learningtechnology dramatically becomes a new way to the taskscheduling of data centers in recent years Unlike su-pervised learning which requires amount of manpowerand time to prepare the labeled data reinforcementlearning can work with unlabeled data (is so-calledmodel-free mode allows users to start the modelingwithout the preparation of accurate server environmentaldata from the scratch (erefore at present more re-searchers turn to try reinforcement learning to solve taskscheduling in complex data center environments Forexample for the purpose of energy saving Yuan et al usedthe Q-learning algorithm to reduce the data center energyconsumption (ey tested the algorithm in the CloudSima cloud computing simulation framework issued by cloudcomputing and distributed system laboratory of theUniversity of Melbourne (e result shows that it canreduce about 40 of the energy consumption of the non-power-aware data center and reduce 17 energy con-sumption of the greedy scheduling algorithm in datacenter scheduling area [18] Lin et al used TD-error re-inforcement learning to reduce the energy consumptionof data centers which does not rely on any given sta-tionary assumptions of the job arrival and job serviceprocesses (e effectiveness of the proposed reinforce-ment learning-based data center power managementframework was verified with real Google cluster datatraces [19] Li et al also proposed an end-to-end coolingcontrol algorithm (CCA) that is based on the deep de-terministic policy gradient algorithm (DDPG) to optimizethe control of cooling system in the data center (e resultshows that CCA can achieve about 11 cooling costsaving on the simulation platform compared with amanually configured baseline control algorithm [20]Shaw et al proposed an advanced reinforcement learningconsolidation agent (ARLCA) based on the Sarsa algo-rithm to reduce cloud energy consumption [21] (eirwork proved that the ARLCA makes a significant im-provement in energy saving while the number of serviceviolations is reduced

Scholars also apply reinforcement learning to solve theproblems on other purposes of the data center taskscheduling Basu et al applied reinforcement learning tobuild cost-models on standard online transaction processingdatasets [22] (ey modeled the execution of queries andupdates as a Markov decision process whose states aredatabase configurations actions are configuration changesand rewards are functions of the cost of configurationchange and query and update evaluation (e approach wasempirically and comparatively evaluated on a standardOLTP dataset (e result shows that the approach iscompetitive with state-of-the-art adaptive index tuningwhich is dependent on a cost model Peng et al triedGaussian process regression with reinforcement learningbased on the Q-learning algorithm to solve the problem ofstate-action space incomplete exploration of reinforcementin cloud data centers [23] (e computational resultsdemonstrated that the schema can balance the explorationand exploitation in the learning process and accelerate theconvergence to a certain extent Ruffy et al presented a newemulator Iroko to support different network topologiescongestion control algorithms and deployment scenariosIroko interfaces with the OpenAI Gym toolkit which al-lows for fast and fair evaluation of different reinforcementlearning and traditional congestion control algorithmsunder the same conditions [24] By the way some scholarstry to combine neural network with heuristic algorithm andhave achieved valuable research results For example Heet al proposed a new strategy iteration method for online Hinfin optimal control law design based on neural network fornonlinear systems Numerical simulation is carried out toverify the feasibility and applicability of the algorithm [25]In the next year a PI algorithm based on neural networkwas proposed to solve the problem of online adaptiveoptimal control of nonlinear systems Two examples weregiven to illustrate the effectiveness and applicability of themethod [26]

(e above studies show that reinforcement learning canhelp us to deal with the scheduling problems of data centersin many domains However most of them aim at singleobjective or are based on the ideal hypothesis that theinformation about the coming tasks and environment areaccurate and adequate in advance which limits the ap-plication of the related studies in real productionenvironment

Furthermore for most of todayrsquos data centers resourcesin fixed scale are usually allocated in advance for mostbusiness which is apparently a low-efficiency mode In factthe amount of resources required by business may changewith time If we cannot customize the resource amountaccording to the changing requirement when more re-sources are needed it will cause serious task delay andaffect the user experience badly Or if fewer resources areneeded more unused resources will be wasted In thisarticle we proposed reinforcement learning to optimizescheduling efficiency and to improve resource utilizationsimultaneously (is is a two-objective optimization workwhich addresses the primary demand of data centeroperations

Complexity 3

3 The Reinforcement Learning-BasedScheduling Model

In this session all key parts of the scheduling model as wellas the related definitions and equations were illustrated (eimportant notations are listed in Table 1 for betterunderstanding

31 e Model of Scheduling System (e reinforcementlearning-based scheduling system consisted of two partsenvironment and scheduling agents As shown in Figure 1the environment contained task queue virtual machinecluster and scheduler (e task queue was the pool to collectthe unimplemented tasks in the data center (e virtualmachine cluster was the container of virtual machinehandlers (e scheduler was the dispatcher to execute theactions from scheduling agents In this scheduling system itwas assumed that the number of virtual machines was fixed(e configuration and performance of all virtual machineswere the same Tasks in task queue can be scheduled to anyidle virtual machine(e virtual machine cluster was definedas VMs vm1 vm2 vm3 vmm where vmi was thehandler of virtual machine i

(e agent part included two scheduling agents Agent1and Agent2 Each agent was responsible for its optimizationobjective Agent1 was for task scheduling and Agent2 wasfor resource utilization

In the scheduling model time is an important factor intask scheduling and optimization of resource utilization(e time in the scheduling model was divided into twotypes time t and time 1113954t shown in Figure 2 t was the starttime of task scheduling and 1113954t was the start time of op-timization of resource utilization Time t and 1113954t weredefined in the relative time period for ease of calculation(e initial value of t and 1113954t was 0 In the scheduling modeleach task scheduling last Tseconds and each optimization

of resource utilization last 1113954T seconds 1113954T KlowastT(Kgt 0)where K was a parameter in the scheduling model con-figuration which hinted the optimization of resourceutilization spends more time than task scheduling In thispaper we defined the interval between t and t + 1 as timeslot t So T was the duration of time slot t (e samedefinitions were for time slot 1113954t

At time t Agent1 made the decision on whether toexecute each task in the task queue (e scheduling actiondecision was stored in a1t After 1113954T seconds that is at time 1113954tthe system started to optimize resource utilization (evirtual machine cluster would turn on or off a certainnumber of virtual machines according to the optimizationdecision action a21113954t output by Agent2 (e procedure wasillustrated in detail as follows

Whenever time t came the system would receive the tasksarriving from time tand store the tasks in the task queue(enthe following actions were performed in sequence

(i) (e system selected tasks according to the prioritycalculated by p(i) the task priority function definedin equation (1) and input the current state of theenvironment s1t to Agent1

(ii) Agent1 output action a1t and returned it to theenvironment

(iii) (e environment executed a1t and returned thereward r1t to Agent1 when a1twas finished

When all tasks in the queue were arranged to executeaccording to a1t the system entered the next task schedulingtime slot t + 1

When the system reached time 1113954t the system would startthe optimization of resource utilization At time 1113954t thecorresponding actions were performed in the followingsteps

(i) (e system input the current environment status s21113954t

to Agent2

Table 1 Notations in the scheduling system model

Notation Memo TypeT (e duration of period of task scheduling Model parameter1113954T (e duration of period of resource optimization Model parameterp(i) (e priority function to estimate the priority of task i Function

t(e start time of task scheduling which also represents the ID of period of task scheduling (the period is also

called time slot) Variable

1113954t (e start time of resource optimization which also represents the ID of period of resource optimization Variable(s1 a1 r1) (e state action and reward vector for task scheduling agent Variable(s2 a2 r2) (e state action and reward vector for resource optimization agent Variable

μ η Calibration parameters to adjust the influence of average task priority and active virtual machineproportion Model parameter

1113954μ 1113954η Calibration parameters to tune the proportion of the active virtual machine and the proportion of idlevirtual machines Model parameter

e t1113954te t1113954tprime (e sum of the execution time of tasks arriving in period 1113954t and the sum of the execution time of tasks not

executed in period 1113954tVariable

n t1113954tn t1113954tprime (e number of tasks arriving in period 1113954t and the number of tasks not executed in period 1113954t Variable

M (e number of virtual machines in cloud server Model parameterK (e ratio of 1113954T to T Model parameterαβHc (e hyperparameter of A2C algorithm Hyperparameter

4 Complexity

(ii) Agent2 output action decision a21113954t and returned it tothe environment

(iii) (e environment executed action decision a21113954t toshut down or start up a certain number of virtualmachines and returned the reward r2t to Agent2then the system entered the next time slot 1113954t + 1

32 Related Definitions

321 Task Priority Tasks were the jobs running on virtualmachine cluster Tasks arriving in time slot twere added to thetask queue waiting for virtual machine allocation As men-tioned above the environment had little information aboutthe exact number and size of tasks in advance so the taskpriority could hardly be simply calculated by waiting time ortask execution time We proposed function p(i) to estimatethe priority of task i In this function ei was the execution timeof the task i and wi was the waiting time of the task i p(i) isdefined in the following equation

p(i) ei + wi( 1113857

ei

(1)

After the priorities were calculated all tasks werescheduled according to their priorities

322 Action1 Action1 was the action space of all actions intask schedule(e element of Action1 a1i indicated whethera certain virtual machine was allocated to the task i

a1i isin [0 1] (e values of a1i were determined in the fol-lowing equation

a1i 1 if task i gets a virtual machine

0 if task i does not get a irtualmachine1113896 (2)

323 State1 State1 was the status space of environment fortask scheduling s1t the instance of State1 was defined as thevector (e1t pt m1t n1t) where e1t was the execution time ofthe task that was allocated a virtual machine pt was thepriority of the task that was allocated a virtual machine m1t

was the average priority of all the tasks in the task queue (seeequation (3)) n1t was the proportion of the active virtualmachines in virtual machine cluster (see equation (4)) andNt was the number of tasks in time slot t

m1t 1

Nt1113944

Nt

i1p(i) (3)

n1t nactive vm

M (4)

324 Reward1 (e reward value represented the feedbackvalue after the action was performed (e reward value oftime slot t is defined in the following equation

0

0 1 2

2k

n ndash 1 n

(n ndash 1)k nk ndash 1 nk + 1nkk ndash 1 k k + 1

T

T T T

T T T T

t

t

hellip hellip

hellip

hellip hellip hellip

hellip

Figure 2 Model of scheduling time

Taskn

Scheduler

Task1

Task2

Actor

Critic

Actor

CriticTurn off

Turn on

Do nothing

Execute

Not executeS1t

Task queue

vm1 vmm

Cluster

r1t r2t

Environment Agent

S2t

a1t a2t

a1t

a2tˆ ˆ

ˆ

Figure 1 Scheduling system architecture

Complexity 5

r1t μlowastm1t + ηlowast n1t (5)

where μ and η were calibration parameters which were usedto adjust the influence of average task priority m1t and activevirtual machine proportion n1t (e values of μ and η werebetween -1 and 1

325 Action2 Action2 represented the number of virtualmachines turned on or off in time slot 1113954t (e instance ofAction2 was defined as a21113954t isin [minusM M] When a21113954t was greaterthan 0 it meant there were a21113954t virtual machines turned onWhen a21113954t was equal to 0 it meant that no change occurredWhen a21113954t was less than 0 it meant that there were a21113954t virtualmachines shut down

326 State2 State2 was the state space for Agent2 s21113954t theinstance of State2 was defined as (e21113954t l1113954t m21113954t n21113954t) where e21113954t

was defined as the logarithm of the sum of e t1113954t and e t1113954tminus1prime where e t1113954t was the sum of the task execution time of the tasksarrived in time slot 1113954t and e t1113954tminus1prime was the sum of the taskexecution time of the tasks not executed at previous time slot1113954t minus 1 l 1113954t was the logarithm of the sum of n t1113954t and n t1113954tminus1prime wheren t1113954t was the number of tasks arrived in time slot 1113954t and n t1113954tminus1primewas the number of tasks not executed in the previous time slot1113954t minus 1 m21113954t was the average value of m1t in time slot 1113954t n21113954t wasthe average proportion of idle virtual machines in time slot1113954t `

e21113954t log e t1113954t + e t1113954tminus1prime( 1113857 1113954t 0 1 2 3

l1113954t log n t1113954t + n t1113954tminus1prime( 1113857 1113954t 0 1 2 3

m21113954t 1K

1113944

Klowast1113954t

iKlowast(1113954tminus1)

m1i 1113954t 0 1 2 3 K 1113954T

T

n21113954t 1K

1113944

klowast1113954t

iklowast(1113954tminus1)

1 minus n1i 1113954t 0 1 2 3 K 1113954T

T

(6)

327 Reward2 Reward2 was the value of reward functionfor Agent2 It was determined in equation (7) where 1113954μ and 1113954ηwere calibration parameters We adjusted the value of 1113954μ and1113954η according to the actual situation to tune the proportion ofthe active virtual machine and the proportion of idle virtualmachines in time slot 1113954t

r21113954t 1113954μn21113954t minus 1113954ηm21113954t (7)

4 Reinforcement Learning Algorithm forScheduling Optimization

In this section the actor-critic deep reinforcement learningalgorithm [21 27 28] was applied to create the model forscheduling optimization of data centers (e actor-criticalgorithm is a hybrid algorithm based on Q-learning andpolicy gradient which are two classic algorithms of rein-forcement learning (e actor-critic algorithm shows

outstanding performance in complicated machine learningmissions

A2C was selected in this study (e structure based onA2C is shown in Figure 3 In A2C the actor network is usedfor action selection and the critic network is used to evaluatethe action

As mentioned in Section 3 Agent1 acted as the opti-mization model for task scheduling and Agent2 as theoptimization model for resource utilization (State1 Ac-tion1 Reward1) and (State2 Action2 Reward2) were usedto describe the state space action space and reward functionof Agent1 and Agent2 respectively Hence (s1t a1t r1t) and(s21113954t a21113954t r21113954t) separately represented one instance in the statespaces action space and the reward functions of Agent1 andAgent2 at time slot t and time slot 1113954t (e data entry(st at rt st+1) was recorded as a sample for the training withthe A2C algorithm

(e parameter of the actor network is updated by ad-vantage function A(st at) (see equation (8)) θa (seeequation (9)) and θc (see equation (10)) are the parameters ofthe actor network and critic network respectively

A st at( 1113857 rt + cVπθ st+1 θc( 1113857 minus V

πθ st+1 θc( 1113857 (8)

θa⟵θa + α1113944t

nabla log πθast at( 1113857A st at( 1113857 + βnablaθa

H πθ middot st

11138681113868111386811138681113872 11138731113872 1113873

(9)

θc⟵θc minus αprime1113944t

nablaθcA st at( 1113857( 1113857

2

(10)

where α is the learning rate of the actor network β is ahyperparameter and H is the entropy of the policy

In this study the full-connection layers were used tobuild the network in which Agent1 used the six-layer full-connection network and Agent2 used the four-layer full-connection network (e size of the hidden layer in bothagents was 1024 In the training phase in order to solve thecold start problem of reinforcement learning and acceleratethe convergence of the model the First-Come-First-Servicetactic was applied in the early stage for the allocation ofvirtual machines in the virtual machine cluster and expe-riences were collected from the results to achieve a betterinitial status In the experiences running_steps agent1_batch_size and agent2_batch_size were the control pa-rameters of the training algorithm (e flowchart of thetraining algorithm is shown in Figure 4

In this study floating point operations per second(FLOPS) was used to evaluate the complexity of the pro-posed scheduling algorithm According to the structure ofthe full-connection network and the input data shown inFigure 3 the complexity of the proposed algorithm wasevaluated by

Time sim O Llowast I2 lowastKlowastN1113872 1113873 (11)

where K was the ratio of duration of resource utilization 1113954T tothe duration of task scheduling T defined in the model ofscheduling time in Section 31 N was the maximum number

6 Complexity

of tasks in time slot t I was the number of nodes in thehidden layer and L was the number of hidden layers of thenetwork of Agent1 and Agent2 It hinted that given the A2C-based model above the performance of the proposed al-gorithm was highly influenced by K and N In this study thecomplexity of the proposed algorithm with the modeltrained above was about O(6lowast 220 lowastKlowastN)

5 Experiment Study

51 Dataset Most previous studies used self-generateddatasets in their training work which was not suitable tovalidate the applicability of the algorithm in the real pro-duction environment In order to verify the effectiveness ofthe proposed algorithm in the real production environmentcluster-trace-v2017 the real production dataset published bythe Alibaba Cluster Trace Program was used in the exper-iment study (e data are about the cluster trace from realproduction environment which helps the researchers to getbetter understanding of the characteristics of modern In-ternet data centers (IDCs) and the workloads [4] (e trace

dataset includes the collocation of online services and batchworkloads about 1300 machines in a period of 12 hours (etrace dataset contains six kinds of collections machine_-metacsv machine_usagecsv container_metacsv contain-er_usagecsv batch_instancecsv and batch_taskcsv (etask information used in the experiment was frombatch_instancecsv collection In this collection task id starttime end time and other attributes are provided

52 Experiment Settings Here it was assumed that onevirtual machine could not performmultiple tasks at the sametime (e virtual machine was equipped with i7-8700k CPUgtx2080 graphics card and 32G RAM (e parameters ofscheduling system were set as shown in Table 2

Tand 1113954T were set to the empirical value in scheduling Mwas the number of virtual machines in different cloudingconfigurations for comparison ααprimec and cprime were set to theempirical value in the A2C algorithm μ and η were used tocontrol the average priority of tasks and the proportion ofactive virtual machines (e default value of them was ndash1 If

Begin

Env Agent1 Agent2

Ys1 s2 done = envreset() If i lt episode

i = i + 1 Y If done

N

If j lt k

j = j + 1m = 0

N

Env return r2 and_s2Batch_append(s2a2r2_s2)

S2 = _s2

Iflen(batch_) gt agent2_

batch_size

Y

Agent2learn(batch_)

done = Envget_done()j = 0

N

a2 = agent2choose_action()Envdo_action2(a2)

If m lt task_number

Y

YIf step lt running

steps N

a1 = 1 a1 = Agent1choose_action(s1)

_s1r1done = envdo_action1(a1)

batchappend(s1a1r1_s1)s1 = _s1

Iflen(batch) gt agent1_

batch_size

Y

Agent1learn(batch)N

m = m + 1

N

End

Figure 4 (e flowchart of model training process

(S1a1r1S2)

(S1a1r1S2)

(S1a1r1S2)

Batch

Actornetwork

Criticnetwork

a

v

hellip

Figure 3 Advantage actor-critic structure

Complexity 7

the goal was to reduce the priority the ratio of μη wasincreased appropriately If the goal was to control the costand reduce the proportion of idle virtual machines the ratioof μη was decreased (e setting of 1113954μ and 1113954η was the same

53 Model Training With the trace data from the realproduction environment we trained the model with thealgorithm introduced in Section 4 and recorded the lossvalue at each training step (e reward values were repre-sented by the average reward of each episode

Figure 5 shows the trend of the loss and the reward intraining process Figure 5(a) shows loss trend graph inwhich x-axis is the number of training steps and y-axis is thevalue of loss It can be seen from the graph that with theincrease of training steps loss gradually decreases untilconvergence Figure 5(b) is the reward trend graph in whichthe y-axis is reward value and the x-axis is the episode Itshows that with the increase of episode reward graduallyincreases and eventually converges at a higher value Ithinted that the performance of the model trained by thealgorithm was satisfactory

54 Comparison with Traditional Scheduling MethodsWe compared the proposed A2C scheduling algorithm withclassical First-Come-First-Service (FCFS) Shortest-Job-First (SJF) and Fair algorithms in the following twoexperiments

541 Experiment1 (e fixed number of virtual machineswas set to 300 350 400 450 and 500 in the cluster and ranthe different algorithms on the dataset For the proposedalgorithm of this paper only the task scheduling agent(Agent1) worked in experiment1 (e result of experiment 1shown in Figure 6 shows that the average task delay time andthe average task priority of the proposed A2C algorithm areless than those of other algorithms with different size ofclusters (e results implied that the proposed algorithmworked better in task scheduling than others with differentfixed numbers of resources

542 Experiment2 In this experiment Agent 2 workedwith Agent1 to schedule the task with dynamic resourceallocation (e performance of the proposed algorithm wascompared with other algorithms in three dimensions av-erage delay time of tasks tasks distribution in different delaytime levels and task congestion degree

(e initial size of virtual machine cluster (M) was set thesame for FCFS SJF and Fair algorithms and a dynamic size(Mprime) was set for the proposed algorithm In order to ensurethe fair resources supporting for all algorithms the maxi-mum value of Mprime was set up to 11 times M

(1) Comparison of Average Delay Time of Tasks (e ex-periment results on different size of virtual machine clustersare shown in Table 3

It shows that the proposed algorithm automaticallyexpands the cluster size when the cluster scale is smaller than400 When the size is set to 300 the cluster size increases by4 with the task delay decreases by at least 22 comparedwith those of other algorithms When the size is set to 350the cluster size increases by 28 with the task delay de-creases by at least 28 compared with others When thecluster size is at a larger level over 400 the proposed al-gorithm can automatically reduce the cluster size sig-nificantly while the task delay is also considerably smallerthan that of Fair FCFS and SJF algorithms In order toshow the performance of the proposed algorithm clearlywe defined the relative delay time τ in equation (12) Ifτ 1 it means the performance of the compared algo-rithm is as good as the proposed algorithm If τ gt 1 itmeans the performance of the compared algorithm isworse than the proposed algorithm otherwise it meansthe performance of the algorithm is better than theproposed algorithm

τim

tim

tmprimelowast

m

mprime (12)

where i isin (Fair FCFS SJF)

M isin (300 350 400 450 500)

Mprime isin (312 360 387 438 464)(13)

tim was the average delay time of algorithm i with the clustersize m tm

prime was the average delay time of the proposed al-gorithm with the cluster size Mprime

According to equation (12) the relative delay time wasconverted from the data in Table 2 and is shown in Table 4

(e results are all greater than 1 which indicates that thetask scheduling performance of other compared algorithmswith resource utilization is not as good as the proposedalgorithm

(2) Comparison of Task Distribution in Different DelayTime Levels In this section the tasks in differentdelay time intervals were considered (e unit of thedelay time interval was T the duration of taskscheduling Table 5 shows the statistical result

(e data in Table 5 are about the percentage of taskswhose delay time is less than the delay time interval (estatistical result clearly shows that the proposed algorithmhas higher percentage than other algorithms in all delay timeintervals It means compared with other algorithms the lessthe delay time interval the higher the percentage ofundelayed tasks in this delay time interval Especially in the

Table 2 System parameter setting of scheduling experiment

Parameter Value Parameters ValueT 3 sec M 300sim5001113954T 300 sec α 1endash4μ ndash10 αprime 1endash4η ndash09 c 091113954μ ndash10 cprime 091113954η 10

8 Complexity

1T level the percentage of tasks out of all is almost twice asmuch as that of others

(3) Comparison of Task Congestion Degree Task con-gestion degree reflected the number of tasks waiting forexecution in the task queue at the end of each time slot Itwas measured by the percentage of time slots in which the

number of congested tasks was less than a certain bench-mark number(erefore with a certain benchmark numberthe less the congestion degree was the better the schedulingalgorithm worked (e statistical result is illustrated inTable 6 and Figure 7

It can be seen from the data in first row of Table 6 thatthe proposed algorithm makes all tasks executed in over82 time slots and for other algorithms all tasks are ex-ecuted only in about 30 of all time slots (e change oftask congestion degree of all algorithms with morebenchmark numbers is visualized by the plot chart inFigure 7 which shows that the congestion degree with theproposed algorithm remains relatively stable and betterthan other algorithms with all benchmark numbers ofcongested tasks

Table 3 Average delay time of different algorithms on different size of virtual machine clusters (in seconds)

Cluster size Fair FCFS SJF Proposed algorithm (dynamic cluster size)300 359267 368301 141581 105032 (312)350 85735 87036 63110 286432 (360)400 49019 49312 48311 2284914 (387)450 46944 46953 46936 20014 (438)500 46931 46931 46931 19975 (464)

Table 4 (e relative delay time of the compared algorithms

Cluster size Fair FCFS SJF300 3371 3371 1296350 2954 2954 2142400 2231 2231 2185450 2410 2410 2409500 2531 2531 2532

0

10

20

30

40

300 350 400 450 500

Del

ay ti

me

Vm number

FairFCFS

SJFThe proposed

(a)

0

10

20

30

40

300 350 400 450 500

Del

ay ti

me

Vm number

FairFCFS

SJFThe proposed

(b)

Figure 6 Results of experiment 1 (a) Comparison of average task delay (b) Comparison of average task priority

800

700

600

500

400

300

200

100

0

Loss

0 10000 20000 30000 40000 50000 60000Training steps

(a)

0 20 40 60 80Episode

ndash090

ndash095

ndash100

ndash105

ndash110

ndash115

Rew

ard

(b)

Figure 5 Trend of the loss and the trend of the reward (a) Loss trend (b) Reward trend

Complexity 9

6 Conclusions and Further Study

With the expansion of online business in data center taskscheduling and resource utilization optimization becomemore and more pivotal Large data center operation isfacing the proliferation of uncertain factors which leadsto the geometric increase of environmental complexity(e traditional heuristic algorithms are difficult to copewith todayrsquos complex and constantly changing data centerenvironment However most of the previous studies fo-cused on one aspect of data center optimization and mostof them did not verify their algorithms on the real datacenter dataset

Based on the previous studies this paper designed areasonable deep reinforcement learning model optimized

the task scheduling and resource utilization Experimentswere also performed with the real production dataset tovalidate the performance of the proposed algorithm Av-erage delay time of tasks task distribution in different delaytime levels and task congestion degree as performanceindicators were used to measure the scheduling performanceof all algorithms applied in the experiments (e experimentresults showed that the proposed algorithm worked sig-nificantly better than the compared algorithms in all indexeslisted in experiments ensured the efficient task schedulingand dynamically optimized the resource utilization ofclusters

It should be noted that reinforcement learning is akind of time-consuming work In this study it took12 hours to train the model with the sample data from thereal production environment containing 1300 virtualmachines We also used other production data (cluster-trace-v2018) from the Alibaba Cluster Trace Program totrain the scheduling model (e dataset is about 4000virtual machines in an 8-day period which is much biggerthan the former dataset As the datasetrsquos scale increasedthe training time became considerably longer andunderfit occurred if the number of layers of the machinelearning model was not increased As shown in equation(11) the performance of the proposed algorithm is de-termined by the complexity of the full-connection net-work time slot ratio and the number of tasks in a timeslot Considering the trade-off of the training time costand the performance it is strongly recommended toprepare the sample dataset to a reasonable scale withsampling technology to reduce the complexity of thescheduling model

In addition we did not consider other task attributes andconstraints of data center environment Fox example in thisstudy it was assumed that the number of virtual machineswas not dynamically changed and the ability of all virtual

Table 5 (e task distribution in different delay time intervals

Task delay time interval (T) SJF () FCFS () Fair () (e proposed algorithm ()1 36675 13045 25688 517812 66734 46833 51021 811153 77775 65546 63935 846914 83226 71036 70797 871735 86451 73118 74603 8878310 92610 76438 80914 9295415 94807 78642 83803 9490520 95948 80761 86011 96017

Table 6 Task congestion degree

Number of congested tasks SJF () FCFS () Fair () (e proposed algorithm ()0 33739 31778 31802 822775 65444 62005 61990 8332010 75056 70965 71107 8433515 79332 74847 74999 8532520 81584 76622 76867 8624825 83204 77809 77952 8704430 84366 78685 78643 87835

0

01

02

03

04

05

06

07

08

09

1

1 31 61 91 121 151 181 211 241 271 301 331

Perc

enta

ge

Task number

SJFFCFS

FairThe proposed

Figure 7 Task congestion degree chart

10 Complexity

machines were the same It was also assumed that one virtualmachine cannot perform multiple tasks at the same time(erefore we do not imply that the proposed model isavailable for all kinds of clusters Notwithstanding its lim-itation in the future study the proposed model should beimproved to optimize the task scheduling in the heteroge-neous environments of the data center clusters throughtaking more constraints and attributes of real productionenvironment into account

Data Availability

(e dataset used to support the study is available at theAlibaba Cluster Trace Program (httpsgithubcomalibabaclusterdata)

Conflicts of Interest

(e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

(e research for this article was sponsored under the projectldquoIntelligent Management Technology and Platform of Data-Driven Cloud Data Centerrdquo National Key RampD Program ofChina no 2018YFB1003700

References

[1] K Arulkumaran M P Deisenroth M Brundage andA A Bharath ldquoDeep reinforcement learning a brief surveyrdquoIEEE Signal Processing Magazine vol 34 no 6 pp 26ndash38 2017

[2] Z Wang Z Schaul M Hessel H Hasselt M Lanctot andN Freitas ldquoDueling network architectures for deep rein-forcement learningrdquo in Proceedings of the 33rd InternationalConference on Machine Learning pp 1995ndash2003 New YorkNY USA June 2016

[3] H Mao M Alizadeh I Menache and S Kandula ldquoResourcemanagement with deep reinforcement learningrdquo in Pro-ceedings of the 15th ACM Workshop on Hot Topics in Net-works pp 50ndash56 Atlanta GA USA November 2016

[4] Alibaba Group ldquoAlibaba cluster trace programrdquo 2019 httpsgithubcomalibabaclusterdata

[5] C Delimitrou and C Kozyrakis ldquoQoS-Aware scheduling inheterogeneous datacenters with paragonrdquo ACM Transactionson Computer Systems vol 31 no 4 pp 1ndash34 2013

[6] J Perry A Ousterhout H Balakrishnan and D ShahldquoFastpass a centralized zero-queue datacenter networkrdquoACM SIGCOMM Computer Communication Review vol 44no 4 pp 307ndash318 2014

[7] H W Tseng W C Chang I H Peng and P S Chen ldquoAcross-layer flow schedule with dynamical grouping foravoiding TCP Incast problem in data center networksrdquo inProceedings of the International Conference on Research inAdaptive and Convergent Systems pp 91ndash96 Odense Den-mark October 2016

[8] H Yuan J Bi W Tan and B H Li ldquoCAWSAC cost-awareworkload scheduling and admission control for distributedcloud data centersrdquo IEEE Transactions on Automation Scienceand Engineering vol 13 no 2 pp 976ndash985 2016

[9] H Yuan J Bi W Tan and B H Li ldquoTemporal taskscheduling with constrained service delay for profit max-imization in hybrid cloudsrdquo IEEE Transactions on Auto-mation Science and Engineering vol 14 no 1 pp 337ndash3482017

[10] J Bi H YuanW Tan et al ldquoApplication-aware dynamic fine-grained resource provisioning in a virtualized cloud datacenterrdquo IEEE Transactions on Automation Science and En-gineering vol 14 no 2 pp 1172ndash1183 2017

[11] H Yuan J Bi W Tan M Zhou B H Li and J Li ldquoTTSA aneffective scheduling approach for delay bounded tasks inhybrid cloudsrdquo IEEE Transactions on Cybernetics vol 47no 11 pp 3658ndash3668 2017

[12] H Yuan H Liu J Bi and M Zhou ldquoRevenue and energycost-optimized biobjective task scheduling for green clouddata centersrdquo IEEE Transactions on Automation Science andEngineering vol 17 pp 1ndash14 2020

[13] L Zhang J Bi and H Yuan ldquoWorkload forecasting withhybrid stochastic configuration networks in cloudsrdquo inProceedings of the 2018 5th IEEE International Conference onCloud Computing and Intelligence Systems (CCIS) pp 112ndash116 Nanjing China November 2018

[14] J Bi H Yuan L Zhang and J Zhang ldquoSGW-SCN an in-tegrated machine learning approach for workload forecastingin geo-distributed cloud data centersrdquo Information Sciencesvol 481 pp 57ndash68 2019

[15] J Bi H Yuan and M Zhou ldquoTemporal prediction of mul-tiapplication consolidated workloads in distributed cloudsrdquoIEEE Transactions on Automation Science and Engineeringvol 16 no 4 pp 1763ndash1773 2019

[16] S Luo H Yu Y Zhao S Wang S Yu and L Li ldquoTowardspractical and near-optimal coflow scheduling for data centernetworksrdquo IEEE Transactions on Parallel and DistributedSystems vol 27 no 11 pp 3366ndash3380 2016

[17] R S Sutton and A G Barto Reinforcement Learning AnIntroduction MIT Press Cambridge MA USA 1998

[18] J Yuan X Jiang L Zhong and H Yu ldquoEnergy aware re-source scheduling algorithm for data center using rein-forcement learningrdquo in Proceedings of 2012 Fifth InternationalConference on Intelligent Computation Technology and Au-tomation pp 435ndash438 Hunan China January 2012

[19] X Lin Y Wang and M Pedram ldquoA reinforcement learning-based power management framework for green computingdata centersrdquo in Proceedings of 2016 IEEE InternationalConference on Cloud Engineering (IC2E) pp 135ndash138 BerlinGermany April 2016

[20] Y Li Y Wen D Tao and K Guan ldquoTransforming coolingoptimization for green data center via deep reinforcementlearningrdquo IEEE Transactions on Cybernetics vol 50 no 5pp 2002ndash2013 2020

[21] R Shaw E Howley and E Barrett ldquoAn advanced rein-forcement learning approach for energy-aware virtual ma-chine consolidation in cloud data centersrdquo in Proceedings of2017 12th International Conference for Internet Technologyand Secured Transactions (ICITST) pp 61ndash66 CambridgeUK December 2017

[22] D Basu Q Lin W Chen et al ldquoRegularized cost-modeloblivious database tuning with reinforcement learningrdquoLecture Notes in Computer Science vol 9940 pp 96ndash1322016

[23] Z Peng D Cui J Xiong B Xu YMa andW Lin ldquoCloud jobaccess control scheme based on Gaussian process regressionand reinforcement learningrdquo in Proceedings of 2016 IEEE 4th

Complexity 11

International Conference on Future Internet of ings andCloud (FiCloud) pp 276ndash284 Vienna Austria August 2016

[24] F Ruffy M Przystupa I Beschastnikh and ldquo Iroko ldquoAframework to prototype reinforcement learning for datacenter traffic controlrdquo 2018 httparxivorgabs181209975

[25] S He H Fang M Zhang F Liu X Luan and Z DingldquoOnline policy iterative-based Hinfin optimization algorithmfor a class of nonlinear systemsrdquo Information Sciencesvol 495 pp 1ndash13 2019

[26] S He H FangM Zhang F Liu and Z Ding ldquoAdaptive optimalcontrol for a class of nonlinear systems the online policy iterationapproachrdquo IEEE Transactions on Neural Networks and LearningSystems vol 31 no 2 pp 549ndash558 2020

[27] R S Sutton D McAllester S Singh and Y Mansour ldquoPolicygradient methods for reinforcement learning with functionapproximationrdquo Advances in Neural Information ProcessingSystems vol 12 pp 1057ndash1063 MIT Press Cambridge MAUSA 2000

[28] Y Wu E Mansimov S Liao A Radford and J SchulmanldquoOpenAI baselines ACKTR amp A2Crdquo 2017 httpsopenaicomblogbaselines-acktr-a2c

12 Complexity

In order to solve the above problems a lot of relatedstudies are carried out Most of them focus on specificscheduling scenarios or rely on the acquired details of thecoming tasks in advance In addition most of the previousstudies are single objective optimization-oriented

In recent years the deep reinforcement learning [1 2]which achieves outstanding performance in complex controlfields strongly shows its superiority of decision making incomplex and unknown environments Mao et al tried totranslate the problem of packing tasks with multiple re-source demands into a learning problem (eir work showsthat deep reinforcement learning performs comparably tostate-of-the-art heuristics adapts to different conditionsconverges quickly and learns strategies that are sensible inhindsight [3] Inspired by these research results we believethat deep reinforcement learning is suitable for taskscheduling of data centers in complex production envi-ronments (is paper proposed a method based on deepreinforcement learning to improve task scheduling perfor-mance and resource utilization of data centers With theneural network model trained through deep reinforcementlearning tasks scheduling and resource utilization im-provement were achieved

In this study we used deep reinforcement learning andaimed at two objectives minimizing the average taskcompletion time and improving resource utilization effi-ciency without any prior knowledge of the coming tasks(e experiments and verification were performed usingreal production data from the Alibaba Cluster TraceProgram sponsored by Alibaba Group [4] (e resultsshowed that compared with the traditional heuristic al-gorithms the proposed method achieved betterperformance

(is paper is organized as follows In Section 2 relatedstudies were discussed Reinforcement learning-basedscheduling as the prevailing technology used in taskscheduling of the data center was introduced In Section 3the technical architecture and related definitions of thescheduling system proposed in this paper were illustrated InSection 4 the reinforcement learning algorithm for sched-uling optimization was introduced in detail In Section 5experiments were performed on Alibaba real productiondataset to show the advantage of the proposed schedulingmodel In the last section the conclusion and future workwere discussed

2 Related Studies

For data center task scheduling many studies have beenlaunched in the past decade Heuristic algorithms and re-inforcement learning are popular in this domain

21 Heuristic Algorithm-Based Studies (e traditionalmethods are mainly based on heuristic algorithms Typ-ically Delimitrou and Kozyrakis proposed the ARQ al-gorithm a multiclass admission control protocol thatconstrains application waiting time and limits applicationlatency to achieve QoS [5] (ey evaluated the algorithm

with a wide range of workload scenarios on both small-and large-scale systems and found that it enforces per-formance guarantees for 91 percent of applications whilethe utilization is improved Perry et al proposed a Fastpassalgorithm to improve data center transmission efficiency[6] Fastpass incorporates two fast algorithms the firstdetermines the time at which each packet should betransmitted while the second determines the path used forthat packet (ey deployed and evaluated Fastpass in aportion of Facebookrsquos data center network which showsthat Fastpass achieves better efficiency in transmissionTseng et al argued that previous studies have typicallyfocused on modifying the original TCP or increasingadditional switch hardware costs and rarely focused onthe existing data center network (DCN) environments Sothey proposed a cross-layer flow schedule with a dynamicgrouping (CLFS-DG) algorithm to reduce the effect of TCPincast in DCNs [7] Yuan et al solved the cost optimizationproblem under CDCs (cloud data centers) from two as-pects Firstly a revenue-based workload admission controlmethod is proposed to selectively accept requests (en acost-aware workload scheduling method is proposed toallocate requests among multiple Internet service pro-viders connected to the distributed CDCs Finally intel-ligent scheduling requests are realized which can achievelower cost and higher throughput for CDC providers [8]Yuan et al proposed the profit maximization algorithm(PMA) for the profit maximization challenge in the hybridcloud scenario (e algorithm uses the hybrid heuristicoptimization algorithm to simulate annealing particleswarm optimization (SAPSO) to improve the throughputand profit of private cloud [9] Bi et al proposed a newdynamic hybrid meta heuristic algorithm based on sim-ulated annealing and particle swarm optimization (PSO)for minimizing energy cost and maximizing revenue ofvarious applications running in virtualized cloud datacenters [10] Yuan et al proposed a heuristic timescheduling algorithm (TTSA) to minimize the cost ofprivate cloud data centers in hybrid cloud which caneffectively improve the throughput of private cloud datacenters [11] Yuan et al proposed a biological target dif-ferential evolution algorithm (SBDE) based on simulatedannealing aiming at the challenges of maximizing profitand minimizing the probability of average task loss indistributed green data centers scenario Compared withseveral existing scheduling algorithms SBDE achievesgreater benefits [12]

With the development of data centers reasonableprediction is very important to improve the efficiency oftask scheduling Although prediction is not involved in theexperiment it is necessary to predict the execution timewhen the model is transferred to the actual scene Zhanget al (2018) proposed an integrated forecasting methodequipped with noise filtering and data frequency repre-sentation named SavitzkyndashGolay and wavelet-supportedstochastic configuration networks (SGW-SCNs) [13] Biet al also proposed an integrated forecasting method thatcombines SavitzkyndashGolay filtering and wavelet decompo-sition with stochastic configuration networks to get the

2 Complexity







Complexity 3















to Agent2












4 Complexity





p(i) ei + wi( 1113857

ei

(1)








m1t 1

Nt1113944

Nt

i1p(i) (3)

n1t nactive vm

M (4)


0

0 1 2

2k

n ndash 1 n


T

T T T

T T T T

t

t

hellip hellip

hellip


hellip


Taskn

Scheduler

Task1

Task2

Actor

Critic

Actor

CriticTurn off

Turn on

Do nothing

Execute

Not executeS1t

Task queue

vm1 vmm

Cluster

r1t r2t

Environment Agent

S2t

a1t a2t

a1t

a2tˆ ˆ

ˆ


Complexity 5








m21113954t 1K

1113944

Klowast1113954t


m1i 1113954t 0 1 2 3 K 1113954T

T

n21113954t 1K

1113944

klowast1113954t


1 minus n1i 1113954t 0 1 2 3 K 1113954T

T

(6)


r21113954t 1113954μn21113954t minus 1113954ηm21113954t (7)








πθ st+1 θc( 1113857 (8)

θa⟵θa + α1113944t


H πθ middot st

11138681113868111386811138681113872 11138731113872 1113873

(9)


nablaθcA st at( 1113857( 1113857

2

(10)






6 Complexity


5 Experiment Study





Begin

Env Agent1 Agent2


i = i + 1 Y If done

N

If j lt k

j = j + 1m = 0

N


S2 = _s2


batch_size

Y

Agent2learn(batch_)


N


If m lt task_number

Y

YIf step lt running

steps N





batch_size

Y

Agent1learn(batch)N

m = m + 1

N

End


(S1a1r1S2)

(S1a1r1S2)

(S1a1r1S2)

Batch

Actornetwork

Criticnetwork

a

v

hellip


Complexity 7










τim

tim

tmprimelowast

m

mprime (12)


M isin (300 350 400 450 500)

Mprime isin (312 360 387 438 464)(13)









8 Complexity









0

10

20

30

40

300 350 400 450 500

Del

ay ti

me

Vm number

FairFCFS

SJFThe proposed

(a)

0

10

20

30

40

300 350 400 450 500

Del

ay ti

me

Vm number

FairFCFS

SJFThe proposed

(b)


800

700

600

500

400

300

200

100

0

Loss

0 10000 20000 30000 40000 50000 60000Training steps

(a)

0 20 40 60 80Episode

ndash090

ndash095

ndash100

ndash105

ndash110

ndash115

Rew

ard

(b)


Complexity 9











0

01

02

03

04

05

06

07

08

09

1

1 31 61 91 121 151 181 211 241 271 301 331

Perc

enta

ge

Task number

SJFFCFS

FairThe proposed


10 Complexity


Data Availability




Acknowledgments


References
























Complexity 11







12 Complexity







Complexity 3















to Agent2












4 Complexity





p(i) ei + wi( 1113857

ei

(1)








m1t 1

Nt1113944

Nt

i1p(i) (3)

n1t nactive vm

M (4)


0

0 1 2

2k

n ndash 1 n


T

T T T

T T T T

t

t

hellip hellip

hellip


hellip


Taskn

Scheduler

Task1

Task2

Actor

Critic

Actor

CriticTurn off

Turn on

Do nothing

Execute

Not executeS1t

Task queue

vm1 vmm

Cluster

r1t r2t

Environment Agent

S2t

a1t a2t

a1t

a2tˆ ˆ

ˆ


Complexity 5








m21113954t 1K

1113944

Klowast1113954t


m1i 1113954t 0 1 2 3 K 1113954T

T

n21113954t 1K

1113944

klowast1113954t


1 minus n1i 1113954t 0 1 2 3 K 1113954T

T

(6)


r21113954t 1113954μn21113954t minus 1113954ηm21113954t (7)








πθ st+1 θc( 1113857 (8)

θa⟵θa + α1113944t


H πθ middot st

11138681113868111386811138681113872 11138731113872 1113873

(9)


nablaθcA st at( 1113857( 1113857

2

(10)






6 Complexity


5 Experiment Study





Begin

Env Agent1 Agent2


i = i + 1 Y If done

N

If j lt k

j = j + 1m = 0

N


S2 = _s2


batch_size

Y

Agent2learn(batch_)


N


If m lt task_number

Y

YIf step lt running

steps N





batch_size

Y

Agent1learn(batch)N

m = m + 1

N

End


(S1a1r1S2)

(S1a1r1S2)

(S1a1r1S2)

Batch

Actornetwork

Criticnetwork

a

v

hellip


Complexity 7










τim

tim

tmprimelowast

m

mprime (12)


M isin (300 350 400 450 500)

Mprime isin (312 360 387 438 464)(13)









8 Complexity









0

10

20

30

40

300 350 400 450 500

Del

ay ti

me

Vm number

FairFCFS

SJFThe proposed

(a)

0

10

20

30

40

300 350 400 450 500

Del

ay ti

me

Vm number

FairFCFS

SJFThe proposed

(b)


800

700

600

500

400

300

200

100

0

Loss

0 10000 20000 30000 40000 50000 60000Training steps

(a)

0 20 40 60 80Episode

ndash090

ndash095

ndash100

ndash105

ndash110

ndash115

Rew

ard

(b)


Complexity 9











0

01

02

03

04

05

06

07

08

09

1

1 31 61 91 121 151 181 211 241 271 301 331

Perc

enta

ge

Task number

SJFFCFS

FairThe proposed


10 Complexity


Data Availability




Acknowledgments


References
























Complexity 11







12 Complexity















to Agent2












4 Complexity





p(i) ei + wi( 1113857

ei

(1)








m1t 1

Nt1113944

Nt

i1p(i) (3)

n1t nactive vm

M (4)


0

0 1 2

2k

n ndash 1 n


T

T T T

T T T T

t

t

hellip hellip

hellip


hellip


Taskn

Scheduler

Task1

Task2

Actor

Critic

Actor

CriticTurn off

Turn on

Do nothing

Execute

Not executeS1t

Task queue

vm1 vmm

Cluster

r1t r2t

Environment Agent

S2t

a1t a2t

a1t

a2tˆ ˆ

ˆ


Complexity 5








m21113954t 1K

1113944

Klowast1113954t


m1i 1113954t 0 1 2 3 K 1113954T

T

n21113954t 1K

1113944

klowast1113954t


1 minus n1i 1113954t 0 1 2 3 K 1113954T

T

(6)


r21113954t 1113954μn21113954t minus 1113954ηm21113954t (7)








πθ st+1 θc( 1113857 (8)

θa⟵θa + α1113944t


H πθ middot st

11138681113868111386811138681113872 11138731113872 1113873

(9)


nablaθcA st at( 1113857( 1113857

2

(10)






6 Complexity


5 Experiment Study





Begin

Env Agent1 Agent2


i = i + 1 Y If done

N

If j lt k

j = j + 1m = 0

N


S2 = _s2


batch_size

Y

Agent2learn(batch_)


N


If m lt task_number

Y

YIf step lt running

steps N





batch_size

Y

Agent1learn(batch)N

m = m + 1

N

End


(S1a1r1S2)

(S1a1r1S2)

(S1a1r1S2)

Batch

Actornetwork

Criticnetwork

a

v

hellip


Complexity 7










τim

tim

tmprimelowast

m

mprime (12)


M isin (300 350 400 450 500)

Mprime isin (312 360 387 438 464)(13)









8 Complexity









0

10

20

30

40

300 350 400 450 500

Del

ay ti

me

Vm number

FairFCFS

SJFThe proposed

(a)

0

10

20

30

40

300 350 400 450 500

Del

ay ti

me

Vm number

FairFCFS

SJFThe proposed

(b)


800

700

600

500

400

300

200

100

0

Loss

0 10000 20000 30000 40000 50000 60000Training steps

(a)

0 20 40 60 80Episode

ndash090

ndash095

ndash100

ndash105

ndash110

ndash115

Rew

ard

(b)


Complexity 9











0

01

02

03

04

05

06

07

08

09

1

1 31 61 91 121 151 181 211 241 271 301 331

Perc

enta

ge

Task number

SJFFCFS

FairThe proposed


10 Complexity


Data Availability




Acknowledgments


References
























Complexity 11







12 Complexity





p(i) ei + wi( 1113857

ei

(1)








m1t 1

Nt1113944

Nt

i1p(i) (3)

n1t nactive vm

M (4)


0

0 1 2

2k

n ndash 1 n


T

T T T

T T T T

t

t

hellip hellip

hellip


hellip


Taskn

Scheduler

Task1

Task2

Actor

Critic

Actor

CriticTurn off

Turn on

Do nothing

Execute

Not executeS1t

Task queue

vm1 vmm

Cluster

r1t r2t

Environment Agent

S2t

a1t a2t

a1t

a2tˆ ˆ

ˆ


Complexity 5








m21113954t 1K

1113944

Klowast1113954t


m1i 1113954t 0 1 2 3 K 1113954T

T

n21113954t 1K

1113944

klowast1113954t


1 minus n1i 1113954t 0 1 2 3 K 1113954T

T

(6)


r21113954t 1113954μn21113954t minus 1113954ηm21113954t (7)








πθ st+1 θc( 1113857 (8)

θa⟵θa + α1113944t


H πθ middot st

11138681113868111386811138681113872 11138731113872 1113873

(9)


nablaθcA st at( 1113857( 1113857

2

(10)






6 Complexity


5 Experiment Study





Begin

Env Agent1 Agent2


i = i + 1 Y If done

N

If j lt k

j = j + 1m = 0

N


S2 = _s2


batch_size

Y

Agent2learn(batch_)


N


If m lt task_number

Y

YIf step lt running

steps N





batch_size

Y

Agent1learn(batch)N

m = m + 1

N

End


(S1a1r1S2)

(S1a1r1S2)

(S1a1r1S2)

Batch

Actornetwork

Criticnetwork

a

v

hellip


Complexity 7










τim

tim

tmprimelowast

m

mprime (12)


M isin (300 350 400 450 500)

Mprime isin (312 360 387 438 464)(13)









8 Complexity









0

10

20

30

40

300 350 400 450 500

Del

ay ti

me

Vm number

FairFCFS

SJFThe proposed

(a)

0

10

20

30

40

300 350 400 450 500

Del

ay ti

me

Vm number

FairFCFS

SJFThe proposed

(b)


800

700

600

500

400

300

200

100

0

Loss

0 10000 20000 30000 40000 50000 60000Training steps

(a)

0 20 40 60 80Episode

ndash090

ndash095

ndash100

ndash105

ndash110

ndash115

Rew

ard

(b)


Complexity 9











0

01

02

03

04

05

06

07

08

09

1

1 31 61 91 121 151 181 211 241 271 301 331

Perc

enta

ge

Task number

SJFFCFS

FairThe proposed


10 Complexity


Data Availability




Acknowledgments


References
























Complexity 11







12 Complexity








m21113954t 1K

1113944

Klowast1113954t


m1i 1113954t 0 1 2 3 K 1113954T

T

n21113954t 1K

1113944

klowast1113954t


1 minus n1i 1113954t 0 1 2 3 K 1113954T

T

(6)


r21113954t 1113954μn21113954t minus 1113954ηm21113954t (7)








πθ st+1 θc( 1113857 (8)

θa⟵θa + α1113944t


H πθ middot st

11138681113868111386811138681113872 11138731113872 1113873

(9)


nablaθcA st at( 1113857( 1113857

2

(10)






6 Complexity


5 Experiment Study





Begin

Env Agent1 Agent2


i = i + 1 Y If done

N

If j lt k

j = j + 1m = 0

N


S2 = _s2


batch_size

Y

Agent2learn(batch_)


N


If m lt task_number

Y

YIf step lt running

steps N





batch_size

Y

Agent1learn(batch)N

m = m + 1

N

End


(S1a1r1S2)

(S1a1r1S2)

(S1a1r1S2)

Batch

Actornetwork

Criticnetwork

a

v

hellip


Complexity 7










τim

tim

tmprimelowast

m

mprime (12)


M isin (300 350 400 450 500)

Mprime isin (312 360 387 438 464)(13)









8 Complexity









0

10

20

30

40

300 350 400 450 500

Del

ay ti

me

Vm number

FairFCFS

SJFThe proposed

(a)

0

10

20

30

40

300 350 400 450 500

Del

ay ti

me

Vm number

FairFCFS

SJFThe proposed

(b)


800

700

600

500

400

300

200

100

0

Loss

0 10000 20000 30000 40000 50000 60000Training steps

(a)

0 20 40 60 80Episode

ndash090

ndash095

ndash100

ndash105

ndash110

ndash115

Rew

ard

(b)


Complexity 9











0

01

02

03

04

05

06

07

08

09

1

1 31 61 91 121 151 181 211 241 271 301 331

Perc

enta

ge

Task number

SJFFCFS

FairThe proposed


10 Complexity


Data Availability




Acknowledgments


References
























Complexity 11







12 Complexity


5 Experiment Study





Begin

Env Agent1 Agent2


i = i + 1 Y If done

N

If j lt k

j = j + 1m = 0

N


S2 = _s2


batch_size

Y

Agent2learn(batch_)


N


If m lt task_number

Y

YIf step lt running

steps N





batch_size

Y

Agent1learn(batch)N

m = m + 1

N

End


(S1a1r1S2)

(S1a1r1S2)

(S1a1r1S2)

Batch

Actornetwork

Criticnetwork

a

v

hellip


Complexity 7










τim

tim

tmprimelowast

m

mprime (12)


M isin (300 350 400 450 500)

Mprime isin (312 360 387 438 464)(13)









8 Complexity









0

10

20

30

40

300 350 400 450 500

Del

ay ti

me

Vm number

FairFCFS

SJFThe proposed

(a)

0

10

20

30

40

300 350 400 450 500

Del

ay ti

me

Vm number

FairFCFS

SJFThe proposed

(b)


800

700

600

500

400

300

200

100

0

Loss

0 10000 20000 30000 40000 50000 60000Training steps

(a)

0 20 40 60 80Episode

ndash090

ndash095

ndash100

ndash105

ndash110

ndash115

Rew

ard

(b)


Complexity 9











0

01

02

03

04

05

06

07

08

09

1

1 31 61 91 121 151 181 211 241 271 301 331

Perc

enta

ge

Task number

SJFFCFS

FairThe proposed


10 Complexity


Data Availability




Acknowledgments


References
























Complexity 11







12 Complexity










τim

tim

tmprimelowast

m

mprime (12)


M isin (300 350 400 450 500)

Mprime isin (312 360 387 438 464)(13)









8 Complexity









0

10

20

30

40

300 350 400 450 500

Del

ay ti

me

Vm number

FairFCFS

SJFThe proposed

(a)

0

10

20

30

40

300 350 400 450 500

Del

ay ti

me

Vm number

FairFCFS

SJFThe proposed

(b)


800

700

600

500

400

300

200

100

0

Loss

0 10000 20000 30000 40000 50000 60000Training steps

(a)

0 20 40 60 80Episode

ndash090

ndash095

ndash100

ndash105

ndash110

ndash115

Rew

ard

(b)


Complexity 9











0

01

02

03

04

05

06

07

08

09

1

1 31 61 91 121 151 181 211 241 271 301 331

Perc

enta

ge

Task number

SJFFCFS

FairThe proposed


10 Complexity


Data Availability




Acknowledgments


References
























Complexity 11







12 Complexity









0

10

20

30

40

300 350 400 450 500

Del

ay ti

me

Vm number

FairFCFS

SJFThe proposed

(a)

0

10

20

30

40

300 350 400 450 500

Del

ay ti

me

Vm number

FairFCFS

SJFThe proposed

(b)


800

700

600

500

400

300

200

100

0

Loss

0 10000 20000 30000 40000 50000 60000Training steps

(a)

0 20 40 60 80Episode

ndash090

ndash095

ndash100

ndash105

ndash110

ndash115

Rew

ard

(b)


Complexity 9











0

01

02

03

04

05

06

07

08

09

1

1 31 61 91 121 151 181 211 241 271 301 331

Perc

enta

ge

Task number

SJFFCFS

FairThe proposed


10 Complexity


Data Availability




Acknowledgments


References
























Complexity 11







12 Complexity











0

01

02

03

04

05

06

07

08

09

1

1 31 61 91 121 151 181 211 241 271 301 331

Perc

enta

ge

Task number

SJFFCFS

FairThe proposed


10 Complexity


Data Availability




Acknowledgments


References
























Complexity 11







12 Complexity


Data Availability




Acknowledgments


References
























Complexity 11







12 Complexity







12 Complexity

adeepreinforcementlearningapproachtotheoptimizationof...

Documents