submitted to ieee transactions on networking 1 a

SUBMITTED TO IEEE TRANSACTIONS ON NETWORKING 1

A Reinforcement Learning Formulation of theLyapunov Optimization: Application to Edge

Computing Systems with Queue StabilitySohee Bae, Student Member, IEEE, Seungyul Han, Student Member, IEEE, and Youngchul Sung:, Senior

Member, IEEE

Abstract—In this paper, a deep reinforcement learning (DRL)-based approach to the Lyapunov optimization is consideredto minimize the time-average penalty while maintaining queuestability. A proper construction of state and action spaces isprovided to form a proper Markov decision process (MDP) forthe Lyapunov optimization. A condition for the reward functionof reinforcement learning (RL) for queue stability is derived.Based on the analysis and practical RL with reward discounting,a class of reward functions is proposed for the DRL-basedapproach to the Lyapunov optimization. The proposed DRL-based approach to the Lyapunov optimization does not requiredcomplicated optimization at each time step and operates withgeneral non-convex and discontinuous penalty functions. Hence,it provides an alternative to the conventional drift-plus-penalty(DPP) algorithm for the Lyapunov optimization. The proposedDRL-based approach is applied to resource allocation in edgecomputing systems with queue stability and numerical resultsdemonstrate its successful operation.

I. INTRODUCTION

The Lyapunov optimization in queueing networks is a well-known method to minimize a certain operating cost functionwhile stabilizing queues in a network [1], [2]. In order tostabilize the queues while minimizing the time average ofthe cost, the famous DPP algorithm minimizes the weightedsum of the drift and the penalty at each time step underthe Lyapunov optimization framework. The DPP algorithm iswidely used to jointly control the network stability and thepenalty such as power consumption in the traditional networkand communication fields [1]–[9]. The Lyapunov optimizationtheorem guarantees that the DPP algorithm results in optimal-ity within certain bound under some conditions. The Lyapunovoptimization framework has been applied to many problems.For example, the backpressure routing algorithm can be usedfor routing in multi-hop queueing networks [10], [11] and theDPP algorithm can be used for joint flow control and networkrouting [1], [12]. In addition to these classical applications toconventional communication networks, the Lyapunov frame-work and the DPP algorithm for optimizing performance underqueue stability can be applied to many optimization problems

This works was supported in part by Institute for Information & communi-cations Technology Promotion (IITP) grant funded by the Korea government(MSIT) (2019-0-00544, Building a simulation environment for machinelearning-based edge computing operation optimization), and in part by theNational Research Foundation of Korea(NRF) grant funded by the Koreagovernment. (MSIT) (2017R1E1A1A03070788) The authors are with theSchool of Electrical Engineering, KAIST, Daejeon, South Korea, 34141.Email: {sh.bae, sy.han, ycsung}@kaist.ac.kr. : Corresponding author.

in emerging systems such as energy harvesting and renewableenergy such as smart grid and electric vehicles in which virtualqueue techniques can be used to represent the energy level withqueue [13]–[19].

Despite its versatility for the Lyapnov optimization, the DPPalgorithm is an instantaneous greedy algorithm and requiressolving a non-trivial optimization problem for every timestep. Solving optimization for the DPP algorithm is not easyin case of complicated penalty functions. With the recentadvances in DRL [20], RL has gained renewed interest inapplications to many control problems for which classicalRL not based on deep learning was not so effective [21]–[24]. In this paper, we consider a DRL-based approach tothe Lyapunov optimization in order to provide an alternativeto the DPP algorithm for time-average penalty minimizationunder queue stability. Basic RL is a MDP composed of a statespace, an action space, a state transition probability and apolicy. The goal of RL is to learn a policy that maximizesthe accumulated expected return [25]. The problem of time-average penalty minimization under queue stability can beformulated into an RL problem based on queue dynamics andadditive cost function. The advantage of an RL-based approachis that RL exploits the trajectory of system evolution and doesnot require any optimization in the execution phase once thecontrol policy is trained. Furthermore, an RL-based approachcan be applied to the case of complicated penalty functionswith which numerical optimization at each time step for theDPP algorithm may be difficult. Even with such advantages ofan RL-based approach, an RL formulation for the Lyapunovoptimization is not straightforward because of the conditionof queue stability. The main challenge in an RL formulationof the Lyapunov optimization is how to incorporate the queuestability constraint into the RL formulation. Since the goalof RL is to maximize the expected accumulated reward, thedesired control behavior is through the reward, and the successand effectiveness of the devised RL-based approach cruciallydepends on a well-designed reward function as well as goodformulation of the state and action spaces.

A. Contributions and OrganizationThe contributions of this paper are as follows:‚ We propose a proper MDP structure for the Lyapunov

optimization by constructing the state and action spaces sothat the formulation yields an MDP with a deterministic rewardfunction, which facilitates learning.

arX

iv:2

012.

0727

9v2

[cs

.NI]

15

Dec

202

0


‚ We propose a class of reward functions yielding queuestability as well as penalty minimization. The derived rewardfunction is based on the relationship between the queuestability condition and the expected accumulated reward whichis the maximization goal of RL. The proposed reward functionis in the form of one-step difference to be suited to practicalRL which actually maximizes the discounted sum of rewards.‚ Using the Soft-Actor Critic (SAC) algorithm [26], we

demonstrates that the DRL-based approach based on theconstructed state and action spaces and the proposed rewardfunction properly learns a policy that minimizes the penaltycost while maintaining queue stability.‚ Considering the importance of edge computing systems in

the trend of network-centric computing [27]–[33], we appliedthe proposed DRL-based approach to the problem of resourceallocation in edge computing systems under queue stability,whereas many previous works investigated resource allocationin edge computing systems from different perspectives notinvolving queue stability at the edge server. The proposedapproach provides a policy to optimal task offloading andself-computation to edge computing systems under task queuestability.

This paper is organized as follows. In Section II, the systemmodel is provided. In Section III, the problem is formulatedand the conventional approach is explained. In Section IV, theproposed DRL-based approach to the Lyapunov optimizationis explained. Implementation and experiments are provided inSections V and VI, respectively, followed by conclusion inSection VII.

II. SYSTEM MODEL

In this paper, as an example of queuing network control,we consider an edge computing system composed of an edgecomputing node, a cloud computing node and multiple mobileuser nodes, and consider the resource allocation problem atthe edge computing node equipped with multiple queues.We will simply refer to the edge computing node and thecloud computing node as the edge node and the cloud node,respectively. We assume that there exist N application types inthe system, and multiple mobile nodes generate applicationsbelonging to the N application types and offload a certainamount of tasks to the edge node. The edge node has N taskdata queues, one for each of the N application types, andstores the upcoming tasks offloaded from the multiple mobilenodes according to their application types. Then, the edge nodeperforms the tasks offloaded from the mobile nodes by itselfor further offloads a certain amount of tasks to the cloud nodethrough a communication link established between the edgenode and the cloud node. We assume that the maximum CPUprocessing clock rate of the edge node is fE cycles per secondand the communication bandwidth between the edge node andthe cloud node is B bits per second. The considered overallsystem model is described in Fig. 1.

A. Queue Dynamics

We assume that the arrival of the i-th application-typetask at the i-th queue in the edge node follows a Poisson

Fig. 1. The edge-cloud system model with application queues

random process with arrival rate λi [arrivals per second],i “ 1, ¨ ¨ ¨ , N , and assume that the processing at the edge nodeis time-slotted with discrete-time slot index t “ 0, 1, 2, ¨ ¨ ¨ . Letthe task data bits offloaded from the mobile nodes to the i-thapplication-type data queue (or simply i-th queue) at time1

t be denoted by aiptq, i.e., aiptq is the sum of the arrivedtask data bits during the time interval rt, t ` 1q. We assumethat different application type has different work load, i.e., itrequires a different number of CPU cycles to process one bitof task data for different application type, and assume thatthe i-th application-type’s one task bit requires wi CPU clockcycles for processing at the edge node.

For each application-type task data in the correspondingqueue, at time t the edge node determines the amount ofallocated CPU processing resource for its own processing andthe amount of offloading to the cloud node while satisfyingthe imposed constraints. Let αiptq be the fraction of the edge-node CPU resource allocated to the i-th queue at the edgenode at time t and let βiptq be the fraction of the edge-cloudcommunication bandwidth for offloading to the cloud node forthe i-th queue at time t. Then, at the edge node, αiptqfE clockcycles per second are assigned to the i-th application-type taskand βiptqB bits per second for the i-th application-type taskare aimed to be offloaded to the cloud node at time t. Thus,the control action variables at the edge node are given by

αααptq “ rα1ptq, ¨ ¨ ¨ , αN ptqs, (1)βββptq “ rβ1ptq, ¨ ¨ ¨ , βN ptqs, (2)

where αααptq and βββptq satisfy the following constraints:

Nÿ

i“1

αiptq ď 1,Nÿ

i“1

βiptq ď 1 for all t. (3)

Our goal is to optimize the control variables αααptq and βββptqunder a certain criterion (which will be explained later) whilesatisfying the constraint (3).

1Time is normalized so that one slot interval is one second for simplicity.


The queue dynamics at the edge node is then given by

qipt` 1q “

»

—

—

—

–

qiptq ` aiptq ´

¨

˚

˚

˚

˝

αiptqfEwi

` βiptqBloooooooooomoooooooooon

“:biptq

˛

‹

‹

‹

‚

fi

ffi

ffi

ffi

fl

`

(4)

where qiptq represents the length of the i-th queue at the edgenode at time t for i “ 1, ¨ ¨ ¨ , N and rxs` “ maxp0, xq. Thesecond, third, and fourth terms in the right-hand side (RHS)of (4) represent the new arrival, the reduction by edge nodeprocessing, and the offloading to the cloud node for the i-th queue at time t, respectively. Here, the departure biptq at

time t is defined as biptq :“αiptqfEwi

` βiptqB. Note that by

defining biptq, (4) reduces to a typical multi-queue dynamicsmodel in queuing theory [2]. However, in the considered edge-cloud system the departure occurs by two separate operations,computing and offloading, associated with αααptq and βββptq, andthis makes the situation more complicated. The amount ofactually offloaded task data bits from the edge node to thecloud node for the i-th queue at time t is given by

oiptq “ min

ˆ

βiptqB, qiptq`aiptq´αiptqfEwi

˙

(5)

because the remaining amount of task data bits at the i-thqueue at the edge node can be less than the offloading targetbits βiptqB.

For proper system operation, we require the considered edgecomputing system to be stable. Among several definitionsof queuing network stability [2], we adopt the followingdefinition for stability:

Definition 1 (Strong stability [2]). A queue is strongly stable if

lim suptÑ81

t

t´1ÿ

τ“0

Erqpτqs ă 8, (6)

where qptq is the length of the queue at time t.

We consider that the edge computing system is stable if allthe queues qiptq, i “ 1, ¨ ¨ ¨ , N in the edge node are stableaccording to Definition 1.

B. Power Consumption Model and Cost Function

In order to model realistic computing environments, weassume multi-core computing at the edge node, and assumethat the edge node has NE CPU cores with equal computingcapability. We also assume that the Dynamic Voltage Fre-quency Scaling (DVFS) method is adopted at the edge node.DVFS is a widely-used technique for power consumptionreduction, e.g., SpeedStep of Intel and Powernow of AMD.DVFS adjusts the CPU clock frequency and supply voltagebased on the required CPU cycles per second to performgiven task in order to reduce power consumption [34]. That is,for a computationally easy task, the CPU clock frequency islowered. On the other hand, for a computationally demandingtask, the CPU clock frequency is raised. Under our assumptionthat the edge CPU has maximum clock frequency of fE cycles

per second and the edge CPU has NE CPU cores, each edgeCPU core has maximum fE{NE clock rate. Note from (1) and(3) that the total assigned computing requirement for the edgenode at time t is fE ¨

řNi“1 αiptq. This total computing load

is distributed to the NE CPU cores according to a multi-coreworkload distribution method. Hence, the assigned workloadfor the j-th core of the edge node is given by

fE,j “ g pfE , NE , α1ptq, ¨ ¨ ¨ , αN ptqq , j “ 1, ¨ ¨ ¨ , NE , (7)

where gp¨q is the multi-core workload distribution function ofthe edge CPU, and depends on individual design.

The power consumption at a CPU core consists mainly oftwo parts: the dynamic part and the static part [35]. We focuson the dynamic power consumption which is dominant ascompared to the static part [36]. It is known that the dynamicpower consumption is modeled as a cubic function of clockfrequency, whereas the static part is modelled as a linearfunction of clock frequency [37], [38]. With the focus on thedynamic part, the power consumption at a CPU core can bemodelled as [39], [40]

PD “ κf3, (8)

where f is the CPU core clock rate and κ is a constantdepending on CPU implementation. Then, the overall powerconsumption CEptq at the edge node can be modelled as

CEptq “NEÿ

j“1

CE,jptq (9)

where CE,j denotes the power consumption at the j-th CPUcore at the edge node and is given by (8) with the coreoperating clock rate f substituted by (7). Note that for givenfE and NE , the power consumption CEptq at time t is afunction of the control vector αααptq, explicitly shown as

CEptq “ CEpα1ptq, ¨ ¨ ¨ , αN ptqq (10)

based on (7), (8) and (9).While CEptq is the cost function measured in terms of the

required power consumption for the edge node caused by itsown processing, we assume that the cloud node charges costCCptq to the edge node based on the amount of workloadrequired to process the offloaded task bits to the cloud nodeřNi“1 wioiptq, where oiptq is given by (5). Since oiptq depends

on αααptq and βββptq as seen in (5), CCptq as a function of thecontrol variables is expressed as

CCptq “ CCpα1ptq, ¨ ¨ ¨ , αN ptq, β1ptq, ¨ ¨ ¨ , βN ptqq. (11)

We assume that CCptq is given in the unit of Watt under theassumption that power and monetary cost are interchangeable.We will use CEptq and CCptq as the penalty cost in latersections. Table I summarizes the introduced notations.

III. PROBLEM STATEMENT AND CONVENTIONALAPPROACH

In this section, based on the derivation in Section II weformulate the problem of optimal resource allocation at theedge node under queue stability. Since the arrival process is


TABLE ISUMMARY OF NOTATIONS

name stands for unitN Number of application types -fE Maximum CPU clock rate of the edge node cycles/sNE Number of CPU cores at the edge node -

BCommunication bandwidth

from the edge node to the cloud node bits/s

λiPoisson arrival rate of the i-th app. type

at the edge node arrivals/s

fE,jAssigned workload for the j-th CPU core

at the edge node cycles/s

aiptq Arrival task bits of the i-th application type at time t bitsbiptq Departure task bits of the i-th application type at time t bitsoiptq Offloaded task bits of the i-th application type at time t bitswi Workload for the i-th application type cycles/bit

qiptqQueue length at time t for the i-th queue

at the edge node bits

αiptqEdge CPU resource allocation factor for

the i-th queue at time t -

βiptqCommunication bandwidth allocation factor for the i-th queue

from the edge node to the cloud node at time t -

CEptq Cost for computing at the edge node at time t Watt

CE,jptqCost for computing at the j-th CPU core

at the edge node at time t Watt

CCptq Cost for offloading to the cloud node at time t Watt

random, the optimization cost is random. Hence, we considerthe minimization of the time-averaged expected cost whilemaintaining queue stability for stable system operation. Theconsidered optimization problem is formulated as follows:

Problem 1.

minαααptq,βββptq

limTÑ8

1

T

T´1ÿ

t“0

ErCEptq ` CCptqs (12)

s.t. lim suptÑ8

1

t

t´1ÿ

τ“0

Erqipτqs ă 8 for all i (13)

Nÿ

i“1

αiptq ď 1 andNÿ

i“1

βiptq ď 1 for all t, (14)

where αααptq and βββptq are defined in (1) and (2), respectively,CEptq and CCptq are the cost functions defined in (10) and(11), respectively, and qiptq is the length of the i-th queue attime t.

Note that (13) implies that we require all the queues in thesystem are strongly stable as a constraint for optimization. It isnot easy to solve Problem 1 directly. A conventional approachto Problem 1 is based on the Lyapunov optimization [2].The Lyapunov optimization defines the quadratic Lyapunovfunction and the Lyapunov drift as follows [2]:

Lptq “1

2

Nÿ

i“1

qiptq2 (15)

∆Lptq “ Lpt` 1q ´ Lptq. (16)

It is known that Problem 1 is feasible when the average servicerate (i.e., average departure rate) is strictly larger than theaverage arrival rate [2], i.e., λiµi ´ αifE

wi´ βiB ă ´ε for

some ε ą 0 and some constants αi and βi, i “ 1, ¨ ¨ ¨ , N .Here, µi is the average task packet size at each arrival at the i-th application queue. When stable control is feasible, we wantto determine the instantaneous service rates αiptq and βiptqat each time for cost-efficient stable control of the system.A widely-considered conventional method to determine the

Algorithm 1 Basic Drift-Plus-Penalty Algorithm [2]1: Initialization:

Set t “ 1.2: Repeat:

1) Observe aptq “ ra1ptq, ¨ ¨ ¨ , aN ptqs and qptq “rq1ptq, ¨ ¨ ¨ , qN ptqs.

2) Choose actions αααptq and βββptq to minimize (18).3) Update qptq according to (4) and tÐ t` 1.

instantaneous service rates for Problem 1 is the DPP algorithm.The DPP algorithm minimizes the DPP instead of the penalty(i.e., cost) alone, given by [2]

∆Lptq ` V rCEptq ` CCptqs (17)

for a positive weighting factor V which determines the trade-off between the drift and the penalty. In (17), the originalqueue stability constraint is absorbed as the drift term in animplicit manner. The DPP in our case is expressed as

∆Lptq ` V rCEptq ` CCptqs

ď1

2

Nÿ

i“1

ˆ

aiptq ´αiptqfEwi

´ βiptqB

˙2

`

Nÿ

i“1

qiptq

ˆ

aiptq ´αiptqfEwi

´ βiptqB

˙

` V rCEpαααptqq ` CCpαααptq,βββptqqs , (18)

where the inequality is due to ignoring the operation rxs` in(4). The basic DPP algorithm minimizes the DPP expression(18) in a greedy manner, which is summarized in Algorithm 1[2]. It is known that under some conditions this simple greedyDPP algorithm yields a solution that satisfies strong stabilityfor all queues and its resultant time-averaged penalty is withinsome constant bound from the optimal value of Problem 1[2]. Note that there exist two terms generated from ∆Lptq inthe RHS of (18): one is the square of the difference betweenthe arrival and departure rates and the other is the product ofthe queue length and the difference between the arrival anddeparture rates. In many cases, the quadratic term in the RHSof (18) is replaced by a constant upper bound based on certainassumptions on the arrival and departure rates aiptq and biptq,and only the second term qiptqpaiptq´biptqq is considered [2].

IV. THE PROPOSED REINFORCEMENT LEARNING-BASEDAPPROACH

Although Problem 1 can be approached by the conventionalDPP algorithm. The DPP algorithm has several disadvantages:It is an instantaneous greedy optimization and requires solvingan optimization problem at each time step, and numericaloptimization with a complicated penalty function can bedifficult. As an alternative, in this section, we consider a DRL-based approach to Problem 1, which exploits the trajectory ofsystem evolution and does not require any optimization in theexecution phase once the control policy is trained.

A basic RL is an MDP composed of a state space S, anaction space A, a state transition probability P : SˆAÑ S, a


reward function r : SˆAÑ R, and a policy π : S Ñ A [25].At time t, the agent which has policy π observes a state st P Sof the environment and performs an action at according to thepolicy, i.e., at P A „ πpat|stq. Then, a reward rt, dependingon the current state st and the agent’s action at, is given to theagent according to the reward function rt “ rpst, atq, and thestate of the environment changes to a next state st`1 accordingto the state transition probability, i.e., st`1 „ P pst`1|st, atq.The goal is to learn a policy to maximize the accumulatedexpected return.

Problem 1 can be formulated into a RL problem based onthe queue dynamics (4) and the additive cost function (12),in which the agent is the resource allocator of the edge nodeand tries to learn an optimal policy for resource allocationat the edge node while stabilizing the queues. Since we donot assume the knowledge of the state transition probabilityP , our approach belongs to model-free RL [25]. The mainchallenge in the RL formulation of Problem 1 is how toenforce the queue stability constraint in (13) into the RLformulation, and the success and effectiveness of the devisedRL-based approach depends critically on a well-designedreward function as well as good formulation of the state andaction spaces.

A. State and Action Spaces

In order to define the state and action spaces, we clarify theoperation in time domain. Fig. 2 describes our timing diagramfor RL operation. For causality under our definition of the stateand the action, we assume one time step delay for overalloperation. Recall that time is normalized in this paper. Hence,the discrete time index t used in the queue dynamics in SectionII-A can also mean the continuous time instant t. As seen inFig. 2, considering the actual timing relationship, we definethe quantities as follows. qiptq is the length of the i-th queueat the continuous time instant t, aiptq is the sum of the arrivedtask bits for the continuous time interval rt, t ` 1q, and biptqis the serviced task bits during the continuous time intervalrt ` 1, t ` 2q based on the observation of qiptq ` aiptq. Theone step delayed service biptq is incorporated in computingthe queue length qipt` 1q at the time instant t` 1 due to theassumption of one step delayed operation for causality. Then,the state variables that we determine at the RL discrete timeindex t for the considered edge system are as follows.

1) The main state variable is the queue length including thearrival aiptq: qiptq ` aiptq, i “ 1, ¨ ¨ ¨ , N .

2) Additionally, we include the queue length at time tbefore the arrival aiptq, i.e., qiptq or the arrival aiptqitself in the set of state variables.

3) The workload for each application type: wi, i “

1, ¨ ¨ ¨ , N .4) The actual CPU use factor2 for the i-th queue at time

t´ 1: αiptq, i “ 1, ¨ ¨ ¨ , N .

2The nominal use of the edge-node CPU for the i-th queue at time t isαiptqfE . However, when the amount of task data bits in the i-th queue is lessthan the value αiptqfE{wi, the actual CPU use for the i-th queue by actionat time t denoted as αiptqfE is less than αiptqfE . In the queue dynamics(4), this effect is handled by the function rxs` “ maxp0, xq.

Fig. 2. Timing diagram

5) The required CPU cycles at the cloud node for theoffloaded tasks at time t:

řNi“1 wioiptq.

6) The time average of aiptq for the most recent 100 timeslots: 1

100

řtτ“t´99 aipτq, i “ 1, ¨ ¨ ¨ , N .

The action variables of the policy are αiptq and βiptq,i “ 1, 2, ¨ ¨ ¨ , N , and the constraints on the actions areřNi“1 αiptq ď 1 and

řNi“1 βiptq ď 1.

Remark 1. We can view the last moment of the continuoustime interval rt, t`1q as the reference time for the RL discretetime index t. Note that qiptq ` aiptq is set as the main statevariable at RL discrete time index t, and the service (i.e.,action) biptq is based on qiptq ` aiptq. In this case, from theRL update viewpoint, the transition is in the form of

Current state st “ tqiptq ` aiptq, ¨ ¨ ¨ uCurrent action at “ pαααptq,βββptqq ñ determines biptqCurrent reward rtpst, atq as a function of pst, atqQueue update qipt` 1q “ rqiptq ` aiptq ´ biptqs

`

Next state st`1 “ tqipt` 1q ` aipt` 1q, ¨ ¨ ¨ u.(19)

This definition of the state and timing is crucial for proper RLtraining and operation. The reason why we define the referencetime and the state in this way will be explained in SectionIV-B.

B. Reward Design for Queue Stability and Penalty Minimiza-tion

Suppose that we simply use the negative of the DPPexpression as the reward function for RL, i.e,

rguesst “ ´r∆Lptq ` V pCEptq ` CCptqqs, (20)

where the drift term ∆Lptq is given by

∆Lptq “1

2

Nÿ

i“1

“

qipt` 1q2 ´ qiptq2‰

(21)

and try to maximize the accumulated sumř

t rguesst by

RL. Then, does RL with this reward function lead to thebehavior that we want? In the following, we derive a properreward function for RL to solve Problem 1 and answer theabove question in a progressive manner, which is the maincontribution of this paper.

The main point for an RL-based approach to Problem 1to yield queue stability is to exploit the fact that the goal ofRL is to maximize the accumulated reward (not to perform agreedy optimization at each time step) and hence achieving the


intended behavior is through a well-designed reward function.In order to design such a reward function for RL to yieldqueue stability, we start with the following result:

Theorem 1. Suppose that qip0q “ 0,@i (we will assume initialzero backlog for all queues in the rest of this paper) and thereward rt at time t for RL satisfies the following condition:

rt ď U ´ ηNÿ

i“1

qipt` 1q (22)

for some finite constant U and finite positive η. Then, RLtrained with such rt tries to strongly stabilize the queues.Furthermore, if the following condition is satisfied

rmin ď rt, @t (23)

for some finite rmin in addition to (22). Then, the resultingqueues by RL with such rt are strongly stable.

Proof. Taking expectation on both sides of (22), we have

Errτ s ď U ´ ηNÿ

i“1

Erqipτ ` 1qs. (24)

Summing (24) over τ “ 0, 1, ¨ ¨ ¨ , t´ 1, we havet´1ÿ

τ“0

Errτ s ď Ut´ ηt´1ÿ

τ“0

Nÿ

i“1

Erqipτ ` 1qs. (25)

Rearranging (25) and dividing by ηt, we have

t` 1

t

1

t` 1

tÿ

τ“0

Nÿ

i“1

Erqipτqs ďU

η´

1

η

1

t

t´1ÿ

τ“0

Errτ s, (26)

where t`1t Ñ 1 as t increases, and we used qip0q “ 0.

Note that the left-hand side (LHS) of (26) is the time-averageof the expected sum queue length. Since the goal of RL isto maximize the accumulated expected reward

řt´1τ“0 Errτ s,

RL with rt satisfying (22) tries to stabilize the queues bymaking the average queue length small. That is, for the same t,when

řt´1τ“0 Errτ s is larger, the average queue length becomes

smaller.Furthermore, if the condition (23) is satisfied in addition,

the second term in the RHS of (26) is upper bounded as´ 1t

řt´1τ“0 Errτ s ď ´rmin. Hence, from (26), we have

t` 1

t

1

t` 1

tÿ

τ“0

Nÿ

i“1

Erqipτqs ďU

η´rminη

“1

ηpU ´ rminq.

If the sum of the queue lengths is bounded, the length ofeach queue is bounded. Therefore, in this case, the queues arestrongly stable by Definition 1. (Note that rmin ă U from(22).)

Note that from the definition of strong stability in Definition1 and the fact that RL tries to maximize the accumulatedexpected reward, the condition (22) and resultant (26) canbe considered as a natural starting point for reward functiondesign.

With the guidance by Theorem 1, we design a rewardfunction for RL to learn a policy to simultaneously decreasethe average queue length and the penalty, i.e., the resource

cost. For this we set the RL reward function as the sum of twoterms: rt “ rQt ` rPt , where rQt is the queue-stabilizing partand rPt is the penalty part given by rPt “ ´V rCEptq`CCptqswith a weighting factor V . Based on Theorem 1, we considerthe following class of functions as a candidate for the queue-stabilizing part rQt :

rQt “ ´ρNÿ

i“1

rqipt` 1qsν , ν ě 1 (27)

with some positive constant ρ. Then, the total reward at timet for RL is given by

rt “ rQt ` V rPt “ ´ρ

Nÿ

i“1

rqipt` 1qsν ´ V rCEptq ` CCptqs.

(28)The property of the reward function (28) is provided in thefollowing theorem:

Theorem 2. The queue-stabilizing part rQt of the rewardfunction (28) makes RL with the reward (28) try to stronglystabilize the queues.

Proof. Note from Section II-B that CEptq “řNE

j“1 CE,jptq

with CE,j “ κf3E,j and CCptq “ CCp

řNi“1 wioiptqq, where

the j-th edge CPU core clock frequency fE,j and the offload-ing oiptq from the i-th queue to the cloud node are given by(7) and (5), respectively. We have CEptq ě 0 and CCptq ě 0by design. Furthermore, we have

fE,j ďfENE

, @j and oiptq ď B, @i (29)

by considering the full computational and communicationresources. Hence, we have

CEptq ď κf3E

N2E

and CCptq ď CC

˜

BNÿ

i“1

wi

¸

, (30)

with slight abuse of the notation CC as a function of theoffloaded task bits in the RHS of the second inequality.Therefore, we have

´ κf3E

N2E

´ CC

˜

BNÿ

i“1

wi

¸

ď ´rCEptq ` CCptqs ď 0. (31)

Now we can upper bound rt as follows:

rt “ ´ρNÿ

i“1

rqipt` 1qsν ´ V rCEptq ` CCptqs (32)

paqď ´ρ

Nÿ

i“1

rqipt` 1qsν , (33)

pbqď ´ρ

Nÿ

i“1

qipt` 1q ` ρN, (34)

where Step (a) is valid due to ´rCEptq ` CCptqs ď 0 andStep (b) is valid due to the inequality xν ě x´ 1 for ν ě 1.Then, by setting U “ ρN and η “ ρ, we can apply the resultof Theorem 1.


Remark 2. Note that in the proof of Theorem 2, the upperbound (33) becomes tight when V is zero. Thus, the queue-stabilizing part of the reward dominantly operates when V issmall so that ´V rCEptq ` CCptqs « 0. In case of large V , asmall reduction in CEptq `CCptq yields a large positive gainin ´V rCEptq`CCptqs and thus rPt becomes dominant. Evenin this case, the queue-stabilizing part rQt itself still operatestowards the direction of queue length reduction due to thenegative sign in front of the queue length term. This is whatTheorem 2 means. However, in case of large V , more rewardcan be obtained by saving CEptq ` CCptq while increasingqipt`1q, and the reward (28) does not guarantee strong queuestability since it is not lower bounded as (23) due to thestructure of ´ρ

ř

irqipt`1qsν . Hence, a balanced V is requiredfor simultaneous queue stability and penalty reduction.

Now let us investigate the reward function (28) further.First, note that the penalty part rPt “ ´V rCEptq ` CCptqsis a deterministic function of action αααiptq and βββiptq. Second,consider the term qipt ` 1q in rQt in detail. qipt ` 1q isdecomposed as

qipt` 1q “ qip0q `t´1ÿ

τ“0

raipτq ´ bipτqs

looooooooooooooomooooooooooooooon

“qiptq

àiptq ´ biptq

“ qiptq ` aiptqloooooomoooooon

state at time t

´ biptqloomoon

action at time t

(35)

under the assumption of qiptq ` aiptq ě biptq for simplicity.Note that bipτq is a deterministic function of the actionαααpτq and βββpτq for τ “ 0, 1, ¨ ¨ ¨ , t but the arrivals aipτq,τ “ 0, 1, ¨ ¨ ¨ , t are random quantities uncontrollable by thepolicy π. Recall that the reward function in RL is a function ofstate and action in general. In the field of RL, it is known thatan environment with probabilistic reward is more difficult tolearn than an environment with deterministic reward [41]. Thatis, for a given state, the agent performs an action and receivesa reward depending on the state and the action. When thereceived reward is probabilistic especially with large variance,it is difficult for the agent to know whether the action isgood or bad for the given state. Now, it is clear why wedefined qiptqàiptq as a state variable at time t, as mentionedin Remark 1, and defined the timing structure as defined inSection IV-A. By defining qiptqàiptq as a state variable, thereward-determining quantity qipt`1q becomes a deterministicfunction of the state and the action as seen in (35), and therandom arrivals aipτq, τ “ 0, 1, ¨ ¨ ¨ , t are absorbed in thestate. In this case, the randomness caused by aipt ` 1q is inthe state transition:

st “ tqiptq ` aiptq, ¨ ¨ ¨ u

st`1 “ tqiptq ` aiptqloooooomoooooon

P siptq

´ biptqloomoon

action

` aipt` 1qlooomooon

random term

, ¨ ¨ ¨ u. (36)

That is, the next state follows st`1 „ P pst`1|st, atq and thedistribution of the random arrival aipt ` 1q affects the statetransition probability P . Note that in this case the transitionis Markovian since the arrival aipt` 1q is independent of thearrivals at other time slots. Thus, the whole set up falls into

an MDP with a deterministic reward function. However, if wehad defined the state at time t as qiptq instead of qiptqàiptq(this setup does not require one time step delay for causality),then qipt` 1q would have been decomposed as

qipt` 1q “ qiptqloomoon

state at time t

` aiptqloomoon

random term

´ biptqloomoon

action at time t

(37)

to yield a random probabilistic reward, and this would havemade learning difficult.

Although RL with the reward function rQt “

´ρřNi“1rqipt ` 1qsν with ν ě 1 added to the penalty

part rPt tries to strongly stabilize the queues by Theorem2 through the relationship (26), we want to reshape thequeue-stabilizing part rQt of the reward into a discountedform to be suited to practical RL, while maintaining thereward sum equivalence needed for queue length control byRL through the relationship (26). Our reward reshaping isbased on the fact that training in RL is typically based onepisodes, which is assumed here too. Let T be the length ofeach episode. Then, under the assumption of qip0q “ 0, wecan express the accumulated reward over one episode as

1

T

T´1ÿ

t“0

rQt “ ´ρ1

T

T´1ÿ

t“0

Nÿ

i“1

rqipt` 1qsν (38)

paq“ ´ρ

T´1ÿ

t“0

Nÿ

i“1

T ´ t

Trqipt` 1qν ´ qiptq

νs (39)

where the equality paq is valid because the coefficient in frontof the term rqip`qs

ν for each ` in (39) is given by

T ´ p`´ 1q

T´T ´ `

T“

1

T. (40)

Thus, by defining

rQt “ ´ρNÿ

i“1

T ´ t

Trqipt` 1qν ´ qiptq

νs, ν ě 1, (41)

we have the sum equivalence between the original reward rQtin (27) and the reshaped reward rQt except the factor 1{T , asseen in (39). Since rQt satisfies rQt ď U ´ η

řNi“1 qipt` 1q as

seen in the proof of Theorem 2 and rt satisfiesřt´1τ“0 r

Qτ “

1t

řt´1τ“0 r

Qτ due to (39), by summing the first condition over

time 0, 1, ¨ ¨ ¨ , t´ 1 and using the second condition, we havet´1ÿ

τ“0

rτ “1

t

t´1ÿ

τ“0

rτ ď U ´ η1

t

t´1ÿ

τ“0

Nÿ

i“1

qipt` 1q. (42)

Rearranging the terms in (42) and taking expectation, we have

t` 1

t

1

t` 1

tÿ

τ“0

Nÿ

i“1

Erqipτqs ďU

η´

1

η

t´1ÿ

τ“0

ErrQτ s. (43)

Hence, we can still control the queue lengths by RL with thereshaped reward rQt . As compared to (26), the factor 1{t infront of the sum reward term in (26) disappears in (43). Thekey aspect of the reshaped reward is that the reward at time t isdiscounted by the factor T´t

T , which is a monotone decreasingfunction of t and decreases from one to zero as time elapses.This fact makes the reshaped reward suitable for practical RLand this will be explained shortly.


Remark 3. Note that the reshaped reward in (41) can berewritten as

rQt “ ´ρNÿ

i“1

T ´ t

Trpqiptq ` aiptqloooooomoooooon

state at t

´ biptqloomoon

action at t

qν ´ qiptqν

loomoon

s.

(44)Again, we want to express rQt as a deterministic function of thestate and the action. The term qiptq`aiptq is already includedin the set of state variables and biptq is deterministicallydependent on the action. For our purpose, the last term qiptqin the RHS of (44) should be deterministically determined bythe state. Hence, we included either qiptq or aiptq in additionto qiptq`aiptq in the set of state variables, as seen in SectionIV-A. In the case of aiptq as a state variable, qiptq in (44)is determined with no uncertainty from the state variablesqiptq ` aiptq and aiptq.

Finally, let us consider practical RL. Practical RL tries tominimize the sum of exponentially discounted rewards

ř

t γtrt

with a discount factor γ ă 1 notř

t rt in order to guarantee theconvergence of the Bellman equation [25], [42], whereas ourderivation up to now assumed the minimization of the sum ofrewards. In RL theory, the Bellman operator is typically usedto estimate the state-action value function, and it becomesa contraction mapping when the rewards are exponentiallydiscounted by a discount factor γ ă 1. Then, the state-action value function converges to a fixed point and properlearning is achieved [42]. Hence, this discounting should beincorporated in our reward design. Note that γt monotonicallydecreases from one to zero as time goes and that our reshapedreward rQt has the internal discount factor T´t

T , which alsomonotonically decreasing from one to zero as time goes. Eventhough the two discount factors are not the same exactly, theirmonotone decreasing behavior matches and plays the samerole of discount. With the existence of the external RL discountfactor γ pă 0q, we redefine our reward for RL aiming at queuestability and penalty minimization as

rt “ ´ρNÿ

i“1

rqipt` 1qν ´ qiptqνs ´ V rCEptq `CCptqs, (45)

with ν ě 1. Then, with the external RL discount factor, theactual queue stability-related part of the reward in practicalRL becomes ´ρ

řNi“1 γ

trqipt ` 1qν ´ qiptqνs. Our reward

(41) tries to approximate this actual reward by a first-orderapproximation with one step time difference form rqipt`1qν´qiptq

νs. Thus, the queue lengths can be controlled through therelationship (43) by directly maximizing the sum of discountedrewards by RL. Note that the penalty part is also discountedwhen we use the reward (45) in practical RL with rewarddiscounting. However, this is not directly related to queuelength control and such discounting is typical in practical RL.

Remark 4. Note that the RHS of (38) is the time averageof qipt ` 1qν over time 0 to T ´ 1 with equal weight1{T , whereas the RHS of (39) is the time average of one-step difference rqipt ` 1qν ´ qiptq

νs over time 0 to T ´ 1with unequal discounted weight T´t

T . Note that the one-stepdifference form makes the impact of each qipt`1q equal in the

discounted average, as seen in (40). Suppose that we directlyuse rQt “ ´ρ

řNi“1 qipt ` 1qν without one-step difference

reshaping for practical RL. Then, the queue-stability-relatedpart in the sum of discounted rewards

řT´1t“0 γtrt in practical

RL becomes

T´1ÿ

t“0

γtrQt “ ´ρT´1ÿ

t“0

ÿ

i

γtqipt` 1qν , 0 ă γ ă 1, (46)

whereř

t“0,1,¨¨¨ γtp¨q can be viewed as a weighted time

average with some scaling. Thus, the queue length of the initialphase of each episode is overly weighted. Reshaping into theone-step difference form mitigates this effect by trying to makethe impact of each qipt ` 1q equal in the discounted averagewithin first-order linear approximation.

Remark 5. Now, suppose that we maximize the sum ofundiscounted rewards and use the one-step discounted reward,i.e.,

řT´1t“0 rQt . Then, we have

1

ρ

T´1ÿ

t“0

rQt “ ´T´1ÿ

t“0

ÿ

i

rqipt` 1qν ´ qiptqνs

“ ´ÿ

i

qipT qν `

ÿ

i

qipT ´ 1qν ´ÿ

i

qipT ´ 1qν`

¨ ¨ ¨ `ÿ

i

qip0qν “ ´

ÿ

i

qipT qν `

ÿ

i

qip0qν

loomoon

“0

.

Hence, the time average of queue length required to implementthe strong stability in Definition 1 does not appear in thereward sum and maximizing the sum reward tries to minimizethe queue length only at the final time step. Thus, the one-step difference reward form (45) is valid for RL minimizingthe sum of discounted rewards.

When ν “ 2, the queue-stabilizing part rQt of our reward(45) reduces to the negative of the drift term ´∆Lptq inthe Lyapunov framework, given in (21). So, the negative ofDPP can be used as the reward for practical RL minimiz-ing the sum of discounted rewards not for RL minimizingthe sum of rewards. When ν “ 1, rQt simply reduces torQt “ ´ρ

řNi“1raiptq ´ biptqs.

Considering that RL tries to maximize the expected accu-mulated reward and Eraiptqs “ λi, we can further stabilize thereward by replacing the random arrival aiptq with its mean λi.For example, when ν “ 1, we use

ˆrQt “ ´ρNÿ

i“1

rλi ´ biptqs, (47)

and when ν “ 2, we use

ˆrQt “ ´ρNÿ

i“1

2qiptqrλi ´ biptqs ` rλi ´ biptqs2(

, (48)

where the arrival rate λi can easily be estimated. Note thatwith ν “ 1, the second upper bounding step (b) in (34) is notrequired and hence we have a tighter upper bound, whereasthe drift case ν “ 2 has the advantage of length balancingacross the queues due to the property of a quadratic function.


Fig. 3. Process diagram of the environment and SAC networks

V. IMPLEMENTATION

Among several popular recent DRL algorithms, we choosethe Soft Actor-Critic (SAC) algorithm, which is a state-of-the-art algorithm optimizing the policy in an off-policy manner[26], [43]. SAC maximizes the discounted sum of the expectedreturn and the policy entropy to enhance exploration. Thus, theSAC policy objective function is given by

Jpπq “ Eξ„π

«

T´1ÿ

t“0

γtprt ` ζHpπp¨|stqqq

ff

, (49)

where π is the policy, ξ “ ps0, a0, s1, a1, ¨ ¨ ¨ q is the state-action trajectory, γ is the reward discount factor, ζ is theentropy weighting factor, Hpπq is the policy entropy, andrt is the reward. Fig. 3 describes the overall diagram of

Parameter ValueOptimizer Adam optimizerLearning rate 3 ¨ 10´4

Discount factor γ 0.999Replay buffer size 106

Number of hidden layers 2Number of hidden units per layer 256Number of samples per minibatch 256Nonlinearity ReLUTarget smoothing coefficient 0.005Target update interval 1Gradient step 1

TABLE IISAC HYPERPARAMETERS

our edge computing environment and the SAC agent. Forimplementation of SAC, we followed the standard algorithmin [26] with the state variables and the action variables defined

in Section IV-A. In order to implement the condition (3), weimplemented the policy deep neural network with dimension2N ` 2 by adding two dummy variables for αN`1ptq pě 0qand βN`1ptq pě 0q and applied the softmax function at theoutput of the policy network satisfying

N`1ÿ

i“1

αiptq “ 1,N`1ÿ

i“1

βiptq “ 1, @t. (50)

Then, we took only αiptq and βiptq, i “ 1, 2, ¨ ¨ ¨ , N fromthe neural network output layer. The used hyperparametersare the same as those in [26] except the discount factor γand the values are provided in Table II. We assumed that theDRL policy at the edge node updated its status and performedits action at every second. The episode length for learningwas T “ 5000 time steps. Our implementation source code isavailable at Github [44].

VI. EXPERIMENTS

A. Environment Setup

In order to test the proposed DRL-based approach, weconsidered the following system. With heavy computationalload on smartphones caused by artificial intelligence (AI)applications, we considered AI applications to be offloadedfrom smartphones to the edge node. The considered three AIapplication types were speech recognition, natural languageprocessing and face recognition. The number of required CPUcycles wi for each application type was roughly estimated byrunning open or commercial software on smart phones andpersonal computers. We assumed that the arrival process ofthe i-th application-type tasks was a Poisson process withmean arrival rate λi [arrivals/second], as mentioned in SectionII-A. We further assumed that the data size di [bits] of eachtask arrival of the i-th application type followed a truncatednormal distribution NT pµi, σi, di,min, di,maxq. We first setthe minimum and maximum data sizes di,min and di,maxof one task for the i-th application type and then set themean and standard deviation as µi “ pdi,max ` di,minq{2and σi “ pdi,max ´ di,minq{4. We set the average number of

TABLE IIIPARAMETER SETUP FOR EACH APPLICATION TYPE

Applicationtype wi Distribution of di λi

Speechrecognition 10435 NT p170, 130, 40, 300q (kB) 5

Natural languageprocessing 25346 NT p52, 48, 4, 100q (kB) 8

FaceRecognition 45043 NT p55, 45, 10, 100q (kB) 4

task arrivals of the i-th application-type λi and the minimumand maximum data sizes di,min and di,max of one task forthe i-th application-type as shown in Table III.

We assumed a scenario in which the cloud node had largerprocessing capability than the edge node and a good portionof processing was done at the cloud node. We assumed thatthe edge node had 10 CPU cores and each CPU core had theprocessing capability of 4 Giga cycles per second [Gcycles/s


or simply GHz]. Hence, the total processing power of the edgenode was 40 Gcycles/s. Among the valid range of ν ě 1, weused ν “ 1 or ν “ 2, i.e., (47) or (48) for our reward function.Thus, the overall reward function was given by

ˆrQt ´ V rCEptq ` CCptqs, (51)

where CEptq was the cost of edge processing given in (9) as

CEptq “NEÿ

j“1

κf3E,j (52)

and CCptq was the cost of offloading from the edge node tothe cloud node. We considered two cases for CCptq. The firstone was a simple continuous function given by

CCptq “ κNC

˜

řNi“1 wioiptq

NC

¸3

, (53)

where oiptq was the number of the offloaded task bits of thei-th application type to the cloud node, given by (5), and NCwas the number of the CPU cores at the cloud node. Notethat the cost function (53) follows the same principle as (52)under the assumption that the overall workload

ř

i wioiptqoffloaded to the cloud node is evenly distributed over the NCCPU cores at the cloud node. We set the number of CPUcores at the cloud node as NC “ 54 with each core having4 Gcycles/s processing capability. Thus, the maximum cloudprocessing capability was set as 216 Gcycles/s. The secondcost function for CCptq was a discontinuous function, which

will be explained in Section VI-C. We set κ “1

p400GHzq3based on a rough estimation3 and ρ “ 10´9 in our simulations.In fact, the values of κ and ρ were not critical since we sweptthe weighting factor V between the delay-related term and thepenalty cost in order to see the overall trade-off. This valuesetting was for the numerical dynamic range of the used SACcode.

Feasibility Check: Note that the average arrival rates interms of CPU clock cycles per second for the three applicationtypes are given by

λ1 ¨ µ1 ¨ w1 “ 5 ¨ 170 ¨ 8 ¨ 1024 ¨ 10435 » 72.7 Gcycles/sλ2 ¨ µ2 ¨ w2 “ 8 ¨ 52 ¨ 8 ¨ 1024 ¨ 25346 » 86.4Gcycles/sλ3 ¨ µ3 ¨ w3 “ 4 ¨ 55 ¨ 8 ¨ 1024 ¨ 45043 » 81.2Gcycles/s.

The sum of the above three rates is roughly 240 Gcycles/samong which 40 Gcycles/s can be processed at maximum atthe edge node. Table IV shows the average values of CEptqand CCptq for different offloading to the cloud node basedon (53) under the assumption that the assigned workload isevenly distributed over the CPU cores both at the edge andcloud nodes (the unit of the first two columns in Table IVis Gcycles/s and the unit of the remaining three columns isG3κ “ 1021κ). As seen in Table IV, the edge node shouldprocess for smaller overall cost. If we offload all tasks tothe cloud node, the required communication bandwidth is

3Suppose that a CPU core of 4GHz clock rate consumes 35W and that10kWh = 36,000 kW¨ s costs one dollar. Then, from 1 : 36, 000kW ¨ s “κf3E,j : 35W ¨ s, we obtain κ “ 35

36,000¨103¨p4GHzq3 « 1{p400GHzq3.

TABLE IVENVIRONMENT SETUP CHECK

At Edge At Cloud CEptq CCptq CEptq ` CCptq40 200 640 2743 338330 210 270 3175 344520 220 80 3651 3731

given byř

i λiµi “ 12.2Mbps. So, we set the communicationbandwidth B “ 20Mbps so that the communication is not abottleneck for system operation. Since the overall service rateprovided by the edge and cloud nodes is 256 p“ 40 ` 216qGcycles/s and the average arrival rate is 240 Gcycles/s andthe communication bandwidth is not a bottleneck, the overallsystem is feasible to control. The DRL resource allocatorshould learn a policy that distributes the arriving tasks to theedge and cloud nodes optimally.

B. Convergence and Comparison with the DPP Algorithm

We tested our DRL-based approach with the proposedreward function for the system described in Section VI-A withCCptq given by the simple continuous cost function (53), andcompared its performance to that of the DPP algorithm. Forcomparison, we used the basic DPP algorithm in Algorithm 1with the cost at each time step given by

Nÿ

i“1

qiptq

ˆ

aiptq ´αiptqfEwi

´ βiptqB

˙

` V 1rCEpαααptqq ` CCpαααptq,βββptqqs, (54)

where the quadratic term in the RHS of (18) was replaced byconstant upper bound and omitted. Note that the weightingfactor V 1 is different from the weighting factor V in (51)in order to take into account the difference in the stability-related terms in the two cost functions (51) and (54). In theDPP algorithm, for each time step, optimal αααptq and βββptqwere found by minimizing (54) for given qiptq and aiptq.For this numerical optimization, we used sequential quadraticprogramming (SQP), which is an iterative method solving theoriginal constrained nonlinear optimization with successivequadratic approximations [45] and is widely used with severalavailable software including MATLAB, LabVIEW and SciPy.We used SciPy of python to implement the DPP algorithm.

For SAC, we did the following. In the beginning, all weightsin the neural networks for the value function and the policywere randomly initialized. We generated four episodes, col-lected 20,000 samples, and stored them into the sample buffer,where one episode for training was composed of T “ 5000time steps and in the beginning of each episode, all queueswere emptied.4 Then, with the samples in the sample buffer,we trained the neural networks. With this trained policy, wegenerated one episode and evaluated the performance withthe evaluation-purpose episode. Then, we again generated andstored four episodes of 20,000 samples into the sample buffer,trained the policy with the samples in the sample buffer, and

4In Atari games, one episode typically corresponds to one game startingfrom the beginning.


0 1 2 3 4 5 6Time steps (1e6)

10

20

30

40

50

60Ep

isode

rewa

rd su

m

V=10V=20V=40V=50V=60V=100

(a)

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0Time steps (1e6)

200

175

150

125

100

75

50

25

0

Episo

de re

ward

sum

V=10V=50V=80V=100V=120V=600

(b)

Fig. 4. The learning curve of the proposed DRL algorithm: (a) ν “ 1, i.e.,(51) with (47) and (b) ν “ 2, i.e., (51) with (48)

evaluated the newly trained policy with one evaluation episode.We repeated this process.

First, we checked the convergence of SAC with the proposedreward function. Fig. 4 shows its learning curve for differentvalues of the weighting factor V in (51). The x-axis in Fig. 4is the training episode time step (not including the evaluationepisode time steps) and the y-axis is the episode reward sumřT´1τ“0 rτ based on (51) without discounting for the evaluation

episode corresponding the x-axis value. Note that althoughSAC itself tries to maximizes the sum of discounted rewards,we plotted the undiscounted episode reward sum by storing(51) at each time step. It is observed that the proposed DRLalgorithm converges as time goes.

With the verification of our DRL algorithm’s convergence,we compared the DRL algorithm with the DPP algorithmconventionally used for the Lyapunov framework. Fig. 5 showsthe trade-off performance between the penalty cost and theaverage queue length of the two methods. For the DRLmethod, we assumed that the policy has converged after 6M

0.0 0.5 1.0 1.5 2.0 2.5 3.0Average backlog of all queues in the edge node (in unit of 109)

1.0

1.5

2.0

2.5

3.0

3.5

Aver

age

com

putin

g co

st

1e30= 1= 2

DPP

Fig. 5. Average episode penalty versus average episode queue length

and 20M time steps for ν “ 1 and ν “ 2, respectively, basedon the result in Fig. 4, and picked this converged policy asour execution policy. With the execution policy, we ran severalepisodes for each V and computed the average episode penaltycost 1

T

řT´1t“0 rCEptq `CCptqs and the average episode queue

length 1T

řT´1t“0

řNi“1 qiptq. We plotted the points in the 2-D

plane of the average episode queue length and the averageepisode penalty cost by sweeping V . The result is shown inFig. 5. Each line in Fig. 5 is the connecting line through themean value of multiple episodes for each V for each algorithm.For the DPP algorithm, multiple episodes with the same lengthT “ 5000 were tested for each V 1 and the trade-off curvewas drawn. The weighting factors V and V 1 in (51) and (54)were separately swept. It is seen that the DRL approach withν “ 2 shows a similar trade-off performance to that of the DPPalgorithm, and the DRL approach with ν “ 1 shows a bettertrade-off performance than the DPP algorithm. This is becausethe stability related part of the reward directly becomes thequeue length when ν “ 1. Then, we checked the actualqueue length evolution with initial zero queue length for theexecution policy, and the result is shown in Figs. 6 and 7 forν “ 1 and ν “ 2, respectively. It is seen that the queues arestabilized up to the average length „ 108 for ν “ 1 and up tothe average length „ 109 for ν “ 2. However, beyond a certainvalue of V , the penalty cost becomes dominant and the agentlearns a policy that focus on the penalty cost reduction whilesacrificing the queue stability. Hence, the upper left region inFig. 5 is the desired operating region with queue stability.

C. General reward function: A discontinuous function case

In the previous experimental example, it is observed thatin the queue-stabilizing operating region the performance ofDRL-SAC and the DPP performance is more or less the same,as seen in Fig. 5. This is because the DPP algorithm yieldsa solution with performance within a constant gap from theoptimal value due to the Lyapunov optimization theorem. Theeffectiveness of the proposed DRL approach is its versatilityfor general reward functions in addition to the fact that opti-


0 1000 2000 3000 4000 5000Timestamp

0.00e+00

2.00e+05

4.00e+05

6.00e+05

8.00e+05

1.00e+06

1.20e+06

1.40e+06

Aver

age

back

logs

of q

ueue

s

V=10queue 1queue 2queue 3

(a) Avg. bits: 2.21 ¨ 105

0 1000 2000 3000 4000 5000Timestamp

0.00e+00

1.00e+06

2.00e+06

3.00e+06

4.00e+06

5.00e+06

Aver

age

back

logs

of q

ueue

s

V=20

queue 1queue 2queue 3

(b) Avg. bits: 2.38 ¨ 106

0 1000 2000 3000 4000 5000Timestamp

0.00e+00

1.00e+07

2.00e+07

3.00e+07

4.00e+07

5.00e+07

6.00e+07

7.00e+07

Aver

age

back

logs

of q

ueue

s

V=40


(c) Avg. bits: 1.40 ¨ 107

0 1000 2000 3000 4000 5000Timestamp

0.00e+00

5.00e+08

1.00e+09

1.50e+09

2.00e+09

2.50e+09Av

erag

e ba

cklo

gs o

f que

ues


(d) Avg. bits: 4.96 ¨ 108

Fig. 6. Queue lengthřN

i“1 qiptq over time: ν “ 1

0 1000 2000 3000 4000 5000Timestamp

0.00e+00

5.00e+06

1.00e+07

1.50e+07

2.00e+07

2.50e+07

3.00e+07

3.50e+07

Aver

age

back

logs

of q

ueue

s

V=10


(a) Avg. bits: 9.75 ¨ 106

0 1000 2000 3000 4000 5000Timestamp

0.00e+00

2.00e+07

4.00e+07

6.00e+07

8.00e+07

1.00e+08

Aver

age

back

logs

of q

ueue

s

V=50


(b) Avg. bits: 2.06 ¨ 107

0 1000 2000 3000 4000 5000Timestamp

0.00e+00

2.50e+08

5.00e+08

7.50e+08

1.00e+09

1.25e+09

1.50e+09

1.75e+09

2.00e+09

Aver

age

back

logs

of q

ueue

s


(c) Avg. bits: 1.11 ¨ 109

0 1000 2000 3000 4000 5000Timestamp

0.00e+00

5.00e+08

1.00e+09

1.50e+09

2.00e+09

2.50e+09

Aver

age

back

logs

of q

ueue

s


(d) Avg. bits: 1.33 ¨ 109


i“1 qiptq over time: ν “ 2

mization is not required once the policy is trained. Note thatthe DPP algorithm requires solving a constrained nonconvexoptimization for each time step in the general reward functioncase. Although such constrained nonconvex optimization canbe approached by several methods such as successive convexapproximation (SCA) [46] as we did in Section VI-B. How-ever, such methods requires certain properties on the rewardfunction such as continuity, differentiability, etc. and it maybe difficult to apply them to general reward functions such asreward given by a table. In order to see the generality of theDRL approach to the Lyapunov optimization, we considereda more complicated penalty function. We considered the samereward CEptq given by (52) but instead of (53), CCptq wasgiven by a scaled version of the number of CPU cores atthe cloud node required to process the offloaded tasks underthe assumption that each CPU core was fully loaded with it

0 2 4 6 8 10 12 14 16Time steps (1e6)

40

20

0

20

40

Aver

age

episo

de re

ward

sum

V=30V=40V=70V=80V=90V=100V=500

Fig. 8. Learning curves of DRL-SAC for the discontinuous CCptq function

0 2 4 6 8Average backlog of all queues in the edge node (in unit of 109)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Aver

age

com

putin

g co

st

1e30V=30V=40V=70V=80V=90V=100V=500

Fig. 9. Average episode penalty versus average episode queue length

maximum clock rate 4 GHz before the next core was assigned.This CCptq is a discontinuous function of the amount of theoffloaded task bits. All other set up was the same as that inSection VI-B. With this penalty function, it is observed that theDPP algorithm based on SQP failed but the DRL-SAC withthe proposed state and reward function successfully learneda policy. Fig. 8 shows the learning curves of DRL-SAC inthis case for different values of V with ν “ 1. (The plot wasobtained in the same way as that used for Fig. 4.) Fig. 9 showsthe corresponding DRL-SAC trade-off performance betweenthe average episode penalty and the average episode queuelength, and Fig. 10 shows the queue length over time for oneepisode for the trained policy in the execution phase. Thus,the DRL approach operates properly in a more complicatedpenalty function for which the DPP algorithm may fail.

D. Operation in Higher Action Dimensions

In the previous experiments, we considered the case ofN “ 3, i.e., three queues and the action dimension was2N ` 2 “ 8. In order to check the operability of the DRL


0 1000 2000 3000 4000 5000Timestamp

0.00e+00

2.00e+05

4.00e+05

6.00e+05

8.00e+05

1.00e+06

1.20e+06

1.40e+06

Aver

age

back

logs

of q

ueue

s


(a)

0 1000 2000 3000 4000 5000Timestamp

0.00e+00

1.00e+07

2.00e+07

3.00e+07

4.00e+07

Aver

age

back

logs

of q

ueue

s


(b)

0 1000 2000 3000 4000 5000Timestamp

0.00e+00

2.00e+08

4.00e+08

6.00e+08

8.00e+08

1.00e+09

1.20e+09

1.40e+09

1.60e+09

Aver

age

back

logs

of q

ueue

s

V=90


(c)

0 1000 2000 3000 4000 5000Timestamp

0.00e+00

2.00e+09

4.00e+09

6.00e+09

8.00e+09Av

erag

e ba

cklo

gs o

f que

ues


(d)


i“1 qiptq over time: (a) V=30, (b) V=80, (c) V=90,and (d) V=100

TABLE VAPPLICATION PARAMETERS

Applicationnames wi Distribution of di λi

Speechrecognition 10435 NT p170, 130, 40, 300q(KB) 0.5

NLP 25346 NT p52, 48, 4, 100q(KB) 0.8FaceRecognition 45043 NT p55, 45, 10, 100q(KB) 0.4

Searching 8405 NT p51, 24.5, 2, 100q (byte) 10Translation 34252 NT p2501, 1249.5, 2, 5000q (byte) 13d game 54633 NT p1.55, 0.725, 0.1, 3q (MB) 0.1VR 40305 NT p1.55, 0.725, 0.1, 3q (MB) 0.1AR 34532 NT p1.55, 0.725, 0.1, 3q (MB) 0.1

approach in a higher dimensional case, we considered thecase of N “ 8. In this case, the action dimension5 was2N ` 2 “ 18. The parameters of the eight application typesthat we considered are shown in Table V. Other parametersand setup were the same as those in the case of N “ 3 andCCptq was the discontinuous function used in Section VI-C.From Table V the average total arrival rate in terms of CPUcycles and task bits per second were 193GHz and 5.14 kbps.Hence, the set up was feasible to control. Fig. 11(a) shows thecorresponding learning curve and Fig. 11(b) shows the queuelength

řNi“1 qiptq over time for one episode for the trained

policy in the execution phase. It is seen that even in this casethe DRL-based approach properly works.

VII. CONCLUSION

In this papper, we have considered a DRL-based approachto the Lyapunov optimization that minimizes the time-averagepenalty cost while maintaining queue stability. We have pro-posed a proper construction of state and action spaces and aclass of reward functions. We have derived a condition for the

5In the Mujoco robot simulator for RL algorithm test, the Humanoid taskis known to hae high action dimensions given by 17 [47].

0 2 4 6 8 10 12 14 16Time steps (1e6)

8

10

12

14

16

18

20

22

24

Episo

de re

ward

sum

V=10

(a)

0 1000 2000 3000 4000 5000Timestamp

0.00e+00

5.00e+06

1.00e+07

1.50e+07

2.00e+07

2.50e+07

3.00e+07

3.50e+07

Aver

age

back

logs

of q

ueue

s

V=10queue 1queue 2queue 3queue 4queue 5queue 6queue 7queue 8

(b)

Fig. 11. Higher Dimension Case (ν “ 1): (a) learning curve and (b) queuelength over time in the execution phase

reward function of RL for queue stability and have provideda discounted form of the reward for practical RL. With theproposed state and action spaces and the reward function, theDRL approach successfully learns a policy minimizing thepenalty cost while maintaining queue stability. The proposedDRL-based approach to Lyapunov optimization does not re-quired complicated optimization at each time step and canoperate with general non-convex and discontinuous penaltyfunctions. Thus, it provides an alternative to the conventionalDPP algorithm to the Lyapunov optimization.

REFERENCES

[1] M. Neely, E. Modiano, and C.-p. Li, “Fairness and optimal stochasticcontrol for heterogeneous networks,” IEEE/ACM Transactions on Net-working, vol. 16, pp. 396–409, Apr. 2008.

[2] M. J. Neely, “Stochastic network optimization with application tocommunication and queueing systems,” Synthesis Lectures on Commu-nication Networks, vol. 3, no. 1, pp. 1–211, 2010.

[3] L. Georgiadis, M. Neely, and L. Tassiulas, Resource Allocation andCross-layer Control in Wireless Networks. Hanover, MA: Now Pub-lishers, 2006.

[4] W. Bao, H. Chen, Y. Li, and B. Vucetic, “Joint rate control and powerallocation for non-orthogonal multiple access systems,” IEEE Journalon Selected Areas in Communications, vol. 35, no. 12, pp. 2798–2811,2017.


[5] R. Urgaonkar, U. C. Kozat, K. Igarashi, and M. J. Neely, “Dynamicresource allocation and power management in virtualized data centers,”in IEEE Network Operations and Management Symposium, 2010.

[6] H. Zhang, B. Wang, C. Jiang, K. Long, A. Nallanathan, V. C. M. Leung,and H. V. Poor, “Energy efficient dynamic resource optimization in nomasystem,” IEEE Transactions on Wireless Communications, vol. 17, no. 9,pp. 5671–5683, 2018.

[7] Y. Li, M. Sheng, Y. Zhang, X. Wang, and J. Wen, “Energy-efficientantenna selection and power allocation in downlink distributed antennasystems: A stochastic optimization approach,” in 2014 IEEE Interna-tional Conference on Communications (ICC), 2014.

[8] M. Karaca, K. Khalil, E. Ekici, and O. Ercetin, “Optimal schedulingand power allocation in cooperate-to-join cognitive radio networks,”IEEE/ACM Transactions on Networking, vol. 21, no. 6, pp. 1708–1721,2013.

[9] H. Ju, B. Liang, J. Li, and X. Yang, “Dynamic power allocation forthroughput utility maximization in interference-limited networks,” IEEEWireless Communications Letters, vol. 2, no. 1, pp. 22–25, 2013.

[10] L. Tassiulas and A. Ephremides, “Stability properties of constrainedqueueing systems and scheduling policies for maximum throughput inmultihop radio networks,” in 29th IEEE Conference on Decision andControl, 1990.

[11] ——, “Dynamic server allocation to parallel queues with randomly vary-ing connectivity,” IEEE Transactions on Information Theory, vol. 39,no. 2, pp. 466–478, 1993.

[12] M. J. Neely, “Energy optimal control for time-varying wireless net-works,” IEEE Transactions on Information Theory, vol. 52, no. 7, pp.2915–2934, 2006.

[13] Y. Mao, J. Zhang, and K. B. Letaief, “A Lyapunov optimization approachfor green cellular networks with hybrid energy supplies,” IEEE Journalon Selected Areas in Communications, vol. 33, no. 12, pp. 2463–2477,2015.

[14] C. Jin, X. Sheng, and P. Ghosh, “Optimized electric vehicle chargingwith intermittent renewable energy sources,” IEEE Journal of SelectedTopics in Signal Processing, vol. 8, no. 6, pp. 1063–1072, 2014.

[15] C. Qiu, Y. Hu, Y. Chen, and B. Zeng, “Lyapunov optimization for energyharvesting wireless sensor communications,” IEEE Internet of ThingsJournal, vol. 5, no. 3, pp. 1947–1956, 2018.

[16] G. Zhang, W. Zhang, Y. Cao, D. Li, and L. Wang, “Energy-delaytradeoff for dynamic offloading in mobile-edge computing system withenergy harvesting devices,” IEEE Transactions on Industrial Informatics,vol. 14, no. 10, pp. 4642–4655, 2018.

[17] S. Lakshminarayana, T. Q. S. Quek, and H. V. Poor, “Cooperation andstorage tradeoffs in power grids with renewable energy resources,” IEEEJournal on Selected Areas in Communications, vol. 32, no. 7, pp. 1386–1397, 2014.

[18] W. Shi, N. Li, C. Chu, and R. Gadh, “Real-time energy managementin microgrids,” IEEE Transactions on Smart Grid, vol. 8, no. 1, pp.228–238, 2017.

[19] X. Wang, Y. Zhang, T. Chen, and G. B. Giannakis, “Dynamic energymanagement for smart-grid-powered coordinated multipoint systems,”IEEE Journal on Selected Areas in Communications, vol. 34, no. 5, pp.1348–1359, 2016.

[20] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,D. Wierstra, S. Legg, and D. Hassabis, “Human-level control throughdeep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533,Feb. 2015.

[21] R. Li, Z. Zhao, Q. Sun, I. Chih-Lin, C. Yang, X. Chen, M. Zhao, andH. Zhang, “Deep reinforcement learning for resource management innetwork slicing,” IEEE Access, vol. 6, pp. 74 429–74 441, 2018.

[22] Y. He, N. Zhao, and H. Yin, “Integrated networking, caching, and com-puting for connected vehicles: A deep reinforcement learning approach,”IEEE Transactions on Vehicular Technology, vol. 67, no. 1, pp. 44–55,2017.

[23] X. Chen, Z. Zhao, C. Wu, M. Bennis, H. Liu, Y. Ji, and H. Zhang,“Multi-tenant cross-slice resource orchestration: A deep reinforcementlearning approach,” IEEE Journal on Selected Areas in Communications,vol. 37, no. 10, pp. 2377–2392, 2019.

[24] U. Challita, L. Dong, and W. Saad, “Proactive resource managementfor LTE in unlicensed spectrum: A deep learning perspective,” IEEETransactions on Wireless Communications, vol. 17, no. 7, pp. 4674–4689, 2018.

[25] R. S. Sutton and A. G. Barto, Reinforcment Learning: An introduction.MIT Press, 1998.

[26] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft Actor-Critic: Off-policy maximum entropy deep reinforcement learning with a stochasticactor,” in Proceedings of the 35th International Conference on MachineLearning, ser. Proceedings of Machine Learning Research, vol. 80.Stockholmsmassan, Stockholm Sweden: PMLR, 10–15 Jul 2018, pp.1861–1870.

[27] Y. Sun, S. Zhou, and J. Xu, “EMM: Energy-aware mobility managementfor mobile edge computing in ultra dense networks,” IEEE Journal onSelected Areas in Communications, vol. 35, no. 11, pp. 2637–2646,2017.

[28] L. Chen, S. Zhou, and J. Xu, “Computation peer offloading for energy-constrained mobile edge computing in small-cell networks,” IEEE/ACMTransactions on Networking, vol. 26, no. 4, pp. 1619–1632, 2018.

[29] S. Bi and Y. J. Zhang, “Computation rate maximization for wirelesspowered mobile-edge computing with binary computation offloading,”IEEE Transactions on Wireless Communications, vol. 17, no. 6, pp.4177–4190, 2018.

[30] J. L. D. Neto, S. Yu, D. F. Macedo, J. M. S. Nogueira, R. Langar, andS. Secci, “ULOOF: A user level online offloading framework for mobileedge computing,” IEEE Transactions on Mobile Computing, vol. 17,no. 11, pp. 2660–2674, 2018.

[31] X. Chen, H. Zhang, C. Wu, S. Mao, Y. Ji, and M. Bennis, “Optimizedcomputation offloading performance in virtual edge computing systemsvia deep reinforcement learning,” IEEE Internet of Things Journal,vol. 6, no. 3, pp. 4005–4018, 2019.

[32] L. Huang, S. Bi, and Y. J. Zhang, “Deep reinforcement learningfor online computation offloading in wireless powered mobile-edgecomputing networks,” IEEE Transactions on Mobile Computing, pp. 1–1, 2019.

[33] T. Q. Dinh, Q. D. La, T. Q. S. Quek, and H. Shin, “Learning forcomputation offloading in mobile edge computing,” IEEE Transactionson Communications, vol. 66, no. 12, pp. 6353–6367, 2018.

[34] S. Mittal, “Power management techniques for data centers: A survey,”arXiv preprint arXiv:1404.6681, 2014.

[35] A. Varghese, J. Milthorpe, and A. P. Rendell, “Performance and energyanalysis of scientific workloads executing on LPSoCs,” in ParallelProcessing and Applied Mathematics, R. Wyrzykowski, J. Dongarra,E. Deelman, and K. Karczewski, Eds. Cham: Springer InternationalPublishing, 2018, pp. 113–122.

[36] S. Mangard, E. Oswald, and T. Popp, Power Analysis Attacks: Revealingthe Secrets of Smart Cards. Springer Science & Business Media, 2008.

[37] T. Ishihara and H. Yasuura, “Voltage scheduling problem for dynami-cally variable voltage processors,” in Proceedings. 1998 InternationalSymposium on Low Power Electronics and Design, 1998.

[38] N. B. Rizvandi, A. Y. Zomaya, Y. C. Lee, A. J. Boloori, and J. Taheri,Multiple Frequency Selection in DVFS-Enabled Processors to MinimizeEnergy Consumption. John Wiley & Sons, Ltd, ch. 17, pp. 443–463.

[39] C. Liu, M. Bennis, M. Debbah, and H. V. Poor, “Dynamic task offloadingand resource allocation for ultra-reliable low-latency edge computing,”IEEE Transactions on Communications, vol. 67, no. 6, pp. 4132–4150,2019.

[40] Y. Mao, J. Zhang, S. H. Song, and K. B. Letaief, “Power-delay tradeoffin multi-user mobile-edge computing systems,” in 2016 IEEE GlobalCommunications Conference (GLOBECOM), 2016.

[41] J. Wang, Y. Liu, and B. Li, “Reinforcement learning with perturbedrewards,” arXiv preprint arXive:1810.01032, 2020.

[42] S. G. Khan, G. Herrmann, F. L. Lewis, T. Pipe, and C. Melhuish,“Reinforcement learning and optimal adaptive control: An overview andimplementation examples,” Annual Reviews in Control, vol. 36, no. 1,pp. 42–59, 2012.

[43] S. Han and Y. Sung, “Diversity actor-critic: Sample-aware en-tropy regularization for sample-efficient exploration,” arXiv preprintarXiv:2006.01419, 2020.

[44] S. Bae, “Mobile edge computing environment with SAC algorithm,”https://github.com/sosam002/KAIST MEC simulator/tree/master/MCES sac TON, 2020.

[45] J. Nocedal and S. J. Wright, “Sequential quadratic programming,”Numerical Optimization, pp. 529–562, 2006.

[46] G. Scutari and Y. Sun, Parallel and Distributed Successive ConvexApproximation Methods for Big-Data Optimization. C.I.M.E LectureNotes in Mathematics, Springer Verlag Series, 2018.

[47] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine formodel-based control,” in 2012 IEEE/RSJ International Conference onIntelligent Robots and Systems, 2012.

https://github.com/sosam002/KAIST_MEC_simulator/tree/master/MCES_sac_TON

https://github.com/sosam002/KAIST_MEC_simulator/tree/master/MCES_sac_TON

submitted to ieee transactions on networking 1 a

Documents