deep echo state q-network (deqn) and its application in

1

Deep Echo State Q-Network (DEQN) and ItsApplication in Dynamic Spectrum Sharing for 5G

and BeyondHao-Hsuan Chang, Lingjia Liu, and Yang Yi

Abstract—Deep reinforcement learning (DRL) has been shownto be successful in many application domains. Combining recur-rent neural networks (RNNs) and DRL further enables DRLto be applicable in non-Markovian environments by capturingtemporal information. However, training of both DRL and RNNsis known to be challenging requiring a large amount of trainingdata to achieve convergence. In many targeted applications, suchas those used in the fifth generation (5G) cellular communication,the environment is highly dynamic while the available trainingdata is very limited. Therefore, it is extremely important todevelop DRL strategies that are capable of capturing the tem-poral correlation of the dynamic environment requiring limitedtraining overhead. In this paper, we introduce the deep echostate Q-network (DEQN) that can adapt to the highly dynamicenvironment in a short period of time with limited training data.We evaluate the performance of the introduced DEQN methodunder the dynamic spectrum sharing (DSS) scenario, which is apromising technology in 5G and future 6G networks to increasethe spectrum utilization. Compared to conventional spectrummanagement policy that grants a fixed spectrum band to a singlesystem for exclusive access, DSS allows the secondary system toshare the spectrum with the primary system. Our work shedslight on the application of an efficient DRL framework in highlydynamic environments with limited available training data.

Index Terms—Echo state networks, deep reinforcement learn-ing, convergence rate, dynamic spectrum sharing, 5G, and 6G

I. INTRODUCTION

In the last few years, deep reinforcement learning (DRL) hasbeen widely adopted in different fields, ranging from playingvideo games [1], playing chess [2], to robotics [3]. DRL pro-vides a flexible solution for many types of problems due to thefact that it does not need to model complex systems or to labeldata for training. Utilizing recurrent neural networks (RNNs)in DRL, the deep recurrent Q-network (DRQN) is introducedto process the temporal correlation of input sequences in a non-Markovian environment [4]. Even though DRQN is a powerfulmachine learning tool, it faces serious issues related to trainingdue to the following two reasons: 1) DRL requires a relativelylarge amount of training data and computational resources tomake the learning agent converge to an appropriate policy,which is a major bottleneck for applying DRL to many real-world applications [5]. 2) The kernel of DRQN, the RNN, hasissues related to vanishing and exploding gradients that make

The authors are with the Bradley Department of Electrical and ComputerEngineering, Virginia Tech, Blacksubrg VA, 24061, USA. The work is sup-ported by the US National Science Foundation under grants ECCS-1811497and CCF-1937487. The corresponding author is L. Liu ([email protected]).

the underlying training difficult [6]. Therefore, the difficultiesof training DRL agents and RNNs make the training ofDRQNs an extremely challenging problem and prevent it frombeing widely adopted for analyzing time-dynamic applications.

In light of the training challenges, in this work we exploit aspecial type of RNNs, echo state networks (ESNs), to reducethe training time and the required training data [7]. ESNs sim-plify the underlying RNNs training by only training the outputweights while leaving input weights and recurrent weightsuntrained. Existing research shows that ESNs can achievecomparable performance with RNNs, especially in some tasksrequiring fast learning [8]. Accordingly, in this work, we adoptESNs as the Q-networks in the DRL framework, which isreferred to as deep echo state Q-networks (DEQN). We willshow that DEQN has the benefit of learning a good policywith short training time and limited training data.

Fueled by the popularity of smartphones as well as theupcoming deployment of the fifth generation (5G) mobilebroadband networks, mobile data traffic will grow at a com-pound annual growth rate (CAGR) of 46 percent between2017 and 2022, reaching 77.5 exabytes (EB) per month by2022 [9]. A significant portion of these data traffic will bereal-time or delay-sensitive. For example, live video will grow9-fold from 2017 to 2022 while virtual reality and augmentedreality traffic will increase 12-fold at a CAGR of 63 percent.This suggests that future wireless networks will likely face thepressing demand of being able to conduct real-time processingfor large volume data in an efficient way. In 5G networks,massive connectivity is regarded as a primary use case withdynamic spectrum sharing (DSS) as an enabling technology.In fact, DSS has been announced as the key technology for 5Gby many companies and operators around the world includingQualcomm, Ericsson, AT&T, and Verizon [10], [11]. Unlikethe current static spectrum management policy that gives asingle system exclusive right to access the spectrum, DSS has amore flexible policy by adopting a hierarchical access structurewith primary users (PUs) and secondary users (SUs) [12]. SUsare allowed to access the licensed spectrum when PUs receivetolerable interference.

Obtaining control information from the environment iscostly in 5G mobile wireless networks. First, a SU cannot de-tect the activities of all PUs simultaneously because perform-ing spectrum sensing is energy-consuming. Second, exchang-ing control information between wireless devices imposes acontrol overhead in wireless network operations. Therefore,the major challenge of DSS is how to optimize the system

arX

iv:2

010.

0544

9v1

[cs

.LG

] 1

2 O

ct 2

020

2

performance under limited information exchange between thesecondary system and the primary system. DRL is a suitableframework for developing DSS strategies because of its abil-ities to adapt to unknown environment without modeling thecomplex 5G networks. DRL usually requires tons of trainingdata and long training time. However, wireless networks aredynamic due to factors such as path loss, shadow fading, andmulti-path fading [13], which largely decreases the numberof effective training data that reflect the latest environment.Furthermore, the performance of spectrum sharing dependson access strategies of multiple users. If one user changesits access strategy, then other users have to change theiraccess strategies accordingly. Under these circumstances, thenumber of effective training samples reflected in the latestwireless environment will be extremely limited. As a result,designing an efficient DRL framework only requiring a smallamount of training data will be critical for 5G and future 6GDSS networks. In this work, we introduce DEQN to learn aspectrum access strategy for each SU in a distributed fashionwith limited training data and short training time in the highlydynamic 5G networks.

The main contributions of our work are as follows:

• We design an efficient DRL framework, DEQN, to adaptto highly dynamic environment with limited training dataand provide training strategies for the introduced DEQN.

• We apply the DEQN method in the critical problem ofDSS for 5G networks where the system is highly dynamicand interactive. Compared to existing DRL-based strate-gies, our method can quickly adapt to real mobile wirelessenvironment to achieve improved network performanceunder limited training data.

• This work is the first to formulate a DRL strategy thatjointly considers spectrum sensing and spectrum sharingin the underlying DSS network for 5G.

II. PROBLEM DEFINITION FOR DSS

In this section, we introduce the DSS problem and discussits challenges. We consider a DSS system where the primarynetwork consisting of M PUs and the secondary networkconsisting of N SUs. It is assumed that one wireless channelis allocated to each PU individually and cross-channel interfer-ence is negligible. We consider a discrete time model, wherethe dynamics of the DSS system, such as behaviors of usersand changes of the wireless environment, are constrained tohappen at discrete time slots t (t is a natural number). Ourgoal is to develop a distributive DSS strategy for each SU toincrease the spectrum utilization without harming the primarynetwork’s performance.

The data of an user are transmitted over the wireless linkbetween its transmitter and receiver. Signal-to-interference-plus-noise ratio (SINR) is a quality measure of the wirelessconnection that compares the power of a desired signal to thesum of the interference power and the power of backgroundnoise. The higher value of the SINR, the better quality of

Fig. 1: The desired links, the interference links, and the sensinglinks when PU1, SU1, and SU2 are operating on the samechannel. PUT/SUT represent the transmitters of PU/SU andPUR/SUR represent the receivers of PU/SU.

the wireless connection. The SINR of the user k’s wirelessconnection on channel m at time slot t is written as

SINRkm[t] =P k ·

∣∣Hk[t]∣∣2∑

z∈Φkm

P z · |Hzk[t]|2 +Nm(1)

where P k and P z are the transmit power of the user k andthe user z, respectively, Φkm is the set containing all the usersthat are transmitting on channel m except for the user k,Hk[t] is the channel gain of the desired link of the user k,Hzk[t] is the channel gain of the interference link betweenthe user z’s transmitter and the user k’s receiver, and Nmis the background noise power on channel m. Note that allchannel gains are changing over time so SINR is also time-variant. The desired link is the link between the transmitterand the receiver of the same user. The interference link is thelink between the transmitter and the receiver of two differentusers if these two users are transmitting on the same channelsimultaneously. Figure 1 shows the complicated associationof desired links and interference links when PU1, SU1, andSU2 are operating on the same channel. Since cross-channelinterference is negligible, the interference link between twousers operating on different channels is out of consideration.

The radio signal attenuates as it propagates through spacebetween the transmitter and the receiver, which is referredto as the path loss. In addition to the path loss, the channelgain is affected by many factors such as shadow fadingand multi-path fading. Shadow fading is caused by a largeobstacle like a hill or a building obscuring the main signalpath between the transmitter and the receiver. Multi-pathfading occurs in any environment where multiple propagationpaths exist between the transmitter and the receiver, whichmay be caused by reflection, diffraction, or scattering. Intelecommunication society, the channel model is carefullydesigned to be consistent with wireless field measurements.We generate channel gains based on the WINNER II channelmodel [14], which is widely used in industry to make faircomparisons of telecommunication algorithms.

To enable the protection of the primary network, we assumethat a PU will broadcast a warning signal if its data transmis-

3

sion experiences a low SINR. There are two possible causesfor low SINR. First, the wireless connection of the desiredlink of the PU is in deep fade, which means the channel gainof the desired link is low. This leads to a small value of thenumerator in Equation (1) so SINR is low. Second, the signalsfrom one or more SUs cause strong interference to a PU whenthey are transmitting over the same wireless channel at thesame time. This leads to a large value of the denominator inEquation (1), so SINR assumes a low value again. We calledSUs ”collides” with the PU in this case. The warning signalcontains information related to which PU may be interferedso that the SUs transmitting on the same channel are awareof the issue. In fact, this kind of warning signal is similarto the control signals (e.g. synchronization, downlink/uplinkcontrol) used in current 4G and 5G networks. It is commonto assume that the control signals are received perfectly atreceivers, otherwise the underlying network will not evenwork. In reality, the control signal can be transmitted througha dedicated control channel. According to this mechanism, aPU will broadcast a warning signal once the received SINR islow, and this is the only control information from the primarysystem to the secondary system to enable the protection forPUs under DSS. Note that a PU may send a warning signaleven when no collisions happen because of deep fade.

The activity of a PU consists of two states: (1) Active and(2) Inactive. If a PU is transmitting data, it is in Active state,otherwise it is in Inactive state. A spectrum opportunity ona channel occurs when the licensed PU of that channel is inInactive state or any SU can transmit on that channel withlittle interference to the Active licensed PU. Unfortunately, itis difficult for a SU to obtain the information of activity statesof PUs or the interference that it will cause in the highlydynamic 5G networks. A SU has to perform spectrum sensingto detect the activity of a PU, but the accuracy of detectionis based on the wireless link between the transmitters of thePU and the SU, the background noise, and the transmit powerof the PU. On the other hand, the interference level causedby a SU is determined by the interference link from the SUto the PU, the desired link of the PU, transmit powers ofthe PU and the SU, and the background noise. Furthermore,all these factors for determining spectrum opportunities aretime-variant so control information becomes outdated quickly.Since obtaining control information is costly in 5G mobilewireless networks, it is impractical to design a DSS strategyby assuming that all the control information is known.

SUs should provide protection to prevent PUs from harmfulinterference since the primary system is the spectrum licensee.A commonly used method is that the transmitter of a SUperforms spectrum sensing to detect the activity of a PUbefore accessing a channel. Due to the power and complexityconstraints, a SU is unable to perform spectrum sensing acrossall channels simultaneously. Therefore, we assume that a SUcan only sense one channel at a particular time. We adopt theenergy detector as the underlying spectrum sensing method,which is the most common one due to its low complexity andcost. The energy detector of SU n first computes the energy

of received signals on channel m as follows:

Enm[t] =

t+Ts−1∑t′=t

|ynm[t′]|2 (2)

where t is the starting time slot of the spectrum sensing, ynm[t′]is the received signal at time slot t′, and Ts is the number oftime slots of the spectrum sensing. We consider the half-duplexSU system where a SU cannot transmit data and performspectrum sensing at the same time. We assume a periodic timestructure of spectrum sensing and data transmission as shownin Figure 2. To be specific, the kth sensing and transmissionperiod contains T time slots from kT + 1 to (k + 1)T , thespectrum sensing contains the first Ts time slots in the periodfrom kT+1 to kT+Ts, and the data transmission contains thesubsequent T − Ts time slots in the period from kT + Ts + 1to (k + 1)T .

Fig. 2: The time structure of spectrum sensing and datatransmission.

The received signal ynm[t′] depends on the activity state ofPU m, the power of PU m, the background noise, and thesensing link between the transmitters of PU m and SU n.When PU m is in the Inactive state, the received signal isrepresented as

ynm[t′] = ωm[t′] (3)

When PU m is in the Active state, the received signal isrepresented as

ynm[t′] =√Pm ·Hmn[t′] + ωm[t′] (4)

where ωm[t′] ∼ CN (0, Nm) is a circularly-symmetric Gaus-sian noise with zero mean and variance Nm, Pm is thetransmit power of PU m, and Hmn[t] is the channel gain ofthe sensing link between the transmitters of PU m and SU n.

If the energy computed in Equation (2) is higher than athreshold, the PU is considered in the Active state, otherwisethe PU is considered in the Inactive state. The challenge of de-signing an energy detector is how to set the threshold properly.The value of the threshold is actually a trade-off between thedetection probability and the false alarm probability. However,setting the threshold for achieving a good trade-off is relatedto many factors, including the channel gain of the sensinglink, the transmit power of the PU, the noise variance, thenumber of received signals, etc. This information is difficultto obtain before deploying in the real environment and istime-variant. Furthermore, setting a threshold is difficult insome cases because of the relative positions of transmitters andreceivers. As shown in Figure 1, the sensing link is betweenthe transmitters of the PU and the SU, but the interferencelink is between the transmitter of the SU and the receiver ofthe PU. The discrepancy between the sensing link and theinterference link may cause the hidden node problem, where

4

the sensing link is weak but the interference link is strong. Forexample, the transmitters of a SU and a PU are far away fromeach other while the SU transmitter is close to the receiver ofthe PU. In this case, the transmitters of the SU and the PU arehidden nodes with respect to each other. The warning signalsfrom PUs are designed to provide additional protection to theprimary system for the case where the SU cannot detect theactivity of the PU, thereby mitigating the issues caused bythe hidden nodes. Meanwhile, instead of making the spectrumaccess decision solely based on the outcomes of the energydetector, we developed a DRL framework to construct a novelspectrum access policy: The DRL agent will use the sensedenergy as the input to learn a spectrum access strategy tomaximize the cumulative reward. The reward is designed tomaximize the spectral-efficiencies of SUs while enabling theprotection for PUs with the help of warning signals from PUs.

III. DRL FRAMEWORK FOR DSS AND DEQN

A. Background on DRL

RL is one type of machine learning method that providesa flexible architecture for solving many types of practicalproblems because it does not need to model complex systemsor to label data for training. In RL, an agent learns howto select actions to maximize the cumulative reward in astochastic environment. The dynamics of the environmentis usually modeled as a Markov decision process (MDP),which characterized by a tuple (S,A,P, R, γ), where S isthe state space, A is the action space, P is the state transitionproviding Pr(st+1|st, at), R is the reward function providingrt = R(st, at), and γ is a discount factor for calculatingcumulative reward. Specifically, at time t, the state is st ∈ S,the RL agent selects an action at ∈ A by following a policyπ(st) and receives the reward rt, and then the system shifts tothe next state st+1 according to the state transition probability.Note that the action at affects both the immediate reward rtand the next state st+1. Consequently, all subsequent rewardsare affected by the current action. The goal of RL agentis to find a policy π to maximize the cumulative reward,Eπ[∑∞

t=1 γt−1rt

].

In RL, a model-free algorithm does not require state tran-sition probability for learning, which is useful when theunderlying system is complicated and difficult to model. Q-learning [15] is the most widely used model-free RL algorithmthat aims to find the Q-function of each state-action pair fora given policy, which is defined as

Qπ(st, at) = E

[ ∞∑t′=1

γt′−1rt′ | s1 = st, a1 = at

]. (5)

Q-function represents the cumulative reward when takingaction at in the state st and then following policy π. Q-learning constructs a Q-table to estimate the Q-function ofeach state-action pair by iteratively updating each element ofthe Q-table through dynamic programming. The update ruleof the Q-table is given as follows:

Q(st, at)← Q(st, at)

+ α[rt + γmax

aQ(st+1, a)−Q(st, at)

],

(6)

where α ∈ (0, 1) is the learning rate. The policy π that selectsaction is the ε-greedy policy as follows:

at =

{argmaxaQ(st, a) , with probability 1− ε,random action , with probability ε,

(7)

where ε ∈ [0, 1] is the exploration probability. However, Q-learning performs poorly when the dimension of the state ishigh because updating a large Q-table makes training difficultor even impossible.

Deep Q-Networks (DQN) [1] is introduced to solve high-dimensional state problems by leveraging a neural network asthe function approximator of the Q-table, which is referredto as the Q-network. Specifically, the Q-network takes thestate st as input and outputs the estimated Q-function of allpossible actions. One key approach of DQN to improve thetraining stability is by creating two Q-networks: the evaluationnetwork Q(s, a; θ) and the target network Q(s, a; θ−). Thetarget network is used to generate the targets for trainingthe evaluation network while the evaluation network is usedto determine the actions. The loss function for training theevaluation network is written as(

rt + γmaxa

Q(st+1, a; θ−)−Q(st, at; θ))2

, (8)

where rt + γmaxa

Q(st+1, a; θ−) is the target Q-value. Theweights of the target network θ− is periodically synchronizedwith the weights of the evaluation network θ. The purpose is tofix targets temporarily during training to improve the trainingstability of the evaluation network.

An improvement of DQN to prevent overestimation of Q-values is called double Q-learning [16], where the evaluationnetwork is used to select the action when computing the targetQ-value, but the target Q-value is still generated by the targetnetwork. Specifically, the target Q-value for the evaluationnetwork is calculated by

rt + γQ(st+1, a′; θ−), (9)

where a′ = argmaxa

Q(st+1, a; θ). Double Q-learning can im-

prove the accuracy in estimating Q-function, thereby improvesthe learned policy.

B. Existing DRL-based Strategies for DSS

DRL-based methods have recently been applied in dynamicspectrum access (DSA) networks [17]–[19] where the focusis exclusively on the ”access” part of the problem withover-simplified network setup. To be specific, [17] considerssingle SU selects one channel to access in the multichannelenvironment, and the goal is to maximize the number ofselecting good channels for access. [18] assumes that theavailable spectrum channels are known a priori and developsa centralized spectrum access algorithm for multi-user access.Both [17] and [18] assumes that one channel can only be usedby one user at any particular time. Although [19] considersmultiple SUs can access a channel at the same time, a SUcannot access a channel that a PU is using. [19] also assumesthat each SU can sense all channels simultaneously and thecollision between a PU and a SU can be perfectly detected.

5

In this work, in order to provide a comprehensive study forthe impact of DEQN on relevant DSS networks for 5G, weconsider practical situations of DSS where 1) mobile userscannot conduct spectrum sensing perfectly. 2) mobile userscannot sense multiple channel at a particular time. 3) thereare multiple PUs and multiple SUs in a DSS network. 4) Achannel can be shared by multiple users if the interferencebetween them is weak. Furthermore, unlike previous workwhich utilizes binary ACK/NACK feedback as the rewardfunction, we calculate the practical reward based on thespectral-efficiency of each mobile link. To be closely in linewith the real wireless environment, the spectral-efficiency ofa mobile link is calculated using the transmission proceduredefined in the telecommunication standard. In this way, we cantrain and evaluate the underlying DEQN-based DRL strategiesin realistic 5G application scenarios. It is important to note thatin our work we treat the unprocessed soft spectrum sensinginformation as the input states of the DRL agent. Soft spectrumsensing information can be directly obtained from spectrumsensing sensors. Through the soft spectrum sensing input, theDRL agent will learn an appropriate detection criterion foreach SU that adapts to different mobile wireless environments,geometry of mobile users, and activities of mobile users. Thisis indeed the first work to study DSS that combines softspectrum sensing information and spectrum access strategiesthrough the DRL framework.

C. DRL Problem Formulation for DSS

We now formulate the DSS problem using the DRL frame-work, where all SUs in the secondary system learn theirspectrum access strategies in a distributed fashion throughthe interactions with the mobile wireless environment. To bespecific, we assume that each SU has a DRL agent that takesits observed state as the input and learns how to performspectrum sensing and access actions in order to maximizeits cumulative reward. The reward for each SU is designedto maximize its spectrum efficiency and to prevent harmfulinterference to PUs.

The state of SU n in the kth sensing and transmission periodis denoted by

sn[k] = (En[k], Qn[k]) , (10)

where k is a non-negative integer, En[k] is the energy ofreceived signals, and Qn[k] is a one-hot M -dimensional vectorindicating the sensed channel from time slots kT + 1 tokT + Ts. If the index of the sensed channel is m, then themth element of Qn[k] is equal to one while other elements ofQn[k] are zeros. On the other hand, En[k] is equal to Enm[kT ]that is calculated by Equation (2).

The action of SU n in the kth sensing and transmissionperiod is denoted by

an[k] = (qn[k], zn[k]) , (11)

where qn[k] ∈ {0, 1} represents SU n will either access thecurrent sensed channel (qn[k] = 1) or be idle (qn[k] = 0)during the data transmission part of the kth period (from timeslots kT +Ts + 1 to (k+ 1)T ), zn[k] ∈ {1, ...,M} representsSU n will sense channel zn[k] during the sensing part of the

(k+ 1)th period (from time slots (k+ 1)T + 1 to (k+ 1)T +Ts). In other words, SU n makes two decisions: qn[k] decideswhether to conduct data transmission in the current sensedchannel of the kth period and zn[k] decides which channel tosense in the (k+1)th period. Therefore, the dimension of eachSU’s action space is 2M . Note that the sensed channel in thekth period may be different from that in the (k + 1)th period

TABLE I: The SINR and CQI mapping to modulation andcoding rate.

CQI index SINR modulation code rate efficiency(≥) (×1024) (bits per symbol)

0 out of range1 -6.9360 QPSK 78 0.15232 -5.1470 QPSK 120 0.23443 -3.1800 QPSK 193 0.37704 -1.2530 QPSK 308 0.60165 0.7610 QPSK 449 0.87706 2.6990 QPSK 602 1.17587 4.6940 16QAM 378 1.47668 6.5250 16QAM 490 1.91419 8.5730 16QAM 616 2.406310 10.3660 64QAM 466 2.730511 12.2890 64QAM 567 3.322312 14.1730 64QAM 666 3.902313 15.8880 64QAM 772 4.523414 17.8140 64QAM 873 5.115215 19.8290 64QAM 948 5.5547

In our work, we use a discrete reward function which issimilar to the existing DRL-based DSS methods. Comparedto a simple binary reward (0 and +1, −1 and +1) in [17] and[18], we consider a more relevant and comprehensive rewarddesign that is based on the underlying achieved modulationand coding strategy (MCS) adopted in the 3GPP LTE/LTE-Advanced standard [20]. To be specific, a receiver measuresSINR to evaluate the quality of the wireless connection andfeedback the corresponding Channel Quality Indicator (CQI)to the transmitter [21]. In this work, we follow the methodpresented in [22] to map the received SINR to the CQI. Afterreceiving the CQI, the transmitter determines the MCS for datatransmission based on the CQI table specified in the 3GPPstandard [20]. The SINR and CQI mapping to MCS is givenin Table I for reference. Accordingly, the achieved spectral-efficiency can be calculated by (bits/symbol) = (modulation’spower of 2) × (code rate) representing the average informationbits per symbol. This critical metric is utilized as the rewardfunction of our design.

To jointly consider the performance of the primary and thesecondary systems, the reward function corresponding to SUn accessing channel m depends on both the spectral-efficiencyof SU n and PU m. During time slots kT+Ts+1 to (k+1)T ,the average spectral-efficiency of SU n, en[k], and the averagespectral-efficiency of PU m, em[k], are calculated by

en[k] =1

T − Ts

(k+1)T−1∑t′=kT+Ts

enm[t′]

em[k] =1

T − Ts

(k+1)T−1∑t′=kT+Ts

emm[t′]

(12)

6

where enm[t′] and emm[t′] represent the spectral-efficiency of SUn and PU m on channel m at time slot t′, respectively.

The reward of SU n in the kth transmission period is definedas

rn[k] =

−2, if em[k] < 1.5

−1, if SU n is idle in the kth period0, if em[k] ≥ 1.5 and en[k] < 1

1, if em[k] ≥ 1.5 and 1 ≤ en[k] < 2

2, if em[k] ≥ 1.5 and 2 ≤ en[k] < 3

3, if em[k] ≥ 1.5 and en[k] ≥ 3

(13)

To enable the protection for the primary system, PU m willbroadcast a warning signal if its average spectral-efficiency isbelow 1.5, and then the reward received by SU n that accesseschannel m is set to −2. To motivate SUs to explore spectrumopportunities, the reward rn[k] is set to −1 if SU n decides tobe idle in the kth transmission period. When PU m does notsuffer from strong interference (the average spectral-efficiencyof PU m is larger than 1.5), we increase the reward rn[k] from0 to 3 as the average spectral-efficiency of SU n increases(see Equation (13)). Note that the low spectral-efficiency ofa PU or a SU does not necessarily mean collisions becausethe underlying wireless channels are changing dynamicallyover time. If the channel gain of the wireless link is small,the spectral-efficiency of the user will be low even if there isno collision. Therefore, the reward function and the warningsignal are introduced since it is impossible to detect collisionsperfectly in practical wireless environments.

D. Efficient Training for DEQN

To capture the activity patterns of PUs, which are usu-ally time-dependent, applying DRQNs is a natural choice.Although DQNs are able to learn the temporal correlationby stacking a history of states in the input, the sufficientnumber of stacked states is unknown because it depends onPUs’ behavior patterns. RNNs are a family of neural networksfor processing sequential data without specifying the length oftemporal correlation.

However, the training of RNNs is known to be difficultthat suffers from vanishing and the exploding gradients prob-lems. Furthermore, the required amount of training data forachieving convergence is large in the DRL scheme, since thereare no explicit labels to guide the training and the agentshave to learn from interacting with its environment. In thewireless environment, the channel gain of a wireless linkchanges rapidly, which is shown in Figure 3. Note that theenvironment observed by a SU is affected by other SUs’ accessstrategies because of possible collisions between SUs, andall SUs are dynamically adjusting their DSS strategies duringtheir training processes. As a result, in the DSS problem, theduration for a learning environment being stable is short andthe available training data is very limited.

The standard training technique for RNNs is to unfoldthe network in time into a computational graph that has arepetitive structure, which is called backpropagation throughtime (BPTT). BPTT suffers from the slow convergence rate

Fig. 3: Time-variant channel gain of a wireless link.

and needs many training examples. DRQN also requires alarge amount of training data because a learning agent findsa good policy by exploring the environment with differentpotential policies. Unfortunately, in the DSS problem, thereare only limited training data for a stable environment dueto dynamic channel gains, partial sensing, and the existenceof multiple SUs. To address this issue, we use ESNs as theQ-networks in the DRQN framework to rapidly adapt to theenvironment. ESNs simplify the training of RNNs significantlyby keeping the input weights and recurrent weights fixed andonly training the output weights.

We denote the sequence of states for SU n by{sn[1], sn[2], ...}. Accordingly, the sequence of hidden states,{hn[1], hn[2], ...}, is updated by

hn[k] =(1− β) · hn[k − 1]

+ β · tanh (Wnins

n[k] +Wnrech

n[k − 1]) ,(14)

where Wnin is the input weight, Wn

rec is the recurrent weight,β ∈ [0, 1] is the leaky parameter, and we let hn[0] = 0. Theoutput sequence, {on[1], on[2], ...}, is computed by

on[k] = Wnoutu

n[k] (15)

where un[k] is a concatenated vector of sn[k] and hn[k], andWnout is the output weight. Note that the output vector on[k]

is a 2M -dimensional vector, where each element of on[k]corresponds to the estimated Q-value of selecting one of allpossible actions given the state sn[1], ..., sn[k].

The double Q-learning algorithm [16] is adopted to trainthe underlying DEQN agent of each SU. As discussed inSection III-A, each DEQN agent has two Q-networks: theevaluation network and the target network. Let the outputsequence from the evaluation network and the target networkbe {onθ [1], onθ [2], ...} and {onθ− [1], onθ− [2], ...}, respectively. Theloss function for training the evaluation network of SU n iswritten as (

rn[k] + γony,θ− [k + 1]− ony,θ[k])2

, (16)

where ony,θ− [k+ 1] and ony,θ[k] are the yth element of onθ− [k+1] and onθ [k], respectively, y is the index of the maximum

7

Algorithm 1 The training algorithm for DEQN.

Initialize the wireless environment with M PUs and N SUs.Set the sensing and transmission period to T time slots and the sensing duration to Ts time slots.Set the buffer size to Z, the training iteration to I , and the exploration probability to ε.Randomly initialize an evaluation network DEQNnθ and a target network DEQNnθ− with the same weights for each SU n.Each SU n randomly selects one channel (= zn[0]) to sense for Ts time slots and then computes the state sn[1].for q = 1, ... do

Initialize an empty buffer Bnq for each SU.for z = 1, ..., Z do

Let k = (q − 1)Z + z.Each SU n inputs sn[k] to DEQNnθ , calculates the hidden state hnθ [k], and outputs onθ [k].Each SU n decides action an[k] = (qn[k], zn[k]) based on ε-greedy policy, where an[k] is the index of the maximumelement of onθ [k] with probability 1− ε and an[k] is chosen randomly with probability ε.Each SU n accesses channel zn[k − 1] if qn[k] = 1 or does not access if qn[k] = 0 for T − Ts time slots.Each SU n obtains the reward rn[k] according to Equation (13).Each SU n senses channel zn[k] for Ts time slots and then computes the state sn[k + 1].Each SU n inputs sn[k + 1] to DEQNnθ− , calculates the hidden state hnθ− [k], and outputs onθ− [k].Each SU n stores (sn[k], hnθ [k], an[k], rn[k], sn[k + 1], hnθ− [k]) in Bnq .

end forfor iteration = 1, ..., I do

Each SU n samples random training batch (sn[k], hnθ [k], an[k], rn[k], sn[k + 1], hnθ− [k]) from Bnq .Each SU n inputs sn[k] and hnθ [k] to DEQNnθ to calculate onθ [k]Each SU n inputs sn[k + 1] and hnθ [k + 1] to DEQNnθ− to calculate onθ− [k]

Each SU n updates DEQNnθ by performing gradient descent step on(rn[k] + γony,θ− [k + 1]− ony,θ[k]

)2

, where y isthe index of the maximum element of onθ [k + 1].

end forEach SU n synchronizes DEQNnθ− with DEQNnθ .

end for

element of onθ [k + 1], rn[k] + γony,θ− [k + 1] is the target Q-value. To stabilize the training targets, the target network isonly periodically synchronized with the evaluation network.

The input weights and the recurrent weights of ESNs arerandomly initialized according to the constraints specified bythe Echo State Property [23], and then they remain untrained.Only the output weights of ESNs are trained so the training isextremely fast. The main idea of ESNs is to generate a largereservoir that contains the necessary summary of past inputsequences for predicting targets. From Equation (14), we canobserve that the hidden state hn[k] at any given time slot kis unchanged during the training process if the input weightsand recurrent weights are fixed. In contrast to conventionalRNNs that usually initialize the hidden states to zeros andwaste some training examples to set them to appropriate valuesin one training iteration, the benefit of ESNs is that the hiddenstates do not need to be reinitialized in every training iteration.Therefore, the training process becomes extremely efficient,which is especially suitable for learning in a high dynamicenvironment. Compared to storing (s[k], a[k], r[k], s[k+1]) inconventional DRQN framework, we also store hidden states(h[k], h[k + 1]) because hidden states are unchanged. In thisway, we do not have to waste lots of training time anddata to recalculate hidden states in every training iteration.It largely boosts the training efficiency in the highly dynamicenvironment since we can avoid using BPTT and only updatethe output weights of networks. Furthermore, we can randomly

sample from the replay memory to create a training batch,while conventional DRQN methods have to sample continuoussequences to create a training batch. Thus the training data canbe more efficiently used in our DEQN method. The trainingdata stored in the buffer will be refreshed periodically in orderto adapt to the latest environment. Therefore, our trainingmethod is an online training algorithm that keeps updatingthe learning agent. The training algorithm for DEQNs in theDSS problem is detailed in Algorithm 1.

IV. PERFORMANCE EVALUATION

A. Experimental Setup

We set the number of PUs and SUs to 4 and 6, respectively,and the locations of PUs and SUs are randomly defined ina 2000m×2000m area. The distance between the transmitterand the receiver of each desired link is randomly chosenfrom 400m-450m. Figure 4 shows the geometry of the DSSnetwork, where PUT/SUT represent the transmitters of PU/SUand PUR/SUR represent the receivers of PU/SU. The channelgains of desired links, interference links, and sensing linksare generated by the WINNER II channel model widely usedin 3GPP LTE-Advanced and 5G networks [14]. In this case,there are 4 desired links for PUs, 6 desired links for SUs, 30interference links between different SUs, 24 interference linksbetween SUTs and PURs, 24 interference links between PUTsand SURs, and 24 sensing links between PUTs and SUTs.Totally, 112 wireless links are generated in our simulation,

8

Fig. 4: The DSS network geometry. PUT/SUT represent thetransmitter of PU/SU. PUR/SUR represent the receiver ofPU/SU.

which establishes a more complicated scenario than existingDRL-based DSS strategies [17]–[19]. Specifically, [17] consid-ers each channel only has two possible states (good or bad)without modeling the true wireless environment; [18] assumesthat the collision between users can be perfectly detectedwithout considering the dynamics of interference links; [19]assumes that SUs are forbidden to access a channel when aPU is using without considering the actual interference linksbetween PUs and SUs.

For each channel, the bandwidth is set to 5MHz and thevariance of the Gaussian noise is set to -157.3dBm. Thetransmit power of PUs and SUs are both set to 500mW. Weset the sensing and transmission period T to 10 time slotsand the sensing duration Ts to 2 time slots, where one timeslot represents interval of 1ms. We list all the parameters togenerate the wireless environment in Table II.

TABLE II: The values of parameters for generating the wire-less environment.

Parameter Valuenumber of PUs M 4number of SUs N 6

simulation area 2000m×2000mdistance between user pair 400m-450m

transmit power of PU 500mWtransmit power of SU 500mW

variance of Gaussian noise -157.3dBmbandwidth of a channel 5MHzinterval of one time slot 1ms

sensing and transmission period T 10 time slotssensing duration Ts 2 time slots

For the activity pattern of PUs, we let two PUs be in Activestate every 3T (PU1 and PU3) and two PUs be in Active stateevery 4T (PU2 and PU4). Each SU trains its DEQN agent andupdates the policy accordingly after collecting 300 samples inthe buffer. The buffer will be refreshed after training so weonly use training data from the latest 3 sec. The total numberof training data is 60000, which requires 600 sec to collect

all the training data. The initial exploration probability ε isset to 0.3, and then it will gradually decrease until ε is 0. Wefirst train the Q-network with learning rate 0.01, and then thelearning rate decreases to 0.001 when ε is less than 0.2.

Fig. 5: The network architecture of DEQN.

B. Network ArchitectureAs shown in Figure 5, our DEQN network consists of L

reservoirs for extracting the necessary temporal correlationto predict targets. The number of neurons in each reservoiris set to 32 and the leaky parameter β is set to 0.7 inEquation (14). During the training process, the input weights{W (1)

in , ...,W(L)in } and the output weights {W (1)

rec, ...,W(L)rec }

are untrained. To find a good policy, only the output weightWout is trained to read essential temporal information fromthe input states and the hidden states stored in the experiencereplay buffer. Existing research shows that stacking RNNsautomatically creates different time scales at different levels,and this stacked architecture has better ability to model long-term dependencies than single layer RNN [24]–[26]. We alsofind that stacking ESNs can indeed improve the performancein our experiment.

C. Results and DiscussionWe evaluate our introduced DEQN method with three

performance metrics: 1) The system throughput of PUs. 2)The system throughput of SUs. 3) The required training time.The throughput represents the number of transmitted bitsper second, which is calculated by (spectral-efficiency) ×(bandwidth), and the system throughput represents the sum ofusers’ throughput in the primary system or secondary system.A good DSS strategy should increase the throughput of SUsas much as possible, while the transmissions of SUs do notharm the throughput of PUs. Therefore, each SU has to accessan available channel by predicting activities of other mobileusers. We compare with conventional DRQN method that usesLong Short Term Memory (LSTM) [27] as the Q-network. Fora fair comparison, we also set the number of neurons in eachLSTM layer to 32. The training algorithm of DRQNs is BPTTand double Q-learning with the same learning rate as DEQNs.Since each SU updates its policy for every 300 samples, weshow all of our curves in figures by calculating the movingaverage of 300 consecutive samples for clarity.

DEQN1 and DEQN2 are our DEQN method with one andtwo layers, respectively, and DRQN1 and DRQN2 are the con-ventional DRQN method with one and two layers, respectively.

9

Fig. 6: The system throughput of PUs.

Fig. 7: The system throughput of SUs.

The system throughput of PUs is shown in Figure 6 and thesystem throughput of SUs is shown in Figure 7. We observethat DEQNs have more stable performance than DRQNs,which empirically proves that the DEQN method can learnefficiently with limited training data. Note that one experiencereplay buffer only contains 300 latest training samples. Afterupdating the learning agent of each SU using the 300 datain the buffer, DSS strategy of each SU changes so theenvironment observed by one SU also changes. Therefore, wehave to erase the outdated samples from the buffer and let SUscollect new training data from the environment. Figure 8 showsthe average reward of SUs versus time. We observe extremelyunstable reward curves of both DRQN1 and DRQN2 so itproves that DRQNs cannot adapt to this dynamic 5G scenariowell with few training data.

We observe that DEQN2 has better performance thanDEQN1 in both the system throughput of PUs and SUs, whichshows that deep structure (stacking ESNs) indeed improvesthe capability of the DRL agent to learn long-term temporalcorrelation. As for DRQNs, we observe that DRQNs do not

Fig. 8: The average reward versus time.

have improved performance as we increase the number oflayers in the underlying RNN. The main reason is that moretraining data are needed for training a larger network but evenDRQN with one layer cannot be trained well.

The top priority of designing a DSS network is to preventharmful interference to the primary system. To analyze theperformance degradation of the primary system after allowingthe secondary system to access, we show the system through-put of PUs when there is no SU exist in Figure 6. We observethat DEQN2 can achieve almost the same performance of thesystem throughput of PUs. A PU broadcasts a warning signalif its spectral-efficiency is below a threshold. For each PU, werecord the frequency of (the PU sends a warning signal and itis received by some SUs) / (number of the PU’s access), whichis called as the warning frequency. Figure 9 shows the averagewarning frequency of each PU versus time. We observe that theevery PU decreases its warning frequency over time, meaningthat each SU learns not to access the channel that will causeharmful interference to PUs.

TABLE III: The comparison of training time of differentnetwork architectures.

Network Training time (sec)DEQN1 161DEQN2 178DRQN1 3776DRQN2 7618

We compare the training time of different approachesin Table III when implemented and executed on the samemachine with 2.71 GHz Intel i5 CPU and 12 GB RAM.The required training time for DRQN1 is 23.4 times thetraining time for DEQN1, and the required training time forDRQN2 is 42.8 times the training time for DEQN2. Thishuge difference shows the training speed advantage of ourintroduced DEQN method against the conventional DRQNmethod. DRQN suffers from high training time because BPTTunfolds the network in time to compute the gradients, butDEQN can be trained very efficiently because the hidden states

10

Fig. 9: The average warning frequency of each PU versus time.

can be pre-stored for many training iterations.

V. CONCLUSION

In this paper, we introduced the concept of DEQN, a newRNN-based DRL strategy to efficiently capture the tempo-ral correlation of the underlying time-dynamic environmentrequiring very limited amount of training data. The DEQN-based DRL strategies largely increase the rate of convergencecompared to conventional DRQN-based strategies. DEQN-based spectrum access strategies are examined in DSS, a keytechnology in 5G and future 6G networks, showing significantperformance improvements over state-of-the-art DRQN-basedstrategies. This provides strong evidence for adopting DEQNfor real-time and time-dynamic applications. Our future workwill be focused on developing methodologies for the design ofneural network architectures tailored to different applications.

REFERENCES

[1] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. Riedmiller, “Playing Atari with deep reinforcement learn-ing,” arXiv, 2013.

[2] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. VanDen Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot et al., “Mastering the game of Go with deep neural networksand tree search,” Nature, vol. 529, no. 7587, p. 484, 2016.

[3] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training ofdeep visuomotor policies,” The Journal of Machine Learning Research,vol. 17, no. 1, pp. 1334–1373, 2016.

[4] M. Hausknecht and P. Stone, “Deep recurrent Q-learning for partiallyobservable mdps,” in AAAI Fall Symposium Series, 2015.

[5] F. S. He, Y. Liu, A. G. Schwing, and J. Peng, “Learning to play in a day:Faster deep reinforcement learning by optimality tightening,” in ICLR,2017.

[6] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of trainingrecurrent neural networks,” in ICML, 2013, pp. 1310–1318.

[7] H. Jaeger, “The “echo state” approach to analysing and training recurrentneural networks-with an erratum note,” German National ResearchCenter For Information Technology, Tech. Rep. 34, Jan. 2001.

[8] G. Tanaka, T. Yamane, J. B. Heroux, R. Nakane, N. Kanazawa,S. Takeda, H. Numata, D. Nakano, and A. Hirose, “Recent advancesin physical reservoir computing: a review,” Neural Networks, 2019.

[9] Cisco, “Cisco visual networking index: Forecast and trends, 2017–2022,”VNI Global Fixed and Mobile Internet Traffic Forecasts, Feb. 2019.

[10] S. Marek, “Marek’s Take: Dynamic spectrum sharing may change the5G deployment game,” Fierce Wireless, April 2019.

[11] S. Kinney, “Dynamic spectrum sharing is key to Verizon’s 5G strategy,”RCR Wireless News, August 2019.

[12] W. S. H. M. W. Ahmad, N. A. M. Radzi, F. Samidi, A. Ismail,F. Abdullah, M. Z. Jamaludin, and M. Zakaria, “5G technology: To-wards dynamic spectrum sharing using cognitive radio networks,” IEEEAccess, vol. 8, pp. 14 460–14 488, 2020.

[13] D. Tse and P. Viswanath, Fundamentals of wireless communication.Cambridge university press, 2005.

[14] CEPT, “WINNER II Channel Models,” European Conference of Postaland Telecommunications Administrations (CEPT), Technical ReportD1.1.2, Feb. 2008, version 1.2.

[15] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no.3-4, pp. 279–292, 1992.

[16] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learningwith double Q-learning,” in AAAI, 2016.

[17] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, “Deep reinforce-ment learning for dynamic multichannel access in wireless networks,”IEEE Trans. on Cogn. Commun. Netw., vol. 4, no. 2, pp. 257–265, 2018.

[18] O. Naparstek and K. Cohen, “Deep multi-user reinforcement learning fordistributed dynamic spectrum access,” IEEE Trans. Wireless Commun.,vol. 18, no. 1, pp. 310–323, 2018.

[19] H. Chang, H. Song, Y. Yi, J. Zhang, H. He, and L. Liu, “Distributivedynamic spectrum access through deep reinforcement learning: A reser-voir computing-based approach,” IEEE Internet Things J., vol. 6, no. 2,pp. 1938–1948, Apr. 2019.

[20] 3GPP, “Evolved Universal Terrestrial Radio Access (E-UTRA); PhysicalLayer Procedures,” 3rd Generation Partnership Project (3GPP), Techni-cal Specification (TS) 36.213, Jun. 2019, version 15.6.0.

[21] L. Liu, R. Chen, S. Geirhofer, K. Sayana, Z. Shi, and Y. Zhou,“Downlink MIMO in LTE-advanced: SU-MIMO vs. MU-MIMO,” IEEECommun. Mag., vol. 50, no. 2, pp. 140–147, 2012.

[22] A. Chiumento, M. Bennis, C. Desset, L. Van der Perre, and S. Pollin,“Adaptive CSI and feedback estimation in LTE and beyond: a Gaussianprocess regression approach,” EURASIP JWCN, vol. 2015, no. 1, p. 168,Jun. 2015.

[23] M. Lukosevicius, “A practical guide to applying echo state networks,”in Neural networks: Tricks of the trade. Springer, 2012, pp. 659–686.

[24] M. Hermans and B. Schrauwen, “Training and analysing deep recurrentneural networks,” in NIPS, 2013, pp. 190–198.

[25] C. Gallicchio, A. Micheli, and L. Pedrelli, “Deep reservoir computing:a critical experimental analysis,” Neurocomputing, vol. 268, pp. 87–99,2017.

[26] Z. Zhou, L. Liu, J. Zhang, and Y. Yi, “Deep reservoir computing meets5g mimo-ofdm systems in symbol detection,” in AAAI, 2020.

[27] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neuralcomputation, vol. 9, no. 8, pp. 1735–1780, 1997.

11

Hao-Hsuan Chang received the B.Sc. degree inelectrical engineering and the M.S. degree in com-munication engineering from National Taiwan Uni-versity, Taipei, Taiwan. He is currently pursuing thePh.D. degree in electrical and computer engineeringat Virginia Polytechnic Institute and State University,Blacksburg, VA, USA. His research interests includedynamic spectrum access, echo state network, anddeep reinforcement learning.

Lingjia Liu is an Associate Professor with theBradley Department of Electrical Engineering andComputer Engineering, Virginia Tech. He is alsothe Associate Director of Wireless@VT. Prior tojoining VT, he was an Associate Professor withthe EECS Department, University of Kansas (KU).He spent more than four years working in theMitsubishi Electric Research Laboratory (MERL)and the Standards and Mobility Innovation Labo-ratory, Samsung Research America (SRA), wherehe received the Global Samsung Best Paper Award

in 2008 and 2010. He was leading Samsung’s efforts on multiuser MIMO,CoMP, and HetNets in LTE/LTE-advanced standards. His general researchinterests mainly lie in emerging technologies for beyond 5G cellular networks,including machine learning for wireless networks, massive MIMO, massiveMTC communications, and mmWave communications.

He received the Air Force Summer Faculty Fellow, from 2013 to 2017,Miller Scholar at KU, 2014, Miller Professional Development Award forDistinguished Research at KU, 2015, the 2016 IEEE GLOBECOM Best PaperAward, the 2018 IEEE ISQED Best Paper Award, the 2018 IEEE TAOS BestPaper Award, 2018 IEEE TCGCC Best Conference Paper Award, and 2020WOCC Charles Kao Best Paper Award.

Yang Yi (SM’17) is an Associate Professor in theBradley Department of ECE at Virginia Tech (VT).She received the B.S. and M.S. degrees in electronicengineering at Shanghai Jiao Tong University, andthe Ph.D. degree in electrical and computer engineer-ing at Texas A&M University. Her research interestsinclude very large scale integrated (VLSI) circuitsand systems, computer aided design (CAD), andneuromorphic computing. Dr. Yi is currently servingas an associate editor for cyber journal of selectedareas in microelectronics and has been serving on the

editorial board of international journal of computational & neural engineering.Dr. Yi is the recipient of 2018 National Science CAREER award, 2016 MillerProfessional Development Award for Distinguished Research, 2016 UnitedStates Air Force (USAF) Summer Faculty Fellowship, 2015 NSF EPSCoRFirst Award, and 2015 Miller Scholar.

deep echo state q-network (deqn) and its application in

Documents