deep reinforcement learning with smooth policy · 2020-03-25 · deep reinforcement learning with...

Deep Reinforcement Learning with SmoothPolicy

Qianli Shen∗, Yan Li∗, Haoming Jiang, Zhaoran Wang, Tuo Zhao †

Abstract

Deep neural networks have been widely adopted in modern reinforcement learning (RL)algorithms with great empirical successes in various domains. However, the large search spaceof training a neural network requires a significant amount of data, which makes the currentRL algorithms not sample efficient. Motivated by the fact that many environments with con-tinuous state space have smooth transitions, we propose to learn a smooth policy that behavessmoothly with respect to states. In contrast to policies parameterized by linear/reproducing ker-nel functions, where simple regularization techniques suffice to control smoothness, for neuralnetwork based reinforcement learning algorithms, there is no readily available solution to learna smooth policy. In this paper, we develop a new training framework — Smooth RegularizedReinforcement Learning (SR2L), where the policy is trained with smoothness-inducing regular-ization. Such regularization effectively constrains the search space of the learning algorithms andenforces smoothness in the learned policy. We apply the proposed framework to both on-policy(TRPO) and off-policy algorithm (DDPG). Through extensive experiments, we demonstrate thatour method achieves improved sample efficiency.

1 Introduction

Deep reinforcement learning has enjoyed great empirical successes in various domains, includingrobotics, personalized recommendations, bidding, advertising and games (Levine et al., 2018;Zheng et al., 2018; Zhao et al., 2018; Silver et al., 2017; Jin et al., 2018). At the backbone ofits success is the superior approximation power of deep neural networks, which parameterizecomplex policy, value or state-action value functions, etc. However, the high complexity of deepneural networks makes the search space of the learning algorithm prohibitively large, thus oftenrequires a significant amount of training data, and suffers from numerous training difficultiessuch as overfitting and training instability (Thrun & Schwartz, 1993; Boyan & Moore, 1995; Zhanget al., 2016).

Reducing the size of the search space while maintaining the network’s performance requiresspecial treatment. While one can simply switch to a network of smaller size, numerous empirical

†Yan Li, Haoming Jiang and Tuo Zhao are affiliated with School of Industrial and Systems Engineering at GeorgiaInstitute of Technology; Qianli Shen is affiliated with Yuanpei College at Peking University; Zhaoran Wang is affiliatedwith Department of Industrial Engineering and Management Sciences at Northwestern University; Tuo Zhao is thecorresponding author; Email: [email protected].

1

arX

iv:2

003.

0953

4v2

[cs

.LG

] 2

4 M

ar 2

020

evidences have shown that small network often leads to performance degradation and trainingdifficulties. It is commonly believed that training a sufficiently large network (also known asover-parameterization) with suitable regularization (e.g., dropout Srivastava et al. (2014), orthog-onality parameter constraints Huang et al. (2018)) is the most effective way to adaptively constrainthe search space, while maintaining the performance benefits of a large network.

For reinforcement learning problems, entropy regularization is one commonly adopted reg-ularization, which is believed to help facilitate exploration in the learning process. Yet in thepresence of high uncertainty in the environment and large noise, such regularization might yieldpoor performances. More recently, Pinto et al. (2017) propose robust adversarial reinforcementlearning (RARL) that aims to perform well under uncertainties by training the agent to be robustagainst the adversarially perturbed environment. However, in addition to the marginal perfor-mance gain, the requirement of learning the additional adversarial policy makes the update ofRARL computationally expensive and less sample efficient than traditional learning algorithms.Cheng et al. (2019) on the other hand propose a control regularization that enforces the behav-ior of the deep policy to be similar to a policy prior, yet designing a good prior often requires asignificant amount of domain knowledge.

Different from previous works, we propose a new training framework – Smoothness RegularizedReinforcement Learning (SR2L) for training reinforcement algorithms. Through promoting smooth-ness, we effectively reduce the size of the search space when learning the policy network andachieve state-of-the-art sample efficiency. Our goal of promoting smoothness in the policy is mo-tivated by the fact that natural environments with continuous state space often have smooth tran-sitions from state to state, which favors a smooth policy – similar states leading to similar actions.As a concrete example, for MuJoCo environment (Todorov et al., 2012), which is a system pow-ered by physical laws, the optimal policy can be described by a set of differential equations withcertain smoothness properties.

Promoting smoothness is particularly important for deep RL, since deep neural networks canbe extremely non-smooth, due to their high complexity. It is observed that small changes in neuralnetworks’ input would result in significant changes in its output. Such non-smoothness has drawnsignificant attention in other domains (e.g., image recognition, information security) that involveusing neural networks as a part of the decision process (Goodfellow et al., 2014; Kurakin et al.,2016). To train a smooth neural network, we need to employ many hacks in the training process.In supervised learning setting with i.i.d. data, these hacks include but not limited to batch nor-malization (Ioffe & Szegedy, 2015), layer normalization (Ba et al., 2016), orthogonal regularization(Huang et al., 2018). However, most of existing hacks do not work well in RL setting, where thetraining data has complex dependencies. As one significant consequence, current reinforcementlearning algorithms often lead to undesirable non-smooth policy.

Our proposed SR2L training framework uses a smoothness-inducing regularization to encour-age the output of the policy (decision) to not change much when injecting small perturbation tothe input of the policy (observed state). The framework is motivated by local shift sensitivity inrobust statistics literature (Hampel, 1974), which can also be considered as a measure of the lo-cal Lipschitz constant of the policy. We highlight that SR2L is highly flexible and can be readily

2

adopted into various reinforcement learning algorithms. As concrete examples, we apply SR2Lto the TRPO algorithm (Schulman et al., 2015), which is an on-policy method, and the regular-izer directly penalizes non-smoothness of the policy. In addition, we also apply SR2L to DDPGalgorithm (Lillicrap et al., 2015), which is an off-policy method, and the regularizer penalizes non-smoothness of either the policy or the state-action value function (also known as the Q-function),from which we can further induce a smooth policy.

Our proposed smoothness-inducing regularizer is related to several existing works (Miyatoet al., 2018; Zhang et al., 2019; Hendrycks et al., 2019; Xie et al., 2019; Jiang et al., 2019). Theseworks consider similar regularization techniques, but target at other applications with differentmotivations, e.g., semi-supervised learning, unsupervised domain adaptation and harnessing ad-versarial examples in image classification.

The rest of the paper is organized as follows: Section 2 introduces the related background;Section 3 introduces our proposed smooth regularized reinforcement learning (SR2L) in detail;Section 4 presents numerical experiments on various MuJoCo environments to demonstrate thesuperior performance of SR2L.

Notations: We let Bd(s,ε) = {s : d(s, s) ≤ ε} denote the ε-radius ball measured in metric d(·, ·)centered at point s. We use IA ∈ R

A to denote the identity matrix in A-dimensional Euclideanspace.

2 Background

We consider a Markov Decision Process (S ,A,P, r,p0,γ), in which an agent interacts with an envi-ronment in discrete time steps. We let S ⊆ R

S denote the continuous state space, A ⊆ RA denote

the action space, P : S × A → S denote the transition kernel, r : S × A → R denote the rewardfunction, p0 denote the initial distribution and γ denote the discount factor. An agent’s behavioris defined by a policy, either stochastic or deterministic. A stochastic policy π maps a state to aprobability distribution over the action space π : S → P (A). A deterministic policy µ maps a statedirectly to an action µ : S → A. At each time step, the agent observes its state st ∈ S , takes actionat ∼ π(st), and receives reward rt = r(st , at). The agent then transits into the next state st+1 withprobability given by the transition kernel st+1 ∼ P(·|st , at). The goal of the agent is to find a policythat maximize the expected discounted reward with discount factor γ :

maxπV (π) = Es0,a0,...

∑t≥0

γ tr(st , at)

, with s0 ∼ p0, at ∼ π(st), st+1 ∼ P(st+1|st , at).

One way to solve the above problem is the classical policy gradient algorithms, which estimatethe gradient of the expected reward through trajectory samples, and update the parameters of thepolicy by following the estimated gradient. The policy gradient algorithms are known to sufferfrom high variance of estimated gradient, which often leads to aggressive updates and unstabletraining. To address this issue, numerous other variants have been proposed. Below we brieflyreview two popular ones used in practice.

3

2.1 Trust Region Policy Optimization (TRPO)

TRPO iteratively improves a parameterized policy πθ by solving a trust region type optimizationproblem. Before we describe the algorithm in detail, we need several definitions in place. Thevalue function V π(s) and the state-action value function Qπ(s,a) are defined by:

V π(s) = Es0=s,a0,...

∑t≥0

γ tr(st , at)

,Qπ(s,a) = Es1,a1,...

∑t≥0

γ tr(st , at)|s0 = s,a0 = a

,with at ∼ π(st), st+1 ∼ P(st+1|st , at). The advantage function Aπ(s,a) and the discounted state visi-tation distribution (unnormalized) ρπ(s) are defined by:

Aπ(s,a) =Qπ(s,a)−V π(s),ρπ(s) =∑i≥0

γ iP(si = s).

At the k-th iteration of TRPO, the policy is updated by solving:

θk+1 = argmaxθ

Es∼ρ

πθk ,a∼πθk

[πθ(a|s)πθk (a|s)

Aπθk (s,a)],

subject to Es∼ρπθk[DKL(πθk (·|s)‖πθ(·|s)

]≤ δ, (1)

where δ is a tuning parameter for controlling the size of the trust region, andDKL(P ||Q) =∫X log

(dPdQ

)dQ

denotes the Kullback-Leibler divergence between two distributions P ,Q over support X . For eachupdate, the algorithm: (i) Samples trajectories using the current policy πθk ; (ii) ApproximatesAπθk

for each state-action pair by taking the discounted sum of future rewards along the trajectory; (iii)Replaces the expectation in (1) and Aπθk by sample approximation, then solves (1) with conjugategradient algorithm.

2.2 Deep Deterministic Policy Gradient (DDPG)

DDPG uses the actor-critic architecture, where the agent learns a parameterized state-action valuefunction Qφ (also known as the critic) to update the parameterized deterministic policy µθ (alsoknown as the actor).

DDPG uses a replay buffer, which is also used in in Deep Q-Network Mnih et al. (2013). Thereplay buffer is a finite sized cache. Transitions are sampled from the environment according tothe policy and the tuple (st , at , rt , st+1) is stored in the replay buffer. When the replay buffer isfull, the oldest samples are discarded. At each time step, µθ and Qφ are updated by sampling amini-batch uniformly from the buffer.

Update of state-action value function. The update of the state-action value function networkQφ depends on the deterministic Bellman equation:

Qφ(s,a) = Es′∼P(·|s,a)[r(s,a) +γQφ(s′ ,πθ(s′))

]. (2)

The expectation depends only on the environment. This means that unlike TRPO, DDPG is anoff-policy method, which can use transitions generated from a different stochastic behavior policy

4

denoted as β (see (Lillicrap et al., 2015) for detail). At the t-th iteration, we update the Qφ byminimizing the associated mean squared Bellman error of transitions {(sit , ait , r it , sit+1)}i∈B sampledfrom the replay buffer. Specifically, let Qφ′t and µθ′t be a pair of target networks, we set yit =r it +γQφ′t (s

it+1,µθ′t (s

it+1)), and then update the critic network:

φt+1 = argminφ

∑i∈B

(yit −Qφ(sit , a

it))2.

After both critic and actor networks are updated, we update the target networks by slowly track-ing the critic and actor networks:

φ′t+1 = τφt+1 + (1− τ)φ′t ,θ′t+1 = θφt+1 + (1− τ)θ′t ,

with τ � 1.Update of policy. The policy network µθ is updated by maximizing the value function using

policy gradient:

maxθ

E

s∼ρβ

[Qφ(s,a)

∣∣∣a=πθ(s)

]. (3)

Similar to updating the critic, we use the minibatch sampled from the replay buffer to computeapproximated gradient of θ and perform the update:

θt+1 = θt +ηt|B|

∑i∈B∇aQφt+1

(s,a)∣∣∣∣ s=si ,a=µθt (si )

∇θµθt (si).

3 Method

In this section, we present the smoothness-inducing regularizer in its general form and describeits intuition in detail. We also apply the proposed regularizer to popular reinforcement learningalgorithms to demonstrate its great adaptability.

3.1 Learning Policy with SR2L

We first focus on directly learning smooth policy πθ with the proposed regularizer. We assumethat the state space is continuous, i.e., S ⊆R

S .For a fixed state s ∈ S and a policy πθ, SR2L encourages the output πθ(s) and πθ (s) to be

similar, where state s is obtained by injecting a small perturbation to state s. We assume theperturbation set is an ε-radius ball measured in metric d(·, ·), which is often chosen to be the `pdistance: d(s, s) = ‖s − s‖p. To measure the discrepancy between the outputs of a policy, we adopta suitable metric function denoted by D. The non-smoothness of policy πθ at state s is defined inan adversarial manner:

maxs∈Bd (s,ε)

D(πθ(s),πθ (s)).

5

To obtain a smooth policy πθ, we encourage smoothness at each state of the entire trajectory. Weachieve this by taking expectation with respect to the state visitation distribution ρπ induced bythe policy, and our smoothness-inducing regularizer is defined by:

Rπs (θ) = E

s∼ρπθtmax

s∈Bd (s,ε)D(πθ(s),πθ (s)). (4)

For a stochastic policy πθ, we set the metric D(·, ·) to be the Jeffrey’s divergence, and the regu-larizer takes the form

Rπs (θ) = E

s∼ρπmax

s∈Bd (s,ε)DJ(π(·|s) || π(·|s)), (5)

where the Jeffrey’s divergence for two distributions P ,Q is defined by:

DJ(P ||Q) =12DKL(P ||Q) +

12DKL(Q||P ). (6)

For a deterministic policy µθ, we set the metric D(·, ·) to be the squared `2 norm of the difference:

Rµs (θ) = E

s∼ρµmax

s∈Bd (s,ε)

∥∥∥µ(s)−µ(s)∥∥∥2

2. (7)

The smoothness-inducing adversarial regularizer is essentially measuring the local Lipschitzcontinuity of policy πθ under the metric D. More precisely, we encourage the output (decision) ofπθ to not change much if we inject a small perturbation bounded in metric d(·, ·) to the state s (SeeFigure 1). Therefore, by adding the regularizer (4) to the policy update, we can encourage πθ tobe smooth within the neighborhoods of all states on all possible trajectories regarding to the sam-pling policy. Such a smoothness-inducing property is particularly helpful to prevent overfitting,improve sample efficiency and overall training stability .

TRPO with SR2L (TRPO-SR). We now apply the proposed smoothness inducing regularizerto TRPO algorithm, which is itself an on-policy algorithm. Since TRPO uses a stochastic policy,we use the Jeffrey’s divergence to penalize the discrepancy between decisions for the regular stateand the adversarially perturbed state, as suggested in (5).

S A

s

s = s + �a

a0

a00

Smooth

Policy ⇡✓

Non-smooth

Figure 1: Smoothness of policy πθ at state s. If policy πθ is smooth at state s, then perturbed states leads to action a′ similar to the original action a. If the policy πθ is non-smooth at state s, thenthe perturbed state s leads to drastically different action a′′.

6

Specifically, TRPO with smoothness-inducing regularizer updates the policy by solving thefollowing subproblem at the k-th iteration:

θk+1 = argminθ−{E

s∼ρπθk ,

a∼πθk

[πθ(a|s)πθk (a|s)

Aπθk (s,a)]

+λsEs∼ρπθk maxs∈Bd (s,ε)

DJ(πθ(·|s) || πθ(·|s))},

s.t. Es∼ρπθk[DKL(πθk (·|s)‖πθ(·|s)

]≤ δ. (8)

3.2 Learning Q-function with Smoothness-inducing Regularization

The proposed smoothness-induced regularizer can be also used to learn a smooth Q-function,which can be further used to generate a smooth policy.

We measure the non-smoothness of a Q-function at state-action pair (s,a) by the squared dif-ference of the state-action value between the normal state and the adversarially perturbed state:

maxs∈Bd (s,ε)

(Qφ(s,a)−Qφ (s,a))2.

To enforce smoothness at every state-action pair, we take expectation with respect to the entiretrajectory, and the smoothness-inducing regularizer takes the form

RQs (φ) = E s∼ρβ ,a∼β

maxs∈Bd (s,ε)

(Qφ(s,a)−Qφ (s,a))2.

where β denotes the behavior policy for sampling in off-policy training setting.DDPG with SR2L. We now apply the proposed smoothness-inducing regularizer to DDPG

algorithm, which is itself an off-policy algorithm. Since DDPG uses two networks: the actor net-work and the critic network, we propose two variants of DDPG, where the regularizer is appliedto the actor or the critic network.•Regularizing theActorNetwork (DDPG-SR-A). We can directly penalize the non-smoothness

of the actor network to promote a smooth policy in DDPG. Since DDPG uses a deterministic pol-icy µθ, when updating the actor network, we penalize the squared difference as suggested in (7)and minimize the following objective:

E

s∼ρβ

[−Qφ(s,a)

∣∣∣a=µθ(s)

+λs maxs∼Bd (s,ε)

∥∥∥µθ(s)−µθ (s)∥∥∥2

2

].

The policy gradient can be written as:

E

s∼ρβ

[−∇aQφ(s,a)

∣∣∣a=µθ(s)

∇θµθ(s) +λs∇θ∥∥∥µθ(s)−µθ (s)

∥∥∥22

],

with s = argmaxs∼Bd (s,ε)

∥∥∥µθ(s)−µθ (s)∥∥∥2

2for s ∼ ρβ .

• Regularizing the Critic Network (DDPG-SR-C). Since DDPG simultaneously learns a Q-function (critic network) to update the policy (actor network), inducing smoothness in the criticnetwork could also help us to generate a smooth policy. By incorporating the proposed regularizer

7

for penalizing Q-function, we obtain the following update for inducing a smooth Q-function inDDPG:

φt+1 = argminφ

∑i∈B

(yit −Qφ(sit , a

it))2

+λs

∑i∈B

maxsit∼Bd (sit ,ε)

(Qφ(sit , ait)−Qφ (sit , a

it))

2,

with yit = r it +γQφ′t (sit+1,µθ′t (s

it+1)),∀i ∈ B,

where B is the mini-batch sampled from the replay buffer.

3.3 Solving the Min-max Problem

Adding the smoothness-inducing regularizer in the policy/Q-function update often involves solv-ing a min-max problem. Though the inner max problem is not concave, simple stochastic gradientalgorithm has been shown to be able to solve it efficiently in practice. Below we describe how toperform the update of TRPO-SR, including solving the corresponding min-max problem. Thedetails are summarized in Algorithm 1. We leave the detailed description of DDPG-SR-A andDDPG-SR-C in the appendix.

Algorithm 1 Trust Region Policy Optimization with Smoothness-inducing Regularization.

Input: step sizes ηδ and ηθ, number of iterations D for inner optimization, number of iterationsK for policy updates, perturbation strength ε, regularization coefficient λs.Initialize: randomly initialize the policy network πθ0

.for k = 1, . . . ,K − 1 do

Sample trajectory Sk = {(stk , atk)}

Tt=1 from current policy πθk .

Estimate advantage function Aπθk (s,a) using sample approximation.for stk ∈ Sk do

Randomly initialize δt0.for ` = 0, . . . ,D − 1 doδt`+1 = δt` + ηδ∇δDJS(πθk (·|s

tk)||πθk (·|s

tk + δt`)).

δt`+1 = ΠBd (0,ε)

(δt`+1

).

end forstk = stk + δtD .

end for

θk+1 = θk + ηθ∑

(stk ,atk)∈Sk

Aπθk (stk , atk)

πθk (atk |s

tk)∇θπθk (a

tk |s

tk)

− ηθλs

∑stk∈Sk

γ t∇θDJ(πθk (·|stk)||πθk (·|stk)).

end for

8

4 Experiment

We apply the proposed SR2L training framework to two popular reinforcement learning algo-rithms: TRPO and DDPG. Both of these algorithms have become the standard routine to solvelarge-scale control tasks, and the building blocks of many state-of-the-art reinforcement learningalgorithms. For TRPO, we directly learns a smooth policy; For DDPG, we promote the smoothnesseither in the actor (policy) or the critic (Q-function).

4.1 Implementation

Swimmer HalfCheetah Hopper Walker2d Ant

Figure 2: OpenAI Gym Mujoco Benchmarks.

Our implementation of SR2L training framework is based on the open source toolkit garage(garage contributors, 2019). We test our algorithms on OpenAI gym (Brockman et al., 2016) con-trol environments with the MuJoCo (Todorov et al., 2012) physics simulator. For all tasks, weuse a network of 2 hidden layers, each containing 64 neurons, to parameterize the policy andthe Q-function. For fair comparison, except for the hyper-parameters related to the smoothregularizer, we keep all the hyper-parameters the same as in the original implementation ofgarage. We use the grid search to select the hyper-parameters (perturbation strength ε, regu-larization coefficient λs) of the smoothness-inducing regularizer. We set the search range to beε ∈ [10−5,10−1],λs ∈ [10−2,102]. To solve the inner maximization problem in the update, we run10 steps of projected gradient ascent, with step size set as 0.2ε. For each algorithm and each en-vironment, we train 10 policies with different initialization for 500 iterations (1K environmentsteps for each iteration).

Below we briefly describe the environments we use to evaluate our algorithms (See also Figure2).

Swimmer. The swimmer is a planar robot of a single torso with 3 links and 2 actuated jointsin a viscous container. The 8-dimensional state space includes positions and velocities of sliders,angles and angular velocity of hinges. The 2-dimensional action space includes torque of eachactuated joints.

HalfCheetah. The half-cheetah is a planar biped robot with 7 rigid links, including two legsand a torso, along with 6 actuated joints. The 17-dimensional state space includes positions andvelocities of sliders, angles and angular velocity of hinges. The 6-dimensional action space in-cludes torque of each actuated joints.

Walker2D. The walker is a planar biped robot consisting of 7 links, corresponding to two legs

9

and a torso, along with 6 actuated joints. The 17-dimensional state space includes positions andvelocities of sliders and angles and angular velocity of hinges. The 6-dimensional action spaceincludes torque of each actuated joints.

Hopper. The hopper is a planar monopod robot with 4 rigid links, corresponding to the torso,upper leg, lower leg, and foot, along with 3 actuated joints. The 11-dimensional state space in-cludes positions and velocities of sliders, angles and angular velocity of hinges. The 3-dimensionalaction space includes torque of each actuated joints.

Ant. The ant is a planar biped robot consisting of 9 links, corresponding to 4 legs and a torso,along with 8 actuated joints. The 111-dimensional state space includes positions and velocities ofsliders, angles and angular velocity of hinges. The 8-dimensional action space includes torque ofeach actuated joints.

4.2 Evaluating the Learned Policies

TRPO with SR2L (TRPO-SR). We use Gaussian policy in our implementation. Specifically, fora given state s, the action follows a Gaussian distribution a ∼ N (πθ(s),σ2IA), where σ is also alearnable parameter. Then the smoothness-inducing regularizer (5) takes the form:

Rπs (θ) = E

s∼ρπmax

s∈Bd (s,ε)‖πθ(s)−πθ (s)‖22 /σ

2.

Iterations Iterations

IterationsIterations

Rew

ard

Rew

ard

Rew

ard

Rew

ard

TRPO TRPO-SR

Swimmer-v2 HalfCheetah-v2

Walker2d-v2 Ant-v2

Figure 3: Learning curves (mean±standard de-viation) for TRPO-SR (orange) trained policiesversus the TRPO (blue) baseline when testedin clean environment. For all the tasks, TRPO-SR achieves a better mean reward than TRPO,with a reduction of variance in initial stage.

Rew

ard

HalfCheetah-v2

Walker2d-v2 Ant-v2

Rew

ard

Rew

ard

Rew

ard

DDPG-SR-CDDPG-SR-ADDPG

Hopper-v2

Iterations Iterations

IterationsIterations

Figure 4: Learning curves (mean±standard de-viation) for DDPG-SR-A (orange) and DDPG-SR-C (green) trained policies versus the DDPG(blue) baseline when tested in clean environ-ment. For all the tasks, DDPG-SR-A(C) trainedpolicies achieve better mean reward comparedto DDPG.

Figure 3 shows the mean and variance of the cumulative reward (over 10 policies) for policiestrained by TRPO-SR and TRPO for Swimmer, HalfCheetah, Hopper and Ant. For all the four

10

Rew

ard

Rew

ard

Rew

ard

Rew

ard

TRPO TRPO-SR

Swimmer-v2 HalfCheetah-v2

Walker2d-v2 Ant-v2Percentile Percentile

PercentilePercentile

Figure 5: Percentile plots for TRPO-SR (or-ange) trained policies versus the TRPO (blue)baseline when tested in clean environment. Al-gorithms are run on ten different initializa-tions and sorted to show the percentiles of cu-mulative reward. For all the tasks, TRPO-SR achieves better or competitive best per-formance, and better worst case performance,compared with TRPO baseline.

HalfCheetah-v2

Walker2d-v2 Ant-v2

Hopper-v2

Rew

ard

Rew

ard

Rew

ard

Rew

ard

Percentile Percentile

PercentilePercentile

DDPG-SR-CDDPG-SR-ADDPG

Figure 6: Percentile plots for DDPG-SR-A (or-ange) and DDPG-SR-C (green) trained poli-cies versus the DDPG (blue) baseline whentested in clean environment. Algorithms arerun with 10 different initializations and sortedto show the percentiles of cumulative reward.For all the tasks, DDPG-SR-A(C) achieves bet-ter or competitive best performance, and sig-nificantly better worst case performance, com-pared with DDPG baseline.

tasks, TRPO-SR learns a better policy in terms of the mean cumulative reward. In addition, TRPO-SR enjoys a smaller variance of the cumulative reward with respect to different initializations.These two observations confirm that our smoothness-inducing regularization improves sampleefficiency as well as the training stability.

We further show that the advantage of our proposed SR2L training framework goes beyondimproving the mean cumulative reward. To show this, we run the algorithm with 10 differentinitializations, sort the cumulative rewards of learned policies and plot the percentiles in Figure5.

• For all four tasks, TRPO-SR uniformly outperforms the baseline TRPO.

• For Swimmer and HalfCheetach tasks, TRPO-SR significantly improves the worst case per-formance compared to TRPO, and have similar best case performance.

• For Walker and Ant tasks, TRPO-SR significantly improves the best case performance com-pared to TRPO.

Our empirical results show strong evidences that the proposed SR2L not only improves the aver-age reward, but also makes the training process significantly more robust to failure case comparedto the baseline method.

11

DDPG with SR2L. We repeat the same evaluations for applying the proposed SR2L frame-work to DDPG (DDPG-SR-A and DDPG-SR-C). Figure 4 shows the mean and variance of thecumulative reward for policies trained by DDPG-SR-A and DDPG-SR-C in HalfCheetah, Hopperand Walker2D and Ant environments. For all the four tasks, DDPG-SR learns a better policy interms of mean reward. For task Ant, DDPG-SR-A shows superior training stability, which is theonly algorithm without drastic decay in the initial training stage. In addition, DDPG-SR-C showscompetitive performance compared to DDPG-SR-A, significantly outperforms DDPG-SR-A andDDPG for task HalfCheetah. This shows that instead of directly learning a smooth policy, we canturn to learn a smooth Q-function and obtain similar performance benefits.

Figure 6 plots percentiles of cumulative reward of learned policies using DDPG and DDPG-SR. Similar to TRPO-SR, both DDPG-SR-A and DDPG-SR-C uniformly outperform the baselineDDPG for all the reward percentiles. DDPG-SR is able to significantly improve the the worst caseperformance, while maintaining competitive best case performance compared to DDPG.

4.3 Robustness with Disturbance

We demonstrate that even if the SR2L training framework is not targeting for robustness, thetrained policy is still able to achieve robustness against both stochastic and adversarial measure-ment error, which is a classical setting considered in partially observable Markov decision process(POMDP) (Monahan, 1982). To show this, we evaluate the robustness of the proposed SR2L train-ing framework in the Swimmer and HalfCheetah environments. We evaluate the trained policywith two types of disturbances in the test environment: for a given state s, we add it with either (i)random disturbance which are sampled uniformly from Bd(0,ε), or (ii) adversarial disturbance,which are generated by solving:

δ = argmaxδ∈Bd (0,ε)

D(πθ(s),πθ(s+ δ))

using 10 steps of projected gradient ascent. For all evaluations, we use disturbance set Bd(0,ε) ={δ : ‖δ‖∞ ≤ ε}. For each policy and disturbed environment, we do 10 stochastic rollouts to evaluatethe policy and plot the cumulative reward of policy.

To evaluate the robustness of TRPO with SR2L, we run both baseline TRPO and TRPO-SR inthe Swimmer environment. Figure 7 plots the cumulative reward against the disturbance strength(ε). We see that for both random and adversarial disturbance, increasing the strength of thedisturbance decreases the cumulative reward of the learned policies. On the other hand, we seethat TRPO-SR clearly achieves improved robustness against perturbations, as its reward declinesmuch slower than the baseline TRPO.

To evaluate the robustness of DDPG with SR2L, we run baseline DDPG, DDPG-SR-A andDDPG-SR-C in the HalfCheetah environment. Figure 8 plots the cumulative reward against thedisturbance strength (ε). We see that incorporating the proposed smoothness-inducing regular-izer into either the actor or the critic network improves the robustness of the learned policy againststate disturbances.

12

Swimmer - Adversarially Disturbed Rollout

log ✏ log ✏

Rew

ard

Rew

ard

TRPO-SR

TRPO TRPO

TRPO-SR

(a) Swimmer - Random Disturbed Rollout (b) Swimmer - Adversarially Disturbed Rollout

Figure 7: Plot of cumulative reward(mean±standard deviation with multiplerollouts) of TRPO-SR (orange) trained poli-cies versus the TRPO (blue) baseline whentested in disturbed environment. The ran-dom disturbance is uniformly sampledfrom set Bd(0,ε) = {δ : ‖δ‖∞ ≤ ε}. Theadversarial disturbance belongs to the setBd(0,ε) = {δ : ‖δ‖∞ ≤ ε}. TRPO-SR trainedpolicies achieve a slower decline in perfor-mance than TRPO as we increase disturbancestrength, and a significant reduction ofvariance under large disturbance strength.

log ✏ log ✏

Rew

ard

Rew

ard

(a) HalfCheetah - Adversarially Disturbed Rollout (b) HalfCheetah - Random Disturbed Rollout

DDPG

DDPG-SR-A

DDPG-SR-C

DDPG

DDPG-SR-A

DDPG-SR-C

Figure 8: Plot of cumulative reward(mean±standard deviation with multiplerollouts) of DDPG-SR-A (orange) and DDPG-SR-C (green) trained policies versus the DDPG(blue) baseline when tested in disturbedenvironment. The disturbance is uniformlysampled from set Bd(0,ε) = {δ : ‖δ‖∞ ≤ ε}.The adversarial disturbance belongs to theset Bd(0,ε) = {δ : ‖δ‖∞ ≤ ε}. As we increasedisturbance strength ε, DDPG-SR-A andDDPG-SR-C trained policies achieve a slowerdecline in performance than DDPG.

4.4 Sensibility with Hyperparameters

The proposed smoothness-inducing regularizer involves setting two hyper-parameters, the coeffi-cient of the regularizer λs and the disturbance strength ε. We vary different choices of (ε,λs) andplot the heatmap of cumulative reward for each configuration in Figure 9. In principle, λs andε both control the strength of the regularization in similar way, as large λs and ε both increasethe strength of the regularization, advocating for more smoothness; small λs and ε both decreasethe strength of the regularization, favoring for cumulative reward over smoothness. We observesimilar behavior in Figure 9: a relatively small λs and large ε shows similar performance with arelatively large λs and small ε.

5 Conclusion

We develop a novel regularization based training framework SR2L to learn a smooth policy in re-inforcement learning. The proposed regularizer encourages the learned policy to produce similardecisions for similar states. It can be applied to either induce smoothness in the policy directly,or induce smoothness in the Q-function, thus enjoys great applicability. We demonstrate the ef-fectiveness of SR2L by applying it to two popular reinforcement learning algorithms, includingTRPO and DDPG. Our empirical results show that SR2L improves sample efficiency and trainingstability of current algorithms. In addition, the induced smoothness in the learned policy alsoimproves robustness against both random and adversarial perturbations to the state.

13

Swimmer - Sensibility with Hyperparameters

log ✏

log�

s

Figure 9: Heatmap of cumulative reward of TRPO-SR trained policies with different ε and λs.Each policy is trained for 2M environment steps.

References

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450,2016.

Boyan, J. A. and Moore, A. W. Generalization in reinforcement learning: Safely approximatingthe value function. In Advances in neural information processing systems, pp. 369–376, 1995.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W.Openai gym. arXiv preprint arXiv:1606.01540, 2016.

Cheng, R., Verma, A., Orosz, G., Chaudhuri, S., Yue, Y., and Burdick, J. W. Conrol regularizationfor reduced variance reinforcement learning. arXiv preprint arXiv:1905.05380, 2019.

garage contributors, T. Garage: A toolkit for reproducible reinforcement learning research. https://github.com/rlworkgroup/garage, 2019.

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572, 2014.

Hampel, F. R. The influence curve and its role in robust estimation. Journal of the american statis-tical association, 69(346):383–393, 1974.

Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel, W. A. Robust statistics: the approachbased on influence functions, volume 196. John Wiley & Sons, 2011.

Hendrycks, D., Mazeika, M., Kadavath, S., and Song, D. Using self-supervised learning can im-prove model robustness and uncertainty. arXiv preprint arXiv:1906.12340, 2019.

14

https://github.com/rlworkgroup/garage

https://github.com/rlworkgroup/garage

Huang, L., Liu, X., Lang, B., Yu, A. W., Wang, Y., and Li, B. Orthogonal weight normalization:Solution to optimization over multiple dependent stiefel manifolds in deep neural networks. InThirty-Second AAAI Conference on Artificial Intelligence, 2018.

Huber, P. J. Robust statistics, volume 523. John Wiley & Sons, 2004.

Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducinginternal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Zhao, T. Smart: Robust and efficient fine-tuningfor pre-trained natural language models through principled regularized optimization. arXivpreprint arXiv:1911.03437, 2019.

Jin, J., Song, C., Li, H., Gai, K., Wang, J., and Zhang, W. Real-time bidding with multi-agentreinforcement learning in display advertising. In Proceedings of the 27th ACM InternationalConference on Information and Knowledge Management, pp. 2193–2201, 2018.

Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial examples in the physical world. arXivpreprint arXiv:1607.02533, 2016.

Levine, S., Pastor, P., Krizhevsky, A., Ibarz, J., and Quillen, D. Learning hand-eye coordination forrobotic grasping with deep learning and large-scale data collection. The International Journal ofRobotics Research, 37(4-5):421–436, 2018.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D.Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.

Miyato, T., Maeda, S.-i., Ishii, S., and Koyama, M. Virtual adversarial training: a regularizationmethod for supervised and semi-supervised learning. IEEE transactions on pattern analysis andmachine intelligence, 2018.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M.Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.

Monahan, G. E. State of the art?a survey of partially observable markov decision processes: theory,models, and algorithms. Management science, 28(1):1–16, 1982.

Pinto, L., Davidson, J., Sukthankar, R., and Gupta, A. Robust adversarial reinforcement learning.In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2817–2826. JMLR. org, 2017.

Qin, C., Martens, J., Gowal, S., Krishnan, D., Dvijotham, K., Fawzi, A., De, S., Stanforth, R., andKohli, P. Adversarial robustness through local linearization. In Advances in Neural InformationProcessing Systems, pp. 13824–13833, 2019.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization.In International conference on machine learning, pp. 1889–1897, 2015.

15

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker,L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simpleway to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.

Thrun, S. and Schwartz, A. Issues in using function approximation for reinforcement learning.In Proceedings of the 1993 Connectionist Models Summer School Hillsdale, NJ. Lawrence Erlbaum,1993.

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE, 2012.

Xie, Q., Dai, Z., Hovy, E., Luong, M.-T., and Le, Q. V. Unsupervised data augmentation. arXivpreprint arXiv:1904.12848, 2019.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requiresrethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

Zhang, H., Yu, Y., Jiao, J., Xing, E. P., Ghaoui, L. E., and Jordan, M. I. Theoretically principledtrade-off between robustness and accuracy. arXiv preprint arXiv:1901.08573, 2019.

Zhao, J., Qiu, G., Guan, Z., Zhao, W., and He, X. Deep reinforcement learning for sponsoredsearch real-time bidding. In Proceedings of the 24th ACM SIGKDD international conference onknowledge discovery & data mining, pp. 1021–1030, 2018.

Zheng, G., Zhang, F., Zheng, Z., Xiang, Y., Yuan, N. J., Xie, X., and Li, Z. Drn: A deep reinforce-ment learning framework for news recommendation. In Proceedings of the 2018 World Wide WebConference, pp. 167–176, 2018.

16

A Appendix

We present two variants of DDPG with the proposed smoothness-inducing regularizer. The firstalgorithm, DDPG-SR-A, directly learns a smooth policy with a regularizer that measures the non-smoothness in the actor network (policy). The second variant, DDPG-SR-C, learns a smooth Q-function with a regularizer that measure the non-smoothness in the critic network (Q-function).We present the details of DDPG-SR-A and DDPG-SR-C in Algorithm 2 and Algorithm 3

Algorithm 2 DDPG with smoothness-inducing regularization on the actor network (DDPG-SR-A).

Input: step size for target networks α ∈ (0,1), coefficient of regularizer λs, perturbation strengthε, number of iterations to solve inner optimization problem D, number of training steps T ,number of training episodes M, step size for inner maximization ηδ, step size for updatingactor/critic network η.Initialize: randomly initialize the critic network Qφ(s,a) and the actor network µθ(s), initializetarget networks Qφ′ (s,a) and µθ′ (s) with φ′ = φ and θ′ = θ, initialize replay buffer R.for episode = 1 . . . , M do

Initialize a random process ε for action exploration.Observe initial state s1.for t = 1 . . .T do

Select action at = µθ(st) + εt where εt is the exploration noise.Take action at, receive reward rt and observe the new state st+1.Store transition (st , at , rt , st+1) into the replay buffer R.Sample mini-batch B of transitions {(sit , ait , r it , sit+1)}i∈B from the replay buffer R.Set yit = r it +γQφ′ (s

it+1,µθ′ (s

it+1)) for i ∈ B.

Update the critic network: φ← argminφ∑i∈B(yit −Qφ(sit , a

it))

2.

for sit ∈ B doRandomly initialize δi .for ` = 1 . . .D doδi ←Π

Bd (0,ε)

(δi + ηδ∇δ

∥∥∥µθ(sit)−µθ(sit + δi)∥∥∥2

2

).

end forSet sit = sit + δi .

end forUpdate the actor network:

θ← θ +η

|B|

∑i∈B

(∇aQφ(s,a)

∣∣∣s=sit ,a=uθ(sit)

∇θµθ(s)∣∣∣s=sit−λs∇θ

∥∥∥µθ(sit)−µθ (sit)∥∥∥2

2

).

Update the target networks:

θ′← αθ + (1−α)θ′ ,φ′← αφ+ (1−α)φ′ .

end forend for

17

Algorithm 3 DDPG with smoothness-inducing regularization on the critic network (DDPG-SR-C).

Input: step size for target networks α ∈ (0,1), coefficient of regularizer λs, perturbation strengthε, number of iterations to solve inner optimization problem D, number of training steps T ,number of training episodes M, step size for inner maximization ηδ, step size for updatingactor/critic network η.Initialize: randomly initialize the critic network Qφ(s,a) and the actor network µθ(s), initializetarget networks Qφ′ (s,a) and µθ′ (s) with φ′ = φ and θ′ = θ, initialize replay buffer R.for episode = 1 . . . , M do

Initialize a random process ε for action exploration.Observe initial state s1.for t = 1 . . .T do

Select action at = µθ(st) + εt where εt is the exploration noise.Take action at, receive reward rt and observe the new state st+1.Store transition (st , at , rt , st+1) into replay buffer R.Sample mini-batch B of transitions {(sit , ait , r it , sit+1)}i∈B from the replay buffer R.Set yit = r it +γQφ′ (s

it+1,µθ′ (s

it+1)) for i ∈ B.

for sit ∈ B doRandomly initialize δi .for ` = 1 . . .D doδi ←Π

Bd (0,ε)

(δi + ηδ∇δ(Qφ(sit , a

it)−Qφ(sit + δ,ait)

)2.

end forSet sit = sit + δi .

end forUpdate the critic network:

φ← argminφ

∑i∈B

(yit −Qφ(sit , ait))

2 +λs

∑i∈B

(Qφ(sit , ait)−Qφ (sit , a

it))

2.

Update the actor network:

θ← θ +η

|B|

∑i∈B∇aQφ(s,a)

∣∣∣s=sit ,a=uθ(sit)

∇θµθ(s)∣∣∣s=sit

.

Update the target networks:

θ′← αθ + (1−α)θ′ ,φ′← αφ+ (1−α)φ′ .

end forend for

18

deep reinforcement learning with smooth policy · 2020-03-25 · deep reinforcement learning with...

Documents