ieee transactions on cognitive development system 1 ... · a single-pixel adversary; and (3)...

IEEE TRANSACTIONS ON COGNITIVE DEVELOPMENT SYSTEM 1

Minimalistic Attacks: How Little it Takes to FoolDeep Reinforcement Learning Policies

Xinghua Qu, Student Member, IEEE, Zhu Sun,Yew Soon Ong, Fellow, IEEE, Pengfei Wei, Abhishek Gupta

Abstract—Recent studies have revealed that neural networkbased policies can be easily fooled by adversarial examples.However, while most prior works analyze the effects of perturbingevery pixel of every frame assuming white-box policy access, inthis paper we take a more restrictive view towards adversarygeneration - with the goal of unveiling the limits of a model’svulnerability. In particular, we explore minimalistic attacks bydefining three key settings: (1) black-box policy access: where theattacker only has access to the input (state) and output (actionprobability) of an RL policy; (2) fractional-state adversary: whereonly several pixels are perturbed, with the extreme case beinga single-pixel adversary; and (3) tactically-chanced attack: whereonly significant frames are tactically chosen to be attacked. Weformulate the adversarial attack by accommodating the threekey settings, and explore their potency on six Atari games byexamining four fully trained state-of-the-art policies. In Breakout,for example, we surprisingly find that: (i) all policies showcasesignificant performance degradation by merely modifying 0.01%of the input state, and (ii) the policy trained by DQN is totallydeceived by perturbation to only 1% frames.

Index Terms—Reinforcement Learning, Adversarial Attack.

I. INTRODUCTION

Deep learning [1] has been widely regarded as a promisingtechnique in reinforcement learning (RL), where the goalof an RL agent is to maximize its expected accumulatedreward by interacting with a given environment. Althoughdeep neural network (DNN) policies have achieved superhuman performance on various challenging tasks (e.g., videogames, robotics and classical control [2]), recent studies haveshown that these policies are easily deceived under adversarialattacks [3]–[5]. These works are however found to make somecommon assumptions, viz., (1) white-box policy access: wherethe adversarial examples are analytically computed by back-propagating through known neural network weights, (2) full-state adversary: where the adversary changes almost all pixelsin the state, and (3) fully-chanced attack: where the attackerstrikes the policy at every frame.

Given that most prior works analyze the effects of perturb-ing every pixel of every frame assuming white-box policy

XH. Qu, is with the Computational Intelligence Lab, School of Com-puter Science and Engineering, Nanyang Technological University, Singapore,639798 (email: [email protected])

Z. Sun is with School of Electrical and Electronic Engineering, NanyangTechnological University, Singapore 639798 (email:[email protected])

Y.-S Ong is with the Data Science and Artificial Intelligence ResearchCentre, School of Computer Science and Engineering, Nanyang TechnologicalUniversity, Singapore 639798 (email: [email protected])

P. Wei is with Department of Computer Science, National University ofSingapore, Singapore 119077 (email:[email protected])

A. Gupta is with Singapore Institute of Manufacturing Technology(SIMTech), A*STAR, Singapore 138634 (email: abhishek [email protected])

0 50 100 150

0

25

50

75

100

125

150

175

200

Original state

Inpu

t lay

er

��

Act

ion

outp

ut

Policy Network

��

...

0 0.5Action Probabilities

RIGHTLEFTFIRE

NOOP

0 50 100 150

0

25

50

75

100

125

150

175

200

Perturbed state

Inpu

t lay

er

��

Act

ion

outp

ut

Policy Network

��

...

0 0.2 0.4Action Probabilities

RIGHTLEFTFIRE

NOOP

Fig. 1: The single pixel attack on Atari Breakout.

access, we propose to take a more restricted view towardsadversary generation - with the goal of exploring the limitsof a DNN model’s vulnerability in RL. In this paper, we thusfocus on minimalistic attacks by only considering adversarialexamples that perturb limited number of pixels in selectedframes, and under the restricted black-box policy access. Inother words, we intend to unveil how little it really takesto successfully fool state-of-the-art RL policies. Our study isbased on three restrictive settings, namely, black-box policyaccess, fractional-state adversary, and tactically-chanced at-tack. These concepts are detailed next.Black-box Policy Access (BPA). Most previous studies focuson a white-box setting [5], that allows full access to a policynetwork for back-propagation. However, most systems do notrelease their internal configurations (i.e., network structure andweights), only allowing the model to be queried; this makesthe white-box assumption too optimistic from an attacker’sperspective [6]. In contrast, we use a BPA setting, where theattacker only has access to the input and output of a policy.Fractional-State Adversary (FSA). In the FSA setting, theadversary only perturbs a small fraction of the input state. This,in the extreme situation, corresponds to the single-pixel attackshown in Fig. 1, where perturbing a single pixel of the inputstate is found to change the action prescription from ‘RIGHT’to ‘LEFT’. In contrast, most previous efforts [5] are mainlybased on a full-state adversary (i.e., the number of modifiedpixels is fairly large, usually spanning the entire frame).Tactically-Chanced Attack (TCA). In previously studied RLadversarial attacks [3], [7], [8], the adversary strikes the policyon every frame of an episode; this is a setting termed asthe fully-chanced attack. Contrarily, we investigate a relativelyrestrictive case where the attacker only strikes at a few selectedframes - a setting we term as tactically-chanced attack, where

arX

iv:1

911.

0384

9v2

[cs

.LG

] 2

2 N

ov 2

019


a minimal number of frames is explored to strategicallydeceive the policy network.

The proposed restrictive settings are deemed to be signifi-cant for safety-critical RL applications, such as in the medicaltreatment of sepsis patients in intensive care units [9], [10]and treatment strategies for HIV [11]. In such domains, theremay exist a temporal gap between the acquisition of the inputand the action execution; thereby, providing a time windowin which a tactical attack could take place. In the future, aswe move towards integrating end-to-end vision-based systemsfor healthcare, the sensitivity of prescribed medical actions(such as drug prescription or other medical interventions)to perturbations in medical images poses a severe threat tothe utility of RL in such domains. Moreover, the restrictivenature of the attacks makes them practically imperceptible,greatly reducing the chance of identifying and rectifying them.With this in mind, the present paper explores the vulnerabilityof deep RL models to restrictive adversarial attacks, withparticular emphasis on image-based tasks (i.e., Atari games).

The major challenges, therefore, lie in how to delicatelyaccommodate the three restricted settings for adversarial at-tacks in DRL. To this end, we design a mathematical programwith a novel objective function to generate the FSA under theblack-box setting, and propose an entropy-based uncertaintymeasurement to achieve the TCA. Note that, the optimizationvariables are defined as the discrete 2-D coordinate location(s)and perturbation value(s) of the selectively attacked pixel(s),and the designed mathematical program guarantees a success-ful deception of the policy as long as a positive objective valuecan be found. The optimization procedure, for a given inputframe, is then carried out by a simple genetic algorithm (GA)[12], [13]. In this regard, the Shannon entropy of the actiondistribution is specified as the attack uncertainty, whereby onlya few salient frames are tactically attacked. We demonstrate thesufficiency of the three restrictive settings on four pre-trainedstate-of-the-art deep RL policies on six Atari games.

To sum up, the contributions of this paper are listed as:

• We unveil how little it takes to deceive an RL policy byconsidering three restrictive settings, namely, black-boxpolicy access, fractional-state adversary, and tactically-chanced attack;

• We formulate the RL adversarial attack as a black-box optimization problem comprising a novel objectivefunction and discrete FSA optimization variables. We alsopropose a Shannon entropy-based uncertainty measure-ment to sparingly select the most vulnerable frames tobe attacked;

• We explore the three restrictive settings on the policiestrained by four state-of-the-art RL algorithms (i.e., DQN[1], PPO [14], A2C [15], ACKTR [16]) on six represen-tative Atari games (i.e., Pong, Breakout, SpaceInvaders,Seaquest, Qbert, BeamRider);

• Surprisingly, we find that on Breakout: (i) with only asingle pixel (≈ 0.01% of the state) attacked, the trainedpolicies are completely deceived; and (ii) by merelyattacking around 1% frames, the policy trained by DQNis totally fooled.

The remainder of this paper is organized as follows. InSection II, we provide an overview of related works in RLadversarial attacks, highlighting the novelty of our paper.Section III presents preliminaries on reinforcement learningand adversarial attack. In section IV, we describe our proposedoptimization problem formulation and the associated proce-dure to generate adversarial attacks for deep RL policies. Thisis followed by Section V, where we report the numerical resultsto prove the effectiveness of our attacks. Finally, Section VIconcludes this paper.

II. RELATED WORK

Since Szegedy et al. [17], a number of adversarial attackmethods have been investigated to fool deep neural networks(DNNs). Most existing approaches generate adversarial exam-ples through pursuing human invisible perturbation on images[18]–[20], with the goal of deceiving trained classifiers. Inother words, they mainly focus on the adversarial attacks forsupervised learning tasks. Adversarial attacks in RL have beenrelatively less explored to date.

In RL, Huang et al. [3] was the first to demonstrate thatneural network policies are vulnerable to adversarial attackby adding small modifications to the input state of Atarigames. A full-state adversary (i.e. adversarial examples thatchange almost every pixel in the input state) has previouslybeen generated by a white-box policy access based approach[21], where the adversarial examples are computed via back-propagation. Lin et al. [4] proposed the strategically-timedattack and the so-called enchanting attack, but the adversarygeneration is still based on a white-box policy access as-sumption and full-state perturbation. Besides, Kos et al. [22]compared the influence of full-state perturbations with randomnoise, and utilized the value function to guide the adversaryinjection.

In summary, existing works are largely based on white-box policy access, together with assumptions of a full-stateadversary and fully-chanced attack. There is little researchstudying the potency of input perturbations that may be far lessextensive. Therefore, our goal in this paper is to investigate anextremely restricted view towards the vulnerability of deep RLmodels, viz., based on black-box policy access, fractional-stateadversary, and tactically-chanced attack. It is contended thatstudying such restrictive scenarios might give new insights onthe geometrical characteristics and overall behavior of deepneural networks in high dimensional spaces [23].

III. PRELIMINARIES

This section first provides the background of RL and severalrepresentative approaches to learn RL policies. It then illus-trates the basic concepts and definition of adversarial attack.

A. Reinforcement Learning

Reinforcement learning (RL) [24]–[33] can be formulatedas a Markov Decision Process (MDP) [34], where the agentinteracts with the environment based on the reward and statetransition. This decision making process is shown in Fig.2, where st and rt are the state and reward received from


Action, ��

State, ��

Reward, ��

State, ��+1

AGENTENVIRONMENT

Reward, ��+1

Fig. 2: Markov Decision Process in Reinforcement Learning:according to the state st, the agent selects an action at, andthen receives the reward rt+1 from the environment.

environment at step t, and at is the action selected by theagent. Based on at, the agent interacts with the environment,transitioning to state st+1 and receiving a new reward rt+1.This procedure continues until the end of the MDP.

The aim of RL [35] is to find an optimal policy π(θ∗) thatmaximizes the expected reward:

θ∗ = maxθ

E[∑T

t=0γtr(t)|πθ], (1)

where θ represents the training parameters (e.g., the weightsof a neural network);

∑Tt=0 γ

tr(t)|πθ is the episode rollout;and γ ∈ (0, 1) denotes the discount factor that balances thelong- and short-term rewards.

To find out the optimal policy parameters θ∗, many differentRL algorithms have been proposed. We select several represen-tative ones for demonstration, including DQN [1], PPO [14],A2C [15], ACKTR [16].• Deep Q-Networks (DQN) [1]: Instead of predicting

the probability of each action, DQN computes the Qvalue for each available action. Such Q value Q∗(s, a)represents the expected accumulative discounted rewardthat is approximated from the current frame. Based onsuch approximation, DQN is trained via minimizing thetemporal difference loss (i.e., squared Bellman error). Inthis paper, the corresponding policy for DQN is achievedby selecting the action with maximum Q value. Besides,to keep DQN-trained policy consistent with other threealgorithms, a soft-max layer is added at the end of DQN.

• Proximal Policy Optimization (PPO) [14]: PPO is anoff-policy method using policy gradient, and it strikesa balance among ease of implementation, sample com-plexity, and ease of tuning. PPO is proposed basedon trust region policy optimization (TRPO) that solvesa constrained optimization problem so as to alleviateperformance instability. However, PPO handles this issuein a different manner as compared to TRPO: it involvesa penalty term that indicates the KullbackLeibler diver-gence between the old policy action prediction and thenew one. This operation computes an update at eachstep, minimizing the cost function while restricting theupdating step to be relatively small.

• Advantage Actor-Critic (A2C) [36]: A2C is an actor-critic method that learns both a policy and a state valuefunction (i.e., also called critic). In A2C, the advantageA represents the extra reward will get if the action at is

taken, which is formulated by A = Q(st, at) − V (st),where Q(st, at) is the state-action value and V (st) is thestate value. Such design is to reduce variance of the policygradient and increase stability by involving a baselineterm V (st) in A for the gradient estimation.

• Actor Critic using Kronecker-Factored Trust Region(ACKTR) [16]: ACKTR is an extension of the naturalpolicy gradient, which optimizes both the actor andthe critic using Kronecker-factored approximate curva-ture (K-FAC) with trust region. ACKTR is the firstscalable trust region natural gradient method for actor-critic methods. The investigation on ACKTR suggeststhat Kronecker-factored natural gradient approximationsin RL is a promising framework.

Although remarkable performance has been achieved bythese algorithms on many challenging tasks (e.g., video games[1] and board games [37], [38]), recent studies have revealedthat the policies trained by these algorithms are easily fooledby adversarial perturbations [3]–[5], as introduced next.

B. Adversarial Attack

Recent studies find that deep learning is vulnerable againstwell-designed input perturbation (i.e., adversary) [6]. Theseadversaries can easily fool even seemingly high performingdeep learning models with human imperceptible perturbations.Such vulnerabilities of deep learning models have been wellstudied in supervised learning, and also to some extent in RL[3], [4], [20].

In RL, the aim of an adversarial attack is to find theoptimal adversary δt that minimizes the accumulated reward.Let T represent the length of an episode, and r(π, st) indicatea function that returns the reward given the state st andDNN based policy π. Accordingly, the general formulationof adversarial attack in RL can be formulated as:

minδt

∑T

t=1r(π, st + δt) : ∀t ‖δt‖ 6 L (2)

where the adversary generated at time-step t is representedby δt, and its norm (i.e., ‖δt‖) is bounded by L. The basicassumption of Eq. (2) is that the misguided action selection ofpolicy π will result in a reward degradation. In other words, anaction prediction apt obtained from the perturbed state st + δtmay differ from the originally unperturbed action aot , thusthreatening the reward value r(·). This corresponds to thedefinition of untargeted attacks, which is stated as aot 6= apt .

To solve the optimization problem formulated in Eq. (2),many white-box based approaches (e.g., FSGM [18]) havebeen applied in previous efforts [3], [4]. These white-boxmethods are essentially gradient-based methods, as they gen-erate adversarial examples by back-propagating through thepolicy network to calculate the gradient of a cost functionwith respect to the input state, i.e., ∇stJ(π, θ,∆at, st). Here,θ represents the weights of the neural network; ∆at is thechange in action space; J indicates the loss function (e.g.,cross-entropy loss).

However, an essential precondition of white-box basedapproaches is the complete knowledge of the policy, includingthe structure and the model parameters θ. Contrarily, a more


restrictive setting is black-box policy access [6], that onlyallows an attacker to present the input state to the policy andobserve the output. This serves as an oracle that only returnsthe output action prediction.

In addition to the above, most prior studies on attackingRL policies only analyze the effects of perturbing every pixelof every frame, which is deemed too intensive to be of muchpractical relevance. Therefore, this paper aims to provide a farmore restricted view towards the impact of adversarial attacks,with the goal of unveiling the limits of a models vulnerability.

IV. THE PROPOSED METHODOLOGY

This section provides details of the three key ingredients ofthe proposed restrictive attack setting, namely black-box policyaccess (BPA), fractional-state adversary (FSA), and tactically-chanced attack (TCA).

A. Black-box Policy Access

We adopt the commonly used black-box definition [6], [40]from supervised learning, where the attacker can only querythe policy. In other words, the attacker is unable to analyticallycompute the gradient ∇stπ(·|st, δt), but only has the privilegeto query a targeted policy so as to obtain useful informationfor crafting adversarial examples. However, such a setting hasbeen rarely investigated in RL adversarial attacks.

We realize such a BPA setting by formulating the adver-sarial attack as a black-box optimization problem. To thisend, we propose and utilize a measure D(·) to quantify thediscrepancy between the original action distribution π(·|st)(produced by an RL policy without input perturbation) and thecorresponding perturbed distribution π(·|st + δt). Assuminga finite set of m available actions a1t , a

2t , · · · amt , the prob-

ability distribution over actions is represented as π(·|st) =[p(a1t ), p(a

2t ), · · · , p(amt )], where

∑mj=1 p(a

jt ) = 1. Typically,

a deterministic policy selects the action o = arg maxj p(ajt ).

With this, the black-box attack considers the discrepancymeasure as the optimization objective function, where the statest is perturbed by adversary δt such that the measure D(·) ismaximized.

The overall problem formulation can thus be stated as:

maxδtD(π(·|st), π(·|st + δt)) : ∀t ‖δt‖ 6 L. (3)

The above mathematical program changes the action dis-tribution to π(·|st + δt), from the original (optimal) actiondistribution π(·|st). If the change in action distribution leadsto a change in action selection, then, based on Bellman’sprinciple of optimality [36], it can then be concluded thatthe perturbed action will lead to a sub-optimal reward as aconsequence of the attack. Under this hypothesis, we proposeto replace the reward minimization problem of Eq. (2) with adiscrepancy maximization formulation proposed in Eq. (3).

The exact choice of discrepancy measure D(·) is expectedto have a significant impact on the attack performance, asdifferent measures shall capture different patterns of similar-ities. Previous works [3], [4], [20] apply the Euclidean norm(e.g., L1, L2, L∞) between π(·|st+δt) and π(·|st). However,such Lp norm cannot guarantee a successful untargeted attack,

since maximizing the Lp norm does not ensure that the actionselection for the perturbed state has been altered. Thus, a suc-cessful untargeted attack under deterministic action selectionmust ensure that arg maxj π(·|st + δt)j 6= arg maxj π(·|st)j ,where π(·|st)j = p(ajt ).

To enable a more consistent discrepancy measure for untar-geted attack, we design the following function D based on thequery feedback of the policy:

D = maxj 6=0

π(·|st + δt)j − π(·|st + δt)o

where: o = arg maxj

π(·|st).(4)

This formulation is different from the Euclidean norm, guar-anteeing a successful untargeted attack if D is positive. Tosupport this claim, we refer to the theorem and proof below:

Theorem 1. Suppose discrepancy measure D in Eq. (4) ispositive; policy π is a deterministic, i.e., choosing action aotis chosen such that o = arg maxj π(·|st). Then the adversarialexample δt for policy π at state st, is guaranteed to make asuccessful untargeted attack, i.e.,

arg maxj

[π(st + δt)]j 6= arg maxj

[π(st)]j .

Proof. We use the symbol o and p to represent the action indexfrom state st and perturbed state st + δt respectively. As thepolicy π is deterministic, o and p are computed by:

o = arg maxj

[π(·|st)]j ,

p = arg maxj

[π(·|st + δt)]j .

The discrepancy measure D is positive, then we have

D > 0⇒maxj 6=o

π(·|st + δt)j − π(·|st + δt)o > 0

⇒maxj 6=o

π(·|st + δt)j > π(·|st + δt)o

⇒maxjπ(·|st + δt)j = max

j 6=oπ(·|st + δt)j .

Therefore, the perturbed action index p is given by:

p = arg maxj

π(·|st + δt)j = arg maxj 6=o

π(·|st + δt)j 6= o

⇒ arg maxj

[π(·|st + δt)]j 6= arg maxj

[π(·|st)]j . �

With the discrepancy measure D, the mathematical programfor generating adversarial attacks is given by,

maxδt

maxj 6=o

π(·|st + δt)j − π(·|st + δt)o

where: ∀t ‖δt‖ 6 L, o = arg maxj

π(·|st)j(5)

When resolving the optimization problem in Eq. (5), δt hasto be determined by its parameterisation. In this paper, δt islimited to perturbing only a small fraction of the input state(i.e., fractional-state adversary), as described next.


(a) Full-state adversary [18] (b) Fractional-state adversary [39]

Fig. 3: Full-state adversary versus the FSA

B. Fractional-State Adversary

To explore adversarial attacks limited to a few pixels in RL,we investigate the fractional-state adversary (FSA) setting. Incomparison with the full-state adversary depicted in Fig. 3(a),FSA only perturbs a fraction of the state; shown in Fig. 3(b).The extreme scenario for FSA is merely a single-pixel attack,which is deemed to be more physically realizable for theattacker than a full-state one. For instance, pasting a stickeror a simple fading of color in the “STOP” sign could easilylead to a FSA in Fig. 3(b).

To achieve the FSA setting, we propose a different parame-terization for the adversary δt with the discrete representationsof perturbed pixels. In doing so, we parameterize the adversaryδt by its corresponding pixel coordinates (i.e., xt, yt) andperturbation value pt:

δt = P(xt, yt,pt)= P(x1t , y

1t , p

1t , · · · , xnt , ynt , pnt )

(6)

where n is the number of pixels that are attacked, xit, yit

are coordinate values of the perturbed pixel, and pit is theadversarial perturbation value of the ith pixel. With Eq. (6),the final black-box optimization problem is reformulated as:

maxxt,yt,pt

maxj 6=o

[π(·|st + P(xt, yt, pt))]j − [π(·|st + P(xt, yt, pt))]o

where: ∀t o = argmaxj

[π(·|st)]j ;

∀t 0 6 xit 6 Ix, xit ∈ N, i = [1, 2, · · · , n];∀t 0 6 yit 6 Iy, yit ∈ N, i = [1, 2, · · · , n];∀t 0 6 pit 6 Ip, pit ∈ N, i = [1, 2, · · · , n];

(7)

where Ix, Iy are the integral upper bounds of xt and ytrespectively, Ip is the perturbation value bound of pt, and n isthe FSA size that controls the number of pixels to be attackedgiven any input state st. The extreme case is n = 1, implyingthat there is only one pixel attacked in a frame. To the best ofour knowledge, such one-pixel attack has never been exploredin the existing literature on RL adversarial attacks.

To obtain the optimal xt, yt,pt, we adopt the genetic algo-rithm (GA) [12], [13], [41] (i.e., an evolutionary computation[42]–[49] method for black-box optimization). The pseudocode of GA is illustrated in Algorithm 1 (lines 7-17). Apopulation containing λ individuals is evolved, where eachindividual δit, i ∈ [1, λ] is a particular adversarial example thatis evaluated by the objective function in Eq. (7). The top-λβindividuals with respect to the objective value, are selectedas elites to survive in the new generation (line 10). Based onstandard crossover (line 12) and mutation (line 13) operations,another population pop′ is reproduced to replace the old pop.

Algorithm 1: Attack by Genetic AlgorithmInput: st – state at time-step t; n – number of pixels for attack; λ –

population size; β – rate of elitism; γ – rate of mutation; fm– maximum number of function evaluations; ζ∗ – TCAboundary value

1 Load the policy π well-trained by a particular RL algorithm;2 Obtain the RL input state s in a particular Atari game;3 Calculate the attack uncertainty ζt by Eq. (8);4 if ζ > ζ∗ then

// No adversarial attack5 δt = none;6 else

// Explore the n pixel attack7 Initialize the population pop with randomly generating λ

individuals;8 repeat9 Evaluate fi, i = 1, · · ·λ using Eq. (7);

10 Save the top-λβ individuals in pop′;11 for j=1:λ(1−β)

2do

12 Crossover: generate δjc with one-point crossover [12];13 Mutation: δjm = δj + γ · εj , εj ∈ N (0, 1);

14 Append δc and δm to pop′;15 pop = pop′

16 until the maximum objective evaluations fm reached;17 Return the optimal FSA δ∗ = [x∗, y∗, p∗];

This process repeats until the maximum objective evaluationnumber fm is reached.

C. Tactically-Chanced Attack

To explore the adversarial attacks on a restricted number offrames, we design the tactically-chanced attack (TCA) wherethe attacker strategically strikes only salient frames that arelikely to be most contributing to the accumulated reward.In contrast, most existing approaches [3] apply adversarialexamples on every frame, which is referred to as fully-chancedattack. Our proposed TCA is clearly more restrictive, as canfurther be highlighted from three different perspectives: (1)due to the communication budget restriction, the attacker maynot be able to strike the policy in a fully-chanced fashion;(2) a tactically-chanced attack is less likely to be detected bythe defender; and (3) only striking the salient frames improvesthe attack efficiency, as many frames contribute trivially to theaccumulated reward.

To this end, we define the normalized Shannon entropy ofthe action distribution π(·|st) = [p(a1t ), p(a

2t ), · · · , p(amt )] as

a measure of the attack uncertainty. Such attack uncertainty(ζt) of each frame is given by,

ζt = −∑m

i=0

p(ait) · log p(ait)

logm(8)

where p(ai) is the probability of action ai; m is the dimensionof the action space. As probability p(ai) ∈ [0, 1], ζt isconstrained in (0, 1]. In such a formulation, the frame with


0 50 100 150

0

25

50

75

100

125

150

175

200

(a) Pong0 50 100 150

0

25

50

75

100

125

150

175

200

(b) Breakout0 50 100 150

0

25

50

75

100

125

150

175

200

(c) SpaceInvaders 0 50 100 150

0

25

50

75

100

125

150

175

200

(d) Seaquest0 50 100 150

0

25

50

75

100

125

150

175

200

(e) Qbert0 50 100 150

0

25

50

75

100

125

150

175

200

(f) BeamRider

Fig. 4: The six representative Atari games [50] (i.e., Pong, Breakout, SpaceInvaders, Seaquest, Qbert, BeamRider) applied inthe experiments

0 200 400 600 800 1000 1200 1400 1600Frame number

0.8

0.85

0.9

0.95

1

Att

ack

un

cert

ain

ty

t

*

Fig. 5: The attack uncertainty ζt on Pong.

relatively low ζt indicates that the policy has a high confidencein its prescribed action. Hence, we assume that attacking suchframes is expected to effectively disrupt the policy.

We demonstrate the attack uncertainty on Pong in Fig.5; similar trends can be observed on the other games aswell. In the figure, the frames with smaller ζt, marked bybrown rectangles, depict that the ball is close to the paddle.On the other hand, the blue rectangle marked ones withlarger ζt (around 1) indicate that the ball is distant from thepaddle. Attacking the brown marked frames are intuitivelymore efficient to fool the policy, as they are likely to leadto a more considerable reward loss. In contrast, an attack maybe relatively inconsequential for the blue marked frames. Asthe ball is distant from the paddle, attacking such frames willhave little impact on the reward.

Therefore, we formulate the TCA by defining a TCAthreshold (ζ∗) as shown by the red line in Fig. (5); thiscontrols the proportion of attacked frames, where only thoseframes with an uncertainty value below ζ∗ are perturbed. Inthe experimental section, we analyze the effect of threshold ζ∗

by varying its values; this also helps us to explore how littleit actually takes to deceive a policy (from the perspective ofthe number of frames attacked). The adversary δt under TCAsetting is thus given by:

δt =

{P(x∗t , y∗t , p

∗t ), if ζt < ζ∗t ,

none, if ζt > ζ∗t .(9)

Eq. (9) suggests that if the attack uncertainty (ζt) is smallerthan ζ∗, an adversary shall be generated by solving theoptimization problem in Eq. (7). Otherwise, the attacker willtactically hide without wasting sources on trivial frames.

State:84*84*4

Feature map:20*20*32

Feature map:9*9*64

Feature map:7*7*64

Features:512

Batch norm

Batch normBatch norm

Batch norm

8*8*32convolution

4*4*64convolution

3*3*64convolution Action probabilities

Fig. 6: The neural network structure for playing Atari games

Summary. By considering the three settings (i.e., BPA, FSAand TCA), we carry out an exploration of restrictive scenariosfor RL adversarial attacks, only focusing on a minimal numberof pixels and frames under black-box policy access.

V. EXPERIMENTS AND ANALYSIS

We evaluate the three restrictive settings on six Atari gamesin OpenAI Gym [50] with various difficulty levels, includingPong, Breakout, SpaceInvaders, Seaquest, Qbert, and Beam-Rider1. These games are shown in Fig. 4. For each game,the policies are trained by four start-of-the-art RL algorithms,including DQN [1], PPO2 [14], A2C [15], ACKTR [16]. Thenetwork structure is adopted from [51], and it keeps same forall the four RL algorithms.

A. Experiment Setup

We utilize the fully-trained policies from the RL baselinezoo [52]. Each policy follows the same pre-processing stepsand neural network architecture (i.e., shown as Fig.6) in [1].The input state st of the neural network is the concatenationof the last four screen images, where each image is resizedto 84× 84. The pixel value of the grey scale image is in therange of [0, 255] stepped by 1. The output of the policy is adistribution over candidate actions for PPO2, A2C, ACKTR,and an estimation of Q values for DQN.

To calculate the attack uncertainty for DQN, the soft-maxoperation is applied to normalize the output. Moreover, giventhe image size and pixel value, the constraints in Eq. (7) areset as Ix = 84, Iy = 84, Ip = 255. In the GA, the populationsize λ is 10, and the maximum number of objective evaluationsfm is set as 400. However, as our objective function hasa guarantee for successful untargeted adversarial attack, theoptimization process will terminate when a positive functionvalue is found. To investigate the impacts of FSA size n, we

1Our code is available at https://github.com/RLMA2019/RLMA

https://github.com/RLMA2019/RLMA


0 2 4 6 8 10Pixel number for attack

-20

-10

0

10

20A

ccum

ulat

ed R

ewar

d DQN

PPO2

A2C

ACKTR

(a) Pong


0

100

200

300

Acc

umul

ated

Rew

ard DQN

PPO2

A2C

ACKTR

(b) Breakout


200

400

600

800

1000

Acc

umul

ated

Rew

ard DQN

PPO2

A2C

ACKTR

(c) SpaceInvaders


500

1000

1500

2000

Acc

umul

ated

Rew

ard DQN

PPO2

A2C

ACKTR

(d) Seaquest


2000

4000

6000

8000

10000

12000

14000

Acc

umul

ated

Rew

ard DQN

PPO2

A2C

ACKTR

(e) Qbert


1000

2000

3000

Acc

umul

ated

Rew

ard DQN

PPO2

A2C

ACKTR

(f) BeamRider

Fig. 7: The results of adversarial attack with different FSA size n (i.e. pixel number), where the line and shaded area illustratethe mean and standard deviation of 30 independent runs respectively.

0 0.2 0.4 0.6 0.8 1Proportion of attacked frames

-20

-10

0

10

20

Acc

umul

ated

Rew

ard DQN

PPO2A2CACKTR

(a) Pong


0

100

200

300

Acc

umul

ated

Rew

ard DQN

PPO2A2CACKTR

(b) Breakout


200

400

600

800

Acc

umul

ated

Rew

ard DQN

PPO2A2CACKTR

(c) SpaceInvaders


500

1000

1500

2000

Acc

umul

ated

Rew

ard DQN

PPO2A2CACKTR

(d) Seaquest


2000

4000

6000

8000

10000

12000

14000

Acc

umul

ated

Rew

ard DQNPPO2A2CACKTR

(e) Qbert


1000

2000

3000

Acc

umul

ated

Rew

ard DQN

PPO2A2CACKTR

(f) Beamrider

Fig. 8: The results of adversarial attack with different TCA boundary ζ∗, where the line and shaded area illustrate the meanand standard deviation of 30 independent runs respectively.

apply a grid search in [1, 10] scaled by 1. We also apply a gridsearch for tactically-chanced attack (TCA) threshold ζ∗ in therange of [0, 1]. This corresponds to the different proportion ofattacked frames as depicted by the x-axis in Fig. 8.

B. Results for Fractional-State Adversary

We investigate the effectiveness of the fractional-state ad-versary (FSA) via varying the value of FSA size n in therange of [1, 10] scaled by 1, where n = 1 corresponds tothe extreme case that only one pixel is perturbed. For a faircomparison with respect to different algorithms and games,we set the TCA threshold ζ∗ as the mean of all frames’ ζthat are obtained from the unattacked policy testing. Fig. 7illustrates the performance (i.e. accumulated reward) drop on

the policies trained by the four RL algorithms on the six Atarigames. Several interesting observations can be noted.

(1) Overall, the FSA size n is positively related to theperformance drop. That is to say, the FSA with larger n is ableto deceive the policy more easily. (2) On most of the games,the policies are almost completely fooled with n ≤ 4. Thisindicates a small FSA size is sufficient to achieve a successfuladversarial attack, indicating the efficiency of FSA. (3) Thereis a relatively huge performance drop for DQN based policyin comparison with policies trained by other RL algorithms,especially on Breakout and Qbert. It suggests that DQN basedpolicy is more volunerable; whereas, the ACKTR based policyis more robust under the attack with FSA setting. (4) The



0

1000

2000

3000N

umbe

r of

atta

cked

fram

es

DQN

PPO2

A2C

ACKTR

(a) Pong


0

50

100

150

200

Num

ber

of a

ttack

ed fr

ames

DQN

PPO2

A2C

ACKTR

(b) Breakout


0

200

400

600

Num

ber

of a

ttack

ed fr

ames

DQN

PPO2

A2C

ACKTR

(c) SpaceInvaders


0

500

1000

1500

Num

ber

of a

ttack

ed fr

ames

DQN

PPO2

A2C

ACKTR

(d) Seaquest


0

200

400

600

Num

ber

of a

ttack

ed fr

ames

DQN

PPO2

A2C

ACKTR

(e) Qbert


0

500

1000

1500

Num

ber

of a

ttack

ed fr

ames

DQN

PPO2

A2C

ACKTR

(f) Beamrider

Fig. 9: The relationship between attacked frames and total frames, where the line and shaded area illustrate the mean andstandard deviation of 30 independent runs respectively.

0 500 1000 1500 2000 2500Attacked frames

1500

2000

2500

3000

Tot

al fr

ames

DQN

PPO2

A2C

ACKTR

(a) Pong

0 50 100 150 200Attacked frames

1000

2000

3000

Tot

al fr

ames

DQN

PPO2

A2C

ACKTR

(b) Breakout

0 200 400 600Attacked frames

400

600

800

1000

1200

Tot

al fr

ames

DQN

PPO2

A2C

ACKTR

(c) SpaceInvaders


1200

1400

1600

1800

2000

2200

2400

Tot

al fr

ames

DQN

PPO2

A2C

ACKTR

(d) Seaquest


500

1000

1500

Tot

al fr

ames

DQN

PPO2

A2C

ACKTR

(e) Qbert


1500

2000

2500

3000

3500

Tot

al fr

ames

DQN

PPO2

A2C

ACKTR

(f) Beamrider

Fig. 10: The relationship between attacked frames and total frames, where the line and shaded area illustrate the mean andstandard deviation of 30 independent runs respectively.

results also imply that the high performance of unattackedpolicy cannot guarantee a strong robustness to resist attack.For instance, in SpaceInvaders, the original performance ofA2C is higher than DQN, but the performance of A2C dropsfaster than that of DQN as shown in Fig. 7(c). (5) Mostimportantly, we surprisingly find that: on Breakout, with singlepixel attacked, the policy trained by all the four RL algorithmsare completely deceived with the accumulated reward close to0. On Qbert and BeamRider, the policy trained by DQN is alsocompletely deceived with only a single pixel attacked. Suchsingle pixel attack corresponds to an approximate perturbationproportion of 0.01% (i.e., 1/(84× 84)) of the total number ofpixels in a frame.

C. Results for Tactically-Chanced Attack

We study the tactically-chanced attack (TCA) by alteringthe TCA threshold ζ∗. Different ζ∗ corresponds to differentproportion of attacked frames. In our exploration, we set the ζ∗

in the range of (0, 1], where ζ∗ = 0 and ζ∗ = 1 correspond to0% and 100% proportion of attacked frames, respectively. Re-call that according to the results for FSA size analysis shownin Fig. 7, most of the policies are sufficiently fooled with nomore than 4 pixels attacked. Hence, for a fair comparison, weset the FSA size n = 4 in subsequent experiments. The resultsare displayed in Fig. 8, where we observe that in general theperformance (i.e., accumulated reward) decreases when higherproportion of frames are attacked. This demonstrates that the


0 1.2 4.5 6.0 12.214.620.90

100

200

224

144.3

51.6

13.7 12.8 11.6 4

Averaged attacked frames

Acc

umul

ated

rew

ard

DQN

0 27.3 40.0 50.2 56.30

100

200

300266.4

14.4 3.6 1.2 0.7


Acc

umul

ated

rew

ard

PPO2

0 88.8 136.3 156.7 166.00

200

400 390.7

12.6 6.1 3.1 3


Acc

umul

ated

rew

ard

A2C

0 73.2 83.5 123.4 173.10

200

400

464.6

13.1 3.7 4.2 3.6


Acc

umul

ated

rew

ard

ACKTR

Fig. 11: The impact of attacked frames for accumulated reward on Breakout

policies are more easily fooled with more frames attacked. Inaddition, we make several other fascinating findings.

(1) In Fig. 8(a) on Pong, only with 10% of frames attacked,the models trained by DQN and PPO2 are totally fooledwith the accumulated reward around −20. This also suggeststhat the model trained by A2C and ACKTR are more robustthan those trained by DQN and PPO2. (2) In Fig. 8(b) onBreakout, the model trained by DQN is totally deceived byonly perturbing less than 5% of frames on average. (3) InFig. 8(f) on Beamrider, even when the proportion of attackedframes is about 20%, all policies can only obtain a reward withless than 2000. Especially for the policies trained by DQN andPPO2, the accumulated reward drops from 14000 to 2000.

We further analyze the exact number of attacked frameswith different TCA threshold values that represent differentproportion of attacked frames. As shown in Fig. 9, the numberof attacked frames increases when the proportion value ishigher, and the exact number of attacked frames is relativelylow considering the number of total frames. In Fig. 9 (b), onBreakout, when the proportion value is smaller than 20%, lessthan 50 frames are attacked. The same finding is also observedon Beamrider, where less than 100 frames are attacked whenthe proportion value is smaller than 20%. On Pong, to achievea same reward degradation, the policy trained by A2C requiresmore frames attacked than that of the other three RL policies.

For a better demonstration, we also analyze the relationbetween attacked frames and total frames in Fig. 10. We findsimilar trends on the six games, where the number of totalframes decreases when more frames are attacked. In particular,such relation is significantly observed on Breakout, whereattacking less than 50 frames considerably reduces the numberof total frames to a small value from 2000. This probablyresults from the fact that the attacked frames may cause (i) thetermination of the game, or (ii) a loss of life. Both outcomesgreatly shorten the episode, resulting in a smaller number oftotal frames. For instance, if the paddle in Breakout is deceivedto miss the ball, the agent would lose 1/5 life.

To explore a more restrictive setting, we additionally ex-amine the single-pixel (n = 1) attack on limited frames.We illustrate the results on Breakout as shown in Fig. 11,and similar trends can be observed on other games. Fromthis figure, we surprisingly find that by only attacking onsix frames on average, the policy trained by DQN is totallyfooled with reward decreasing from 224 to 13.7. Here, sixframes correspond to an attack ratio of 6/535 ≈ 1%, where535 is the averaged number of total frames. Similar surprisingfindings are also observed on policies trained by PPO2, A2Cand ACKTR. For instance, the policy trained by PPO2 shows a

significant reward decline with only 27 frames attacked underthe same TCA setting as other policies.

D. Initialization Analysis

The GA population initialization plays an important rolein efficiently generating successful adversarial examples, andthus highlights the practicality of the approach. To this end, westudy the effects of random initialization versus warm startingbased on prior data. In RL, the simplest form of prior origi-nates from the correlation between states in a in the MarkovDecision Process (MDP); this also distinguishes RL fromother supervised learning tasks (e.g., classification), where thepredictions on different data points are independent. Thus, withthe existence of correlations between adjacent frames in mind,we set up different types population initialization in the GA:(1) GA-RI - random initialization (RI) of pop0; (2) GA-WSI-warm start initialization (WSI) of pop0 by adopting the optimaladversarial example x∗p found in the previous attacked frame.

To ensure a fair comparison and for ease of presentation,we randomly attack four frames for all the six games. Oneach attacked frame, the GA optimization is executed 10 timesindependently on both types of initialization. The comparisonresults are shown in Fig. 12, where the dash line and shadedarea respectively represent the mean and standard deviationof loss values for the 10 runs. In general, we can clearlysee that GA-WSI significantly speeds up the optimizationconvergence while finding the successful adversarial examples(i.e., corresponding to positive objective function values). Inparticular, on many frames (e.g., Pong: ξ1, ξ2, ξ4; Breakout: ξ2;Space Invaders: ξ3; Qbert: ξ3, ξ4), the GA-RI takes relativelymore time (or is even unable) to obtain a positive objectivevalue; whereas the GA-WSI achieves it easily.

We interestingly find that, in some cases, GA-WSI providesa successful adversary without the need of any optimization.This observation underlines the real possibility of launchingsuccessful adversarial attacks on deep RL policies in real-time at little/no computational cost, a threat that raises severeconcerns about their deployment especially in safety criticalapplications

In summary, our experimental results and examinationssuggest that by moderately setting the FSA size n and TCAthreshold ζ∗, the attacker could successfully disrupt a seem-ingly high performing policy. Moreover, our investigationson different initialization manners (i.e., GA-RI and GA-WSI)indicate that even restrictive black-box settings for adversarialattack can pose a big challenge to the utility of state-of-the-artRL methods, thus highlighting the need to focus on robustnessin the future.


0 20 40Generation number

0.02

0.01

0.00

0.01

0.02

Loss

val

ue

GA-RIGA-WSI

Pong: ξ1


0.02

0.01

0.00

0.01

0.02

Loss

val

ue

GA-RIGA-WSI

Pong: ξ2


0.000

0.002

0.004

0.006

Loss

val

ue

GA-RIGA-WSI

Pong: ξ3


0.06

0.04

0.02

0.00

Loss

val

ue

GA-RIGA-WSI

Pong: ξ4


0.0

0.2

0.4

0.6

Loss

val

ue

GA-RIGA-WSI

Breakout: ξ1


0.2

0.0

0.2

0.4

Loss

val

ue

GA-RIGA-WSI

Breakout: ξ2


0.00.20.40.60.8

Loss

val

ue

GA-RIGA-WSI

Breakout: ξ3


0.01

0.00

0.01

0.02

Loss

val

ue

GA-RIGA-WSI

Breakout: ξ4


0.0

0.1

Loss

val

ue

GA-RIGA-WSI

Space Invaders: ξ1


0.05

0.00

Loss

val

ue

GA-RIGA-WSI

Space Invaders: ξ2


0.10

0.05

0.00Lo

ss v

alue

GA-RIGA-WSI

Space Invaders: ξ3


0.10

0.05

0.00

Loss

val

ue

GA-RIGA-WSI

Space Invaders: ξ4


0.005

0.000

0.005

Loss

val

ue

GA-RIGA-WSI

Seaquest: ξ1


0.005

0.000

0.005

Loss

val

ue

GA-RIGA-WSI

Seaquest: ξ2


0.0050.0000.0050.0100.015

Loss

val

ue

GA-RIGA-WSI

Seaquest: ξ3


0.005

0.000

0.005

Loss

val

ue

GA-RIGA-WSI

Seaquest: ξ4


0.05

0.00

0.05

0.10

Loss

val

ue

GA-RIGA-WSI

Qbert: ξ1


0.05

0.00

0.05

Loss

val

ue

GA-RIGA-WSI

Qbert: ξ2


0.00

0.05

Loss

val

ue

GA-RIGA-WSI

Qbert: ξ3


0.05

0.00

0.05

Loss

val

ue

GA-RIGA-WSI

Qbert: ξ4


0.00

0.02

0.04

Loss

val

ue

GA-RIGA-WSI

Beamrider: ξ1


0.00

0.02

0.04

0.06

Loss

val

ue

GA-RIGA-WSI

Beamrider: ξ2


0.00

0.02

Loss

val

ue

GA-RIGA-WSI

Beamrider: ξ3


0.00

0.02

0.04

Loss

val

ue

GA-RIGA-WSI

Beamrider: ξ4

Fig. 12: The comparison between random initialization and case-injected initialization, where the line and shaded area illustratethe mean and standard deviation of 10 independent runs respectively.


CONCLUSION

This paper explores the minimalistic scenarios for adversar-ial attack in deep reinforcement learning (RL), comprising (1)black-box policy access where the attacker only has accessto the input (state) and output (action probability) of anRL policy; (2) fractional-state adversary where only a smallnumber of pixels are perturbed, with the extreme case of one-pixel attack; and (3) tactically-chanced attack where only somesignificant frames are chosen to be attacked. We verify thesesettings on policies that are trained by four state-of-the-artRL policies on six Atari games. We surprisingly find that: (i)with only a single pixel (≈ 0.01% of the state) attacked, thetrained policies can be completely deceived; and (ii) by merelyattacking around 1% frames, the policy trained by DQN istotally fooled on certain games.

REFERENCES

[1] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-ing,” arXiv preprint arXiv:1312.5602, 2013.

[2] Z. Li, J. Liu, Z. Huang, Y. Peng, H. Pu, and L. Ding, “Adaptiveimpedance control of human–robot cooperation using reinforcementlearning,” IEEE Transactions on Industrial Electronics, vol. 64, no. 10,pp. 8013–8022, 2017.

[3] S. Huang, N. Papernot, I. Goodfellow, Y. Duan, and P. Abbeel,“Adversarial attacks on neural network policies,” arXiv preprintarXiv:1702.02284, 2017.

[4] Y.-C. Lin, Z.-W. Hong, Y.-H. Liao, M.-L. Shih, M.-Y. Liu, and M. Sun,“Tactics of adversarial attack on deep reinforcement learning agents,”arXiv preprint arXiv:1703.06748, 2017.

[5] X. Yuan, P. He, Q. Zhu, and X. Li, “Adversarial examples: Attacks anddefenses for deep learning,” IEEE transactions on neural networks andlearning systems, 2019.

[6] P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, and C.-J. Hsieh, “Zoo: Zerothorder optimization based black-box attacks to deep neural networkswithout training substitute models,” in Proceedings of the 10th ACMWorkshop on Artificial Intelligence and Security. ACM, 2017, pp. 15–26.

[7] A. Pattanaik, Z. Tang, S. Liu, G. Bommannan, and G. Chowdhary,“Robust deep reinforcement learning with adversarial attacks,” in Pro-ceedings of the 17th International Conference on Autonomous Agentsand MultiAgent Systems. International Foundation for AutonomousAgents and Multiagent Systems, 2018, pp. 2040–2042.

[8] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta, “Robust adver-sarial reinforcement learning,” in Proceedings of the 34th InternationalConference on Machine Learning-Volume 70. JMLR. org, 2017, pp.2817–2826.

[9] A. Raghu, “Reinforcement learning for sepsis treatment: Baselines andanalysis,” Reinforcement Learning for Real Life, 2019.

[10] A. Raghu, M. Komorowski, L. A. Celi, P. Szolovits, and M. Ghassemi,“Continuous state-space models for optimal sepsis treatment-a deepreinforcement learning approach,” arXiv preprint arXiv:1705.08422,2017.

[11] S. Parbhoo, J. Bogojeska, M. Zazzi, V. Roth, and F. Doshi-Velez,“Combining kernel and model based learning for hiv therapy selection,”AMIA Summits on Translational Science Proceedings, vol. 2017, p. 239,2017.

[12] J. H. Holland, “Genetic algorithms,” Scientific american, vol. 267, no. 1,pp. 66–73, 1992.

[13] K. Deep, K. P. Singh, M. L. Kansal, and C. Mohan, “A real codedgenetic algorithm for solving integer and mixed integer optimizationproblems,” Applied Mathematics and Computation, vol. 212, no. 2, pp.505–518, 2009.

[14] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347,2017.

[15] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-forcement learning,” in International conference on machine learning,2016, pp. 1928–1937.

[16] Y. Wu, E. Mansimov, R. B. Grosse, S. Liao, and J. Ba, “Scalable trust-region method for deep reinforcement learning using kronecker-factoredapproximation,” in Advances in neural information processing systems,2017, pp. 5279–5288.

[17] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow,and R. Fergus, “Intriguing properties of neural networks,” arXiv preprintarXiv:1312.6199, 2013.

[18] I. Goodfellow, N. Papernot, S. Huang, Y. Duan, P. Abbeel, andJ. Clark, “Attacking machine learning with adversarial examples,” Ope-nAI. https://blog. openai. com/adversarial-example-research, 2017.

[19] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial machine learningat scale,” arXiv preprint arXiv:1611.01236, 2016.

[20] C. Xiao, X. Pan, W. He, J. Peng, M. Sun, J. Yi, B. Li, and D. Song,“Characterizing attacks on deep reinforcement learning,” arXiv preprintarXiv:1907.09470, 2019.

[21] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessingadversarial examples,” arXiv preprint arXiv:1412.6572, 2014.

[22] J. Kos and D. Song, “Delving into adversarial attacks on deep policies,”arXiv preprint arXiv:1705.06452, 2017.

[23] A. Fawzi, S.-M. Moosavi-Dezfooli, and P. Frossard, “The robustnessof deep networks: A geometrical perspective,” IEEE Signal ProcessingMagazine, vol. 34, no. 6, pp. 50–62, 2017.

[24] X. Qu, Y.-S. Ong, Y. Hou, and X. Shen, “Memetic evolution strategyfor reinforcement learning,” in 2019 IEEE Congress on EvolutionaryComputation (CEC). IEEE, 2019, pp. 1922–1928.

[25] Z. Xie and Y. Jin, “An extended reinforcement learning frameworkto model cognitive development with enactive pattern representation,”IEEE Transactions on Cognitive and Developmental Systems, vol. 10,no. 3, pp. 738–750, 2018.

[26] D. Liu, X. Yang, D. Wang, and Q. Wei, “Reinforcement-learning-basedrobust controller design for continuous-time uncertain nonlinear systemssubject to input constraints,” IEEE transactions on cybernetics, vol. 45,no. 7, pp. 1372–1385, 2015.

[27] B. Xu, C. Yang, and Z. Shi, “Reinforcement learning output feedbacknn control using deterministic learning technique,” IEEE Transactionson Neural Networks and Learning Systems, vol. 25, no. 3, pp. 635–641,2013.

[28] F. de La Bourdonnaye, C. Teuliere, J. Triesch, and T. Chateau, “Learningto touch objects through stage-wise deep reinforcement learning,” in2018 IEEE/RSJ International Conference on Intelligent Robots andSystems (IROS). IEEE, 2018, pp. 1–9.

[29] S. Badreddine and M. Spranger, “Injecting prior knowledge for trans-fer learning into reinforcement learning algorithms using logic tensornetworks,” arXiv preprint arXiv:1906.06576, 2019.

[30] C. Colas, P.-Y. Oudeyer, O. Sigaud, P. Fournier, and M. Chetouani, “Cu-rious: Intrinsically motivated modular multi-goal reinforcement learn-ing,” in International Conference on Machine Learning, 2019, pp. 1331–1340.

[31] C.-T. Lin and C. G. Lee, “Reinforcement structure/parameter learning forneural-network-based fuzzy logic control systems,” IEEE Transactionson Fuzzy Systems, vol. 2, no. 1, pp. 46–63, 1994.

[32] Y. Hou, Y.-S. Ong, L. Feng, and J. M. Zurada, “An evolutionarytransfer reinforcement learning framework for multiagent systems,”IEEE Transactions on Evolutionary Computation, vol. 21, no. 4, pp.601–615, 2017.

[33] Z. Zhang, Y.-S. Ong, D. Wang, and B. Xue, “A collaborative multiagentreinforcement learning method based on policy gradient potential,” IEEEtransactions on cybernetics, 2019.

[34] C. H. Papadimitriou and J. N. Tsitsiklis, “The complexity of markovdecision processes,” Mathematics of operations research, vol. 12, no. 3,pp. 441–450, 1987.

[35] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trustregion policy optimization,” in International Conference on MachineLearning, 2015, pp. 1889–1897.

[36] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.MIT press, 2018.

[37] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. VanDen Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot et al., “Mastering the game of go with deep neural networksand tree search,” nature, vol. 529, no. 7587, p. 484, 2016.

[38] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Masteringthe game of go without human knowledge,” Nature, vol. 550, no. 7676,p. 354, 2017.

[39] T. B. Brown, D. Mane, A. Roy, M. Abadi, and J. Gilmer, “Adversarialpatch,” arXiv preprint arXiv:1712.09665, 2017.


[40] J. Su, D. V. Vargas, and K. Sakurai, “One pixel attack for fooling deepneural networks,” IEEE Transactions on Evolutionary Computation,2019.

[41] D. P. Bertsekas, “Nonlinear programming,” Journal of the OperationalResearch Society, vol. 48, no. 3, pp. 334–334, 1997.

[42] Y. Jin, J. Branke et al., “Evolutionary optimization in uncertainenvironments-a survey,” IEEE Transactions on evolutionary computa-tion, vol. 9, no. 3, pp. 303–317, 2005.

[43] Y. Jin, “A comprehensive survey of fitness approximation in evolutionarycomputation,” Soft computing, vol. 9, no. 1, pp. 3–12, 2005.

[44] Y. Jin, M. Olhofer, and B. Sendhoff, “A framework for evolutionaryoptimization with approximate fitness functions,” IEEE Transactions onevolutionary computation, vol. 6, no. 5, pp. 481–494, 2002.

[45] Y. Jin, H. Wang, T. Chugh, D. Guo, and K. Miettinen, “Data-drivenevolutionary optimization: an overview and case studies,” IEEE Trans-actions on Evolutionary Computation, vol. 23, no. 3, pp. 442–458, 2018.

[46] B. Liu, L. Wang, Y.-H. Jin, F. Tang, and D.-X. Huang, “Improved particleswarm optimization combined with chaos,” Chaos, Solitons & Fractals,vol. 25, no. 5, pp. 1261–1271, 2005.

[47] B. Liu, L. Wang, and Y.-H. Jin, “An effective pso-based memeticalgorithm for flow shop scheduling,” IEEE Transactions on Systems,Man, and Cybernetics, Part B (Cybernetics), vol. 37, no. 1, pp. 18–27,2007.

[48] X. Ma, Q. Zhang, G. Tian, J. Yang, and Z. Zhu, “On tchebycheffdecomposition approaches for multiobjective evolutionary optimization,”IEEE Transactions on Evolutionary Computation, vol. 22, no. 2, pp.226–244, 2017.

[49] L. Feng, Y.-S. Ong, S. Jiang, and A. Gupta, “Autoencoding evolutionarysearch with learning across heterogeneous problems,” IEEE Transactionson Evolutionary Computation, vol. 21, no. 5, pp. 760–772, 2017.

[50] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schul-man, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprintarXiv:1606.01540, 2016.

[51] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovskiet al., “Human-level control through deep reinforcement learning,”Nature, vol. 518, no. 7540, p. 529, 2015.

[52] A. RAFFIN, “Reinforcement learning baseline zoo,” https://https://github.com/araffin/rl-baselines-zoo, 2018.

Xinghua Qu received the B.S. degree in AircraftDesign and Engineering from Northwestern Poly-technic University, China, in 2014, and the M.S.degree in Aerospace Engineering, School of Astro-nautics, Beihang University, China, in 2017.He is currently the Phd student of school of com-puter science and engineering, Nanyang Technolog-ical University, Singapore, and his research interestsinclude reinforcement learning, adversarial machinelearning and evolutionary computation.

Zhu Sun received her Ph.D. degree from School ofComputer Science and Engineering, Nanyang Tech-nological University (NTU), Singapore, in 2018.During her Ph.D. study, she focused on designingefficient recommendation algorithms by consideringside information. Her research has been publishedin leading conferences and journals in related do-mains (e.g., IJCAI, AAAI, CIKM, ACM RecSys andACM UMAP,). Currently, she is a research fellowat School of Electrical and Electronic Engineering,NTU, Singapore.

Yew-Soon Ong received a PhD degree on ArtificialIntelligence in complex design from the Computa-tional Engineering and Design Center, Universityof Southampton, UK in 2003. He is a Professorand the Chair of the School of Computer Scienceand Engineering, Nanyang Technological University(NTU), Singapore, where he is also the Directorof the Data Science and Artificial Intelligence Re-search Center and Principal Investigator of the DataAnalytics and Complex Systems Programme at theRollsRoyce@NTU Corporate Lab. His research in-

terest in computational intelligence spans across memetic computing, complexdesign optimization, and big data analytics. He is the founding Editor-in-Chiefof the IEEE Transactions on Emerging Topics in Computational Intelligence,Associate Editor of the IEEE Transactions on Evolutionary Computation,the IEEE Transactions on Neural Networks & Learning Systems, the IEEETransactions on Cybernetics, and others.

Pengfei Wei received his Ph.D. Degree fromSchool of Computer Science and Engineering,Nanyang Technological University (NTU), Singa-pore, in 2018. During his Ph.D. study, he focusedon domain adaptation and gaussian processes. Hisresearch has been published in leading conferencesand journals in related domains (e.g., ICML, IJCAI,ICDM, IEEE TNNLS, IEEE TKDE). Currently, he isa research fellow at School of Computing, NationalUniversity of Singapore.

Abhishek Gupta received his PhD in Engineer-ing Science from the University of Auckland, NewZealand, in the year 2014. He graduated with aBachelor of Technology degree in the year 2010,from the National Institute of Technology Rourkela,India. He currently serves as a Scientist in theSingapore Institute of Manufacturing Technology(SIMTech), Agency of Science, Technology andResearch (A*STAR), Singapore. He has diverseresearch experience in the field of computationalscience, ranging from numerical methods in engi-

neering physics, to topics in computational intelligence. His recent researchinterests are in the development of memetic computation for automatedknowledge extraction and transfer across problems in evolutionary design.

https://https://github.com/araffin/rl-baselines-zoo

https://https://github.com/araffin/rl-baselines-zoo

ieee transactions on cognitive development system 1 ... · a single-pixel adversary; and (3)...

Documents