model regularization for stable sample rolloutsauai.org/uai2014/proceedings/individuals/179.pdf ·...

Model Regularization for Stable Sample Rollouts

Erik TalvitieDepartment of Mathematics and Computer Science

Franklin & Marshall CollegeLancaster, PA 17604

Abstract

When an imperfect model is used to generatesample rollouts, its errors tend to compound –a flawed sample is given as input to the model,which causes more errors, and so on. Thispresents a barrier to applying rollout-based plan-ning algorithms to learned models. To ad-dress this issue, a training methodology called“hallucinated replay” is introduced, which addssamples from the model into the training data,thereby training the model to produce sensiblepredictions when its own samples are given asinput. Capabilities and limitations of this ap-proach are studied empirically. In several exam-ples hallucinated replay allows effective planningwith imperfect models while models trained us-ing only real experience fail dramatically.

1 INTRODUCTION

Online Monte Carlo-based planning algorithms such asSparse Sampling (Kearns et al. 2002), UCT (Kocsis &Szepesvari 2006), and POMCP (Silver & Veness 2010) areattractive in large problems because their computationalcomplexity is independent of the size of the state space.This type of planning has been successful in many domainswhere a perfect model is available to the agent, but there isa fundamental barrier in the way of applying these meth-ods to learned models. The core operation in Monte Carloplanning is sampling possible futures by composing themodel’s one-step predictions (in a process called rollout).When an imperfect model is used, errors in one sampledobservation cause more errors in the next sample, and so onuntil the rollout bears little or no resemblance to any plau-sible future, making it worthless for planning purposes.

To illustrate, consider the following simple planning prob-lem, which will serve as a running example throughout thepaper. In “Mini-Golf” (pictured in Figure 1a), the agent’s

observations are 20 × 3 binary images. The image showsa ball, a wall, and a pit. A cover slides back and forth overthe pit. The agent selects between two actions: no-op andhit; once the hit action is selected, only the no-op actionis available thereafter. When the ball is hit, it travels tothe right, knocks the wall down (i.e., the wall disappears),bounces back to the left, hits the left edge, and bouncesback to the right. If the cover is in the correct position,the ball reaches the right side, the episode ends, and theagent receives 1 reward. Otherwise, the ball falls into thepit which ends the episode with no reward.

Imagine that the agent learns a factored model over pixels.There is a separate component model responsible for pre-dicting each pixel; the prediction for the full image is theproduct of the component models’ predictions. The com-ponent model at position 〈x, y〉 is given the pixel values inan n×m neighborhood centered at 〈x, y〉 for the previoustwo time-steps, as well as the most recent two actions.

If n = 9 and m = 3, there is enough information to makeperfectly accurate predictions, but imagine that n = 7. Thissimulates the common situation that the minimum set offeatures necessary for perfect accuracy is either unknownor computationally impractical. The consequences of thelimitation are illustrated in Figure 1b. The main problemis that a pixel immediately adjacent to the ball cannot tellif the ball is next to a wall, and therefore whether the ballwill bounce back. As a result, the model for this pixel isuncertain whether the pixel will be black or white.

Figure 2 shows a rollout from a perfect model (n = 9)on the left and a rollout from an imperfect model (n = 7)in the center. When the model is very accurate, rolloutsare plausible futures. When the model is imperfect, themodel’s limitation eventually causes an error – a pixel nextto the ball stochastically turns black, as if the ball werebouncing back. This creates a context for the nearby pixelmodels that is unlike any they have seen before. Thiscauses new artifacts to appear in the next sample, and soon. The meaningful structure in the observations is quicklyobliterated making it impossible to tell whether the ball willeventually reach the other side.

Figure 1: a) The Mini-Golf problem. b) The pixel marked with the ? cannot be predicted perfectly with a 7×3 neighborhood(shaded) because it cannot tell if the ball will bounce off of a wall.

This issue can be framed as a mismatch between themodel’s training data and test data. Specifically, themodel’s training data is drawn from real world experiencebut when it is actually used for planning (i.e., “tested”), theinputs to the model are samples generated from the modelitself, which may be very different than real observations.In a sense, the model in this example has overfit to the realworld! The approach taken in this paper to address thistrain/test mismatch is to mix samples from the model intothe training data. The resulting method is called “halluci-nated replay,” in analogy to “experience replay” (Lin 1992).On the right of Figure 2 is a rollout of a model trained usinghallucinated replay. Note that it makes the same initial er-ror (a spurious black pixel still appears), but it has learnedthat lone black pixels should turn white. In effect, it haslearned to correct for its own sampling errors.

This paper has two main goals. The first is to highlight thedramatic impact of this mismatch between real inputs andsampled inputs on model-based reinforcement learning. Inseveral examples seemingly innocuous model limitationscause catastrophic planning failures. The second is to em-pirically investigate hallucinated replay as an approach toaddressing this issue. Most notably, hallucinated replaywill be used to learn imperfect models that can neverthe-less be used for effective planning. These results indicatethat this method may be an important tool for model-basedreinforcement learning in problems where one cannot ex-pect to learn a perfect model.

2 ENVIRONMENTS AND MODELS

The development of hallucinated replay will focus on thesetting of multi-dimensional, discrete, deterministic dy-namical systems. Time proceeds in discrete steps. Ateach step t, the agent selects an action at from a finiteset of actions A. Given the action, the environment de-terministically produces an observation vector ot. LetO = O1×O2× . . .×On be the observation space, whereOi is the finite set of possible values for component i ofthe observation. Note that not all o ∈ O will necessarilybe reachable in the world. Let OE ⊆ O be the subset ofobservations that the environment can ever produce. Forexample, in Mini-Golf, O would be the set of all 20 × 3binary images, and OE would be the set of binary imageswith a ball, a pit, and either a wall or no wall.

Let H be the set of all sequences a1o1 . . . akok for all

lengths k and let ot = T (at, ht−1) where ht−1 =a1o1 . . . at−1ot−1 is the history at time t − 1 and T is thetransition function T : A ×H → OE . Note that typicallynot all sequences in H are reachable. Let HE ⊆ H bethe set of histories that could be produced by taking someaction sequence in the environment.

At each step the environment also produces a reward valuert and, in the episodic case, a Boolean termination sig-nal ωt. For simplicity’s sake, these values are assumedto be contained in the observation vector ot so correctlypredicting the next observation also involves correctly pre-dicting the reward and whether the episode will termi-nate. Assume that if ωt = True then ωt′ = True andrt′ = 0 for all t′ > t. The goal of an agent in this en-vironment is to maximize the expected discounted returnE[∑∞

t=1 γt−1rt

], where γ ∈ [0, 1] is the discount factor

and the expectation is over the agent’s behavior (the envi-ronment is deterministic).

Note that even though the environment is assumed to bedeterministic, it may be too complex to be perfectly cap-tured by a model. Thus models will, in general, be stochas-tic. Also note that because OE and HE are unknown apriori, models will be defined over the full spaces O andH instead. So, an imperfect model not only has an incor-rect probability distribution over possible observations, butit may assign positive probability to impossible observa-tions, as seen in the Mini-Golf example. Furthermore, dur-ing rollouts the model will have to make predictions givenhistories inH\HE , though no amount of experience in theenvironment will provide guidance about what those pre-dictions should be.

In an abstract sense, a model M provides a conditionalprobability TM (o | ha) for every observation vector o ∈O, action a ∈ A, and history sequence h ∈ H. As as-sumed above, predicting the next observation also involvespredicting the reward value and whether the episode willterminate. Note that the history h is an arbitrarily longand ever expanding sequence of actions and observations.Most practical models will rely upon some finite dimen-sional summary of history c(ha), which is hereafter calledthe model’s context. For the sake of simplifying notation,let ct = c(ht−1at). For instance, a Markov model’s con-text would include the most recent action and observationand a POMDP model’s context would include the beliefstate associated with the history.

Figure 2: Rollouts using a perfect model (left), a model trained with only real experience (center) and a model trained withhallucinated replay (right). Hallucinated replay makes the model robust to its own errors.

3 HALLUCINATED REPLAY

Consider learning a model using a set of training trajecto-ries {h1, . . . , hk}, where trajectory hi is of length |hi|. Thetraining data can be seen as a set of training pairs for a su-pervised learning algorithm: {〈cit,oi

t〉 | i = 1 . . . k, t =1 . . . |hi|}, where the context cit is the input and the ob-servation oi

t is the target output. Let M be the space ofpossible models, or the model class. Most model learningmethods aim to find the modelM∗ ∈M, which maximizesthe log-likelihood:

M∗ = argmaxM∈M

k∑i=1

|hi|∑t=1

log(TM (oit | c(hit−1ait)))

= argmaxM∈M

k∑i=1

|hi|∑t=1

log(TM (oit | cit))

If a perfect model exists in M, one for which TM (oit |

cit) = 1 for all i and t, then M∗ is a perfect model. So inthis case M∗ is a sensible learning target for the purposesof model-based reinforcement learning.

If there is no perfect model inM, M∗ may not be a verygood model for planning. Note that the log-likelihood onlymeasures the accuracy of the model’s one-step predictionsgiven contexts produced by the environment. However, asseen above, in order to prevent errors from compoundingduring planning, the model must make sensible predictionsin contexts the environment will never produce. As will bedemonstrated empirically, a model that accomplishes thismay yield better planning performance than M∗, even if itincurs more prediction error (lower log-likelihood).

The high-level idea behind hallucinated replay is to trainthe model on contexts it will see during sample rollouts inaddition to those from the world. Hallucinated replay aug-ments the training data by randomly selecting a context citfrom the training set, replacing it with a “hallucinated” con-text cit that contains sampled observations from the model,and finally adding the training pair 〈cit,oi

t〉 to the trainingset. The model is trained with inputs that are generated

from its own predictions, but outputs from real experience.

The experiments below consider three concrete instantia-tions of this abstract idea, which are described below andillustrated in Figure 3. All three begin by randomly se-lecting a trajectory i and timestep t to select the “output”component of the new training pair.

Shallow Sample: The simplest form of hallucinated replayis to replace the observation at step t − 1 with a sampleoit−1 ∼ TM (· | cit−1). Then the hallucinated context is

cit = c(hit−2ait−1o

it−1a

it). This approach essentially imag-

ines that there might be errors in the first sample of a rollout(the sampled observation placed at the end of history) andtrains the model to behave reasonably despite those errors(i.e., to predict the true next observation).

Deep Sample: A flaw in the above approach is that, evenif a rollout recovers from an error the erroneous obser-vation will still persist in history, possibly affecting pre-dictions deeper in the rollout. A simple fix is to ran-domly select an observation in recent history (not justthe most recent one) to replace with a sample. Specif-ically, randomly select an offset j > 0 and replacethe observation at step t − j with a sample oi

t−j ∼TM (· | cit−j) to generate the hallucinated context cit =

c(hit−j−1ait−j o

it−ja

it−j+1o

it−j+1 . . . a

it−1o

it−1a

it). If the

model is Markovian (that is, if its context only consid-ers the most recent observation), then these two strategiesare equivalent. In the experiments, the model is 2nd-orderMarkov, so j is either 1 or 2.

Deep Context: The above approaches have the draw-back that they only introduce a single sampled observa-tion into the context, whereas the contexts encountered bythe model during a rollout will largely consist entirely ofsampled observations. Even if the context only makes useof one observation, the above approaches sample that ob-servation as if it is the first in the rollout, whereas thecontexts encountered by the model will largely be sam-pled after a long rollout. To address this, randomly se-lect a rollout length l and sample observations oi

j forj = t − l . . . t − 1. Then the hallucinated context is

Figure 3: Methods for producing hallucinated contexts, illustrated with a 2nd-order Markov context. Samples are shaded;dashed arrows indicate the contexts which generated the samples. Actions are not shown.

cit = c(hit−l−1ait−lo

it−l . . . a

it−1o

it−1a

it). In the experi-

ments below, l > 0 is drawn with probability Pr(l) = 12l

.This biases the distribution toward short rollouts, since longrollouts are expected to produce very noisy observations.

3.1 IMPORTANCE OF DETERMINISM

Note that hallucinated replay as described can not sensi-bly be applied to stochastic environments. In a determin-istic environment, the only reason cit can differ from cit isthat the model is imperfect. In a stochastic environment,even a perfect model could generate cit 6= cit by samplinga different stochastic outcome. In that case, training on thepair 〈cit,oi

t〉 could harm the model by contradicting tem-poral dependencies the model may have correctly identi-fied. One exception to this limitation is when the sourceof stochasticity is “measurement error,” noise that is uncor-rolated over time and does not affect the evolution of theunderlying state. In that case, hallucinated replay wouldre-sample the noise to no ill effect. This work focuses onthe case of stochastic models of deterministic environments(assuming the environment may be too complex to modelperfectly). This is, in itself, a large and interesting class ofproblems. The possibility of extending the idea to stochas-tic environments is briefly discussed in Section 5.2.

4 EXPERIMENTS

This section contains the main results of the paper, an em-pirical exploration of the properties of hallucinated replayunder different modeling conditions. The experiments willbe performed on variations of the Mini-Golf problem de-scribed in the introduction. One of the key features of thisdomain is that, though it is a simple planning problem (theagent needs only to select hit at the right moment), a longrollout is necessary to sample the eventual reward value anddetermine whether it is better to hit or no-op. The exper-iments employ the simple 1-ply Monte Carlo algorithm:

at each step, for each available action a, perform 50 roll-outs of length at most 80 that begin with a and uniformlyrandomly select actions thereafter. After the rollouts, ex-ecute the action that received the highest average return.In these experiments model learning and planning are bothperformed online – at each step, the planner uses the cur-rent model, which is then updated with the resulting train-ing pair. All of the replay strategies train on 5 additionalpairs from past experience per step.

The model of the pixels is factored, as described in the in-troduction. If a neighborhood contains pixels outside theboundaries of the image they are encoded with a special“out of bounds” color. In all the experiments data is sharedacross all positions, providing some degree of positionalinvariance in the predictions. To predict the reward and ter-mination signals rt and ωt (which do not have positions onthe screen) the context is the pixel values at all positionsin ot and ot−1, as well as the actions at and at−1. Duringsampling the pixel values are sampled before reward andtermination so the sampled image can be supplied as input.

In most of the experiments below the Context Tree Weight-ing (CTW) algorithm (Willems et al. 1995) is used to learnthe component models, similar to the FAC-CTW algorithm(Veness et al. 2011), which was also applied in a model-based reinforcement learning setting. Very briefly, CTWmaintains a Bayesian posterior over all prunings of a fixeddecision tree. Amongst its attractive properties are compu-tationally efficient updates, a lack of free parameters, andguaranteed zero asymptotic regret with respect to the max-imum likelihood tree in the model class. In order to ap-ply CTW, an ordering must be selected for the variables inthe context tree. In all cases the actions and observationswere ordered by recency. For the pixel models, neighbor-ing pixels within a timestep were ordered by proximity tothe pixel being predicted. For the reward and terminationmodels, pixel values were in reverse column-major order(bottom-to-top, right-to-left).

0 200 400 600 800 1000

Episode

0.0

0.2

0.4

0.6

0.8

1.0A

vg.

Rew

ard

(30

tria

ls)

Random

a) Planning

Deep SampleShallow SampleDeep ContextExperience ReplayNo Replay

0 200 400 600 800 1000

Episode

−101

−100

−10−1

Avg

.LL

(30

tria

ls)

b) Prediction

Experience ReplayDeep SampleShallow SampleNo ReplayDeep Context

Mini-Golf: Small Neighborhoods, CTW

Figure 4: CTW results in Mini-Golf with small neighborhoods, averaged over 30 independent trials. Curves are smoothedusing a sliding average of width 50. Unsmoothed data is shown in faint colors. The legends list the replay method indecreasing order of the value of the curve at episode 200 (in both cases, higher is better). Note the log scale in theprediction performance graph. These conventions are maintained in subsequent figures.

4.1 CORRECTING SAMPLE ERRORS

The first experiment considers the setting described in theintroduction in which models are learned with 7× 3 neigh-borhoods. Figure 4 shows the results of training with thethree hallucinated replay variations proposed above andtraining with pure “Experience replay” (re-training on ran-domly selected training pairs from past experience withoutmodifying the context). The results of training with onlythe agent’s experience (“No Replay”) are also provided asa baseline to show the effects of replay alone.

The most dramatic result can be seen in the planning per-formance (Figure 4a): the models trained only with expe-rience from the environment (regardless of whether replaywas used) result in behavior that is no better than an agentthat chooses actions uniformly randomly (indicated by thesolid black line). As can be seen in Figure 4b, the mod-els trained using hallucinated replay achieve no better (andin one case much worse) prediction accuracy, as measuredby average per-step log-likelihood over the training data(not counting replay). Nevertheless, they all result in nearperfect planning performance. This matches the intuitiondescribed above, and indeed the rollouts in Figure 2 weregenerated using these learned models.

It is unsurprising that the Deep Sample replay strategywould outperform the Shallow Sample strategy – it was in-deed important that sampled observations appear in bothpositions in the context. It is somewhat surprising thatDeep Sample outperformed Deep Context as well. Thehypothesis that it would be important for the model to betrained on contexts sampled far into a rollout is not sup-ported by these results. It may be that the noisy samples ob-tained from long rollouts, especially early in training, weremisleading enough to harm performance. It is also possi-ble that this effect is an artifact of this example. Neverthe-less, these qualitative findings were consistent across thevariations explored below so, to reduce clutter, only DeepSample results are reported in the subsequent experiments.

4.2 FULL NEIGHBORHOODS

The previous experiment is an example of hallucinated re-play making meaningful planning possible when a perfectmodel is not contained within the model class. This is themost common case in problems of genuine interest. Never-theless, it is relevant to ask what effect it has when a perfectmodel is in the class.

Figures 5a and 5b show the results in the Mini-Golf prob-lem when the models used 9× 3 neighborhoods (sufficientto make accurate predictions). Both the Experience Replayand No Replay models are able to support planning, andexperience replay clearly improves sample efficiency. TheHallucinated Replay model has consistently lower predic-tion accuracy and lags behind the Experience Replay modelin planning performance, though it eventually catches up.Noting that the hallucinated contexts in the very beginningof training are likely to be essentially meaningless, Figure5 also shows the results of performing experience replayfor the first 50 episodes, and hallucinated replay thereafter.This model’s planning performance is nearly identical tothat of the Experience Replay model. Once the model hashad some time to learn, the sampled observations becomemore and more like real observations, so this similarity isnot surprising.

Note that though a perfect model is in the class, the in-termediate models during learning are imperfect and theirsampling errors do impact planning performance. It is in-teresting that hallucinated replay does not aid planning per-formance by making the model more robust to these errors.One hypothesis might be that if the rollouts were muchlonger, errors would be more likely to affect planning, andhallucinated replay might have more of an impact.

Figures 5c and 5d show the results of a variation on Mini-Golf that is the same in every respect, except that there are6 walls to knock down rather than 1. Thus rollouts mustbe much longer for successful planning. First note that the

0 200 400 600 800 1000

Episode

0.0

0.2

0.4

0.6

0.8

1.0A

vg.

Rew

ard

(30

tria

ls)

Random

a) Planning (1 Wall)

Deep Sample After 50Experience ReplayDeep SampleNo Replay

0 200 400 600 800 1000

Episode

−101

−100

−10−1

−10−2

−10−3

−10−4

Avg

.LL

(30

tria

ls)

b) Prediction (1 Wall)

Experience ReplayDeep Sample After 50Deep SampleNo Replay

0 200 400 600 800 1000

Episode

0.0

0.2

0.4

0.6

0.8

1.0

Avg

.R

ewar

d(3

0tr

ials

)

Random

c) Planning (6 Walls)

Deep Sample After 50Experience ReplayDeep SampleNo Replay

0 200 400 600 800 1000

Episode

−101

−100

−10−1

−10−2

−10−3

−10−4

Avg

.LL

(30

tria

ls)

d) Prediction (6 Walls)

Experience ReplayDeep Sample After 50Deep SampleNo Replay

Mini-Golf: Full Neighborhoods, CTW

Figure 5: CTW results in Mini-Golf with full neighborhoods, both with 1 wall and with 6 walls.

hypothesis that longer rollouts will cause more samplingerrors is borne out. For a comparable level of predictionaccuracy, the models achieve substantially better planningperformance in the 1 wall version than the 6 wall version.Still, hallucinated replay was not able to mitigate the im-pact of the model errors.

The errors produced by these “models-in-training” tend tobe “salt-and-pepper” noise caused by pixel models that allassign small but positive probability to the incorrect out-come. This is a relatively unstructured type of noise incomparison to that seen in the first experiment. It may sim-ply be that more data is required to learn to correct for thenoise than to simply improve accuracy. Or, since objects inthis problem consist of only a few pixels, this type of noisemay be too destructive to be compensated for.

4.3 A DIFFERENT MODEL CLASS

In order to validate that the benefits of hallucinated replayare robust to the choice of model class, Figure 6 showsthe results of training feed forward neural networks in theMini-Golf domain. The same model architecture was usedexcept here the pixel, reward, and termination models areneural networks with 10 logistic hidden units and a lo-gistic output layer, trained using standard backpropagation(Rumelhart et al. 1986). Several values of the learning rateα were tested and the results using the best value of α foreach method are presented. Figures 6a and 6b show theplanning and prediction performance of the neural networkmodels when given 7 × 3 neighborhoods. The results arevery similar to those using CTW models. Figures 6c and6d show the results using 9 × 3 neighborhoods. Here, un-like with CTW, the hallucinated replay training does notseem to cause a significant loss in performance. The neural

network is likely less sensitive to the initially uninforma-tive hallucinated samples because it is easily able to ignoreirrelevant features.

Notably, waiting to start hallucinated replay hurts perfor-mance rather than helps. The reason for this can be seenin the prediction performance graphs: a large drop in pre-diction performance is observed when hallucinated replaybegins – this is because the hidden units must adjust to asudden change in the distribution of input vectors. This in-dicates that the compatibility of a model learning methodwith hallucinated replay may depend on its ability to grace-fully handle the non-stationarity it creates in the trainingdata. No-regret algorithms like CTW may be particularlywell-suited for that reason.

Neural network models also provide an opportunity to ex-amine another source of model error. Figure 7 shows the re-sults (planning only) of using neural network models withtoo few hidden nodes (but with full neighborhoods). InFigure 7a, the networks have 5 hidden nodes. The mod-els trained with experience replay are able to learn quickly,but level off at sub-optimal behavior. Training with hal-lucinated replay is slower, but planning performance ulti-mately surpasses that of the experience replay model. InFigure 7b, the average score of the last 100 episodes (out of5000) is shown for various numbers of hidden nodes. Thebest learning rate for each method and each number of hid-den nodes is used. With a small number of hidden nodes,the model is unable to make good predictions or correctfor errors. With a large number, an accurate model can belearned and hallucinated replay is unnecessary. For inter-mediate values hallucinated replay is able to compensatesomewhat for the model’s representational deficiency.

0 500 1000 1500 2000

Episode

0.0

0.2

0.4

0.6

0.8

1.0A

vg.

Rew

ard

(30

tria

ls)

Random

a) Planning (Small Neighborhoods)

Deep Sample (α = 2.5e− 2)Deep Sample After 100 (α = 2.5e− 2)Experience Replay (α = 2.5e− 2)No Replay (α = 2.5e− 2)

0 500 1000 1500 2000

Episode

−101

−100

−10−1

Avg

.LL

(30

tria

ls)

b) Prediction (Small Neighborhoods)

Experience Replay (α = 2.5e− 2)Deep Sample (α = 2.5e− 2)Deep Sample After 100 (α = 2.5e− 2)No Replay (α = 2.5e− 2)

0 500 1000 1500 2000

Episode

0.0

0.2

0.4

0.6

0.8

1.0

Avg

.R

ewar

d(3

0tr

ials

)

Random

c) Planning (Full Neighborhoods)


0 500 1000 1500 2000

Episode

−101

−100

−10−1

−10−2

−10−3

Avg

.LL

(30

tria

ls)

d) Prediction (Full Neighborhoods)


Mini-Golf: Feed Forward Neural Networks (10 Hidden Nodes)

Figure 6: Neural network results in Mini-Golf with 10 hidden nodes.

4.4 ADDITIONAL EXAMPLES

Finally, as additional examples of model impairments thathallucinated replay can help to overcome, consider twovariations on the Mini-Golf domain.

4.4.1 DECEPTIVELY INFORMATIVE FEATURES

In the first variation, a score display is added. The obser-vations are now 20 × 6 images. The bottom three rows ofthe image are blank for the most part, but at the end of anepisode the 3 × 3 area in the bottom right displays a digit(“0” or a “1”) to show the reward received.

As can be seen in Figure 8a, when the model is trained us-ing only real experience this seemingly innocuous additioncauses severe planning problems, even though the pixelmodels are given sufficient contexts to perfectly model thedynamics of the game. The problem arises because theneighborhood size is not sufficient to model the score dis-play. When trained only on real experience, the reward andtermination models rightly learn that the score display is areliable predictor. However, during rollouts, those pixelsno longer behave as they do in the world, and the rewardand termination models make erratic predictions.

In contrast, the models trained with hallucinated replaylearn that the score display pixels cannot be relied uponand focus instead on other relevant contextual information(such as the ball’s position). This allows the models to pro-vide meaningful predictions that result in good planningperformance. Note again that hallucinated replay has notfixed the errors. The hallucinated replay models are no bet-

ter at predicting what the score display will do. They areable to make good predictions about the reward and termi-nation signals despite sample errors in the score display.

4.4.2 COMPLICATED, IRRELEVANT FEATURES

In the second variation, an irrelevant but hard-to-predictcomponent is added to the image. Specifically, the imagesare 21 × 3 where the last column displays a set of “blink-ing lights.” Only one pixel in the last column is white onany given step, but the pattern of which pixel is white is3rd-order Markov. Since the model is 2nd-order Markov,it cannot reliably predict the light pattern. That said, thelights have no impact on the rest of the dynamics and themodels have sufficient context to make perfect predictionsfor every other pixel. Nevertheless, as can be seen in Figure8b, this addition once again causes the model trained onlyon real experience to fail when used for planning.

In this case the problem is that rollout samples of the blink-ing lights will not resemble those seen in the world. Inthe world, only one light can be on a time but in sampleslights stochastically turn on and off, causing novel config-urations. The pixel models neighboring these unfamiliarconfigurations have higher uncertainty as a result, and thenow familiar error compounding process ensues, corrupt-ing the important parts of the sampled image as well asthe unimportant parts. The models trained using halluci-nated replay, in contrast, learn to make predictions giventhe distribution of light configurations that will actually beencountered during a rollout.

0 1000 2000 3000 4000 5000

Episode

0.0

0.2

0.4

0.6

0.8

1.0A

vg.

Rew

ard

(30

tria

ls)

Random

a) Planning (5 Hidden Nodes)

Experience Replay (α = 5e− 2)Deep Sample After 100 (α = 1e− 2)Deep Sample (α = 1e− 2)No Replay (α = 5e− 2)

2 3 4 5 6 7 8 9 10

Number of Hidden Nodes

0.0

0.2

0.4

0.6

0.8

1.0

Avg

.R

ewar

d(3

0tr

ials

)

b) Planning (avg. of last 100 episodes)

Deep Sample After 100Deep SampleExperience ReplayNo Replay

Mini-Golf: Feed Forward Neural Networks (Full Neighborhoods)

Figure 7: Neural network results in Mini-Golf with full neighborhoods but too few hidden nodes.

5 DISCUSSION

The results have shown that even mildly limited fac-tored models that can accurately predict nearly everyaspect of the environment’s dynamics can dramaticallyfail when used in combination with Monte-Carlo plan-ning. Since factored models and Monte-Carlo planningare both important, successful tools in problems with largestate/observation spaces, resolving this issue may be a keystep toward scaling up model-based reinforcement learn-ing. Overall, hallucinated replay was effective in its pur-pose: to train models to be more robust to their own sam-pling errors during rollouts.

When a perfect model was contained within the class (arare occurrence in problems of genuine interest), halluci-nated replay was comparable to (but slightly less effectivethan) pure experience replay. Though in this case the mod-els are imperfect during training, hallucinated replay didnot seem to be effective at compensating for the type ofwide-spread, unstructured noise that resulted from transientparameter settings (as opposed to the more systematic er-rors resulting from class limitations in other examples).

It was also observed that when the model class is extremelylimited (e.g. the neural network models with only 2 hid-den nodes), hallucinated replay may be ineffective at com-pensating for errors because the model simply cannot learnto make meaningful predictions at all. Of course, in thiscase the model is ineffective no matter how it is trained.It seems that hallucinated replay is most effective when themodel can capture some but not all of the environment’s dy-namics. When a model is performing poorly hallucinatedreplay may be able to magnify the impact of improving themodel’s expressive power (as was seen when the neural net-works were given more hidden nodes).

5.1 RELATED WORK

Other authors have noted that the most accurate model in alimited class may not be the best model for planning. Sorg,

Singh & Lewis (2010) argue that when the model or plan-ner are limited, reward functions other than the true rewardmay lead to better planning performance. Sorg, Lewis &Singh (2010) use policy gradient to learn a reward func-tion for use with online planning algorithms. Joseph et al.(2013) make a similar observation regarding the dynamicsmodel and use policy gradient to learn model parametersthat lead to good planning, rather than low prediction er-ror (demonstrating that the two do not always coincide).The results presented here add to the evidence that simplytraining to maximize one-step prediction accuracy may notyield the best planning results.

Siddiqi et al. (2007) addressed the problem of learning sta-ble dynamics models of linear dynamical systems. Thekey idea was to project the learned dynamics matrix intothe constrained space of matrices with spectral radius of atmost 1, and which therefore remain bounded even whenmultiplied with themselves arbitrarily many times. Thoughcertainly conceptually related (their work has the same goalof learning a model whose predictions are sensible whengiven its own output as input), their specific approach isunlikely to apply to the setting considered here, as the con-cept of “stability” is not so easily quantified.

There is a long history in the supervised learning settingof adding noise to training data (or otherwise intention-ally corrupting it) to obtain regularization and generaliza-tion benefits (see e.g. Matsuoka 1992, Burges & Scholkopf1997, Maaten et al. 2013). There the issue is not typicallyprediction composition but more broadly the existence ofinputs that the training set may not adequately cover. With-out access to the true distribution over inputs, noise is typ-ically generated via some computationally or analyticallyconvenient distribution (e.g. Gaussian) or via a distributionthat incorporates prior knowledge about the problem. Incontrast, we have access to the model which generates itsown inputs, and can thus sample corrupting noise from alearned, but more directly relevant distribution.

Ross & Bagnell (2012) observed a similar train/test mis-

0 200 400 600 800 1000

Episode

0.0

0.2

0.4

0.6

0.8

1.0A

vg.

Rew

ard

(30

tria

ls)

Random

a) Score Display: Planning

Deep Sample After 50Deep SampleExperience Replay

0 200 400 600 800 1000

Episode

0.0

0.2

0.4

0.6

0.8

1.0

Avg

.R

ewar

d(3

0tr

ials

)

Random

b) Blinking Lights: Planning

Deep Sample After 50Deep SampleExperience Replay

Mini-Golf Variations: Full Neighborhoods, CTW

Figure 8: CTW results in two variations of Mini-Golf (described in text).

match in the model-based reinforcement learning settingwhen the model is trained on a fixed batch of data. In thatcase, the plan generated using the model may visit statesthat were underrepresented in training, resulting in poorperformance. Their algorithm DAgger mixes experienceobtained using model-generated plans into the training dataas the model learns, thereby closing the gap between train-ing and test distributions. Roughly speaking, they provethat their approach achieves good planning performancewhen a model in the class has low prediction error. In theexamples considered here no model in the class has lowprediction error, but near perfect planning performance isnevertheless possible.

5.2 FUTURE DIRECTIONS

The results indicate that, in addition to one-step accuracy,a model’s robustness to its own errors should be a key con-cern for model based reinforcement learning. They also in-dicate that training a model on its own outputs is a promis-ing approach to addressing this concern. These observa-tions open the door to a number of interesting questionsand issues for further study.

As discussed in Section 3, hallucinated replay is not im-mediately applicable in stochastic domains. While a lot ofperceived stochasticity can be attributed to unmodeled (butdeterministic) complexity, it would nevertheless be valu-able to explore whether this approach can be applied to in-herently stochastic dynamics. The main challenge in thiscase is forcing the model’s hallucinated data to choose thesame stochastic outcome that was observed in the real data.It may be that this could be accomplished by learning re-verse correlations from real world outcomes to hallucinatedcontexts and applying rejection sampling when generatinghallucinated data.

The hallucinated replay algorithm presented in this paperis simple and offers no theoretical guarantees. It may,however, be possible to incorporate the basic idea of hal-lucinated replay into more principled algorithms that ad-mit more formal analysis. For instance, it may be possibleto extend the DAgger algorithm given by Ross & Bagnell

(2012) using a form of hallucinated replay. This would al-low the analysis to take into account the learned model’srobustness (or lack thereof) to its own errors. It would alsobe valuable, if possible, to develop algorithms that have thebenefits of hallucinated replay but are able to promise thatthe hallucinated data will not harm asymptotic performanceif an accurate model is in the class.

Perhaps most important would be to develop a more for-mal understanding of the relationships between accuracy,robustness, and planning performance. In sufficiently inter-esting environments, model errors will be inevitable. Suc-cessfully planning despite those errors is key to applyingmodel based reinforcement learning to larger, more com-plex problems. A characterization of the properties beyondsimple accuracy that make a model good for planning and amore nuanced understanding of which types of model errorare acceptable and which are catastrophic could yield im-proved forms of hallucinated replay, or other model learn-ing approaches that have good planning performance as anexplicit goal rather than an implicit effect of accuracy.

6 CONCLUSIONS

Hallucinated replay trains a model using inputs drawn fromits own predictions, making it more robust when its pre-dictions are composed. This training methodology wasempirically shown to substantially improve planning per-formance compared to using only real experience. Theresults suggest that hallucinated replay is most effectivewhen combined with model-learning methods that grace-fully handle non-stationarity and when the model class’limitations allow it to capture some but not all of the en-vironment’s dynamics.

Acknowledgements

Thank you to Michael Bowling, Marc Bellemare, and JoelVeness for discussions that helped shape this work. Thanksalso to Joel Veness for his freely available FAC-CTW im-plementation (http://jveness.info/software/mc-aixi-src-1.0.zip).

References

Burges, C. J. & Scholkopf, B. (1997), Improving theaccuracy and speed of support vector machines, in‘Advances in Neural Information Processing Systems(NIPS)’, pp. 375–381.

Joseph, J., Geramifard, A., Roberts, J. W., How, J. P. &Roy, N. (2013), Reinforcement learning with misspec-ified model classes, in ‘2013 IEEE International Con-ference on Robotics and Automation (ICRA)’, pp. 939–946.

Kearns, M., Mansour, Y. & Ng, A. Y. (2002), ‘A sparsesampling algorithm for near-optimal planning in largemarkov decision processes’, Machine Learning 49(2-3), 193–208.

Kocsis, L. & Szepesvari, C. (2006), Bandit based monte-carlo planning, in ‘Proceedings of the 17th EuropeanConference on Machine Learning (ECML)’, pp. 282–293.

Lin, L.-J. (1992), ‘Self-improving reactive agents basedon reinforcement learning, planning and teaching’, Ma-chine learning 8(3-4), 293–321.

Maaten, L., Chen, M., Tyree, S. & Weinberger, K. Q.(2013), Learning with marginalized corrupted features,in ‘Proceedings of the 30th International Conference onMachine Learning (ICML-13)’, pp. 410–418.

Matsuoka, K. (1992), ‘Noise injection into inputs in back-propagation learning’, IEEE Transactions on Systems,Man and Cybernetics 22(3), 436–440.

Ross, S. & Bagnell, D. (2012), Agnostic system identifi-cation for model-based reinforcement learning, in ‘Pro-ceedings of the 29th International Conference on Ma-chine Learning (ICML-12)’, pp. 1703–1710.

Rumelhart, D. E., Hinton, G. E. & Williams, R. J. (1986),‘Learning representations by back-propagating errors’,Nature 323(9), 533–536.

Siddiqi, S. M., Boots, B. & Gordon, G. J. (2007), A con-straint generation approach to learning stable linear dy-namical systems, in ‘Advances in Neural InformationProcessing Systems (NIPS)’, pp. 1329–1336.

Silver, D. & Veness, J. (2010), Monte-carlo planning inlarge pomdps, in ‘Advances in Neural Information Pro-cessing Systems (NIPS)’, pp. 2164–2172.

Sorg, J., Lewis, R. L. & Singh, S. (2010), Reward designvia online gradient ascent, in ‘Advances in Neural Infor-mation Processing Systems (NIPS)’, pp. 2190–2198.

Sorg, J., Singh, S. & Lewis, R. L. (2010), Internal rewardsmitigate agent boundedness, in ‘Proceedings of the 27thInternational Conference on Machine Learning (ICML-10)’, pp. 1007–1014.

Veness, J., Ng, K. S., Hutter, M., Uther, W. T. B. & Sil-ver, D. (2011), ‘A Monte-Carlo AIXI Approximation’,Journal of Artificial Intelligence Research 40, 95–142.

Willems, F. M., Shtarkov, Y. M. & Tjalkens, T. J. (1995),‘The context tree weighting method: Basic properties’,IEEE Transactions on Information Theory 41, 653–664.

model regularization for stable sample rolloutsauai.org/uai2014/proceedings/individuals/179.pdf ·...

Documents