marginalizedoﬀ-policyevaluationforreinforcement learning...1...

Marginalized Off-Policy Evaluation for Reinforcement1

Learning2

Tengyang Xie∗UMass Amehrst

[email protected]

Yu-Xiang Wang∗UC Santa Barbara

[email protected]

Yifei MaAmazon AI

[email protected]

3

Abstract4

Off-policy evaluation is concerned with evaluating the performance of a policy using the his-5

torical data obtained by different behavior policies. In the real-world application of reinforcement6

learning, acting a policy can be costly and dangerous, and off-policy evaluation usually plays as7

a crucial step. Currently, the existing methods for off-policy evaluation are mainly based on the8

Markov decision process (MDP) model of discrete tree MDP, and they suffer from high variance9

due to the cumulative product of importance weights. In this paper, we propose a new off-policy10

evaluation approach directly based on the discrete digraph MDP. Our approach can be applied11

to most of the estimators of off-policy evaluation without modification and could reduce the12

variance dramatically. We also provide a theoretical analysis of our approach and evaluate it by13

empirical results.14

1 Introduction15

The problem of off-policy evaluation (OPE), which predicts the performance of policy with data only16

sampled by a behavior policy [Sutton and Barto, 1998], is crucial for using reinforcement learning17

(RL) algorithms responsibly in many real-world applications. For the setting that RL algorithm18

has been already deployed, e.g., applying RL for improving the targeting of advertisements and19

marketing [Bottou et al., 2013; Tang et al., 2013; Chapelle et al., 2015; Theocharous et al., 2015;20

Thomas et al., 2017] or for the medical treatment [Murphy et al., 2001; Raghu et al., 2017], online21

estimation for the policy value is usually expensive, risky, or even illegal. Also, using a bad policy in22

these applications is dangerous and could lead to severe consequences. Thus, the problem of OPE is23

essential for the practical application of RL.24

To tackle the problem of OPE, the idea of importance sampling (IS) corrects the mismatch in25

the distributions under the behavior policy and estimated policy. It also provides typically unbiased26

or strongly consistent estimators [Precup et al., 2000]. IS-based off-policy evaluation methods have27

also seen a lot of interest recently especially for short-horizon problems, including contextual bandits28

[Murphy et al., 2001; Hirano et al., 2003; Dudík et al., 2011; Wang et al., 2017]. However, the29

variance of IS based approaches [Precup et al., 2000; Thomas et al., 2015; Jiang and Li, 2016; Thomas30

and Brunskill, 2016; Guo et al., 2017; Farajtabar et al., 2018] tends to be too high to be useful31

[Mandel et al., 2014] for long-horizon problems, because the variance of the cumulative product of32

importance weights is exploding exponentially as the horizon goes long.33

Given this high-variance issue, it is necessary to find a IS-based approach without relying heavily34

on the cumulative product of importance weights. In this paper, we propose a marginalized method35

∗Most of this work performed at Amazon AI.

1

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

r1

r2 r3

ρ1

ρ2 ρ3

(a) Tree MDP v(π) = E[∏t

t′=0 ρ′t(π)rt]

t

s1

s2

s3

s1

s2

s3

Rtwt

ρt

(b) Marginal MDP v(π) =∑

t E[wt(π)ρt(π)Rt]

Figure 1: Tree MDP vs Marginal MDP. The Marginal MDP avoids multiplicative factors.

for off-policy evaluation based on the discrete graph MDP, while previous methods are all based on36

the discrete tree MDP [Jiang and Li, 2016]. Given the key difference of the graph MDP from the tree37

MDP, where one state can be possibly visited more than once, we use the model of marginal MDP38

for the graph MDP to avoid the cumulative product of importance weights. Figure 1 illustrates the39

difference between tree MDP and marginal MDP.40

Our approach can be used in any OPE estimators that leverage IS based estimators to become41

marginalized OPE estimators, such as doubly robust [Jiang and Li, 2016], MAGIC [Thomas and42

Brunskill, 2016], MRDR [Farajtabar et al., 2018]. The marginalized OPE estimators work in the43

space of possible states and actions, instead of the space of trajectories, resulting in a significant44

potential for variance reduction. We also evaluate a number of marginalized OPE estimators against45

to their original version empirically. The experimental results demonstrate the effectiveness of the46

marginalized OPE estimators across a number of problems.47

Here is a roadmap for the rest of the paper. Section 2 provides the preliminaries of the problem48

of off-policy evaluation. In Section 3, we summary the previous importance-sampling based off-policy49

estimators, and we offer a new framework for the marginalized estimator. We provide the empirical50

results in Section 3, and more detailed preliminary theoretical analysis in Section 5. At last, Section51

7 provides the conclusion and future work.52

2 Preliminaries53

We consider the problem of off-policy evaluation for MDP, which is a tuple defined by M = (S, A,54

T , R, γ), where S is the state space, A is the action space, T : S ×A× S → [0, 1] is the transition55

function with T (s′ | s, a) defined by probability of achieving state s′ after taking action a in state56

s, R : S ×A → R is the expected reward function with R(s, a) defined by the mean of immediate57

received reward after taking action a in state s, and γ ∈ [0, 1] is the discount factor.58

Let |S|, |A| be the cardinality of S and A and h be the time horizon. We denote Tt ∈ R|S|×|S|×|A|59

the transition matrix at time t, Tt[s, s′, a] denotes the probability of transitioning from s into s′60

when action a is taken at time t. When the MDP is time-invariant, let T = Tt,∀ 0 ≤ t ≤ H − 1.61

We use P[E] to denote the probability of an event E and p(x) denotes p.m.f. (or pdf) of the62

random variable X taking value x. Let µ, π : S → PA be policies which output a distribution of63

actions given an observed state. For notation convenience we denote µ(at|st) and π(at|st) the p.m.f64

of actions given state at time t. Moreover, we denote dµt (st) and dπt (st) the induced state distribution65

at time t. They are functions of not just the policies themselves but also the unknown underlying66

transition dynamics.67

We call µ the logging policy (a.k.a. behavioral policy) and π the target policy. µ is used to68

collect data in forms of (sit, ait, rit) ∈ S × A× R for time index t = 0, . . . ,H − 1 and episode index69

2

i = 1, ..., n. π is the target policy that we are interested to evaluate. Also, let D to denote the70

historical data, which contains n trajectories in total.71

The problem of off-policy evaluation is about finding an estimator v : (S ×A×R)H×n → R that72

makes use of the data collected by running µ to estimate73

v(π) = Eπ

[H−1∑t=0

γtrt(st, at)

].

Note that we assume that µ(a|s) and π(a|s) for all (s, a) ∈ S ×A are known to us and can be74

used in the estimator but we do not observe the state-transition distributions therefore do not have75

µ(st), π(st). The corresponding minimax risk for square error is76

R(π, µ, T, rmax, σ2) = inf

vsup{

r(s,a)∈S×A→PR s.t. ∀s∈S,a∈AE[r(s,a)]≤rmax,Var[r(s,a)]≤σ2

}E[(v(π)− v(π))2]

Different from previous studies, we focus on the case where S is sufficiently small but S2A is too77

large for a reasonable sample size. In other word, this is a setting where we do not have enough data78

points to estimate the state-transition dynamics, but we do observe the states and can estimate the79

distribution of the states.80

3 Marginalized Estimators for OPE81

In this section, we show some example of popular IS based estimators for the problem of OPE and82

provide the reason for its difficulties. After that, we discuss the way to get around that difficulties83

and present our new design for the OPE estimators.84

3.1 Generic IS-Based Estimators Setup85

The IS based estimators usually provide an biased or consistent estimate of the value of target policy86

π [Thomas, 2015]. We first provide a generic framework of IS-based estimators, and analysis the87

similarity and difference between different IS-based estimators. This framework could give us insight88

into the design of IS-based estimators, and is useful to understand the limitation of them.89

Let ρit :=π(ait|sit)µ(ait|sit)

be the importance ratio at time step t of i-th trajectory, and ρi0:t :=∏tt′=0

π(a′it |s′it )µ(a′it |s′it )

90

be the cumulative importance ratio for the i-th trajectory. The generic framework of IS-based91

estimators can be expressed as follows92

v(π) =1

n

n∑i=1

g(si0) +n∑i=1

H−1∑t=0

ρi0:tφt(ρ1:n0:t )

γt(rit + ft(sit, a

it, s

it+1)), (3.1)

where φt : Rn+ → R+ are the “normalization” functions for ρi0:t, g : S → R and ft : S ×A× S → R93

are the “value-related” functions. Note Eft = 0. For the unbiased IS-based estimators, it usually has94

φt(ρ1:n0:t ) = n, and we first observe that the importance sampling (IS) estimator [Precup et al., 2000]95

falls in this framework using:96

(IS) : g(si0) = 0; φt(ρ1:n0:t ) = n; ft(s

it, a

it, s

it+1) = 0.

For the doubly tobust (DR) estimator [Jiang and Li, 2016], the normalization function and value-97

related functions are:98

(DR) : g(si0) = V π(s0); φt(ρ1:n0:t ) = n; ft(s

it, a

it, s

it+1) = −Qπ(sit, ait) + γV π(sit+1).

3

Self-normalized estimators such as weight importance sampling (WIS) and weighted doubly99

robust (WDR) estimators [Thomas and Brunskill, 2016] are popular consistent estimators to achieve100

better bias-variance trade-off. The critical difference of consistent self-normalized estimators is to101

use∑n

j=1 ρj0:t as normalization function φt rather than n. Thus, the WIS estimator is using the102

following normalization and value-related functions:103

(WIS) : g(si0) = 0; φt(ρ1:n0:t ) =

n∑j=1

ρj0:t; ft(sit, a

it, s

it+1) = 0,

and the WDR estimator:104

(WDR) : g(si0) = V π(s0); φt(ρ1:n0:t ) =

n∑j=1

ρj0:t; ft(sit, a

it, s

it+1) = −Qπ(sit, ait) + γV π(sit+1).

Note that, the DR estimator reduced the variance from the stochasticity of action by using the105

technique of control variants in value-related function, and the WDR estimators reducing variance106

by the bias-variance trade-off using self-normalization. However, it could still suffer a large variance,107

because the cumulative importance ratio ρi0:t always appear directly in this framework, and it makes108

the variance increase exponentially with the horizon goes long. This kind of high-variance issue is109

inherited from the hardness of OPE problem for discrete tree MDP which is defined as follows:110

Definition 1 (Discrete Tree MDP). If an MDP satisfies:111

• The state is represented by history, i.e., st = ht := o1a1 . . . ot−1at−1ot, where oi is the observa-112

tion at step i(1 ≤ i ≤ t).113

• The observations and actions are discrete.114

• The initial state takes the form of s0 = o0. After taking action a at state s = h, the next can115

be only expressed in the form of s′ = hao, with probability P(o|h, a).116

Then, this MDP is a discrete tree MDP.117

It can be proved that the variance of any unbiased OPE estimator for discrete tree MDP can be118

lower bounded by∑H−1

t=0 E[ρ20:t−1Vt[V (st)]

][Jiang and Li, 2016]. However, the condition of discrete119

tree MDP can be related to discrete directed acyclic graph (DAG) MDP, which could reduce the120

lower bound of variance dramatically.121

Definition 2 (Discrete DAG MDP). If an MDP satisfies:122

• The state space and action space are finite.123

• Each state can only occur at a particular time step.124

Then, this MDP is a discrete DAG MDP.125

The lower variance bound of any unbiased OPE estimator under discrete DAG MDP is126 ∑H−1t=0 E

[(dπt (st−1)π(at−1|st−1))2

(dµt (st−1)µ(at−1|st−1))2Vt[V (st)]

][Jiang and Li, 2016]. In this paper, we will mainly de-127

sign OPE estimators based on this relaxed case.128

4

3.2 Design of Marginalized Estimators129

Based on the analysis in the last section, the critical part of the variance of estimators fall in the130

framework (3.1) is the cumulative importance ratio ρi0:t. Note that, this ratio is used for re-weighting131

the probability of the trajectory τ0:t, i.e.,132

v(π) =

H−1∑t=0

Eτ∼π[rt] =H−1∑t=0

Eτ∼µ[ρ0:trt].

Let dπt (s) := P[st = s|π], dµt (s) := P[st = s|µ] be the marginal distribution of state s at time step t,133

then given the conditional independence following from the Markov property; define134

wt(st) :=dπt (st)

dµt (st)⇒ v(π) =

H−1∑t=0

Eτ∼π[rt] =H−1∑t=0

Est∼π[rt] =H−1∑t=0

Est∼µ[wt(st)rt]

For the moment let us assume that we have an unbiased or consistent estimator of wt(s) =135

dπt (s)/dµt (s) called wnt (s). Designing such an estimator is the main technical contribution of the136

paper but we will start by assuming we have such an estimator as a black box and use it to construct137

off-policy evaluation methods. Thus, we obtain a generic framework of marginalized IS-based138

estimators as:139

vM (π) =1

n

n∑i=1

g(si0) +1

n

n∑i=1

H−1∑t=0

wnt (sit)ρ

itγt(rit + ft(s

it, a

it, s

it+1)). (3.2)

Note that the “normalization” function φ has not appeared in the framework above is because it can140

be a part in wt(s).141

We first show the equivalence between framework (3.1) with φt(ρ1:n0:t ) = n and (3.2) in expectation142

if wnt (s) is exactly wt(s).143

Lemma 1. If φt(ρ1:n0:t ) = n in framework (3.1) and wnt (s) = wt(s) in framework (3.2), then these144

two frameworks are equal in expectation, i.e.,145

vIS(π) = E[wt(s

it)ρ

it(r

it + ft(s

it, a

it, s

it+1))

]= E

[ρi0:t(r

it + ft(s

it, a

it, s

it+1))

]holds for all i and t.146

Full proof of Lemma 1 can be found in the appendix.147

We now show that if we have that unbiased or consistent wnt , all the OPE estimator that148

leverage IS-based estimators can use wnt directly by replacing∏t−1t′=0

π(at′ |st′ )µ(at′ |st′ )

with wnt (st), and keep149

unbiasedness or consistency.150

Theorem 1. Let φt(ρ1:n0:t ) = n in framework (3.1), then framework (3.2) could keep the unbiasedness151

and consistency same as in framework (3.1) if wnt (s) is a unbiased or consistent estimator for152

marginalized ratio wt(s) for all t:153

1. If an unbiased estimator falls in framework (3.1), then its marginalized estimator in framework154

(3.2) is also a unbiased estimator of v(π) given unbiased estimator wnt (s) for all t.155

2. If a consistent estimator falls in framework (3.1), then its marginalized estimator in framework156

(3.2) is also a consistent estimator of v(π) given consistent estimator wnt (s) for all t.157

5

Full proof of Theorem 1 can be found in the appendix.158

Next, we propose a new marginalized IS estimator to further improve the data efficiency and159

reduce variance. Since DR only reduces the variance from the stochasticity of action [Jiang and160

Li, 2016] and our marginalized estimator (3.2) reduce the variance from the cumulative importance161

weights, it is also possible to reduce the variance the stochasticity of reward function.162

Based on the definition of MDP, we know that rt is the random variable that only determined163

by st, at. Thus, if R(s, a) is a unbiased and consistent estimator for R(s, a), rit can be in framework164

(3.2) can be replaced by that R(sit, ait), and keep unbiasedness or consistency same as using rit.165

Note that we can use a unbiased and consistent Monte-Carlo based estimator166

R(st, at) =

∑ni=1 r

(i)t 1(s

(i)t = st, a

(i)t = at)∑n

i=1 1(s(i)t = st, a

(i)t = at)

,

and then we obtain a better marginalized framework167

vBM (π) =1

n

n∑i=1

g(si0) +1

n

n∑i=1

H−1∑t=0

wnt (sit)ρ

itγt(R(sit, a

it) + ft(s

it, a

it, s

it+1)). (3.3)

Remark 1. Note that, the only difference between (3.2) and (3.3) is rit and R(sit, ait). Thus, the168

unbiasedness or consistency of (3.3) can be obtained similarly by following Theorem 1 and its proof.169

One interesting observation is that when each (st, at)-pair is observed only once in n iterations,170

then framework (3.3) reduces to (3.2). Note that when this happens, we could still potentially171

estimate wnt (st) well if A is large but S is relative small, in which case we can still afford to observe172

each potential values of st many times.173

3.3 Estimating wt(s)174

We now present our estimators wt(s). Note that wt(s) is defined by the marginalized ratio dπt (s)/dµt (s)175

at time step t. Considering the cumulative importance ratio is actually calculating the ratio for the176

probability of the whole episode, i.e.,177

t∏t′=0

π(ait′ |sit′)µ(ait′ |sit′)

=P(si0, ai0, . . . , sit, ait|π)P(si0, ai0, . . . , sit, ait|µ)

,

for all i and t, thus, we can obtain the unbiased and consistent Monte-Carlo based estimator for178

wt(s) as179

wnt (s) =

∑ni=1

∏t−1t′=0

π(ait′ |s

it′ )

µ(ait′ |s

it′ )1(sit = s)∑n

i=1 1(sit = s)

. (3.4)

We then show the unbiasedness of wnt in the next theorem.180

Theorem 2. The estimator of wnt (s) in (3.4) is unbiased estimator.181


Self-normalization is a popular strategy in the problem of off-policy evaluation, since the183

cumulative product of important weight∏t−1t′=0

π(ait′ |s

it′ )

µ(ait′ |s

it′ )

is usually distributed with heavy upper tails184

6

[Thomas et al., 2015]. By following this insight of self-normalization, we can obtain the following185

consistent estimator of wt(s):186

wnt (s) =

∑ni=1

∏t−1t′=0

π(ait′ |s

it′ )

µ(ait′ |s

it′ )1(sit = s)

1n

∑ni=1

∏t−1t′=0

π(ait′ |s

it′ )

µ(ait′ |s

it′ )

∑nj=1 1(s

jt = s)

=n∑n

i=1

∏t−1t′=0

π(ait′ |s

it′ )

µ(ait′ |s

it′ )

wnt (s). (3.5)

It is clear intuitively that this estimator is consistent since E[∏t−1t′=0

π(ait′ |s

it′ )

µ(ait′ |s

it′ )] = 1, so using Slutsky’s187

theorem [Slutsky, 1925] we know (3.5) convergences to the same result as (3.4).188

Given the unbiasedness or consistency of the estimator (3.4) and (3.5), we can obtain unbiased189

or consistent marginalized estimators for OPE based on the framework (3.2) and Theorem 1.190

4 Experiments191

Throughout this section, we present the empirical results to illustrate the comparison among different192

estimators. We demonstrate the effectiveness of the proposed marginalized approaches by comparing193

the classical estimator with their marginalized version on several different domains and their variants.194

The estimators we compared are as follows:195

1. IS: importance sampling196

2. WIS: weighted importance sampling197

3. DR: doubly robust198

4. WDR: weighted doubly robust199

5. MIS: marginalized importance sampling (importance sampling with wnt in (3.4))200

6. W-MIS: marginalized importance sampling (importance sampling with wnt in (3.5))201

7. MDR: marginalized doubly robust (doubly robust with wnt in (3.4))202

8. W-MDR: marginalized doubly robust (doubly robust with wnt in (3.5))203

S1S2 S3!1!2

!1!2

!1

!2

p=0.4 p=0.6

p=0.6 p=0.4

r=1 r=-1

r=1 r=-1

(a) ModelWin MDP

S1S2 S3!1!2

!1!2

!1

!2

p=1 p=0

p=0 p=1

r=1 r=-1

(b) ModelFail MDP

Figure 2: Domains from Thomas and Brunskill [2016]

We first provide results on the variants of standard domains, ModelWin and ModelFail, which204

are first introduced by Thomas and Brunskill [2016]. In the ModelWin domain (as showed in Figure205

7

2(a)), the agent always begins in s1, where it must select between two actions. The first action,206

a1, causes the agent to transition to s2 with probability p and s3 with probability 1 − p. The207

second action, a2, does the opposite: the agent transitions to s2 with probability 1− p and s3 with208

probability p. We set p = 0.4. If the agent transitions to s2, then it receives a reward of 1, and if209

it transitions to s3 it receives a reward of −1. In states s2 and s3, the agent also has two possible210

actions a1 and a2, but both always produce a reward of zero and a deterministic transition back to211

s1. In our ModelFail Domain (as showed in Figure 2(b)), we set the state transition is same as the212

ModelWin domain, with p = 1, but we delay the reward receiving – the agent receives a reward of213

1 when it arrive s1 from s2, and receives a reward of −1 when it arrive s1 from s3. The discount214

factor γ = 1. The evaluated policy π is selecting action a1 and a2 with probabilities 0.2 and 0.8215

respectively everywhere for both of these domains, and behavior policy µ is uniform policy. To show216

the effectiveness of our approach in the long-horizon domain, we set the horizon H of these two217

domains to vary among {8, 16, 32, 64, 128, 256, 512, 1024}. Note that the sacle of v(π) is in direct218

proportion to H, so we provide results for both RMSE and relative RMSE where relative RMSE219

is defined by the ratio between RMSE and v(π). The results on the Modelwin domain is in Table220

1 (RMSE) and Table 2 (Relative RMSE), and the results on the Modelwin domain is in Table 3221

(RMSE) and Table 4 (Relative RMSE). All results are from 1024 runs.222

Horizon IS WIS DR WDR MIS W-MIS MDR W-MDR8 0.152 0.149 0.425 0.425 0.079 0.075 0.395 0.39416 0.510 0.415 1.093 0.985 0.142 0.104 0.804 0.80232 3.394 1.187 5.099 2.160 1.551 0.205 2.269 1.64664 14.937 2.714 9.280 4.442 5.221 0.205 5.832 3.312128 19.651 5.418 34.416 8.229 6.576 0.291 5.978 6.569256 48.638 11.029 22.593 15.546 30.472 0.412 18.642 13.228512 46.670 22.443 38.884 29.287 33.821 0.580 20.575 26.4971024 66.632 46.402 49.578 58.071 58.340 0.806 15.552 52.931

Table 1: RMSE on the ModelWin Domain


Table 2: Relative RMSE on the ModelWin Domain.

To demonstrate the scalability of the proposed marginalized approaches, we also test all estimators223

in the Mountain Car domain [Singh and Sutton, 1996], with the horizon of H = 100, the initial state224

distributed uniformly randomly, and same state aggregations as Jiang and Li [2016]. To construct225

8


Table 3: RMSE on the ModelFail Domain


Table 4: Relative RMSE on the ModelFail Domain

the stochastic behavior policy µ and stochastic evaluated policy π, we first compute the optimal226

Q-function using Q-learning, and use its softmax policy of the optimal Q-function as evaluated227

policy π (with temperature of 1). For the behavior policy µ, we also use the softmax policy of the228

optimal Q-function, but we set the temperature as 0.75. The results on the Mountain Car domain is229

in Table 5, and all results are from 1024 runs.230

5 Theoretical Analysis231

We provide the preliminary theoretical analysis in this section. We use the high-confidence bound232

to measure properties of wnt (s) and wnt (s). We first show the high-confidence bound of unbiased233

estimator wnt (s) in (3.4).234

Theorem 3. If there exists an η > 0 such that µ(a|s) ≥ ηπ(a|s) for every s and a, then with235

probability at least 1− δ, we have236

|wnt (s)− wt(s)| ≤1

ηt

√√√√ log(4/δ)√2n

2(ndµt (s)

√2n−

√log(2/δ)

) ,for wnt in (3.4).237


9

Data Size IS WIS DR WDR MIS W-MIS MDR W-MDR16 534.51 24.55 724.48 22.81 533.67 5.39 695.25 27.8532 398.98 24.63 504.99 23.81 398.84 4.12 497.50 24.4864 335.24 24.63 404.75 23.42 335.21 3.06 400.46 20.2128 307.73 24.63 357.06 21.93 307.63 2.40 344.29 15.75256 275.34 24.63 260.40 18.93 275.19 2.04 252.05 11.00512 256.75 24.63 210.47 13.91 256.63 1.80 205.58 9.341024 249.97 24.63 166.67 6.71 249.94 1.70 163.30 11.622048 246.08 24.63 108.34 6.13 246.07 1.64 106.86 17.46

Table 5: RMSE on the Mountain Car Domain

The following theorem provides the high-confidence bound of estimator wnt (s) in (3.5), which239

illustrate the consistency of estimator wnt (s).240

Theorem 4. If there exists an η > 0 such that µ(a|s) ≥ ηπ(a|s) for every s and a, then with241

probability at least 1− δ, we have242

|wnt (s)− wt(s)|

≤

n√2n

ηtn√2n−

√log(6/δ)

√√√√ log(6/δ)√2n

2(ndµt (s)

√2n−

√log(3/δ)

) +wt(s)

√log(6/δ)

ηtn√2n−

√log(6/δ)

∧(wt(s) ∨

( √2n

dµt (s)√2n−

√log(2/δ)− wt(s ∈ St)

− wt(s)

)),

for wnt in (3.5).243


Note that the results in this section is preliminary. Although we are able to obtain the high-245

confidence bounds of wnt (s) and wnt (s) from Theorem 3 and Theorem 4, those high-confidence bound246

do not perfectly matching the experimental results. The bound in Theorem 3 can be tighter than the247

bound in Theorem 4. However, the experimental results in Section 4 shows that estimators with wnt (s)248

always outperform those with wnt (s). Intuitively, the reason is the correlation in the self-normalization,249

i.e.,∏t−1t′=0

π(ait′ |s

it′ )

µ(ait′ |s

it′ )

in both numerator and denominator of wnt (s), but the high-confidence bound is250

not able to fully express that correlation.251

6 Theoretical Analysis for Recursive Estimator252

For the moment, assume µ(st) is known.253

Let254

π(st) =π(st) + ε(st)

π(st) =π(st) + ε(st)

1 +∑ε(st)

,

10

where ε(st) satisfies ε(st) ≥ −π(st) and∑ε(st) ≥ −1. Then,255

π(st+1) =1

n

n∑i=1

π(sit)

µ(sit)

π(ait|sit)µ(ait|sit)

1(sit+1 = st+1)

=1

1 +∑ε(st)

[1

n

n∑i=1

π(sit)

µ(sit)


1(sit+1 = st+1) +1

n

n∑i=1

ε(sit)

µ(sit)


1(sit+1 = st+1)

]

=1

1 +∑ε(st)

[π(st+1) +O

(√Eµ(ρ2t )√n

)+∑st

1

n

n∑i=1

π(ait|st)µ(ait|st)

1(sit+1 = st+1, sit = st)︸︷︷︸

=Pπ(st+1|st)µst+O(√

Eµ[ρ2t,a]√n

)ε(st)

µ(st)

]

=1

1 +∑ε(st)

π(st+1) +∑st

Pπ(st+1|st)ε(st) +O

√Eµ[ρ2t ]√n

+

√maxst Eµ[ρ2t,a]√n

ε(st)

.

Note that, the difference between normalized and unnormalized version is whether there is a 11+∑ε(st)

256

term.257

We now discuss the ε(st) term for the unnormalized case.258

ε(st+1) =∑st


√Eµ[ρ2t ]√n

+

√maxst Eµ[ρ2t,a|st]

√n

∑st

ε(st)

|ε(st+1)| ≤

∑st

Pπ(st+1|st)|ε(st)|+O

√Eµ[ρ2t ]√n

+

√maxst Eµ[ρ2t,a|st]

√n

∑st

|ε(st)|

.

Take summation over st+1 on both sides, we have the following results for the unnormalized version,259

∑st+1

|ε(st+1)| ≤∑st

|ε(st)|+O

|S|√Eµ[ρ2t ]√n

+|S|√maxst Eµ[ρ2t,a|st]√n

∑st

|ε(st)|

⇒∑st+1

|ε(st+1)| ≤

1 +O

|S|√maxst Eµ[ρ2t,a|st]√n

∑st

|ε(st)|+√Eµ[ρ2t ]√n

.

Thus, if∑

st|ε(st)| ≤ 1, the error grows linearly; and if

∑st|ε(st)| > 1, the error grows exponentially260

but with a small factor. Since that, we can handle horizon up to H =√n/√E[ρ2t ].261

11

For the case with normalization, ε(st+1) becomes different262

ε(st+1) =− π(st+1) +1

n

n∑i=1

π(sit)

µ(sit)


1(sit+1 = st+1)

=− π(st+1) +1

1 +∑ε(st)

(π(st+1) +O

(√E[ρ2t ]√n

))

+1

1 +∑ε(st)

∑st


√Eµ[ρ2t,a]√n

|ε(st)|

=

(−1 + 1

1 +∑ε(st)

)π(st+1) +

∑stPπ(st+1|st)ε(st)1 +

∑ε(st)

+1

1 +∑ε(st)

O

√Eµ[ρ2t ]√n

+

√Eµ[ρ2t,a]√n

|ε(st)|

.

Let’s first calculate∑

st+1ε(st+1)263

∑st+1

ε(st+1) = −∑ε(st)

1 +∑ε(st)

+

∑ε(st)

1 +∑ε(st)

+O

|St+1|(√

Eµ[ρ2t ] +√Eµ[ρ2t,a]|ε(st)|

)√n(1 +

∑ε(st))

.

Thus, when |ε(st)| ≤ 1, it takes264

H �√n

|S|(√


)before

∑st+1

ε(st+1) could reach −0.5.265

Now266

|ε(st+1)| ≤|ε(st)|

1 +∑ε(st)

+|St+1||

∑stε(st)|

1 +∑ε(st)

+O

|St+1|(√


)√n(1 +

∑ε(st))

.

Collecting the bounds we get:267

|ε(st+1)| ≤

1 +|St+1|

√Eµ[ρ2t,a]√n

1 +∑ε(st))

|ε(st)|+ |St+1|2√Eµ[ρ2t ]√

n(1 +∑ε(st))2

+|St+1|2

√Eµ[ρ2t,a]

√n(1 +

∑ε(st)))2

|ε(st)|+|St+1|

√Eµ[ρ2t ]√

n(1 +∑ε(st))

This is quantitatively the same as the version without normalization. But it strictly reduces the268

error if (1 +∑ε(st)) > 1.269

Thus our idea is: choose unnormalized version if (1 +∑ε(st)) < 1, otherwise choose the270

normalized version. Note that (1 +∑ε(st)) < 1 ⇐⇒

∑π(st) < 1.271

12

7 Conclusions and Future Work272

In this paper, we proposed the marginalized approach to the problem of off-policy evaluation in273

reinforcement learning. This is the first approach based on the model of DAG MDP. It achieves274

substantially better performance than existing approaches. The theoretical analysis of the marginal-275

ized approach is still an open problem, and we will extend our theoretical results to fully explain all276

these marginalized approaches.277

References278

Bottou, L., Peters, J., Quiñonero-Candela, J., Charles, D. X., Chickering, D. M., Portugaly, E.,279

Ray, D., Simard, P., and Snelson, E. (2013). Counterfactual reasoning and learning systems: The280

example of computational advertising. The Journal of Machine Learning Research, 14(1):3207–281

3260.282

Chapelle, O., Manavoglu, E., and Rosales, R. (2015). Simple and scalable response prediction for283

display advertising. ACM Transactions on Intelligent Systems and Technology (TIST), 5(4):61.284

Dudík, M., Langford, J., and Li, L. (2011). Doubly robust policy evaluation and learning. In285

Proceedings of the 28th International Conference on International Conference on Machine Learning,286

pages 1097–1104. Omnipress.287

Farajtabar, M., Chow, Y., and Ghavamzadeh, M. (2018). More robust doubly robust off-policy288

evaluation. In Proceedings of the 35th International Conference on Machine Learning, volume 80289

of Proceedings of Machine Learning Research, pages 1447–1456, Stockholmsmässan, Stockholm290

Sweden. PMLR.291

Guo, Z., Thomas, P. S., and Brunskill, E. (2017). Using options and covariance testing for long292

horizon off-policy policy evaluation. In Advances in Neural Information Processing Systems, pages293

2492–2501.294

Hirano, K., Imbens, G. W., and Ridder, G. (2003). Efficient estimation of average treatment effects295

using the estimated propensity score. Econometrica, 71(4):1161–1189.296

Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the297

American statistical association, 58(301):13–30.298

Jiang, N. and Li, L. (2016). Doubly robust off-policy value evaluation for reinforcement learning.299

In Proceedings of the 33rd International Conference on International Conference on Machine300

Learning-Volume 48, pages 652–661. JMLR. org.301

Mandel, T., Liu, Y.-E., Levine, S., Brunskill, E., and Popovic, Z. (2014). Offline policy evaluation302

across representations with applications to educational games. In Proceedings of the 2014 interna-303

tional conference on Autonomous agents and multi-agent systems, pages 1077–1084. International304

Foundation for Autonomous Agents and Multiagent Systems.305

Murphy, S. A., van der Laan, M. J., Robins, J. M., and Group, C. P. P. R. (2001). Marginal mean306

models for dynamic regimes. Journal of the American Statistical Association, 96(456):1410–1423.307

Precup, D., Sutton, R. S., and Singh, S. P. (2000). Eligibility traces for off-policy policy evaluation.308

In Proceedings of the Seventeenth International Conference on Machine Learning, pages 759–766.309

Morgan Kaufmann Publishers Inc.310

13

Raghu, A., Komorowski, M., Celi, L. A., Szolovits, P., and Ghassemi, M. (2017). Continuous311

state-space models for optimal sepsis treatment: a deep reinforcement learning approach. In312

Machine Learning for Healthcare Conference, pages 147–163.313

Singh, S. P. and Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces.314

Machine learning, 22(1-3):123–158.315

Slutsky, E. (1925). Uber stochastische asymptoten und grenzwerte. Metron, 5(3):3–89.316

Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction, volume 1. MIT317

press Cambridge.318

Tang, L., Rosales, R., Singh, A., and Agarwal, D. (2013). Automatic ad format selection via319

contextual bandits. In Proceedings of the 22nd ACM international conference on Conference on320

information & knowledge management, pages 1587–1594. ACM.321

Theocharous, G., Thomas, P. S., and Ghavamzadeh, M. (2015). Personalized ad recommendation322

systems for life-time value optimization with guarantees. In IJCAI, pages 1806–1812.323

Thomas, P. and Brunskill, E. (2016). Data-efficient off-policy policy evaluation for reinforcement324

learning. In International Conference on Machine Learning, pages 2139–2148.325

Thomas, P. S. (2015). Safe reinforcement learning. PhD thesis, University of Massachusetts Amherst.326

Thomas, P. S., Theocharous, G., and Ghavamzadeh, M. (2015). High-confidence off-policy evaluation.327

In AAAI, pages 3000–3006.328

Thomas, P. S., Theocharous, G., Ghavamzadeh, M., Durugkar, I., and Brunskill, E. (2017). Predictive329

off-policy policy evaluation for nonstationary decision problems, with applications to digital330

marketing. In AAAI, pages 4740–4745.331

Wang, Y.-X., Agarwal, A., and Dudık, M. (2017). Optimal and adaptive off-policy evaluation in332

contextual bandits. In International Conference on Machine Learning, pages 3589–3597.333

14

Appendix334

A Proof Details335

In this section, we provide the full proofs for our lemmas and theorems. We first provide the proof336

of lemma 1.337

Proof of Lemma 1. Given the conditional independence in the Markov property, we have338

E[ρi0:t(r

it + ft(s

it, a

it, s

it+1))

]=E

[E[ρi0:t(r

it + ft(s

it, a

it, s

it+1))|sit

]]=E

[E[ρi0:t−1|sit

]E[ρit(r

it + ft(s

it, a

it, s

it+1))|sit

]]=E

[wt(s

it)E[ρit(r

it + ft(s

it, a

it, s

it+1))|sit

]]=E

[wt(s

it)ρ

it(r

it + ft(s

it, a

it, s

it+1))

],

where the first equation follows from the law of total expectation, the second equation follows from339

the conditional independence from the Markov property. This completes the proof.340

We then provide the proof of theorem 1.341

Proof of Theorem 1. We first provide the proof of the first part of unbiasedness. Given E[wnt (s)|s] =342

wt(s) for all t, then343

E[wnt (s

it)ρ

itγt(rit + ft(s

it, a

it, s

it+1))

]=E

[E[wnt (s

it)ρ

itγt(rit + ft(s

it, a

it, s

it+1))|sit

]]=E

[E[wnt (s

it)|sit

]E[ρitγ

t(rit + ft(sit, a

it, s

it+1))|sit

]]=E

[wt(s

it)E[ρitγ

t(rit + ft(sit, a

it, s

it+1))|sit

]]=E

[wt(s

it)ρ

it(r

it + ft(s

it, a

it, s

it+1))

]=E

[ρi0:t(r

it + ft(s

it, a

it, s

it+1))

], (A.1)

where the the first equation follows from the law of total expectation, the second equation follows344

from the conditional independence of the Markov property, the last equation follows from Lemma345

1. Since the original estimator falls in framework (3.1) is unbiased, summing (A.1) over i and t346

completes the proof of the first part.347

We now prove the second part of consistency. Since we have348

plimn→∞

1

n

n∑i=1

H−1∑t=0

wnt (sit)ρ

itγt(rit + ft(s

it, a

it, s

it+1)) =

H−1∑t=0

γtplimn→∞

1

n

n∑i=1

wnt (sit)ρ

it(r

it + ft(s

it, a

it, s

it+1)),

then, to prove the consistency, it is sufficient to show349

plimn→∞

1

n

n∑i=1

wnt (sit)ρ

it(r

it + ft(s

it, a

it, s

it+1)) = plim

n→∞

1

n

n∑i=1

ρi0:t(rit + ft(s

it, a

it, s

it+1)), (A.2)

15

given plimn→∞wnt (s) = wt(s) for all s ∈ S. Note that dµt (s) is the state distribution under behavior350

policy µ at time step t, then for the left hand side of (A.2), we have351

plimn→∞

1

n

n∑i=1

wnt (sit)ρ

it(r

it + ft(s

it, a

it, s

it+1))

=∑s∈S

dµt (s)plimn→∞

[1

n

n∑i=1

wnt (s)π(ait|s)µ(ait|s)

1(sit = s)(rit + ft(s, ait, s

it+1))

]

=∑s∈S

dµt (s)plimn→∞

[wnt (s)

1

n

n∑i=1

π(ait|s)µ(ait|s)


it+1))

]

=∑s∈S

dµt (s)

[plimn→∞

(wnt (s)) plimn→∞

(1

n

n∑i=1

π(ait|s)µ(ait|s)


it+1))

)]

=∑s∈S

dµt (s)wt(s)plimn→∞

[1

n

n∑i=1

π(ait|s)µ(ait|s)


it+1))

]

=∑s∈S

dµt (s)wt(s)E[π(at|s)µ(at|s)

(rt + ft(s, at, st+1))∣∣∣st = s

], (A.3)

where the first equation follows from the weak law of large number. Similarly, for the right hand352

side of (A.2), we have353

plimn→∞

1

n

n∑i=1

ρi0:t(rit + ft(s

it, a

it, s

it+1))

=∑s∈S

dµt (s)plimn→∞

[1

n

n∑i=1

t−1∏t′=0


1(sit = s)π(ait|s)µ(ait|s)

(rit + ft(s, ait, s

it+1))

]

=∑s∈S

dµt (s)E

[t−1∏t′=0

π(at′ |st′)µ(at′ |st′)

π(at|s)µ(at|s)

(rt + ft(s, at, st+1))∣∣∣st = s

]

=∑s∈S

dµt (s)E

[t−1∏t′=0

π(at′ |st′)µ(at′ |st′)

∣∣∣st = s

]E[π(at|s)µ(at|s)

(rt + ft(s, at, st+1))∣∣∣st = s

]=∑s∈S

dµt (s)wt(s)E[π(at|s)µ(at|s)

(rt + ft(s, at, st+1))∣∣∣st = s

], (A.4)

where the first equation follows from the weak law of large number and the third equation follows354

from the conditional independence of the Markov property. Thus, we have (A.3) equal to (A.4).355

This completes the proof of the second half.356

We now provide the proof of theorem 2.357

Proof of Theorem 2. The unbiasedness of wnt can be obtained directed by taking expectation as358

16

follows.359

E [wnt (s)] =E

∑n

i=1

∏t−1t′=0

π(ait′ |s

it′ )

µ(ait′ |s


i=1 1(sit = s)

=E

E∑n

i=1

∏t−1t′=0

π(ait′ |s

it′ )

µ(ait′ |s


i=1 1(sit = s)

∣∣∣∣s1:nt

=E

[E

[n∑i=1

t−1∏t′=0


1(sit = s)

∣∣∣∣s1:nt]

1∑ni=1 1(s

it = s)

]

=E

[E

[n∑i=1

dµt (sit)

dπt (sit)1(sit = s)

∣∣∣∣s1:nt]

1∑ni=1 1(s

it = s)

]

=E

[1∑n

i=1 1(sit = s)

n∑i=1

dµt (sit)

dπt (sit)1(sit = s)

]

=E

[1∑n

i=1 1(sit = s)

n∑i=1

dµt (s)

dπt (s)1(sit = s)

]

=E

[dµt (s)

dπt (s)

1∑ni=1 1(s

it = s)

n∑i=1

1(sit = s)

]

=dµt (s)

dπt (s),

where the first equation follows from the law of total expectation, the other equations are all follows360

from the conditional independence. Thus, we complete the proof of unbiasedness.361

Proof of Theorem 3. For the case of wnt is (3.4), we have the following inequality according to the362

Hoeffding’s inequality [Hoeffding, 1963]363

P (|wnt (s ∈ St)− wt(s ∈ St)| ≥ ε1) ≤ 2 exp(−2nt(s)ε21η2t

). (A.5)

Since E[nt(s)] = ndµt (s), we also have the following inequality according to the Hoeffding’s inequality364

P (ndµt (s)− nt(s) ≤ ε2) ≤ exp(−2nε22

). (A.6)

By setting 2 exp(−2nt(s)ε21η2t

)= exp

(−2nε22

)= δ/2, we can obtain365

2 exp(−2nt(s)ε21η2t

)=δ

22nt(s)ε

21η

2t = log(4/δ)

ε1 =1

ηt

√log(4/δ)

2nt(s), (A.7)

and366

exp(−2nε22

)=δ

22nε22 = log(2/δ)

ε2 =

√log(2/δ)

2n. (A.8)

17

By combining (A.6) and (A.8), we know that with probability at least 1− δ/2, we have367

nt(s) ≥ ndµt (s)−√

log(2/δ)

2n. (A.9)

Thus, substitute (A.9) into (A.7), we can obtain the following inequality holds with probability368

at least 1− δ/2369

ε1 ≤1

ηt

√√√√√ log(4/δ)

2

(ndµt (s)−

√log(2/δ)

2n

)

=1

ηt

√√√√ log(4/δ)√2n

2(ndµt (s)

√2n−

√log(2/δ)

) .Then, based on the inequality (A.5), we completes the proof.370

At last, we provide the proof of theorem 4.371

Proof of Theorem 4. We now provide the proof of the high-confidence bound of wnt in (3.5). We372

also have the following inequality according to the Hoeffding’s inequality [Hoeffding, 1963] for wnt (s)373

and nt(s).374

P (|wnt (s ∈ St)− wt(s ∈ St)| ≥ ε1) ≤2 exp(−2nt(s)ε21η2t

),

P (ndµt (s)− nt(s) ≤ ε2) ≤ exp(−2nε22

). (A.10)

In addition, for the part n/∑n

i=1

∏t−1t′=0

π(ait′ |s

it′ )

µ(ait′ |s

it′ )

in (3.5), we have375

P

(∣∣∣∣∣n∑i=1

t−1∏t′=0


− n

∣∣∣∣∣ ≥ ε3)≤2 exp

(−2nε23η2t

). (A.11)

By setting 2 exp(−2nt(s)ε21η2t

)= exp

(−2nε22

)= 2 exp

(−2nε23η2t

)= δ/3, we can obtain376

2 exp(−2nt(s)ε21η2t

)=δ

32nt(s)ε

21η

2t = log(6/δ)

ε1 =1

ηt

√log(6/δ)

2nt(s), (A.12)

377

exp(−2nε22

)=δ

32nε22 = log(3/δ)

ε2 =

√log(3/δ)

2n, (A.13)

18

and378

2 exp(−2nε23η2t

)=δ

32nε23η

2t = log(6/δ)

ε3 =1

ηt

√log(6/δ)

2n. (A.14)

By combining (A.10) and (A.13), we know that with probability at least 1− δ/3, we have379

nt(s) ≥ ndµt (s)−√

log(3/δ)

2n. (A.15)

Then, substitute (A.15) into (A.12), we can obtain the following inequality holds with probability380

at least 1− δ/3381

ε1 ≤1

ηt

√√√√√ log(6/δ)

2

(ndµt (s)−

√log(3/δ)

2n

) (A.16)

=1

ηt

√√√√ log(6/δ)√2n

2(ndµt (s)

√2n−

√log(3/δ)

) .Based on the definition of wnt (s) in (3.5) and (A.11), we have the following inequality holds with382

probability at least 1− 2δ/3383

|wnt (s ∈ St)− wt(s ∈ St)| ≤wt(s) + ε11− ε3/n

− wt(s) =nε1 + wt(s)ε3

n− ε3. (A.17)

Combining (A.17) with the definition of ε3 in (A.14) and the high probability bound of ε1 in384

(A.16), we can obtain our final high probability bound for wnt (s):385

|wnt (s)− wt(s)|

≤ n√2n

ηtn√2n−

√log(6/δ)

√√√√ log(6/δ)√2n

2(ndµt (s)

√2n−

√log(3/δ)

) +wt(s)

√log(6/δ)

ηtn√2n−

√log(6/δ)

holds with probability at least 1− δ.386

Also, since the consistent estimator wnt (s) in (3.5) contains self-normalization, we can also obtain387

its upper bound as388

wnt (s) =

∑ni=1

∏t−1t′=0

π(ait′ |s

it′ )

µ(ait′ |s

it′ )1(sit = s)

1n

∑ni=1

∏t−1t′=0

π(ait′ |s

it′ )

µ(ait′ |s

it′ )

∑nj=1 1(s

jt = s)

≤ n∑nj=1 1(s

jt = s)

.

Since we also have389

P

∣∣∣∣∣∣n∑j=1

1(sjt = s)− ndµt (s)

∣∣∣∣∣∣ >√

log(2/δ)

2n

≤ δfrom the Hoeffding’s inequality, then wnt (s) can be upper bounded by

√2n/(dµt (s)

√2n−

√log(2/δ))390

with probability at least 1− δ. This completes the proof.391

19

marginalizedoﬀ-policyevaluationforreinforcement learning...1...

Documents