off-policy learning

Off-policy Learning

Abdullah Alomar, Satvat Jagwani

Outline

• Setup. 1. What is Off-policy Reinforcement Learning?2. Off-policy policy evaluation.

• Methods for off-policy policy evaluation.

1. Importance sampling methods2. Doubly Robust and Weighted Doubly Robust (WDR) estimator.3. Model And Guided Importance sampling Combined (MAGIC)

estimator

• Note on Off-policy RL vs. Offline RL

What is off-policy learning

What is off-policy Learning?

Off-policy Learning.

In Off-Policy learning, algorithms evaluate and improve a target policy that is different from the observational policy used to generate the data.

Off-policyOn-policyData generated according to 𝜋! Data

generated from different policies

Off-policy Learning.

In Off-Policy learning, algorithms evaluate and improve a target policy that is different from the observational policy used to generate the data.

Off-policy example: Q-learning

Contrast it with the on-policy method: SARSA

Update using the greedy policy

Update using the same policy 𝜋

Today, we will focus on Off-policy policy Evaluation.

Recall:

Policy Evaluation: Estimating, 𝑉" 𝑠 , the expected future reward available from each state following policy 𝜋.

The Estimation of the policy value function 𝑉" 𝑠 , is a central question in RL and is a part of many policy-based algorithms, such as policy iteration.

In the model-free setting, we have seen that we can estimate the policy evaluation using methods like monte Carlo.

Requires interacting with the environment using the policy we want to evaluate

Policy Evaluation: Estimating, 𝑉" 𝑠 , the expected future reward available from each state following policy 𝜋.

What if we can not use the target policy?

In some situations, it is costly and/or risky to try a different unevaluated policy.

1. Healthcare.Which medical treatment to suggest for a patient.

2. Marketing decisions.Which advertisement to show a user visiting a website.

3. Education policies.Which curriculum to recommend for a student.

The main ques<on.

Can we estimate the value function of a specific target policy from data collected by another policy.

Detailed Setup

We consider the episodic MDP.• The agent interacts with its environment in a sequence of episodes, indexed by 𝒎.• Each episode has a finite number of time steps, 𝑡 ∈ {0, … , 𝑇}.• Finite number of states and actions. • ℙ(𝒔’|𝒔, 𝒂) denotes the unknown state-transition probabilities,,• 𝒓(𝒔, 𝒂)denotes the one-step reward at time state s and action a.

Under this MDP, we are interested in two (arbitrary) stationary Markov policies:

• 𝝅𝒃: A behavioral policy used to generate the data. (with 𝝅𝒃 . , . > 𝟎 ∀𝒔, 𝒂)

• 𝝅𝒈: The target policy whose value function we seek to estimate.

Goal: Using episodes generated via 𝝅𝒃, evaluate 𝑽𝝅𝒈(. ), the value funcGon of the policy 𝝅𝒈

Methods: Importance-sampling methods

What is importance sampling?

One way to view the task of off-policy learning is as a mismatchof distributions:

à We want trajectories sampled from the distribution of the target policy 𝝅𝒈.❌ However, we have data drawn from the distribution of the behavior policy 𝝅𝒃.

Importance sampling is a classical technique for handling just this kind of mismatch.

Example:

Estimate 𝔼[𝑥] of a random variable x distributed according to ℙ𝒅 through samples drawn from another distribution ℙ𝟎.

ℙ𝒅

ℙ𝟎

What is importance sampling?

Example:

Since,

Then we can use the unbiased consistent estimator:

ℙ𝒅

ℙ𝟎

Estimate 𝔼[𝑥] of a random variable x distributed according to ℙ𝒅 through samples drawn from another distribution ℙ𝟎.

Importance Sampling Estimator for off-policy evaluation

Now let’s consider using the importance sampling estimator for our case.

• Using policy 𝝅𝒃, we collect M episodes worth of data from the environment

• In each episode, we collect the data

{(𝑠(, 𝑎(, 𝑟(), (𝑠), 𝑎), 𝑟)), … , (𝑠*+), 𝑎*+), 𝑟*+)), 𝑠* }

Importance Sampling Es:mator for off-policy evalua:on

• We want to estimate the value function 𝑽𝝅𝒈(𝒔), for an arbitrary state s.

{(𝑠(, 𝑎(, 𝑟(), (𝑠), 𝑎), 𝑟)), … , (𝑠*+), 𝑎*+), 𝑟*+)), 𝑠* }

• We want to estimate the value function 𝑽𝝅𝒈(𝒔), for an arbitrary state s.

• Let 𝑡# be the first time when 𝑠$ = 𝑠 in the m-th episode.

{(𝑠(, 𝑎(, 𝑟(), (𝑠), 𝑎), 𝑟)), … , (𝑠, 𝑎," , 𝑟,"), … (𝑠*+), 𝑎*+), 𝑟*+)), 𝑠* }

𝑡 = 𝑡-

Then we define the importance sampling estimator of 𝑽𝝅𝒈(𝒔), as:

Note: In the on-policy case, when 𝝅𝒃 = 𝝅𝒈 we get our Monte-Carlo estimator back.

Where,

This estimator is a consistent✅ and unbiased✅.

Instability of Importance sampling estimator.

❌ If an unlikely event occurs its weight will be very large and cause large variation.

⭐ One way to mitigate this is to use a weighted average of the samples.

This estimator is a consistent✅ but biased ❌ estimator but is more stable ✅than the IS estimator.

Weighted Importance Sampling Estimator (WSI)

Per-decision IS es:mator

The IS estimators defined above all consider complete trajectories.

⭐ This motivated the Per decision IS algorithm.Recall that:

Intuitively, the reward at time t should not depend on the future transitions, which motivate the estimator

🤔 Can we further reduce the variance using the Markov property. ❌ Is not easily implemented on an incremental, step-by-step basis.

Per-decision IS estimator is unbiased.

Theorem. The per-decision importance sampling estimator 𝑉./ is a consistent unbiased of 𝑉"# .

Proof. To prove this, we will show that 𝔼[𝑉01] = 𝔼[𝑉./]. This is sufficient as we know that 𝑉01 is unbiased.

Per-decision IS es:mator is unbiased.

Proof.

Let’s examine the expectation for one of the terms:

Proof.

Proof. = 1

Per-decision IS estimator is unbiased.

Proof.

That is, 𝔼[𝑉01] = 𝔼[𝑉./], which is sufficient as we know that 𝑉01 is unbiased.

Weighted Per-decision IS estimator.

Similar to how the weighted IS estimator was introduced, a weighted version of the PD can be considered .

Summary

Incremental & Practical

𝑉!"Importance Sampling Estimator

𝑉#$Per-decision IS Estimator

𝑉%!"Weighted Importance Sampling

Estimator

𝑉%#$Weighted Per-decision IS Estimator

Can we do better than importance sampling-based methods?

❌ High variance especially with long horizon.

Issue with the Importance sampling es:mators

When a trajectory is unlikely under our target-policy: Very small

When a trajectory is likely under our target-policy: Very big

The final estimate is potentially determined by only a few trajectories à High variance

Problem is exacerbated with a longer horizon.

How does model-based work? (AKA regression-based estimator)1. Interact with the environment to learn the MDP !𝑀, by learning:

i. A model of the transition probability &ℙ(𝑠’|𝑠, 𝑎)

ii. A model of the reward /𝑅(𝑠, 𝑎).

2. Calculate the Value function 𝑽𝝅𝒈(. ) Recursively by Bellman equation

Possible Resolution: Model-based estimator

How well does it work? ✅ Low variance

❌ Hard to learn an MDP well in complex environment (e.g., high-dimensional environment)

1. Hard to specify a valid class of functions.

2. Using universal approximators such as DNN does not generalize well (introduce bias).

Can we get the best of both worlds?

Contextual Bandits.• Horizon is one, and a trajectory take the form of (𝑠, 𝑎, 𝑟)• Finite number of states and acGons.

Again, we are interested in two (arbitrary) staHonary Markov policies:

• 𝝅𝒃: A behavioral policy used to generate the data.

• 𝝅𝒈: The target policy whose value funcGon we seek to esGmate.

Can we take the best of regression based and importance sampling-based approaches?

Let’s consider a simpler setup.

Doubly Robust estimator

● Step 1. Estimate the reward function B𝑅(𝑠, 𝑎) using regression on a separate dataset.

● Step 2. Use the estimator

Where 𝜌 = "$(3,5)"# 3,5

, B𝑉 𝑠 = ∑5 𝜋7 𝑠, 𝑎 B𝑅(𝑠, 𝑎

V/8(𝑠) = B𝑉(𝑠) + 𝜌(𝑟 − B𝑅(𝑠, 𝑎))

Doubly Robust es:mator

● Step 1. Estimate the reward function B𝑅(𝑠, 𝑎) using regression on a separate dataset.

● Step 2. Use the estimator

Where 𝜌 = "$(3,5)"# 3,5

, B𝑉 𝑠 = ∑5 𝜋7 𝑠, 𝑎 B𝑅(𝑠, 𝑎

V/8(𝑠) = B𝑉(𝑠) + 𝜌(𝑟 − B𝑅(𝑠, 𝑎))

✅ The variance of (r - B𝑅(𝑠, 𝑎)) is lower than (r); provide that B𝑅(𝑠, 𝑎) is a good estimate

High-level Idea is to use approximate model to:

○ Reduce variance of estimates by subtracting model-based estimate like

○ Guide but not completely replace importance sampling estimates

Hence called ‘guided importance sampling method’.

Doubly Robust estimator

Doubly Robust estimator: sequential setting

Sequential SettingFinite Horizon H

High-level Idea (similar to contextual bandits)

Use approximate model to:

○ Reduce variance of estimates by subtracting (s, a)

○ Guide but not completely replace importance sampling estimates

● Following the idea of step-IS,○ Value function

● We know that

● Therefore, we can write the corresponding equation for DR as:○ Value function

● is estimated via regression on a separate dataset

● Variance of DR estimator is given by

● Depends on via the error term - Q.

● IS corresponds to special case with = 0

● Therefore, generally performs better than IS if Q_hat is well chosen.

Doubly Robust estimator: Variance Analysis

● What is Cramer Rao bound?

● Cramer-Rao bound is a lower bound on the variance of an unbiased estimator of a parameter

● It is used to find what is the minimum mean squared error or variance an unbiased estimator needs to maintain

● Note that biased estimators can end up having lower mean squared error or variance.

Doubly Robust estimator: Cramer-Rao Bound

● DR achieves the Cramer-Rao bound.

● In other words, DR is the best we can do if we use unbiased estimators.

● Hard case: Discrete-tree MDP

● Variance of any unbiased OPE is lower bounded by

● DR achieves this bound if we unfold the recursion for variance of DR estimator.

● Relaxed case: MDP with Directed Acyclic Graph (DAG) structure

● Variance of an unbiased OPE is lower bounded by

● DR achieves this bound

Doubly Robust estimator: Cramer-Rao Bound

● Lower MSE than IS and WIS for Off-Policy Policy Evaluation in Mountain Car environment

Doubly Robust estimator: Experiments

● Rationale of experiment:

o OPE is used when target policy cannot be safely deployed before we have a good value estimate

● Setting:

○ Given data D, use D_train to generate policies and D_test to evaluate (using OPE) and select new policies

○ Exp 1: Good policies: Select policy with highest LCB: V – C sigma

○ Exp 2: Bad policies: Select policy with lowest LCB

Doubly Robust estimator: Safe Policy Improvement

● Results:

○ Good policies: DR’s value improvement largely outperforms IS

○ Bad DR policies for a given confidence interval are at least as safe as corresponding IS policies.

Doubly Robust estimator: Safe Policy Improvement

● Key idea: directly optimize Mean Squared Error (MSE) instead of minimizing variance while constrained on 0 bias.

○ We want low MSE because that does the job of finding approximate value for a given policy to use as intermediate result for policy search algorithms.

○ To minimize MSE, we need to find a good bias-variance tradeoff, but we do not necessarily need 0 bias.

Can we do better with biased estimators?

● Similar to WIS, use weights

● Now a biased estimator, but low MSE

Extension of DR to WDR

● Results:○ WDR performs better than other IS estimators.○ IS based methods (including DR and WDR) perform better than model-base AM approximator

on some scenarios, but AM performs best when model represents true MDP.

We have looked so far at:

• Importance Sampling based (use all importance weights till horizon H - 1)■ IS, PDIS, WIS, CWPDIS, DR, WDR

• Purely model based■ AM

Another hybrid model is introduced:

Partially importance sampling based

• Use importance weights till some t < H – 1

• corresponds to purely model-based estimator

• corresponds to an importance sampling estimator

Can we blend two different OPEs to get a better performance

MAGIC Es:mator: Key Idea

●Create partially importance sampling based estimators with WDR as the IS part and AM as the model-based part.

●Combine in such a way that MSE between estimation and our target value function is minimum.

●In other words, take a linear combination of these estimators such that it minimizes the error.

●This reduces to finding x s.t. is minimum, where and are in terms of

MAGIC Estimator: Details

•Create a vector of all partially importance sampling based estimators

• New estimator is where x are weights chosen such that is minimized.

• Choosing a sub-problem , where J is a finite subset of {-1, 0, ..., +inf}, the problem reduces to finding x such that is minimized

• Estimating can be done by

• Estimating is more tricky because using Model-based gives high bias and using IS gives high variance. So, it is done using distance from confidence intervals.

MAGIC Estimator

Experimental Results

■ MAGIC perfoms similar to DR and WDR on some settings.■ Results comparable to AM when environment dynamics of true MDP are perfectly

modelled■ Outperforms both DR/WDR and AM for Hybrid settings.

Takeaways

Many methods can effectively evaluate a target policy given data from a different

observational policies.

§ Importance Sampling methods. which suffer from high variance.

§ Model-based methods. May not be effective in complex environments and

partially observable environments.

§ Guided Importance Sampling methods (DR, WDR), which use approximate

model to guide the importance sampling.

§ Hybrid approach (MAGIC), which combines the model-based and Importance

Sampling approaches to get best of both worlds.

A quick Note on offline RL

Offline RL vs off-policy

§ In offline RL, we have to learn a policy using previously collected batch data

§ Considered to be more realistic for some of the most important real-world use cases of RL

Off-policyOffline RL

Offline RL: Challenges

§ Limited Exploration. Learning relies on a static dataset D à no exploration.

❌ if D does not contain transitions that illustrate high-reward regions we may be out of luck

§ Distributional shift. While we trained under one distribution, our policy will be

evaluated on a different distribution during deployment.

❌ Learned policy can deviate from the known state-action space to the unknown state-

action space, where we don’t know the behavior.

Offline RL: Mi;ga;ng the distribu;on shiA problem

§ Imposing policy constraints. Ensure that the learned policy is close to the behavior

policy.

Ex: Batch Constrained Q-learning (BCQ)

§ Uncertainty based methods. Estimate the uncertainty of Q-values (in value-based

methods) or the learned transition model (in model-based methods), and then utilize

this uncertainty to detect distributional shift. Ex: MoREL.

References

§ Off-policy policy evaluation.1. Precup , Sutton, and Singh. Eligibility traces for off-policy policy evaluation (2000).

https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_faculty_pubs

2. Jiang and Li. Doubly robust off-policy value evaluation for reinforcement learning (2015). https://arxiv.org/pdf/1511.03722

3. Thomas and Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning (2016) http://www.jmlr.org/proceedings/papers/v48/thomasa16.pdf

§ Offline RL.1. Levine et.al (2020) Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open

Problems. https://arxiv.org/abs/2005.01643v3

off-policy learning

Documents

public policy towards off- shore oil spills · public...

tools for policy learning and policy transfer def

off-policy temporal-difference learning with function...

small learning community fall kick off

learning on and off the web

southern california learning collaborative kick-off workshop

safe and efﬁcient off-policy reinforcement learning · in...

annual leave and time off policy -...

the short-run policy trade-off

park view learning policy

esl 2014-15 kick-off professional learning

policy learning and sustainable urban transitions ... ·...

policy learning

off-campus activity safety policy - queen's · pdf filethe...

safe and efﬁcient off-policy reinforcement learningsafe...

on-policy vs. off-policy updates for deep reinforcement...

non-statutory off-site mitigation and compensation policy

safe reinforcement learning - stanford university€¦ ·...

doubly robust off-policy value evaluation for …doubly...

hands-off policy tools to prevent hand injuries