off-policy learning
Post on 13-Mar-2022
0 Views
Preview:
TRANSCRIPT
Outline
β’ Setup. 1. What is Off-policy Reinforcement Learning?2. Off-policy policy evaluation.
β’ Methods for off-policy policy evaluation.
1. Importance sampling methods2. Doubly Robust and Weighted Doubly Robust (WDR) estimator.3. Model And Guided Importance sampling Combined (MAGIC)
estimator
β’ Note on Off-policy RL vs. Offline RL
What is off-policy Learning?
Off-policy Learning.
In Off-Policy learning, algorithms evaluate and improve a target policy that is different from the observational policy used to generate the data.
Off-policyOn-policyData generated according to π! Data
generated from different policies
What is off-policy Learning?
Off-policy Learning.
In Off-Policy learning, algorithms evaluate and improve a target policy that is different from the observational policy used to generate the data.
Off-policy example: Q-learning
Contrast it with the on-policy method: SARSA
Update using the greedy policy
Update using the same policy π
What is off-policy Learning?
Today, we will focus on Off-policy policy Evaluation.
Recall:
Policy Evaluation: Estimating, π" π , the expected future reward available from each state following policy π.
The Estimation of the policy value function π" π , is a central question in RL and is a part of many policy-based algorithms, such as policy iteration.
In the model-free setting, we have seen that we can estimate the policy evaluation using methods like monte Carlo.
Requires interacting with the environment using the policy we want to evaluate
Policy Evaluation: Estimating, π" π , the expected future reward available from each state following policy π.
What is off-policy Learning?
What if we can not use the target policy?
In some situations, it is costly and/or risky to try a different unevaluated policy.
1. Healthcare.Which medical treatment to suggest for a patient.
2. Marketing decisions.Which advertisement to show a user visiting a website.
3. Education policies.Which curriculum to recommend for a student.
The main ques<on.
Can we estimate the value function of a specific target policy from data collected by another policy.
Detailed Setup
We consider the episodic MDP.β’ The agent interacts with its environment in a sequence of episodes, indexed by π.β’ Each episode has a finite number of time steps, π‘ β {0, β¦ , π}.β’ Finite number of states and actions. β’ β(πβ|π, π) denotes the unknown state-transition probabilities,,β’ π(π, π)denotes the one-step reward at time state s and action a.
Under this MDP, we are interested in two (arbitrary) stationary Markov policies:
β’ π π: A behavioral policy used to generate the data. (with π π . , . > π βπ, π)
β’ π π: The target policy whose value function we seek to estimate.
Goal: Using episodes generated via π π, evaluate π½π π(. ), the value funcGon of the policy π π
What is importance sampling?
One way to view the task of off-policy learning is as a mismatchof distributions:
Γ We want trajectories sampled from the distribution of the target policy π π.β However, we have data drawn from the distribution of the behavior policy π π.
Importance sampling is a classical technique for handling just this kind of mismatch.
Example:
Estimate πΌ[π₯] of a random variable x distributed according to βπ through samples drawn from another distribution βπ.
βπ
βπ
What is importance sampling?
Example:
Since,
Then we can use the unbiased consistent estimator:
βπ
βπ
Estimate πΌ[π₯] of a random variable x distributed according to βπ through samples drawn from another distribution βπ.
Importance Sampling Estimator for off-policy evaluation
Now letβs consider using the importance sampling estimator for our case.
β’ Using policy π π, we collect M episodes worth of data from the environment
β’ In each episode, we collect the data
{(π (, π(, π(), (π ), π), π)), β¦ , (π *+), π*+), π*+)), π * }
Importance Sampling Es:mator for off-policy evalua:on
Now letβs consider using the importance sampling estimator for our case.
β’ Using policy π π, we collect M episodes worth of data from the environment
β’ In each episode, we collect the data
β’ We want to estimate the value function π½π π(π), for an arbitrary state s.
{(π (, π(, π(), (π ), π), π)), β¦ , (π *+), π*+), π*+)), π * }
Importance Sampling Estimator for off-policy evaluation
Now letβs consider using the importance sampling estimator for our case.
β’ Using policy π π, we collect M episodes worth of data from the environment
β’ In each episode, we collect the data
β’ We want to estimate the value function π½π π(π), for an arbitrary state s.
β’ Let π‘# be the first time when π $ = π in the m-th episode.
{(π (, π(, π(), (π ), π), π)), β¦ , (π , π," , π,"), β¦ (π *+), π*+), π*+)), π * }
π‘ = π‘-
Importance Sampling Estimator for off-policy evaluation
Then we define the importance sampling estimator of π½π π(π), as:
Note: In the on-policy case, when π π = π π we get our Monte-Carlo estimator back.
Where,
This estimator is a consistentβ and unbiasedβ .
Instability of Importance sampling estimator.
β If an unlikely event occurs its weight will be very large and cause large variation.
β One way to mitigate this is to use a weighted average of the samples.
This estimator is a consistentβ but biased β estimator but is more stable β than the IS estimator.
Weighted Importance Sampling Estimator (WSI)
Per-decision IS es:mator
The IS estimators defined above all consider complete trajectories.
β This motivated the Per decision IS algorithm.Recall that:
Intuitively, the reward at time t should not depend on the future transitions, which motivate the estimator
π€ Can we further reduce the variance using the Markov property. β Is not easily implemented on an incremental, step-by-step basis.
Per-decision IS estimator is unbiased.
Theorem. The per-decision importance sampling estimator π./ is a consistent unbiased of π"# .
Proof. To prove this, we will show that πΌ[π01] = πΌ[π./]. This is sufficient as we know that π01 is unbiased.
Per-decision IS estimator is unbiased.
Proof.
That is, πΌ[π01] = πΌ[π./], which is sufficient as we know that π01 is unbiased.
Weighted Per-decision IS estimator.
Similar to how the weighted IS estimator was introduced, a weighted version of the PD can be considered .
Summary
Low
er
varia
nce
Incremental & Practical
π!"Importance Sampling Estimator
π#$Per-decision IS Estimator
π%!"Weighted Importance Sampling
Estimator
π%#$Weighted Per-decision IS Estimator
β High variance especially with long horizon.
Issue with the Importance sampling es:mators
When a trajectory is unlikely under our target-policy: Very small
When a trajectory is likely under our target-policy: Very big
The final estimate is potentially determined by only a few trajectories Γ High variance
Problem is exacerbated with a longer horizon.
How does model-based work? (AKA regression-based estimator)1. Interact with the environment to learn the MDP !π, by learning:
i. A model of the transition probability &β(π β|π , π)
ii. A model of the reward /π (π , π).
2. Calculate the Value function π½π π(. ) Recursively by Bellman equation
Possible Resolution: Model-based estimator
How well does it work? β Low variance
β Hard to learn an MDP well in complex environment (e.g., high-dimensional environment)
1. Hard to specify a valid class of functions.
2. Using universal approximators such as DNN does not generalize well (introduce bias).
Can we get the best of both worlds?
Contextual Bandits.β’ Horizon is one, and a trajectory take the form of (π , π, π)β’ Finite number of states and acGons.
Again, we are interested in two (arbitrary) staHonary Markov policies:
β’ π π: A behavioral policy used to generate the data.
β’ π π: The target policy whose value funcGon we seek to esGmate.
Can we take the best of regression based and importance sampling-based approaches?
Letβs consider a simpler setup.
Doubly Robust estimator
β Step 1. Estimate the reward function Bπ (π , π) using regression on a separate dataset.
β Step 2. Use the estimator
Where π = "$(3,5)"# 3,5
, Bπ π = β5 π7 π , π Bπ (π , π
V/8(π ) = Bπ(π ) + π(π β Bπ (π , π))
Doubly Robust es:mator
β Step 1. Estimate the reward function Bπ (π , π) using regression on a separate dataset.
β Step 2. Use the estimator
Where π = "$(3,5)"# 3,5
, Bπ π = β5 π7 π , π Bπ (π , π
V/8(π ) = Bπ(π ) + π(π β Bπ (π , π))
β The variance of (r - Bπ (π , π)) is lower than (r); provide that Bπ (π , π) is a good estimate
High-level Idea is to use approximate model to:
β Reduce variance of estimates by subtracting model-based estimate like
β Guide but not completely replace importance sampling estimates
Hence called βguided importance sampling methodβ.
Doubly Robust estimator
High-level Idea (similar to contextual bandits)
Use approximate model to:
β Reduce variance of estimates by subtracting (s, a)
β Guide but not completely replace importance sampling estimates
Doubly Robust estimator: sequential setting
β Following the idea of step-IS,β Value function
β We know that
β Therefore, we can write the corresponding equation for DR as:β Value function
β is estimated via regression on a separate dataset
Doubly Robust estimator: sequential setting
β Variance of DR estimator is given by
where
β Depends on via the error term - Q.
β IS corresponds to special case with = 0
β Therefore, generally performs better than IS if Q_hat is well chosen.
Doubly Robust estimator: Variance Analysis
β What is Cramer Rao bound?
β Cramer-Rao bound is a lower bound on the variance of an unbiased estimator of a parameter
β It is used to find what is the minimum mean squared error or variance an unbiased estimator needs to maintain
β Note that biased estimators can end up having lower mean squared error or variance.
Doubly Robust estimator: Cramer-Rao Bound
β DR achieves the Cramer-Rao bound.
β In other words, DR is the best we can do if we use unbiased estimators.
β Hard case: Discrete-tree MDP
β Variance of any unbiased OPE is lower bounded by
β DR achieves this bound if we unfold the recursion for variance of DR estimator.
β Relaxed case: MDP with Directed Acyclic Graph (DAG) structure
β Variance of an unbiased OPE is lower bounded by
β DR achieves this bound
Doubly Robust estimator: Cramer-Rao Bound
β Lower MSE than IS and WIS for Off-Policy Policy Evaluation in Mountain Car environment
Doubly Robust estimator: Experiments
β Rationale of experiment:
o OPE is used when target policy cannot be safely deployed before we have a good value estimate
β Setting:
β Given data D, use D_train to generate policies and D_test to evaluate (using OPE) and select new policies
β Exp 1: Good policies: Select policy with highest LCB: V β C sigma
β Exp 2: Bad policies: Select policy with lowest LCB
Doubly Robust estimator: Safe Policy Improvement
β Results:
β Good policies: DRβs value improvement largely outperforms IS
β Bad DR policies for a given confidence interval are at least as safe as corresponding IS policies.
Doubly Robust estimator: Safe Policy Improvement
β Key idea: directly optimize Mean Squared Error (MSE) instead of minimizing variance while constrained on 0 bias.
β We want low MSE because that does the job of finding approximate value for a given policy to use as intermediate result for policy search algorithms.
β To minimize MSE, we need to find a good bias-variance tradeoff, but we do not necessarily need 0 bias.
Can we do better with biased estimators?
Extension of DR to WDR
β Results:β WDR performs better than other IS estimators.β IS based methods (including DR and WDR) perform better than model-base AM approximator
on some scenarios, but AM performs best when model represents true MDP.
We have looked so far at:
β’ Importance Sampling based (use all importance weights till horizon H - 1)β IS, PDIS, WIS, CWPDIS, DR, WDR
β’ Purely model basedβ AM
Another hybrid model is introduced:
Partially importance sampling based
β’ Use importance weights till some t < H β 1
β’ corresponds to purely model-based estimator
β’ corresponds to an importance sampling estimator
Can we blend two different OPEs to get a better performance
MAGIC Es:mator: Key Idea
βCreate partially importance sampling based estimators with WDR as the IS part and AM as the model-based part.
βCombine in such a way that MSE between estimation and our target value function is minimum.
βIn other words, take a linear combination of these estimators such that it minimizes the error.
βThis reduces to finding x s.t. is minimum, where and are in terms of
MAGIC Estimator: Details
β’Create a vector of all partially importance sampling based estimators
β’ New estimator is where x are weights chosen such that is minimized.
β’ Choosing a sub-problem , where J is a finite subset of {-1, 0, ..., +inf}, the problem reduces to finding x such that is minimized
β’
β’
β’ Estimating can be done by
β’ Estimating is more tricky because using Model-based gives high bias and using IS gives high variance. So, it is done using distance from confidence intervals.
Experimental Results
β MAGIC perfoms similar to DR and WDR on some settings.β Results comparable to AM when environment dynamics of true MDP are perfectly
modelledβ Outperforms both DR/WDR and AM for Hybrid settings.
Takeaways
Many methods can effectively evaluate a target policy given data from a different
observational policies.
Β§ Importance Sampling methods. which suffer from high variance.
Β§ Model-based methods. May not be effective in complex environments and
partially observable environments.
Β§ Guided Importance Sampling methods (DR, WDR), which use approximate
model to guide the importance sampling.
Β§ Hybrid approach (MAGIC), which combines the model-based and Importance
Sampling approaches to get best of both worlds.
Offline RL vs off-policy
Β§ In offline RL, we have to learn a policy using previously collected batch data
Β§ Considered to be more realistic for some of the most important real-world use cases of RL
Off-policyOffline RL
Offline RL: Challenges
Β§ Limited Exploration. Learning relies on a static dataset D Γ no exploration.
β if D does not contain transitions that illustrate high-reward regions we may be out of luck
Β§ Distributional shift. While we trained under one distribution, our policy will be
evaluated on a different distribution during deployment.
β Learned policy can deviate from the known state-action space to the unknown state-
action space, where we donβt know the behavior.
Offline RL: Mi;ga;ng the distribu;on shiA problem
Β§ Imposing policy constraints. Ensure that the learned policy is close to the behavior
policy.
Ex: Batch Constrained Q-learning (BCQ)
Β§ Uncertainty based methods. Estimate the uncertainty of Q-values (in value-based
methods) or the learned transition model (in model-based methods), and then utilize
this uncertainty to detect distributional shift. Ex: MoREL.
References
Β§ Off-policy policy evaluation.1. Precup , Sutton, and Singh. Eligibility traces for off-policy policy evaluation (2000).
https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_faculty_pubs
2. Jiang and Li. Doubly robust off-policy value evaluation for reinforcement learning (2015). https://arxiv.org/pdf/1511.03722
3. Thomas and Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning (2016) http://www.jmlr.org/proceedings/papers/v48/thomasa16.pdf
Β§ Offline RL.1. Levine et.al (2020) Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open
Problems. https://arxiv.org/abs/2005.01643v3
top related