online learning with a lot of batch data

65
Bandits Off-Policy Evaluation Example: Probabilistic Maintenance Off-Policy Evaluation In Partially Observable Environments Relation to Causal Inference OPE Results (RL) 2 Online Learning with A Lot of Batch Data Shie Mannor [email protected] With G. Tennenholtz, U. Shalit and Y. Efroni Technion - Israel Institute of Technology & NVIDIA Research S. Mannor November 2020 OPE in POE 1 / 36

Upload: others

Post on 09-Dec-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Online Learning with A Lot of Batch Data

Shie Mannor

[email protected]

With G. Tennenholtz, U. Shalit and Y. Efroni

Technion - Israel Institute of Technology & NVIDIA Research

S. Mannor November 2020 OPE in POE 1 / 36

Page 2: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Motivation

• Large amounts of offline data are readily available• Healthcare• Autonomous Driving / Smart Cities• Education• Robotics

• The problem: offline data is often partially observable.

• May result in biased estimates that are confounded byspurious correlation.

S. Mannor November 2020 OPE in POE 2 / 36

Page 3: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Motivation

Use offline data for reinforcementlearning (RL)• Off-policy evaluation.

• Batch-mode reinforcementlearning (offline RL).

• Let’s start with bandits

S. Mannor November 2020 OPE in POE 3 / 36

Page 4: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Part I: Linear Bandits + Confouned Data

• Mixed setting: online + offline• Linear contextual bandit (online)

• T trials, |A| discrete actions, xt ∈ X i.i.d. contexts• Context dimension: d• Reward given by rt =

⟨xt ,w

∗at

⟩+ ηt

• w∗a ∈ Rda∈A are unknown parameter vectors

• ηt is some conditionally σ-subgaussian random noise• Minimize regret:

Regret (T ) =T∑t=1

⟨xt ,w

∗π∗(xt)

⟩−

T∑t=1

⟨xt ,w

∗at

⟩.

S. Mannor November 2020 OPE in POE 4 / 36

Page 5: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Setup: Linear Bandits + Confouned Data

Additional access to partially observable offline data

• Data was generated by an unknown, fixed behavior policy πb• Only L features of the context are visible in the data

• Let xo, xh denote the observed and unobserved features of thecontext x , respectively.

S. Mannor November 2020 OPE in POE 5 / 36

Page 6: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Setup: Linear Bandits + Confouned Data

S. Mannor November 2020 OPE in POE 6 / 36

Page 7: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Data = Linear Constraints

• Suppose we ignore that the data is partially observable.

• We find a least square solution to

minb∈RL

Na∑i=1

(〈xoi , b〉 − ri )2 ∀a ∈ A.

• Denote by bLSa its solution.

• Can bLSa provide useful information for the bandit problem?

S. Mannor November 2020 OPE in POE 7 / 36

Page 8: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Data = Linear Constraints

Proposition

Let R11(a) = Eπb(xo (xo)T |a

), R12(a) = Eπb

(xo(xh)T |a) . The

following holds almost surely for all a ∈ A.

limN→∞

bLSa =

(IL×L, R−1

11 (a)R12(a)

)w∗a ,

• bLSa provides us L independent linear relations.

• We only need to learn a lower dimensional subspace.

S. Mannor November 2020 OPE in POE 8 / 36

Page 9: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Linear Bandits with Linear Constraints

• Given side information to the bandit problem

Maw∗a = ba , a ∈ A.

• Ma ∈ RL×d , ba ∈ RL are known.

• Let Pa denote the orthogonal projection onto the kernel of Ma

• Effectively dimension of problem: d − L

• We can thus achieve regret O(

(d − L)√KT)

S. Mannor November 2020 OPE in POE 9 / 36

Page 10: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Linear Bandits with Linear Constraints

Algorithm 1 OFUL with Linear Side Information

1: input: α > 0,Ma ∈ RL×d , ba ∈ RL, δ > 0

2: init: Va = λId ,Ya = 0, ∀a ∈ A

3: for t = 1, . . . do

4: Receive context xt

5: wPat,a = (PaVaPa)†

(Ya − (Va − λId)M†a ba

)6: yt,a =

⟨xt ,M

†a ba⟩

+⟨xt , w

Pat,a

⟩7: UCBt,a =

√βt(δ)‖xt‖(PaVaPa)†

8: at ∈ arg maxa∈Ayt,a + αUCBt,a

9: Play action at and receive reward rt

10: Vat = Vat + xtxTt ,Yat = Yat + xtrt

11: end forS. Mannor November 2020 OPE in POE 10 / 36

Page 11: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Deconfounding Partially Obserable Data

• In our case, for partially observable offline data, we get

Ma =

(IL, R−1

11 (a)R12(a)

),

and ba is the solution to minb∈RL

∑Nai=1(〈xoi , b〉 − ri )

2.

• Problem: R12(a) is unknown (not identifiable)

S. Mannor November 2020 OPE in POE 11 / 36

Page 12: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Deconfounding Partially Obserable Data

• Solution: Assume that at each round t > 0, we can query πb.

• This lets us get an estimate for R12 as

R12(a, t) =1

t

t∑i=1

1ai=a

Pπb(a)(xoi )(xhi )T

• Our final estimate for Ma is then

Mt,a =

(IL, R−1

11 (a)R12(a, t)

)

S. Mannor November 2020 OPE in POE 12 / 36

Page 13: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Deconfounding Partially Obserable Data

Algorithm 2 OFUL with Partially Observable Offline Data

1: input: α>0, δ>0,T , ba∈RL (from dataset)

2: for n = 0, . . . , logT − 1 do

3: Use 2n previous samples from πb to update the estimate of M2n,a, ∀a ∈ A

4: Calculate M†2n,a, P2n,a, ∀a ∈ A

5: Run Algorithm 1 for 2n time steps with bonus√βn,t(δ) and M2n,a, ba

6: end for

S. Mannor November 2020 OPE in POE 13 / 36

Page 14: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Deconfounding Partially Obserable Data

Theorem (Main Result)

For any T > 0, with probability at least 1− δ, the regret ofAlgorithm 2 is bounded by

Regret (T ) ≤ O(

(1 + fB1)(d − L)√KT).

• fB1 is a factor indicating how hard it is to estimate the linearconstraints

• Worst case dependence: fB1 ≤ O(

maxa(L(d−L))1/4

Pπb (a)

)

S. Mannor November 2020 OPE in POE 14 / 36

Page 15: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Part II: Off-Policy Evaluation in Reinforcement Learning

Given: data generated by a behavior policy πb

S. Mannor November 2020 OPE in POE 15 / 36

Page 16: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Off-Policy Evaluation in Reinforcement Learning

Given: data generated by a behavior policy πb

S. Mannor November 2020 OPE in POE 15 / 36

Page 17: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Off-Policy Evaluation in Reinforcement Learning

Given: data generated by a behavior policy πb

S. Mannor November 2020 OPE in POE 15 / 36

Page 18: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Off-Policy Evaluation in Reinforcement Learning

Given: data generated by a behavior policy πb

S. Mannor November 2020 OPE in POE 15 / 36

Page 19: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Off-Policy Evaluation in Reinforcement Learning

Given: data generated by a behavior policy πb

S. Mannor November 2020 OPE in POE 15 / 36

Page 20: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Off-Policy Evaluation in Reinforcement Learning

Evaluate the value of a different policy πe

S. Mannor November 2020 OPE in POE 16 / 36

Page 21: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Off-Policy Evaluation (OPE)

• Given: Data generated by a behavioral policy πb in a MarkovDecision Process (MDP)

• Objective: Evaluate the value of an evaluation policy πe .• Methods:

• Direct methods (model based and model free)• Inverse propensity scoring (e.g., importance sampling)• Doubly robust methods

• How do we define OPE under partial observability?

S. Mannor November 2020 OPE in POE 17 / 36

Page 22: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Example: Probabilistic Maintenance

• We have an expensive machine that needs monthlymaintenance.

• Every month an expert comes and checks the machine.

• The expert knows whether the machine is working properly.

• The expert can choose to fix the machine or leave it as is.

S. Mannor November 2020 OPE in POE 18 / 36

Page 23: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Example: Probabilistic Maintenance

• State Space: Working / Broken

• Action Space: Fix / Not Fix

S. Mannor November 2020 OPE in POE 19 / 36

Page 24: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Example: Probabilistic Maintenance (Numeric)

S. Mannor November 2020 OPE in POE 20 / 36

Page 25: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Example: Probabilistic Maintenance (Numeric)

S. Mannor November 2020 OPE in POE 20 / 36

Page 26: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Example: Probabilistic Maintenance (Numeric)

S. Mannor November 2020 OPE in POE 20 / 36

Page 27: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Example: Probabilistic Maintenance (Numeric)

S. Mannor November 2020 OPE in POE 21 / 36

Page 28: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Example: Probabilistic Maintenance (Numeric)

S. Mannor November 2020 OPE in POE 21 / 36

Page 29: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Example: Probabilistic Maintenance (Numeric)

S. Mannor November 2020 OPE in POE 22 / 36

Page 30: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Example: Probabilistic Maintenance (Numeric)

S. Mannor November 2020 OPE in POE 22 / 36

Page 31: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Example: Probabilistic Maintenance (Numeric)

S. Mannor November 2020 OPE in POE 22 / 36

Page 32: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

OPE in Partially Observable Environments

S. Mannor November 2020 OPE in POE 22 / 36

Page 33: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Markov Decision Process (POMDP)

S. Mannor November 2020 OPE in POE 23 / 36

Page 34: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Markov Decision Process (POMDP)

S. Mannor November 2020 OPE in POE 23 / 36

Page 35: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Markov Decision Process (POMDP)

S. Mannor November 2020 OPE in POE 23 / 36

Page 36: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Markov Decision Process (POMDP)

S. Mannor November 2020 OPE in POE 23 / 36

Page 37: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Markov Decision Process (POMDP)

S. Mannor November 2020 OPE in POE 23 / 36

Page 38: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Markov Decision Process (POMDP)

S. Mannor November 2020 OPE in POE 23 / 36

Page 39: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Markov Decision Process (POMDP)

S. Mannor November 2020 OPE in POE 23 / 36

Page 40: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Markov Decision Process (POMDP)

S. Mannor November 2020 OPE in POE 23 / 36

Page 41: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Markov Decision Process (POMDP)

S. Mannor November 2020 OPE in POE 23 / 36

Page 42: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Markov Decision Process (POMDP)

S. Mannor November 2020 OPE in POE 24 / 36

Page 43: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Markov Decision Process (POMDP)

S. Mannor November 2020 OPE in POE 24 / 36

Page 44: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Markov Decision Process (POMDP)

S. Mannor November 2020 OPE in POE 24 / 36

Page 45: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Markov Decision Process (POMDP)

S. Mannor November 2020 OPE in POE 24 / 36

Page 46: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Markov Decision Process (POMDP)

S. Mannor November 2020 OPE in POE 24 / 36

Page 47: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Markov Decision Process (POMDP)

S. Mannor November 2020 OPE in POE 24 / 36

Page 48: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Markov Decision Process (POMDP)

S. Mannor November 2020 OPE in POE 24 / 36

Page 49: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Markov Decision Process (POMDP)

S. Mannor November 2020 OPE in POE 24 / 36

Page 50: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Markov Decision Process (POMDP)

S. Mannor November 2020 OPE in POE 24 / 36

Page 51: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Markov Decision Process (POMDP)

S. Mannor November 2020 OPE in POE 24 / 36

Page 52: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Markov Decision Process (POMDP)

S. Mannor November 2020 OPE in POE 24 / 36

Page 53: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Markov Decision Process (POMDP)

S. Mannor November 2020 OPE in POE 24 / 36

Page 54: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Markov Decision Process (POMDP)

S. Mannor November 2020 OPE in POE 25 / 36

Page 55: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Partially Observable Markov Decision Process (POMDP)

S. Mannor November 2020 OPE in POE 26 / 36

Page 56: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Off-Policy Evaluation in POMDPs

The goal of off-policy evaluation in POMDPs is to evaluate v(πe)using the measure Pπb(·) over observable trajectories T o

L and thegiven policy πe .

S. Mannor November 2020 OPE in POE 27 / 36

Page 57: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Reinforcement Learning and Causal Inference: BetterTogether

S. Mannor November 2020 OPE in POE 28 / 36

Page 58: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

A Meeting Point of RL and CI

• When performing off-policy evaluation (or learning) on datawhere we do not have access to the same data as the agent.

• Example: physicians treating patients in an intensive care unit(ICU)

• Mistakes were made: applying RL to observational ICU datawithout considering hidden confounders or overlap (commonsupport, positivity)

• In RL, hidden confounding can be described using partialobservability.

S. Mannor November 2020 OPE in POE 29 / 36

Page 59: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

CI-RL Dictionary

Causal Term RL Term Example

confounder(possibly hidden)

state(possibly

unobserved)

informationavailable tothe doctor

action,treatment

actionmedications,procedures

outcome reward mortality

treatmentassigmentprocess

behaviorpolicy

the waydoctors treat

patients

proxyvariable

observationselectronic health

record

S. Mannor November 2020 OPE in POE 30 / 36

Page 60: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

The Three Layer Causal Hierarchy

Level (Symbol) Typical Activity Typical Questions

1. AssociationP(y |x)

SeeingWhat is?

How would seeing xchange my belief in y?

2. InterventionP(y |do (x) , z)

DoingIntervening

What if?What if I do x?

3. CounterfactualP(yx |x ′, y ′)

ImaginingRetrospection

Why?Was it x that caused y?

What if I had acted differently?

Pearl, Judea. ”Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution.”Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 2018.

S. Mannor November 2020 OPE in POE 31 / 36

Page 61: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Off-Policy Evaluation in POMDPs

Theorem 1 (POMDP Evaluation)

Assume |O| ≥ |S| and under invertibility assumptions of thedynamics we have an estimator

v(πe) = f (observable data)

S. Mannor November 2020 OPE in POE 32 / 36

Page 62: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Off-Policy Evaluation in POMDPs

Theorem 1 (POMDP Evaluation)

Assume |O| ≥ |S| and that the matrices Pb(Oi |ai ,Oi−1) areinvertible for all i and all ai ∈ A. For any τo ∈ T o

t define thegeneralized weight matrices

Wi (τo) = Pb(Oi |ai ,Oi−1)−1Pb(Oi , oi−1|ai−1,Oi−2)

for i ≥ 1, and W0(τo) = Pb(O0|a0,O−1)−1Pb(O0).

Denote Πe(τo) =∏t

i=0 π(i)e (ai |hoi ), Ω(τo) =

∏ti=0 Wt−i (τ

o).Then

Pe(rt) =∑τo∈T o

t

Πe(τo)Pb(rt , ot |at ,Ot−1)Ω(τo).

S. Mannor November 2020 OPE in POE 33 / 36

Page 63: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

POMDP Limitation

• Causal structure of POMDPs is restricting.

• Must invert matrices of dimension S even when O ⊂ S.

• Solution:• Detach observed and unobserved variables.• Decoupled POMDPs (more in our AAAI 20’ paper).

S. Mannor November 2020 OPE in POE 34 / 36

Page 64: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

A Special POMDP (DE-POMDP)

POMDP Decoupled PODMP

Z observations, U state Z&U is state O is observations

S. Mannor November 2020 OPE in POE 35 / 36

Page 65: Online Learning with A Lot of Batch Data

BanditsOff-Policy EvaluationExample: Probabilistic MaintenanceOff-Policy Evaluation In Partially Observable EnvironmentsRelation to Causal InferenceOPE Results

(RL)2

Conclusions

• Unknown states (confounders) produce bias through factorsthat affect both observed actions and rewards.

• This is a major problem in offline off-policy data.

• Be aware of such biases when using off-policy data that wasnot generated by them.

• Our work is a first step to introducing OPE for partiallyobservable environments in RL.

• Causality and RL: Better together

S. Mannor November 2020 OPE in POE 36 / 36