(co)algebraic techniques for - markov decision processes · long-term values in markov decision...

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 1 / 29

(Co)Algebraic Techniques for

Markov Decision ProcessesSYSMICS 2019

Frank Feys Helle Hvid Hansen Larry Moss

Delft University of Technology

Introduction

Joint work with Frank Feys (Delft) and Larry Moss (Indiana):

Long-Term Values in Markov Decision Processes, (Co)AlgebraicallyProc. of Coalgebraic Methods in Computer Science (CMCS 2018)

Motivation:

• Classic theory of MDPs uses “low-level”, analytic methods.

• Some classic proofs have coinductive flavour. Make this precise.

• Develop compositional, (co)algebraic methods for MDPs.

Outline

1 Coalgebra basics

2 MDP Preliminaries

3 Part I: Long-Term Values from b-Corecursive Algebras

4 Part II: Policy Improvement (Co)Inductively

5 Conclusion

Outline

1 Coalgebra basics

2 MDP Preliminaries

5 Conclusion

Systems as Coalgebras

Universal Coalgebra: a universal theory of systems [Rutten, m.m.]

• Framework for uniform study of systems and their properties

• Uniform defs and proofs, coinduction principle, ...

T -Coalgebras

Determ. automaton with output in B: X → B × XA

Labelled transition system: X → P(X )A

Kripke model: X → P(X )A × P(P0)Mealy machine : X → (B × X )A

Probabilistic automaton: X → P(D(X ))A

Linear weighted automata: X → R× (RX )A

. . . . . .

T -coalgebra : X → T (X )

Systems as Coalgebras

Universal Coalgebra: a universal theory of systems [Rutten, m.m.]

• Framework for uniform study of systems and their properties

• Uniform defs and proofs, coinduction principle, ...

T -Coalgebras

Determ. automaton with output in B: X → B × XA

Labelled transition system: X → P(X )A

Kripke model: X → P(X )A × P(P0)Mealy machine : X → (B × X )A

Probabilistic automaton: X → P(D(X ))A

Linear weighted automata: X → R× (RX )A

. . . . . .

T -coalgebra : X → T (X )

T -Coalgebras and T -AlgebrasGiven functor T : C→ C on category C,

T -coalgebra T -coalgebra morphism

��

T (X )

��

f // Y

d��

T (X )T (f )

// T (Y )

T -algebra T -algebra morphism

T (X )

a��

T (X )

a��

T (f )f// T (Y )

b��

Xf // Y

Final Coalgebra

Def. A T -coalgebra (Z , ζ) is final if X

∀γ��

∃!beh // Z

��

T (X )T (beh)

// T (Z )

For a state x ∈ X , beh(x) is the “behaviour of x”.

Examples:T (X ) Carrier of final T -coalgebra

B × X Bω, streams over B2× XA 2A

∗, languages over alphabet A

(B × X )A causal stream functions f : Aω → Bω

Pω strongly extensional trees.

Outline

1 Coalgebra basics

2 MDP Preliminaries

5 Conclusion

MDPs: Planning under Uncertainty

Markov decision processes (MDPs) are state-based models ofsequential decision-making under uncertainty.

• The system/agent chooses actions and collects rewards, but doesnot have full control over transitions.

• The decision maker wants to find a policy/plan that maximizesfuture expected long-term rewards.

• Applications: maintenance schedules, production planning,finance, reinforcement learning, ...

• MDPs are one-player stochastic games.

MDPs: Planning under Uncertainty

Markov decision processes (MDPs) are state-based models ofsequential decision-making under uncertainty.

• The system/agent chooses actions and collects rewards, but doesnot have full control over transitions.

• The decision maker wants to find a policy/plan that maximizesfuture expected long-term rewards.

• Applications: maintenance schedules, production planning,finance, reinforcement learning, ...

• MDPs are one-player stochastic games.

MDP Example

A start-up company needs to decide to Advertise or Save money:

PoorPoor

Unknown

Rich Rich

Famous

+10 +10

1/21/2

Markov Decision Processes

Def. A (discrete, time-independent) Markov Decision Process (MDP)is a Set-coalgebra

m = 〈u, t〉 : S → R× (∆S)A

• S finite set of states,

• A is a finite set of actions,

• ∆ is the finite-support distributions functor,

• t : S → (∆S)A is a probabilistic transition function,

• u : S → R is an (immediate) reward function.

(Alternatively, m : S → (R×∆S)A, i.e. rewards are given ontransitions)

Def. A (deterministic, stationary) policy σ is a map σ : S → A.

Markov Decision Processes

Def. A (discrete, time-independent) Markov Decision Process (MDP)is a Set-coalgebra

m = 〈u, t〉 : S → R× (∆S)A

• S finite set of states,

• A is a finite set of actions,

• ∆ is the finite-support distributions functor,

• t : S → (∆S)A is a probabilistic transition function,

• u : S → R is an (immediate) reward function.

(Alternatively, m : S → (R×∆S)A, i.e. rewards are given ontransitions)

Def. A (deterministic, stationary) policy σ is a map σ : S → A.

Markov Reward ProcessGiven m = 〈u, t〉 : S → R× (∆S)A and policy σ : S → A, we getMarkov reward process

mσ = 〈u, tσ〉 : S → R×∆S where tσ(s) := t(s)(σ(s))

; m]σ : ∆S → R×∆S by determinisation (cf. Bartels, Jacobs,

Silva, Sokolova) — using (∆, δ, µ) is monad and ev : ∆(R)→ R isalgebra for monad (∆, δ, µ).

δ��

mσ=〈u,tσ〉// R×∆S

idR×!

��

m]σ66

!��

Rω // R× Rω

Concretely, m]σ = 〈u], t]σ〉 where

u](ϕ) =∑

s∈S u(s) · ϕ(s)

t]σ(ϕ)(s) =∑

s′∈S tσ(s)(s ′) · ϕ(s ′)

m]σ is (R×−)-coalgebra.

Markov Reward ProcessGiven m = 〈u, t〉 : S → R× (∆S)A and policy σ : S → A, we getMarkov reward process

mσ = 〈u, tσ〉 : S → R×∆S where tσ(s) := t(s)(σ(s))

; m]σ : ∆S → R×∆S by determinisation (cf. Bartels, Jacobs,

Silva, Sokolova) — using (∆, δ, µ) is monad and ev : ∆(R)→ R isalgebra for monad (∆, δ, µ).

δ��

idR×!

��

m]σ66

!��

Rω // R× Rω

Concretely, m]σ = 〈u], t]σ〉 where

u](ϕ) =∑

s∈S u(s) · ϕ(s)

t]σ(ϕ)(s) =∑

s′∈S tσ(s)(s ′) · ϕ(s ′)

m]σ is (R×−)-coalgebra.

Expected Rewards via Trace Semantics

δ��

��

idR×!

��

m]σ66

!��

Rω // R× Rω

Trace semantics via finality:

trc(s) = (rσ0 (s), rσ1 (s), rσ2 (s), . . .)

where rσn (s) is the output of m]σ

after n steps starting from δ(s),i.e., expected reward at time stepn starting from s.

Long-term expected value for σ in s depends on how you evaluate

(rσ0 (s), rσ1 (s), rσ2 (s), . . .)

Different evaluation criteria exist...

Expected Rewards via Trace Semantics

δ��

��

idR×!

��

m]σ66

!��

Rω // R× Rω

Trace semantics via finality:

trc(s) = (rσ0 (s), rσ1 (s), rσ2 (s), . . .)

where rσn (s) is the output of m]σ

after n steps starting from δ(s),i.e., expected reward at time stepn starting from s.

Long-term expected value for σ in s depends on how you evaluate

(rσ0 (s), rσ1 (s), rσ2 (s), . . .)

Different evaluation criteria exist...

Long-Term Value via Discounted Sums

Let 0 ≤ γ < 1 be a discount factor.

Def. The long-term value of policy σ according to the discountedsum criterion is V σ : S → R:

V σ(s) =∞∑n=0

γn · rσn (s)

Converges because reward map u : S → R is bounded.

We define:

• σ ≥ τ if for all s, V σ(s) ≥ V τ (s).

• σ is an optimal policy if σ ≥ τ for all τ .

Long-Term Value via Discounted Sums

Let 0 ≤ γ < 1 be a discount factor.

Def. The long-term value of policy σ according to the discountedsum criterion is V σ : S → R:

V σ(s) =∞∑n=0

γn · rσn (s)

Converges because reward map u : S → R is bounded.

We define:

• σ ≥ τ if for all s, V σ(s) ≥ V τ (s).

• σ is an optimal policy if σ ≥ τ for all τ .

Optimal Value

Def. The optimal value function V ∗ : S → R of m is defined as

V ∗(s) = maxσ

V σ(s)

Classical facts (wrt discounted sum criterion), cf. (Puterman, 2014):

• If σ is optimal, then V σ = V ∗.

• Optimal policy always exists.

• Optimal policies need not be unique.

• Stationary (memoryless), deterministic policies suffice.

Optimal Value

Def. The optimal value function V ∗ : S → R of m is defined as

V ∗(s) = maxσ

V σ(s)

Classical facts (wrt discounted sum criterion), cf. (Puterman, 2014):

• If σ is optimal, then V σ = V ∗.

• Optimal policy always exists.

• Optimal policies need not be unique.

• Stationary (memoryless), deterministic policies suffice.

Outline

1 Coalgebra basics

2 MDP Preliminaries

5 Conclusion

V σ as Coalgebra-to-Algebra Morphism• V σ : S → R satisfies for all s ∈ S :

V σ(s) = u(s) + γ∑

s′∈S tσ(s)(s ′) · V σ(s ′)

i.e. V σ = u + γtσVσ (as linear system)

• V σ arises as a fixpoint of the linear operator Ψσ : RS → RS givenby

Ψσ(v) = u + γtσv

• Observation: we can re-express (1) as V σ being acoalgebra-to-algebra morphism:

Smσ=〈u,tσ〉

��

R×∆S

R×∆(V σ)��

R R× Rαγ

oo R×∆RR×evoo

whereαγ : R× R→ R isαγ(x , y) = x + γ · y .

Let H = R× Id.

V σ(s) = u(s) + γ∑

s′∈S tσ(s)(s ′) · V σ(s ′)

Smσ=〈u,tσ〉

��

R×∆S

R×∆(V σ)��

R R× Rαγ

oo R×∆RR×evoo

Let H = R× Id.

V σ(s) = u(s) + γ∑

s′∈S tσ(s)(s ′) · V σ(s ′)

Smσ=〈u,tσ〉

��

R×∆S

R×∆(V σ)��

R R× Rαγ

oo R×∆RR×evoo

Let H = R× Id.

V σ via Universal Property?

• Recall that a corecursive algebra (for functor F ) is an F -algebraα s.t.

X∀f //

∃! f †

��

F (f †)

��

A FAαoo

• Question: Is αγ ◦ H(ev) a corecursive algebra for H∆?

• By (Capretta et al., 2004):

αγ ◦ H(ev) a corecursive algebra for H∆iffαγ corecursive algebra for H

V σ via Universal Property?

• Recall that a corecursive algebra (for functor F ) is an F -algebraα s.t.

X∀f //

∃! f †

��

F (f †)

��

A FAαoo

• Question: Is αγ ◦ H(ev) a corecursive algebra for H∆?

• By (Capretta et al., 2004):

αγ ◦ H(ev) a corecursive algebra for H∆iffαγ corecursive algebra for H

Is αγ Corecursive Algebra?

• αγ : R× R→ R is corecursive for H = R× Id if:for all f : X → R× X there is unique f † : X → Rsuch that f † = αγ ◦ (R× f †) ◦ f .

• Consider H-coalgebra f : X → R× X given by

f = x0a0−→ x1

a1−→ x2a2−→ . . .

• f † is solution iff f †(xn) = an + γ · f †(xn+1), n = 0, 1, 2, . . ..

• This system has infinitely many solutions when γ > 0, even if(an)n is bounded. So, αγ is not corecursive for γ > 0.

• However, if (an)n is bounded then this system has a uniquebounded solution.

f = x0a0−→ x1

a1−→ x2a2−→ . . .

f = x0a0−→ x1

a1−→ x2a2−→ . . .

Bounded Corecursive Algebra (bca)To get uniqueness ⇒ incorporate boundedness information.

• Def. A b-category (C,B) is a category C with a subclassB ⊆ Mor(C) of “bounded” morphisms s.t. for all f :f ∈ B ⇒ f ◦ g ∈ B. (Also known as a sieve.)Main example: (Met,B) where Met is metric spaces with allmaps, and B are all bounded maps.

• Def. Let (C,B) be a b-category and F : C→ C endofunctor. AnF -algebra α : FA→ A is a b-corecursive algebra (bca) if

X∀f ∈B

∃!f †∈B��

Ff †

��

A FAαoo

X∀f ∈B

∃!f †∈B��

Ff †

��

A FAαoo

X∀f ∈B

∃!f †∈B��

Ff †

��

A FAαoo

V σ from Universal Property of bca

We show αγ ◦ H(ev) is bca for H∆:

1 Develop some theory of b-categories (b-functors, b-naturaltransformation, B-preservation properties, . . . )

2 Prove b-version of (Capretta et al.)-result.Theorem: If ... (certain conditions), then we obtain a bca forH∆ from from bca for H.

3 Show that αγ is bca for H (using Banach FPT).

4 Show that conditions for Theorem apply.

(Details are in the paper)

Optimal Value V ∗ from bca

Sm=〈u,t〉

��

R× (∆S)A

R×(∆V ∗)A

��

R R× (∆R)Aαγ◦(R×maxA ◦ evA)oo

We can show directly that we have bca.

But bca is not obtained via Capretta-style Theorem.Problem: maxA is not affine/linear.

Outline

1 Coalgebra basics

2 MDP Preliminaries

5 Conclusion

Policy IterationNotation: For σ : S → A , let `σ : ∆S → R be

`σ(ϕ) =∑s∈S

ϕ(s) · V σ(s)

(ϕ-expectation of long-term value for σ).

Policy Iteration Algorithm:

1 Initialise σ0 to any policy.

2 Compute V σk (e.g. by solving system of linear equations).

3 Define σk+1 by

σk+1(s) := argmaxa∈A {`σk (ta(s))}

4 If σk+1 = σk then stop, else go to step 2.

Termination: X(since AS is finite). Correctness: follows if σk+1 ≥ σk .

Policy IterationNotation: For σ : S → A , let `σ : ∆S → R be

`σ(ϕ) =∑s∈S

ϕ(s) · V σ(s)

(ϕ-expectation of long-term value for σ).

Policy Iteration Algorithm:

1 Initialise σ0 to any policy.

2 Compute V σk (e.g. by solving system of linear equations).

3 Define σk+1 by

σk+1(s) := argmaxa∈A {`σk (ta(s))}

4 If σk+1 = σk then stop, else go to step 2.

Termination: X(since AS is finite). Correctness: follows if σk+1 ≥ σk .

Policy Improvement

By definition,

for all s ∈ S , σk+1(s) = argmaxa∈A {`σk (ta(s))}.

which implies

for all s ∈ S , `σk (tσk+1(s)) ≥ `σk (tσk (s)),

i.e.`σk ◦ tσk+1

≥ `σk ◦ tσk (in pointwise order on RS).

Policy Improvement Lemma:

For all σ, τ : `σ ◦ tτ ≥ `σ ◦ tσ ⇒ V τ ≥ V σ

Policy Improvement

By definition,

for all s ∈ S , σk+1(s) = argmaxa∈A {`σk (ta(s))}.

which implies

for all s ∈ S , `σk (tσk+1(s)) ≥ `σk (tσk (s)),

i.e.`σk ◦ tσk+1

≥ `σk ◦ tσk (in pointwise order on RS).

Policy Improvement Lemma:

Contraction (Co)Induction

Def. Ordered metric space

An ordered (complete) metric space (M, d ,≤) is a (complete) metricspace (M, d) together with a partial order (M,≤) such that for ally ∈ M, {z | z ≤ y} and {z | y ≤ z} are closed in the metric topology.Example: B(X ,R) with the pointwise order and supremum metric.

Theorem: Contraction (Co)Induction

Let M be a non-empty, ordered complete metric space. If f : M → Mis both contractive and order-preserving, then the fixpoint x∗ of f is:(i) a least pre-fixpoint (if f (x) ≤ x , then x∗ ≤ x), and(ii) a greatest post-fixpoint (if x ≤ f (x), then x ≤ x∗).

Cf. Metric Coinduction (Kozen & Ruozzi, 2009) and (Denardo, 1967).

Contraction (Co)Induction

Def. Ordered metric space

An ordered (complete) metric space (M, d ,≤) is a (complete) metricspace (M, d) together with a partial order (M,≤) such that for ally ∈ M, {z | z ≤ y} and {z | y ≤ z} are closed in the metric topology.Example: B(X ,R) with the pointwise order and supremum metric.

Theorem: Contraction (Co)Induction

Let M be a non-empty, ordered complete metric space. If f : M → Mis both contractive and order-preserving, then the fixpoint x∗ of f is:(i) a least pre-fixpoint (if f (x) ≤ x , then x∗ ≤ x), and(ii) a greatest post-fixpoint (if x ≤ f (x), then x ≤ x∗).

Cf. Metric Coinduction (Kozen & Ruozzi, 2009) and (Denardo, 1967).

Proof of Policy ImprovementPolicy Improvement Lemma:

Proof:

Apply Contraction (Co)induction to Ψσ : RS → RS

(contractive and order-preserving X)

Ψσ(v) = u + γtσv , and V σ is its fixpoint.

We have:

`σ ◦ tτ ≥ `σ ◦ tσ⇒ u + γ · `σ ◦ tτ ≥ u + γ · `σ ◦ tσ⇔ Ψτ (V σ) ≥ Ψσ(V σ) = V σ

(by contr. coind.) ⇒ V τ ≥ V σ

Proof of Policy ImprovementPolicy Improvement Lemma:

Proof: Apply Contraction (Co)induction to Ψσ : RS → RS

(contractive and order-preserving X)

Ψσ(v) = u + γtσv , and V σ is its fixpoint.

We have:

`σ ◦ tτ ≥ `σ ◦ tσ⇒ u + γ · `σ ◦ tτ ≥ u + γ · `σ ◦ tσ⇔ Ψτ (V σ) ≥ Ψσ(V σ) = V σ

(by contr. coind.) ⇒ V τ ≥ V σ

Outline

1 Coalgebra basics

2 MDP Preliminaries

5 Conclusion

ConclusionSummary

• Value functions V σ and V ∗ from b-corecursive algebras.• Coinductive proof of policy improvement theorem.• Still need to resort to Banach FPT to get fixpoints.

Future work

• Stochastic games : Existence of Nash Eq = Kakutani +Contraction (Co)Induction.• Other types of equilibria (Subgame perfect, Markov, ...)• Make connections to

• Coalgebraic infinite games (Abramsky & Winschel, 2017)• Open games (Hedges et al., 2018)• Semantics of equilibria (Pavlovic, 2009)

• Dualities for ordered metric spaces, fixpoints.• Coalgebraic learning: reinforcement learning

Thanks!

ConclusionSummary

Future work

Thanks!

ConclusionSummary

Future work

Thanks!

(co)algebraic techniques for - markov decision processes · long-term values in markov decision...

Documents