(co)algebraic techniques for - markov decision processes · long-term values in markov decision...

49
Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 1 / 29 (Co)Algebraic Techniques for Markov Decision Processes SYSMICS 2019 Frank Feys Helle Hvid Hansen Larry Moss Delft University of Technology

Upload: others

Post on 26-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 1 / 29

(Co)Algebraic Techniques for

Markov Decision ProcessesSYSMICS 2019

Frank Feys Helle Hvid Hansen Larry Moss

Delft University of Technology

Page 2: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Introduction

Joint work with Frank Feys (Delft) and Larry Moss (Indiana):

Long-Term Values in Markov Decision Processes, (Co)AlgebraicallyProc. of Coalgebraic Methods in Computer Science (CMCS 2018)

Motivation:

• Classic theory of MDPs uses “low-level”, analytic methods.

• Some classic proofs have coinductive flavour. Make this precise.

• Develop compositional, (co)algebraic methods for MDPs.

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 2 / 29

Page 3: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Outline

1 Coalgebra basics

2 MDP Preliminaries

3 Part I: Long-Term Values from b-Corecursive Algebras

4 Part II: Policy Improvement (Co)Inductively

5 Conclusion

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 3 / 29

Page 4: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Outline

1 Coalgebra basics

2 MDP Preliminaries

3 Part I: Long-Term Values from b-Corecursive Algebras

4 Part II: Policy Improvement (Co)Inductively

5 Conclusion

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 4 / 29

Page 5: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Systems as Coalgebras

Universal Coalgebra: a universal theory of systems [Rutten, m.m.]

• Framework for uniform study of systems and their properties

• Uniform defs and proofs, coinduction principle, ...

T -Coalgebras

Determ. automaton with output in B: X → B × XA

Labelled transition system: X → P(X )A

Kripke model: X → P(X )A × P(P0)Mealy machine : X → (B × X )A

Probabilistic automaton: X → P(D(X ))A

Linear weighted automata: X → R× (RX )A

. . . . . .

T -coalgebra : X → T (X )

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 5 / 29

Page 6: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Systems as Coalgebras

Universal Coalgebra: a universal theory of systems [Rutten, m.m.]

• Framework for uniform study of systems and their properties

• Uniform defs and proofs, coinduction principle, ...

T -Coalgebras

Determ. automaton with output in B: X → B × XA

Labelled transition system: X → P(X )A

Kripke model: X → P(X )A × P(P0)Mealy machine : X → (B × X )A

Probabilistic automaton: X → P(D(X ))A

Linear weighted automata: X → R× (RX )A

. . . . . .

T -coalgebra : X → T (X )

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 5 / 29

Page 7: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

T -Coalgebras and T -AlgebrasGiven functor T : C→ C on category C,

T -coalgebra T -coalgebra morphism

X

c

��

T (X )

X

c

��

f // Y

d��

T (X )T (f )

// T (Y )

T -algebra T -algebra morphism

T (X )

a��

X

T (X )

a��

T (f )f// T (Y )

b��

Xf // Y

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 6 / 29

Page 8: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Final Coalgebra

Def. A T -coalgebra (Z , ζ) is final if X

∀γ��

∃!beh // Z

ζ

��

T (X )T (beh)

// T (Z )

For a state x ∈ X , beh(x) is the “behaviour of x”.

Examples:T (X ) Carrier of final T -coalgebra

B × X Bω, streams over B2× XA 2A

∗, languages over alphabet A

(B × X )A causal stream functions f : Aω → Bω

Pω strongly extensional trees.

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 7 / 29

Page 9: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Outline

1 Coalgebra basics

2 MDP Preliminaries

3 Part I: Long-Term Values from b-Corecursive Algebras

4 Part II: Policy Improvement (Co)Inductively

5 Conclusion

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 8 / 29

Page 10: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

MDPs: Planning under Uncertainty

Markov decision processes (MDPs) are state-based models ofsequential decision-making under uncertainty.

• The system/agent chooses actions and collects rewards, but doesnot have full control over transitions.

• The decision maker wants to find a policy/plan that maximizesfuture expected long-term rewards.

• Applications: maintenance schedules, production planning,finance, reinforcement learning, ...

• MDPs are one-player stochastic games.

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 9 / 29

Page 11: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

MDPs: Planning under Uncertainty

Markov decision processes (MDPs) are state-based models ofsequential decision-making under uncertainty.

• The system/agent chooses actions and collects rewards, but doesnot have full control over transitions.

• The decision maker wants to find a policy/plan that maximizesfuture expected long-term rewards.

• Applications: maintenance schedules, production planning,finance, reinforcement learning, ...

• MDPs are one-player stochastic games.

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 9 / 29

Page 12: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

MDP Example

A start-up company needs to decide to Advertise or Save money:

+0+0

A

PoorPoor

Unknown

Unknown

Rich Rich

Famous

Famous

ASS

S

SA

A

+10 +10

& &

& &

11

1

1/2

1/2

1/2

1/2

1/2

1/2

1/21/2

1/2

1/2

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 10 / 29

Page 13: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Markov Decision Processes

Def. A (discrete, time-independent) Markov Decision Process (MDP)is a Set-coalgebra

m = 〈u, t〉 : S → R× (∆S)A

where

• S finite set of states,

• A is a finite set of actions,

• ∆ is the finite-support distributions functor,

• t : S → (∆S)A is a probabilistic transition function,

• u : S → R is an (immediate) reward function.

(Alternatively, m : S → (R×∆S)A, i.e. rewards are given ontransitions)

Def. A (deterministic, stationary) policy σ is a map σ : S → A.

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 11 / 29

Page 14: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Markov Decision Processes

Def. A (discrete, time-independent) Markov Decision Process (MDP)is a Set-coalgebra

m = 〈u, t〉 : S → R× (∆S)A

where

• S finite set of states,

• A is a finite set of actions,

• ∆ is the finite-support distributions functor,

• t : S → (∆S)A is a probabilistic transition function,

• u : S → R is an (immediate) reward function.

(Alternatively, m : S → (R×∆S)A, i.e. rewards are given ontransitions)

Def. A (deterministic, stationary) policy σ is a map σ : S → A.

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 11 / 29

Page 15: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Markov Reward ProcessGiven m = 〈u, t〉 : S → R× (∆S)A and policy σ : S → A, we getMarkov reward process

mσ = 〈u, tσ〉 : S → R×∆S where tσ(s) := t(s)(σ(s))

; m]σ : ∆S → R×∆S by determinisation (cf. Bartels, Jacobs,

Silva, Sokolova) — using (∆, δ, µ) is monad and ev : ∆(R)→ R isalgebra for monad (∆, δ, µ).

S

�

mσ=〈u,tσ〉// R×∆S

idR×!

��

∆S

m]σ66

!��

Rω // R× Rω

Concretely, m]σ = 〈u], t]σ〉 where

u](ϕ) =∑

s∈S u(s) · ϕ(s)

t]σ(ϕ)(s) =∑

s′∈S tσ(s)(s ′) · ϕ(s ′)

m]σ is (R×−)-coalgebra.

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 12 / 29

Page 16: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Markov Reward ProcessGiven m = 〈u, t〉 : S → R× (∆S)A and policy σ : S → A, we getMarkov reward process

mσ = 〈u, tσ〉 : S → R×∆S where tσ(s) := t(s)(σ(s))

; m]σ : ∆S → R×∆S by determinisation (cf. Bartels, Jacobs,

Silva, Sokolova) — using (∆, δ, µ) is monad and ev : ∆(R)→ R isalgebra for monad (∆, δ, µ).

S

�

mσ=〈u,tσ〉// R×∆S

idR×!

��

∆S

m]σ66

!��

Rω // R× Rω

Concretely, m]σ = 〈u], t]σ〉 where

u](ϕ) =∑

s∈S u(s) · ϕ(s)

t]σ(ϕ)(s) =∑

s′∈S tσ(s)(s ′) · ϕ(s ′)

m]σ is (R×−)-coalgebra.

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 12 / 29

Page 17: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Expected Rewards via Trace Semantics

S

�

trc

��

mσ=〈u,tσ〉// R×∆S

idR×!

��

∆S

m]σ66

!��

Rω // R× Rω

Trace semantics via finality:

trc(s) = (rσ0 (s), rσ1 (s), rσ2 (s), . . .)

where rσn (s) is the output of m]σ

after n steps starting from δ(s),i.e., expected reward at time stepn starting from s.

Long-term expected value for σ in s depends on how you evaluate

(rσ0 (s), rσ1 (s), rσ2 (s), . . .)

Different evaluation criteria exist...

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 13 / 29

Page 18: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Expected Rewards via Trace Semantics

S

�

trc

��

mσ=〈u,tσ〉// R×∆S

idR×!

��

∆S

m]σ66

!��

Rω // R× Rω

Trace semantics via finality:

trc(s) = (rσ0 (s), rσ1 (s), rσ2 (s), . . .)

where rσn (s) is the output of m]σ

after n steps starting from δ(s),i.e., expected reward at time stepn starting from s.

Long-term expected value for σ in s depends on how you evaluate

(rσ0 (s), rσ1 (s), rσ2 (s), . . .)

Different evaluation criteria exist...

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 13 / 29

Page 19: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Long-Term Value via Discounted Sums

Let 0 ≤ γ < 1 be a discount factor.

Def. The long-term value of policy σ according to the discountedsum criterion is V σ : S → R:

V σ(s) =∞∑n=0

γn · rσn (s)

Converges because reward map u : S → R is bounded.

We define:

• σ ≥ τ if for all s, V σ(s) ≥ V τ (s).

• σ is an optimal policy if σ ≥ τ for all τ .

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 14 / 29

Page 20: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Long-Term Value via Discounted Sums

Let 0 ≤ γ < 1 be a discount factor.

Def. The long-term value of policy σ according to the discountedsum criterion is V σ : S → R:

V σ(s) =∞∑n=0

γn · rσn (s)

Converges because reward map u : S → R is bounded.

We define:

• σ ≥ τ if for all s, V σ(s) ≥ V τ (s).

• σ is an optimal policy if σ ≥ τ for all τ .

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 14 / 29

Page 21: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Optimal Value

Def. The optimal value function V ∗ : S → R of m is defined as

V ∗(s) = maxσ

V σ(s)

Classical facts (wrt discounted sum criterion), cf. (Puterman, 2014):

• If σ is optimal, then V σ = V ∗.

• Optimal policy always exists.

• Optimal policies need not be unique.

• Stationary (memoryless), deterministic policies suffice.

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 15 / 29

Page 22: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Optimal Value

Def. The optimal value function V ∗ : S → R of m is defined as

V ∗(s) = maxσ

V σ(s)

Classical facts (wrt discounted sum criterion), cf. (Puterman, 2014):

• If σ is optimal, then V σ = V ∗.

• Optimal policy always exists.

• Optimal policies need not be unique.

• Stationary (memoryless), deterministic policies suffice.

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 15 / 29

Page 23: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Outline

1 Coalgebra basics

2 MDP Preliminaries

3 Part I: Long-Term Values from b-Corecursive Algebras

4 Part II: Policy Improvement (Co)Inductively

5 Conclusion

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 16 / 29

Page 24: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

V σ as Coalgebra-to-Algebra Morphism• V σ : S → R satisfies for all s ∈ S :

V σ(s) = u(s) + γ∑

s′∈S tσ(s)(s ′) · V σ(s ′)

i.e. V σ = u + γtσVσ (as linear system)

(1)

• V σ arises as a fixpoint of the linear operator Ψσ : RS → RS givenby

Ψσ(v) = u + γtσv

• Observation: we can re-express (1) as V σ being acoalgebra-to-algebra morphism:

Smσ=〈u,tσ〉

//

V σ

��

R×∆S

R×∆(V σ)��

R R× Rαγ

oo R×∆RR×evoo

whereαγ : R× R→ R isαγ(x , y) = x + γ · y .

Let H = R× Id.

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 17 / 29

Page 25: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

V σ as Coalgebra-to-Algebra Morphism• V σ : S → R satisfies for all s ∈ S :

V σ(s) = u(s) + γ∑

s′∈S tσ(s)(s ′) · V σ(s ′)

i.e. V σ = u + γtσVσ (as linear system)

(1)

• V σ arises as a fixpoint of the linear operator Ψσ : RS → RS givenby

Ψσ(v) = u + γtσv

• Observation: we can re-express (1) as V σ being acoalgebra-to-algebra morphism:

Smσ=〈u,tσ〉

//

V σ

��

R×∆S

R×∆(V σ)��

R R× Rαγ

oo R×∆RR×evoo

whereαγ : R× R→ R isαγ(x , y) = x + γ · y .

Let H = R× Id.

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 17 / 29

Page 26: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

V σ as Coalgebra-to-Algebra Morphism• V σ : S → R satisfies for all s ∈ S :

V σ(s) = u(s) + γ∑

s′∈S tσ(s)(s ′) · V σ(s ′)

i.e. V σ = u + γtσVσ (as linear system)

(1)

• V σ arises as a fixpoint of the linear operator Ψσ : RS → RS givenby

Ψσ(v) = u + γtσv

• Observation: we can re-express (1) as V σ being acoalgebra-to-algebra morphism:

Smσ=〈u,tσ〉

//

V σ

��

R×∆S

R×∆(V σ)��

R R× Rαγ

oo R×∆RR×evoo

whereαγ : R× R→ R isαγ(x , y) = x + γ · y .

Let H = R× Id.

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 17 / 29

Page 27: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

V σ via Universal Property?

• Recall that a corecursive algebra (for functor F ) is an F -algebraα s.t.

X∀f //

∃! f †

��

FX

F (f †)

��

A FAαoo

• Question: Is αγ ◦ H(ev) a corecursive algebra for H∆?

• By (Capretta et al., 2004):

αγ ◦ H(ev) a corecursive algebra for H∆iffαγ corecursive algebra for H

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 18 / 29

Page 28: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

V σ via Universal Property?

• Recall that a corecursive algebra (for functor F ) is an F -algebraα s.t.

X∀f //

∃! f †

��

FX

F (f †)

��

A FAαoo

• Question: Is αγ ◦ H(ev) a corecursive algebra for H∆?

• By (Capretta et al., 2004):

αγ ◦ H(ev) a corecursive algebra for H∆iffαγ corecursive algebra for H

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 18 / 29

Page 29: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Is αγ Corecursive Algebra?

• αγ : R× R→ R is corecursive for H = R× Id if:for all f : X → R× X there is unique f † : X → Rsuch that f † = αγ ◦ (R× f †) ◦ f .

• Consider H-coalgebra f : X → R× X given by

f = x0a0−→ x1

a1−→ x2a2−→ . . .

• f † is solution iff f †(xn) = an + γ · f †(xn+1), n = 0, 1, 2, . . ..

• This system has infinitely many solutions when γ > 0, even if(an)n is bounded. So, αγ is not corecursive for γ > 0.

• However, if (an)n is bounded then this system has a uniquebounded solution.

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 19 / 29

Page 30: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Is αγ Corecursive Algebra?

• αγ : R× R→ R is corecursive for H = R× Id if:for all f : X → R× X there is unique f † : X → Rsuch that f † = αγ ◦ (R× f †) ◦ f .

• Consider H-coalgebra f : X → R× X given by

f = x0a0−→ x1

a1−→ x2a2−→ . . .

• f † is solution iff f †(xn) = an + γ · f †(xn+1), n = 0, 1, 2, . . ..

• This system has infinitely many solutions when γ > 0, even if(an)n is bounded. So, αγ is not corecursive for γ > 0.

• However, if (an)n is bounded then this system has a uniquebounded solution.

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 19 / 29

Page 31: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Is αγ Corecursive Algebra?

• αγ : R× R→ R is corecursive for H = R× Id if:for all f : X → R× X there is unique f † : X → Rsuch that f † = αγ ◦ (R× f †) ◦ f .

• Consider H-coalgebra f : X → R× X given by

f = x0a0−→ x1

a1−→ x2a2−→ . . .

• f † is solution iff f †(xn) = an + γ · f †(xn+1), n = 0, 1, 2, . . ..

• This system has infinitely many solutions when γ > 0, even if(an)n is bounded. So, αγ is not corecursive for γ > 0.

• However, if (an)n is bounded then this system has a uniquebounded solution.

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 19 / 29

Page 32: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Bounded Corecursive Algebra (bca)To get uniqueness ⇒ incorporate boundedness information.

• Def. A b-category (C,B) is a category C with a subclassB ⊆ Mor(C) of “bounded” morphisms s.t. for all f :f ∈ B ⇒ f ◦ g ∈ B. (Also known as a sieve.)Main example: (Met,B) where Met is metric spaces with allmaps, and B are all bounded maps.

• Def. Let (C,B) be a b-category and F : C→ C endofunctor. AnF -algebra α : FA→ A is a b-corecursive algebra (bca) if

X∀f ∈B

//

∃!f †∈B��

FX

Ff †

��

A FAαoo

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 20 / 29

Page 33: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Bounded Corecursive Algebra (bca)To get uniqueness ⇒ incorporate boundedness information.

• Def. A b-category (C,B) is a category C with a subclassB ⊆ Mor(C) of “bounded” morphisms s.t. for all f :f ∈ B ⇒ f ◦ g ∈ B. (Also known as a sieve.)Main example: (Met,B) where Met is metric spaces with allmaps, and B are all bounded maps.

• Def. Let (C,B) be a b-category and F : C→ C endofunctor. AnF -algebra α : FA→ A is a b-corecursive algebra (bca) if

X∀f ∈B

//

∃!f †∈B��

FX

Ff †

��

A FAαoo

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 20 / 29

Page 34: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Bounded Corecursive Algebra (bca)To get uniqueness ⇒ incorporate boundedness information.

• Def. A b-category (C,B) is a category C with a subclassB ⊆ Mor(C) of “bounded” morphisms s.t. for all f :f ∈ B ⇒ f ◦ g ∈ B. (Also known as a sieve.)Main example: (Met,B) where Met is metric spaces with allmaps, and B are all bounded maps.

• Def. Let (C,B) be a b-category and F : C→ C endofunctor. AnF -algebra α : FA→ A is a b-corecursive algebra (bca) if

X∀f ∈B

//

∃!f †∈B��

FX

Ff †

��

A FAαoo

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 20 / 29

Page 35: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

V σ from Universal Property of bca

We show αγ ◦ H(ev) is bca for H∆:

1 Develop some theory of b-categories (b-functors, b-naturaltransformation, B-preservation properties, . . . )

2 Prove b-version of (Capretta et al.)-result.Theorem: If ... (certain conditions), then we obtain a bca forH∆ from from bca for H.

3 Show that αγ is bca for H (using Banach FPT).

4 Show that conditions for Theorem apply.

(Details are in the paper)

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 21 / 29

Page 36: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Optimal Value V ∗ from bca

Sm=〈u,t〉

//

V ∗

��

R× (∆S)A

R×(∆V ∗)A

��

R R× (∆R)Aαγ◦(R×maxA ◦ evA)oo

We can show directly that we have bca.

But bca is not obtained via Capretta-style Theorem.Problem: maxA is not affine/linear.

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 22 / 29

Page 37: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Outline

1 Coalgebra basics

2 MDP Preliminaries

3 Part I: Long-Term Values from b-Corecursive Algebras

4 Part II: Policy Improvement (Co)Inductively

5 Conclusion

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 23 / 29

Page 38: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Policy IterationNotation: For σ : S → A , let `σ : ∆S → R be

`σ(ϕ) =∑s∈S

ϕ(s) · V σ(s)

(ϕ-expectation of long-term value for σ).

Policy Iteration Algorithm:

1 Initialise σ0 to any policy.

2 Compute V σk (e.g. by solving system of linear equations).

3 Define σk+1 by

σk+1(s) := argmaxa∈A {`σk (ta(s))}

4 If σk+1 = σk then stop, else go to step 2.

Termination: X(since AS is finite). Correctness: follows if σk+1 ≥ σk .

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 24 / 29

Page 39: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Policy IterationNotation: For σ : S → A , let `σ : ∆S → R be

`σ(ϕ) =∑s∈S

ϕ(s) · V σ(s)

(ϕ-expectation of long-term value for σ).

Policy Iteration Algorithm:

1 Initialise σ0 to any policy.

2 Compute V σk (e.g. by solving system of linear equations).

3 Define σk+1 by

σk+1(s) := argmaxa∈A {`σk (ta(s))}

4 If σk+1 = σk then stop, else go to step 2.

Termination: X(since AS is finite). Correctness: follows if σk+1 ≥ σk .

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 24 / 29

Page 40: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Policy Improvement

By definition,

for all s ∈ S , σk+1(s) = argmaxa∈A {`σk (ta(s))}.

which implies

for all s ∈ S , `σk (tσk+1(s)) ≥ `σk (tσk (s)),

i.e.`σk ◦ tσk+1

≥ `σk ◦ tσk (in pointwise order on RS).

Policy Improvement Lemma:

For all σ, τ : `σ ◦ tτ ≥ `σ ◦ tσ ⇒ V τ ≥ V σ

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 25 / 29

Page 41: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Policy Improvement

By definition,

for all s ∈ S , σk+1(s) = argmaxa∈A {`σk (ta(s))}.

which implies

for all s ∈ S , `σk (tσk+1(s)) ≥ `σk (tσk (s)),

i.e.`σk ◦ tσk+1

≥ `σk ◦ tσk (in pointwise order on RS).

Policy Improvement Lemma:

For all σ, τ : `σ ◦ tτ ≥ `σ ◦ tσ ⇒ V τ ≥ V σ

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 25 / 29

Page 42: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Contraction (Co)Induction

Def. Ordered metric space

An ordered (complete) metric space (M, d ,≤) is a (complete) metricspace (M, d) together with a partial order (M,≤) such that for ally ∈ M, {z | z ≤ y} and {z | y ≤ z} are closed in the metric topology.Example: B(X ,R) with the pointwise order and supremum metric.

Theorem: Contraction (Co)Induction

Let M be a non-empty, ordered complete metric space. If f : M → Mis both contractive and order-preserving, then the fixpoint x∗ of f is:(i) a least pre-fixpoint (if f (x) ≤ x , then x∗ ≤ x), and(ii) a greatest post-fixpoint (if x ≤ f (x), then x ≤ x∗).

Cf. Metric Coinduction (Kozen & Ruozzi, 2009) and (Denardo, 1967).

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 26 / 29

Page 43: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Contraction (Co)Induction

Def. Ordered metric space

An ordered (complete) metric space (M, d ,≤) is a (complete) metricspace (M, d) together with a partial order (M,≤) such that for ally ∈ M, {z | z ≤ y} and {z | y ≤ z} are closed in the metric topology.Example: B(X ,R) with the pointwise order and supremum metric.

Theorem: Contraction (Co)Induction

Let M be a non-empty, ordered complete metric space. If f : M → Mis both contractive and order-preserving, then the fixpoint x∗ of f is:(i) a least pre-fixpoint (if f (x) ≤ x , then x∗ ≤ x), and(ii) a greatest post-fixpoint (if x ≤ f (x), then x ≤ x∗).

Cf. Metric Coinduction (Kozen & Ruozzi, 2009) and (Denardo, 1967).

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 26 / 29

Page 44: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Proof of Policy ImprovementPolicy Improvement Lemma:

For all σ, τ : `σ ◦ tτ ≥ `σ ◦ tσ ⇒ V τ ≥ V σ

Proof:

Apply Contraction (Co)induction to Ψσ : RS → RS

(contractive and order-preserving X)

Ψσ(v) = u + γtσv , and V σ is its fixpoint.

We have:

`σ ◦ tτ ≥ `σ ◦ tσ⇒ u + γ · `σ ◦ tτ ≥ u + γ · `σ ◦ tσ⇔ Ψτ (V σ) ≥ Ψσ(V σ) = V σ

(by contr. coind.) ⇒ V τ ≥ V σ

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 27 / 29

Page 45: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Proof of Policy ImprovementPolicy Improvement Lemma:

For all σ, τ : `σ ◦ tτ ≥ `σ ◦ tσ ⇒ V τ ≥ V σ

Proof: Apply Contraction (Co)induction to Ψσ : RS → RS

(contractive and order-preserving X)

Ψσ(v) = u + γtσv , and V σ is its fixpoint.

We have:

`σ ◦ tτ ≥ `σ ◦ tσ⇒ u + γ · `σ ◦ tτ ≥ u + γ · `σ ◦ tσ⇔ Ψτ (V σ) ≥ Ψσ(V σ) = V σ

(by contr. coind.) ⇒ V τ ≥ V σ

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 27 / 29

Page 46: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

Outline

1 Coalgebra basics

2 MDP Preliminaries

3 Part I: Long-Term Values from b-Corecursive Algebras

4 Part II: Policy Improvement (Co)Inductively

5 Conclusion

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 28 / 29

Page 47: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

ConclusionSummary

• Value functions V σ and V ∗ from b-corecursive algebras.• Coinductive proof of policy improvement theorem.• Still need to resort to Banach FPT to get fixpoints.

Future work

• Stochastic games : Existence of Nash Eq = Kakutani +Contraction (Co)Induction.• Other types of equilibria (Subgame perfect, Markov, ...)• Make connections to

• Coalgebraic infinite games (Abramsky & Winschel, 2017)• Open games (Hedges et al., 2018)• Semantics of equilibria (Pavlovic, 2009)

• Dualities for ordered metric spaces, fixpoints.• Coalgebraic learning: reinforcement learning

Thanks!

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 29 / 29

Page 48: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

ConclusionSummary

• Value functions V σ and V ∗ from b-corecursive algebras.• Coinductive proof of policy improvement theorem.• Still need to resort to Banach FPT to get fixpoints.

Future work

• Stochastic games : Existence of Nash Eq = Kakutani +Contraction (Co)Induction.• Other types of equilibria (Subgame perfect, Markov, ...)• Make connections to

• Coalgebraic infinite games (Abramsky & Winschel, 2017)• Open games (Hedges et al., 2018)• Semantics of equilibria (Pavlovic, 2009)

• Dualities for ordered metric spaces, fixpoints.• Coalgebraic learning: reinforcement learning

Thanks!

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 29 / 29

Page 49: (Co)Algebraic Techniques for - Markov Decision Processes · Long-Term Values in Markov Decision Processes, (Co)Algebraically Proc. of Coalgebraic Methods in Computer Science (CMCS

ConclusionSummary

• Value functions V σ and V ∗ from b-corecursive algebras.• Coinductive proof of policy improvement theorem.• Still need to resort to Banach FPT to get fixpoints.

Future work

• Stochastic games : Existence of Nash Eq = Kakutani +Contraction (Co)Induction.• Other types of equilibria (Subgame perfect, Markov, ...)• Make connections to

• Coalgebraic infinite games (Abramsky & Winschel, 2017)• Open games (Hedges et al., 2018)• Semantics of equilibria (Pavlovic, 2009)

• Dualities for ordered metric spaces, fixpoints.• Coalgebraic learning: reinforcement learning

Thanks!

Helle Hvid Hansen (TU Delft) SYSMICS 2019 Amsterdam, 21-25 Jan 2019 29 / 29