model-based bayesian reinforcement learning in partially observable domains

Model-based Bayesian Reinforcement Learningin Partially Observable Domains

Pascal Poupart and Nikos Vlassis

(2008 International Symposium on Artificial Intelligence and Math)

Presented by Lihan He

ECE, Duke University

Oct 3, 2008

Introduction

POMDP represented as dynamic decision network (DDN)

Partially observable reinforcement learning

Belief update

Value function and optimal action

Partially observable BEETLE

Offline policy optimization

Online policy execution

Conclusion

Outline

Introduction

Final objective: learn optimal actions (policy) to achieve best reward

POMDP: partially observable Markov decision process represented by sequential decision-making problem

ROTAS ,,,,,

Reinforcement learning for POMDP: solve the decision-making problem given feedback from environment, when the dynamics of the environment (T and O) are unknown.

given action-observation sequence as history model-based: explicitly model the environment model-free: avoid to explicitly model the environment online learning: policy learning and execution at the same time offline learning: learn policy first given training data, and then

execute policy without modifying the policy

Introduction

This paper:

Bayesian model-based approach Set the prior for belief as mixture of products of Dirichlets The posterior belief is a mixture of products of Dirichlets The value function is also a mixture of products of Dirichlets The number of the mixture components increases exponentially with

respect to the time step PO-BEETLE algorithm

POMDP and DDN

Redefine POMDP as dynamic decision network (DDN)

EXXG ,',

},,,,{ RDCBAX

},,,{ RDCBS }{DO SO }{RR SR

X, X’ : two consecutive time steps

Observation and reward are subsets of state variableThe conditional probability distributions of state Pr(s’|pas’) jointly

encode the transition, observation and reward models T, O and R

POMDP and DDN

The optimal value function satisfies Bellman’s equation

Given X, S, R, O, A, edge E and the dynamics Pr(s’|pas’):

Belief update:

Objective: finding a policy that maximizes the expected total reward

Value iteration algorithms optimize the value function by iteratively computing the right hand side of the Bellman’s equation.

POMDP and DDN

For reinforcement learning, assume X, S, R, O, A are known, and edges E are known, but the dynamics Pr(s’|pas’) are unknown.

We augment graph: Dynamics are included in the graph, denoted by parameter Θ.

If the unknown model is static,

Belief over s joint belief over s and θ

PORL: belief update

Problem: number of mixture components increases by a factor of |S| (exponential growth with time)

Prior setting for belief: a mixture of products of Dirichlets

Posterior belief (after taking action a and receiving observation o’) is again a mixture of products of Dirichlets

PORL: value function and optimal action

The augmented POMDP is hybrid, with discrete state variables S and continuous model variables Θ

Discrete state POMDP: )(max)(* bbV

sbsab )()()(with

Continuous state POMDP [1]: dssbsabs )()()(

[1] Porta, J. M.; Vlassis, N. A.; Spaaan, M. T. J.; and Poupart, P. 2006. Point-based value iteration for continuous POMDPs. Journal of Machine Learning Research 7:2329–2367.

The α-function α(s,θ) can also be represented as a mixture of products of Dirichlets

Hybrid state POMDP: s

dsbsab

),(),()(

PORL: value function and optimal action

Assume for k step-to-go is )(bV k

then for k+1 step-to-go is )(1 bV k

decomposed in 3 steps

find optimal action for belief b

find the corresponding α-function

Problem: number of mixture components increases by a factor of |S| (exponential growth with time) 9/14

PO-BEETLE: offline policy optimization

Policy learning is performed offline, given sufficient training data (action-observation sequence)

PO-BEETLE: offline policy optimization

Keep the number of mixture components for α-functions bounded:

Approach 1: approximation using basis functions

Approach 2: approximation by important components

PO-BEETLE: online policy execution

Given policy, the agent executes the policy and updates belief online.

Keep the number of mixture components for belief b bounded:

Approach 1: approximation using importance sampling

PO-BEETLE: online policy execution

Approach 2: particle filtering: simultaneously update belief and reduce the number of mixture components

Sample one updated component (after taking a and receiving o’)

The updated belief is represented by k particles

Conclusion

Bayesian model-based reinforcement learning; Prior belief is a mixture of products of Dirichlets; Posterior belief is also a mixture of products of Dirichlets,

with the number of mixture components growing exponentially with time;

α-functions (associated with value functions) are also represented as mixtures of products of Dirichlets that grow exponentially with time;

Partially observable BEETLE algorithm.

model-based bayesian reinforcement learning in partially observable domains

mixture of products

policy learning

dirichletsthe value

function s

ddnthe optimal value

withcontinuous state

optimal actions policy

dirichletshybrid state

Documents

partially observable markov decision processes for … ·...

partially observable markov decision processes -...

planning in partially-observable switching-mode continuous...

learning and solving partially observable markov decision...

planning and acting in partially observable stochastic...

value-function approximations for partially observable...

partially-observable markov decision processes tom...

partially observable markov decision processes ·...

cse-473 artificial intelligence partially-observable mdps...

actor-critic policy optimization in partially observable...

partially observable markov decision process (chapter 15 &...

information-theoretic methods for planning and learning in...

model-based bayesian reinforcement learning in partially...

model-based bayesian reinforcement learning in partially...

multi-agent uav planning using belief space hierarchical...

monitoring plan execution in partially observable stochastic...

2534 lecture 5: partially observable...

cse-573 artificial intelligence partially-observable mdps...

reinforcement learning algorithm for partially observable

continuous-observation partially observable semi-markov...