![Page 1: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/1.jpg)
A Tutorial on the Partially Observable Markov
Decision Process and Its Applications
Lawrence Carin
June 7,2006
![Page 2: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/2.jpg)
Outline
Overview of Markov decision Processes
(MDPs)
Introduction to partially observable decision
processes (POMDPs)
Some applications of POMDPs
Conclusions
![Page 3: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/3.jpg)
Overview of MDPs
Introduction to POMDPs model
Some applications of POMDPs
Conclusions
![Page 4: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/4.jpg)
Markov decision processes
The MDP is defined by the tuple < S, A, T, R >
S is a finite set of states of the world.
A is a finite set of actions.
T: SA (S) is the state-transition function, the probability of an action changing the the world state from one to another,T(s, a, s’).
R: SA is the reward for the agent in a given world
state after performing an action, R(s, a).
![Page 5: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/5.jpg)
Two properties of the MDP
• The action-dependent state transition is Markovian
• The state is fully observable after taking action aIllustration of MDPs
Markov decision processes
AGENT
WORLD: T(s,a, s’)Sta
te
s
Act
ion
a
as :
![Page 6: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/6.jpg)
Markov decision processes
Objective of MDPs
• Finding the optimal policy , mapping state s to action a in order to maximize the value function V(s).
))'()',,(),((max)('
s
asVsasTasRsV
![Page 7: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/7.jpg)
Overview of MDPs
Introduction to POMDPs
Some applications of POMDPs
Conclusions
![Page 8: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/8.jpg)
S, A, T, and R are defined the same as in MDPs
is a finite set of observations the agent can experience its world.
O: SA () is the observation function, the probability of making a certain observation after performing a particular action, landing in state s’, O(s’, a, o).
The POMDP is defined by the tuple < S, A, T, R, , O >
Introduction to POMDPs
![Page 9: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/9.jpg)
Introduction to POMDPs
Differences between MDPs and POMDPs• The state is hidden after taking action a.
• The hidden state information is inferred from the action-state dependent observation function O(s’, a, o).
Uncertainty of state s in POMDPs
![Page 10: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/10.jpg)
Introduction to POMDPs
A new concept in POMDPs: Belief State b(s)b(st) = Pr(st = s | o1, a1, o2, a2, …, ot-1, at-1, ot)
![Page 11: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/11.jpg)
Introduction to POMDPs
),|Pr(
)()',,(),,'( )'('
bao
sbsasToasOsb Ss
s1o1
o2b
b’=T(b|a, o1)
b’=T(b|a, o2)
n control interval remaining n-1 control interval remainings2 s2s3
s3
s1
(1)
The belief state b(s) evolves according to Bayes rule
![Page 12: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/12.jpg)
Introduction to POMDPs
Illustration of POMDPs
SE:
AGENT
b
WORLD: T(s,a, s’)
O(s’, a, o)
Obse
rvatio
n
o
Act
ion
a
ab :'bb
SE: State Estimator using (1)
: Policy Search
![Page 13: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/13.jpg)
Introduction to POMDPs
• Finding the optimal policy for POMDPs, mapping belief point b to action a in order to maximize the value function V(b).
))'((),'|Pr(),|'Pr()(),()(max
))'((){,'|Pr(),|'Pr()(max)(
1'
1''
Ss o
oan
SsSsAa
Ss o
oan
aoss
SsAa
n
sbVasoasssbasRsb
sbVRasoasssbbV
Expected immediate reward
Objective of POMDPs
(2)
![Page 14: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/14.jpg)
][max])()([max)( bsbsbV knk
Ss
knkn
Pr(S1
)0 1
V(b
)
p1
p2
p3
p4
p5
a(p1) a(p2) a(p5)
• Piecewise linearity and convexity of optimal value function for finite horizon in POMDPs
Introduction to POMDPs
Optimal value function
(3)
![Page 15: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/15.jpg)
Substituting (3), (1) into (2)
)'(),'|(),|'(),()(max
)'(),'|(),|'()(max),()(max
)'(),|Pr(),()(max)(
'
),,(1
'1
1
Ss o s
oabln
Aa
o s s
kn
kSs
Aa
on
SsAa
n
sasoassTasRsb
sasoassTsbasRsb
bVaboasRsbbV
Maximizing to obtain the index l
-vector of belief point b
Optimal value of belief point b
(4)
Introduction to POMDPs
![Page 16: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/16.jpg)
Introduction to POMDPs
Approaches to solving POMDPs problem• Exact algorithms: finding all -vectors for the
whole
belief space which is exact but intractable for large
size problems.
• Approximate algorithms: finding -vectors of a
subset of the belief space, which is fast and can deal
with large size problems.
![Page 17: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/17.jpg)
Point-based value iteration (PBVI)
b5b1 b3 b4b0
• focus on a finite set of belief points
• maintain an -vector for each point
Point-Based Value Iteration
![Page 18: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/18.jpg)
• RBVI maintains an -vector for each convex region over which the optimal value function is linear.
• RBVI simultaneously determines the -vectors for all relevant convex regions based on all available belief points.
Region-Based Value Iteration (RBVI)
![Page 19: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/19.jpg)
The piecewise linear value function:
which can be reformulated as
by introducing hidden variables z(b)=k, denoting b Bk
RBVI (Contd)
![Page 20: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/20.jpg)
The belief space is partitioned using hyper-ellipsoids,
Then we have
RBVI (Contd)
![Page 21: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/21.jpg)
The joint distribution of V(b) and b can be written as
where
Expectation-Maximization (EM) Estimation:
RBVI (Contd)
E step:
M step:
![Page 22: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/22.jpg)
Overview of MDPs
Introduction to POMDPs model
Some applications of POMDPs
Conclusions
![Page 23: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/23.jpg)
Applications of POMDPs• Application of Partially Observable Markov Decision
Processes to robot navigation in a Minefield
• Application of Partially Observable Markov Decision
Processes to feature selection
• Application of Partially Observable Markov Decision
Processes to sensor scheduling
![Page 24: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/24.jpg)
Applications of POMDPsSome considerations in applying
POMDPs to new problems
• How to define the state
• How to obtain the transition and observation matrix
• How to set the reward
![Page 25: A Tutorial on the Partially Observable Markov Decision Process and Its Applications](https://reader036.vdocuments.us/reader036/viewer/2022062423/56814c08550346895db90915/html5/thumbnails/25.jpg)
References1. Leslie Pack Kaelbling, Michael L. Littman and Anthony R.
Cassandra. Planning and Acting in Partially Observable Stochastic Domains. Artificial Intelligence, Vol. 101,1998.
2. Smallwood, R. D., and Sondik, E. J. 1973. The optimal control of partially observable markov processes over a finite horizon. Operational Research 21:1071–1088.
3. J. Pineau, G. Gordon & S. Thrun. Point-based value iteration: An anytime algorithm for POMDPs. International Joint Conference on Artificial Intelligence (IJCAI), Acapulco, Mexico, Aug. 2003.
4. D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, Blemont, Massachusetts,2001, Vol.1 & Vol.2.
5. Bellman, R. 1957. Dynamic Programming. Princeton University Press.