15-887 planning, execution and learning planning under ...mmv/planning/handouts/pomdps_15887.pdf ·...
TRANSCRIPT
![Page 1: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/1.jpg)
15-887
Planning, Execution and Learning
Planning under Uncertainty:
Partially Observable
Markov Decision Processes (POMDP)
Maxim Likhachev
Robotics Institute
Carnegie Mellon University
![Page 2: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/2.jpg)
• Consider a path planning example
Graph vs. MDP vs. POMDP
Carnegie Mellon University 2
R
G
Outd
oo
rs
No
-fly
zo
ne
![Page 3: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/3.jpg)
• Consider a path planning example
• Assume perfect action execution and full knowledge of the state (i.e.,
perfect localization)
Graph vs. MDP vs. POMDP
Carnegie Mellon University 3
R
G
Outd
oo
rs
No
-fly
zo
ne
S0
SG
S2S1
S4S3S5 S6
Graph
![Page 4: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/4.jpg)
• Consider a path planning example
• Assume perfect action execution and full knowledge of the state (i.e.,
perfect localization)
Graph vs. MDP vs. POMDP
Carnegie Mellon University 4
R
G
Outd
oo
rs
No
-fly
zo
ne
S0
SG
S2S1
S4S3S5 S6
Graph:Implicitly defined as {S, A, C},
where S – set of states, A – set of actions, C – costs of all (s,a) pairs.
S0
SG
S2S1
S4S3S5 S6
11
11
1111
1
![Page 5: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/5.jpg)
• Consider a path planning example
• Assume perfect action execution and full knowledge of the state (i.e.,
perfect localization)
Graph vs. MDP vs. POMDP
Carnegie Mellon University 5
R
G
Outd
oo
rs
No
-fly
zo
ne
S0
SG
S2S1
S4S3S5 S6
Graph:Implicitly defined as {S, A, C},
where S – set of states, A – set of actions, C – costs of all (s,a) pairs.
S0
SG
S2S1
S4S3S5 S6
11
11
1111
1
Each edge is defined as:(s, succ(s,a)) for every s in S and every action a in Aedge cost is given by c(s,a)
Each edge is defined as:(s, succ(s,a)) for every s in S and every action a in Aedge cost is given by c(s,a)
![Page 6: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/6.jpg)
• Consider a path planning example
• Assume imperfect action execution and full knowledge of the state (i.e.,
perfect localization)
Graph vs. MDP vs. POMDP
Carnegie Mellon University 6
R
G
Outd
oo
rs
No
-fly
zo
ne
S0
SG
S2S1
S4S3S5 S6
MDP:
Let’s assume
50% chance of ending up on the left and
50% ending up on the right
![Page 7: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/7.jpg)
• Consider a path planning example
• Assume imperfect action execution and full knowledge of the state (i.e.,
perfect localization)
Graph vs. MDP vs. POMDP
Carnegie Mellon University 7
R
G
Outd
oo
rs
No
-fly
zo
ne
MDP:Defined as {S, A, T, C}, where S – set of states, A – set of actions,
T(s,a,s’) - Prob(s’ |s, a), C – costs of all (s,a) pairs
S0
SG
S2S1
S4S3S5 S6
S0
SG
S2S1
S4S3S5 S6
1
p=0.5
1
p=0.5
11
1111
1
![Page 8: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/8.jpg)
• Consider a path planning example
• Assume imperfect action execution and full knowledge of the state (i.e.,
perfect localization)
Graph vs. MDP vs. POMDP
Carnegie Mellon University 8
R
G
Outd
oo
rs
No
-fly
zo
ne
MDP:Defined as {S, A, T, C}, where S – set of states, A – set of actions,
T(s,a,s’) - Prob(s’ |s, a), C – costs of all (s,a) pairs
S0
SG
S2S1
S4S3S5 S6
S0
SG
S2S1
S4S3S5 S6
1
p=0.5
1
p=0.5
11
1111
1
What is an optimal policy here?
![Page 9: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/9.jpg)
• Consider a path planning example
• Assume imperfect action execution and full knowledge of the state (i.e.,
perfect localization)
Graph vs. MDP vs. POMDP
Carnegie Mellon University 9
R
G
Outd
oo
rs
No
-fly
zo
ne
MDP:Defined as {S, A, T, C}, where S – set of states, A – set of actions,
T(s,a,s’) - Prob(s’ |s, a), C – costs of all (s,a) pairs
S0
SG
S2S1
S4S3S5 S6
S0
SG
S2S1
S4S3S5 S6
1
p=0.5
1
p=0.5
11
1111
1
What is an optimal policy here?
![Page 10: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/10.jpg)
• Consider a path planning example
• Assume imperfect action execution and full knowledge of the state (i.e.,
perfect localization)
Graph vs. MDP vs. POMDP
Carnegie Mellon University 10
R
G
Outd
oo
rs
No
-fly
zo
ne
MDP (rewards version):Defined as {S, A, T, R}, where S – set of states, A – set of actions,
T(s,a,s’) - Prob(s’ |s, a), R – rewards for all (s,a) pairs
S0
SG
S2S1
S4S3S5 S6
S0
SG
S2S1
S4S3S5 S6
0
p=0.5
0
p=0.5
00
0550
0
![Page 11: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/11.jpg)
• Consider a path planning example
• Assume imperfect action execution and partial observability of the
state (i.e., imperfect localization)
Graph vs. MDP vs. POMDP
Carnegie Mellon University 11
R
G
Outd
oo
rs
No
-fly
zo
ne
S0
SG
S2S1
S4S3S5 S6
POMDP:
Let’s assume
UAV initially knows it is at S0
During execution: it can only sense
adjacent obstacles and being at goal
After taking this action, UAV doesn’t
know whether it is at state S1 or S2
![Page 12: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/12.jpg)
• Consider a path planning example
• Assume imperfect action execution and partial observability of the
state (i.e., imperfect localization)
Graph vs. MDP vs. POMDP
Carnegie Mellon University 12
R
G
Outd
oo
rs
No
-fly
zo
ne
S0
SG
S2S1
S4S3S5 S6
POMDP:
Let’s assume
UAV initially knows it is at S0
During execution: it can only sense
adjacent obstacles and being at goal
After taking this action, UAV doesn’t
know whether it is at state S1 or S2
What is an optimal policy here?
![Page 13: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/13.jpg)
• Consider a path planning example
• Assume imperfect action execution and partial observability of the
state (i.e., imperfect localization)
Graph vs. MDP vs. POMDP
Carnegie Mellon University 13
R
G
Outd
oo
rs
No
-fly
zo
ne
S0
SG
S2S1
S4S3S5 S6
POMDP: {S, A, T, R, Ω, O}, where S, A, T(s,a,s’), R(s,a) – all as in
MDP, Ω – set of all possible observation vectors o, O(s’,a,o) – Prob(o|s’,a)
probability of seeing o after executing action a and ending up at state s’
Let’s assume
UAV initially knows it is at S0
During execution: it can only sense
adjacent obstacles and being at goal
After taking this action, UAV doesn’t
know whether it is at state S1 or S2
![Page 14: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/14.jpg)
• Consider a path planning example
• Assume imperfect action execution and partial observability of the
state (i.e., imperfect localization)
Graph vs. MDP vs. POMDP
Carnegie Mellon University 14
R
G
Outd
oo
rs
No
-fly
zo
ne
S0
SG
S2S1
S4S3S5 S6
POMDP: {S, A, T, R, Ω, O}, where S, A, T(s,a,s’), R(s,a) – all as in
MDP, Ω – set of all possible observation vectors o, O(s’,a,o) – Prob(o|s’,a)
probability of seeing o after executing action a and ending up at state s’
s
a
s’
o
R
Causal relationship
![Page 15: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/15.jpg)
• Consider a path planning example
• Assume imperfect action execution and partial observability of the
state (i.e., imperfect localization)
Graph vs. MDP vs. POMDP
Carnegie Mellon University 15
R
G
Outd
oo
rs
No
-fly
zo
ne
S0
SG
S2S1
S4S3S5 S6
POMDP: {S, A, T, R, Ω, O}, where S, A, T(s,a,s’), R(s,a) – all as in
MDP, Ω – set of all possible observation vectors o, O(s’,a,o) – Prob(o|s’,a)
probability of seeing o after executing action a and ending up at state s’
s
a
s’
o
R
Causal relationship
Example of POMDP problems
where the robot knows its own pose perfectly
(perfect localization)?
![Page 16: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/16.jpg)
• Belief state b: Probability distribution over the states the robot believes it
is currently in
Belief State Space
Carnegie Mellon University 16
R
GO
utd
oo
rs
No
-fly
zon
e
S0
SG
S2S1
S4S3S5 S6
POMDP: {S, A, T, R, Ω, O}, where T(s,a,s’) = P(s’|s,a), R(s,a), O(s’,a,o) = Prob(o|s’,a)
s
a
s’
o
R
Causal relationship
![Page 17: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/17.jpg)
• Belief state b: Probability distribution over the states the robot believes it
is currently in
Belief State Space
Carnegie Mellon University 17
R
GO
utd
oo
rs
No
-fly
zon
e
S0
SG
S2S1
S4S3S5 S6
POMDP: {S, A, T, R, Ω, O}, where T(s,a,s’) = P(s’|s,a), R(s,a), O(s’,a,o) = Prob(o|s’,a)
s
a
s’
o
R
Causal relationship
Initially, the robot knows it is in s0.Thus, initial b = [1,0,0,0,0,0,0,0]T. That is, P(s0) = 1Initially, the robot knows it is in s0.Thus, initial b = [1,0,0,0,0,0,0,0]T. That is, P(s0) = 1
b – a vector of size N (# of states in S)ΣN bi = 1, and bi ≥ 0 for all i
b – a vector of size N (# of states in S)ΣN bi = 1, and bi ≥ 0 for all i
![Page 18: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/18.jpg)
• Belief state b: Probability distribution over the states the robot believes it
is currently in
Belief State Space
Carnegie Mellon University 18
R
GO
utd
oo
rs
No
-fly
zon
e
S0
SG
S2S1
S4S3S5 S6
POMDP: {S, A, T, R, Ω, O}, where T(s,a,s’) = P(s’|s,a), R(s,a), O(s’,a,o) = Prob(o|s’,a)
s
a
s’
o
R
Causal relationship
Initially, the robot knows it is in s0.Thus, initial b = [1,0,0,0,0,0,0,0]T. That is, P(s0) = 1Initially, the robot knows it is in s0.Thus, initial b = [1,0,0,0,0,0,0,0]T. That is, P(s0) = 1
b – a vector of size N (# of states in S)ΣN bi = 1, and bi ≥ 0 for all i
b – a vector of size N (# of states in S)ΣN bi = 1, and bi ≥ 0 for all i
What is b after robot takes the 1st action?
![Page 19: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/19.jpg)
• Belief state b: Probability distribution over the states the robot believes it
is currently in
Belief State Space
Carnegie Mellon University 19
R
GO
utd
oo
rs
No
-fly
zon
e
S0
SG
S2S1
S4S3S5 S6
POMDP: {S, A, T, R, Ω, O}, where T(s,a,s’) = P(s’|s,a), R(s,a), O(s’,a,o) = Prob(o|s’,a)
b
b’
b’’
o1
o2a1
b’’’
oM
a2
o1o2
…aK
Belief State Space
(for K actions, M possible observations)
![Page 20: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/20.jpg)
• Belief state b: Probability distribution over the states the robot believes it
is currently in
Belief State Space
Carnegie Mellon University 20
R
GO
utd
oo
rs
No
-fly
zon
e
S0
SG
S2S1
S4S3S5 S6
POMDP: {S, A, T, R, Ω, O}, where T(s,a,s’) = P(s’|s,a), R(s,a), O(s’,a,o) = Prob(o|s’,a)
b
b’
b’’
o1
o2a1
b’’’
oM
a2
o1o2
…aK
Belief State Space
(for K actions, M possible observations)
b’: P(s’|b,a,o) for every s’ in S;
b’(s’) = P(s’|b,a,o) = 𝑂(𝑠′,𝑎,𝑜) 𝑠{𝑇 𝑠,𝑎,𝑠′ ∗𝑏 𝑠 }
𝑃(𝑜|𝑏,𝑎)
Here how outcome beliefs are computed
Here how outcome beliefs are computed
![Page 21: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/21.jpg)
• Belief state b: Probability distribution over the states the robot believes it
is currently in
Belief State Space
Carnegie Mellon University 21
R
GO
utd
oo
rs
No
-fly
zon
e
S0
SG
S2S1
S4S3S5 S6
POMDP: {S, A, T, R, Ω, O}, where T(s,a,s’) = P(s’|s,a), R(s,a), O(s’,a,o) = Prob(o|s’,a)
b
b’
b’’
o1
o2a1
b’’’
oM
a2
o1o2
…aK
Belief State Space
(for K actions, M possible observations)
b’: P(s’|b,a,o) for every s’ in S;
b’(s’) = P(s’|b,a,o) = 𝑂(𝑠′,𝑎,𝑜) 𝑠{𝑇 𝑠,𝑎,𝑠′ ∗𝑏 𝑠 }
𝑃(𝑜|𝑏,𝑎)
Here how outcome beliefs are computed
Here how outcome beliefs are computed
Derivation:
P(s’|b,a,o)=𝑃 𝑜 𝑏, 𝑎, 𝑠′ 𝑃(𝑠′|𝑏,𝑎)
𝑃(𝑜|𝑏,𝑎)=𝑃(𝑜|𝑠′,𝑎) 𝑠{𝑃 𝑠′|𝑠,𝑎 ∗𝑃 𝑠 }
𝑃(𝑜|𝑏,𝑎)
Derivation:
P(s’|b,a,o)=𝑃 𝑜 𝑏, 𝑎, 𝑠′ 𝑃(𝑠′|𝑏,𝑎)
𝑃(𝑜|𝑏,𝑎)=𝑃(𝑜|𝑠′,𝑎) 𝑠{𝑃 𝑠′|𝑠,𝑎 ∗𝑃 𝑠 }
𝑃(𝑜|𝑏,𝑎)
![Page 22: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/22.jpg)
• Belief state b: Probability distribution over the states the robot believes it
is currently in
Belief State Space
Carnegie Mellon University 22
R
GO
utd
oo
rs
No
-fly
zon
e
S0
SG
S2S1
S4S3S5 S6
POMDP: {S, A, T, R, Ω, O}, where T(s,a,s’) = P(s’|s,a), R(s,a), O(s’,a,o) = Prob(o|s’,a)
Belief State Space
(for K actions, M possible observations)
What is Belief State Space?
It is MDP!We just need to compute transition
probabilities τ(b,a,b’) = P(b’|b,a) and reward function ρ(b,a)
It is MDP!We just need to compute transition
probabilities τ(b,a,b’) = P(b’|b,a) and reward function ρ(b,a)
b
b’
b’’
P(b’|b,a1)
a1
b’’’
a2…aK
P(b’’’|b,a1)
ρ(b,a1)
ρ(b,a2)ρ(b,aK)
![Page 23: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/23.jpg)
• Belief state b: Probability distribution over the states the robot believes it
is currently in
Belief State Space
Carnegie Mellon University 23
R
GO
utd
oo
rs
No
-fly
zon
e
S0
SG
S2S1
S4S3S5 S6
POMDP: {S, A, T, R, Ω, O}, where T(s,a,s’) = P(s’|s,a), R(s,a), O(s’,a,o) = Prob(o|s’,a)
Belief State Space
(for K actions, M possible observations)
τ(b,a,b’) = P(b’|b,a)= 𝑜 𝑙𝑒𝑎𝑑𝑖𝑛𝑔 𝑡𝑜 𝑏′𝑃(𝑜|𝑏, 𝑎)= 𝑜 𝑙𝑒𝑎𝑑𝑖𝑛𝑔 𝑡𝑜 𝑏′ 𝑠′𝑃(𝑜|𝑠′, 𝑎) 𝑠𝑃 𝑠′ 𝑠, 𝑎 𝑏(𝑠)
b
b’
b’’
P(b’|b,a1)
a1
b’’’
a2…aK
P(b’’’|b,a1)
ρ(b,a1)
ρ(b,a2)ρ(b,aK)
![Page 24: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/24.jpg)
• Belief state b: Probability distribution over the states the robot believes it
is currently in
Belief State Space
Carnegie Mellon University 24
R
GO
utd
oo
rs
No
-fly
zon
e
S0
SG
S2S1
S4S3S5 S6
POMDP: {S, A, T, R, Ω, O}, where T(s,a,s’) = P(s’|s,a), R(s,a), O(s’,a,o) = Prob(o|s’,a)
Belief State Space
(for K actions, M possible observations)
τ(b,a,b’) = P(b’|b,a)= 𝑜 𝑙𝑒𝑎𝑑𝑖𝑛𝑔 𝑡𝑜 𝑏′𝑃(𝑜|𝑏, 𝑎)= 𝑜 𝑙𝑒𝑎𝑑𝑖𝑛𝑔 𝑡𝑜 𝑏′ 𝑠′𝑃(𝑜|𝑠′, 𝑎) 𝑠𝑃 𝑠′ 𝑠, 𝑎 𝑏(𝑠)
ρ(b,a) = 𝑠𝑅(𝑠, 𝑎)𝑏(𝑠)
b
b’
b’’
P(b’|b,a1)
a1
b’’’
a2…aK
P(b’’’|b,a1)
ρ(b,a1)
ρ(b,a2)ρ(b,aK)
![Page 25: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/25.jpg)
• Belief state b: Probability distribution over the states the robot believes it
is currently in
Belief State Space
Carnegie Mellon University 25
R
GO
utd
oo
rs
No
-fly
zon
e
S0
SG
S2S1
S4S3S5 S6
POMDP: {S, A, T, R, Ω, O}, where T(s,a,s’) = P(s’|s,a), R(s,a), O(s’,a,o) = Prob(o|s’,a)
Belief State Space
(for K actions, M possible observations)
τ(b,a,b’) = P(b’|b,a)= 𝑜 𝑙𝑒𝑎𝑑𝑖𝑛𝑔 𝑡𝑜 𝑏′𝑃(𝑜|𝑏, 𝑎)= 𝑜 𝑙𝑒𝑎𝑑𝑖𝑛𝑔 𝑡𝑜 𝑏′ 𝑠′𝑃(𝑜|𝑠′, 𝑎) 𝑠𝑃 𝑠′ 𝑠, 𝑎 𝑏(𝑠)
ρ(b,a) = 𝑠𝑅(𝑠, 𝑎)𝑏(𝑠)
b
b’
b’’
P(b’|b,a1)
a1
b’’’
a2…aK
P(b’’’|b,a1)
ρ(b,a1)
ρ(b,a2)ρ(b,aK)
So, finding an optimal policy for POMDP = finding an optimal policy for Belief MDP So, finding an optimal policy for POMDP = finding an optimal policy for Belief MDP
We can even use Value Iteration you studied, can’t we?
The size of Belief MDP is infinite The size of Belief MDP is infinite
![Page 26: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/26.jpg)
• Belief state b: Probability distribution over the states the robot believes it
is currently in
• Popular techniques for solving POMDPs– by discretizing belief statespace into a finite # of states [Lovejoy, ‘91]
– by taking advantage of the geometric nature of value function [Kaelbing, Littman & Cassandra, ‘98]
– by sampling-based approximations [Pineau, Gordon & Thrun, ‘03]
Belief State Space
Carnegie Mellon University 26
R
GO
utd
oo
rs
No
-fly
zon
e
S0
SG
S2S1
S4S3S5 S6
POMDP: {S, A, T, R, Ω, O}, where T(s,a,s’) = P(s’|s,a), R(s,a), O(s’,a,o) = Prob(o|s’,a)
Belief State Space
(for K actions, M possible observations)
b
b’
b’’
P(b’|b,a1)
a1
b’’’
a2…aK
P(b’’’|b,a1)
ρ(b,a1)
ρ(b,a2)ρ(b,aK)
![Page 27: 15-887 Planning, Execution and Learning Planning under ...mmv/planning/handouts/pomdps_15887.pdf · Planning, Execution and Learning Planning under Uncertainty: Partially Observable](https://reader030.vdocuments.us/reader030/viewer/2022041021/5ed099ede2e54a54101262e9/html5/thumbnails/27.jpg)
• MDP generalizes Graph representation
• POMDP generalizes MDP representation
• POMDP – representation of problems where the state of relevant
variables is NOT fully known
• Solving POMDP can be represented as solving a Belief MDP (whose
size is infinite though)
• Approximation techniques exist but intractability is still a huge issue
for using POMDP planning in real world
Summary
Carnegie Mellon University 27