solving pomdps through macro decomposition larry bush tony jimenez brian bairstow
TRANSCRIPT
![Page 1: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/1.jpg)
Solving POMDPs through Macro Decomposition
Larry Bush
Tony Jimenez
Brian Bairstow
![Page 2: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/2.jpg)
Outline
• Introduction to POMDPs– Brian Bairstow
• Demonstration of POMDPs– Larry Bush
• Approximating POMDPs with Macro Actions– Tony Jimenez
![Page 3: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/3.jpg)
Introduction to POMDPs
• Introduction to POMDPs– Markov Decision Processes (MDPs)– Value Iteration– Partially Observable Markov Decision Processes
(POMDPs)– Overview of Techniques
• Demonstration of POMDPs• Approximating POMDPs with Macro Actions
![Page 4: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/4.jpg)
Navigation of a Building
• Robot– Knows building map– Wants to reach goal– Uncertainty in actions– What is the best way to
get there?
![Page 5: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/5.jpg)
Markov Decision Processes
• Model– States, S
– Actions, A
– Transition Probabilities, p(s,a)
– Rewards, r(s,a)
• Process– Observe state st in S
– Choose action at in A
– Receive reward rt = r(st,at)
– State becomes st+1 according to probabilities p(s,a)
• Goal– Create a policy for choosing actions that maximizes the
lifetime reward• Discount Factor
Value = r0 + r1 + 2r2 + …
![Page 6: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/6.jpg)
MDP Model Example
• States A, B, C• Actions 1, 2• Transition probabilities p and rewards r in diagram• C is terminal state
A B C
21
1
2
p = 0.3
p = 0.7
p = 0.4
p = 0.6
p = 0.5
p = 0.5
p = 0.7p = 0.3
r = 1
r = 1
r = 1r = 1
r = 0.2
r = 0.2
r = 0
r = 0
![Page 7: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/7.jpg)
Decision Tree Representation of MDP
A
1
2
A
C
0.7
0.3
C
B0.6
0.41
1
0.2
0
0.3
0.52
0.52
RewardExpected Reward
Max Q Value
![Page 8: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/8.jpg)
Decision Tree Representation of MDP
A
1
2
A
C
0.7
0.3
C
B0.6
0.41
1
0.9
0.52
0.664
0.94
0.94
Max Q ValueTotal Future Reward
Max Q Value 1
2
A
C
0.7
0.31
0
B
C
0.6
0.4
1
0.20.52
0.3
1
2
A
C
0.3
0.71
0
B
C
0.5
0.5
1
0.20.6
0.7
Expected Reward
Reward
![Page 9: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/9.jpg)
Value Iteration
• Finite Horizon, 1 stepQ1(s,a) = r(s,a)
• Finite Horizon, n stepsQn(s,a) = r(s,a) + [p(s’|s,a) maxa’Qn-1(s’,a’)]
• Infinite HorizonQ(s,a) = r(s,a) + [p(s’|s,a) maxa’Q(s’,a’)]
• Policy(s) = argmaxaQ(s,a)
![Page 10: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/10.jpg)
Q Reinforcement Learning
• Q can be calculated only if p(s,a) and r(s,a) are known
• Otherwise Q can be trained:Qt+1(s,a) = (1-)Qt(s,a) + [R+maxa’Qt(s’,a’)]
• Perform trial runs; get data from observations• Keep running combinations of s and a until
convergence
![Page 11: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/11.jpg)
Partially Observable MDP’s
• Can no longer directly observe the state of the system• Instead at each step make an observation O, and know the
probability of each observation for each state p(O|S)• Also have initial state distribution p(s0)
S S S
O O O
A A
p(O|S)
p(s,a) Hidden
Observable
t1 t2 t3
![Page 12: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/12.jpg)
State Estimation
• States and Observations not independent of past observations
• Maintain a distribution of state probabilities called a belief bbt = p(st|ot,at,ot-1,at-1,…,o0,a0,s0)
StateEstimator
Policyb a
o
![Page 13: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/13.jpg)
Value Iteration with POMDPs
b(s2)0 1
Q(b,a)Q(a1) = Q(s1,a1)(1-b(s2)) + Q(s2,a1)b(s2)
Q(a2)
• Finite horizon value function is piecewise linear and convex– Depending on the value of b(s2), a different action is optimal
• Here is a one-step finite horizon problem– Need to consider effect on future belief states
![Page 14: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/14.jpg)
Decision Tree Representation of POMDP
b1
a1
a2
b211
b212
o2
o1
b221
b222
o2
o1
a1
a2
a1
a2
a1
a2
a1
a2
![Page 15: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/15.jpg)
Approximation: Point-Based
• Concentrate on specific points in continuous belief space
• How should beliefs be chosen?• Problems with high dimensionality
b2
b1
b3
b8b5
b6
b7
b4
![Page 16: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/16.jpg)
Approximation: Monte Carlo
• Estimate values of actions for samples of belief space• Interpolate for arbitrary values
• Computationally difficult to learn policy
a1
bib1
bkbj
![Page 17: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/17.jpg)
Approximation: Grid-Based
• Like point-based approximation, but with regular spacing• Can use variable resolution
• Size increases with dimensionality and resolution
b2b1 b3
b8
b5 b6
b7
b4
b9
b10
b12
b14
b13b11
![Page 18: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/18.jpg)
Heuristic: Maximum-Likelihood
• Assume that the state is the most likely state, and perform corresponding MDP.
• Only works if this is a good assumption!
b = <0.08, 0.02, 0.1, 0.8> POMDP
b = < 0 , 0 , 0 , 1 > MDP
![Page 19: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/19.jpg)
Heuristic: Information Gathering
• MDP is easy to solve
• Valuable to learn about the state
b = < 1 , 0 , 0 , 0 > b = < 0.1, 0.2, 0.4, 0.3 >
b1
![Page 20: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/20.jpg)
Heuristic: Hierarchical Method
• Split into top level POMDP and low level POMDPs• Smaller state space reduces amount of computation
Belief Space
POMDP of
POMDPs
![Page 21: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/21.jpg)
Heuristic: Policy Search
• Reinforcement learning on policy instead of value
• Value and Policy Search– Combine value and policy reinforcement
learning into a single update rule
![Page 22: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/22.jpg)
Outline
• Introduction to POMDPs
• Demonstration of POMDPs– Problem : UAV Navigation– MDP Visualizations– POMDP Visualizations
• Approximating POMDPs with Macro Actions
![Page 23: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/23.jpg)
Demonstrations*
• MATLAB• MDP
– Map to Problem Domain– Value Iteration– Policy Execution– Visualization**
• POMDP – Map to Problem Domain– Belief Space Search– Policy Execution– Visualization**
* All coding is mine except for some basic MDP value iteration functionality.
** Visualizations run in line with code.
![Page 24: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/24.jpg)
INFO
i
UAV Land at Airport
Problem : Introduction
Landing Runway
Take-off Runway+
UAV: +
![Page 25: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/25.jpg)
Outline• Introduction to POMDPs
• Demonstration of POMDPs– Problem : UAV Navigation– MDP Visualizations
• Value Iteration
• Policy Execution
– POMDP Visualizations
• Approximating POMDPs with Macro Actions
![Page 26: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/26.jpg)
INFO
i
MDP Utility Overlay
UAV: +
Reward (#)
State UtilityExclusion
Zone
.04
Transition Model
Cost Probability
.8.1 .1
Problem : Utility Overlay
![Page 27: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/27.jpg)
INFO
i
MDP Value Iteration : Until Convergence
UAV: +
Reward (#)
State Utility
Policy
.04
Transition Model
Cost Probability
.8.1 .1
MDP Value Iteration
![Page 28: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/28.jpg)
INFO
i
MDP Value Iteration : Until Convergence
UAV: +
Reward (#)
State Utility
Policy
.
.04
Transition Model
Cost Probability
.8.1 .1
MDP Value Iteration
![Page 29: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/29.jpg)
UAV: +
Reward (#)
State Utility
Policy
INFO
i
MDP Policy Execution
.04
Transition Model
Cost Probability
.8.1 .1
MDP Policy Execution
![Page 30: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/30.jpg)
UAV: +
Reward (#)
State Utility
Policy
INFO
.
.04
Transition Model
Cost Probability
.8.1 .1
i
MDP Policy Execution
MDP Policy Execution
![Page 31: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/31.jpg)
UAV: +
Reward (#)
State Utility
Policy
INFO
i
MDP Policy Execution
.04
Transition Model
Cost Probability
.8.1 .1
MDP Policy Execution
![Page 32: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/32.jpg)
UAV: +
Reward (#)
State Utility
Policy
INFO
.
i
MDP Policy Execution
.04
Transition Model
Cost Probability
.8.1 .1
MDP Policy Execution
![Page 33: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/33.jpg)
Outline• Introduction to POMDPs• Demonstration of POMDPs
– Problem : UAV Navigation
– MDP Visualizations
– POMDP Visualizations• Belief State Space Search
• Policy Comparison– 3 Step Horizon
– 10 Step Horizon
– AIMA (Russell & Norvig)
– Fuel Observation
• Approximating POMDPs with Macro Actions
![Page 34: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/34.jpg)
UAV: +
Reward (#)
State Probability
INFO
TotalBeliefStateReward
Policy
3 Step Horizon
i
POMDP Belief State Space Search : No Observations
POMDP Belief State Space Search
![Page 35: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/35.jpg)
.
State Probability
Policy
3 Step Horizon
UAV: +
Reward (#)
INFO
TotalBeliefStateReward
i
POMDP Belief State Space Search : No Observations
State Probability
3 Step Horizon
Policy
POMDP Belief State Space Search
![Page 36: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/36.jpg)
UAV: +
Reward (#)
INFO
TotalBeliefStateReward
Policy
< Right, Right, Right >
State Probability
i
POMDP Policy Execution : No Observations
POMDP Policy (3 Iterations)
![Page 37: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/37.jpg)
UAV: +
Reward (#)
INFO
TotalBeliefStateReward
Policy
< Right, Right, Right >
.
State Probability
i
POMDP Policy Execution : No Observations
POMDP Policy (3 Iterations)
![Page 38: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/38.jpg)
UAV: +
Reward (#)
INFO
TotalBeliefStateReward
Policy
< Left, Up, Up, Right, Up, Up, Right, Right, Right, Right >
State Probability
i
POMDP Policy Execution : No Observations
POMDP Policy (10 Iterations)
![Page 39: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/39.jpg)
UAV: +
Reward (#)
INFO
TotalBeliefStateReward
Policy
< Left, Up, Up, Right, Up, Up, Right, Right, Right, Right >
.
State Probability
i
POMDP Policy Execution : No Observations
POMDP Policy (10 Iterations)
![Page 40: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/40.jpg)
UAV: +
Reward (#)
INFO
TotalBeliefStateReward
Policy
< Left, Up, Up, Right, Up, Up, Right, Up, Up, Right >
State Probability
i
POMDP Policy Execution : No Observations
POMDP Policy (AIMA)
![Page 41: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/41.jpg)
UAV: +
Reward (#)
INFO
TotalBeliefStateReward
Policy
< Left, Up, Up, Right, Up, Up, Right, Up, Up, Right >
.
State Probability
i
POMDP Policy Execution : No Observations
POMDP Policy (AIMA)
![Page 42: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/42.jpg)
UAV: +
Reward (#)
INFO
TotalBeliefStateReward
Policy
< Up, Up, Right, Right, Right, Up, Right, Right, Right, Right >
State Probability
i
POMDP Policy Execution : Fuel Observation
POMDP Policy (Fuel Observation)
![Page 43: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/43.jpg)
UAV: +
Reward (#)
INFO
TotalBeliefStateReward
Policy
< Up, Up, Right, Right, Right, Up, Right, Right, Right, Right >
.
State Probability
i
POMDP Policy Execution : Fuel Observation
POMDP Policy (Fuel Observation)
![Page 44: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/44.jpg)
Demonstration Conclusion
• Difficult to Solve
• Efficient Methods
![Page 45: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/45.jpg)
Outline
• Introduction to POMDPs
• Demonstration of POMDPs
• Approximating POMDPs with Macro Actions– Belief Compression using Grid Approximation– Macro-Actions in the Semi-MDP’s framework– Reinforcement Learning with a model
![Page 46: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/46.jpg)
Approximation Methods
• Point-Based Approximations
• Monte Carlo Method
• Grid-Based Approximations– Dynamic Grid-Resolution
![Page 47: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/47.jpg)
Policy Heuristics
• Maximum-Likelihood
• Information Gathering
• Hierarchical Methods– Macro-Actions
• Policy Search
![Page 48: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/48.jpg)
Grid Approximation
S2(0,1,0)
S3(0,0,1)
S1(1,0,0)
Resolution 1
S2(0,1,0)
S3(0,0,1)
S1(1,0,0)
Resolution 2
S2(0,1,0)
S3(0,0,1)
S1(1,0,0)
Resolution 4
![Page 49: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/49.jpg)
Dynamic Grid Allocation
• Allocate grid points from a uniformly spaced grid dynamically– Simulate trajectories of the agent through the belief
space
• When experiencing belief state:– Find grid point closest to belief state
– Add it to set of grid points explicitly considered
• Derived set of grid points is small and adapted to parts of belief space typically inhabited by agent
![Page 50: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/50.jpg)
Macro-Actions
• Temporally Extended Actions– For example:
• Move-Down-Corridor• Go-to-Chicago
– Versus primitive actions, such as:• Turn-Right• One-Step-Forward
• POMDP’s breakdown into a hierarchical structure of smaller POMDP’s
![Page 51: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/51.jpg)
Macro-Actions
MDP
SMDP
![Page 52: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/52.jpg)
Semi-Markov Decision Process
• Defined as a five-tuple (S,A,P,R,F)– S : finite set of states– A : set of actions– P : state and action transition probability functions– R : reward function– F : probability of transition time function for each
state-action pair• F(s’, |s,a) : specifies the joint probability that a transition
from state s to state s’ occurs after steps with action a
• Q-learning rule for SMDP’sQt+1(s,a) = (1-)Qt(s,a) + [R+maxa’Qt(s’,a’)]
![Page 53: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/53.jpg)
Problem Reformulation
• Monte Carlo Updates of the Q-values for the SMDP are used to learn the policy
POMDPGrid
Approx.MDP
MacroActions
SMDP
![Page 54: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/54.jpg)
Graphical Overview of the Algorithm
b’
g
b
b’’
g3
g2
g1
![Page 55: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/55.jpg)
Reinforcement LearningWith a Model
1. Assume a current true state• Which is physical true location of agent
• b(s) 0
2. Discretize the current belief state b gi
• gi is the closest grid-point in a regular discretization of the belief space.
• If gi is missing add it to the table of belief states to explicitly consider
• Interpolate its initial value from coarser resolutions.
![Page 56: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/56.jpg)
Reinforcement LearningWith a Model
3. Choose the best macro-action from the current Q values
• Random action % of the time• Interpolate over the Q values of the vertices
of the sub-simplex that contains b
![Page 57: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/57.jpg)
Reinforcement LearningWith a Model
4. Estimate E [R(gi,) + t V (b’)] by sampling:
• Sample a state s from the current grid-belief state gi
– Choose primitive action a according to macro-action – Sample the next state s’ and observation z from models
– Store the reward and update the belief state
– Set t = t+1, b = b’, s = s’ and repeat with next primitive action until terminates
• Compute the value of the resulting belief state b’– Interpolate over vertices in resulting belief sub-simplex
• Repeat multiple times and average the estimate
![Page 58: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/58.jpg)
Reinforcement LearningWith a Model
5. Update the state action value:Q(gi,) = (1-)q(gi,) + [R + t v(b’)]
6. Update the state value:V(gi) = argmaxQ(gi,)
![Page 59: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/59.jpg)
Reinforcement LearningWith a Model
7. Execute the macro-action starting from the belief state until termination
• Generate observations during execution• Set b = b’ and s = s’ and go to step 2
8. Repeat this learning epoch multiple times starting from the same b
![Page 60: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/60.jpg)
Tested Results
• Benefits of using dynamic grid-resolution– For 1068 world states and resolution of 4,
number of grid points:• 5.410 with regular discretization
• ~3500 with dynamic grid-resolution
• ~1000 with dynamic grid-resolution and macro-actions
![Page 61: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/61.jpg)
Tested Results
• Benefits of using macro-actions– Requires fewer training steps per episode– Performs better with increased grid-resolution in
the dynamic grid model– Better quality policy overall
• Benefits of using reward-shaping– Successful completion of test with macro-
actions 90~95%– 0~10% without reward-shaping
![Page 62: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/62.jpg)
What Is Next?
• Able to solve POMDP’s with 1000+ states
• How do we generate a set of “good” macro-actions?– Current research is focused on the dynamic
generation of macro-actions
![Page 63: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/63.jpg)
End
![Page 64: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/64.jpg)
An Extra Visualization
• This visualization (and others) were cut from the talk, for brevity. However, we thought that it was worthwhile to include this one in the tutorial slides.
![Page 65: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/65.jpg)
UAV: +
Reward (#)
INFO
TotalBeliefStateReward
Policy
3 Step Horizon
State Probability
i
POMDP Belief State Space Search : Coastal Observation
POMDP Belief State Space Search(Coastal Observation)
![Page 66: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/66.jpg)
UAV: +
Reward (#)
INFO
TotalBeliefStateReward
Policy
.
State Probability
3 Step Horizon
i
POMDP Belief State Space Search : Coastal Observation
POMDP Belief State Space Search (Coastal Observation)
![Page 67: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/67.jpg)
UAV: +
Reward (#)
INFO
TotalBeliefStateReward
Policy
< Up, Up, Right, Up, Right, Up, Right, Up, Right >
State Probability
i
POMDP Policy Execution
POMDP Policy (Coastal Observation)
![Page 68: Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow](https://reader035.vdocuments.us/reader035/viewer/2022062409/56649f235503460f94c3b9da/html5/thumbnails/68.jpg)
UAV: +
Reward (#)
INFO
TotalBeliefStateReward
Policy
< Up, Up, Right, Up, Right, Up, Right, Up, Right >
.
State Probability
i
POMDP Policy Execution
POMDP Policy (Coastal Observation)