an overview of mdp planning research at icapsmausam/papers/tut13.pdf · university of washington,...
TRANSCRIPT
![Page 1: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/1.jpg)
An Overview of MDP Planning Research
at ICAPS
Andrey Kolobov and Mausam
Computer Science and Engineering
University of Washington, Seattle
1
![Page 2: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/2.jpg)
What is ICAPS?
• International Conference onAutomated Planning and Scheduling
– 23rd conference: June 10-14, Rome, Italy.
– 24th conference: June 21-26, Portsmouth, NH
• Traditional community: classical planning
• ~15 years: significant planning under uncertainty– Discrete MDPs
– Discrete POMDPs
– other kinds of uncertainty (e.g., disjunctive)
– ~continuous state spaces
2
![Page 3: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/3.jpg)
What is ICAPS?
• International Conference onAutomated Planning and Scheduling
– 23rd conference: June 10-14, Rome, Italy.
– 24th conference: June 21-26, Portsmouth, NH
• Traditional community: classical planning
• ~15 years: significant planning under uncertainty– Discrete MDPs
– Discrete POMDPs
– other kinds of uncertainty (e.g., disjunctive)
– ~continuous state spaces
3
![Page 4: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/4.jpg)
International Probabilistic Planning Competition
• IPPCs: 2004, 2006, 2008, 2011
• Salience
– shapes the research in the community– shapes the research in the community
• Slightly controversial history
– ~deterministic planners beat MDP planners (‘04, ‘06)
– how to represent a problem: PPDDL vs. RDDL (‘11)
4
![Page 5: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/5.jpg)
Strength/Bias of AI Planning
• Causal planning
• Temporal planning
• Resource planning
• Spatial planning
• …
5
![Page 6: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/6.jpg)
Does Planning Want to be Slow?
• In theory– yes (classical planning in STRIPS is PSPACE)
– YES (MDP planning in factored problems in EXPTIME)
• In practice– NO– NO
– It is what you make it out to be• optimal
• quality bound
• time bound
• anytime
• interleaved planning and execution
• …6
![Page 7: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/7.jpg)
3 Key Messages
• M#0: No need for exploration-exploitation tradeoff
– planning is purely a computational problem (V.I. vs. Q)
• M#1: Search in planning
– states can be ignored or reordered for efficient computation
• M#2: Representation in planning
– develop interesting representations for Factored MDPs
� Exploit structure to design domain-independent algorithms
• M#3: Goal-directed MDPs
– design algorithms that use explicit knowledge of goals
7
![Page 8: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/8.jpg)
Outline
• Introduction
• MDP Model
• Uninformed Algorithms• Uninformed Algorithms
• Heuristic Search Algorithms
• Approximation Algorithms
• Extension of MDPs
8
![Page 9: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/9.jpg)
Shameless Plug
9
![Page 10: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/10.jpg)
Outline
• Introduction
• MDP Model
• Uninformed Algorithms• Uninformed Algorithms
• Heuristic Search Algorithms
• Approximation Algorithms
• Extension of MDPs
10
![Page 11: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/11.jpg)
Stochastic Shortest-Path MDPs
SSP MDP is a tuple <S, A, T, C, G>, where:• S is a finite state space
• A is a finite action set
• T: S x A x S �[0, 1] is a stationary transition function
• C: S x A x S �R is a stationary cost function (= -R: S x A x S �R)
• G is a set of absorbing cost-free goal states
[Bertsekas 1995]
• G is a set of absorbing cost-free goal states
Under two conditions:• There is a proper policy (reaches a goal with P= 1 from all states)
• Every improper policy incurs a cost of ∞ from every state from which it does not reach the goal with P=1
11
![Page 12: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/12.jpg)
SSP MDP Objective Function
• Find policy ¼: S � A
• Minimize expected (undiscounted) cost to reach a goal
– agent acts for an indefinite horizon
• A cost-minimizing policy is always proper!
12
![Page 13: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/13.jpg)
Bellman Equations
13
![Page 14: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/14.jpg)
Not an SSP MDP Example
S1 S2
a1
C(s1, a1, s2) = 1
a2
a2
C(s1, a2, s1) = 7.2
C(s2, a2, sG) = 1
SG
C(sG, a2, sG) = 0
T(s2, a2, sG) = 0.3
a
a2
No cost-free “loops” allowed!
14
C(s2, a1, s1) = -1
C(sG, a1, sG) = 0
C(s2, a2, s2) = -3
T(s2, a2, sG) = 0.7
S3
C(s3, a2, s3) = 0.8C(s3, a1, s3) = 2.4
a1 a2
a1
No dead ends allowed!
a1
![Page 15: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/15.jpg)
SSP and Other MDP Classes
SSPInfinite Horizon
Discounted
Reward MDPs
Finite Horizon
MDPs
• Will concentrate on SSP in the rest of the tutorial
15
![Page 16: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/16.jpg)
Discounted Reward MDP � SSP[Bertsekas&Tsitsiklis 95]
S0a
C=-r1
C=-r
T= γ t1
S1
S C=0
S0a
R=r1
T=t1
S1
S
16
C=-r2
T= γ t2
S2
SG
C=0
T=1-γ
C=0
T=1-γR=r2
T=t2
S2
![Page 17: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/17.jpg)
Outline
• Introduction
• MDP Model
• Uninformed Algorithms• Uninformed Algorithms
• Heuristic Search Algorithms
• Approximation Algorithms
• Extension of MDPs
17
![Page 18: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/18.jpg)
Algorithms
• Value Iteration
• Policy Iteration
• Asynchronous Value Iteration
– Prioritized Sweeping• Many priority functions (some better for goal-directed MDPs)• Many priority functions (some better for goal-directed MDPs)
– Backward Value Iteration: priorities w/o priority queue
– Partitioned Value Iteration• Topological Value Iteration
– …
• Linear Programming
18
![Page 19: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/19.jpg)
Backward VI [Dai&Hansen 07]
• Prioritized Sweeping has priority queue overheads
– Prioritized VI without priority queue!
• Backup states in reverse order starting from goal
– don‘t repeat a state in an iteration
M#3: some algorithms
use explicit
knowledge of goals
– don‘t repeat a state in an iteration
– other optimizations • (backup only states in current greedy subgraph)
• Characteristics
– no overhead of priority queue
– good information flow
19
![Page 20: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/20.jpg)
(General) Partitioned VI
20
How to construct a partition?How many backups to perform per partition?
How to construct priorities?
![Page 21: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/21.jpg)
Topological VI [Dai&Goldsmith 07]
• Identify strongly-connected components
• Perform topological sort of partitions
• Backup partitions to ²-consistency: reverse top. order
M#1: states can be
reordered for
efficient computation
21
s0
s2
s1
sgPr=0.6
a00 s4
s3
Pr=0.4a01
a21 a1
a20 a40
C=5a41
a3C=2
![Page 22: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/22.jpg)
Other Benefits of Partitioning
• External-memory algorithms
– PEMVI [Dai et al 08, 09]
• partitions live on disk
• get each partition from the disk and backup all states
• Cache-efficient algorithms
– P-EVA algorithm [Wingate&Seppi 04a]
• Parallelized algorithms
– P3VI (Partitioned, Prioritized, Parallel VI) [Wingate&Seppi 04b]
22
![Page 23: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/23.jpg)
Outline of the Tutorial
• Introduction
• MDP Model
• Uninformed Algorithms• Uninformed Algorithms
• Heuristic Search Algorithms
• Approximation Algorithms
• Extension of MDPs
23
![Page 24: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/24.jpg)
Heuristic Search
• Insight 1
– knowledge of a start state to save on computation
~ (all sources shortest path � single source shortest path)
• Insight 2
– additional knowledge in the form of heuristic function
~ (dfs/bfs � A*)
24
![Page 25: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/25.jpg)
Model
• SSP (as before) with an additional start state s0
– denoted by SSPs0
• Solution to an SSP : partial policy closed w.r.t. s• Solution to an SSPs0: partial policy closed w.r.t. s0
– defined for all states s’ reachable by ¼s0
starting from s0
– many states are irrelevant (e.g., not reachable from s0)
25
![Page 26: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/26.jpg)
Heuristic Function
• h(s): S!!!!R
– estimates V*(s)
– gives an indication about “goodness” of a state
– usually used in initialization V0(s) = h(s)– usually used in initialization V0(s) = h(s)
– helps us avoid seemingly bad states
• Define admissible heuristic
– optimistic
– h(s) ···· V*(s)
26
![Page 27: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/27.jpg)
A General Scheme for
Heuristic Search in MDPs• Two (over)simplified intuitions– Focus on states in greedy policy wrt V rooted at s0
– Focus on states with residual > ²
• Find & Revise: [Bonet&Geffner 03]
– repeat– repeat• find a state that satisfies the two properties above
• perform a Bellman backup
– until no such state remains
• Convergence to V* is guaranteed– if heuristic function is admissible
– no state in greedy policy gets starved in 1111 FIND steps
27
![Page 28: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/28.jpg)
A* � LAO*
regular graph
soln:(shortest) path
A*
acyclic AND/OR graph
soln:(expected shortest)acyclic graph
AO* [Nilsson’71]
cyclic AND/OR graph
soln:(expected shortest)cyclic graph
LAO* [Hansen&Zil.’98]
All algorithms able to make effective use of reachability information!
![Page 29: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/29.jpg)
s0
s1 s2 s3 s4
LAO*
s0V(s0) = h(s0)
29
Sgs5 s6 s7 s8
add s0 in the fringe and in greedy graph
![Page 30: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/30.jpg)
s0
s1 s2 s3 s4
LAO*
s0V(s0) = h(s0)
30
Sgs5 s6 s7 s8
FIND: expand best state s on the fringe (in greedy graph)
![Page 31: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/31.jpg)
s0
s1 s2 s3 s4
LAO*
s0
s1 s2 s3 s4
V(s0)
h h h h
31
Sgs5 s6 s7 s8
FIND: expand best state s on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform VI on this subset
![Page 32: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/32.jpg)
s0
s1 s2 s3 s4
LAO*
s0
s1 s2 s3 s4
V(s0)
h h h h
32
Sgs5 s6 s7 s8
FIND: expand best state s on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform VI on this subset
recompute the greedy graph
![Page 33: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/33.jpg)
s0
s1 s2 s3 s4
LAO*
s0
s1 s2 s3 s4h h h h
V(s0)
33
Sgs5 s6 s7 s8 s6 s7
FIND: expand best state s on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform VI on this subset
recompute the greedy graph
h h
![Page 34: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/34.jpg)
s0
s1 s2 s3 s4
LAO*
s0
s1 s2 s3 s4h h h h
V(s0)
34
Sgs5 s6 s7 s8 s6 s7
FIND: expand best state s on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform VI on this subset
recompute the greedy graph
h h
![Page 35: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/35.jpg)
s0
s1 s2 s3 s4
LAO*
s0
s1 s2 s3 s4h h V h
V
35
Sgs5 s6 s7 s8 s6 s7
FIND: expand best state s on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform VI on this subset
recompute the greedy graph
h h
![Page 36: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/36.jpg)
s0
s1 s2 s3 s4
LAO*
s0
s1 s2 s3 s4h h V h
V
36
Sgs5 s6 s7 s8 s6 s7
FIND: expand best state s on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform VI on this subset
recompute the greedy graph
h h
![Page 37: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/37.jpg)
s0
s1 s2 s3 s4
LAO*
s0
s1 s2 s3 s4h h V hV
V
37
Sgs5 s6 s7 s8 Sgs5 s6 s7
FIND: expand best state s on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform VI on this subset
recompute the greedy graph
h hh 0
![Page 38: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/38.jpg)
s0
s1 s2 s3 s4
LAO*
s0
s1 s2 s3 s4h h V hV
V
38
Sgs5 s6 s7 s8 Sgs5 s6 s7
FIND: expand best state s on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform VI on this subset
recompute the greedy graph
h hh 0
![Page 39: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/39.jpg)
s0
s1 s2 s3 s4
LAO*
s0
s1 s2 s3 s4V h V hV
V
39
Sgs5 s6 s7 s8 Sgs5 s6 s7
FIND: expand best state s on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform VI on this subset
recompute the greedy graph
h hh 0
![Page 40: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/40.jpg)
s0
s1 s2 s3 s4
LAO*
s0
s1 s2 s3 s4V h V hV
V
40
Sgs5 s6 s7 s8 Sgs5 s6 s7
FIND: expand best state s on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform VI on this subset
recompute the greedy graph
h hh 0
![Page 41: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/41.jpg)
s0
s1 s2 s3 s4
LAO*
s0
s1 s2 s3 s4V V V hV
V
41
Sgs5 s6 s7 s8 Sgs5 s6 s7
FIND: expand best state s on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform VI on this subset
recompute the greedy graph
h hh 0
![Page 42: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/42.jpg)
s0
s1 s2 s3 s4
LAO*
s0
s1 s2 s3 s4V V V hV
V
42
Sgs5 s6 s7 s8 Sgs5 s6 s7
FIND: expand best state s on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform VI on this subset
recompute the greedy graph
h hh 0
![Page 43: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/43.jpg)
s0
s1 s2 s3 s4
LAO*
s0
s1 s2 s3 s4V V V hV
V
43
Sgs5 s6 s7 s8 Sgs5 s6 s7
output the greedy graph as the final policy
V hh 0
![Page 44: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/44.jpg)
s0
s1 s2 s3 s4
LAO*
s0
s1 s2 s3 s4V V V hV
V
44
Sgs5 s6 s7 s8 Sgs5 s6 s7
output the greedy graph as the final policy
V hh 0
![Page 45: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/45.jpg)
s0
s1 s2 s3 s4
LAO*
s0
s1 s2 s3 s4V V V hV
VM#1: some states
can be ignored for
efficient compuation
45
Sgs5 s6 s7 s8 Sgs5 s6 s7
s4 was never expandeds8 was never touched
V hh 0 s8
![Page 46: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/46.jpg)
Real Time Dynamic Programming[Barto et al 95]
• Original Motivation
– agent acting in the real world
• Trial
– simulate greedy policy starting from start state;– simulate greedy policy starting from start state;
– perform Bellman backup on visited states
– stop when you hit the goal
• RTDP: repeat trials forever
– Converges in the limit #trials !!!! 1111
46
![Page 47: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/47.jpg)
Trial
s0
s1 s2 s3 s4
47
Sg
s1 s2 s3 s4
s5 s6 s7 s8
![Page 48: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/48.jpg)
Trial
s0
s1 s2 s3 s4h h h h
V
48
Sg
s1 s2 s3 s4
s5 s6 s7 s8
h h h h
start at start state
repeat
perform a Bellman backup
simulate greedy action
![Page 49: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/49.jpg)
Trial
s0
s1 s2 s3 s4h h h h
V
49
Sg
s1 s2 s3 s4
s5 s6 s7 s8
h h h h
start at start state
repeat
perform a Bellman backup
simulate greedy action
h h
![Page 50: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/50.jpg)
Trial
s0
s1 s2 s3 s4h h V h
V
50
Sg
s1 s2 s3 s4
s5 s6 s7 s8
h h V h
start at start state
repeat
perform a Bellman backup
simulate greedy action
h h
![Page 51: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/51.jpg)
Trial
s0
s1 s2 s3 s4h h V h
V
51
Sg
s1 s2 s3 s4
s5 s6 s7 s8
h h V h
start at start state
repeat
perform a Bellman backup
simulate greedy action
h h
![Page 52: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/52.jpg)
Trial
s0
s1 s2 s3 s4h h V h
V
52
Sg
s1 s2 s3 s4
s5 s6 s7 s8
h h V h
start at start state
repeat
perform a Bellman backup
simulate greedy action
V h
![Page 53: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/53.jpg)
Trial
s0
s1 s2 s3 s4h h V h
V
53
Sg
s1 s2 s3 s4
s5 s6 s7 s8
h h V h
start at start state
repeat
perform a Bellman backup
simulate greedy action
until hit the goal
V h
![Page 54: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/54.jpg)
Trial
s0
s1 s2 s3 s4h h V h
V
54
Sg
s1 s2 s3 s4
s5 s6 s7 s8
h h V h
start at start state
repeat
perform a Bellman backup
simulate greedy action
until hit the goal
V h
RTDP
repeatforever
![Page 55: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/55.jpg)
Real Time Dynamic Programming[Barto et al 95]
• Original Motivation
– agent acting in the real world
• Trial
– simulate greedy policy starting from start state;– simulate greedy policy starting from start state;
– perform Bellman backup on visited states
– stop when you hit the goal
• RTDP: repeat trials forever
– Converges in the limit #trials !!!! 1111
55
No terminationcondition!
![Page 56: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/56.jpg)
• Admissible heuristic & monotonicity
⇒⇒⇒⇒ V(s) · V*(s)
⇒⇒⇒⇒ Q(s,a) · Q*(s,a)
• Label a state s as solved
– if V(s) has converged
[Bonet&Geffner 03b]
– if V(s) has convergedbest action
ResV(s) < ²²²²
) ) ) ) V(s) won’t change!label s as solved
sgs
![Page 57: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/57.jpg)
Labeling (contd)
best action
sgs
s'
57
ResV(s) < ²²²²
s' already solved) ) ) ) V(s) won’t change!
label s as solved
s'
![Page 58: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/58.jpg)
Labeling (contd)
best action
sgs
s'
best action
sgs
s'best action
M#3: some algorithms
use explicit
knowledge of goals
58
ResV(s) < ²²²²
s' already solved
) ) ) ) V(s) won’t change!
label s as solved
s'
ResV(s) < ²²²²
ResV(s’) < ²²²²
V(s), V(s’) won’t change!label s, s’ as solved
s'
M#1: some states
can be ignored for
efficient computation
![Page 59: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/59.jpg)
Labeled RTDP [Bonet&Geffner 03b]
repeat
s ÃÃÃÃ s0
label all goal states as solved
repeat //trials
REVISE s; identify agreedy
M#3: some algorithms
use explicit
knowledge of goals
REVISE s; identify agreedy
FIND: sample s’ from T(s, agreedy, s’)
s ÃÃÃÃ s’
until s is solved
for all states s in the trial
try to label s as solved
until s0 is solved
59
M#1: some states
can be ignored for
efficient computation
![Page 60: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/60.jpg)
LRTDP Extensions
• Different ways to pick next state
• Different termination conditions
• Bounded RTDP [McMahan et al 05]• Bounded RTDP [McMahan et al 05]
• Focused RTDP [Smith&Simmons 06]
• Value of Perfect Information RTDP [Sanner et al 09]
60
![Page 61: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/61.jpg)
Action Elimination
Q(s,a1) Q(s,a2)
If Ql(s,a1) > Vu(s) then a1 cannot be optimal for s.
61
![Page 62: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/62.jpg)
Topological VI [Dai&Goldsmith 07]
• Identify strongly-connected components
• Perform topological sort of partitions
• Backup partitions to ²-consistency: reverse top. order
62
s0
s2
s1
sgPr=0.6
a00 s4
s3
Pr=0.4a01
a21 a1
a20 a40
C=5a41
a3C=2
![Page 63: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/63.jpg)
Topological VI [Dai&Goldsmith 07]
• Identify strongly-connected components
• Perform topological sort of partitions
• Backup partitions to ²-consistency: reverse top. order
63
s0
s2
s1
sgPr=0.6
a00 s4
s3
Pr=0.4a01
a21 a1
a20 a40
C=5a41
a3C=2
![Page 64: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/64.jpg)
Focused Topological VI [Dai et al 09]
• Topological VI
– hopes there are many small connected components
– can‘t handle reversible domains…
• FTVI
– initializes Vl and Vu
– LAO*-style iterations to update Vl and Vu
– eliminates actions using action-elimination
– Runs TVI on the resulting graph
64
![Page 65: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/65.jpg)
Where do Heuristics come from?
• Domain-dependent heuristics
• Domain-independent heuristics
– dependent on specific domain representation– dependent on specific domain representation
65
M#2: factored
representations
expose useful
problem structure
![Page 66: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/66.jpg)
Outline of the Tutorial
• Introduction
• MDP Model
• Uninformed Algorithms• Uninformed Algorithms
• Heuristic Search Algorithms
• Approximation Algorithms
• Extension of MDPs
66
![Page 67: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/67.jpg)
Symbolic Representations
• Use of Algebraic Decision Diagrams for MDPs
– SPUDD (Symbolic VI) [Hoey et al 99]
– Symbolic LAO* [Feng&Hansen 02]
– Symbolic RTDP [Feng&Hansen 03]– Symbolic RTDP [Feng&Hansen 03]
• XADDs [Sanner et al 11]
• FODDs [Joshi et al 10]
67
M#2: factored
representations
expose useful
problem structure
![Page 68: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/68.jpg)
Outline of the Tutorial
• Introduction
• MDP Model
• Uninformed Algorithms• Uninformed Algorithms
• Heuristic Search Algorithms
• Approximation Algorithms
• Extension of MDPs
68
![Page 69: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/69.jpg)
Motivation
• Even π* closed wr.t. s0 is often too large to fit in memory…
• … and/or too slow to compute …
• … for MDPs with complicated characteristics• … for MDPs with complicated characteristics
– Large branching factors/high-entropy transition function
– Large distance to goal
– Etc.
• Must sacrifice optimality to get a “good enough” solution
69
![Page 70: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/70.jpg)
Overview
Determinization-based
techniques
Heuristic search with
inadmissible
heuristics
OfflineOnline
70
Monte-Carlo planning
Hybridized
planning
Hierarchical
planning
Dimensionality
reduction
![Page 71: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/71.jpg)
Overview
• Not a “golden standard” classification
– In some aspects, arguable
– Others possible, e.g., optimal vs. suboptimal in the limit
• Approaches differ in the quality aspect they sacrifice
– Probability of reaching the goal
– Expected cost of reaching the goal
– Both
71
![Page 72: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/72.jpg)
Approximation Algorithms
�Overview
• Determinization-based Algorithms
• Heuristic Search with Inadmissible Heuristics
• Dimensionality Reduction• Dimensionality Reduction
• Monte-Carlo Planning
• Hierarchical Planning
• Hybridized Planning
72
![Page 73: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/73.jpg)
Determinization-based Techniques
• High-level idea:1. Compile MDP into its determinization
2. Generate plans in the determinization
3. Use the plans to choose an action in the curr. state
4. Execute, repeat M#2: factored
representations 4. Execute, repeat
• Key insight:– Classical planners are very fast
• Key assumption:– The MDP is goal-oriented in factored representation
73
M#3: some algorithms
use explicit
knowledge of goals
representations
expose useful
problem structure
![Page 74: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/74.jpg)
Detour: Factored MDPs
• How to describe an MDP instance?– S = {s1, … , sn} – flat representation
– T(si, aj, sk) = pi,j,k for every state, action, state triplet
– …
• Flat representation too cumbersome!• Flat representation too cumbersome!– Real MDPs have billions of billions of states
– Can’t enumerate transition function explicitly
• Flat representation too uninformative!– State space has no meaningful distance measure
– Tabulated transition/reward function doesn’t expose structure
74
![Page 75: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/75.jpg)
Detour: Example Factored MDP
• S = {(x1, …, xn) | x1 in X1, … , xn in Xn}
• A =x1 ┐x3, x5
Effects
…• T =
• C =
75
x2, ┐x4
0.70.3
2
Precondition
…
![Page 76: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/76.jpg)
Back to Determinization-based Planning:
All-Outcome Determinization
Each outcome of each probabilistic action � separate action
x1 ┐x3, x5 x1 ┐x3, x5
76
x2, ┐x4 x2, ┐x4 x2, ┐x4
![Page 77: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/77.jpg)
Most-Likely-Outcome Determinization
x1 ┐x3, x5 ┐x3, x5
0.70.3
77
x2, ┐x4 x2, ┐x4
0.70.3
![Page 78: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/78.jpg)
FF-Replan: Overview & Example1) Find a goal plan in
a determinization
2) Try executing it
in the original MDP
3) Replan&repeat if
unexpected outcome
[Yoon, Fern, Givan, 2007]
G Gs'2s''2
s0
s1
s0
s'1
s0 s'1
s'1
![Page 79: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/79.jpg)
FF-Replan: Theoretical Properties
• Optimizes the MAXPROB criterion – PG of reaching the goal– In SSPs, this is always 1.0 – FF-Replan always tries to avoid cycles!
– Super-efficient on SSPs w/o dead ends
– Won two IPPCs (-2004 and -2006) against much more sophisticated techniques
• Ignores probability of deviation from the found plan– Results in long-winded paths to the goal
– Troubled by probabilistically interesting MDPs [Little, Thiebaux, 2007]• There, an unexpected outcome may lead to catastrophic consequences
• In particular, breaks down in the presence of dead ends– Originally designed for MDPs without them
79
![Page 80: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/80.jpg)
FF-Hindsight: Putting “Probabilistic” Back
Into Planning• FF-Replan is oblivious to probabilities
– Its main undoing
– How do we take them into account?
• Suppose you knew the future– Knew would happen if you executed action a t steps from now– Knew would happen if you executed action a t steps from now
– The problem would be deterministic, could easily solve it
• Don’t actually know the future…– …but can sample several futures probabilistically
– Solutions tell which actions are likely to start successful policies
• Basic idea behind FF-Hindsight
80
![Page 81: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/81.jpg)
FF-Hindsight: ExampleAction
Probabilistic
Outcome
Time 1
Objective: Optimize MAXPROB criterionLeft Outcomes
are more likely
A1 A2
I
[Yoon, Fern, Givan, Kambhampati, 2010]
Time 2
Goal State
81
Action
State
Dead End
A1 A2 A1 A2 A1 A2 A1 A2
Slide courtesy of S. Yoon, A. Fern, R. Givan, and R. Kambhampati
![Page 82: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/82.jpg)
FF-Hindsight: Sampling a Future-1Action
Probabilistic
Outcome
Time 1
Maximize Goal Achievement
A1 A2
I
Time 2
Goal State
82
Action
State
Dead EndA1: 1
A2: 0
A1 A2 A1 A2 A1 A2 A1 A2
Slide courtesy of S. Yoon, A. Fern, R. Givan, and R. Kambhampati
![Page 83: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/83.jpg)
FF-Hindsight: Sampling a Future-2Action
Probabilistic
Outcome
Time 1
Maximize Goal Achievement
A1 A2
I
Time 2
Goal State
83
Action
State
Dead EndA1: 2
A2: 1
A1 A2 A1 A2 A1 A2 A1 A2
Slide courtesy of S. Yoon, A. Fern, R. Givan, and R. Kambhampati
![Page 84: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/84.jpg)
Providing Solution Guarantees
• FF-Replan provides no solution guarantees
– May have PG = 0 on SSPs with dead ends, even if P*G > 0
• FF-Hindsight provides only theoretical guarantees• FF-Hindsight provides only theoretical guarantees
– Practical implementations too distinct from theory
• RFF (Robust FF) provides quality guarantees in practice
– Constructs a policy tree out of deterministic plans
84
![Page 85: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/85.jpg)
RFF: Initialization
G
1. Generate either the AO or MLO determinization. Start with the
policy graph consisting of the initial state s0 and all goal states G
[Teichteil-Königsbuch, Infantes , Kuter, 2010]
85
S0G
![Page 86: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/86.jpg)
RFF: Finding an Initial Plan
G
2. Run FF on the chosen determinization and add all the states
along the found plan to the policy graph.
86
S0 G
![Page 87: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/87.jpg)
RFF: Adding Alternative Outcomes
3. Augment the graph with states to which other outcomes of the
actions in the found plan could lead and that are not in the graph
already. They are the policy graph’s fringe states.
87
S0 G
![Page 88: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/88.jpg)
RFF: Run VI
4. Run VI to propagate heuristic values of the newly added states.
This possibly changes the graph’s fringe and helps avoid dead ends!
88
S0 G
![Page 89: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/89.jpg)
RFF: Computing Replanning Probability
5. Estimate the probability P(failure) of reaching the fringe states
(e.g., using Monte-Carlo sampling) from s0. This is the current
partial policy’s failure probability w.r.t. s0.
89
S0 G
If P(failure) > ε
P(failure) = ?
Else, done!
![Page 90: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/90.jpg)
RFF: Finding Plans from the Fringe
G
6. From each of the fringe states, run FF to find a plan to reach
the goal or one of the states already in the policy graph.
90
S0G
Go back to step 3: Adding Alternative Outcomes
![Page 91: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/91.jpg)
Summary of Determinization Approaches
• Revolutionized SSP MDPs approximation techniques– Harnessed the speed of classical planners
– Eventually, started to take into account probabilities
– Help optimize for a “proxy” criterion, MAXPROB
• Classical planners help by quickly finding paths to a goal• Classical planners help by quickly finding paths to a goal– Takes “probabilistic” MDP solvers a while to find them on their own
• However…– Still almost completely disregard expect cost of a solution
– Often assume uniform action costs (since many classical planners do)
– So far, not useful on FH and IHDR MDPs turned into SSPs• Reaching a goal in them is trivial, need to approximate reward more directly
– Impractical on problems with large numbers of outcomes
91
![Page 92: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/92.jpg)
Approximation Algorithms
�Overview
�Determinization-based Algorithms
• Heuristic Search with Inadmissible Heuristics
• Dimensionality Reduction• Dimensionality Reduction
• Monte-Carlo Planning
• Hierarchical Planning
• Hybridized Planning
92
![Page 93: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/93.jpg)
The FF Heuristic
• Taken directly from deterministic planning
• Uses the all-outcome determinization of an MDP– But ignores the delete effects (negative literals in action outcomes)
[Hoffmann and Nebel, 2001]
x1 ┐x3, x5 x1 ┐x3, x5
• hFF(s) = approximately optimal cost of a plan from s to a goal in the delete relaxation
• Very fast due to using the delete relaxation
• Very informative93
x2, ┐x4 x2, ┐x4 x2, ┐x4
M#3: some
algorithms use
explicit knowledge
of goals
M#2: factored
representations
expose useful
problem structure
![Page 94: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/94.jpg)
The GOTH Heuristic
• Designed for MDPs at the start (not adapted classical)
• Motivation: would be good to estimate h(s) as cost of a
non-relaxed deterministic goal plan from s
[Kolobov, Mausam, Weld, 2010a]
– But too expensive to call a classical planner from every s
– Instead, call from only a few s and generalize estimates to others
• More details in Mausam’s talk…
94
![Page 95: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/95.jpg)
Approximation Algorithms
�Overview
�Determinization-based Algorithms
�Heuristic Search with Inadmissible Heuristics
• Dimensionality Reduction• Dimensionality Reduction
• Monte-Carlo Planning
• Hierarchical Planning
• Hybridized Planning
95
![Page 96: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/96.jpg)
Dimensionality Reduction: Motivation
• No approximate methods so far explicitly try to save space
• Dimensionality reduction attempts to do exactly that
– Insight: V* and π* are functions of ~|S| parameters (states)
– Replace it with an approximation with r << |S| params …– Replace it with an approximation with r << |S| params …
– … in order to save space
• How to do it?
– Factored representations are crucial for this
– View V & π as functions of state variables, not states themselves
– Obtain r basis functions bi, let V*(s) ≈ ∑iwi bi(s), find wi s
96
M#2: factored
representations
expose useful
problem structure
![Page 97: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/97.jpg)
ReTrASE
• Derives basis functions from the all-outcomes
determinization
• Sets V(s) = mini wiBi(s)
[Kolobov, Mausam, Weld, 2009]
• Learns weights wi by running an RTDP-like procedure
• More details in Mausam’s talk…
97
![Page 98: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/98.jpg)
Approximate LP
• Let V*(s) ≈ ∑iwi bi(s), where bi's are given by the user
• Insight: assume each b depends on at most z << |X| vars
[Guestrin, Koller, Parr, Venkataraman, 2003]
• Insight: assume each b depends on at most z << |X| vars
• Then, can formulate LP with only O(2z) constraints to find
wi's
– Much smaller than |S| = O(2|X|)
98
![Page 99: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/99.jpg)
FPG
• Directly learns a policy, not a value function
• For each action, defines a desirability function
fa (X1, …, Xn)θa
[Buffet and Aberdeen, 2006, 2009]
• Mapping from state variable values to action “quality”
– Represented as a neural network
– Parameters to learn are network weights θa,1, …, θa,m for each a
99
X1 Xn… …
θa,1 θa,2 θa,m-1 θa,mθa, …
![Page 100: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/100.jpg)
FPG
• Policy (distribution over actions) is given by a softmax
• To learn the parameters:
– Run trials (similar to RTDP)
– After taking each action, compute the gradient w.r.t. weights
– Adjust weights in the direction of the gradient
– Makes actions causing expensive trajectories to be less desirable
100
![Page 101: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/101.jpg)
Approximation Algorithms
�Overview
�Determinization-based Algorithms
�Heuristic Search with Inadmissible Heuristics
�Dimensionality Reduction�Dimensionality Reduction
• Monte-Carlo Planning
• Hierarchical Planning
• Hybridized Planning
101
![Page 102: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/102.jpg)
What Planning Approaches Couldn’t Do
(Until Recently)
• The Sysadmin problem:
Restart(Ser1)
P(Ser t |Restartt-1(Ser ), Ser t-1, Ser t-1)
Time t-1 Time t
T:A: R: ∑i [Seri = ↑]
102
Restart(Ser2)
Restart(Ser3)
P(Ser1t |Restartt-1(Ser1), Ser1
t-1, Ser2t-1)
P(Ser2t |Restartt-1(Ser2), Ser1
t-1, Ser2t-1, Ser3
t-1)
P(Ser3t |Restartt-1(Ser3), Ser2
t-1, Ser3t-1)
![Page 103: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/103.jpg)
Monte-Carlo Planning: Motivation
• Characteristics of Sysadmin:
– FH MDP turned SSPs0 MDP
• Reaching the goal is trivial, determinization approaches not really helpful
– Enormous reachable state space
– High-entropy T (2|X| outcomes per action, many likely ones)– High-entropy T (2|X| outcomes per action, many likely ones)
• Building determinizations can be super-expensive
• Doing Bellman backups can be super-expensive
• Solution: Monte-Carlo Tree Search
– Does not manipulate T or C/R explicitly – no Bellman backups
– Relies on a world simulator – indep. of MDP description size
103
![Page 104: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/104.jpg)
UCT: A Monte-Carlo Planning Algorithm
• UCT [Kocsis & Szepesvari, 2006] computes a solution by simulating the current best policy and improving it– Similar principle as RTDP
– But action selection, value updates, and guarantees are different
• Success stories:• Success stories:– Go (thought impossible in ‘05, human grandmaster level at 9x9 in ‘08)
– Klondike Solitaire (wins 40% of games)
– General Game Playing Competition
– Real-Time Strategy Games
– Probabilistic Planning Competition (-2011)
– The list is growing…
104
![Page 105: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/105.jpg)
Current World State
1/2
When all node actions tried once, select action according to tree policy
0
Tree
Policy
a1 a2),(
)(ln),(maxarg)(
asn
sncasQs aUCT +=π
UCT Example
1
1
1
0
0
0
0
0
0
0
Slide courtesy of A. Fern105
Rollout
Policy
![Page 106: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/106.jpg)
• Rollout policy:
– Basic UCT uses random
• Tree policy:
UCT Details
:– Q’(s,a) : average reward received in current trajectories after
taking action a in state s
– n(s,a) : number of times action a taken in s
– n(s) : number of times state s encountered
),(
)(ln),('maxarg)(
asn
sncasQs aUCT +=π
106
Exploration term
![Page 107: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/107.jpg)
• Optimizes cumulative regret
– Total amount of regret during state space exploration
– Appropriate in RL
UCT Caveat
• In planning, simple regret is more appropriate
– Regret during state space exploration is fictitious
– Only regret of the final policy counts!
107
![Page 108: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/108.jpg)
BRUE: MCTS for Planning
• Modifies UCT to minimize simple regret directly
• In UCT, simple regret diminishes at a polynomial
rate in # samples
[Feldman and Domshlak, 2012]
rate in # samples
• In BRUE, simple regret diminishes at an
exponential rate
108
![Page 109: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/109.jpg)
Summary
• Surveyed 4 different approximation families
– Dimensionality reduction
– Inadmissible heuristic search
– Dimensionality reduction
– MCTS– MCTS
– Hierarchical planning
– Hybridized planning
• Sacrifice different solution quality aspects
• Lots of work to be done in each of these areas
109
![Page 110: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/110.jpg)
Beyond Stochastic Shortest-Path MDPs
S1 S2
a1
a2
a2
SG
a1
No dead ends
a1
a2
[Kolobov, Mausam, Weld, 2012]
110
S3a1 a2
No dead ends allowed in SSP!
-But real problems do have dead ends!
-What does the “best” policy mean in the presence of dead
ends? How do we find it?
- More details in the talk on pitfalls of goal-oriented MDPs
![Page 111: An Overview of MDP Planning Research at ICAPSmausam/papers/tut13.pdf · University of Washington, Seattle 1. What is ICAPS? •International Conference on Automated Planning and Scheduling](https://reader033.vdocuments.us/reader033/viewer/2022053021/60496127640cb94cdc24bcec/html5/thumbnails/111.jpg)
Conclusion: 3 Key Messages
• M#0: No need for exploration-exploitation tradeoff– Planning is purely a computational problem
– VI, PI
• M#1: Search in planning– States can be ignored or reordered for efficient computation
– TVI
• M#2: Representation in planning– Efficient planning thanks to informative representations
– Det.-based planning, dimensionality reduction
• M#3: Goal-directed MDPs– Design algorithms that use explicit knowledge of goals
– Det.-based planning, heuristic seach
111