MCAI 2013 1
Partially-Observable Markov Decision Processes
Tom Dietterich
MCAI 2013 2
Markov Decision Processas a Decision Diagram
𝑠𝑡 𝑠𝑡+1
𝑎𝑡 𝑟 𝑡 Note:We observe before we choose All states, actions, and rewards are observed
MCAI 2013 3
What If We Can’t Directly Observe the State?
𝑠𝑡 𝑠𝑡+1
𝑎𝑡 𝑟 𝑡 Note:We observe before we choose Only the observations are observed, not the underlying states
𝑜𝑡 𝑜𝑡+1
MCAI 2013 4
POMDPs are Hard to Solve
• Tradeoff between taking actions to gain information and taking actions to change the world– Some actions can do both
Optimal Management of Difficult-to-Observe Invasive Species [Regan et al., 2011]
MCAI 20135
Branched Broomrape (Orobanche ramosa) Annual parasitic plant Attaches to root system of
host plant Results in 75-90% reduction in
host biomass Each plant makes ~50,000
seeds Seeds are viable for 12 years
Quarantine Area in S. Australia
MCAI 20136
375 farms; 70km x 70km area
Goog
le m
aps
Formulation as a POMDP:Single Farm
7
States: {Empty, Seeds, Plants &
Seeds} Actions:
{Nothing, Host Denial, Fumigation}
Observations: {Absent, Present} Detection probability
Rewards: Cost(Nothing)
Cost(Host Denial) Cost(Fumigation)
Objective: 20-year discounted reward
(discount = 0.96)
State Diagram
MCAI 2013
Optimal MDP Policy
8
If plant is detected, Fumigate; Else Do Nothing Assumes perfect detection
www.grdc.com.au
MCAI 2013
Optimal POMDP Policy for
9
Same as the Optimal MDP Policy
ActionOBSERVATION
Decision StateAfter State
MCAI 2013
0 1Fumigate ABSENT
PRESENT
NothingABSENT
PRESENT
Optimal Policy for
10
Deny Host for 15 years before switching to Nothing
For Deny Host for 17 years before switching to Nothing
MCAI 2013
Deny Deny0 1Fumigate ABS
PRESENT
ABS
PRESENT
2 ABS 16
PRESENT
... Nothing
ABS
PRESENT
Probability of Eradication
11 MCAI 2013
Discussion
12
POMDP is exactly solvable because the state space is very small
Real problem is more complex Each farm can have many fields, each with its own
hidden state There 375 farms in the quarantine area
states if we treat each farm as a single unit Exact solution of large POMDPs is beyond the state
of the art
Notice that there is no tradeoff between acting to gather information and acting to change the world. None of the actions gain information
MCAI 2013
Ways to Avoid a POMDP (1)
MCAI 201313
State Estimation and State Tracking In many problems, we have (or can acquire)
enough sensors so that we can estimate the state quite well has low uncertainty Let be the most likely hidden state
In such problems, we can pretend that we have an MDP and we can directly observe
We do not need to take actions to gain information, so we do not face this difficult tradeoff
Ways to Avoid a POMDP (2)
MCAI 201314
Pure Information-Gathering POMDPs Consider a medical diagnosis case for a specific
disease where there are tests, that can be performed. Our goal is to decide whether the patient has the disease by choosing tests to perform Each test has two possible outcomes and Each test has a cost Given any subset of the outcomes, we can compute the
probability that the patient has the disease There is a “false positive” cost for incorrectly saying
that and a “false negative” cost, for saying that
Formulation as an MDP
MCAI 201315
States:
starting state is Actions
actions are the medical tests action says “the patient does not have the disease” and terminates with cost 0 if
correct and cost if incorrect action says “the patient has the disease” and terminates with cost 0 if correct and
cost if incorrect State Transitions
When we perform test in state , the resulting state sets the th entry in the state to according to
When we perform a “declare” action, the problem transitions to a terminal state with probability 1
If there aren’t too many tests and we know , we can enumerate the states and solve this via standard MDP methods
Belief States
MCAI 201316
In general, we can think of a POMDP as being an MDP over a Belief State
In the medical diagnosis cases, the belief states have the form (0,1,?,?,0,?)
In the Broomrape case, the belief state is a probability distribution over the 3 states:
emptyseeds
weeds + seeds
Belief State Reasoning
MCAI 201317
Each observation updates the belief state Example: observing the
presence of weeds means weeds are present and seeds might also be present
emptyseeds
weeds + seeds
emptyseeds
weeds + seedsobserve present
Taking Actions
MCAI 201318
Each action updates the belief state Example: fumigate
emptyseeds
weeds + seeds
emptyseeds
weeds + seedsfumigate
Belief MDP
MCAI 201319
State space: all reachable belief states Action space: same actions as the POMDP Reward function: expected rewards derived
from the underlying states Transition function: moves in belief space
Problem: Belief space is continuous and there can be an immense number of reachable states
Monte Carlo Policy Evaluation
MCAI 201320
Key Insight: It is just as easy to evaluate a policy via Monte Carlo trials in a POMDP as it is an in MDP!
Approach: Define a space of policies Evaluate them by Monte Carlo trials Pick the best one
Finite State Machine Policies
MCAI 201321
In many POMDPs (and MDPs), a policy can be represented as a finite state machine
We can design a set of FSM policies and then evaluate them
There are algorithms for incrementally improving FSM policies
Deny Deny0 1Fumigate ABS
PRESENT
ABS
PRESENT
2 ABS 16
PRESENT
... Nothing
ABS
PRESENT
Summary
MCAI 201322
Many problems in AI can be formulated as POMDPs Formulating a problem as a POMDP doesn’t help much,
because they are so hard to solve (PSPACE-hard for finite horizon; undecidable for infinite horizon) Can we do state estimation and pretend ? Are we performing pure observation actions? Can the policy be divided into a pure observation phase and a
pure action phase? If so, we can use MDP methods instead
Unfortunately, many problems in ecosystem management are “essential” POMDPs that mix information gathering and world-changing actions
Monte Carlo methods (based on policy space search) are one of the most practical ways of finding good POMDP solutions