devika subramanian comp 140 fall 2008 - rice university · (c) devika subramanian, 2008 stochastic...
TRANSCRIPT
![Page 1: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect](https://reader034.vdocuments.us/reader034/viewer/2022043006/5fae265c0fda3a0ea06971cb/html5/thumbnails/1.jpg)
Decision making
Devika SubramanianComp 140Fall 2008
1
![Page 2: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect](https://reader034.vdocuments.us/reader034/viewer/2022043006/5fae265c0fda3a0ea06971cb/html5/thumbnails/2.jpg)
(c) Devika Subramanian, 2006 2
Principle of maximum expected utility
If you are in state s, and you have actions a from a set A, then the best action a* at state s is:
Expected utility of doing ain s.
![Page 3: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect](https://reader034.vdocuments.us/reader034/viewer/2022043006/5fae265c0fda3a0ea06971cb/html5/thumbnails/3.jpg)
3
An example
Should we have a party indoors or outside?
s
in
out
dry
wet
dry
wet
regret
relief
perfect
disasters4
s3
s2
s1
![Page 4: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect](https://reader034.vdocuments.us/reader034/viewer/2022043006/5fae265c0fda3a0ea06971cb/html5/thumbnails/4.jpg)
4
Utility function
A numerical score over all possible states of the world.
location weather utilityin dry 50in wet 60out dry 100out wet 0
![Page 5: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect](https://reader034.vdocuments.us/reader034/viewer/2022043006/5fae265c0fda3a0ea06971cb/html5/thumbnails/5.jpg)
5
Maximizing expected utility
in
out
dry
wet
dry
wet
Regret 50
Relief 60
Perfect 100
Disaster 0
Choose the action that maximizes expected utility EU(in) = 0.7 * 50 + 0.3 * 60 =53 EU(out) = 0.7 * 100 + 0.3 * 0 =70 Choose out
0.7
0.3
0.7
0.3
![Page 6: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect](https://reader034.vdocuments.us/reader034/viewer/2022043006/5fae265c0fda3a0ea06971cb/html5/thumbnails/6.jpg)
(c) Devika Subramanian, 2008
Robot navigation in a grid
obstacleTerminalStates:No actioncan takeyou outof thesestates.
11statesin statespace
1 2 3 4
1
2
3
startstate
![Page 7: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect](https://reader034.vdocuments.us/reader034/viewer/2022043006/5fae265c0fda3a0ea06971cb/html5/thumbnails/7.jpg)
(c) Devika Subramanian, 2008
Stochastic actions
Actions: N,S,E,W Effects of actions: each action achieves its
intended effect with probability 0.8, but with probability 0.1 each, the action moves the robot at right angles to its intended direction. Robot at (1,1) and it executes action N. With probability 0.8 it ends up at square (1,2), with
probability 0.1 it goes to square (2,1) and with probability 0.1, it remains at (1,1).
![Page 8: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect](https://reader034.vdocuments.us/reader034/viewer/2022043006/5fae265c0fda3a0ea06971cb/html5/thumbnails/8.jpg)
(c) Devika Subramanian, 2008
Markov state transition model The next state is probabilistically determined
by the current state and the current action, i.e.,
Example P((1,2)|(1,1),N) = 0.8 P((1,1)|(1,1),N) = 0.1 P((2,1)|(1,1),N) = 0.1
![Page 9: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect](https://reader034.vdocuments.us/reader034/viewer/2022043006/5fae265c0fda3a0ea06971cb/html5/thumbnails/9.jpg)
(c) Devika Subramanian, 2008
Probabilistic forward projection
Where can you get to from (3,2) with oneaction?
(3,2)n
s ew
(3,3) (3,2) (4,2)
0.8 0.1 0.1
(3,1) (3,2) (4,1)
0.8 0.1 0.1
(4,2) (3,1) (3,3)
0.8 0.1 0.1
(3,2) (3,1) (3,3)
0.8 0.10.1
![Page 10: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect](https://reader034.vdocuments.us/reader034/viewer/2022043006/5fae265c0fda3a0ea06971cb/html5/thumbnails/10.jpg)
(c) Devika Subramanian, 2008
Probabilistic forward projection with a plan
Let start state be (3,2) and let plan = NE.Can we, with probability 1, get to (4,3) while avoiding theblack hole?
(3,2)
(3,3) (4,2) (3,2)
0.8 0.10.1
0.8 0.10.1 0.8 0.10.1
N
(4,3) (3,3) (3,2) (4,2) (3,3) (3,1)
E
With probability 0.64, we reach (4,3) with fixed plan NE.
E
![Page 11: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect](https://reader034.vdocuments.us/reader034/viewer/2022043006/5fae265c0fda3a0ea06971cb/html5/thumbnails/11.jpg)
(c) Devika Subramanian, 2008
Value function V: history of states R We will consider additive value or utility
functions, and define a reward function r mapping states in the state space to a real number.
s0 sn
![Page 12: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect](https://reader034.vdocuments.us/reader034/viewer/2022043006/5fae265c0fda3a0ea06971cb/html5/thumbnails/12.jpg)
(c) Devika Subramanian, 2008
Reward function Example
r(s) = -0.01 in non-terminal states r(s) = +1 in terminal state (4,3) r(s) = -1 in terminal state (4,2)
The agent is penalized for each step, so theway we have defined the reward functionis to give the agent an incentive to get outof this grid as quickly as possible via the +1terminal state.
![Page 13: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect](https://reader034.vdocuments.us/reader034/viewer/2022043006/5fae265c0fda3a0ea06971cb/html5/thumbnails/13.jpg)
(c) Devika Subramanian, 2008
Markov Decision Process (MDP)
A finite set S of states A finite set A of actions State transitions P:SxAPr(S) Rewards r: SR
Rewards can be functions of the action chosenin a state, e.g. r:SxARInitial state may or may not be specified. In caseit isn’t, the objective is to find a solution no matterwhat the initial state is.
![Page 14: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect](https://reader034.vdocuments.us/reader034/viewer/2022043006/5fae265c0fda3a0ea06971cb/html5/thumbnails/14.jpg)
(c) Devika Subramanian, 2008
Plans and policies In a Markov environment, if the robot has no
sensors, only plans with probabilistic guarantees can be generated.
In a Markov environment that the robot can sense accurately, it can use a solution that specifies what to do for any state that it might reach. Such a mapping from states to actions is called a policy. An optimal policy maximizes the expected utility of the agent in the environment.
![Page 15: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect](https://reader034.vdocuments.us/reader034/viewer/2022043006/5fae265c0fda3a0ea06971cb/html5/thumbnails/15.jpg)
(c) Devika Subramanian, 2008
Optimal policy
obstacle
1 2
1
2
3
startstate
r(s) = -0.01
![Page 16: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect](https://reader034.vdocuments.us/reader034/viewer/2022043006/5fae265c0fda3a0ea06971cb/html5/thumbnails/16.jpg)
(c) Devika Subramanian, 2008
Optimal policy
obstacle
1 2
1
2
3
startstate
r(s) = -2
![Page 17: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect](https://reader034.vdocuments.us/reader034/viewer/2022043006/5fae265c0fda3a0ea06971cb/html5/thumbnails/17.jpg)
(c) Devika Subramanian, 2008
Value iteration Basic idea: calculate the expected utility V(s) of
each state s in S, and then choose actions that maximize expected utility.
s
![Page 18: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect](https://reader034.vdocuments.us/reader034/viewer/2022043006/5fae265c0fda3a0ea06971cb/html5/thumbnails/18.jpg)
(c) Devika Subramanian, 2008
The Maximum Expected Utility Principle
An agent picks the action a in state s that maximizes expected utility of the subsequent state
![Page 19: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect](https://reader034.vdocuments.us/reader034/viewer/2022043006/5fae265c0fda3a0ea06971cb/html5/thumbnails/19.jpg)
(c) Devika Subramanian, 2008
Calculating V(s)
The utility of a state is the immediate reward for that state plus the expected utility of the next state, assuming that the agent chooses the optimal action.
Bellman’s equationV (s) = r(s) + m a x a
!
s !S
P (s, a , s )V (s
)
![Page 20: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect](https://reader034.vdocuments.us/reader034/viewer/2022043006/5fae265c0fda3a0ea06971cb/html5/thumbnails/20.jpg)
(c) Devika Subramanian, 2008
Bellman equation
n
ws
e
We get n Bellman equations if there are n statesin the state space. Unique solutions exist to thissystem of n equations. (Bellman, 1957)
![Page 21: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect](https://reader034.vdocuments.us/reader034/viewer/2022043006/5fae265c0fda3a0ea06971cb/html5/thumbnails/21.jpg)
(c) Devika Subramanian, 2008
Calculating the optimal policy by value iteration Initialize V0(s) to be 0, for every s in S. Loop
do a Bellman update
t = t + 1 Until successive values of V are the same
Vt+1(s) = r(s) + m a xa
!
s!!S
P (s, a , s!)Vt(s
!)
![Page 22: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect](https://reader034.vdocuments.us/reader034/viewer/2022043006/5fae265c0fda3a0ea06971cb/html5/thumbnails/22.jpg)
(c) Devika Subramanian, 2008
Utility function V*
+1
obstacle
1 2
1
2
3
startstate
0.812 0.868 0.918
-10.762
0.7050.655 0.611 0.388
0.660
![Page 23: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect](https://reader034.vdocuments.us/reader034/viewer/2022043006/5fae265c0fda3a0ea06971cb/html5/thumbnails/23.jpg)
(c) Devika Subramanian, 2008
Termination criteria for value iteration RMS error of the utility values. Policy loss: stop when policies on subsequent
iterations are the same. Value iteration converges and computes the
optimal policy in time proportional to the square of the number of states times the number of actions.