devika subramanian comp 140 fall 2008 - rice university · (c) devika subramanian, 2008 stochastic...

Decision making

Devika SubramanianComp 140Fall 2008

1

(c) Devika Subramanian, 2006 2

Principle of maximum expected utility

If you are in state s, and you have actions a from a set A, then the best action a* at state s is:

Expected utility of doing ain s.

3

An example

Should we have a party indoors or outside?

s

in

out

dry

wet

dry

wet

regret

relief

perfect

disasters4

s3

s2

s1

4

Utility function

A numerical score over all possible states of the world.

location weather utilityin dry 50in wet 60out dry 100out wet 0

5

Maximizing expected utility

in

out

dry

wet

dry

wet

Regret 50

Relief 60

Perfect 100

Disaster 0

Choose the action that maximizes expected utility EU(in) = 0.7 * 50 + 0.3 * 60 =53 EU(out) = 0.7 * 100 + 0.3 * 0 =70 Choose out

0.7

0.3

0.7

0.3

(c) Devika Subramanian, 2008

Robot navigation in a grid

obstacleTerminalStates:No actioncan takeyou outof thesestates.

11statesin statespace

1 2 3 4

1

2

3

startstate


Stochastic actions

Actions: N,S,E,W Effects of actions: each action achieves its

intended effect with probability 0.8, but with probability 0.1 each, the action moves the robot at right angles to its intended direction. Robot at (1,1) and it executes action N. With probability 0.8 it ends up at square (1,2), with

probability 0.1 it goes to square (2,1) and with probability 0.1, it remains at (1,1).


Markov state transition model The next state is probabilistically determined

by the current state and the current action, i.e.,

Example P((1,2)|(1,1),N) = 0.8 P((1,1)|(1,1),N) = 0.1 P((2,1)|(1,1),N) = 0.1


Probabilistic forward projection

Where can you get to from (3,2) with oneaction?

(3,2)n

s ew

(3,3) (3,2) (4,2)

0.8 0.1 0.1

(3,1) (3,2) (4,1)

0.8 0.1 0.1

(4,2) (3,1) (3,3)

0.8 0.1 0.1

(3,2) (3,1) (3,3)

0.8 0.10.1


Probabilistic forward projection with a plan

Let start state be (3,2) and let plan = NE.Can we, with probability 1, get to (4,3) while avoiding theblack hole?

(3,2)

(3,3) (4,2) (3,2)

0.8 0.10.1

0.8 0.10.1 0.8 0.10.1

N

(4,3) (3,3) (3,2) (4,2) (3,3) (3,1)

E

With probability 0.64, we reach (4,3) with fixed plan NE.

E


Value function V: history of states R We will consider additive value or utility

functions, and define a reward function r mapping states in the state space to a real number.

s0 sn


Reward function Example

r(s) = -0.01 in non-terminal states r(s) = +1 in terminal state (4,3) r(s) = -1 in terminal state (4,2)

The agent is penalized for each step, so theway we have defined the reward functionis to give the agent an incentive to get outof this grid as quickly as possible via the +1terminal state.


Markov Decision Process (MDP)

A finite set S of states A finite set A of actions State transitions P:SxAPr(S) Rewards r: SR

Rewards can be functions of the action chosenin a state, e.g. r:SxARInitial state may or may not be specified. In caseit isn’t, the objective is to find a solution no matterwhat the initial state is.


Plans and policies In a Markov environment, if the robot has no

sensors, only plans with probabilistic guarantees can be generated.

In a Markov environment that the robot can sense accurately, it can use a solution that specifies what to do for any state that it might reach. Such a mapping from states to actions is called a policy. An optimal policy maximizes the expected utility of the agent in the environment.


Optimal policy

obstacle

1 2

1

2

3

startstate

r(s) = -0.01


Optimal policy

obstacle

1 2

1

2

3

startstate

r(s) = -2


Value iteration Basic idea: calculate the expected utility V(s) of

each state s in S, and then choose actions that maximize expected utility.

s


The Maximum Expected Utility Principle

An agent picks the action a in state s that maximizes expected utility of the subsequent state


Calculating V(s)

The utility of a state is the immediate reward for that state plus the expected utility of the next state, assuming that the agent chooses the optimal action.

Bellman’s equationV (s) = r(s) + m a x a

!

s !S

P (s, a , s )V (s

)


Bellman equation

n

ws

e

We get n Bellman equations if there are n statesin the state space. Unique solutions exist to thissystem of n equations. (Bellman, 1957)


Calculating the optimal policy by value iteration Initialize V0(s) to be 0, for every s in S. Loop

do a Bellman update

t = t + 1 Until successive values of V are the same

Vt+1(s) = r(s) + m a xa

!

s!!S

P (s, a , s!)Vt(s

!)


Utility function V*

+1

obstacle

1 2

1

2

3

startstate

0.812 0.868 0.918

-10.762

0.7050.655 0.611 0.388

0.660


Termination criteria for value iteration RMS error of the utility values. Policy loss: stop when policies on subsequent

iterations are the same. Value iteration converges and computes the

optimal policy in time proportional to the square of the number of states times the number of actions.

devika subramanian comp 140 fall 2008 - rice university · (c) devika subramanian, 2008 stochastic...

Documents