devika subramanian comp 140 fall 2008 - rice university · (c) devika subramanian, 2008 stochastic...

23
Decision making Devika Subramanian Comp 140 Fall 2008 1

Upload: others

Post on 12-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect

Decision making

Devika SubramanianComp 140Fall 2008

1

Page 2: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect

(c) Devika Subramanian, 2006 2

Principle of maximum expected utility

If you are in state s, and you have actions a from a set A, then the best action a* at state s is:

Expected utility of doing ain s.

Page 3: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect

3

An example

Should we have a party indoors or outside?

s

in

out

dry

wet

dry

wet

regret

relief

perfect

disasters4

s3

s2

s1

Page 4: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect

4

Utility function

A numerical score over all possible states of the world.

location weather utilityin dry 50in wet 60out dry 100out wet 0

Page 5: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect

5

Maximizing expected utility

in

out

dry

wet

dry

wet

Regret 50

Relief 60

Perfect 100

Disaster 0

Choose the action that maximizes expected utility EU(in) = 0.7 * 50 + 0.3 * 60 =53 EU(out) = 0.7 * 100 + 0.3 * 0 =70 Choose out

0.7

0.3

0.7

0.3

Page 6: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect

(c) Devika Subramanian, 2008

Robot navigation in a grid

obstacleTerminalStates:No actioncan takeyou outof thesestates.

11statesin statespace

1 2 3 4

1

2

3

startstate

Page 7: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect

(c) Devika Subramanian, 2008

Stochastic actions

Actions: N,S,E,W Effects of actions: each action achieves its

intended effect with probability 0.8, but with probability 0.1 each, the action moves the robot at right angles to its intended direction. Robot at (1,1) and it executes action N. With probability 0.8 it ends up at square (1,2), with

probability 0.1 it goes to square (2,1) and with probability 0.1, it remains at (1,1).

Page 8: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect

(c) Devika Subramanian, 2008

Markov state transition model The next state is probabilistically determined

by the current state and the current action, i.e.,

Example P((1,2)|(1,1),N) = 0.8 P((1,1)|(1,1),N) = 0.1 P((2,1)|(1,1),N) = 0.1

Page 9: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect

(c) Devika Subramanian, 2008

Probabilistic forward projection

Where can you get to from (3,2) with oneaction?

(3,2)n

s ew

(3,3) (3,2) (4,2)

0.8 0.1 0.1

(3,1) (3,2) (4,1)

0.8 0.1 0.1

(4,2) (3,1) (3,3)

0.8 0.1 0.1

(3,2) (3,1) (3,3)

0.8 0.10.1

Page 10: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect

(c) Devika Subramanian, 2008

Probabilistic forward projection with a plan

Let start state be (3,2) and let plan = NE.Can we, with probability 1, get to (4,3) while avoiding theblack hole?

(3,2)

(3,3) (4,2) (3,2)

0.8 0.10.1

0.8 0.10.1 0.8 0.10.1

N

(4,3) (3,3) (3,2) (4,2) (3,3) (3,1)

E

With probability 0.64, we reach (4,3) with fixed plan NE.

E

Page 11: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect

(c) Devika Subramanian, 2008

Value function V: history of states R We will consider additive value or utility

functions, and define a reward function r mapping states in the state space to a real number.

s0 sn

Page 12: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect

(c) Devika Subramanian, 2008

Reward function Example

r(s) = -0.01 in non-terminal states r(s) = +1 in terminal state (4,3) r(s) = -1 in terminal state (4,2)

The agent is penalized for each step, so theway we have defined the reward functionis to give the agent an incentive to get outof this grid as quickly as possible via the +1terminal state.

Page 13: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect

(c) Devika Subramanian, 2008

Markov Decision Process (MDP)

A finite set S of states A finite set A of actions State transitions P:SxAPr(S) Rewards r: SR

Rewards can be functions of the action chosenin a state, e.g. r:SxARInitial state may or may not be specified. In caseit isn’t, the objective is to find a solution no matterwhat the initial state is.

Page 14: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect

(c) Devika Subramanian, 2008

Plans and policies In a Markov environment, if the robot has no

sensors, only plans with probabilistic guarantees can be generated.

In a Markov environment that the robot can sense accurately, it can use a solution that specifies what to do for any state that it might reach. Such a mapping from states to actions is called a policy. An optimal policy maximizes the expected utility of the agent in the environment.

Page 15: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect

(c) Devika Subramanian, 2008

Optimal policy

obstacle

1 2

1

2

3

startstate

r(s) = -0.01

Page 16: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect

(c) Devika Subramanian, 2008

Optimal policy

obstacle

1 2

1

2

3

startstate

r(s) = -2

Page 17: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect

(c) Devika Subramanian, 2008

Value iteration Basic idea: calculate the expected utility V(s) of

each state s in S, and then choose actions that maximize expected utility.

s

Page 18: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect

(c) Devika Subramanian, 2008

The Maximum Expected Utility Principle

An agent picks the action a in state s that maximizes expected utility of the subsequent state

Page 19: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect

(c) Devika Subramanian, 2008

Calculating V(s)

The utility of a state is the immediate reward for that state plus the expected utility of the next state, assuming that the agent chooses the optimal action.

Bellman’s equationV (s) = r(s) + m a x a

!

s !S

P (s, a , s )V (s

)

Page 20: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect

(c) Devika Subramanian, 2008

Bellman equation

n

ws

e

We get n Bellman equations if there are n statesin the state space. Unique solutions exist to thissystem of n equations. (Bellman, 1957)

Page 21: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect

(c) Devika Subramanian, 2008

Calculating the optimal policy by value iteration Initialize V0(s) to be 0, for every s in S. Loop

do a Bellman update

t = t + 1 Until successive values of V are the same

Vt+1(s) = r(s) + m a xa

!

s!!S

P (s, a , s!)Vt(s

!)

Page 22: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect

(c) Devika Subramanian, 2008

Utility function V*

+1

obstacle

1 2

1

2

3

startstate

0.812 0.868 0.918

-10.762

0.7050.655 0.611 0.388

0.660

Page 23: Devika Subramanian Comp 140 Fall 2008 - Rice University · (c) Devika Subramanian, 2008 Stochastic actions Actions: N,S,E,W Effects of actions: each action achieves its intended effect

(c) Devika Subramanian, 2008

Termination criteria for value iteration RMS error of the utility values. Policy loss: stop when policies on subsequent

iterations are the same. Value iteration converges and computes the

optimal policy in time proportional to the square of the number of states times the number of actions.