rutgers cs440, fall 2003 decisions under uncertainty reading: ch. 16, aima 2 nd ed

Rutgers CS440, Fall 2003

Decisions under uncertainty

Reading: Ch. 16, AIMA 2nd Ed.


Outline

• Decisions, preferences, utility functions• Influence diagrams• Value of information


Decision making

• Decisions – an irrevocable allocation of domain resources

• Decisions should be made so as to maximize expected utility

• Questions:– Why make decisions based on average or expected utility?

– Why can one assume that utility functions exist?

– Can an agent act rationally by expressing preferences between states without giving them numeric values?

– Can every preference structure be captured by assigning a single number to every state?


Simple decision problem

• Party decision problem: inside or outside?

Action

state

state

Dry

Dry

Wet

Wet

IN

OUT

Regret

Relief

Perfect !

Disaster


Value function

• Numerical score over all possible states of the world

Action Weather Value

OUT Dry $100

IN Wet $60

IN Dry $50

OUT Wet $0


Preferences

• Agent chooses among prizes (A,B,…) and lotteries (situations with uncertain prizes)

.2

.8

$40,000

$0

.25

.75

$30,000

$0

L1 = ( .2, $40000; .8, $0 ) L2 = ( .25, $30000; .75, $0 )

~

A B A is preferred to BA B B is preferred to AA ~ B indifference between A & B


Desired properties for preferences over lotteries

• Prefer $100 over $0 AND p < q, then

p

1-p

$100

$0

q

1-q

$100

$0

L1 = ( p, $100; 1-p, $0 ) L2 = ( .q, $100; 1-q, $0 )


Properties of (rational) preference

Lead to rational agent behavior

1. Orderability( A B ) V ( A B ) V ( A ~ B )

2. Transitivity( A B ) ^ ( B C ) ( A C )

3. ContinuityA B C p, ( p, A; (1-p) C ) ~ B

4. SubstitutabilityA~B p, ( p,A; (1-p), C ) ~ ( p,B; (1-p), C )

5. MonotonicityA B ( p > q ( p,A; (1-p)B ) ( q,A; (1-q),B ) )


Preference & expected utility

• Properties of preference lead to existence (Ramsey 1931, von Neumann

& Morgenstern 1944) of utility function U such that

p

1-p

$100

$0

q

1-q

$100

$0

L1 = ( p, $100; 1-p, $0 ) L2 = ( .q, $100; 1-q, $0 )

p U($100) + (1-p) U($0) q U($100) + (1-q) U($0)<

IFF

EXPECTED UTILITY of L2, EU(L2)EXPECTED UTILITY of L1, EU(L1)


Properties of utility

• Utility is a function that maps states to real numbers

• Standard approach to assessing utilities of states:

1. Compare state A to a standard lottery L = ( p,Ubest, 1-p, Uworst)Ubest – best possible eventUworst – worst possible event

2. Adjust p until A ~ L

0.999999

0.000001

Continue asbefore

Instantdeath

$30 ~


Utility scales

• Normalized utilities: Ubest = 1.0, Uworst = 0.0

• Micromorts: one-millionth chance of death– useful for Russian roulette, paying to reduce product risks, etc.

• QALYs: quality-adjusted life years– useful for medical decisions involving substantial risk

• Note: behavior is invariant w.r.t. positive linear transformation

U’(s) = A U(s) + B, A > 0


Utility vs Money

• Utility is NOT monetary payoff

.8

.2

$40,000

$0

1

0

$30,000

$0

EMV(L1) = $32,000 EMV(L2) = $30,000>

ii MVpLEMV )(


Attitudes toward risk

$ reward

U( $reward )

$500

U( $500 )

$1000$400

0.5

0.5

$1000

$0

Insurance risk premium

certain monetary equivalent

L

U( L )

U convex – risk averseU linear – risk neutralU concave – risk seeking


Human judgment under uncertainty

• Is decision theory compatible with human judgment under uncertainty?

• Are people “experts” in reasoning under uncertainty? How well do they perform? What kind of heuristics do they use?

.2

.8

$40,000

$0

.25

.75

$30,000

$0

.8

.2

$40,000

$0

1

0

$30,000

$0

.2 U($40k) > .25 U($30k)

.8 U($40k) > U($30k)

.8 U($40k) < U($30k)


Student group utility

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

500 2000 4000 6000 8000 10000

• For each $ amount, adjust p until half the class votes for lottery ($10000)


Technology forecasting

• “I think there is a world market for about five computers.”- Thomas J. Watson, Sr.

Chairman of the Board of IBM, 1943

• “There doesn't seem to be any real limit to the growth of the computer industry.”

- Thomas J. Watson, Sr.Chairman of the Board of IBM, 1968


Maximizing expected utility

Action

state

state

Dry

Dry

Wet

Wet

IN

OUT

0.7

0.7

0.3

0.3

EU(IN) = 0.7 * 0.632 + 0.3 * 0.699 = 0.6521EU(OUT) = 0.7 * 0.865 + 0.3 * 0 = 0.6055

0.6521

0.6055

U($50) = 0.632

U($60) = 0.699

U($100) = 0.865

U($0) = 0

UtilityValue

$50

$60

$100

$0

Action* = arg MEU(IN,OUT) = arg max{ EU(IN), EU(OUT) } = IN


Multi-attribute utilities

• Many aspects of an outcome combine to determine our preferences:– vacation planning: cost, flying time, beach quality, food quality, etc.

• Medical decision making: risk of death (micromort), quality of life (QALY), cost of treatment, etc.

• For rational decision making, must combine all relevant factors into single utility function.

U(a,b,c,…)= f[ f1(a), f2(b), … ]

where f is a simple function such as addition• f=+, In case of mutual preference independence which occurs

when it is always preferable to increase the value of an attribute given all other attributes are fixed


Decision graphs / Influence diagrams

earthquake burglary

alarm

call

goodsrecovered

gohome? Utility

missmeeting

Actionnode

Utility node

newscast

flood_decision.net


Optimal policy

earthquake burglary

alarm

call

goodsrecovered

gohome?

Utility

missmeeting

newscast

Choose action given evidence MEU( go home | call )

Call? EU( Go home ) EU( Stay )

Yes ? ?

No ? ?


Optimal policy

earthquake burglary

alarm

call

goodsrecovered

gohome?

Utility

missmeeting

newscast


G M

G M

YesGoHomeYesCallGRPYesGoHomeMPMGRU

YesGoHomeYesCallMGRPMGRUYesCallYesGoHomeEU

),|()|(),(

),|,(),()|(


Optimal policy

earthquake burglary

alarm

call

goodsrecovered

gohome?

Utility

missmeeting

newscast



Yes 37 13

No 53 83

Call?c EU( Go home ) EU( Stay ) MEU(Call )

Yes 37 13 37

No 53 83 83

A*(Call=Yes) = Go Home

A*(Call=No) = Stay


Yes 37 13

No 53 83


Value of information

• What is it worth to get another piece of information?

• What is the increase in (maximized) expected utility if I make a decision with an additional piece of information?

• Additional information (if free) cannot make you worse off.

• There is no value-of-information if you will not change your decision.


Optimal policy with additional evidence

earthquake burglary

alarm

call

goodsrecovered

gohome? Utility

missmeeting

newscast

How much better can we doif we have evidence aboutnewscast?

( Should we ask for evidenceabout newscast? )


Optimal policy with additional evidence

earthquake burglary

alarm

call

goodsrecovered

gohome? Utility

missmeeting

newscast

Call Newscast Go home

Yes Quake 44 / 45

Yes No 35 / 6

No Quake 51 / 80

No No 52 / 84

Call Newscast Go home

Yes Quake NO

Yes No YES

No Quake NO

No No NO


Value of perfect information

• The general case: We assume that exact evidence can be obtained about the value of some random variable Ej.

• The agent's current knowledge is E.• The value of the current best action a is defined by:

i

AEAAPAUEaEU ),|)(Result())(Result(max)|( ii

• With the new evidence Ej the value of new best action aEj will be

i

jA

jEj EEAAPAUEEaEU ),,|)(Result())(Result(max),|( ii


VPI (cont’d)

• However, we do not have this new evidence in hand. Hence, we can only say what we expect the expected utility of Ej to be:

e

jeEjj eEEaEUEeEP ),|()|(

• The value of perfect information Ej is then

)|(),|()|()( EaEUeEEaEUEeEPEVPIe

jeEjjjE


Properties of VPI

1. Positive:

E,E1 VPIE(E1) 0

2. Non-additive ( in general ):

VPIE(E1,E2) VPIE(E1) + VPIE(E2)

3. Order-invariant:

VPIE(E1,E2) = VPIE( E1) + VPIE,E1(E2)= VPIE( E2) + VPIE,E2(E1)


Example

• What is the value of information Newscast?

G M

NoYesGoHomeGoHomeQuakeNewscastYesCallMGRPMGRU

QuakeNewscastYesCallMEU

),,|,(),(max

),(

},{

G M

NoYesGoHomeGoHomeNoQuakeNewscastYesCallMGRPMGRU

NoQuakeNewscastYesCallMEU

),,|,(),(max

),(

},{

)(

)|(),(

)|(),(

)(

YesCallMEU

YesCallNoQuakeNewscastPNoQuakeNewscastYesCallMEU

YesCallQuakeNewscastPQuakeNewscastYesCallMEU

NewscastVPI


Example (cont’d)Call? MEU(Call )

Yes 36.74

No 83.23

Call? Newscast MEU(Call, Newscast)

Yes Quake 45.20Yes NoQuake 35.16

No Quake 80.89

No NoQuake 83.39

Call? P(Newscast=Quake | Call ) P(Newscast=NoQuake | Call)

Yes .1794 .8206

No .0453 .9547

22.074.3696.368206.351794.45)|( YesCallNewscastVPI


Sequential Decisions

• So far, decisions in static situations. But most situations are dynamic!– If I don’t attend CS440 today, will I be kicked out of the class?

– If I don’t attend CS440 today, will I be better off in the future?

State

Action

Utility

P(S | A)

U( S )

SA

A

EASPSUA

AEUA

),|()(maxarg*

)(maxarg*Eviden.

A: Attend / Do not attend

S: Professor hates me / Professor does not care

E: Professor looks upset / not upset

U: Probability of being expelled from class


Sequential decisions

• Extend static structure over time – just like an HMM, with decisions and utilities.

• One small caveat: a different representation slightly better…

S0

A0

U0

E0

S1

A1

U1

E1

S2

A2

U2

E2

…


S0

A0

R0

E0

S1

A1

R1

E1

S2

A2

R2

E2

…

P(St | At-1)

P(St | St-1)

P(Et | St)R(St)

Partially-observed Markov decision processes (POMDP)

• Actions at time t should impact state at t+1• Use Rewards (R) instead of utilities (U)• Actions directly determine rewards


POMDP Problems

• Objective:

Find a sequence of actions that takes one from an initial state to a final state while maximizing some notion of total/future “reward”.

E.g.: Find a sequence of actions that takes a car from point A to point B while minimizing time and consumed fuel.


Example (POMDPs)

• Optimal dialog modeling: e.g., automated airline reservation system– Actions:

• System prompts: “How may I help you?”, “Please specify your favorite airline”, “Where are you leaving from?”, Do you mind leaving at a different time?”, …

– States:• (Origin, Destination, Airline, Flight#,

Departure, Arrival,…)

“A Stochastic Model of Human-Machine Interaction for Learning Dialog Strategies” Levin, Pieraccini, and Eckert, IEEE TSAP, 2000


Example #2 (POMDP)

• Optimal control– States/actions are continuous

– Objective: design optimal control laws for guiding objects from start position to goal position (Lunar lander)

– Actions:• Engine thrust, robot arm torques, …

– States:• Positions, velocities of objects/robotic arm joints, …

– Reward:• Usually specified in terms of cost (reward-1): cost of fuel, battery charge

loss, energy loss, …


S0

A0

R0

S1

A1

R1

S2

A2

R2

…

P(St | At-1)

P(St | St-1)

R(St)

Markov decision processes (MDPs)

• Let’s make life a bit simpler – assume we exactly know the state of the world


Examples

• Blackjack game– Objective: Have your card sum be greater than the dealers

without exceeding 21.

– States(200 of them): • Current sum (12-21)• Dealer’s showing card (ace -10)• Do I have a useable ace?

– Reward: +1 for winning, 0 for a draw, -1 for losing

– Actions: stick (stop receiving cards), hit (receive another card)


MDP Fundamentals

• We mentioned (in POMDPs) that the goal is to select actions that maximize “reward”

• What “reward”? – Immediate?

at* = arg max E[ R(st+1) ]

– Cumulative?at* = arg max E[ R(st+1) + R(st+2) + R(st+3) + … ]

– Discounted?at* = arg max E[ R(st+1) + R(st+2) + 2 R(st+3) + … ]


Utility & utility maximization in MDPs

• Assume we are in state st and want to find the best sequence of future actions that will maximize discounted reward from st on.

st

at

Rt

st+1

at+1

Rt+1

st+2

at+2

Rt+2

…

A

S

U

...)()()( 22

1 ttt sRsRsRU


Utility & utility maximization in MDPs

• Assume we are in state st and want to find the best sequence of future actions that will maximize discounted reward from st on.

• Convert it into a “simplified” model by compounding states, actions, rewards

st

A

S

U

...)()()( 22

1 ttt sRsRsRU

)|(maxarg* tA

sAEUA Maximum expected utility


Bellman equations & value iteration algorithm

• Do we need to search over |a|T, T, actions A?• No!

1

),|()(maxarg* 11

tt s

tttta

t assPsUa

1

)(),|(max)()( 11

tt s

tttta

tt sUassPsRsU

Best immediate action

Bellman update


Proof of Bellman update equation

1

1

1121

1

1121

1

1121

1 21

1 21

211

211

211

1211

)(),|(max)(

...)()(max),|(max)(

...)()(),|(maxmax)(

...)()(),|(max)(

)...,|(...)()(),|(max)(

)...,|(),|(...)()(max)(

)...,|(),|(...)()(max)(

)...,|(),|(...)()()(max

,...),,|,...,(...)()()(max

...)()()(max)(

11

21,...,|,...,...

1

21,...,|,...1,...

21,...,|,...1,...,

,...112211

,...,

,...11212

21

,...,

,...,112121

,...,

,...,11212

21

,...,

,...,1212

21

,...,

22

1,...,,|,...,,...,

tt

t

ttttt

t

ttttt

t

ttttt

t ttt

t ttt

tttt

tttt

tttt

ttttttt

stttt

at

sttass

attt

at

sttassttt

aat

sttassttt

aat

s stttttttt

aat

s stttttttt

aat

sstttttttt

aat

ssttttttttt

aa

sstttttttt

aa

tttaasssaa

t

sUassPsR

sRsREassPsR

sRsREassPsR

sRsREassPsR

assPsRsRassPsR

assPassPsRsRsR

assPassPsRsRsR

assPassPsRsRsR

aasssPsRsRsR

sRsRsREsU


Example

• “If I don’t attend CS440 today, will I be better off in the future?”

• Actions: Attend / Don’t attend• States: Learned topic / Did not learn

topic• Reward: +1 Learned, -1 Did not learn• Discount factor: =0.9• Transition probabilities:

U(L) = 1 + 0.9 max{ 0.9 U(L) + 0.1 U(NL), 0.6 U(L) + 0.4 U(NL) }U(NL) = -1 + 0.9 max{ 0.5 U(L) + 0.5 U(NL), 0.2 U(L) + 0.8 U(NL) }

Attended (A) Do not attend (NA)

Learned (L)

Did not learn (NL)

Learned (L)

Did not learn (NL)

Learned (L)

0.9 0.5 0.6 0.2

Did not learn (L)

0.1 0.5 0.4 0.8

)(

)()'(,

)(

)()'(max

)(

)(

)(

)(

NLU

LUNAP

NLU

LUAP

NLR

LR

NLU

LU


Computing MDP state utilitiesValue iteration

• How can one solve for U(L) and U(NL) in the previous example?

• Answer: Value-iteration algorithmStart with some initial utility U(L), U(NL), then iterate

)(

)()'(,

)(

)()'(max

)(

)(

)(

)(

NLU

LUNAP

NLU

LUAP

NLR

LR

NLU

LU

5 10 15 20 25 30 35 40 45 50-1

0

1

2

3

4

5

6

7

8

U(L)U(NL)


A*(L) = arg max{ 0.9 U(L) + 0.1 U(NL), 0.6 U(L) + 0.4 U(NL) } = arg max{ 0.9*7.1574 + 0.1*4.0324, 0.6*7.1574 + 0.4*4.0324 } = arg max{ 6.8449, 5.9074 } = Attend

A*(NL) = arg max{ 0.5 U(L) + 0.5 U(NL), 0.2 U(L) + 0.8 U(NL) } = arg max{ 5.5949, 4.6574 } = Attend

Optimal policy

• Given utilities from VPI, find optimal policy

)(

)()'(,

)(

)()'(maxarg

)(*

)(*

NLU

LUNAP

NLU

LUAP

NLA

LA


Policy iteration

• Instead of iterating in the space of utility values, iterate over policies

1. Assume optimal policy, e.g., A*(L) & A*(NL)

2. Compute utility values, e.g., U(L) & U(NL) for A*

3. Compute new optimal policy from utilities, e.g., U(L) & U(NL)

rutgers cs440, fall 2003 decisions under uncertainty reading: ch. 16, aima 2 nd ed

Documents

b p q p

b slide

rutgers cs440

u worst u

utility function u

properties of utility

u best

reward u