rutgers cs440, fall 2003 decisions under uncertainty reading: ch. 16, aima 2 nd ed

47
Rutgers CS440, Fall 2003 Decisions under uncertainty Reading: Ch. 16, AIMA 2 nd Ed.

Post on 22-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Rutgers CS440, Fall 2003

Decisions under uncertainty

Reading: Ch. 16, AIMA 2nd Ed.

Rutgers CS440, Fall 2003

Outline

• Decisions, preferences, utility functions• Influence diagrams• Value of information

Rutgers CS440, Fall 2003

Decision making

• Decisions – an irrevocable allocation of domain resources

• Decisions should be made so as to maximize expected utility

• Questions:– Why make decisions based on average or expected utility?

– Why can one assume that utility functions exist?

– Can an agent act rationally by expressing preferences between states without giving them numeric values?

– Can every preference structure be captured by assigning a single number to every state?

Rutgers CS440, Fall 2003

Simple decision problem

• Party decision problem: inside or outside?

Action

state

state

Dry

Dry

Wet

Wet

IN

OUT

Regret

Relief

Perfect !

Disaster

Rutgers CS440, Fall 2003

Value function

• Numerical score over all possible states of the world

Action Weather Value

OUT Dry $100

IN Wet $60

IN Dry $50

OUT Wet $0

Rutgers CS440, Fall 2003

Preferences

• Agent chooses among prizes (A,B,…) and lotteries (situations with uncertain prizes)

.2

.8

$40,000

$0

.25

.75

$30,000

$0

L1 = ( .2, $40000; .8, $0 ) L2 = ( .25, $30000; .75, $0 )

~

A B A is preferred to BA B B is preferred to AA ~ B indifference between A & B

Rutgers CS440, Fall 2003

Desired properties for preferences over lotteries

• Prefer $100 over $0 AND p < q, then

p

1-p

$100

$0

q

1-q

$100

$0

L1 = ( p, $100; 1-p, $0 ) L2 = ( .q, $100; 1-q, $0 )

Rutgers CS440, Fall 2003

Properties of (rational) preference

Lead to rational agent behavior

1. Orderability( A B ) V ( A B ) V ( A ~ B )

2. Transitivity( A B ) ^ ( B C ) ( A C )

3. ContinuityA B C p, ( p, A; (1-p) C ) ~ B

4. SubstitutabilityA~B p, ( p,A; (1-p), C ) ~ ( p,B; (1-p), C )

5. MonotonicityA B ( p > q ( p,A; (1-p)B ) ( q,A; (1-q),B ) )

Rutgers CS440, Fall 2003

Preference & expected utility

• Properties of preference lead to existence (Ramsey 1931, von Neumann

& Morgenstern 1944) of utility function U such that

p

1-p

$100

$0

q

1-q

$100

$0

L1 = ( p, $100; 1-p, $0 ) L2 = ( .q, $100; 1-q, $0 )

p U($100) + (1-p) U($0) q U($100) + (1-q) U($0)<

IFF

EXPECTED UTILITY of L2, EU(L2)EXPECTED UTILITY of L1, EU(L1)

Rutgers CS440, Fall 2003

Properties of utility

• Utility is a function that maps states to real numbers

• Standard approach to assessing utilities of states:

1. Compare state A to a standard lottery L = ( p,Ubest, 1-p, Uworst)Ubest – best possible eventUworst – worst possible event

2. Adjust p until A ~ L

0.999999

0.000001

Continue asbefore

Instantdeath

$30 ~

Rutgers CS440, Fall 2003

Utility scales

• Normalized utilities: Ubest = 1.0, Uworst = 0.0

• Micromorts: one-millionth chance of death– useful for Russian roulette, paying to reduce product risks, etc.

• QALYs: quality-adjusted life years– useful for medical decisions involving substantial risk

• Note: behavior is invariant w.r.t. positive linear transformation

U’(s) = A U(s) + B, A > 0

Rutgers CS440, Fall 2003

Utility vs Money

• Utility is NOT monetary payoff

.8

.2

$40,000

$0

1

0

$30,000

$0

EMV(L1) = $32,000 EMV(L2) = $30,000>

ii MVpLEMV )(

Rutgers CS440, Fall 2003

Attitudes toward risk

$ reward

U( $reward )

$500

U( $500 )

$1000$400

0.5

0.5

$1000

$0

Insurance risk premium

certain monetary equivalent

L

U( L )

U convex – risk averseU linear – risk neutralU concave – risk seeking

Rutgers CS440, Fall 2003

Human judgment under uncertainty

• Is decision theory compatible with human judgment under uncertainty?

• Are people “experts” in reasoning under uncertainty? How well do they perform? What kind of heuristics do they use?

.2

.8

$40,000

$0

.25

.75

$30,000

$0

.8

.2

$40,000

$0

1

0

$30,000

$0

.2 U($40k) > .25 U($30k)

.8 U($40k) > U($30k)

.8 U($40k) < U($30k)

Rutgers CS440, Fall 2003

Student group utility

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

500 2000 4000 6000 8000 10000

• For each $ amount, adjust p until half the class votes for lottery ($10000)

Rutgers CS440, Fall 2003

Technology forecasting

• “I think there is a world market for about five computers.”- Thomas J. Watson, Sr.

Chairman of the Board of IBM, 1943

• “There doesn't seem to be any real limit to the growth of the computer industry.”

- Thomas J. Watson, Sr.Chairman of the Board of IBM, 1968

Rutgers CS440, Fall 2003

Maximizing expected utility

Action

state

state

Dry

Dry

Wet

Wet

IN

OUT

0.7

0.7

0.3

0.3

EU(IN) = 0.7 * 0.632 + 0.3 * 0.699 = 0.6521EU(OUT) = 0.7 * 0.865 + 0.3 * 0 = 0.6055

0.6521

0.6055

U($50) = 0.632

U($60) = 0.699

U($100) = 0.865

U($0) = 0

UtilityValue

$50

$60

$100

$0

Action* = arg MEU(IN,OUT) = arg max{ EU(IN), EU(OUT) } = IN

Rutgers CS440, Fall 2003

Multi-attribute utilities

• Many aspects of an outcome combine to determine our preferences:– vacation planning: cost, flying time, beach quality, food quality, etc.

• Medical decision making: risk of death (micromort), quality of life (QALY), cost of treatment, etc.

• For rational decision making, must combine all relevant factors into single utility function.

U(a,b,c,…)= f[ f1(a), f2(b), … ]

where f is a simple function such as addition• f=+, In case of mutual preference independence which occurs

when it is always preferable to increase the value of an attribute given all other attributes are fixed

Rutgers CS440, Fall 2003

Decision graphs / Influence diagrams

earthquake burglary

alarm

call

goodsrecovered

gohome? Utility

missmeeting

Actionnode

Utility node

newscast

flood_decision.net

Rutgers CS440, Fall 2003

Optimal policy

earthquake burglary

alarm

call

goodsrecovered

gohome?

Utility

missmeeting

newscast

Choose action given evidence MEU( go home | call )

Call? EU( Go home ) EU( Stay )

Yes ? ?

No ? ?

Rutgers CS440, Fall 2003

Optimal policy

earthquake burglary

alarm

call

goodsrecovered

gohome?

Utility

missmeeting

newscast

Choose action given evidence MEU( go home | call )

G M

G M

YesGoHomeYesCallGRPYesGoHomeMPMGRU

YesGoHomeYesCallMGRPMGRUYesCallYesGoHomeEU

),|()|(),(

),|,(),()|(

Rutgers CS440, Fall 2003

Optimal policy

earthquake burglary

alarm

call

goodsrecovered

gohome?

Utility

missmeeting

newscast

Choose action given evidence MEU( go home | call )

Call? EU( Go home ) EU( Stay )

Yes 37 13

No 53 83

Call?c EU( Go home ) EU( Stay ) MEU(Call )

Yes 37 13 37

No 53 83 83

A*(Call=Yes) = Go Home

A*(Call=No) = Stay

Call? EU( Go home ) EU( Stay )

Yes 37 13

No 53 83

Rutgers CS440, Fall 2003

Value of information

• What is it worth to get another piece of information?

• What is the increase in (maximized) expected utility if I make a decision with an additional piece of information?

• Additional information (if free) cannot make you worse off.

• There is no value-of-information if you will not change your decision.

Rutgers CS440, Fall 2003

Optimal policy with additional evidence

earthquake burglary

alarm

call

goodsrecovered

gohome? Utility

missmeeting

newscast

How much better can we doif we have evidence aboutnewscast?

( Should we ask for evidenceabout newscast? )

Rutgers CS440, Fall 2003

Optimal policy with additional evidence

earthquake burglary

alarm

call

goodsrecovered

gohome? Utility

missmeeting

newscast

Call Newscast Go home

Yes Quake 44 / 45

Yes No 35 / 6

No Quake 51 / 80

No No 52 / 84

Call Newscast Go home

Yes Quake NO

Yes No YES

No Quake NO

No No NO

Rutgers CS440, Fall 2003

Value of perfect information

• The general case: We assume that exact evidence can be obtained about the value of some random variable Ej.

• The agent's current knowledge is E.• The value of the current best action a is defined by:

i

AEAAPAUEaEU ),|)(Result())(Result(max)|( ii

• With the new evidence Ej the value of new best action aEj will be

i

jA

jEj EEAAPAUEEaEU ),,|)(Result())(Result(max),|( ii

Rutgers CS440, Fall 2003

VPI (cont’d)

• However, we do not have this new evidence in hand. Hence, we can only say what we expect the expected utility of Ej to be:

e

jeEjj eEEaEUEeEP ),|()|(

• The value of perfect information Ej is then

)|(),|()|()( EaEUeEEaEUEeEPEVPIe

jeEjjjE

Rutgers CS440, Fall 2003

Properties of VPI

1. Positive:

E,E1 VPIE(E1) 0

2. Non-additive ( in general ):

VPIE(E1,E2) VPIE(E1) + VPIE(E2)

3. Order-invariant:

VPIE(E1,E2) = VPIE( E1) + VPIE,E1(E2)= VPIE( E2) + VPIE,E2(E1)

Rutgers CS440, Fall 2003

Example

• What is the value of information Newscast?

G M

NoYesGoHomeGoHomeQuakeNewscastYesCallMGRPMGRU

QuakeNewscastYesCallMEU

),,|,(),(max

),(

},{

G M

NoYesGoHomeGoHomeNoQuakeNewscastYesCallMGRPMGRU

NoQuakeNewscastYesCallMEU

),,|,(),(max

),(

},{

)(

)|(),(

)|(),(

)(

YesCallMEU

YesCallNoQuakeNewscastPNoQuakeNewscastYesCallMEU

YesCallQuakeNewscastPQuakeNewscastYesCallMEU

NewscastVPI

Rutgers CS440, Fall 2003

Example (cont’d)Call? MEU(Call )

Yes 36.74

No 83.23

Call? Newscast MEU(Call, Newscast)

Yes Quake 45.20Yes NoQuake 35.16

No Quake 80.89

No NoQuake 83.39

Call? P(Newscast=Quake | Call ) P(Newscast=NoQuake | Call)

Yes .1794 .8206

No .0453 .9547

22.074.3696.368206.351794.45)|( YesCallNewscastVPI

Rutgers CS440, Fall 2003

Sequential Decisions

• So far, decisions in static situations. But most situations are dynamic!– If I don’t attend CS440 today, will I be kicked out of the class?

– If I don’t attend CS440 today, will I be better off in the future?

State

Action

Utility

P(S | A)

U( S )

SA

A

EASPSUA

AEUA

),|()(maxarg*

)(maxarg*Eviden.

A: Attend / Do not attend

S: Professor hates me / Professor does not care

E: Professor looks upset / not upset

U: Probability of being expelled from class

Rutgers CS440, Fall 2003

Sequential decisions

• Extend static structure over time – just like an HMM, with decisions and utilities.

• One small caveat: a different representation slightly better…

S0

A0

U0

E0

S1

A1

U1

E1

S2

A2

U2

E2

Rutgers CS440, Fall 2003

S0

A0

R0

E0

S1

A1

R1

E1

S2

A2

R2

E2

P(St | At-1)

P(St | St-1)

P(Et | St)R(St)

Partially-observed Markov decision processes (POMDP)

• Actions at time t should impact state at t+1• Use Rewards (R) instead of utilities (U)• Actions directly determine rewards

Rutgers CS440, Fall 2003

POMDP Problems

• Objective:

Find a sequence of actions that takes one from an initial state to a final state while maximizing some notion of total/future “reward”.

E.g.: Find a sequence of actions that takes a car from point A to point B while minimizing time and consumed fuel.

Rutgers CS440, Fall 2003

Example (POMDPs)

• Optimal dialog modeling: e.g., automated airline reservation system– Actions:

• System prompts: “How may I help you?”, “Please specify your favorite airline”, “Where are you leaving from?”, Do you mind leaving at a different time?”, …

– States:• (Origin, Destination, Airline, Flight#,

Departure, Arrival,…)

“A Stochastic Model of Human-Machine Interaction for Learning Dialog Strategies” Levin, Pieraccini, and Eckert, IEEE TSAP, 2000

Rutgers CS440, Fall 2003

Example #2 (POMDP)

• Optimal control– States/actions are continuous

– Objective: design optimal control laws for guiding objects from start position to goal position (Lunar lander)

– Actions:• Engine thrust, robot arm torques, …

– States:• Positions, velocities of objects/robotic arm joints, …

– Reward:• Usually specified in terms of cost (reward-1): cost of fuel, battery charge

loss, energy loss, …

Rutgers CS440, Fall 2003

S0

A0

R0

S1

A1

R1

S2

A2

R2

P(St | At-1)

P(St | St-1)

R(St)

Markov decision processes (MDPs)

• Let’s make life a bit simpler – assume we exactly know the state of the world

Rutgers CS440, Fall 2003

Examples

• Blackjack game– Objective: Have your card sum be greater than the dealers

without exceeding 21.

– States(200 of them): • Current sum (12-21)• Dealer’s showing card (ace -10)• Do I have a useable ace?

– Reward: +1 for winning, 0 for a draw, -1 for losing

– Actions: stick (stop receiving cards), hit (receive another card)

Rutgers CS440, Fall 2003

MDP Fundamentals

• We mentioned (in POMDPs) that the goal is to select actions that maximize “reward”

• What “reward”? – Immediate?

at* = arg max E[ R(st+1) ]

– Cumulative?at* = arg max E[ R(st+1) + R(st+2) + R(st+3) + … ]

– Discounted?at* = arg max E[ R(st+1) + R(st+2) + 2 R(st+3) + … ]

Rutgers CS440, Fall 2003

Utility & utility maximization in MDPs

• Assume we are in state st and want to find the best sequence of future actions that will maximize discounted reward from st on.

st

at

Rt

st+1

at+1

Rt+1

st+2

at+2

Rt+2

A

S

U

...)()()( 22

1 ttt sRsRsRU

Rutgers CS440, Fall 2003

Utility & utility maximization in MDPs

• Assume we are in state st and want to find the best sequence of future actions that will maximize discounted reward from st on.

• Convert it into a “simplified” model by compounding states, actions, rewards

st

A

S

U

...)()()( 22

1 ttt sRsRsRU

)|(maxarg* tA

sAEUA Maximum expected utility

Rutgers CS440, Fall 2003

Bellman equations & value iteration algorithm

• Do we need to search over |a|T, T, actions A?• No!

1

),|()(maxarg* 11

tt s

tttta

t assPsUa

1

)(),|(max)()( 11

tt s

tttta

tt sUassPsRsU

Best immediate action

Bellman update

Rutgers CS440, Fall 2003

Proof of Bellman update equation

1

1

1121

1

1121

1

1121

1 21

1 21

211

211

211

1211

)(),|(max)(

...)()(max),|(max)(

...)()(),|(maxmax)(

...)()(),|(max)(

)...,|(...)()(),|(max)(

)...,|(),|(...)()(max)(

)...,|(),|(...)()(max)(

)...,|(),|(...)()()(max

,...),,|,...,(...)()()(max

...)()()(max)(

11

21,...,|,...,...

1

21,...,|,...1,...

21,...,|,...1,...,

,...112211

,...,

,...11212

21

,...,

,...,112121

,...,

,...,11212

21

,...,

,...,1212

21

,...,

22

1,...,,|,...,,...,

tt

t

ttttt

t

ttttt

t

ttttt

t ttt

t ttt

tttt

tttt

tttt

ttttttt

stttt

at

sttass

attt

at

sttassttt

aat

sttassttt

aat

s stttttttt

aat

s stttttttt

aat

sstttttttt

aat

ssttttttttt

aa

sstttttttt

aa

tttaasssaa

t

sUassPsR

sRsREassPsR

sRsREassPsR

sRsREassPsR

assPsRsRassPsR

assPassPsRsRsR

assPassPsRsRsR

assPassPsRsRsR

aasssPsRsRsR

sRsRsREsU

Rutgers CS440, Fall 2003

Example

• “If I don’t attend CS440 today, will I be better off in the future?”

• Actions: Attend / Don’t attend• States: Learned topic / Did not learn

topic• Reward: +1 Learned, -1 Did not learn• Discount factor: =0.9• Transition probabilities:

U(L) = 1 + 0.9 max{ 0.9 U(L) + 0.1 U(NL), 0.6 U(L) + 0.4 U(NL) }U(NL) = -1 + 0.9 max{ 0.5 U(L) + 0.5 U(NL), 0.2 U(L) + 0.8 U(NL) }

Attended (A) Do not attend (NA)

Learned (L)

Did not learn (NL)

Learned (L)

Did not learn (NL)

Learned (L)

0.9 0.5 0.6 0.2

Did not learn (L)

0.1 0.5 0.4 0.8

)(

)()'(,

)(

)()'(max

)(

)(

)(

)(

NLU

LUNAP

NLU

LUAP

NLR

LR

NLU

LU

Rutgers CS440, Fall 2003

Computing MDP state utilitiesValue iteration

• How can one solve for U(L) and U(NL) in the previous example?

• Answer: Value-iteration algorithmStart with some initial utility U(L), U(NL), then iterate

)(

)()'(,

)(

)()'(max

)(

)(

)(

)(

NLU

LUNAP

NLU

LUAP

NLR

LR

NLU

LU

5 10 15 20 25 30 35 40 45 50-1

0

1

2

3

4

5

6

7

8

U(L)U(NL)

Rutgers CS440, Fall 2003

A*(L) = arg max{ 0.9 U(L) + 0.1 U(NL), 0.6 U(L) + 0.4 U(NL) } = arg max{ 0.9*7.1574 + 0.1*4.0324, 0.6*7.1574 + 0.4*4.0324 } = arg max{ 6.8449, 5.9074 } = Attend

A*(NL) = arg max{ 0.5 U(L) + 0.5 U(NL), 0.2 U(L) + 0.8 U(NL) } = arg max{ 5.5949, 4.6574 } = Attend

Optimal policy

• Given utilities from VPI, find optimal policy

)(

)()'(,

)(

)()'(maxarg

)(*

)(*

NLU

LUNAP

NLU

LUAP

NLA

LA

Rutgers CS440, Fall 2003

Policy iteration

• Instead of iterating in the space of utility values, iterate over policies

1. Assume optimal policy, e.g., A*(L) & A*(NL)

2. Compute utility values, e.g., U(L) & U(NL) for A*

3. Compute new optimal policy from utilities, e.g., U(L) & U(NL)