rutgers cs440, fall 2003 decisions under uncertainty reading: ch. 16, aima 2 nd ed
Post on 22-Dec-2015
220 views
TRANSCRIPT
Rutgers CS440, Fall 2003
Outline
• Decisions, preferences, utility functions• Influence diagrams• Value of information
Rutgers CS440, Fall 2003
Decision making
• Decisions – an irrevocable allocation of domain resources
• Decisions should be made so as to maximize expected utility
• Questions:– Why make decisions based on average or expected utility?
– Why can one assume that utility functions exist?
– Can an agent act rationally by expressing preferences between states without giving them numeric values?
– Can every preference structure be captured by assigning a single number to every state?
Rutgers CS440, Fall 2003
Simple decision problem
• Party decision problem: inside or outside?
Action
state
state
Dry
Dry
Wet
Wet
IN
OUT
Regret
Relief
Perfect !
Disaster
Rutgers CS440, Fall 2003
Value function
• Numerical score over all possible states of the world
Action Weather Value
OUT Dry $100
IN Wet $60
IN Dry $50
OUT Wet $0
Rutgers CS440, Fall 2003
Preferences
• Agent chooses among prizes (A,B,…) and lotteries (situations with uncertain prizes)
.2
.8
$40,000
$0
.25
.75
$30,000
$0
L1 = ( .2, $40000; .8, $0 ) L2 = ( .25, $30000; .75, $0 )
~
A B A is preferred to BA B B is preferred to AA ~ B indifference between A & B
Rutgers CS440, Fall 2003
Desired properties for preferences over lotteries
• Prefer $100 over $0 AND p < q, then
p
1-p
$100
$0
q
1-q
$100
$0
L1 = ( p, $100; 1-p, $0 ) L2 = ( .q, $100; 1-q, $0 )
Rutgers CS440, Fall 2003
Properties of (rational) preference
Lead to rational agent behavior
1. Orderability( A B ) V ( A B ) V ( A ~ B )
2. Transitivity( A B ) ^ ( B C ) ( A C )
3. ContinuityA B C p, ( p, A; (1-p) C ) ~ B
4. SubstitutabilityA~B p, ( p,A; (1-p), C ) ~ ( p,B; (1-p), C )
5. MonotonicityA B ( p > q ( p,A; (1-p)B ) ( q,A; (1-q),B ) )
Rutgers CS440, Fall 2003
Preference & expected utility
• Properties of preference lead to existence (Ramsey 1931, von Neumann
& Morgenstern 1944) of utility function U such that
p
1-p
$100
$0
q
1-q
$100
$0
L1 = ( p, $100; 1-p, $0 ) L2 = ( .q, $100; 1-q, $0 )
p U($100) + (1-p) U($0) q U($100) + (1-q) U($0)<
IFF
EXPECTED UTILITY of L2, EU(L2)EXPECTED UTILITY of L1, EU(L1)
Rutgers CS440, Fall 2003
Properties of utility
• Utility is a function that maps states to real numbers
• Standard approach to assessing utilities of states:
1. Compare state A to a standard lottery L = ( p,Ubest, 1-p, Uworst)Ubest – best possible eventUworst – worst possible event
2. Adjust p until A ~ L
0.999999
0.000001
Continue asbefore
Instantdeath
$30 ~
Rutgers CS440, Fall 2003
Utility scales
• Normalized utilities: Ubest = 1.0, Uworst = 0.0
• Micromorts: one-millionth chance of death– useful for Russian roulette, paying to reduce product risks, etc.
• QALYs: quality-adjusted life years– useful for medical decisions involving substantial risk
• Note: behavior is invariant w.r.t. positive linear transformation
U’(s) = A U(s) + B, A > 0
Rutgers CS440, Fall 2003
Utility vs Money
• Utility is NOT monetary payoff
.8
.2
$40,000
$0
1
0
$30,000
$0
EMV(L1) = $32,000 EMV(L2) = $30,000>
ii MVpLEMV )(
Rutgers CS440, Fall 2003
Attitudes toward risk
$ reward
U( $reward )
$500
U( $500 )
$1000$400
0.5
0.5
$1000
$0
Insurance risk premium
certain monetary equivalent
L
U( L )
U convex – risk averseU linear – risk neutralU concave – risk seeking
Rutgers CS440, Fall 2003
Human judgment under uncertainty
• Is decision theory compatible with human judgment under uncertainty?
• Are people “experts” in reasoning under uncertainty? How well do they perform? What kind of heuristics do they use?
.2
.8
$40,000
$0
.25
.75
$30,000
$0
.8
.2
$40,000
$0
1
0
$30,000
$0
.2 U($40k) > .25 U($30k)
.8 U($40k) > U($30k)
.8 U($40k) < U($30k)
Rutgers CS440, Fall 2003
Student group utility
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
500 2000 4000 6000 8000 10000
• For each $ amount, adjust p until half the class votes for lottery ($10000)
Rutgers CS440, Fall 2003
Technology forecasting
• “I think there is a world market for about five computers.”- Thomas J. Watson, Sr.
Chairman of the Board of IBM, 1943
• “There doesn't seem to be any real limit to the growth of the computer industry.”
- Thomas J. Watson, Sr.Chairman of the Board of IBM, 1968
Rutgers CS440, Fall 2003
Maximizing expected utility
Action
state
state
Dry
Dry
Wet
Wet
IN
OUT
0.7
0.7
0.3
0.3
EU(IN) = 0.7 * 0.632 + 0.3 * 0.699 = 0.6521EU(OUT) = 0.7 * 0.865 + 0.3 * 0 = 0.6055
0.6521
0.6055
U($50) = 0.632
U($60) = 0.699
U($100) = 0.865
U($0) = 0
UtilityValue
$50
$60
$100
$0
Action* = arg MEU(IN,OUT) = arg max{ EU(IN), EU(OUT) } = IN
Rutgers CS440, Fall 2003
Multi-attribute utilities
• Many aspects of an outcome combine to determine our preferences:– vacation planning: cost, flying time, beach quality, food quality, etc.
• Medical decision making: risk of death (micromort), quality of life (QALY), cost of treatment, etc.
• For rational decision making, must combine all relevant factors into single utility function.
U(a,b,c,…)= f[ f1(a), f2(b), … ]
where f is a simple function such as addition• f=+, In case of mutual preference independence which occurs
when it is always preferable to increase the value of an attribute given all other attributes are fixed
Rutgers CS440, Fall 2003
Decision graphs / Influence diagrams
earthquake burglary
alarm
call
goodsrecovered
gohome? Utility
missmeeting
Actionnode
Utility node
newscast
flood_decision.net
Rutgers CS440, Fall 2003
Optimal policy
earthquake burglary
alarm
call
goodsrecovered
gohome?
Utility
missmeeting
newscast
Choose action given evidence MEU( go home | call )
Call? EU( Go home ) EU( Stay )
Yes ? ?
No ? ?
Rutgers CS440, Fall 2003
Optimal policy
earthquake burglary
alarm
call
goodsrecovered
gohome?
Utility
missmeeting
newscast
Choose action given evidence MEU( go home | call )
G M
G M
YesGoHomeYesCallGRPYesGoHomeMPMGRU
YesGoHomeYesCallMGRPMGRUYesCallYesGoHomeEU
),|()|(),(
),|,(),()|(
Rutgers CS440, Fall 2003
Optimal policy
earthquake burglary
alarm
call
goodsrecovered
gohome?
Utility
missmeeting
newscast
Choose action given evidence MEU( go home | call )
Call? EU( Go home ) EU( Stay )
Yes 37 13
No 53 83
Call?c EU( Go home ) EU( Stay ) MEU(Call )
Yes 37 13 37
No 53 83 83
A*(Call=Yes) = Go Home
A*(Call=No) = Stay
Call? EU( Go home ) EU( Stay )
Yes 37 13
No 53 83
Rutgers CS440, Fall 2003
Value of information
• What is it worth to get another piece of information?
• What is the increase in (maximized) expected utility if I make a decision with an additional piece of information?
• Additional information (if free) cannot make you worse off.
• There is no value-of-information if you will not change your decision.
Rutgers CS440, Fall 2003
Optimal policy with additional evidence
earthquake burglary
alarm
call
goodsrecovered
gohome? Utility
missmeeting
newscast
How much better can we doif we have evidence aboutnewscast?
( Should we ask for evidenceabout newscast? )
Rutgers CS440, Fall 2003
Optimal policy with additional evidence
earthquake burglary
alarm
call
goodsrecovered
gohome? Utility
missmeeting
newscast
Call Newscast Go home
Yes Quake 44 / 45
Yes No 35 / 6
No Quake 51 / 80
No No 52 / 84
Call Newscast Go home
Yes Quake NO
Yes No YES
No Quake NO
No No NO
Rutgers CS440, Fall 2003
Value of perfect information
• The general case: We assume that exact evidence can be obtained about the value of some random variable Ej.
• The agent's current knowledge is E.• The value of the current best action a is defined by:
i
AEAAPAUEaEU ),|)(Result())(Result(max)|( ii
• With the new evidence Ej the value of new best action aEj will be
i
jA
jEj EEAAPAUEEaEU ),,|)(Result())(Result(max),|( ii
Rutgers CS440, Fall 2003
VPI (cont’d)
• However, we do not have this new evidence in hand. Hence, we can only say what we expect the expected utility of Ej to be:
e
jeEjj eEEaEUEeEP ),|()|(
• The value of perfect information Ej is then
)|(),|()|()( EaEUeEEaEUEeEPEVPIe
jeEjjjE
Rutgers CS440, Fall 2003
Properties of VPI
1. Positive:
E,E1 VPIE(E1) 0
2. Non-additive ( in general ):
VPIE(E1,E2) VPIE(E1) + VPIE(E2)
3. Order-invariant:
VPIE(E1,E2) = VPIE( E1) + VPIE,E1(E2)= VPIE( E2) + VPIE,E2(E1)
Rutgers CS440, Fall 2003
Example
• What is the value of information Newscast?
G M
NoYesGoHomeGoHomeQuakeNewscastYesCallMGRPMGRU
QuakeNewscastYesCallMEU
),,|,(),(max
),(
},{
G M
NoYesGoHomeGoHomeNoQuakeNewscastYesCallMGRPMGRU
NoQuakeNewscastYesCallMEU
),,|,(),(max
),(
},{
)(
)|(),(
)|(),(
)(
YesCallMEU
YesCallNoQuakeNewscastPNoQuakeNewscastYesCallMEU
YesCallQuakeNewscastPQuakeNewscastYesCallMEU
NewscastVPI
Rutgers CS440, Fall 2003
Example (cont’d)Call? MEU(Call )
Yes 36.74
No 83.23
Call? Newscast MEU(Call, Newscast)
Yes Quake 45.20Yes NoQuake 35.16
No Quake 80.89
No NoQuake 83.39
Call? P(Newscast=Quake | Call ) P(Newscast=NoQuake | Call)
Yes .1794 .8206
No .0453 .9547
22.074.3696.368206.351794.45)|( YesCallNewscastVPI
Rutgers CS440, Fall 2003
Sequential Decisions
• So far, decisions in static situations. But most situations are dynamic!– If I don’t attend CS440 today, will I be kicked out of the class?
– If I don’t attend CS440 today, will I be better off in the future?
State
Action
Utility
P(S | A)
U( S )
SA
A
EASPSUA
AEUA
),|()(maxarg*
)(maxarg*Eviden.
A: Attend / Do not attend
S: Professor hates me / Professor does not care
E: Professor looks upset / not upset
U: Probability of being expelled from class
Rutgers CS440, Fall 2003
Sequential decisions
• Extend static structure over time – just like an HMM, with decisions and utilities.
• One small caveat: a different representation slightly better…
S0
A0
U0
E0
S1
A1
U1
E1
S2
A2
U2
E2
…
Rutgers CS440, Fall 2003
S0
A0
R0
E0
S1
A1
R1
E1
S2
A2
R2
E2
…
P(St | At-1)
P(St | St-1)
P(Et | St)R(St)
Partially-observed Markov decision processes (POMDP)
• Actions at time t should impact state at t+1• Use Rewards (R) instead of utilities (U)• Actions directly determine rewards
Rutgers CS440, Fall 2003
POMDP Problems
• Objective:
Find a sequence of actions that takes one from an initial state to a final state while maximizing some notion of total/future “reward”.
E.g.: Find a sequence of actions that takes a car from point A to point B while minimizing time and consumed fuel.
Rutgers CS440, Fall 2003
Example (POMDPs)
• Optimal dialog modeling: e.g., automated airline reservation system– Actions:
• System prompts: “How may I help you?”, “Please specify your favorite airline”, “Where are you leaving from?”, Do you mind leaving at a different time?”, …
– States:• (Origin, Destination, Airline, Flight#,
Departure, Arrival,…)
“A Stochastic Model of Human-Machine Interaction for Learning Dialog Strategies” Levin, Pieraccini, and Eckert, IEEE TSAP, 2000
Rutgers CS440, Fall 2003
Example #2 (POMDP)
• Optimal control– States/actions are continuous
– Objective: design optimal control laws for guiding objects from start position to goal position (Lunar lander)
– Actions:• Engine thrust, robot arm torques, …
– States:• Positions, velocities of objects/robotic arm joints, …
– Reward:• Usually specified in terms of cost (reward-1): cost of fuel, battery charge
loss, energy loss, …
Rutgers CS440, Fall 2003
S0
A0
R0
S1
A1
R1
S2
A2
R2
…
P(St | At-1)
P(St | St-1)
R(St)
Markov decision processes (MDPs)
• Let’s make life a bit simpler – assume we exactly know the state of the world
Rutgers CS440, Fall 2003
Examples
• Blackjack game– Objective: Have your card sum be greater than the dealers
without exceeding 21.
– States(200 of them): • Current sum (12-21)• Dealer’s showing card (ace -10)• Do I have a useable ace?
– Reward: +1 for winning, 0 for a draw, -1 for losing
– Actions: stick (stop receiving cards), hit (receive another card)
Rutgers CS440, Fall 2003
MDP Fundamentals
• We mentioned (in POMDPs) that the goal is to select actions that maximize “reward”
• What “reward”? – Immediate?
at* = arg max E[ R(st+1) ]
– Cumulative?at* = arg max E[ R(st+1) + R(st+2) + R(st+3) + … ]
– Discounted?at* = arg max E[ R(st+1) + R(st+2) + 2 R(st+3) + … ]
Rutgers CS440, Fall 2003
Utility & utility maximization in MDPs
• Assume we are in state st and want to find the best sequence of future actions that will maximize discounted reward from st on.
st
at
Rt
st+1
at+1
Rt+1
st+2
at+2
Rt+2
…
A
S
U
...)()()( 22
1 ttt sRsRsRU
Rutgers CS440, Fall 2003
Utility & utility maximization in MDPs
• Assume we are in state st and want to find the best sequence of future actions that will maximize discounted reward from st on.
• Convert it into a “simplified” model by compounding states, actions, rewards
st
A
S
U
...)()()( 22
1 ttt sRsRsRU
)|(maxarg* tA
sAEUA Maximum expected utility
Rutgers CS440, Fall 2003
Bellman equations & value iteration algorithm
• Do we need to search over |a|T, T, actions A?• No!
1
),|()(maxarg* 11
tt s
tttta
t assPsUa
1
)(),|(max)()( 11
tt s
tttta
tt sUassPsRsU
Best immediate action
Bellman update
Rutgers CS440, Fall 2003
Proof of Bellman update equation
1
1
1121
1
1121
1
1121
1 21
1 21
211
211
211
1211
)(),|(max)(
...)()(max),|(max)(
...)()(),|(maxmax)(
...)()(),|(max)(
)...,|(...)()(),|(max)(
)...,|(),|(...)()(max)(
)...,|(),|(...)()(max)(
)...,|(),|(...)()()(max
,...),,|,...,(...)()()(max
...)()()(max)(
11
21,...,|,...,...
1
21,...,|,...1,...
21,...,|,...1,...,
,...112211
,...,
,...11212
21
,...,
,...,112121
,...,
,...,11212
21
,...,
,...,1212
21
,...,
22
1,...,,|,...,,...,
tt
t
ttttt
t
ttttt
t
ttttt
t ttt
t ttt
tttt
tttt
tttt
ttttttt
stttt
at
sttass
attt
at
sttassttt
aat
sttassttt
aat
s stttttttt
aat
s stttttttt
aat
sstttttttt
aat
ssttttttttt
aa
sstttttttt
aa
tttaasssaa
t
sUassPsR
sRsREassPsR
sRsREassPsR
sRsREassPsR
assPsRsRassPsR
assPassPsRsRsR
assPassPsRsRsR
assPassPsRsRsR
aasssPsRsRsR
sRsRsREsU
Rutgers CS440, Fall 2003
Example
• “If I don’t attend CS440 today, will I be better off in the future?”
• Actions: Attend / Don’t attend• States: Learned topic / Did not learn
topic• Reward: +1 Learned, -1 Did not learn• Discount factor: =0.9• Transition probabilities:
U(L) = 1 + 0.9 max{ 0.9 U(L) + 0.1 U(NL), 0.6 U(L) + 0.4 U(NL) }U(NL) = -1 + 0.9 max{ 0.5 U(L) + 0.5 U(NL), 0.2 U(L) + 0.8 U(NL) }
Attended (A) Do not attend (NA)
Learned (L)
Did not learn (NL)
Learned (L)
Did not learn (NL)
Learned (L)
0.9 0.5 0.6 0.2
Did not learn (L)
0.1 0.5 0.4 0.8
)(
)()'(,
)(
)()'(max
)(
)(
)(
)(
NLU
LUNAP
NLU
LUAP
NLR
LR
NLU
LU
Rutgers CS440, Fall 2003
Computing MDP state utilitiesValue iteration
• How can one solve for U(L) and U(NL) in the previous example?
• Answer: Value-iteration algorithmStart with some initial utility U(L), U(NL), then iterate
)(
)()'(,
)(
)()'(max
)(
)(
)(
)(
NLU
LUNAP
NLU
LUAP
NLR
LR
NLU
LU
5 10 15 20 25 30 35 40 45 50-1
0
1
2
3
4
5
6
7
8
U(L)U(NL)
Rutgers CS440, Fall 2003
A*(L) = arg max{ 0.9 U(L) + 0.1 U(NL), 0.6 U(L) + 0.4 U(NL) } = arg max{ 0.9*7.1574 + 0.1*4.0324, 0.6*7.1574 + 0.4*4.0324 } = arg max{ 6.8449, 5.9074 } = Attend
A*(NL) = arg max{ 0.5 U(L) + 0.5 U(NL), 0.2 U(L) + 0.8 U(NL) } = arg max{ 5.5949, 4.6574 } = Attend
Optimal policy
• Given utilities from VPI, find optimal policy
)(
)()'(,
)(
)()'(maxarg
)(*
)(*
NLU
LUNAP
NLU
LUAP
NLA
LA