Download - Henry Hexmoor SIUC
![Page 1: Henry Hexmoor SIUC](https://reader035.vdocuments.us/reader035/viewer/2022062520/568162c0550346895dd350d1/html5/thumbnails/1.jpg)
Utilities and MDP: A Lesson in Multiagent System
Based on Jose Vidal’s bookFundamentals of Multiagent Systems
Henry HexmoorSIUC
![Page 2: Henry Hexmoor SIUC](https://reader035.vdocuments.us/reader035/viewer/2022062520/568162c0550346895dd350d1/html5/thumbnails/2.jpg)
Utility
• Preferences are recorded as a utility function
S is the set of observable states in the worldui is utility functionR is real numbers
• States of the world become ordered.
RSui :
![Page 3: Henry Hexmoor SIUC](https://reader035.vdocuments.us/reader035/viewer/2022062520/568162c0550346895dd350d1/html5/thumbnails/3.jpg)
Properties of Utilites
Reflexive: ui(s) ≥ ui(s)Transitive: If ui(a) ≥ ui(b) and ui(b) ≥ ui(c)
then ui(a) ≥ ui(c).Comparable: a,b either ui(a) ≥ ui(b) or
ui(b) ≥ ui(a).
![Page 4: Henry Hexmoor SIUC](https://reader035.vdocuments.us/reader035/viewer/2022062520/568162c0550346895dd350d1/html5/thumbnails/4.jpg)
Selfish agents:
• A rational agent is one that wants to maximize its utilities, but intends no harm.
Agents
Rational, non-selfish agents
Selfish agents
Rational agents
![Page 5: Henry Hexmoor SIUC](https://reader035.vdocuments.us/reader035/viewer/2022062520/568162c0550346895dd350d1/html5/thumbnails/5.jpg)
Utility is not money:
• while utility represents an agent’s preferences it is not necessarily equated with money. In fact, the utility of money has been found to be roughly logarithmic.
![Page 6: Henry Hexmoor SIUC](https://reader035.vdocuments.us/reader035/viewer/2022062520/568162c0550346895dd350d1/html5/thumbnails/6.jpg)
Marginal Utility
• Marginal utility is the utility gained from next event
Example: getting A for an A student.
versus A for an B student
![Page 7: Henry Hexmoor SIUC](https://reader035.vdocuments.us/reader035/viewer/2022062520/568162c0550346895dd350d1/html5/thumbnails/7.jpg)
Transition function
Transition function is represented as
Transition function is defined as the probability of reaching S’ from S with action ‘a’
)',,( sasT
![Page 8: Henry Hexmoor SIUC](https://reader035.vdocuments.us/reader035/viewer/2022062520/568162c0550346895dd350d1/html5/thumbnails/8.jpg)
Expected Utility
• Expected utility is defined as the sum of product of the probabilities of reaching s’ from s with action ‘a’ and utility of the final state.
Where S is set of all possible states
Ss
ii susasTasuE'
)'()',,(],,[
![Page 9: Henry Hexmoor SIUC](https://reader035.vdocuments.us/reader035/viewer/2022062520/568162c0550346895dd350d1/html5/thumbnails/9.jpg)
Value of Information
• Value of information that current state is t and not s:
here represents updated, new info represents old value
)](,,[)](,,[ stuEttuEE iiii
)](,,[ ttuE ii
)](,,[ stuE ii
![Page 10: Henry Hexmoor SIUC](https://reader035.vdocuments.us/reader035/viewer/2022062520/568162c0550346895dd350d1/html5/thumbnails/10.jpg)
Markov Decision Processes: MDP• Graphical representation of a sample Markov decision process along with values for
the transition and reward functions. We let the start state be s1. E.g.,
![Page 11: Henry Hexmoor SIUC](https://reader035.vdocuments.us/reader035/viewer/2022062520/568162c0550346895dd350d1/html5/thumbnails/11.jpg)
Reward Function: r(s)
• Reward function is represented as r : S → R
![Page 12: Henry Hexmoor SIUC](https://reader035.vdocuments.us/reader035/viewer/2022062520/568162c0550346895dd350d1/html5/thumbnails/12.jpg)
Deterministic Vs Non-Deterministic
• Deterministic world: predictable effectsExample: only one action leads to T=1 , else Ф
• Nondeterministic world: values change
![Page 13: Henry Hexmoor SIUC](https://reader035.vdocuments.us/reader035/viewer/2022062520/568162c0550346895dd350d1/html5/thumbnails/13.jpg)
Policy:
• Policy is behavior of agents that maps states to action
• Policy is represented by
![Page 14: Henry Hexmoor SIUC](https://reader035.vdocuments.us/reader035/viewer/2022062520/568162c0550346895dd350d1/html5/thumbnails/14.jpg)
Optimal Policy
• Optimal policy is a policy that maximizes expected utility.
• Optimal policy is represented as
],,[maxarg)(* asuEs iAai
*
![Page 15: Henry Hexmoor SIUC](https://reader035.vdocuments.us/reader035/viewer/2022062520/568162c0550346895dd350d1/html5/thumbnails/15.jpg)
Discounted Rewards:
• Discounted rewards smoothly reduce the impact of rewards that are farther off in the future
Where represents discount factor
)10(
...)()()( 32
21
10 srsrsr
)10(
sa
susasTs )(),,(maxarg)(*
![Page 16: Henry Hexmoor SIUC](https://reader035.vdocuments.us/reader035/viewer/2022062520/568162c0550346895dd350d1/html5/thumbnails/16.jpg)
Bellman Equation
Where r(s) represents immediate reward T(s,a,s’)u(s’) represents future, discounted rewards
sa
susasTsrsu )(),,(max)()(
![Page 17: Henry Hexmoor SIUC](https://reader035.vdocuments.us/reader035/viewer/2022062520/568162c0550346895dd350d1/html5/thumbnails/17.jpg)
Brute Force Solution
• Write n Bellman equations one for each n states, solve …
• This is a non-linear equation due to a
max
![Page 18: Henry Hexmoor SIUC](https://reader035.vdocuments.us/reader035/viewer/2022062520/568162c0550346895dd350d1/html5/thumbnails/18.jpg)
Value Iteration Solution
• Set values of u(s) to random numbers• Use Bellman update equation
• Converge and stop using this equation when
where max utility change
)(),,(max)()(1 susasTsrsus
t
a
t
)1(
u
u
![Page 19: Henry Hexmoor SIUC](https://reader035.vdocuments.us/reader035/viewer/2022062520/568162c0550346895dd350d1/html5/thumbnails/19.jpg)
Value Iteration Algorithm
ureturn
until
susuthen
susuif
susasTsrsudo
Ssfor
uudo
rTITERATIONVALUE
sa
)1(
|)()('|
|)()('|
)(),,(max)()(
'
),,,(
![Page 20: Henry Hexmoor SIUC](https://reader035.vdocuments.us/reader035/viewer/2022062520/568162c0550346895dd350d1/html5/thumbnails/20.jpg)
The algorithm stops after t=4.15.05.0 and
![Page 21: Henry Hexmoor SIUC](https://reader035.vdocuments.us/reader035/viewer/2022062520/568162c0550346895dd350d1/html5/thumbnails/21.jpg)
MDP for one agent
• Multiagent: one agent changes , others are stationary.
• Better approach is a vector of size ‘n’ showing each agent’s
action. Where ‘n’ represents number of agents
• Rewards:– Dole out equally among agents– Reward proportional to contribution
)',,( sasT
a
![Page 22: Henry Hexmoor SIUC](https://reader035.vdocuments.us/reader035/viewer/2022062520/568162c0550346895dd350d1/html5/thumbnails/22.jpg)
• noise + cannot observe world …• Belief state
• Observation model O(s,o) = probability of observing ‘o’, being in state ‘s’.
Where is normalization constant
Observation model
nPPPPb ,...,,, 321
)(),,(),()( sbsasTosOsbs
s
![Page 23: Henry Hexmoor SIUC](https://reader035.vdocuments.us/reader035/viewer/2022062520/568162c0550346895dd350d1/html5/thumbnails/23.jpg)
Partially observable MDP
),,( babT )(),,(),( sbsasTosO
s s
if * holds = 0 otherwise* - is true for new reward function
• Solving POMDP is hard.
)(),,(),()( sbsasTosOsbs
s bab
,,
s
srsbb )()()(