machine learning lecture outline
DESCRIPTION
Multi-Agent Systems Lecture 10 University “Politehnica” of Bucarest 2005-2006 Adina Magda Florea [email protected] http://turing.cs.pub.ro/ blia_06. Machine Learning Lecture outline. 1 Learning in AI (machine learning) 2 Reinforcement learning 3 Learning in multi-agent systems - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/1.jpg)
Multi-Agent SystemsLecture 10Lecture 10
University “Politehnica” of Bucarest2005-2006
Adina Magda [email protected]
http://turing.cs.pub.ro/blia_06
![Page 2: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/2.jpg)
Machine LearningMachine LearningLecture outlineLecture outline
1 Learning in AI (machine learning)1 Learning in AI (machine learning)2 Reinforcement learning2 Reinforcement learning3 Learning in multi-agent systems3 Learning in multi-agent systems
3.1 Learning action coordination3.1 Learning action coordination3.2 Learning individual performance3.2 Learning individual performance3.3 Learning to communicate3.3 Learning to communicate3.4 Layered learning3.4 Layered learning
5 Conclusions5 Conclusions
![Page 3: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/3.jpg)
3
1 Learning in AI1 Learning in AI What is machine learning?
Herbet Simon defines learning as:“any change in a system that allows it to
perform better the second time on repetition of the same task or another task drawn from the same population (Simon, 1983).”
In ML the agent learns: knowledge representation of the problem domain problem solving rules, inferences problem solving strategies
![Page 4: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/4.jpg)
4
Classifying learning
In MAS learning the agents should learn: what an agent learns in ML but in the context of MAS -
both cooperative and self-interested agents how to cooperate for problem solving - cooperative agents how to communicate - both cooperative and self-
interested agents how to negotiate - self interested agents
Different dimensions explicitly represented domain knowledge how the critic component (performance evaluation) of a
learning agent works the use of knowledge of the domain/environment
![Page 5: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/5.jpg)
5
Single agent learningSingle agent learning
Learning Process
Problem Solving K & B Inferences Strategy
Performance Evaluation
Learning results
Results
Environment
Feed-back
Teacher
Feed-backData
![Page 6: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/6.jpg)
6
NB: Both in this diagram and the next, not all components or flow arrows are always present - it depends on the type of agent (cognitive, reactive), type of learning, etc.
Self-interested learning agentSelf-interested learning agent
Learning Process
Problem Solving K & B Self Inferences Other Strategy agents
Performance Evaluation
Learning results
Results
Environment
Communication
Actions
Feed-backAgent
Agent
Agent
Feed-back
Data
![Page 7: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/7.jpg)
7
Learning Process
Problem Solving K & B Self Inferences Other Strategy agents
Learning results
Results Results
Learning Process
Problem Solving K & B Self Inferences Other Strategy agents
Learning results
Cooperative learning agentsCooperative learning agents
PerformanceEvaluation
EnvironmentAgent Agent
Communication CommunicationActions ActionsFeed-back
Feed-backFeed-back
Data
Communication
![Page 8: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/8.jpg)
8
2 Reinforcement learning
Combines dynamic programming and AI machine learning techniques
Trial-and-error interactions with a dynamic environment The feedback of the environment – reward or reinforcement
search in the space of behaviors – genetic algorithms
Two main approaches
learn utility based on statistical techniques and dynamic programming methods
![Page 9: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/9.jpg)
9
2.1 A reinforcement-learning model
B – agent's behaviori – input = current state of the envr – value of reinforcement (reinforcement signal)T – model of the world
The model consists of:- a discrete set of environment states S (sS)- a discrete set of agent actions A (a A)- a set of scalar reinforcement signals, typically {0, 1} or real numbers- the transition model of the world, T
• environment is nondeterministicT : S x A P(S) – T = transition model T(s, a, s’)
Environment history = a sequence of states that leads to a terminal state
i
T
I
R B
E
s
a
r
![Page 10: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/10.jpg)
10
A 4 x 3 environment The intended outcome occurs with probability 0.8,
and with probability 0.2 (0.1, 0.1) the agent moves at right angles to the intended direction.
The two terminal states have reward +1 and –1, all other states have a reward of –0.04
+1
-1
0.1 0.1
0.8
1 2 3 4
3
2
1
Up, Up, Right, Right, Right (4,3) 0.85 =0.32768
![Page 11: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/11.jpg)
11
2.2 Features varying RL accessible / inaccessible environment has (T known) / has not a model of the environment learn behavior / learn behavior + model reward received only in terminal states or in any state passive/active learner:
– learn utilities of states– active learner – learn also what to do
how does the agent represent B, namely its behavior:– utility functions on states or state histories (T is known)– active-value functions (T is not necessarily known) -
assigns an expected utility to taking a given action in a given state
![Page 12: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/12.jpg)
AgentsState and goals
goal : E {0, 1}
Utilities
utility : E R
env : E x A P(E)
Expected utility of an action a in a state e
Maximum Expected Utility (MEU)
12
),('
)'(*)'),((),(aeenve
eutilityeeaexprobeaU
![Page 13: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/13.jpg)
13
2.3 The RL problem the agent has to find a policy = a function which
maps states to actions and which maximizes some long-time measure of reinforcement.
The agent has to learn an optimal behavior = optimal policy = a policy which yields the highest expected utility - *
The utility function depends on the environment history (a sequence of states)
In each state s the agents receives a reward - R(s)
Uh([s0, s1, …, sn]) – utility function on histories
![Page 14: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/14.jpg)
14
Models of behavior Finite-horizon model: at a given moment of time the agent
should optimize its expected reward for the next h stepsE(t=0, h R(st))
rt represents the reward received t steps into the future. Infinite-horizon model: optimize the long-run reward E(t=0, R(st)) Infinite-horizon discounted model: optimize the long-run
reward but rewards received in the future are geometrically discounted according to a discount factor .
E(t=0, t R(st))0 < 1.
can be interpreted in several ways. It can be seen as an interest rate, a probability of living another step, or as a mathematical trick to bound an infinite sum.
![Page 15: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/15.jpg)
15
2.4 Markov systemsDiscounted rewards
An AP gets payed 20/year
20+20+20..
(reward now) + (reward at time 1) + 2(rewards at time) 2 + …
A Markov System with rewards
(S1, S2,…Sn)
A transition probability matrix Pij=Prob(Next=Sj|This = Si)
Each state has a rweard r1, r2,…rn
Discount factor in [0,1]
On each time step
Assume state is Si
Get reward ri
Randomly move to another state Pij
All future rewards are discounted by
![Page 16: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/16.jpg)
16
U*(Si)=expected discounted sum of future rewards starting in state Si
U*(Si) =ri+(Pi1U*(S1)+Pi2U*(S2)+ .. +PinU*(Sn)), i=1,n
Solve equations, get an exact answer but 100 000 states splve a 100 000 by 100 000 system of equations
Value iteration to solve a Markov system
U1(Si)=ri
U2(Si) = ri + j=1,N PijU1(Sj)
Compute U1(Si) for each sate
Compute U2(Si) for eaxch state, etc
Stop when |Uk+1(Si) - Uk(Si)| < eps
![Page 17: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/17.jpg)
17
2.5 Markov Decision Problem (MDP)consists of:
<S, A, P, R> S - a set of statesA - a set of actionsR – reward function, R: S x A RT : S x A (S), with (S) the probability
distribution over the states SOn each time step
Assume state is SiGet reward RiChoose action a (from a1…ak)Move to another state Pij with probability T(Si,a)All future rewards are discounted by
We shall use T(s,a,s’)
Pass’=Prob(Next=s’|This=s and I use action k)
![Page 18: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/18.jpg)
18
Markov Decision Problem (MDP) The model is Markov if the state transitions are
independent of any previous environment states or agent actions.
MDP: finite-state and finite-action – focus on that / infinite state and action space
For every MDP there exists an optimal policy It’s a policy such that for every possible start
state there is no better option than to follow the policy
Finding the optimal policy given a model T = calculate the utility of each state U(state) and use state utilities to select an optimal action in each state.
![Page 19: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/19.jpg)
19
Value iteration to solve a MDP
U1(s)=R(s)
U2(s) = maxa(R(s) + s’ T(s,a,s’)*U1(s))
….
UK+1(s) = maxa(R(s) + s’ T(s,a,s’)*Uk(s))
Compute U1(si) for each state, s=si
Compute U2(si) for each state, etc
Stop when maxi |Uk+1(si) - Uk(si)| < eps
convergence
(dynamic programming)
Value iteration for a MSUk+1(Si) = ri + j=1,N PijU k(Sj)
![Page 20: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/20.jpg)
20
The utility of a state is the immediate reward for that state plus the expected discounted utility of the next state, assuming that the agent chooses the optimal action
• U(s) = R(s) + max as’T(s,a,s’)*U(s’)
• Bellman equation - U(s) – unique solutions
The utility function U(s) allows the agent to select actions by using the Maximum Expected Utility principle
*(s) = argmaxa (R(s) + s’T(s,a,s’)*U(s’))
optimal policy
![Page 21: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/21.jpg)
21
A 4 x 3 environment The intended outcome occurs with probability 0.8,
and with probability 0.2 (0.1, 0.1) the agent moves at right angles to the intended direction.
The two terminal states have reward +1 and –1, all other states have a reward of –0.04, =1
+1
-1
0.1 0.1
0.8
+1
-1
0.812 0.868 0.918
0.762
0.705
0.660
0.655 0.611 0.388
1 2 3 4
3
2
1
3
2
1
1 2 3 4
![Page 22: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/22.jpg)
+1
-1
0.812 0.868 0.918
0.762
0.705
0.660
0.655 0.611 0.388
3
2
1
1 2 3 4
Bellman equation for the 4x3 worldEquation for the state (1,1)U(1,1) = -0.04 + max{ 0.8 U(1,2) + 0.1 U(2,1) + 0.1 U(1,1),
Up 0.9U(1,1) + 0.1U(1,2),
Left 0.9U(1,1) + 0.1U(2,1),
Down 0.8U(2,1) +0.1U(1,2) + 0.1U(1,1)}
Right
Up is the best action
![Page 23: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/23.jpg)
23
Value Iteration Given the maximal expected utility, the optimal policy is:
*(s) = arg maxa(R(s) + s’ T(s,a,s’) * U(s’)) Compute U*(s) using an iterative approach Value Iteration
U0(s) = R(s)
Ut+1(s) = R(s) + maxa( s’ T(s,a,s’) * Ut(s’))
t inf ….utility values converge to the optimal values
defines the best action in state s
compute for all s
![Page 24: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/24.jpg)
24
Policy iteration
Manipulate the policy directly, rather than finding it indirectly via the optimal value function
choose an arbitrary policy (randomly) at each time t, compute the the long time reward starting in s,
using t, i.e. solve the equations
Ut(s) = R(s) + s’ (T(s, t(s),s’) * Ut(s’))
improve the policy at each state
t+1(s) arg maxa (R(s) + s’ T(s,a,s’) * Ut(s’))
Involves all next states - complex
![Page 25: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/25.jpg)
25
2.6 RL learning Use observed rewards to learn an optimal (or
near optimal) policy for the environment
Ex: play 100 moves, you loose In an MDP the agent has a complete model f
the evironment Now the agent has not such a model Passive learning – the agent policy is fixedThe
tesk is to learn the utilities of states (or state-action pairs)
Active learning – the agent must aso learn what to do: exploitation/exploration
![Page 26: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/26.jpg)
26
(a) Passive reinforcement learning Policy is fixed = in state s always execute (s) Goal – learn how good the policy is = learn U(s) Does not know T(s,a,s’), does not know before R(s) ADP (Adaptive Dynamic Programming) learning
The problem of calculating an optimal policy in an accessible, stochastic environment.
ADP = plug the learned T(s, (s),s’) and the observed rewards R(s) into the Bellman equations to calculate the utility of states
Supervised learning – input: state-action pairs
output: resulting state
Estimate transition probabilities T(s,a,s’) from frequencies with which s’ is reached after executing a in s’
(1,3) – Right – 2 times (2,3), 1 time in (1,3) =>
T((1,3),Right,(2,3))=2/3
![Page 27: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/27.jpg)
27
ADP (Adaptive Dynamic Programming) learningfunction Passive-ADP-Agent(percept) returns an actioninputs: percept, a percept indicating the current state s’ and reward signal r’variable: , a fixed policy
mdp, an MDP with model T, rewards R, discount U, a table of utilities, initially emptyNsa, a table of frequencies for state-action pairs, initially zeroNsas’, a table of frequencies of state-action-state triples, initially
zeros, a, the previous state and action, initially null
if s’ is new then U[s’] r’, R[s’] r’if s is not null then
increment Nsa[s,a] and Nsas’[s,a,s’]for each t such that Nsas’[s,a,t] <>0 do
T[s,a,t] Nsas’[s,a,t] / Nsa[s,a]U Value-Determination(,U,mdp)if Terminal[s’] then s,a null else s,a s’, [s’]return aend
according to MDP (value iteration or policy iteration)
![Page 28: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/28.jpg)
28
Temporal difference learning(TD learning) The value function is no longer implemented by solving a set of
linear equations, but it is computed iteratively. Used observed transitions to adjust the values of the
observed states so that they agree with the constraint equations.
U(s) U (s) + (R(s) + U (s’) – U (s))
is the learning rate. Whatever state is visited, its estimated value is updated to be
closer to R(s) + U (s’)since R(s) is the instantaneous reward received andU (s') is the estimated value of the actually occurring next state.
simpler, involves only next states decreases as the number of times the state is visited increases
![Page 29: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/29.jpg)
29
Temporal difference learningfunction Passive-TD-Agent(percept) returns an action
inputs: percept, a percept indicating the current state s’ and reward signal r’
variable: , a fixed policy
U, a table of utilities, initially empty
Ns, a table of frequencies for states, initially zero
s, a, r, the previous state, action, and reward, initially null
if s’ is new then U[s’] r’
if s is not null then
increment Ns[s]
U[s] U[s] + (Ns[s])(r + U [s’] – U [s])
if Terminal[s’] then s, a, r null else s, a, r s’, [s’], r’
return a
end
![Page 30: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/30.jpg)
30
Temporal difference learning
Does not need a model to perform its updates
The environment supplies the connections between neighboring states in the form of observed transitions.
ADP and TD comparison ADP and TD try both to make local adjustments to the utility
estimates in order to make each state « agree » with its successors TD adjusts a state to agree with the observed successor ADP adjusts a state to agree with all of the successors that might
occur, weighted by their probabilities
![Page 31: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/31.jpg)
31
(b) Active reinforcement learning Passive learning agent – has a fixed policy that determines its behavior An active learning agent must decide what action to take The agent must learn a complete model with outcome probabilities for all
actions (instead of a model for the fixed policy) Compute/learn the utilities that obey the Bellman equation
U (s) = R(s) + maxas’ (T(s, t(s),s’) * U(s’))using value iteration r policy iteration- If value iteration then look for the action that maximze utility- If policy iteration you already have the action- Exploration/exploitation- The representative problem is the n-armed bandit problemSolutions- 1/t time choose random actions, rest follow - give weights to actions that have not been explored, avoid actions with
low utilities- Exploratory function – f(u,n) – how greedy (prefer high utility vales r not
(exploration) the agent is
![Page 32: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/32.jpg)
32
Q-learningActive learning of action-value functionsaction-value function = assigns an expected utility to taking a
given action in a given state, Q-values
Q(a, s) – the value of doing action a in state s (expected utility)Q-values are related to utility values by the equation:
U(s) = maxaQ(a, s) Approach 1
Q(a,s) = R(s) + s’ (T(s, a,s’) *maxa’ Q(a’,s’))This requires a model
Approach 2 Use TD The agent does not need to learn a model – model free
![Page 33: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/33.jpg)
33
Q-learning TD learning, unknown environment
Q(a,s) Q(a,s) + (R(s) + maxa’Q(a’, s’) – Q(a,s)) calculated after each transition from state s to s’. Is it better to learn a model and a utility function or to learn an
action-value function with no model?
![Page 34: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/34.jpg)
34
Q-learningfunction Q-Learning-Agent(percept) returns an action
inputs: percept, a percept indicating the current state s’ and reward signal r’
variable: Q, a table of action values index by state and action
Nsa, a table of frequencies for state-action pairs
s, a, r the previous state, action, and reward, initially null
if s is not null then
increment Nsa[s,a]
Q[a,s] Q[a,s] + (Nsa[s,a])(r + maxa’Q [a’,s’] – Q [a,s])
if Terminal[s’] then s, a, r null
else s, a, r` s’, argmaxa’ f(Q[a’, s’], Nsa[a’,s’]), r’
return a
ends, a, r` s’, argmaxa’ (Q[a’, s’]), r’
![Page 35: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/35.jpg)
35
Generalization of RL The problem of learning in large spaces – large no.
of states Generalization techniques - allow compact storage
of learned information and transfer of knowledge between "similar" states and actions.
Neural nets Decision trees U(state)=U(most similar sate in memory) U(state) =average U(most similar sates in memory)
![Page 36: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/36.jpg)
36
3 Learning in MAS
The credit-assignment problem (CAP) = the problem of assigning feed-back (credit or blame) for an overall performance of the MAS (increase, decrease) to each agent that contributed to that change
inter-agent CAP = assigns credit or blame to the external actions of agents
intra-agent CAP = assigns credit or blame for a particular external action of an agent to its internal inferences and decisions
distinction not always obvious one or another
![Page 37: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/37.jpg)
37
3.1 Learning action coordination s – current environment state Agent i – determines the set of actions it can do
in s: Ai(s) = {Aij(s)}
Computes the goal relevance of each action: Ei
j(s) Agent i announces a bid for each action with
Eij(s) > threshold
Bij(s) = ( + ) Ei
j(s) - risk factor (small) - noise term (to prevent
convergence to local minima)
![Page 38: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/38.jpg)
38
The action with the highest bid is selected Incompatible actions are eliminated Repeat process until all actions in bids are either selected or
eliminated A – selected actions = activity context
Execute selected actions Update goal relevance for actions in A
Eij(s) Ei
j(s) – Bij(s) + (R / |A|)
R –external reward received Update goal relevance for actions in the previous activity
context Ap (actions Akl)
Ekl(sp) Ek
l(sp) + (AijA Bij(s)/ |Ap|)
![Page 39: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/39.jpg)
39
3.2 Learning individual performance
The agent learns how to improve its individual performance in a multi-agent settings
Examples Cooperative agents - learning
organizational roles Competitive agents - learning from
market conditions
![Page 40: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/40.jpg)
Agents learn to adopt a specific role in a particular situation (state) in a cooperative MAS.
Aim = to increase utility of final states Each agent may play several roles in a situation The agents learn to select the most appropriate role Use reinforcement learning Utility, Probability, and CostUtility, Probability, and Cost (UPC) estimates of a role
in a situation Utility - the agent's estimate of a final state worth for a
specific role in a situation – world states mapped to smaller set of situations
S = {s0,…,sf}
Urs = U(sf), s0 … sf
40
3.2.1 Learning organizational roles (Nagendra, e.a.)
![Page 41: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/41.jpg)
Probability - the likelihood of reaching a final state for a specific role in a situation
Prs = p(sf), s0 … sf
Cost - the computational cost of reaching a final state for a specific role in a situation
Potential of a role - estimates the usefulness of a role, discovering pertinent global information and constraints (ortogonal to utilities)
Representation: Sk - vector of situations for agent k, SK
1,…,SKn
Rk - vector of roles for agent k, Rk1,…,Rk
m
|Sk| x |Rk| x 4 values to describe UPC and Potential41
![Page 42: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/42.jpg)
FunctioningFunctioning
Phase I: Learning
Several learning cycles; in each cycle:
each agent goes from s0 to sf and selects its role as the one with the highest probability
Probability of selecting a role r in a situation s
f - objective function used to rate the roles
(e.g., f(U,P,C,Pot) = U*P*C + Pot)
- depends on the domain
42
kRjjsjsjsjs
rsrsrsrsr PotCPUf
PotCPUfrP
),,,(
),,,()(
![Page 43: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/43.jpg)
Use reinforcement learning to update UPC and the potential of a role
For every s [s0,…,sf] and chosen role r in s
Ursi+1 = (1-)Urs
i + Usf
i - the learning cycle
Usf - the utility of a final state
01 - the learning rate
Prsi+1 = (1-)Prs
i + O(sf)
O(sf) = 1 if sf is successful, 0 otherwise
43
![Page 44: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/44.jpg)
44
Potrsi+1 = (1-)Potrs
i + Conf(Path)
Path = [s0,…,sf]
Conf(Path) = 0 if there are conflicts on the Path, 1 otherwise
The update rules for cost are domain dependent
Phase II: Performing
In a situation s the role r is chosen such that:
),,,(arg max jsjsjsjsRj
PotCPUfrk
![Page 45: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/45.jpg)
Agents use past experience and evolved models of other agents to better sell and buy goods
Environment = a market in which agents buy and sell information (electronic marketplace)
Open environment
The agents are self-interested (max local utility)
{g} - a set of goods
P - set of possible prices for goods
Qg - set of possible qualities for a good g45
3.2.2 Learning in market environments(Vidal & Durfee)
![Page 46: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/46.jpg)
information has a cost for the seller and a value for the buyer information is sold at a certain price a buyer announces a good it needs sellers bid their prices for delivering the good the buyer selects from these bids and pays the corresponding price the buyer assesses the quality of information after it receives it from
the seller ProfitProfit of a seller s for selling the good g at price p
Profitsg(p) = p - cs
g
csg - the cost of producing the good g by s; p - the
price ValueValue of a good g for a buyer b
Vbg(p,q) p - price b paid for g
q - quality of good g
GoalGoal seller - maximize profit in a transaction buyer - maximize value in a transaction
46
![Page 47: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/47.jpg)
3 types of agents3 types of agents
0-level agents0-level agents they set their buying and selling prices based on their
own past experience they do not model the behavior of other agents
1-level agents1-level agents model other agents based on previous interactions they set their buying and selling prices based on these
models and on past experience they model the other agents as 0-level agents
2-level agents2-level agents same as 1-level agents but they model the other agents
as 1-level agents47
![Page 48: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/48.jpg)
Strategy of 0-level agentsStrategy of 0-level agents
0-level buyer
- learns the expected value function, fg(p), of buying g at price p
- uses reinforcement learning
fgi+1(p) = (1-)fg
i(p) + Vbg(p,q), min 1, for i=0, = 1
- chooses the seller s* for supplying a good g
0-level seller
- learns the expected profit function, hg(p),if it offers good g at price p
- uses reinforcement learning
hgi+1(p) = (1-)hg
i(p) + Profitbg(p)
where Profitbg(p) = p - cs
g if it wins the auction, 0 otherwise
- chooses the price ps* to sell the good g so as to maximize profit48
)(arg* max gsg
Ss
pfs
)(arg* max&
php gs
cpPps
gs
![Page 49: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/49.jpg)
Strategy of 1-level agentsStrategy of 1-level agents
1-level buyer
- models sellers for good g
- does not model other buyers
- uses a probability distribution function qsg(x) over the qualities x of a
good g
- computes expected utility, Esg, of buying good g from seller s
- chooses the seller s* for supplying a good g that maximizes this expected utility
49
gs
SsEs maxarg*
Qx
gs
gb
gs xpVxq
QE ),()(
1
![Page 50: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/50.jpg)
1-level seller
- models buyers for good g- models the other sellers s for good g Buyer's modeling- uses a probability distribution function mb
g(p) - the probability that b will choose price p for good g
Seller's modeling- uses a probability distribution function ns'
g(y) - the probability that s' will bid price y for good g
- computes the probability of bidding lower than a given seller s' with the price p
Prob_of_bidding_lower_than_s' =p'(Prob of bid of s' with p' for which s wins) =p' N(g,b,s;s',p,p')
N(g,b,s;s',p,p') = ns'g(p') if mb
g(p') mbg(p)
0 otherwise
50
![Page 51: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/51.jpg)
- computes the probability of bidding lower than all other sellers with the price p
Prob_of_bidding_lower_with_p =
(Prob_of_bidding_lower_than_s')
s'S - {s}
- chooses the best price p* to bid so as to maximize profit
51
pcpp gs
Pp
_withing_lower_ob_of_biddPr)(arg* max
![Page 52: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/52.jpg)
What to communicateWhat to communicate (e.g., what information is of interest to the others)
When to communicateWhen to communicate (e.g., when try doing something by itself or when look for help)
With which agents to communicateWith which agents to communicate
How to communicateHow to communicate (e.g., language, protocol, ontology)
52
3.3 Learning to communicate
![Page 53: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/53.jpg)
Learning to which agents to ask for performing a task Used in a contract net protocol for task allocation to
reduce communication for task announcement Goal = acquire and refine knowledge about other agents'
task solving abilities Case-based reasoning used for knowledge acquisition
and refinement
A case consists of:
(1) A task specification
(2) Information about which agents solved a task or similar tasks in the past and the quality of the provided solution
53
Learning with which agents to communicate (Ohko, e.a. )
![Page 54: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/54.jpg)
(1) Task specification
Ti = {Ai1 Vi1, …, Aimi Vimi}Aij - task attribute, Vij - value of attribute
Similar tasks
Sim(Ti, Tj) = r s Dist(Air, Ajs)
AirTi, AjsTj
Dist(Air, Ajs) = Sim_Attr(Air, Ajs) * Sim_Vals(Vir, Vjs)
Set of similar tasks
S(T) = {Tj : Sim(T, Tj) 0.85}
54
![Page 55: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/55.jpg)
(2) Which agents performed T or similar tasks in the past
Suitability of Agent k
Perform(Ak, Tj) - quality of solution for Tj assured by agent Ak performing Tj in the past
The agent computes
{ Suit(Ak, T), Suit(Ak, T)>0 } and selects the agent k* such that
or the first m agents with best suitability
After each encounter, the agent stores the tasks performed by other agents and the solution quality
Tradeoff between exploitation and exploration
55
)(
),()(
1),(
TSTjkk
j
TAPerformTS
TASuit
),(arg* max TASuitk kk
![Page 56: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/56.jpg)
(Stone & Veloso) A hierarchical machine learning paradigm in MAS Used simulated robotic soccer – RoboCup
Learning
Input Output – Intractable Decompose the learning task L into subtasks: L1, …, Ln Characteristics of the environment:
Cooperative MASTeammates and adversariesHidden states – agents have a partial world view
at any given momentAgents have noisy sensory data and actuatorsPerception and action cycles are asynchronousAgents must make their decisions in real-time
56
3.4 Layered learning
![Page 57: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/57.jpg)
Problem: the agent receives a moving ball and must decide what to do with it: dribble, pass to a teammate, shoot towards the goal
Decompose the problem into 3 subtasks:
Layer Behavior type ExampleL1 Individual Ball interceptionL2 Multiagent Pass evaluationL3 Team Pass selection
The decomposition into subtasks enables the learning of more complex behaviors
The hierarchical task decomposition is constructed bottom-up, in a domain dependent fashion
Learning methods are chosen to suit the task Learning in one layer feeds into the next layer either by providing a
portion of the behavior used for training (ball interception – pass evaluation) or by creating the input representation and pruning the action space (pass evaluation – pass selection)
57
![Page 58: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/58.jpg)
L1 – Ball interceptionL1 – Ball interceptionbehavior = individual
Aim:Blocks or intercepts opponents shots or passes orReceive passes from teammates
Learning methodLearning method: a fully connected backpropagation NN
Repeatedly shooting the ball towards a defender in front of a goal.The defender collects t.e. by acting randomly and noticing when it successfully stops the ball
Classification:Saves = successful interceptionsGoals = unsuccessful attemptsMisses = shoots that went wide of the goal
58
![Page 59: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/59.jpg)
L2 – Pass evaluationL2 – Pass evaluationbehavior = multiagent
Uses its learned ball-interception skills as part of the behavior for training MAS behavior
Aim: the agent must decide To pass (or not) the ball to a teammate and If the teammate will successfully receive the ball (based on
positions + abilities of the teammate to receive or intercept a pass) Learning methodLearning method: decision trees (C4.5) Kick the ball towards randomly placed teammates interspread with
randomly placed opponents The intended pass recipient and the opponents all use the learned ball-
interception behavior Classification of a potential pass to a receiver:
Success, with a c.f. (0,1] Failure, with a c.f. [-1,0) Miss, (= 0)
59
![Page 60: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/60.jpg)
L3 – Pass selectionL3 – Pass selectionbehavior = team
Uses its learned pass-evaluation capabilities to create the input and output set for learning pass selection
Aim: the agent has the ball and must decide To which teammate to pass the ball or Shoot on goal
Learning methodLearning method: Q-learning of a function that depends on the agent’s position on the field
Simulate 2 teams playing with identical behavior others than their pass-selection policies
Reinforcement = total goals scored Learns:
Shoot the goal The teammate to which to pass
60
![Page 61: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/61.jpg)
There is no unique method or set of methods for learning in MAS
Many approaches are based on extending ML techniques in a MAS setting
Many approaches use reinforcement learning, but also NN or genetic algorithms
61
4 Conclusions4 Conclusions
![Page 62: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/62.jpg)
ReferencesReferences S. Sen, G. Weiss. Learning in Multiagent systems. In Multiagent
Systems - A Modern Approach to Distributed Artificial Intelligence, G. Weiss (Ed.), The MIT Press, 2001, p.257-298.
T. Ohko, e.a. - Addressee learning and message interception for communication load reduction in multiple robot environment. In Distributed Artificial Intelligence Meets Machine Learning, G. Weiss, Ed., Lecture Notes in Artificial Intelligence, Vol. 1221, Springer-Verlag, 1997, p.242-258.
M.V. Nagendra, e.a. Learning organizational roles in a heterogeneous multi-agent systems. In Proc. of the Second International Conference on Multiagent Systems, AAAI Press, 1996, p.291-298.
J.M. Vidal, E.H. Durfee. The impact of nested agent models in an information economy. In Proc. of the Second International Conference on Multiagent Systems, AAAI Press, 1996, p.377-384.
P. Stone, M. Veloso. Layered Learning, Eleventh European Conference on Machine Learning, ECML-2000.
62
![Page 63: Machine Learning Lecture outline](https://reader035.vdocuments.us/reader035/viewer/2022062804/56814bf0550346895db8d87a/html5/thumbnails/63.jpg)
Web ReferencesWeb References An interesting set of training examples and the connection between decision
trees and rules.
http://www.dcs.napier.ac.uk/~peter/vldb/dm/node11.html Decision trees construction
http://www.cs.uregina.ca/~hamilton/courses/831/notes/ml/dtrees/4_dtrees2.html Building Classification Models: ID3 and C4.5
http://yoda.cis.temple.edu:8080/UGAIWWW/lectures/C45/ Introduction to Reinforcement Learning
http://www.cs.indiana.edu/~gasser/Salsa/rl.html
On-line book on Reinforcement Learning
http://www-anw.cs.umass.edu/~rich/book/the-book.html
63