multi agent social learning in large repeated games
DESCRIPTION
Multi agent social learning in large repeated games. Jean Oh. Motivation. Approach. Theoretical. Empirical. Conclusion. far. Selfish solutions can be suboptimal. If short-sighted,. Individual objective: to minimize cost. agent 2. agent 1. agent n. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/1.jpg)
Multiagent social learning in large repeated games
Jean Oh
![Page 2: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/2.jpg)
Selfish solutions can be suboptimal.
If short-sighted,
Motivation Approach Theoretical Empirical Conclusion
far
![Page 3: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/3.jpg)
3
A={ resource1, resource2… resourcem }
N={ } …
statet
agent1agent2 agentn
Strategy of agent iCost ci(si, s-i) ? Strategy profile
Multiagent resource selection problem
strategy
Individual objective: to minimize cost
![Page 4: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/4.jpg)
4
Congestion cost depends on:the number of agents that have chosen the same resource.
• Individual objective: to minimize congestion cost• “Selfish solutions” can be arbitrarily suboptimal [Roughgarden 2007].• Important subject in transportation science, computer networks, and
algorithmic game theory.
Congestion game!
Selfish solution: the cost of every path becomes more or less
indifferent; thus no one wants to deviate from current path.
social welfare:average cost of
all agents
![Page 5: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/5.jpg)
5
Constant cost: 1
n agents
Metro vs. Driving[Pigou 1920, Roughgarden 2007]
Example: Inefficiency of selfish solution
Depends on # of drivers: 1
Optimal average cost [n/2 1 + n/2 ½]/n = ¾
Objective: minimize average cost
Centraladministrator
Stationary algorithms(e.g. no regret, fictious play)
n
1 1
Selfish solution Average cost = 1
metro
drivi
ng
2
n
2
n
Nonlinear cost function?
![Page 6: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/6.jpg)
6
If a few agents take alternative route, everyone else is better off. Just a few altruistic agents to sacrifice, any volunteers?
Excellent! as long as it’s not me.
![Page 7: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/7.jpg)
7
Coping with the inefficiency of selfish solution
• Increase resource capacity [Korilis 1999]
• Redesign network structure [Roughgarden 2001a]
• Algorithmic mechanism design [Ronen 2000,Calliese&Gordon 2008]
• Centralization [Shenker 1995, Chakrabarty 2005, Blumrosen 2006]
• Periodic policy under “homo-egualis” principle [Nowé et al. 2003]– Taking the worst-performing agent into consideration (to avoid inequality)
• Collective Intelligence (COIN) [Wolpert & Tumer 1999]– WLU: Wonderful Life Utility!
• Altruistic Stackelberg strategy [Roughgarden 2001b]
– (Market) leaders make first moves, hoping to induce desired actions from the followers
– LLF (centralized + selfish) agents• “Explicit coordination is necessary to achieve system
optimal solution in congestion games” [Milchtaich 2004]
Braess’ paradox
Related work
Can self-interested agents support mutually
beneficial solution without external
intervention?
![Page 8: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/8.jpg)
8
Explicit threat: grim-trigger
We’ll be mutually beneficial
I’ll punish you with minimax value
forever
As long as you stay If you deviateWhatever you do from then on
Minimax value: as good as [i] can get when the rest of the world turns against [i].
• Computational intractability• “Significant coordination overhead”• Existing algorithms limited to 2-player games (Stimpson 2001, Littman & Stone 2003, Sen et al. 2003, Crandall 2005)
NP-complete(Borgs et al. 2008)
NP-hard(Meyers 2006)
Complete monitoring
Related work: non-stationary strategy
Congestion cost
Coordinationoverhead
Can “self-interested” agents learn to support mutually beneficial solution efficiently?
![Page 9: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/9.jpg)
IMPRESImplicit Reciprocal Strategy
Learning
Motivation Approach Theoretical Empirical Conclusion
![Page 10: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/10.jpg)
10
Assumptions
The other agents are _______________.1. opponents
2. sources of uncertainty
3. sources of knowledge
The agents are _________ in their ability.1. symmetric
2. asymmetric
“sources of knowledge”
“asymmetric”
may be
IMPRES
![Page 11: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/11.jpg)
“Learn to act more rationallyby using strategy given by others”
“Learn to act more rationallyby giving strategy to others”stop
Go
Intuition: social learning
IMPRES
![Page 12: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/12.jpg)
12
Agent i
Agent k
congestion cost
path
Agent j
Inner-layer
Overview: 2-layered decision making
Meta-layer
Agent iAgent j Agent k
Environment
IMPRES
-solitary-subscriber-strategist
1. whose strategy?
2. which path?
3. Learn strategies using cost
“Take route 2”
![Page 13: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/13.jpg)
13
Meta-learning
LOOP:• p selectPath(a); take path p; receive congestion cost c• Update Q value of action a using cost c: Q(a) (1-)Q(a) + (MaxCost - c)• new action randomPick(strategist lookup table L); A A {}• Update meta-strategy s
• a select action according to meta-strategy s; if a = -strategist, L L {i}
Aa
TaQ
TaQ
as
Aa
,)'(
exp
)(exp
)(
'
IMPRES
A = {-strategist, -solitary }Q 0 0s 0.5 0.5
how to select action from A
Current meta-action a
-subscriber0
Environment
path
coststrategy Agent i
how to select path from P = {p1,…}
strategy …
Strategist lookuptable L
More probability mass to low cost actions
![Page 14: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/14.jpg)
14
Inner-learning• Symmetric network congestion games• f: number of subscribers (to this strategy)
when f = 0, no inner-learning : joint strategy for f agents
1. path p; take path p; observe # of agents on edges of p
2. Predict traffic on each edge generated by others
3. Select best joint strategy for f agents (exploration with small probability)
4. Shuffle joint strategy
IMPRES
e1 e2
e4
e3
f = 2 f = 0
![Page 15: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/15.jpg)
15
Non-stationary strategy
-subscriberstrategy
-solitarystrategy
exploit
explore
Cost(C) Cost(I)
Cost(C) ≥ Cost(I)
Cost(C) Cost(I)
Cost(I) Cost(C)
An IMPRES strategy
IMPRES
Correlated strategy
Independentstrategy
[Correlated strategy] mutually beneficial strategies for -strategist and its -subscribers[Independent strategy] -solitary, implicit reciprocity = break from correlation
![Page 16: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/16.jpg)
16
Correlatedstrategy
Independentstrategy
exploit
explore
Cost(C) Cost(I)
Cost(C) ≥ Cost(I)
Cost(C) Cost(I)
Cost(I) Cost(C)
An IMPRES strategy
Grim-trigger vs. IMPRES
Correlatedstrategy
Minimaxstrategy
The other playerobeys Whatever
A grim-trigger strategy
Observe a deviator
Perfect monitoring Imperfect monitoring Intractable Tractable
Coordination overhead Efficient coordination Deterministic Stochastic
Non-stationary strategies
![Page 17: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/17.jpg)
17
correlatedstrategy
Rational agents can support mutually beneficial outcome with
explicit threat.
General belief
Motivation Approach Theoretical Empirical Conclusion
Minimaxstrategy
Explicit threat
independentstrategy
Implicit threat
“IMPRES”
“without”
Main result
![Page 18: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/18.jpg)
Empirical evaluation
Motivation Approach Theoretical Empirical Conclusion
Selfish solutionsCongestion cost: arbitrarily suboptimal Coordination overhead: none
Con
gest
ion
cost
Centralized solutions (1-to-n)Congestion cost: optimalCoordination overhead: high
Coordination overhead
IMPRES
Quantifying “mutually beneficial” and “efficient”
![Page 19: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/19.jpg)
19
Evaluation criteria
1. Individual rationality: minimax-safety2. Average congestion cost of all agents
(social welfare); for problem p3. Coordination overhead (size of
subgroups) relative to a 1-to-n centrally administrated system.
4. Agent demographic (based on meta-strategy), e.g. percentage of solitaries, strategists, and subscribers.
Cost (solutionp)
Cost (optimump)
overhead (solutionp)
overhead (maxp)
![Page 20: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/20.jpg)
20
• Number of agents n = 100; (n = 2 ~ 1000)• All agents use IMPRES (self-play)• Number of iterations = 20,000 ~ 50,000• Averaged over 10-30 trials • Learning parameters:
Experimental setup
Parameter Value Description
Learning step size; use bigger step size for actions tried less often.
T T0=10; T 0.95T Temperature in update eq.
k 10 Max number of actions in meta-layer
)10
1,01.0max(
iatrials
![Page 21: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/21.jpg)
21
Metro vs. Driving (n=100)
metro
driving
metro
driving
Agent demographic
The lower, the better
Free riders:always driving
![Page 22: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/22.jpg)
22
C(s): congestion cost of solution s
C(s)
C(optimum)
Selfish solution Optimum IMPRES
311.2
For this problem:
Polynomial cost functions, average number of paths=5
Optimal baseline[Meyers 2006]
Selfish base
line
[Fabrikant 2
004]
Selfish solution: the cost of every path becomes more or less
indifferent; thus no one wants to deviate from current path.
y=x
(data is based on average cost after 20,000 iterations)
C(selfish solution)
C(optimum)
![Page 23: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/23.jpg)
23
o(s): coordination overhead of solution s
C(s)
C(optimum)
O(solution)
O(1-to-n solution)
Polynomial cost functions, average number of paths=5
1-to-n solution
eso )( Average communication bandwidth
Congestion cost
Optimum
better
worse
Coordination overhead
![Page 24: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/24.jpg)
24
On dynamic population
1 agent in every ith round, randomly selected, replaced with new one
40 problems with mixed convex cost functions, average number of paths=5
Optimal baseline
(data is based on average cost after 50,000 iterations)
Selfish base
line
C(s)
C(optimum)
C(selfish solution)
C(optimum)
![Page 25: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/25.jpg)
25
Summary of experiments
• Symmetric network congestion games– Well-known examples– Linear, polynomial, exponential, & discrete cost functions– Scalability
• number of alternative paths (|S| = 2 ~ 15)• Population size (n = 2 ~ 1000)
– Robustness under dynamic population assumption
• 2-player matrix games• Inefficiency of solution based on 121 problems:
– Selfish solutions: 120% higher than optimum– IMPRES solutions: 30% higher than optimum
25% coord. overhead of 1-to-n model
Motivation Approach Theoretical Empirical Conclusion
![Page 26: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/26.jpg)
26
Contributions
• Discovery of social norm (strategies) that can support mutually beneficial solutions
• Investigated “social learning” in multiagent context• Proposed IMPRES: 2-layered learning algorithm
– significant extension to classical reinforcement learning models
– the first algorithm that learns non-stationary strategies for more than 2 players under imperfect monitoring
• Demonstrated IMPRES agents self-organize:– Every agent is individually rational (minimax-safety)– Social welfare is improved by approx. 4 times from selfish
solutions– Efficient coordination (overhead within 25% of 1-to-n model)
Motivation Approach Theoretical Empirical Conclusion
![Page 27: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/27.jpg)
27
Future work
• Short-term goals: more asymmetry– Strategists – give more incentive– Individual threshold (sightseers vs. commuters)– Tradeoffs between multiple criteria (weight)– Free rider problem
• Long-term goals:– Establish the notion of social learning in artificial
agent learning context• Learning by copying actions of others• Learning by observing consequences of other agents
Motivation Approach Theoretical Empirical Conclusion
![Page 28: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/28.jpg)
28
Conclusion
Rationally bounded agents adopting social learning can support mutually beneficial outcomes without the explicit notion of threat.
![Page 29: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/29.jpg)
29
Thank you.
![Page 30: Multi agent social learning in large repeated games](https://reader035.vdocuments.us/reader035/viewer/2022062422/56813435550346895d9b27ad/html5/thumbnails/30.jpg)
56
Selfish solutions
• Nash equilibrium• Wardrop's first principle (a.k.a. user
equilibrium) : travel times of all routes that are in use are equal; and less than the travel time of single user on any of those routes that are not in current use.
• (Wardrop’s second principle: system optimal solution based on average cost of all agents)