september 15september 15 multiagent learning using a variable learning rate igor kiselev, university...
TRANSCRIPT
Apr 21, 2023Apr 21, 2023
Multiagent learning using a variable learning rate
Igor Kiselev, University of Waterloo
M. Bowling and M. Veloso.Artificial Intelligence, Vol. 136, 2002, pp. 215 - 250.
University of Waterloo Page 2
Agenda
IntroductionMotivation to multi-agent learningMDP frameworkStochastic game frameworkReinforcement Learning: single-agent, multi-agentRelated work:
Multiagent learning with a variable learning rateTheoretical analysis of the replicator dynamicsWoLF Incremental Gradient Ascent algorithm WoLF Policy Hill Climbing algorithm
ResultsConcluding remarks
Apr 21, 2023Apr 21, 2023
IntroductionMotivation to multi-agent learning
University of Waterloo Page 4
MAL is a Challenging and Interesting TaskResearch goal is to enable an agent effectively learn to act (cooperate,
compete) in the presence of other learning agents in complex domains.Equipping MAS with learning capabilities permits the agent to deal with large, open, dynamic, and unpredictable environmentsMulti-agent learning (MAL) is a challenging problem for developing intelligent systems.Multiagent environments are non-stationary, violating the traditional assumption underlying single-agent learning
University of Waterloo Page 5
Reinforcement Learning Papers: Statistics
Reinforcement Learning Papers
Wasserman
Barto, Ernst,Collins, Bowling
Littman
Sutton
Singh
0
500
1000
1500
2000
2500
1963 1968 1973 1978 1983 1988 1993 1998 2003 2008
Google Scholar
University of Waterloo Page 6
Various Approaches to Learning / Related Work
Y. Shoham et al., 2003
Apr 21, 2023Apr 21, 2023
PreliminariesMDP and Stochastic Game Frameworks
University of Waterloo Page 8
Single-agent Reinforcement Learning
Independent learners act ignoring the existence of others
Stationary environment
Learn policy that maximizes individual utility (“trial-error”)
Perform their actions, obtain a reward and update their Q-values without regard to the actions performed by others
Policy
World,
State
Learning Algorithm
Actions
Observations,Sensations
Rewards
R. S. Sutton, 1997
University of Waterloo Page 9
Markov Decision Processes / MDP Framework
T. M. Mitchell, 1997
Environment is a modeled as an MDP, defined by (S, A, R, T)S – finite set of states of the environment
A(s) – set of actions possible in state sST: S×A → P – set transition function from state-action pairs to states
R(s,s',a) – expected reward on transition (s to s‘)P(s,s',a) – probability of transition from s to s' – discount rate for delayed reward
Each discrete time t = 0, 1, 2, . . . agent:observes state StS
chooses action atA
receives immediate reward rt ,
state changes to St+1
rt +f = 0. . . st tart +1 st +1
t +1art +2 st +2
t +2art +3 st +3
. . .t +3a t+f-1a st +f
University of Waterloo Page 10
Agent’s learning task – find optimal action selection policy
Execute actions in environment, observe results, and learn to construct an optimal action selection policy that maximizes the agent's performance - the long-term Total Discounted Reward
Find a policy sS aA(s)
that maximizes the value (expected future reward) of each s :
and each s,a pair:
V (s) = E {r + r + r + s =s, }
rewards
t +1 t +2 t +3 t. . .2
Q (s,a) = E {r + r + r + s =s, a =a, }t +1 t +2 t +3 t t. . .2
T. M. Mitchell, 1997
University of Waterloo Page 11
Agent’s Learning Strategy – Q-Learning methodQ-function - iterative approximation of Q values with learning rate β: 0≤ β<1
Q-Learning incremental process1. Observe the current state s
2. Select an action with probability based on the employed selection policy
3. Observe the new state s′
4. Receive a reward r from the environment
5. Update the corresponding Q-value for action a and state s
6. Terminate the current trial if the new state s′ satisfies a terminal condition; otherwise let s′→ s and go back to step 1
))ˆ,'(max),((),()1(),(ˆ
asQasrasQasQa
University of Waterloo Page 12
Multi-agent Framework
Learning in multi-agent settingall agents simultaneously learning
environment not stationary (other agents are evolving)
problem of a “moving target”
University of Waterloo Page 13
Stochastic Game Framework for addressing MAL
From the perspective of sequential decision making:
Markov decision processesone decision maker
multiple states
Repeated gamesmultiple decision makers
one state
Stochastic games (Markov games)extension of MDPs to multiple decision makers
multiple states
University of Waterloo Page 14
Stochastic Game / Notation
S: Set of states (n-agent stage games)
Ri(s,a): Reward to player i in state s under joint action a
T(s,a,s): Probability of transition from s to state s on a
a1 R1(s,a), R2(s,a), …
a2
s
[ ]
[ ]
[ ]
s
T(s,a,s)
From dynamic programming approach:
Qi(s,a): Long-run payoff to i from s on a then equilibrium
Apr 21, 2023Apr 21, 2023
ApproachMultiagent learning using a variable learning rate
University of Waterloo Page 16
Evaluation criteria for multi-agent learningUse of convergence to NE is problematic:
Terminating criterion: Equilibrium identifies conditions under which learning can or should stop
Easier to play in equilibrium as opposed to continued computation
Nash equilibrium strategy has no “prescriptive force”: say anything prior to termination
Multiple potential equilibria
Opponent may not wish to play an equilibria
Calculating a Nash Equilibrium can be intractable for large games
New criteria: rationality and convergence in self-playConverge to stationary policy: not necessarily Nash
Only terminates once best response to play of other agents is found
During self play, learning is only terminated in a stationary NE
University of Waterloo Page 17
Contributions and Assumptions
Contributions:Criterion for multi-agent learning algorithms
A simple Q-learning algorithm that can play mixed strategies
The WoLF PHC (Win or Lose Fast Policy Hill Climber)
Assumptions - gets both properties given that:The game is two-player, two-action
Players can observe each other’s mixed strategies (not just the played action)
Can use infinitesimally small step sizes
University of Waterloo Page 18
Opponent Modeling or Joint-Action Learners
C. Claus, C. Boutilier, 1998
University of Waterloo Page 19
Joint-Action Learners Method
Maintains an explicit model of the opponents for each state.Q-values are maintained for all possible joint actions at a given stateThe key assumption is that the opponent is stationaryThus, the model of the opponent is simply frequencies of actions played in the pastProbability of playing action a-i:
where C(a−i) is the number of times the opponent has played action a−i.n(s) is the number of times state s has been visited.
University of Waterloo Page 20
Opponent modeling FP-Q learning algorithm
University of Waterloo Page 21
WoLF Principles
The idea is to use two different strategy update steps, one for winning and another one for loosing situations“Win or Learn Fast”: agent reduces its learning rate when performing well, and increases when doing badly. Improves convergence of IGA and policy hill-climbingTo distinguish between those situations, the player keeps track of two policies.Winning is considered if the expected utility of the actual policy is greater than the expected utility of the equilibrium (or average) policy.If winning, the smaller of two strategy update steps is chosen by the winning agent.
University of Waterloo Page 22
Incremental Gradient Ascent Learners (IGA)
IGA:incrementally climbs on the mixed strategy space
for 2-player 2-action general sum games
guarantees convergence to a Nash equilibrium or guarantees convergence to an average payoff that is sustained by some Nash equilibrium
WoLF IGA:based on WoLF principle
converges guarantee to a Nash equilibrium for all 2 player 2 action general sum games
University of Waterloo Page 23
Information passing in the PHC algorithm
University of Waterloo Page 24
Simple Q-Learner that plays mixed strategies
Problems:guarantees rationality against stationary opponentsdoes not converge in self-play
Updating a mixed strategy by giving more weight to the action that Q-learning believes is the best
University of Waterloo Page 25
WoLF Policy Hill Climbing algorithm
Maintaining average policy
Determination of “W” and “L”: by comparing the expected value of the current policy to that of the average policy
agent only need to see its own payoffconverges for two player two action SG’s in self-play
Probability of playing action
Apr 21, 2023Apr 21, 2023
Theoretical analysisAnalysis of the replicator dynamics
University of Waterloo Page 27
Replicator Dynamics – Simplification Case Best response dynamics for Paper-Rock-Scissors
Circular shift from one agent’s policy to the other’s average reward
University of Waterloo Page 28
A winning strategy against PHC
If winningplay probability 1 for
current preferred actionin order to maximize
rewards while winningIf losing
play a deceiving policy until we are ready to take advantage of them again 0 1
1
0.5
0.5
Probability we play headsPro
babili
ty o
pponent
pla
ys
heads
University of Waterloo Page 29
Ideally we’d like to see this:
winning
losing
University of Waterloo Page 30
Ideally we’d like to see this:
winninglosing
University of Waterloo Page 31
Convergence dynamics of strategiesIterated Gradient Ascent:
• Again does a myopic adaptation to other players’ current strategy.
• Either converges to a Nash fixed point on the boundary (at least one pure strategy), or get limit cycles
•Vary learning rates to be optimal while satisfying both properties
Apr 21, 2023Apr 21, 2023
Results
University of Waterloo Page 33
Experimental testbeds
Matrix GamesMatching pennies
Three-player matching pennies
Rock-paper-scissors
Gridworld
Soccer
University of Waterloo Page 34
Matching pennies
University of Waterloo Page 35
Rock-paper-scissors: PHC
University of Waterloo Page 36
Rock-paper-scissors: WoLF PHC
University of Waterloo Page 37
Summary and Conclusion
Criterion for multi-agent learning algorithms: rationality and convergence
A simple Q-learning algorithm that can play mixed strategies
The WoLF PHC (Win or Lose Fast Policy Hill Climber) to satisfy rationality and convergence
University of Waterloo Page 38
Disadvantages
Analysis for two-player, two-action games: pseudoconvergence
Avoidance of exploitationguaranteeing that the learner cannot be deceptively exploited by another agent
Chang and Kaelbling (2001) demonstrated that the best-response learner
PHC (Bowling & Veloso, 2002) could be exploited by a particular dynamic strategy.
University of Waterloo Page 39
Pseudoconvergence
University of Waterloo Page 40
Future Work by Authors
Exploring learning outside of self-play:whether WoLF techniques can be exploited by a malicious (not rational) “learner”.
Scaling to large problems:combining single-agent scaling solutions (function approximators and parameterized policies) with the concepts of a variable learning rate and WoLF.
Online learning
List other algorithms of authors:GIGA-WoLF, normal form games
University of Waterloo Page 41
Discussion / Open Questions Investigation other evaluation criteria:
No-regret criteriaNegative non-convergence regret (NNR)Fast reaction (tracking) [Jensen]Performance: maximum time for reaching a desired performance level
Incorporating more algorithms into testing: deeper comparison with more simple and more complex algorithms (e.g. AWESOME [Conitzer and Sandholm 2003])Classification of situations (games) with various values of the delta and alpha variables: what values are good in what situations.Extending work to have more players.Online learning and exploration policy in stochastic games (trade-off)
Currently the formalism is presented in two dimensional state-space: possibility for extension of the formal model (geometrical ?)?What does make Minimax-Q irrational?Application of WoLF to multi-agent evolutionary algorithms (e.g. to control the mutation rate) or to learning of neural networks (e.g. to determine a winner neuron)?Connection with control theory and learning of Complex Adaptive Systems: manifold-adaptive learning?
Apr 21, 2023Apr 21, 2023
QuestionsThank you