september 15september 15 multiagent learning using a variable learning rate igor kiselev, university...

Apr 21, 2023Apr 21, 2023

Multiagent learning using a variable learning rate

Igor Kiselev, University of Waterloo

M. Bowling and M. Veloso.Artificial Intelligence, Vol. 136, 2002, pp. 215 - 250.

University of Waterloo

Agenda

IntroductionMotivation to multi-agent learningMDP frameworkStochastic game frameworkReinforcement Learning: single-agent, multi-agentRelated work:

Multiagent learning with a variable learning rateTheoretical analysis of the replicator dynamicsWoLF Incremental Gradient Ascent algorithm WoLF Policy Hill Climbing algorithm

ResultsConcluding remarks

Apr 21, 2023Apr 21, 2023

IntroductionMotivation to multi-agent learning


MAL is a Challenging and Interesting TaskResearch goal is to enable an agent effectively learn to act (cooperate,

compete) in the presence of other learning agents in complex domains.Equipping MAS with learning capabilities permits the agent to deal with large, open, dynamic, and unpredictable environmentsMulti-agent learning (MAL) is a challenging problem for developing intelligent systems.Multiagent environments are non-stationary, violating the traditional assumption underlying single-agent learning


Reinforcement Learning Papers: Statistics

Reinforcement Learning Papers

Wasserman

Barto, Ernst,Collins, Bowling

Littman

Sutton

Singh

0

500

1000

1500

2000

2500

1963 1968 1973 1978 1983 1988 1993 1998 2003 2008

Google Scholar


Various Approaches to Learning / Related Work

Y. Shoham et al., 2003

Apr 21, 2023Apr 21, 2023

PreliminariesMDP and Stochastic Game Frameworks


Single-agent Reinforcement Learning

Independent learners act ignoring the existence of others

Stationary environment

Learn policy that maximizes individual utility (“trial-error”)

Perform their actions, obtain a reward and update their Q-values without regard to the actions performed by others

Policy

World,

State

Learning Algorithm

Actions

Observations,Sensations

Rewards

R. S. Sutton, 1997


Markov Decision Processes / MDP Framework

T. M. Mitchell, 1997

Environment is a modeled as an MDP, defined by (S, A, R, T)S – finite set of states of the environment

A(s) – set of actions possible in state sST: S×A → P – set transition function from state-action pairs to states

R(s,s',a) – expected reward on transition (s to s‘)P(s,s',a) – probability of transition from s to s' – discount rate for delayed reward

Each discrete time t = 0, 1, 2, . . . agent:observes state StS

chooses action atA

receives immediate reward rt ,

state changes to St+1

rt +f = 0. . . st tart +1 st +1

t +1art +2 st +2

t +2art +3 st +3

. . .t +3a t+f-1a st +f


Agent’s learning task – find optimal action selection policy

Execute actions in environment, observe results, and learn to construct an optimal action selection policy that maximizes the agent's performance - the long-term Total Discounted Reward

Find a policy sS aA(s)

that maximizes the value (expected future reward) of each s :

and each s,a pair:

V (s) = E {r + r + r + s =s, }

rewards

t +1 t +2 t +3 t. . .2

Q (s,a) = E {r + r + r + s =s, a =a, }t +1 t +2 t +3 t t. . .2

T. M. Mitchell, 1997


Agent’s Learning Strategy – Q-Learning methodQ-function - iterative approximation of Q values with learning rate β: 0≤ β<1

Q-Learning incremental process1. Observe the current state s

2. Select an action with probability based on the employed selection policy

3. Observe the new state s′

4. Receive a reward r from the environment

5. Update the corresponding Q-value for action a and state s

6. Terminate the current trial if the new state s′ satisfies a terminal condition; otherwise let s′→ s and go back to step 1

))ˆ,'(max),((),()1(),(ˆ

asQasrasQasQa


Multi-agent Framework

Learning in multi-agent settingall agents simultaneously learning

environment not stationary (other agents are evolving)

problem of a “moving target”


Stochastic Game Framework for addressing MAL

From the perspective of sequential decision making:

Markov decision processesone decision maker

multiple states

Repeated gamesmultiple decision makers

one state

Stochastic games (Markov games)extension of MDPs to multiple decision makers

multiple states


Stochastic Game / Notation

S: Set of states (n-agent stage games)

Ri(s,a): Reward to player i in state s under joint action a

T(s,a,s): Probability of transition from s to state s on a

a1 R1(s,a), R2(s,a), …

a2

s

[ ]

[ ]

[ ]

s

T(s,a,s)

From dynamic programming approach:

Qi(s,a): Long-run payoff to i from s on a then equilibrium

Apr 21, 2023Apr 21, 2023

ApproachMultiagent learning using a variable learning rate


Evaluation criteria for multi-agent learningUse of convergence to NE is problematic:

Terminating criterion: Equilibrium identifies conditions under which learning can or should stop

Easier to play in equilibrium as opposed to continued computation

Nash equilibrium strategy has no “prescriptive force”: say anything prior to termination

Multiple potential equilibria

Opponent may not wish to play an equilibria

Calculating a Nash Equilibrium can be intractable for large games

New criteria: rationality and convergence in self-playConverge to stationary policy: not necessarily Nash

Only terminates once best response to play of other agents is found

During self play, learning is only terminated in a stationary NE


Contributions and Assumptions

Contributions:Criterion for multi-agent learning algorithms

A simple Q-learning algorithm that can play mixed strategies

The WoLF PHC (Win or Lose Fast Policy Hill Climber)

Assumptions - gets both properties given that:The game is two-player, two-action

Players can observe each other’s mixed strategies (not just the played action)

Can use infinitesimally small step sizes


Opponent Modeling or Joint-Action Learners

C. Claus, C. Boutilier, 1998


Joint-Action Learners Method

Maintains an explicit model of the opponents for each state.Q-values are maintained for all possible joint actions at a given stateThe key assumption is that the opponent is stationaryThus, the model of the opponent is simply frequencies of actions played in the pastProbability of playing action a-i:

where C(a−i) is the number of times the opponent has played action a−i.n(s) is the number of times state s has been visited.


Opponent modeling FP-Q learning algorithm


WoLF Principles

The idea is to use two different strategy update steps, one for winning and another one for loosing situations“Win or Learn Fast”: agent reduces its learning rate when performing well, and increases when doing badly. Improves convergence of IGA and policy hill-climbingTo distinguish between those situations, the player keeps track of two policies.Winning is considered if the expected utility of the actual policy is greater than the expected utility of the equilibrium (or average) policy.If winning, the smaller of two strategy update steps is chosen by the winning agent.


Incremental Gradient Ascent Learners (IGA)

IGA:incrementally climbs on the mixed strategy space

for 2-player 2-action general sum games

guarantees convergence to a Nash equilibrium or guarantees convergence to an average payoff that is sustained by some Nash equilibrium

WoLF IGA:based on WoLF principle

converges guarantee to a Nash equilibrium for all 2 player 2 action general sum games


Information passing in the PHC algorithm


Simple Q-Learner that plays mixed strategies

Problems:guarantees rationality against stationary opponentsdoes not converge in self-play

Updating a mixed strategy by giving more weight to the action that Q-learning believes is the best


WoLF Policy Hill Climbing algorithm

Maintaining average policy

Determination of “W” and “L”: by comparing the expected value of the current policy to that of the average policy

agent only need to see its own payoffconverges for two player two action SG’s in self-play

Probability of playing action

Apr 21, 2023Apr 21, 2023

Theoretical analysisAnalysis of the replicator dynamics


Replicator Dynamics – Simplification Case Best response dynamics for Paper-Rock-Scissors

Circular shift from one agent’s policy to the other’s average reward


A winning strategy against PHC

If winningplay probability 1 for

current preferred actionin order to maximize

rewards while winningIf losing

play a deceiving policy until we are ready to take advantage of them again 0 1

1

0.5

0.5

Probability we play headsPro

babili

ty o

pponent

pla

ys

heads


Ideally we’d like to see this:

winning

losing


Ideally we’d like to see this:

winninglosing


Convergence dynamics of strategiesIterated Gradient Ascent:

• Again does a myopic adaptation to other players’ current strategy.

• Either converges to a Nash fixed point on the boundary (at least one pure strategy), or get limit cycles

•Vary learning rates to be optimal while satisfying both properties

Apr 21, 2023Apr 21, 2023

Results


Experimental testbeds

Matrix GamesMatching pennies

Three-player matching pennies

Rock-paper-scissors

Gridworld

Soccer


Matching pennies


Rock-paper-scissors: PHC


Rock-paper-scissors: WoLF PHC


Summary and Conclusion

Criterion for multi-agent learning algorithms: rationality and convergence

A simple Q-learning algorithm that can play mixed strategies

The WoLF PHC (Win or Lose Fast Policy Hill Climber) to satisfy rationality and convergence


Disadvantages

Analysis for two-player, two-action games: pseudoconvergence

Avoidance of exploitationguaranteeing that the learner cannot be deceptively exploited by another agent

Chang and Kaelbling (2001) demonstrated that the best-response learner

PHC (Bowling & Veloso, 2002) could be exploited by a particular dynamic strategy.


Pseudoconvergence


Future Work by Authors

Exploring learning outside of self-play:whether WoLF techniques can be exploited by a malicious (not rational) “learner”.

Scaling to large problems:combining single-agent scaling solutions (function approximators and parameterized policies) with the concepts of a variable learning rate and WoLF.

Online learning

List other algorithms of authors:GIGA-WoLF, normal form games


Discussion / Open Questions Investigation other evaluation criteria:

No-regret criteriaNegative non-convergence regret (NNR)Fast reaction (tracking) [Jensen]Performance: maximum time for reaching a desired performance level

Incorporating more algorithms into testing: deeper comparison with more simple and more complex algorithms (e.g. AWESOME [Conitzer and Sandholm 2003])Classification of situations (games) with various values of the delta and alpha variables: what values are good in what situations.Extending work to have more players.Online learning and exploration policy in stochastic games (trade-off)

Currently the formalism is presented in two dimensional state-space: possibility for extension of the formal model (geometrical ?)?What does make Minimax-Q irrational?Application of WoLF to multi-agent evolutionary algorithms (e.g. to control the mutation rate) or to learning of neural networks (e.g. to determine a winner neuron)?Connection with control theory and learning of Complex Adaptive Systems: manifold-adaptive learning?

Apr 21, 2023Apr 21, 2023

QuestionsThank you

september 15september 15 multiagent learning using a variable learning rate igor kiselev, university...

Documents