september 15september 15 multiagent learning using a variable learning rate igor kiselev, university...

42
Jun 15, 2022 Jun 15, 2022 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence, Vol. 136, 2002, pp. 215 - 250.

Upload: peter-wilkins

Post on 16-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

Apr 21, 2023Apr 21, 2023

Multiagent learning using a variable learning rate

Igor Kiselev, University of Waterloo

M. Bowling and M. Veloso.Artificial Intelligence, Vol. 136, 2002, pp. 215 - 250.

Page 2: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 2

Agenda

IntroductionMotivation to multi-agent learningMDP frameworkStochastic game frameworkReinforcement Learning: single-agent, multi-agentRelated work:

Multiagent learning with a variable learning rateTheoretical analysis of the replicator dynamicsWoLF Incremental Gradient Ascent algorithm WoLF Policy Hill Climbing algorithm

ResultsConcluding remarks

Page 3: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

Apr 21, 2023Apr 21, 2023

IntroductionMotivation to multi-agent learning

Page 4: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 4

MAL is a Challenging and Interesting TaskResearch goal is to enable an agent effectively learn to act (cooperate,

compete) in the presence of other learning agents in complex domains.Equipping MAS with learning capabilities permits the agent to deal with large, open, dynamic, and unpredictable environmentsMulti-agent learning (MAL) is a challenging problem for developing intelligent systems.Multiagent environments are non-stationary, violating the traditional assumption underlying single-agent learning

Page 5: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 5

Reinforcement Learning Papers: Statistics

Reinforcement Learning Papers

Wasserman

Barto, Ernst,Collins, Bowling

Littman

Sutton

Singh

0

500

1000

1500

2000

2500

1963 1968 1973 1978 1983 1988 1993 1998 2003 2008

Google Scholar

Page 6: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 6

Various Approaches to Learning / Related Work

Y. Shoham et al., 2003

Page 7: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

Apr 21, 2023Apr 21, 2023

PreliminariesMDP and Stochastic Game Frameworks

Page 8: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 8

Single-agent Reinforcement Learning

Independent learners act ignoring the existence of others

Stationary environment

Learn policy that maximizes individual utility (“trial-error”)

Perform their actions, obtain a reward and update their Q-values without regard to the actions performed by others

Policy

World,

State

Learning Algorithm

Actions

Observations,Sensations

Rewards

R. S. Sutton, 1997

Page 9: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 9

Markov Decision Processes / MDP Framework

T. M. Mitchell, 1997

Environment is a modeled as an MDP, defined by (S, A, R, T)S – finite set of states of the environment

A(s) – set of actions possible in state sST: S×A → P – set transition function from state-action pairs to states

R(s,s',a) – expected reward on transition (s to s‘)P(s,s',a) – probability of transition from s to s' – discount rate for delayed reward

Each discrete time t = 0, 1, 2, . . . agent:observes state StS

chooses action atA

receives immediate reward rt ,

state changes to St+1

rt +f = 0. . . st tart +1 st +1

t +1art +2 st +2

t +2art +3 st +3

. . .t +3a t+f-1a st +f

Page 10: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 10

Agent’s learning task – find optimal action selection policy

Execute actions in environment, observe results, and learn to construct an optimal action selection policy that maximizes the agent's performance - the long-term Total Discounted Reward

Find a policy sS aA(s)

that maximizes the value (expected future reward) of each s :

and each s,a pair:

V (s) = E {r + r + r + s =s, }

rewards

t +1 t +2 t +3 t. . .2

Q (s,a) = E {r + r + r + s =s, a =a, }t +1 t +2 t +3 t t. . .2

T. M. Mitchell, 1997

Page 11: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 11

Agent’s Learning Strategy – Q-Learning methodQ-function - iterative approximation of Q values with learning rate β: 0≤ β<1

Q-Learning incremental process1. Observe the current state s

2. Select an action with probability based on the employed selection policy

3. Observe the new state s′

4. Receive a reward r from the environment

5. Update the corresponding Q-value for action a and state s

6. Terminate the current trial if the new state s′ satisfies a terminal condition; otherwise let s′→ s and go back to step 1

))ˆ,'(max),((),()1(),(ˆ

asQasrasQasQa

Page 12: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 12

Multi-agent Framework

Learning in multi-agent settingall agents simultaneously learning

environment not stationary (other agents are evolving)

problem of a “moving target”

Page 13: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 13

Stochastic Game Framework for addressing MAL

From the perspective of sequential decision making:

Markov decision processesone decision maker

multiple states

Repeated gamesmultiple decision makers

one state

Stochastic games (Markov games)extension of MDPs to multiple decision makers

multiple states

Page 14: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 14

Stochastic Game / Notation

S: Set of states (n-agent stage games)

Ri(s,a): Reward to player i in state s under joint action a

T(s,a,s): Probability of transition from s to state s on a

a1 R1(s,a), R2(s,a), …

a2

s

[ ]

[ ]

[ ]

s

T(s,a,s)

From dynamic programming approach:

Qi(s,a): Long-run payoff to i from s on a then equilibrium

Page 15: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

Apr 21, 2023Apr 21, 2023

ApproachMultiagent learning using a variable learning rate

Page 16: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 16

Evaluation criteria for multi-agent learningUse of convergence to NE is problematic:

Terminating criterion: Equilibrium identifies conditions under which learning can or should stop

Easier to play in equilibrium as opposed to continued computation

Nash equilibrium strategy has no “prescriptive force”: say anything prior to termination

Multiple potential equilibria

Opponent may not wish to play an equilibria

Calculating a Nash Equilibrium can be intractable for large games

New criteria: rationality and convergence in self-playConverge to stationary policy: not necessarily Nash

Only terminates once best response to play of other agents is found

During self play, learning is only terminated in a stationary NE

Page 17: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 17

Contributions and Assumptions

Contributions:Criterion for multi-agent learning algorithms

A simple Q-learning algorithm that can play mixed strategies

The WoLF PHC (Win or Lose Fast Policy Hill Climber)

Assumptions - gets both properties given that:The game is two-player, two-action

Players can observe each other’s mixed strategies (not just the played action)

Can use infinitesimally small step sizes

Page 18: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 18

Opponent Modeling or Joint-Action Learners

C. Claus, C. Boutilier, 1998

Page 19: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 19

Joint-Action Learners Method

Maintains an explicit model of the opponents for each state.Q-values are maintained for all possible joint actions at a given stateThe key assumption is that the opponent is stationaryThus, the model of the opponent is simply frequencies of actions played in the pastProbability of playing action a-i:

where C(a−i) is the number of times the opponent has played action a−i.n(s) is the number of times state s has been visited.

Page 20: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 20

Opponent modeling FP-Q learning algorithm

Page 21: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 21

WoLF Principles

The idea is to use two different strategy update steps, one for winning and another one for loosing situations“Win or Learn Fast”: agent reduces its learning rate when performing well, and increases when doing badly. Improves convergence of IGA and policy hill-climbingTo distinguish between those situations, the player keeps track of two policies.Winning is considered if the expected utility of the actual policy is greater than the expected utility of the equilibrium (or average) policy.If winning, the smaller of two strategy update steps is chosen by the winning agent.

Page 22: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 22

Incremental Gradient Ascent Learners (IGA)

IGA:incrementally climbs on the mixed strategy space

for 2-player 2-action general sum games

guarantees convergence to a Nash equilibrium or guarantees convergence to an average payoff that is sustained by some Nash equilibrium

WoLF IGA:based on WoLF principle

converges guarantee to a Nash equilibrium for all 2 player 2 action general sum games

Page 23: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 23

Information passing in the PHC algorithm

Page 24: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 24

Simple Q-Learner that plays mixed strategies

Problems:guarantees rationality against stationary opponentsdoes not converge in self-play

Updating a mixed strategy by giving more weight to the action that Q-learning believes is the best

Page 25: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 25

WoLF Policy Hill Climbing algorithm

Maintaining average policy

Determination of “W” and “L”: by comparing the expected value of the current policy to that of the average policy

agent only need to see its own payoffconverges for two player two action SG’s in self-play

Probability of playing action

Page 26: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

Apr 21, 2023Apr 21, 2023

Theoretical analysisAnalysis of the replicator dynamics

Page 27: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 27

Replicator Dynamics – Simplification Case Best response dynamics for Paper-Rock-Scissors

Circular shift from one agent’s policy to the other’s average reward

Page 28: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 28

A winning strategy against PHC

If winningplay probability 1 for

current preferred actionin order to maximize

rewards while winningIf losing

play a deceiving policy until we are ready to take advantage of them again 0 1

1

0.5

0.5

Probability we play headsPro

babili

ty o

pponent

pla

ys

heads

Page 29: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 29

Ideally we’d like to see this:

winning

losing

Page 30: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 30

Ideally we’d like to see this:

winninglosing

Page 31: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 31

Convergence dynamics of strategiesIterated Gradient Ascent:

• Again does a myopic adaptation to other players’ current strategy.

• Either converges to a Nash fixed point on the boundary (at least one pure strategy), or get limit cycles

•Vary learning rates to be optimal while satisfying both properties

Page 32: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

Apr 21, 2023Apr 21, 2023

Results

Page 33: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 33

Experimental testbeds

Matrix GamesMatching pennies

Three-player matching pennies

Rock-paper-scissors

Gridworld

Soccer

Page 34: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 34

Matching pennies

Page 35: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 35

Rock-paper-scissors: PHC

Page 36: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 36

Rock-paper-scissors: WoLF PHC

Page 37: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 37

Summary and Conclusion

Criterion for multi-agent learning algorithms: rationality and convergence

A simple Q-learning algorithm that can play mixed strategies

The WoLF PHC (Win or Lose Fast Policy Hill Climber) to satisfy rationality and convergence

Page 38: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 38

Disadvantages

Analysis for two-player, two-action games: pseudoconvergence

Avoidance of exploitationguaranteeing that the learner cannot be deceptively exploited by another agent

Chang and Kaelbling (2001) demonstrated that the best-response learner

PHC (Bowling & Veloso, 2002) could be exploited by a particular dynamic strategy.

Page 39: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 39

Pseudoconvergence

Page 40: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 40

Future Work by Authors

Exploring learning outside of self-play:whether WoLF techniques can be exploited by a malicious (not rational) “learner”.

Scaling to large problems:combining single-agent scaling solutions (function approximators and parameterized policies) with the concepts of a variable learning rate and WoLF.

Online learning

List other algorithms of authors:GIGA-WoLF, normal form games

Page 41: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

University of Waterloo Page 41

Discussion / Open Questions Investigation other evaluation criteria:

No-regret criteriaNegative non-convergence regret (NNR)Fast reaction (tracking) [Jensen]Performance: maximum time for reaching a desired performance level

Incorporating more algorithms into testing: deeper comparison with more simple and more complex algorithms (e.g. AWESOME [Conitzer and Sandholm 2003])Classification of situations (games) with various values of the delta and alpha variables: what values are good in what situations.Extending work to have more players.Online learning and exploration policy in stochastic games (trade-off)

Currently the formalism is presented in two dimensional state-space: possibility for extension of the formal model (geometrical ?)?What does make Minimax-Q irrational?Application of WoLF to multi-agent evolutionary algorithms (e.g. to control the mutation rate) or to learning of neural networks (e.g. to determine a winner neuron)?Connection with control theory and learning of Complex Adaptive Systems: manifold-adaptive learning?

Page 42: September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

Apr 21, 2023Apr 21, 2023

QuestionsThank you