convergent learning in unknown graphical games

Convergent Learning in Unknown Graphical Games

Dr Archie Chapman, Dr David Leslie,Dr Alex Rogers and Prof Nick Jennings

School of Mathematics, University of Bristol and School of Electronics and Computer Science

University of Southampton

david.leslie@bristol.ac.uk

Playing games?

)0,0()1,1()1,1()1,1()0,0()1,1()1,1()1,1()0,0(

PaperScissors

RockPaperScissorsRock

Playing games?

Dense deployment of sensors to detect pedestrian and vehicle activity within an urban environment.

Berkeley Engineering

Learning in games• Adapt to observations of past play• Hope to converge to something “good”• Why?!

– Bounded rationality justification of equilibrium– Robust to behaviour of “opponents”– Language to describe distributed optimisation

Notation• Players• Discrete action sets• Joint action set • Reward functions

• Mixed strategies• Joint mixed strategy space• Reward functions extend to

},...,1{ NiiA

RAri :NAAA 1

)( iii AN 1

Best response / Equilibrium• Mixed strategies of all players other than i is

• Best response of player i is

• An equilibrium is a satisfying, for all i,

),(argmax)( iiiii rbii

)( iii b

Fictitious playEstimate strategies of other players

Game matrix

Select best action given estimates

Update estimates

Belief updates• Belief about strategy of player i is the MLE

• Online updating

1111 )( ttt

• Processes of the form

where and• F is set-valued (convex and u.s.c.)• Limit points are chain-recurrent sets of the

differential inclusion

Stochastic approximation

1111 )( tttttt eMXFXX

0)|( 1 tt XME 0te

Best-response dynamics• Fictitious play has M and e identically 0, and

• Limit points are limit points of the best-response differential inclusion

• In potential games (and zero-sum games and some others) the limit points must be Nash equilibria

Generalised weakened fictitious play

• Consider any process such that

where andand also an interplay between and M.

• The convergence properties do not change

tttttt Mbt

111 )(

,0t t0t

Fictitious playEstimate strategies of other players

Game matrix

Update estimates

?)(?,?)(?,?)(?,?)(?,?)(?,?)(?,?)(?,?)(?,?)(?,

PaperScissors

RockPaperScissorsRock

Learning the game

ti earR )(

Reinforcement learning• Track the average reward for each joint

action• Play each joint action frequently enough• Estimates will be close to the expected value

• Estimated game converges to the true game

Q-learned fictitious playEstimate strategies of other players

Game matrix

Estimated game matrix

Update estimates

Theoretical result

Theorem – If all joint actions are played infinitely often then beliefs follow a GWFP

Proof: The estimated game converges to the true game, so selected strategies are -best responses.

Playing games?

Dense deployment of sensors to detect pedestrian and vehicle activity within an urban environment.

Berkeley Engineering

It’s impossible!• N players, each with A actions• Game matrix has AN entries to learn

• Each individual must estimate the strategy of every other individual

• It’s just not possible for realistic game scenarios

Marginal contributions• Marginal contribution of player i is

total system reward – system reward if i absent

• Maximised marginal contributions implies system is at a (local) optimum

• Marginal contribution might depend only on the actions of a small number of neighbours

Sensors – rewards• Global reward for action a is

• Marginal reward for i is

• Actually use

jEIEaU )(

eventsobserved is

nobservatio and events

ananiggi

jjEaUaUarby

observed

events)()()(

ananti

by observed

Local learningEstimate strategies of other players

Game matrix

Estimated game matrix

Update estimates

Local learningEstimate strategies of neighbours

Game matrix

Estimated game matrixfor local interactions

Update estimates

Theoretical result

Theorem – If all joint actions of local games are played infinitely often then beliefs follow a GWFP

Proof: The estimated game converges to the true game, so selected strategies are -best responses.

Sensing results

So what?!• Play converges to (local) optimum with only

noisy information and local communication

• An individual always chooses an action to maximise expected reward given information

• If an individual doesn’t “play cricket”, the other individuals will reach an optimal point conditional on the behaviour of the itinerant

Summary• Learning the game while playing is essential

• This can be accommodated within the GWFP framework

• Exploiting the neighbourhood structure of marginal contributions is essential for feasibility

convergent learning in unknown graphical games

zerosum games

strategy of player i

team gamesplaying games

best responses

ismarginal reward

average reward

selected strategies

joint actions

Documents

convergent learning in unknown graphical games dr archie...

mechanisms of convergent egg provisioning in poison...

convergent interviewing

convergent plate boundaries finz 2011. what are the 3 types...

a superlinearly-convergent proximal newton...

convergent charging_config guide

convergent reflection sensing ensures stable detection ·...

convergent charging_install guide

graphical user interface - cisco€¦ · graphical user...

divergent/ convergent

probabilistic graphical models principles and techniques -...

convergent boundaries

convergent boundaries. this is a diagram of a convergent...

edr management, edr billing and convergent invoicing -...

convergent newsroom

convergent evidence for abnormal striatal synaptic...

convergent plate boundaries

convergent technology - basics

convergent close-coupling approach to atomic and molecular...

convergent billing in