multi-agent learning - utrecht university€¦ · author: gerard vreeswijk. slides last modiﬁed...

Multi-agent learningRepeated gamesGerard Vreeswijk, Intelligent Systems Group, Computer Science

Department, Faculty of Sciences, Utrecht University, TheNetherlands.

Wednesday 30th April, 2014

Repeated games: motivation

Author: Gerard Vreeswijk. Slides last modified on April 30th , 2014 at 22:58 Multi-agent learning: Repeated games

intera tionlearning

stage game

�niteinde�nitein�nite



1. Much intera tion in MASs can be modelled through games.

learning

stage game





2. Much learning in MASs can therefore be modelled through learning ingames.

stage game






3. Learning in games usually takes place through the (gradual)adaptation of strategies (hence, behaviour) in a repeated game.

stage game







4. In most repeated games, one game (a.k.a. stage game) is playedrepeatedly. Possibilities:








■ A �nite number of times.

inde�nitein�nite








■ An inde�nite (same: indeterminate) number of times.

in�nite









■ An in�nite number of times.









■ An in�nite number of times.

5. Therefore, familiarity with the basic concepts and results from thetheory of repeated games is essential to understand MAL.

Plan for today


ba kward indu tion

Dis ount fa torFolk theoremTrigger strategy

Plan for today


■ NE in normal form games that are repeated a finite number of times.

ba kward indu tion


Plan for today



● Principle of ba kward indu tion.


Plan for today




■ NE in normal form games that are repeated an indefinite number oftimes.


Plan for today





● Dis ount fa tor. Models the probability of continuation.

Folk theoremTrigger strategy

Plan for today






● Folk theorem. (Actually many FT’s.) Repeated games generally dohave infinitely many Nash equilibria.

Trigger strategy

Plan for today







● Trigger strategy, on-path vs. off-path play, the threat to “minmax”an opponent.

Plan for today








This presentation draws heavily on (Peters, 2008).

Plan for today








This presentation draws heavily on (Peters, 2008).

* H. Peters (2008): Game Theory: A Multi-Leveled Approach. Springer, ISBN: 978-3-540-69290-4.Ch. 8: Repeated games.

Recap:

Stage games


The expected payoff of a stage game


Prisoners’ DilemmaOther:

Cooperate DefectYou: Cooperate (3, 3) (0, 5)Defect (5, 0) (1, 1)





■ Suppose following strategy profile for one game:






● Row player (you) plays with mixed strategy 0.8 on C.







● Column player (other) plays with mixed strategy 0.7 on C.








■ Your expected payoff is 0.8(0.7· 3 + 0.3 · 0) + 0.2(0.7· 5 + 0.3 · 1) = 2.44








■ Your expected payoff is 0.8(0.7· 3 + 0.3 · 0) + 0.2(0.7· 5 + 0.3 · 1) = 2.44

■ General formula (cf., e.g., Leyton-Brown et al., 2008):

Expected payoffi,t(s) = ∑(i1,...,in)∈An

Πnk=1sk,ik

·payoffi(si1 , . . . , sin)

Expected payoffs two players in stage PD


Player 1mayonlymove“back–front”;Player2 mayonlymove“left –right”.

Part I:

Nash equilibria


Part I:

Nash equilibria

in normal form games


Part I:

Nash equilibria


that are repeated


Part I:

Nash equilibria


that are repeated

a finite number of times


Nash equilibria in playing the PD twice


Other:Prisoners’ Dilemma Cooperate DefectYou: Cooperate (3, 3) (0, 5)

Defect (5, 0) (1, 1)

one Nashequilibrium Dtwo times in su ession




Defect (5, 0) (1, 1)

■ Even if mixed strategies are allowed, the PD possesses one Nashequilibrium, viz. (D, D) with payoffs (1, 1).

two times in su ession




Defect (5, 0) (1, 1)


■ This equilibrium is Pareto sub-optimal. (Because (3, 3) makes bothplayers better off.)

two times in su ession




Defect (5, 0) (1, 1)



■ Does the situation change if two parties get to play the Prisoners’Dilemma two times in su ession?




Defect (5, 0) (1, 1)



■ Does the situation change if two parties get to play the Prisoners’Dilemma two times in su ession?

■ The following diagram (hopefully) shows that playing the PD twotimes in succession does not yield an essentially new NE.



(0, 0)

CC(3, 3)

CC(6, 6)

CD(3, 8)

DC(8, 3)

DD(4, 4)

CD(0, 5)

CC(3, 8)

CD(0, 10)

DC(5, 5)

DD(1, 6)

DC(5, 0)

CC(8, 3)

CD(5, 5)

DC(10, 0)

DD(6, 1)

DD(1, 1)

CC(4, 4)

CD(1, 6)

DC(6, 1)

DD(2, 2)



In normal form:

Other:CC CD DC DDYou: CC (6, 6) (3, 8) (3, 8) (0, 10)

CD (8, 3) (4, 4) (5, 5) (1, 6)DC (8, 3) (5, 5) (4, 4) (1, 6)DD (10, 0) (6, 1) (6, 1) (2, 2)



In normal form:


CD (8, 3) (4, 4) (5, 5) (1, 6)DC (8, 3) (5, 5) (4, 4) (1, 6)DD (10, 0) (6, 1) (6, 1) (2, 2)

■ The action profile (DD, DD) is the only Nash equilibrium.



In normal form:


CD (8, 3) (4, 4) (5, 5) (1, 6)DC (8, 3) (5, 5) (4, 4) (1, 6)DD (10, 0) (6, 1) (6, 1) (2, 2)


■ With 3 successive games, we obtain a 23 × 23 matrix, where the actionprofile (DDD, DDD) still would be the only Nash equilibrium.



In normal form:


CD (8, 3) (4, 4) (5, 5) (1, 6)DC (8, 3) (5, 5) (4, 4) (1, 6)DD (10, 0) (6, 1) (6, 1) (2, 2)


■ With 3 successive games, we obtain a 23 × 23 matrix, where the actionprofile (DDD, DDD) still would be the only Nash equilibrium.

■ Generalise to N repetitions: (DN , DN) still is the only Nashequilibrium in a repeated game where the PD is played N times insuccession.

Backward induction (version for repeated games)


historystrategy



■ Suppose G is a game in normal form for p players, where all playerspossess the same arsenal of possible actions A = {a1, . . . , am}.

historystrategy




■ The game Gn arises by playing the stage game G a number of n timesin succession.

historystrategy





■ A history h of length k is an element of (Ap)k

strategy





■ A history h of length k is an element of (Ap)k, e.g., for p = 3 andk = 10,

a7 a5 a3 a6 a1 a9 a2 a7 a7 a3

a6 a9 a2 a4 a2 a9 a9 a1 a1 a4

a1 a2 a7 a9 a6 a1 a1 a8 a2 a4

is a history of length ten in a game with three players.

strategy






a7 a5 a3 a6 a1 a9 a2 a7 a7 a3

a6 a9 a2 a4 a2 a9 a9 a1 a1 a4

a1 a2 a7 a9 a6 a1 a1 a8 a2 a4

is a history of length ten in a game with three players. The set of allpossible histories is denoted by H.

strategy






a7 a5 a3 a6 a1 a9 a2 a7 a7 a3

a6 a9 a2 a4 a2 a9 a9 a1 a1 a4

a1 a2 a7 a9 a6 a1 a1 a8 a2 a4

is a history of length ten in a game with three players. The set of allpossible histories is denoted by H. (Hence, |Hk| = mkp.)

strategy






a7 a5 a3 a6 a1 a9 a2 a7 a7 a3

a6 a9 a2 a4 a2 a9 a9 a1 a1 a4

a1 a2 a7 a9 a6 a1 a1 a8 a2 a4

is a history of length ten in a game with three players. The set of allpossible histories is denoted by H. (Hence, |Hk| = mkp.)

■ A (possibly mixed) strategy for one player is a function H → Pr(A).



Independen e on history determined future



■ For some repeated games of length n, the dominating (read: “clearlybest”) strategy for all players in round n (the last round) does notdepend on the history of play.




■ For some repeated games of length n, the dominating (read: “clearlybest”) strategy for all players in round n (the last round) does notdepend on the history of play. E.g., for the Prisoners’ Dilemma in lastround:





“No matter what happened in rounds 1 . . . n − 1, I am better off playing D.”






■ Fixed strategies (D, D) in round n determine play after round n − 1.







■ Independen e on history, plus a determined future, leads to the followingjustification for playing D in round n − 1:







“No matter what happened in rounds 1 . . . n − 2 (the past), and given that Iwill receive a payoff of 1 in round n (the future), I am better off playing Dnow.”








■ Per induction in round k, where k ≥ 1:








■ Per induction in round k, where k ≥ 1: “No matter what happened inrounds 1 . . . k, and given that I will receive a payoff of (n − k) · 1 in rounds(k + 1) . . . n, I am better off playing D in round k.”

Part II:

Nash equilibria


Part II:

Nash equilibria


that are repeated


Part II:

Nash equilibria


that are repeated

an indefinite number of times


Indefinite number of repetitions


inde�nite

dis ount fa torin�nitely many

Folk theorems



■ A Pareto-suboptimal outcome can be avoided in case the followingthree conditions are met.

inde�nite


Folk theorems




1. The Prisoners’ Dilemma is repeated an inde�nite number of times(rounds).


Folk theorems





2. A so-called dis ount fa tor δ ∈ [0, 1] determines the probability ofcontinuing the game after each round.

in�nitely many

Folk theorems






3. The probability to continue, δ, must be large enough.

in�nitely many

Folk theorems







■ Under these conditions suddenly in�nitely many Nash equilibria exist.This is sometimes called an embarrassment of richness (Peters, 2008).

Folk theorems








■ Various Folk theorems state the existence of multiple equilibria ininfinitely repeated games.








■ Various Folk theorems state the existence of multiple equilibria ininfinitely repeated games.

■ We now informally discuss one version of “the” Folk Theorem.

Example: Prisoners’ Dilemma repeated indefinitely


the stage gameoverall gamehistoryrealisation



■ Consider the game G∗(δ).




■ Consider the game G∗(δ). This is the PD game, G, played anindefinite number of times.





■ The number of times the stage game G is played is determined by aparameter 0 ≤ δ ≤ 1.





■ The number of times the stage game G is played is determined by aparameter 0 ≤ δ ≤ 1. The probability that the next stage (and thestages thereafter) will be played is δ.






Therefore, the probability that stage game Gt will be played is δt.

(What if t = 0?)







(What if t = 0?)

■ The PD (of which every Gt is an incarnation) is called the stage game,as opposed to the overall game G∗(δ).

historyrealisation






(What if t = 0?)


■ A history h of length t of a repeated game is a sequence of actionprofiles of length t.

realisation






(What if t = 0?)


■ A history h of length t of a repeated game is a sequence of actionprofiles of length t.

■ A realisation h is a countably infinite sequence of action profiles.



(mixed) strategystrategy pro�leexpe ted payo�



■ Example of a history of length t = 10:

Row player: C D D D C C D D D DColumn player: C D D D D D D C D D

0 1 2 3 4 5 6 7 8 9






0 1 2 3 4 5 6 7 8 9

■ The set of all possible histories (of any length) is denoted by H.






0 1 2 3 4 5 6 7 8 9


■ A (mixed) strategy for Player i is a function si : H → ∆{C, D} such that

Pr(Player i plays C in round |h|+ 1 | h ) = si(h) (C).

strategy pro�leexpe ted payo�





0 1 2 3 4 5 6 7 8 9




■ A strategy pro�le s is a combination of strategies, one for each player.

expe ted payo�





0 1 2 3 4 5 6 7 8 9




■ A strategy pro�le s is a combination of strategies, one for each player.

■ The expe ted payo� for player i given s can be computed. It is

Expected payoffi(s) =∞

∑t=0

δt Expected payoffi,t(s).

Subgame perfect equilibria of G∗(δ): D∗


Definition. A subgame perfe t Nash equilibrium of an extensive (in thiscase: repeated) game is a Nash equilibrium of this extensive game, ofwhich the restriction to every subgame (read: tailgame) also is a Nashequilibrium to that subgame.




Consider the strategy of iterated defection D∗: “always defect, no matterwhat”.1





Claim. The strategy profile (D∗, D∗) is a subgame perfect equilibrium inG∗(δ).






Proof. Consider any tailgame starting at round t.






Proof. Consider any tailgame starting at round t. We are done if we canshow that (D∗, D∗) is a NE for this subgame.






Proof. Consider any tailgame starting at round t. We are done if we canshow that (D∗, D∗) is a NE for this subgame. This is true: given that oneplayer always defects, it never pays off for the other player to play C atany time.






Proof. Consider any tailgame starting at round t. We are done if we canshow that (D∗, D∗) is a NE for this subgame. This is true: given that oneplayer always defects, it never pays off for the other player to play C atany time. Hence, everyone plays D∗.

1A notation like D∗ or (worse) D∞ is suggestive. Mathematically it makes no sense,but intuitively it does.

Part III:

Trigger strategies


Example: Cost of deviating in Round 4


Consider the so-called trigger strategy T: “always play C unless D hasbeen played at least once. In that case play D forever”.




Claim. The strategy profile (T, T) is a subgame perfect equilibrium in G∗(δ),provided the probability of continuation, δ, is sufficiently large.





Proof. Consider a typical play:

Row player: C C C C C D D D D D . . .Column player: C C C C D D D D D D . . .

0 1 2 3 4 5 6 7 8 9 . . .







0 1 2 3 4 5 6 7 8 9 . . .

Column player defects after Round 4. By doing so he expects a payoff of

4

∑t=0

δt· 3







0 1 2 3 4 5 6 7 8 9 . . .


4

∑t=0

δt· 3 + δ

5· 5







0 1 2 3 4 5 6 7 8 9 . . .


4

∑t=0

δt· 3 + δ

5· 5 +∞

∑t=6

δt· 1.



By cooperating throughout, the column player could have earned

∞

∑t=0

δt· 3




∞

∑t=0

δt· 3

which means he forfeited

∞

∑t=0

δt· 3




∞

∑t=0

δt· 3


∞

∑t=0

δt· 3 −

(




∞

∑t=0

δt· 3


∞

∑t=0

δt· 3 −

(

4

∑t=0

δt· 3




∞

∑t=0

δt· 3


∞

∑t=0

δt· 3 −

(

4

∑t=0

δt· 3 + δ

5· 5




∞

∑t=0

δt· 3


∞

∑t=0

δt· 3 −

(

4

∑t=0

δt· 3 + δ

5· 5 +∞

∑t=6

δt· 1

)




∞

∑t=0

δt· 3


∞

∑t=0

δt· 3 −

(

4

∑t=0

δt· 3 + δ

5· 5 +∞

∑t=6

δt· 1

)

= −2δ5 + 2

∞

∑t=6

δt

by deviating from T.




∞

∑t=0

δt· 3


∞

∑t=0

δt· 3 −

(

4

∑t=0

δt· 3 + δ

5· 5 +∞

∑t=6

δt· 1

)

= −2δ5 + 2

∞

∑t=6

δt

by deviating from T. If δ 6= 0,

−2δ5 + 2

∞

∑t=6

δt> 0




∞

∑t=0

δt· 3


∞

∑t=0

δt· 3 −

(

4

∑t=0

δt· 3 + δ

5· 5 +∞

∑t=6

δt· 1

)

= −2δ5 + 2

∞

∑t=6

δt


−2δ5 + 2

∞

∑t=6

δt> 0 ⇔ 1 −

∞

∑t=1

δt> 0




∞

∑t=0

δt· 3


∞

∑t=0

δt· 3 −

(

4

∑t=0

δt· 3 + δ

5· 5 +∞

∑t=6

δt· 1

)

= −2δ5 + 2

∞

∑t=6

δt


−2δ5 + 2

∞

∑t=6

δt> 0 ⇔ 1 −

∞

∑t=1

δt> 0 ⇔ 1 −

∞

∑t=0

δt + 1 > 0




∞

∑t=0

δt· 3


∞

∑t=0

δt· 3 −

(

4

∑t=0

δt· 3 + δ

5· 5 +∞

∑t=6

δt· 1

)

= −2δ5 + 2

∞

∑t=6

δt


−2δ5 + 2

∞

∑t=6

δt> 0 ⇔ 1 −

∞

∑t=1

δt> 0 ⇔ 1 −

∞

∑t=0

δt + 1 > 0

⇔ 1 −1

1 − δ+ 1 > 0




∞

∑t=0

δt· 3


∞

∑t=0

δt· 3 −

(

4

∑t=0

δt· 3 + δ

5· 5 +∞

∑t=6

δt· 1

)

= −2δ5 + 2

∞

∑t=6

δt


−2δ5 + 2

∞

∑t=6

δt> 0 ⇔ 1 −

∞

∑t=1

δt> 0 ⇔ 1 −

∞

∑t=0

δt + 1 > 0

⇔ 1 −1

1 − δ+ 1 > 0 ⇔ δ >

1

2.

Cost of deviating in Round N


Player starts to defect at Round N.



Player starts to defect at Round N. By doing so he expects a payoff of

N−1

∑t=0




N−1

∑t=0

δt· 3




N−1

∑t=0

δt· 3 + δ

N· 5




N−1

∑t=0

δt· 3 + δ

N· 5 +∞

∑t=N+1

δt· 1




N−1

∑t=0

δt· 3 + δ

N· 5 +∞

∑t=N+1

δt· 1

By playing C throughout, the column player could have earned




N−1

∑t=0

δt· 3 + δ

N· 5 +∞

∑t=N+1

δt· 1


∞

∑t=0




N−1

∑t=0

δt· 3 + δ

N· 5 +∞

∑t=N+1

δt· 1


∞

∑t=0

δt· 3




N−1

∑t=0

δt· 3 + δ

N· 5 +∞

∑t=N+1

δt· 1


∞

∑t=0

δt· 3


δN· (3 − 5)




N−1

∑t=0

δt· 3 + δ

N· 5 +∞

∑t=N+1

δt· 1


∞

∑t=0

δt· 3


δN· (3 − 5) +

∞

∑t=N+1

δt· (3 − 1)




N−1

∑t=0

δt· 3 + δ

N· 5 +∞

∑t=N+1

δt· 1


∞

∑t=0

δt· 3


δN· (3 − 5) +

∞

∑t=N+1

δt· (3 − 1) = −2δ

N




N−1

∑t=0

δt· 3 + δ

N· 5 +∞

∑t=N+1

δt· 1


∞

∑t=0

δt· 3


δN· (3 − 5) +

∞

∑t=N+1

δt· (3 − 1) = −2δ

N + 2∞

∑t=N+1

δt

thanks to deviating in round N (and further).



If δ 6= 0, and

−2δN + 2

∞

∑t=N+1

δt> 0

then there is forfeit from payoff.



If δ 6= 0, and

−2δN + 2

∞

∑t=N+1

δt> 0

then there is forfeit from payoff. This is when

−2δN + 2

∞

∑t=N+1

δt> 0



If δ 6= 0, and

−2δN + 2

∞

∑t=N+1

δt> 0


−2δN + 2

∞

∑t=N+1

δt> 0

⇔ 1 −∞

∑t=N+1

δt−N

> 0



If δ 6= 0, and

−2δN + 2

∞

∑t=N+1

δt> 0


−2δN + 2

∞

∑t=N+1

δt> 0

⇔ 1 −∞

∑t=N+1

δt−N

> 0 ⇔ 1 −∞

∑t=0

δ(t−N)+(N+1)

> 0



If δ 6= 0, and

−2δN + 2

∞

∑t=N+1

δt> 0


−2δN + 2

∞

∑t=N+1

δt> 0

⇔ 1 −∞

∑t=N+1

δt−N

> 0 ⇔ 1 −∞

∑t=0

δ(t−N)+(N+1)

> 0

⇔ 1 − δ

∞

∑t=0

δt> 0



If δ 6= 0, and

−2δN + 2

∞

∑t=N+1

δt> 0


−2δN + 2

∞

∑t=N+1

δt> 0

⇔ 1 −∞

∑t=N+1

δt−N

> 0 ⇔ 1 −∞

∑t=0

δ(t−N)+(N+1)

> 0

⇔ 1 − δ

∞

∑t=0

δt> 0 ⇔ 1 − δ

1

1 − δ> 0



If δ 6= 0, and

−2δN + 2

∞

∑t=N+1

δt> 0


−2δN + 2

∞

∑t=N+1

δt> 0

⇔ 1 −∞

∑t=N+1

δt−N

> 0 ⇔ 1 −∞

∑t=0

δ(t−N)+(N+1)

> 0

⇔ 1 − δ

∞

∑t=0

δt> 0 ⇔ 1 − δ

1

1 − δ> 0 ⇔ δ >

1

2.



If δ 6= 0, and

−2δN + 2

∞

∑t=N+1

δt> 0


−2δN + 2

∞

∑t=N+1

δt> 0

⇔ 1 −∞

∑t=N+1

δt−N

> 0 ⇔ 1 −∞

∑t=0

δ(t−N)+(N+1)

> 0

⇔ 1 − δ

∞

∑t=0

δt> 0 ⇔ 1 − δ

1

1 − δ> 0 ⇔ δ >

1

2.

Thus, if δ > 1/2 every player forfeits payoff by deviating from T.

Example: an alternating trigger strategy


Yet another subgame perfect equilibrium:




Informal definition of strategies. A and B tacitly agree to alternateactions, i.e. (C, D), (D, C), (C, D), . . . .




Informal definition of strategies. A and B tacitly agree to alternateactions, i.e. (C, D), (D, C), (C, D), . . . . If one of them deviates, the otherparty plays D forever.




Informal definition of strategies. A and B tacitly agree to alternateactions, i.e. (C, D), (D, C), (C, D), . . . . If one of them deviates, the otherparty plays D forever. (The party who originally deviated plays Dforever thereafter as well.)




Informal definition of strategies. A and B tacitly agree to alternateactions, i.e. (C, D), (D, C), (C, D), . . . . If one of them deviates, the otherparty plays D forever. (The party who originally deviated plays Dforever thereafter as well.) Notice the CKR aspect!





Let A be the strategy that plays C in Round 1. Let B be the other strategy.






Claim. The strategy profile (A, B) is a subgame perfect equilibrium in G∗(δ),provided the probability of continuation, δ, is sufficiently large.






Claim. The strategy profile (A, B) is a subgame perfect equilibrium in G∗(δ),provided the probability of continuation, δ, is sufficiently large.

An analysis of this situation and a proof of this claim can be found in(Peters, 2008), pp. 104-105.*

*H. Peters (2008): Game Theory: A Multi-Leveled Approach. Springer, ISBN: 978-3-540-69290-4.

Generalisation of trigger strategies




■ Every convex combination2 of payoffs

α1(3, 3) + α2(0, 5) + α3(5, 0) + α4(1, 1)

can be established by smartly picking appropriate strategy patterns.




α1(3, 3) + α2(0, 5) + α3(5, 0) + α4(1, 1)

can be established by smartly picking appropriate strategy patterns.E.g.: “We play 4 times (C, C). Then we play 7 times (C, D), (D, C),then 4 times (C, C) and so on”.




α1(3, 3) + α2(0, 5) + α3(5, 0) + α4(1, 1)


■ Ensure that (C, C) occurs (in the long run) in α1, (C, D) in α2, (D, C)in α3, and (D, D) in α4 percent of the stages.




α1(3, 3) + α2(0, 5) + α3(5, 0) + α4(1, 1)



■ As long as these limiting average payoffs exceed payoff({D, D}) foreach player (which is 1), associated trigger strategies can beformulated that lead to these payoffs and trigger eternal play of(D, D) after a deviation.




α1(3, 3) + α2(0, 5) + α3(5, 0) + α4(1, 1)



■ As long as these limiting average payoffs exceed payoff({D, D}) foreach player (which is 1), associated trigger strategies can beformulated that lead to these payoffs and trigger eternal play of(D, D) after a deviation.

■ For δ high enough, these strategies again form a SGP NE.2Meaning αi ≥ 0 and α1 + α2 + α3 + α4 = 1.

Folk theorem for SGP NE in the repeated PD


0

1

2

3

4

5

0 1 2 3 4 5

(3, 3)•

•

•

•



0

1

2

3

4

5

0 1 2 3 4 5

(3, 3)•

•

•

•

1. Feasible payoffs (striped):payoff combos that can beobtained by jointly repeatingpatterns of actions (more

accurately: patterns of actionprofiles).



0

1

2

3

4

5

0 1 2 3 4 5

(3, 3)•

•

•

•



2. Enforceable payoffs (shaded):everyone resides aboveminmax.



0

1

2

3

4

5

0 1 2 3 4 5

(3, 3)•

•

•

•




For every payoff pair (x, y)in (1) ∩ (2),



0

1

2

3

4

5

0 1 2 3 4 5

(3, 3)•

•

•

•




For every payoff pair (x, y)in (1) ∩ (2), there is aδ(x, y) ∈ (0, 1),



0

1

2

3

4

5

0 1 2 3 4 5

(3, 3)•

•

•

•




For every payoff pair (x, y)in (1) ∩ (2), there is aδ(x, y) ∈ (0, 1), such that forall δ ≥ δ(x, y)



0

1

2

3

4

5

0 1 2 3 4 5

(3, 3)•

•

•

•




For every payoff pair (x, y)in (1) ∩ (2), there is aδ(x, y) ∈ (0, 1), such that forall δ ≥ δ(x, y) the payoff(x, y) can be obtained as thelimiting average in a sub-game perfect equilibrium ofG∗(δ).

Family of Folk Theorems


There actually exist many Folk Theorems.




■ Horizon. May the game be repeated infinitely (as in our case) or isthere an upper bound to the number of plays?





■ Information. Do players act on the basis of CKR (present case), or arecertain parts of the history hidden?






■ Reward. Do players collect their payoff through a discount factor(present case) or through average rewards?







■ Equilibrium. Do we consider Nash equilibria, or other forms ofequilibria, such as so-called ǫ-Nash equilibria or so-called correlatedequilibria?







■ Equilibrium. Do we consider Nash equilibria, or other forms ofequilibria, such as so-called ǫ-Nash equilibria or so-called correlatedequilibria?

■ Subgame perfectness. Do we consider subgame perfect equilibria(present case) or just Nash equilibria?

Part IV:

non-SGP Nash equilibria


Existence of non-SGP Nash equilibria


many subgame perfe t equilibria exist for repeatedgamesexisten e of non-SGP Nash equilibria in repeated games



■ We have seen that many subgame perfe t equilibria exist for repeatedgames.(At least for repeated games were both players have all information,horizon is infinite, and more assumptions.)

existen e of non-SGP Nash equilibria in repeated games




■ What about the existen e of non-SGP Nash equilibria in repeated games,i.e., equilibria that are not necessarily subgame perfect?





■ Without the requirement of subgame perfection, deviations can bepunished more severely: the equilibrium does not have to induce aNash equilibrium in the punishment subgame.





■ Without the requirement of subgame perfection, deviations can bepunished more severely: the equilibrium does not have to induce aNash equilibrium in the punishment subgame.

■ We will now consider the consequences of relaxing the subgameperfection requirement for a Nash equilibrium in an infinitelyrepeated game.

Example: A repeated game with a non-SGP NE


Other:Some game Left ( L ) Right ( R )You: Up ( U ) (1, 1) (0, 0)Down ( D ) (0, 0) (−1, 4)




1. The pure profile (U, L) is the only mixed strategy profile that is a NE.





2. Define trigger-strategies (T1, T2) such that the pattern[(D, R), (U, L)3]∗ is played indefinitely. (So we have periods of length4.)






3. If this pattern is violated:







■ Fallback strategy of the row player (you) is mixed (0.8, 0.2)∗.








■ Fallback strategy of the column player is pure R∗.








■ Fallback strategy of the column player is pure R∗.

This combination is not a NE. (Cf. 1.)

Example: a repeated game with a non-SGP NE






Claim. The combination of trigger strategies (T1, T2) is a Nash-equilibrium forsome δ ∈ (0, 1).





■ T1 ⇒ T2. If you play (the non-degenerated part of) T1, then thecolumn player must play T2, for T2 is a best response to T1.






■ T2 ⇒ T1. If at all, the best moment for you to deviate is at D, for thatwould give you a incidental advantage of 1. After that your opponentfalls back to R∗.






■ T2 ⇒ T1. If at all, the best moment for you to deviate is at D, for thatwould give you a incidental advantage of 1. After that your opponentfalls back to R∗.

Total payoff for you: 0 (for cheating) + 0 + · · ·+ 0 (for being punishedby your opponent).




■ T2 ⇒ T1 (continued). Total payoff for row player: 0 (for cheating) +0 + · · ·+ 0 (for being punished by the column player).




■ T2 ⇒ T1 (continued). Total payoff for row player: 0 (for cheating) +0 + · · ·+ 0 (for being punished by the column player). Payoff for rowplayer if he was loyal:

(−1 + 1· δ + 1· δ2 + 1· δ

3)





(−1 + 1· δ + 1· δ2 + 1· δ

3) + (−1· δ4 + 1· δ

5 + 1· δ6 + 1· δ

7)





(−1 + 1· δ + 1· δ2 + 1· δ

3) + (−1· δ4 + 1· δ

5 + 1· δ6 + 1· δ

7) + . . .





(−1 + 1· δ + 1· δ2 + 1· δ

3) + (−1· δ4 + 1· δ

5 + 1· δ6 + 1· δ

7) + . . .

=∞

∑k=0

δk





(−1 + 1· δ + 1· δ2 + 1· δ

3) + (−1· δ4 + 1· δ

5 + 1· δ6 + 1· δ

7) + . . .

=∞

∑k=0

δk − 2

∞

∑k=0

δ4k





(−1 + 1· δ + 1· δ2 + 1· δ

3) + (−1· δ4 + 1· δ

5 + 1· δ6 + 1· δ

7) + . . .

=∞

∑k=0

δk − 2

∞

∑k=0

δ4k =

1

1 − δ





(−1 + 1· δ + 1· δ2 + 1· δ

3) + (−1· δ4 + 1· δ

5 + 1· δ6 + 1· δ

7) + . . .

=∞

∑k=0

δk − 2

∞

∑k=0

δ4k =

1

1 − δ− 2

1

1 − δ4.





(−1 + 1· δ + 1· δ2 + 1· δ

3) + (−1· δ4 + 1· δ

5 + 1· δ6 + 1· δ

7) + . . .

=∞

∑k=0

δk − 2

∞

∑k=0

δ4k =

1

1 − δ− 2

1

1 − δ4.

This expression is positive only if δ >≈ 0.54. (Solve 3rd-degreeequation.)

Retaliation in repeated games: playing the minmax


In the previous game, your “PlanB” was to play a mixed strategy(0.8, 0.2).




Questions:




Questions:

■ Why may a mixed strategy(0.8, 0.2) be considered apunishment?




Questions:


■ Is mixed strategy (0.8, 0.2) themost severe punishment?




Questions:


■ Is mixed strategy (0.8, 0.2) themost severe punishment?

■ If so, why?

Retaliation in repeated games: playing minmax


Other:0 1 2You: 0 (4, 9) (7, 9) (5, 8)

1 (6, 7) (8, 7) (4, 8)2 (7, 9) (5, 7) (6, 7)

youpure minmax



Other:0 1 2You: 0 (4, 9) (7, 9) (5, 8)

1 (6, 7) (8, 7) (4, 8)2 (7, 9) (5, 7) (6, 7)

First for pure strategies.

youpure minmax



Other:0 1 2You: 0 (4, 9) (7, 9) (5, 8)

1 (6, 7) (8, 7) (4, 8)2 (7, 9) (5, 7) (6, 7)


■ Which action of you (the rowplayer) minimises the maximumpayoff of your opponent?

pure minmax



Other:0 1 2You: 0 (4, 9) (7, 9) (5, 8)

1 (6, 7) (8, 7) (4, 8)2 (7, 9) (5, 7) (6, 7)



■ This is the pure minmax:minimise maximum payoff,which can be found byscanning “blue rows” (= payoffsof opponent).



Other:0 1 2You: 0 (4, 9) (7, 9) (5, 8)

1 (6, 7) (8, 7) (4, 8)2 (7, 9) (5, 7) (6, 7)




It turns out that Action 1 keepsthe payoff of your opponentbelow 9.



Other:0 1 2You: 0 (4, 9) (7, 9) (5, 8)

1 (6, 7) (8, 7) (4, 8)2 (7, 9) (5, 7) (6, 7)





■ Similarly, if your opponentwishes to punish you, he scans“green columns” to minimiseyour payoff



Other:0 1 2You: 0 (4, 9) (7, 9) (5, 8)

1 (6, 7) (8, 7) (4, 8)2 (7, 9) (5, 7) (6, 7)





■ Similarly, if your opponentwishes to punish you, he scans“green columns” to minimiseyour payoff ⇒ Action 2.



Other:0 1 2You: 0 (4, 9) (7, 9) (5, 8)

1 (6, 7) (8, 7) (4, 8)2 (7, 9) (5, 7) (6, 7)






■ Alert 1: minmax may be6= maxmin (= security levelstrategy).



Other:0 1 2You: 0 (4, 9) (7, 9) (5, 8)

1 (6, 7) (8, 7) (4, 8)2 (7, 9) (5, 7) (6, 7)






■ Alert 1: minmax may be6= maxmin (= security levelstrategy).

■ Alert 2: mixed minmax may be< pure minmax.

Minmax: payoff surface of the opponent


his payoff

your mix

his mix

push payoffrange to theminimum

The minmaxvalue as a threat


threat



■ Your (possibly mixed) minmax strategy can be used as a threatto withhold you opponent from deviating of the silently agreedpath.

Cf. the threat:

If you’re not complying to our normal pattern of ac-tions, I am going to minmax you.




Cf. the threat:


■ By actually executing this threat you might harm yourself aswell ⇒ non-SGP.




Cf. the threat:


■ By actually executing this threat you might harm yourself aswell ⇒ non-SGP.

■ For finite two-person strictly competitive games (such as zero-sum games), minmax = maxmin.

(Finite in the sense that the arsenal of playable actions, hencethe payoff matrix, is finite.)



Other: L RYou: U (1, 1) (0, 0)D (0, 0) (−1, 4)



Other: L RYou: U (1, 1) (0, 0)D (0, 0) (−1, 4)

■ Your opponent can punish youmaximally by playing R∗. Howyou can punish your opponentis less obvious.



Other: L RYou: U (1, 1) (0, 0)D (0, 0) (−1, 4)


■ If you play D∗ your opponentwill earn 4∗. If you play U∗,your opponent will earn 1∗.



Other: L RYou: U (1, 1) (0, 0)D (0, 0) (−1, 4)



■ It is possible to punish youropponent even more bybecoming unpredictable



Other: L RYou: U (1, 1) (0, 0)D (0, 0) (−1, 4)



■ It is possible to punish youropponent even more bybecoming unpredictable (withinCKR!!).



Other: L RYou: U (1, 1) (0, 0)D (0, 0) (−1, 4)




Given your mixed strategy(u, d) your opponent maximiseshis payoff by choosing the rightmix (l, r):

maxl,r

ul· 1+ dr· 4



Other: L RYou: U (1, 1) (0, 0)D (0, 0) (−1, 4)





maxl,r

ul· 1+ dr· 4

= maxl

ul + 4(1 − u)(1 − l)



Other: L RYou: U (1, 1) (0, 0)D (0, 0) (−1, 4)





maxl,r

ul· 1+ dr· 4

= maxl

ul + 4(1 − u)(1 − l)

= maxl

(5u − 4)l + 4 − 4u



Other: L RYou: U (1, 1) (0, 0)D (0, 0) (−1, 4)





maxl,r

ul· 1+ dr· 4

= maxl

ul + 4(1 − u)(1 − l)

= maxl

(5u − 4)l + 4 − 4u

■ If 5u − 4 = 0, it does not matterwhat you opponent chooses forl—his expected payoff is always4 − 4(4/5) = 4/5.



u0 14/5

l

0

1 4

0

0

1

Draw a picture of the payoff sur-face of the opponent.



u0 14/5

l

0

1 4

0

0

1


■ If 5u− 4 = 0, it does not matterwhat you opponent chooses forl. He expects 4 − 4(4/5).



u0 14/5

l

0

1 4

0

0

1



■ If 5u − 4 > 0 your opponentwill play 1, and expects to earn> 4 − 4u which is > 4/5.



u0 14/5

l

0

1 4

0

0

1




■ If 5u − 4 < 0 your opponentwill play 0, and expects to earn> 4 − 4u which, again, is >

4/5.



u0 14/5

l

0

1 4

0

0

1




■ If 5u − 4 < 0 your opponentwill play 0, and expects to earn> 4 − 4u which, again, is >

4/5.

These calculations are done byhand, and do not easily generaliseto higher dimensions.

Literature


Literature on game theory is vast. For me, the following sources werehelpful and (no less important) offered different perspectives to repeatedgames.

Literature



* C.M. Gintis (2009): Game Theory Evolving: A Problem-Centered Introduction to Modeling StrategicInteraction. Second Edition. University Press, Princeton. Ch. 9: Repeated Games.

Literature





Literature





* K. Leyton-Brown & Y. Shoham (2008): Essentials of Game Theory: A Concise, MultidisciplinaryIntroduction. Morgan and Claypool Publishers, 2008. Ch. 6: Repeated and Stochastic Games.

Literature






* S.P. Hargreaves Heap & Y. Varoufakis (2004): Game theory: a critical text, Routledge. Ch. 5: THEPRISONERS’DILEMMA - The riddle of co-operation and its implications for collective agency.

Literature







* J. Ratliff (1997): Graduate-Level Course in Game Theory. (AKA: “Jim Ratliff’s Graduate-LevelCourse in Game Theory”. Lecture notes, Dept. of Economics, University of Arizona. (Availablethrough the web but not officially published.) Sec. 5.3: A Folk Theorem Sampler.

Literature







* J. Ratliff (1997): Graduate-Level Course in Game Theory. (AKA: “Jim Ratliff’s Graduate-LevelCourse in Game Theory”. Lecture notes, Dept. of Economics, University of Arizona. (Availablethrough the web but not officially published.) Sec. 5.3: A Folk Theorem Sampler.

* M.J. Osborne & A. Rubinstein (1994): A Course in Game Theory. MIT Press. Ch. 8: Repeatedgames.

What next?


Now that we know that infinitely many equilibria exist in repeatedgames (an embarrassment of richness), there are a number of ways inwhich we may proceed.

What next?



■ Gradient Dynamics. This is to approximate NE of single-shot games(stage games) through gradient ascent (hill-climbing).

What next?




■ Reinforcement Learning. Agents simply execute the action(s) withmaximal rewards in the past.

What next?





■ No-regret learning. Agents execute the action(s) with maximalvirtual rewards in the past.

What next?





■ No-regret learning. Agents execute the action(s) with maximalvirtual rewards in the past.

■ Fictitious Play. Sample the actions of opponent(s) and play a bestresponse.

multi-agent learning - utrecht university€¦ · author: gerard vreeswijk. slides last modiﬁed...

Documents