the iterated prisoners dilemma 20 years on advances in natural computation.9789812706973.28764

The Iterated Prisoners’ Dilemma

20 Years On

ADVANCES IN NATURAL COMPUTATION

Series Editor: Xin Yao (University of Birmingham, UK)

Assoc. Editors: Hans-Paul Schwefel (University of Dortmund, Germany)Byoung-Tak Zhang (Seoul National University, South Korea)Martyn Amos (University of Liverpool, UK)

Published

Vol. 1: Applications of Multi-Objective Evolutionary AlgorithmsEds: Carlos A. Coello Coello (CINVESTAV-IPN, Mexico) andGary B. Lamont (Air Force Institute of Technology, USA)

Vol. 2: Recent Advances in Simulated Evolution and LearningEds: Kay Chen Tan (National University of Singapore, Singapore),Meng Hiot Lim (Nanyang Technological University, Singapore),Xin Yao (University of Birmingham, UK) andLipo Wang (Nanyang Technological University, Singapore)

Vol. 3: Recent Advances in Artificial LifeEds: H. A. Abbass (University of New South Wales, Australia),T. Bossomaier (Charles Sturt University, Australia) andJ. Wiles (The University of Queensland, Australia)

Vol. 4: The Iterated Prisoners’ DilemmaEds: Graham Kendall (The University of Nottingham, UK)Xin Yao (The University of Birmingham, UK)

Steven - The Iterated Prisoners.pmd 3/19/2007, 4:44 PM2

N E W J E R S E Y • L O N D O N • S I N G A P O R E • B E I J I N G • S H A N G H A I • H O N G K O N G • TA I P E I • C H E N N A I

World Scientific

Advances in Natura l Computat ion — Vol . 4

Graham Kendall

Xin Yao

Siang Yew Chong

The Iterated Prisoners’ Dilemma

20 Years On

The University of Nottingham, UK

The University of Birmingham, UK

British Library Cataloguing-in-Publication DataA catalogue record for this book is available from the British Library.

For photocopying of material in this volume, please pay a copying fee through the CopyrightClearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission tophotocopy is not required from the publisher.

ISBN-13 978-981-270-697-3ISBN-10 981-270-697-6

All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means,electronic or mechanical, including photocopying, recording or any information storage and retrievalsystem now known or to be invented, without written permission from the Publisher.

Copyright © 2007 by World Scientific Publishing Co. Pte. Ltd.

Published by

World Scientific Publishing Co. Pte. Ltd.

5 Toh Tuck Link, Singapore 596224

USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601

UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

Printed in Singapore.

Advances in Natural Computation — Vol. 4THE ITERATED PRISONERS’ DILEMMA20 Years On

Steven - The Iterated Prisoners.pmd 3/19/2007, 4:44 PM1

March 1, 2007 18:37 World Scientific Review Volume - 9in x 6in contents

Contents

List of Contributors vii

Chapter 1 The Iterated Prisoner’s Dilemma: 20 Years On 1

Siang Yew Chong, Jan Humble, Graham Kendall, Jiawei

Li and Xin Yao

Chapter 2 Iterated Prisoner’s Dilemma and Evolutinary Game Theory 23


Li and Xin Yao

Chapter 3 Learning IPD Strategies Through Co-evolution 63


Li and Xin Yao

Chapter 4 How to Design a Strategy to Win an IPD Tournament 89

Jiawei Li

Chapter 5 An Immune Adaptive Agent for the Iterated Prisoner’s

Dilemma 105

Oscar Alonso and Fernando Nino

Chapter 6 Exponential Smoothed Tit-for-Tat 127

Michael Filzmoser

Chapter 7 Opponent Modelling, Evolution, and The Iterated

Prisoner’s Dilemma 139

Philip Hingston, Dan Dyer, Luigi Barone, Tim French

and Graham Kendall

Chapter 8 On Some Winning Strategies for the Iterated Prisoner’s

Dilemma 171

Wolfgang Slany and Wolfgang Kienreich

v

March 1, 2007 18:37 World Scientific Review Volume - 9in x 6in contents

vi Contents

Chapter 9 Error-Correcting Codes for Team Coordination within a

Noisy Iterated Prisoner’s Dilemma Tournament 205

Alex Rogers, Rajdeep K. Dash, Sarvapali D. Ramchurn,

Perukrishnen Vytelingum and Nicholas R. Jennings

Chapter 10 Is it Accidental or Intentional? A Symbolic Approach to

the Noisy Iterated Prisoner’s Dilemma 231

Tsz-Chiu Au and Dana Nau

March 1, 2007 18:39 World Scientific Review Volume - 9in x 6in contributors

List of Contributors

Oscar Alonso,

Computer Systems and Industrial Engineering Department,

National University of Colombia, Bogota

Colombia

Email: [email protected]

Tsz-Chiu Au,

Department of Computer Science and Institute for Systems Research,

University of Maryland,

College Park, MD 20742

USA


Luigi Barone,

Department of Computer Science and Software Engineering,

The Univesity of Western Australia,

35 Stirling Highway,

Crawley, WA, 6009

Australia


Siang Yew Chong,

School of Computer Science,

University of Birmingham,

Birmingham, B15 2TT

UK


vii


viii List of Contributors

Rajdeep K. Dash,

Electronics and Computer Science,

University of Southampton,

Southampton, SO17 1BJ

UK


Dan Dyer,


The University of Western Australia,


Crawley, WA, 6009

Australia


Michael Filzmoser,

School of Business Administration,

Economics, and Statistics,

University of Vienna,

Vienna, A-1210

Austria


Tim French,


The University of Western Australia,


Crawley, WA, 6009

Australia


Philip Hingston,

School of Computer and Information Science,

Edith Cowan University - Mt Lawley Campus,

2 Bradford Street,

Mt Lawley, WA 6050

Australia



List of Contributors ix

Jan Humble,

School of Computer Science and Information Technology,

University of Nottingham,

Nottingham, NG8 1BB

UK


Nicholas R. Jennings,



Southampton, SO17 1BJ,

UK


Graham Kendall,

School of Computer Science and Information Technology,

University of Nottingham,

Nottingham, NG8 1BB

UK


Wolfgang Kienreich,

Know-Center, Inffeldgasse 21a/II,

8010 Graz

Austria


Jiawei Li,

Robot Institute,

Harbin Institute of Technology,

Heilongjiang, 150001,

P. R. China

Email: lijiawei [email protected]

Dana Nau,

Department of Computer Science and Institute for Systems Research,

University of Maryland,

College Park, MD 20742

USA



x List of Contributors

Fernando Nino,

Computer Systems and Industrial Engineering Department,

National University of Colombia, Bogota

Colombia


Sarvapali D. Ramchurn,




UK


Alex Rogers,




UK


Wolfgang Slany,

Institut fur Softwaretechnologie,

Inffeldgasse 16b/II,

TU Graz, A-8010 Graz

Austria


Perukrishnen Vytelingum,




UK


Xin Yao,

School of Computer Science,

University of Birmingham,

Birmingham, B15 2TT

UK


February 8, 2007 8:53 World Scientific Review Volume - 9in x 6in chapter1

Chapter 1

The Iterated Prisoner’s Dilemma: 20 Years On

Siang Yew Chong1, Jan Humble2, Graham Kendall2, Jiawei Li2,3,

Xin Yao1

University of Birmingham1, University of Nottingham2, Harbin Institute

of Technology3

1.1. Introduction

In 1984, Robert Axelrod reported the results of two iterated prisoner’s

dilemma (IPD) competitions [Axelrod (1984)]. The booked was to be a

catalyst for much of the research in this area since that time. It is unlikely

that you would write a scientific paper about IPD, without citing Axelrod’s

1984 book. The book is even more remarkable in that it is just as accessible

to a general audience, as well as being an important source of inspiration

for the scientific community.

In 2001, whilst attending the Congress on Evolutionary Computation

(CEC) conference, we were discussing some of the presentations we had

seen which reported recent some of the latest work on the iterated prisoner’s

dilemma. We were paying tribute to the fact that Axelrod’s book had stood

the test of time when somebody made a casual comment suggesting that we

should re-run the competition in 2004, to celebrate the 20th anniversary.

And, so, this book was born.

Of course, since the conversation in Hawaii and the publication of this

book, there have been a lot of people doing a lot of work. Not least of

all Robert Axelrod who was good enough to give up his time to present a

plenary talk at the CEC conference in 2004. At that talk he presented his

latest work which is investigating evolution on a grid based world.

We owe a debt of thanks to the UK’s EPSRC (Engineering and Physical

Sciences Research Council). This is the largest of the UK research coun-

cils which funds research in the UK. When we returned from Hawaii, we

1


2 S. Y. Chong et al.

submitted a proposal,a which requested a small amount of funds (£23,718)

in order to re-run, and extend, the competitions that Axelrod had run

20 years earlier. The funds we received from EPSRC allowed us to run two

competitions, one in 2004 and one in 2005. The entrants to the competi-

tions were invited to submit a chapter for consideration in this book. These

chapters underwent a peer review process (see later in this chapter for an

acknowledgement of the reviewers) and those chapters that were successful

form the latter part of this book.

As editors, we feel fortunate to have several winners, second and third

place entries reported in this book. This affords the reader the opportu-

nity to learn, first hand from the authors, what made these strategies so

successful and, perhaps, use some of the ideas and innovations in their own

strategies for future competitions.

1.2. Iterated Prisoner’s Dilemma

Almost every chapter in this book has its own description of the iterated

prisoner’s dilemma. As each chapter can be read in isolation and, for com-

pleteness, we present our own interpretation of the IPD here, along with a

short review of some of the important work in the area.

The prisoner’s dilemma (PD) and iterated prisoners dilemma (IPD)

have been a rich source of research material since the 1950’s. However, the

publication of Axelrod’s book [Axelrod (1984)] in the 1980’s was largely re-

sponsible for bringing this research to the attention to other areas, outside

of game theory, including evolutionary computing, evolutionary biology,

networked computer systems and promoting cooperation between opposing

countries [Goldstein (1991); Fogel (1993); Axelrod and D’Ambrosio (1995)].

Despite the large literature base that now exists (see, for example, [Pound-

stone (1992); Boyd and Lorberbaum (1987); Maynard Smith (1982); Davis

(1997), Axelrod (1997)], this is an on-going area of research, with Darwen

and Yao [Darwen and Yao (1995, 2001); Yao and Darwen (1999)] carrying

out some recent work. Their 2001 work [Darwen and Yao (2001)] extends

the prisoner’s dilemma by offering more choices, other than simply “coop-

erate” or “defect,” and by providing indirect interactions (reputation).

When you play the prisoner’s dilemma you have to decide whether to

cooperate with an opponent, or defect. Both you and your opponent make a

aThe EPSRC grant reference numbers are GR/S63465/01 and GR/S63472/01.


The Iterated Prisoner’s Dilemma: 20 Years On 3

choice and then your decisions are revealed. You receive a payoff according

to the following matrix (where the top line is the payoff to the column).

Cooperate Defect

CooperateR = 3 T = 5

R = 3 S = 0

DefectS = 0 P = 1

T = 5 P = 1

• R is a Reward for mutual cooperation. Therefore, if both players coop-

erate then both receive a reward of 3 points.

• If one player defects and the other cooperates then one player receives

the T emptation to defect payoff (5 in this case) and the other player (the

cooperator) receives the Sucker payoff (zero in this case).

• If both players defect then they both receive the P unishment for mutual

defection payoff (1 in this case).

The question arises: what should you do in such a game?

• Suppose you think the other player will cooperate. If you cooperate

then you will receive a payoff of 3 for mutual cooperation. If you defect

then you will receive a payoff of 5 for the Temptation to Defect payoff.

Therefore, if you think the other player will cooperate then you should

defect, to give you a payoff of 5.

• But what if you think the other player will defect? If you cooperate,

then you get the Sucker payoff of zero. If you defect then you would both

receive the Punishment for Mutual Defection of 1 point. Therefore, if

you think the other player will defect, you should defect as well.

So, you should defect, no matter what option your opponent chooses.

Of course, the same logic holds for your opponent. And, if you both de-

fect you receive a payoff of 1 each, whereas, the better outcome would have

been mutual cooperation with a payoff of 3. The payoff for an individual

is less than that could have been achieved by two cooperating players, thus

the dilemma and the research challenge of finding strategies that promote

mutual cooperation.

In defining a prisoner’s dilemma, certain conditions have to hold. The

values we used above, to demonstrate the game, are not the only values

that could have been used, but they do have to adhere to the conditions

listed below.



Firstly, the order of the payoffs is important. The best a player can

do is T (temptation to defect). The worst a player can do is to get the

sucker payoff, S. If the two players cooperate then the reward for that

mutual cooperation, R, should be better than the punishment for mutual

defection, P . Therefore, the following must hold.

T > R > P > S . (1.1)

Secondly, players should not be allowed to get out of the dilemma by

taking it in turns to exploit each other. Or, to be a little more precise, the

players should not play the game so that they end up with half the time

being exploited and the other half of the time exploiting their opponent.

In other words, an even chance of being exploited or doing the exploiting is

not as good an outcome as both players mutually cooperating. Therefore,

the reward for mutual cooperation should be greater than the average of

the payoff for the temptation and the sucker. That is, the following must

hold.

R > (S + T )/2 . (1.2)

Playing a “one-shot” prisoners dilemma, it is not difficult to decide

which strategy to adopt, but the question arises: can cooperation evolve

from playing the game over and over again, against the same opponent?

If you know how many times you are to play, then there is an argu-

ment that the game is exactly the same as playing the “one-shot” prisoners

dilemma. This is based on the observation that you will defect on the last

iteration as that is the sensible thing to do as, you are in effect playing a

single iteration. Knowing this, it is sensible to defect on the second to last

one as well; and this logic can be applied all the way to the first iteration.

However, this reasoning cannot be used when the number of iterations

is infinite as you know there is always another iteration. In practise, this

translates to not knowing when the game will end.

Experiments, using human players [Scodel (1962, 1963); Minas et al.

(1960); Scodel and Philburn (1959), Scodel et al. (1959); Scodel et al.

(1960)] showed that they, generally, did not cooperate even when it should

have been obvious that the other person was going to cooperate, just as long

as you do. It has been a long term aim to find strategies which causes players

to cooperate. If players would only cooperate then their payoff, over an in-

definite number of games could be maximised, rather than tending towards

defection and hoping the other player would cooperate. In 1979 Axelrod

organised a prisoner’s dilemma competition and invited game theorists to



submit their strategies [Axelrod (1980a)]. Fourteen entries were received

with an extra one being added (defect or cooperate with equal probabil-

ity). The strategies were competed against each other, including itself. The

winner was Anatol Rapoport who submitted the simple strategy (Tit-for-

Tat) which cooperates on the first move, then does whatever your opponent

did on the previous move. In a second tournament [Axelrod (1980b)], 62

entries were received but, again, the winner was Tit-for-Tat. These two

competitions formed the basis of his important book [Axelrod (1984)].

The prisoners dilemma has a modern day version in the form of the

TV show “Shafted” - a game show recently screened on terrestrial TV in

the UK (note that this show is not a true prisoners dilemma as defined

by Rapoport [Rapoport (1996)], but does demonstrate that the ideas have

wider applicability). At the end of the show two contestants have accu-

mulated a sum of money and they have to decide if to share the money

or to try and get all the money for themselves. Their decision is made

without the knowledge of what the other person has decided to do. If both

contestants cooperate then they share the money. If they both defect then

they both receive nothing. If one cooperates and the other defects, the one

that defected gets all the money and the contestant that cooperated gets

nothing.

Although the prisoners dilemma, in the context of game theory, has been

an active research area for at least 50 [Scodel (1962); Scodel (1963); Minas

et al. (1960); Scodel and Philburn (1959); Scodel et al. (1959); Scodel

et al. (1960)] years (it can be traced back to von Neumann and Morgen-

stern [von Neumann and Morgenstern (1944)] and, of course, John Nash

[Nash (1950, 1953)]), it is still an active research area with, among other

research aims, researchers trying to evolve strategies [O’Riordan (2000)]

that promote cooperation.

Recent research has also considered the prisoner’s dilemma where there

are more than two choices and more than two players. Darwen and Yao have

shown that offering more choices leads to less cooperation [Darwen and Yao

(2001)], although reputation may help [Darwen and Yao (2002); Yao and

Darwen (1999)]. Birk [Birk (1999)] used a multi-payer IPD. His model had

continuous degrees of cooperation (as opposed to the binary; cooperate

or defect). He used a robotic environment and showed that a justified-

snobism strategy, that tries to cooperate slightly more than the average,

is a successful strategy and is evolutionarily stable (that is, it cannot be

invaded by another strategy). O’Riordan and Bradish (2000) also simulated

a multi-player game where the players are involved in many types of games.



They show that cooperation can emerge in a high percentage of 2-player

games.

As well as the academic papers on the subject, there are many books

devoted to game theory and/or the prisoners dilemma. The 1997 book

by Axelrod (1997) re-produces a range of his papers (with commentary)

ranging from 1986 through to 1997. The papers consider areas such as

promoting cooperation using a genetic algorithm, coping with noise and

promoting norms.

1.3. Contents of the Book

This book does not have to be read from cover to cover. Each chapter can

be read independently, with most of the chapters describing the IPD. This

was a conscious decision by the editors as we realised that the book would

be dipped into and we did not want to make any chapter dependent on

any other. Also, each chapter has its own set of references, rather than

having one complete list of references at the end of the book. The book is

structured as follows

Chapter 1

This chapter provides a general introduction to the book. In keeping with

the rest of the book, we also briefly describe the IPD. As well as briefly

describing each chapter. This chapter also presents the results of the two

competitions that we ran in 2004 and 2005.

Chapter 2

Chapter 2 (“Iterated Prisoner’s Dilemma and Evolutionary Game Theory”)

reviews some of the important work in IPD, with particular emphasis (in

the latter part of the chapter) on evolutionary game theory. The chapter

contains over 250 references, which we hope will be a good starting point

for other researchers who are looking to start work in this area.

We have concentrated on the evolutionary aspects of IPD for two rea-

sons. Firstly, this seemed to be an area that was exploited in the entries

we received. Secondly, the literature on IPD is truly vast (perhaps only

exceeded by literature on the traveling salesman problem), and we had to

draw some boundaries and, given the close links that this competition had



with the Congress on Evolutionary Computation, it seemed appropriate to

report on the evolutionary aspects of IPD.

We apologise to any authors who feel their work should have been in-

cluded in this chapter. We hope you understand that we simply could not

list every paper. However, if you would like to drop us an EMAIL, we would

be happy to consider the inclusion of the reference in any later editions.

Chapter 3

Chapter 3 (“Learning IPD Strategies Through Co-evolution”) reviews an-

other area of IPD that has received scientific interest in recent years; that

of co-evolution. This chapter also discusses an extension to the classic IPD

formulation. That is when there are more than two players and when they

have more than two choices. Similar to chapter two, there is an extensive

list of references for the interested reader.

Chapter 4

This chapter reports the winning strategy from competition 4, from the

event held in 2005. This competition mimics the original ones held by

Axelrod. Only one entry was allowed per person, to stop the cooperating

strategies that had dominated the first competition. Although we believe

that having cooperating strategies is a valid tactic, some competitors felt

that this did not truly mimic the original competitions. For this reason we

introduced an additional competition for the 2005 event. The result was a

win for Jiawei Li, who details his winning strategy in chapter 4, which is

entitled How to Design a Strategy to Win an IPD Tournament.

Chapter 5

The strategy in this chapter attempts to model its opponent using an ar-

tificial immune system. It is interesting to see how relatively new method-

ologies are being used for problems such as IPD, demonstrating that there

is a continuous flow of new ideas which might just be shown to be superior

to all other methods so far. Whilst not appearing in the top ten of any of

the competitions that it entered, it does present an exciting new research

direction for IPD tournaments.



Chapter 6

Michael Filzmoser, reports on a variation of tit-for-tat, which he calls Ex-

ponential Smoothed Tit-for-Tat. Whereas tit-for-tat only considers the last

move of the opponent, exponential smoothed tit-for-tat considers the com-

plete history of the opponent. This discussion is extended to IPD with

noise, as well as the more common IPD, where the actions by the player

are reliably reported.

Chapter 7

In chapter 7 (“Opponent modelling, Evolution, and the Iterated Prisoner’s

Dilemma”), the authors explore the idea of modeling an opponent. It does

this by playing tit-for-tat for the first 50 moves, whilst trying to model the

moves played by the opponent. After 50 moves, subsequent moves are then

based on the model that has been built.

It is interesting to compare this strategy (which came 3rd in competition

4 in 2005), with the strategy described in chapter 4, which also uses a type

of modeling but over a shorter time period. Perhaps this explains why it

was able to achieve better payoffs, as it was able to exploit opponents much

earlier in the game?

Chapter 8

The strategies reported in this chapter were entered in both the 2004 and

2005 events, and performed well in many of the competitions, winning

competition 1 in the 2005 event.

This chapter, more than any other, touches on the debate about coop-

erating strategies, which is why we introduced competition 4 in the 2005

event. If you followed the discussion at the time, many entrants (with some

justification) questioned if allowing multiple strategies from one person was

in the spirit of the original Axelrod competitions. Whilst we agreed with

this, so introduced a single entry rule in 2005, we also argue that these

competitions were about the research that was being carried out and some

of the chapters in this book report on those results. Of course, as the

authors of chapter 8 admit, there are still ways of flouting the rules by

submitting cooperating entries under different names. We hope that the

other entrants will accept this in the spirit of research under which this

was done. As the authors point out, the organisers failed to recognise that



cooperating strategies had been submitted, but, as they also say, this is a

theoretically difficult problem.

We would also like to take this opportunity to the authors of chapter 8

for missing their OTFT strategy from some of the competitions. It is still

unclear to us why this happened.

Chapter 9

A team from Southampton, who took the first three places in competition

1, in the 2004 competition present chapter 9. Their chapter is an excellent

example of how strategies can cooperate. As strategies have no mechanism

to interact directly, the only way to recognise one of your collaborators is

to somehow communicate through the defect/cooperate choices that you

make.

Chapter 10

One of the competitions that we run included noise, with some low prob-

ability. By noise, we mean that a defect or cooperate signal might be

misinterpreted. This final chapter by Tsz-Chiu Au and Dana Nau explores

this issue using a strategy they call Derived Belief Strategy. It attempts to

model their opponent and then judge if their choice has been affected by

noise. They performed very well in the competition, even when up against

strategies which were cooperating.

1.4. Celebrating the 20th Anniversary: The Competitions

We ran two events. The first was held during the Congress of Evolutionary

Computation Conference in 2004 (June 19-23, Portland, Oregon, USA) and

the next at the Computational Intelligence and Games Conference in 2005

(April 4-6, 2005, Essex UK). At the 2004 event we ran three competitions,

with an additional competition being held in 2005.

(1) The first competition aimed to emulate the original Axelrod competi-

tion. We received some enquiries about whether multiple entries were

allowed. As we had not stated this as a restriction, we allowed it (but

did state we had the right to limit the number, else running the com-

petition may become intractable). At the time, we did not realise the



controversy that this decision would cause, which is why we modified

the competitions in the 2005 event.

(2) The second competition had noise in it. Each decision had a 0.1 prob-

ability of being mis-interpreted.

(3) The third competition allowed competitors to submit a strategy to an

IPD that has more than one player and more than one payoff, that is,

multi player and multi-choice.

(4) The fourth competition (which was only run in 2005) emulated the

original Axelrod competition. The definition was exactly the same as

competition 1, but we only allowed one entry per person.

The payoff table we used for competitions 1, 2 and 4 is shown in ta-

ble 1.1. The payoff table for competition 3 is shown in table 1.2.

Table 1.1. Payoff table for all IPD competitions

except for the IPD with multiple players and mul-

tiple choices.

Cooperate Defect

CooperateR = 3 T = 5

R = 3 S = 0

DefectS = 0 P = 1

T = 5 P = 1

Table 1.2. Payoff table for IPD competition with multiple players

and multiple payoffs Player BLevels of Cooperation.

Player B

Levels of Cooperation 13

4

1

2

1

40

Player A

1 4 3 2 1 0

3

44

1

43

1

42

1

41

1

4

1

4

1

24

1

23

1

22

1

21

1

2

1

2

1

44

3

43

3

42

3

41

3

4

3

4

0 5 4 3 2 1

To support the competitions, we developed a software framework. This

is discussed in the Appendix, and a URL is supplied so that the software

can be downloaded.

1.5. Competition Results

In the following tables we present the top ten entries from each of

the competitions. The full listings of the results can be seen at



http://www.prisoners-dilemma.com. Also available on the web site is a

log containing all the interactions that took place.

Table 1.3. Results from 2004 event, competition 1. There were 223 entries (19

web based entries, 195 java based entries and 9 standard entries (RAND, NEG,

ALLC, ALLD, TFT, STFT, TFTT, GRIM, Pavlov)).

Rank Player Strategy Won Drawn LostTotal

Points

1Gopal

StarSN (StarSN) 105 21 98 117,057Ramchurn

2Gopal

StarS (StarS) 113 48 63 110,611Ramchurn

3Gopal

StarSL (StarSL) 115 46 63 110,511Ramchurn

4

GRIM (GRIM

GRIM (GRIM120 76 28 100,611

Trigger) 1

Trigger)Wolfgang

Kienreich

5Wolfgang OTFT (Omega

90 70 64 100,604Kienreich tit for tat)

6Wolfgang

ADEPT

95 72 57 96,291Kienreich

(ADEPT

Strategy)

7 Emp 1 EMP (Emperor) 90 73 61 95,927

8Bingzhong

() 31 94 99 94,161Wang

9 Hannes Payer

PRobbary

95 75 54 94,123(PRobbary

Historylength 2)

10 Nanlin Jin HCO (HCO) 27 95 102 93,953



Table 1.4. Results from 2004 event, competition 2. There were 223 entries (19 web

based entries, 195 java based entries and 9 standard entries (RAND, NEG, ALLC,

ALLD, TFT, STFT, TFTT, GRIM, Pavlov)).


Points

1Gopal

StarSN (StarSN) 42 2 180 93,962Ramchurn

2Colm

Mem1 (Mem1) 5 1 218 83,049O’Riordan

3Gopal CoordinateCDCSIAN

158 6 60 83,015Ramchurn (CoordinateCDCSIAN)

4Gopal

PoorD (PoorD) 190 7 27 82,890Ramchurn

5Wolfgang

OTFT (Omega tit for tat) 158 8 58 82,838Kienreich

6Wayne

ltft (ltft) 66 8 150 82,765Davis

7

GRIM

GRIM (GRIM Trigger) 184 7 33 82,591(GRIM

Trigger) 1

8Gopal

MooD (MooD) 193 3 28 82,578Ramchurn

9Gopal

AITFT (AITFT) 60 9 155 82,504Ramchurn

10Gopal

GSTFT (GSTFT) 64 9 151 82,502Ramchurn



Table 1.5. Results from 2004 event, competition 3. There were 15 entries.

Note that there is only one round in this competition.

Rank Player StrategyTotal

Points

1Gopal

AgentSoton (SOTON AGENT) 3,756Ramchurn

2Gopal

HarshTFT (HarshTFT) 3,756Ramchurn

3Deirdre PCurvepower1Memory2 (Penalty Curve of

3,738Murrihy 1 using opponent’s previous 2 moves)

4Deirdre PCurvepower2Memory2 (Penalty Curve of

3,738Murrihy 2 using opponent’s previous 2 moves)

5Deirdre PCurvepower0.5Memory2 (Penalty Curve

3,738Murrihy of 0.5 using opponent’s previous 2 moves)

6Enda PCurvepower2 (Penalty Curve of 2 using

3,738Howley opponent’s previous move)

7Enda PCurvepower1 (Penalty Curve of 1 using

3,738Howley opponent’s previous move)

8Enda PCurvepower0.5 (Penalty Curve of 0.5

3,738Howley using opponent’s previous move)

9Wolfgang

CNHM (CosaNostra Hitman) 3,738Kienreich

10Wolfgang








Points

1Wolfgang

CNGF

48 96 49 100,905Kienreich

(CosaNostra

Godfather)

2 Jia-wei Li

IMM01

46 112 35 98,922(Intelligent

Machine Master

01)

3Carlos G.

CLAS- (CLAS-) 23 95 75 92,174Tardon

4Perukrishnen

SWIN (Soton

61 44 88 90,918Vytelingum

Agent RA -

Competition 1)

5Constantin LORD (the lord

20 102 71 87,617Ionescu strategy)

6GRIM (GRIM GRIM (GRIM

73 114 6 84,805Trigger) 1 Trigger)

7 Tsz-Chiu Au

LSF (Learning of

28 94 71 84,698opponent strategy

with forgiveness)

8 Tsz-Chiu AuDBStft (DBS with

23 97 73 83,867TFT)

9Richard PRobberyL2

14 98 81 83,837Brunauer (PRobberyL2)

10Carlos G.

CLAS2 (CLAS2) 72 96 25 83,746Tardon






Rank Player Strategy Won Drawn Lost Total

Points

1Perukrishnen

BWIN

85 1 80 73,330Vytelingum

(S2Agent1 ZEUS -

Competition 2)

2 Jia-wei LiIMM01 (Intelligent

108 7 51 70,506Machine Master 01)

3 Tsz-Chiu AuDBSy (DBS

35 3 128 68,370(version y))

4 Tsz-Chiu AuDBSz (DBS

27 3 136 68,339(version z))

5 Tsz-Chiu Au

DBSpl (DBS with

37 2 127 67,979learning

prevention)

6 Tsz-Chiu Au

DBSd (Derivative

42 6 118 67,392Belief Strategy

(version d))

7 Tsz-Chiu AuDBSx (DBS

19 9 138 66,719(version x))

8 Tsz-Chiu AuTFTIc (TFT

41 4 121 66,409improved (ver. c))

9 Tsz-Chiu Au

DBSf (Derivative

48 2 116 66,269Belief Strategy

(version f))

10 Tsz-Chiu AuTFTIm (TFT

38 3 125 66,239improved (ver. m))



Table 1.8. Results from 2005 event, competition 3. There were 34 entries.

Note that there is only one round in this competition.

Rank Player StrategyTotal

Points

1Perukrishnen

$AgentSoton ($SOTON AGENT) 7,558Vytelingum

2Deirdre PCurvepower1Memory2 (Penalty Curve

7,521Murrihy of 1 using opponent’s previous 2 moves)

3Deirdre PCurvepower2Memory2 (Penalty Curve

7,521Murrihy of 2 using opponent’s previous 2 moves)

4Deirdre

PCurvepower0.5Memory2 (Penalty

7,521Murrihy

Curve of 0.5 using opponent’s previous 2

moves)

5 Enda HowleyPCurvepower2 (Penalty Curve of 2 using

7,521opponent’s previous move)

6 Enda HowleyPCurvepower1 (Penalty Curve of 1 using

7,521opponent’s previous move)

7 Enda HowleyPCurvepower0.5 (Penalty Curve of 0.5

7,521using opponent’s previous move)

8Wolfgang


9Wolfgang


10Wolfgang








Points

1 Jia-wei LiAPavlov (Adaptive

11 34 6 30,096Pavlov)

2Wolfgang OTFT (Omega tit for

9 36 6 29,554Kienreich tat)

3Philip

(Modeller) 7 36 8 29,003HingstonMod

4Bruno

GRAD (Gradual) 8 32 11 28,707Beaufils

5Tim

tro1 (tro1) 13 32 6 28,692Romberg

6Richard DETerminatorL6C4

12 32 7 28,523Brunauer (DETerminatorL6C4)

7Hannes DETerminatorL4C4

11 33 7 28,292Payer (DETerminatorL4C4)

8Bennett LOOKDB

22 11 18 28,110McElwee (LookaheadDB)

9Gerhard PRobberyM5C4

11 32 8 27,893Mitterlechner (PRobberyM5C4)

10 Wayne Davis ltft (ltft) 1 44 6 27,834



1.6. Acknowledgements

We would like to thank the following people who acted as reviewers for the

chapters in this book.

• Muhammad A. Ahmad

• Oscar Alonso

• Dan Ashlock

• Tsz-Chiu Au

• Carlos Eduardo Rodriguez Calderon

• Michel Charpentier

• Wayne Davis

• Jorg Denzinger

• Eugene Eberbach

• Michael Filzmoser

• Nelis Franken

• Nicholas Gessler

• Michal Glomba

• Philip Hingston

• Enda Howley

• Nick Jennings

• Nanlin Jin

• Jacint Jordana

• Wolfgang Kienreich

• Eun-Youn Kim

• Jia-wei Li

• Helmut A. Mayer

• Bennett McElwee

• Gerhard Mitterlechner

• Colm O’Riordan

• Sarvapali Ramchurn,

• Alex Rogers

• Tim Romberg

• Darryl A. Seale

• Wolfgang Slany

• Elpida Tzafestas

• Perukrishnen Vytelingum

• Georgios N. Yannakakis

• Lukas Zebedin



Appendix: Software Framework

A software library and corresponding application was developed to easily

implement prisoner’s dilemma strategies and tournament competitions be-

tween populations of these. Although a vast array of software is available

for the same purpose they did not contain all our feature requirements. For

several of our experiments we required a game engine that would, among

other things, handle a continuous [normalised] range of moves, arbitrarily

sized payoff matrices, different types of signal noise, multiple (> 2) strate-

gies per game, and logging of partial and completed game results.

The software suite was developed in Java, allowing ease in development

and web deployment. New strategies are easily implemented by imple-

menting a subclass of the Strategy class. The principal requirements are

the implementations of the getMove() and reset() methods which returns

the current strategy move and clears the strategy state between games re-

spectively.

Currently we define two types of games: standard and multi-player. A

standard game involves two competing strategies playing for a number of

rounds, and should mimic the basic game mechanics in the competitions

run by Axelrod. A multi-player game involves several competing strategies

obtaining payoffs for every other opponent it plays against on each round.

A tournament involves every participating strategy and differs for standard

and multi-player type games. A standard tournament pits every strategy

against every other (including self) in a standard game [a la Round Robin].

A multi-player tournament plays a single multi-player game.

An option is available to introduce a Gaussian distributed random num-

ber of rounds to be played, so as to discourage strategies from using the

knowledge of a predefined or static parameter for an unfair advantage.

There is also an option to introduce noise into the output moves, in prin-

ciple to test the robustness of the algorithms. Besides the programming

API, a graphical user interface is available to set up and run PD tourna-

ment competitions (see Figure 1.1).

The software monitors and allows users to log the output of a tourna-

ment with different degrees of detail. However, detailed logs will degrade

performance.

Besides the standard 2× 2 payoff matrix for classic games, there is the

ability to define an arbitrarily sized payoff matrix allowing for a wider range

of allowable moves. Moves are normalised and payoffs are calculated from

the closest allowable move in the payoff matrix.



Fig. 1.1. IPD tournament application.

A number of standard classic strategies are included in the library.

The software can be downloaded for http://prisoners-dilemma.com.

References

Axelrod, R. (1980a). Effective Choices in the Prisoner’s Dilemma, J. Conflict

Resolution, 24, pp. 3-25.

Axelrod, R. (1980b). More Effective Choices in the Prisoner’s Dilemma, J. Con-

flict Resolution, 24, pp. 379-403.

Axelrod R. M. (1984). The Evolution of Cooperation (BASIC Books, New York).

Axelrod R. and D’Ambrosio L. (1995). Announcement for Bibliography on the

Evolution of Cooperation, Journal of Conflict Resolution 39, pp. 190.

Axelrod R. (1997). The Compleity of Cooperation (Princeton University Press).

Birk A. (1999). Evolution of Continuous Degrees of Cooperations in an N-Player

Iterated Prisoner’s Dilemma, Technical Report under review, Vrije Univer-

siteit Brussel, AI-Laboratory.

Boyd R. and Lorberbaum J. P. (1987). No Pure Strategy is Evolutionary Stable

in the Repeated Prisoner’s Dilemma, Nature, 327, pp. 58-59.

Darwen P and Yao X. (2002). Co-Evolution in Iterated Prisoners Dilemma with

Intermediate Levels of Cooperation: Application to Missile Defense, In-

ternational Journal of Computational Intelligence and Applications, 2, 1,

pp. 83-107.

Darwen P. and Yao X. (1995). On Evolving Robust Strategies for Iterated Pris-

oners Dilemma, In Progress in Evolutionary Computation, LNAI, 956,

pp. 276-292.



Darwen P. and Yao X. (2001). Why More Choices Cause Less Cooperation in

Iterated Prisoner’s Dilemma, Proc. Congress of Evolutionary Computation,

pp. 987-994.

Davis M. Game Theory. (1997). A Nontechnical Introduction (Dover Publica-

tions).

Fogel D. (1993). Evolving Behaviours in the Iterated Prisoners Dilemma. Evolu-

tionary Computation, 1, 1, pp. 77-97.

Goldstein J. (1991). Reciprocity in Superpower Relations: An Empirical Analysis,

International Studies Quarterly, 35, pp. 195-209.

Maynard Smith J. (1982). Evolution and the Theory of Games (Cambridge Uni-

versity Press).

Minas J. S., Scodel A., Marlowe D. and Rawson H. (1960). Some Descriptive

Aspects of Two-Person, Non-Zero-Sum Games, II, Journal of Conflict Res-

olution, 4, pp. 193-197.

Nash J. (1950). The Bargaining Problem, Econometrica, 18, pp. 150-155.

Nash J. (1953). Two-Person Cooperative Games, Econometrica, 21, pp. 128-140.

O’Riordan and Bradish S. (2000). Experiments in the Iterated Prisoner’s Dilemma

and the Voter’s Paradox. 11th Irish Conference on Artificial Intelligence and

Cognitive Science.

O’Riordan C. (2000). A Forgiving Strategy for the Iterated Prisoner’s Dilemma,

Journal of Artificial Societies and Social Simulation, 3, 1.

Poundstone W. (1992). Prisoner’s Dilemma, Doubleday

Rapoport A. (1996). Optimal policies for the prisoners dilemma, Tech report

No. 50, Psychometric Laboratory, Univ. North Carolina, NIH Grant, MH-

10006.

Scodel A. and Philburn R. (1959). Some Personality Correlates of Decision Mak-

ing under Conditions of Risk, Behavioral Science, 4, pp. 19-28.

Scodel A., Minas J. S., Ratoosh P.and Lipetz M. (1959). Some Descriptive Aspects

of Two-Person, Non-Zero-Sum Games, Journal of Conflict Resolution, 3,

pp. 114-119.

Scodel A. and Minas J. S. (1960). The Behavior of Prisoners in a “Prisoner’s

Dilemma” Game, Journal of Psychology, 50, pp. 133-138.

Scodel A. (1962). Induced Collaboration in Some Non-Zero-Sum Games, Journal

of Conflict Resolution, 6, pp. 335-340.

Scodel A. (1963). Probability Preferences and Expected Values. Journal of Psy-

chology, 56, pp. 429-434.

von Neumann J. and Morgenstern O. (1944). Theory of Games and Economic

Behavior (Princeton University Press).

Yao, X and Darwen P. (1999). How Important is Your Reputation in a Multi-

Agent Environment. Proc. Of the 1999 IEEE Conference on Systems, Man

and Cybernetics, IEEE Press, Piscataway, NJ, USA, pp. II-575 – II-580,

Oct.

This page intentionally left blankThis page intentionally left blank

January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2

Chapter 2

Iterated Prisoner’s Dilemma and Evolutionary

Game Theory

Siang Yew Chong1, Jan Humble2, Graham Kendall2, Jiawei Li2,3,

Xin Yao1


of Technology3

2.1. Introduction

The prisoner’s dilemma is a type of non-zero-sum game in which two players

try to maximize their payoff by cooperating with, or betraying the other

player. The term non-zero-sum indicates that whatever benefits accrue to

one player do not necessarily imply similar penalties imposed on the other

player. The Prisoner’s dilemma was originally framed by Merrill Flood and

Melvin Dresher working at RAND Corporation in 1950. Albert W. Tucker

formalized the game with prison sentence payoffs and gave it the “Prisoner’s

Dilemma” name. The classical prisoner’s dilemma (PD) is as follows:

Two suspects, A and B, are arrested by the police. The police

have insufficient evidence for a conviction, and, having sepa-

rated both prisoners, visit each of them to offer the same deal:

if one testifies for the prosecution against the other and the

other remains silent, the betrayer goes free and the silent ac-

complice receives the full 10-year sentence. If both stay silent,

the police can sentence both prisoners to only six months in

jail for a minor charge. If each betrays the other, each will re-

ceive a two-year sentence. Each prisoner must make the choice

of whether to betray the other or to remain silent. However,

neither prisoner knows for sure what choice the other prisoner

will make. So the question this dilemma poses is: What will

happen? How will the prisoners act?

The general form of the PD is represented as the following matrix [Scodel

et al. (1959)]:

23



Prisoner 2

Cooperate Defect

Cooperate (R, R) (S, T )

Prisoner 1 Defect (T, S) (P,P )

where R, S, T , and P denote Reward for mutual cooperation, Sucker’s

payoff, Temptation to defect, and Punishment for mutual defection respec-

tively, and T > R > P > S and R > 1/2(S + T ). The two constraints

motivate each player to play noncooperatively and prevent any incentive to

alternate between cooperation and defection [Rapoport (1966, 1999)].

Neither prisoner knows the choice of his accomplice. Even if they were

able to talk to each other, neither could be sure that he could trust the

other. The “dilemma” faced by the prisoners here is that, whatever the

other does, each is better off confessing than remaining silent. However,

the payoff when both confess is worse for each player than the outcome

they would have received if they had both remained silent. Traditional

game theory predicts the outcome of PD be mutual defection based on the

concept of Nash equilibrium. To defect is dominant because if both players

choose to defect, no player has anything to gain by changing their own

strategy [Hardin (1968); Nash (1950, 1951, 1996)].

In the Iterated Prisoner’s Dilemma (IPD) game, two players have to

choose their mutual strategy repeatedly, and have memory of their previ-

ous behaviors. Because players who defect in one round can be “punished”

by defections in subsequent rounds and those who cooperate can be re-

warded by cooperation, the appropriate strategy for self-interested players

is no longer obvious in IPD games. If the precise length of an IPD is

known to the players, then the optimal strategy is to defect on each round

(often called All Defect of AllD) [Luce and Raiffa (1957)]. This single ratio-

nal play strategy which is deduced from propagating the single stage Nash

equilibrium of mutual defection backwards through every stage of the game

prevents players from cooperating to achieve higher payoffs [Selten (1965,

1983, 1988); Noldeke and Samuelson (1993)]. If the game has infinite length

or at least the players are not aware of the length of the game, backward

induction is no longer effective and there exists the possibility that cooper-

ation can take place. In fact, there is still controversy about whether or not

backward induction can be applied to infinite (or finite) IPDs [Sobel (1975,

1976); Kavka (1986); Becker and Cudd (1990); Binmore (1997); Binmore et

al. (2002); Bovens (1997)]. However, in IPD experiments, it was not uncom-

mon to see people cooperate to gain a greater payoff not only in repeated


Iterated Prisoner’s Dilemma and Evolutionary Game Theory 25

games but even in one-shot games [Cooper et al. (1996); Croson (2000);

Davis and Holt (1999); Milinski and Wedekind (1998)]. Traditional game

theory interprets the cooperation phenomena in IPDs by means of repu-

tation [Fudenberg and Maskin (1986); Kreps and Wilson (1982); Milgrom

and Roberts (1982)], incomplete information [Harsanyi (1967); Kreps et al.

(1982; Sarin (1999)], or bounded rationality [Anthonisen (1999); Harborne

(1997); Radner (1980, 1986); Simon (1955, 1990); Vegaredondo (1994)].

Evolutionary game theory differs from classical game theory in respect

of focusing on the dynamics of strategy change in a population more than

the properties of strategy equilibrium. In evolutionary game theory, IPD

is an ideal experimental platform for the problem as to how cooperation

occurs and persists, which is considered to be impossible in the static or

deterministic environment. IPD attracted wide interest after Robert Ax-

elrod’s famous book “The Evolution of Cooperation”. In 1979, Robert

Axelrod organized a prisoner’s dilemma tournaments and solicited strate-

gies from game theorists [Axelrod (1980a, 1980b)]. Each of the 14 entries

competed against all others (including itself) over a sequence of 200 moves.

The specific payoff function used is as follows.

Prisoner 2

Cooperate Defect

Cooperate (3, 3) (0, 5)

Prisoner 1 Defect (5, 0) (1, 1)

The winner of the tournament was “tit-for-tat” (TFT) submitted by

Anatol Rapoport. TFT always cooperate on the first move and then mim-

ics whatever the other player did on the previous move. In a second tourna-

ment with 62 entries, again the winner was TFT. Axelrod discovered that

“greedy” strategies tended to do very poorly in the long run while “altru-

istic” strategies did better when PD were repeated over a long period of

time with many players. Then genetic algorithms were introduced to show

how these altruistic strategies evolve in the populations that are initially

dominated by selfishness. The prisoner’s dilemma is therefore of interest

to the social sciences such as economics, politics and sociology, and to the

biological sciences such as ethology and evolutionary biology, as well to

the applied mathematics such as evolutionary computing. Many social and

natural processes, for example arm race between states and price setting for

duopolistic firms, have been abstracted into models in which independent

groups or individuals are engaged in PD games [Brelis (1992); Bunn and

Payne (1988); Hauser (1992); Hemelrijk (1991); Surowiecki (2004)].



The optimal strategy for the one-shot PD game is simply defection.

However, in the IPD game the optimal strategy depends upon the strate-

gies of the possible opponents. For example, the strategy of Always Co-

operate (AllC) is dominated by the strategy of Always Defect (AllD), and

AllD is optimal in a population consisting of AllD and AllC. However, in

a population consisting of AllD, AllC, and TFT, AllD is not necessarily

the optimal strategy. It appears that all the strategies in the population

determine which strategy is optimal. Although TFT was proved to be ef-

ficient in lots of IPD tournaments and was long considered to be the best

basic strategy, it could be defeated in some specific circumstances [Beaufils,

Delahaye and Mathieu (1996); Wu and Axelrod (1994)]. Therefore, there

is lasting interest for game theorists to find optimal strategies or at least

novel strategies which outperform TFT in IPD tournaments.

Since Axelrod, two types of approaches are developed to test the effi-

ciency or robustness of a strategy and further to derive optimal strategies:

(1) Round-robin tournaments.

(2) Evolutionary dynamics.

Round-robin tournament shows the efficiency of a strategy in competing

with others, while ecological simulation illustrates the evolutionary robust-

ness of a strategy in terms of the number of descendants or survivability

in a certain environment. Lots of novel strategies have been developed and

analyzed by means of these approaches.

By using round-robin tournaments, the interactions between different

strategies can be observed and analyzed. If the statistical distribution of

opposing strategies can be determined an optimal counter-strategy can be

derived mathematically. For example, if the population consists of 50%

TFT and 50% AllC, the optimal strategy should cooperate with TFT and

defect with AllC in order to maximize the payoff. It is easy to design such a

strategy that defects in the first two moves, and then plays always C if the

opponent defected on the second move, otherwise plays always D. A similar

concept in analyzing optimal strategy is Bayesian Nash equilibrium which

is widely used in experimental economics [Bedford and Meilijson (1997);

Gilboa and Schmeidler (2001); Kagel and Roth (1995); Kalai and Lehrer

(1993); Rubinstein (1998);]. In evolutionary dynamics, the processes like

natural selection are simulated where individuals with low scores die off,

and those with high scores flourish. The evolutionary rule that describes

what future states follow from the current state is fixed and deterministic:

for a given time interval only one future state follows from the current state



[Katok and Hasselblatt (1996)]. The common methodology of the evolution

rule is replicator equations that assume infinite populations, continuous

time, complete mixing and that strategies breed true. Given a population

of strategies and the dynamic equations, the evolutionary process can be

simulated, and how strategies evolve in the population over a short or long

time period can be shown. Optimal strategies can be developed in this way

[Axelrod (1987); Darwen and Yao (1995, 1996, 2001); Lindgren (1992);

Miller (1996)].

2.2. Strategies in IPD Tournaments

Axelrod is the first who attempts to search for efficient strategies by means

of IPD tournament [Axelrod (1980a, 1980b)]. TFT had long been studied as

a strategy of IPD game [Komorita, Sheposh and Braver (1968); Rapoport

and Chammah (1965)]. However, it is after Axelrods tournaments that

TFT become well-known.

According to Axelrod, several conditions are necessary for a strategy to

be successful. These conditions include:

Nice

The most important condition is that the strategy must be “nice”.

That is, it will not defect before its opponent does. Almost all of the

top-scoring strategies are nice. Therefore a selfish strategy will never

defect first.

Retaliating

Axelrod contended that a successful strategy must not be a blind op-

timist. It must always retaliate. An example of a non-retaliating

strategy is AllC. This is a very bad choice, as “nasty” strategies will

ruthlessly exploit such strategies.

Forgiving

Another quality of successful strategies is that they must be forgiving.

Though they will retaliate, they will fall back to cooperating if the

opponent does not continue to defect. This stops long runs of revenge

and counter-revenge, thus maximising payoffs.



Clear

The last quality is being clear, that is making it easier for other strate-

gies to predict its behavior so as to facilitate mutually cooperation.

Stochastic strategies, however, are not clear because of the uncertainty

in their choice.

In a further study, Axelrod noted that just a few of the 62 entries in the

second tournament have reasonably influence on the performance of a given

strategy. He utilized eight strategies as opponents for a simulated evolving

population based on a genetic algorithm approach [Axelrod (1987)]. The

population consisted of deterministic strategies that use outcomes of the

three previous moves to determine a current move. The simulation was con-

ducted using a population of 20 strategies from a total of 270 strategies exe-

cuted repeatedly against the eight representatives. Mutation and crossover

were used to generate new strategies. The typical results indicated that

populations initially generated mutual defection, but subsequently evolved

toward mutual cooperation. Moreover, most of the strategies that evolved

in the simulation actually resemble TFT, having the properties of “Nice”,

“Forgiving”, and “Retaliating”.

Although TFT has been considered to be the most successful strategy

in IPD for several decades, there still is some controversy about it. There

seems to be a lack of theoretical explanation for the strategies like TFT in

traditional game theory. TFT is not subgame perfect, and there are always

subgame perfect equilibria that dominate TFT according to the Folk The-

orem [Binmore (1992); Hargreaves and Varoufakis (1995); Myerson (1991);

Rubinstein (1979); Selten (1965, 1975)]. On the other hand, whether or not

TFT is the most efficient singleton strategy in IPD game is still unclear;

therefore, many researchers are attempting to develop novel strategies that

can outperform TFT.

2.2.1. Heterogeneous TFTs

Since TFT had such success in IPD tournaments and experiments, it is

natural to draw the conclusion that TFT may be improved by slightly

modifying its rule. Many heterogeneous TFTs have been developed in

order to overcome TFT’s shortcoming or to adapt to a certain environment,

for example IPD with noise. Among these strategies, Tit-for-Two-Tats

(TFTT), Generous TFT (GTFT), and Contrite TFT (CTFT) are examples.



A situation that TFT does not handle well is a long series of mutual

retaliations evoked by an occasional defection. The deadlock can be bro-

ken if the co-player behaves more generously than TFT and forgives at

least one defection. TFTT retaliates with defection only after two succes-

sive defections and thus attempts to avoid becoming involved in mutual

retaliations. Usually, TFTT performs well in a population with more coop-

erative strategies but does poorly in a population with more permanently

defective strategies. Similar to TFTT, Benevolent TFT (BTFT) always

cooperates after cooperation and normally defects after defection, but oc-

casionally BTFT responds to defection by cooperation in order to break

up a series of mutual obstruction [Komorita, Sheposh and Braver (1968)].

In experiments of Manarini (1998) and Micko (1997), fixed interval BTFT

strategies were shown to be superior to, or at least equivalent to, TFT in

terms of cooperation as well as in terms of cumulative pay-off. However,

BTFT tends to produce irregularly alternating exploitations and sometimes

resort mutual retaliations.

Allowing some percentage of the other player’s defections to go unpun-

ished has been widely accepted as a good way to cope with noise [Molander

(1985); May (1987); Axelrod and Dion (1988); Bendor et al. (1991); God-

fray (1992); Wu and Axelrod (1994)]. A reciprocating strategy such as TFT

can be modified to forgive the other player’s defection with a certain ratio in

order to decrease the influence of noise. GTFT behaves like TFT but coop-

erates with the probability of q = min[1−(T−R)/(R−S), (R−P )/(T−P )]

when it would otherwise defect. This prevents a single error from echoing

indefinitely. For example, in the case of T = 5, R = 3, P = 1, and S = 0,

q = 1/3. GTFT is said to take over the dominant position of the population

of homogeneous TFT strategies in an evolutionary environment with noise

[Nowak and Sigmund (1992)].

In a noisy environment, retaliating unintended defection often leads to

permanent bilateral retaliation. Therefore, forgiving defection evoked by

unintended defection allows a quick way to recover from error. It is based

upon the idea that one shouldn’t be provoked by the other player’s response

to one’s own unintended defection [Sugden (1986); Boyd (1989)]. The strat-

egy of CTFT has three states: “contrite”, “content” and “provoked”. It

begins in a content state, with cooperation and stays there unless there is

a unilateral defection. If it was the victim while content, it becomes pro-

voked and defects until a cooperation from other player causes it to become

content. If it was the defector while content, it becomes contrite and co-

operates. When contrite, it becomes content only after it has successfully



cooperated. CTFT can correct its unintended defection in a noisy envi-

ronment. If one of two CTFT players defects, the defecting player will

contritely cooperate on the next move and the other player will defect, and

then both will be content to cooperate on the following move. However,

CTFT is not effective at correcting the other player’s error. For example, if

CTFT is playing TFT and the TFT player defected by accident, the retal-

iation will continue until another error occurs. In an ecological simulation

with noise, GTFT and CTFT competed with the 63 rules of the Second

Round of the Computer Tournament for the Prisoner’s Dilemma [Axelrod

(1984)]. CTFT is the dominant strategy, becoming 97% of the population

at generation 2000 [Wu and Axelrod (1994)].

2.2.2. Pavlov (Win-Stay Lose-Shift)

A possible drawback of TFT is that it performs poorly in a noisy envi-

ronment. Assume that a population of TFT strategies plays IPD with one

another in a noisy environment, where every choice may be occasionally im-

plemented in error. Although a TFT strategy cooperates with its twin at

the beginning, it would get out of cooperation as soon as the other player’s

action is misinterpreted, and then this induces the other player’s defection

in the next round. Therefore, after an error, the result of the game turns

out to be a CD, DC, CD . . . cycle. If a second error happens, the outcome

is as likely to fall into defection as it is to resume cooperation. Coopera-

tion between TFT strategies is easy to break even in the case of low noise

frequency [Donninger (1986); Kraines and Kraines (1995)].

The Pavlov strategy, also known as Win-Stay Lose-Shift or Simpleton

[Rapoport and Chammah (1965)], has been shown to outperform TFT in

the environment with noise [Fudenberg and Maskin (1990); Kraines and

Kraines (1995, 2000)]. Pavlov cooperates when both sides have cooperated

or defected on the previous move, and defects otherwise. Pavlov, as well as

TFT, are a type of memory-one strategies where players only remember and

make use of their own move and their opponent’s move on the last round.

The major difference between Pavlov and TFT is that Pavlov will choose

COOPERATE after a defection as against TFT’s DEFECT, and this helps

Pavlov resume cooperation with those cooperative strategies, such as TFT,

in a noisy environment. When restricted to an environment of memory-one

agents interacting in iterated Prisoners Dilemma games with a 1% noise

level, Pavlov is the only cooperative strategy and one of the very few that

cannot be invaded by a similar strategy [Nowak and Sigmund (1993, 1995)].



Simulation of evolutionary dynamics of win-stay lose-shift strategies

shows that these strategies are able to adapt to the uncertain environment

even when the noise level is high [Posch (1997)]. In simulated stochas-

tic memory-one strategies for the IPD games, Nowak and Sigmund (1993,

1995) report that cooperative agents using a Pavlov type strategy even-

tually dominate a random population. Memory-one strategies can be ex-

pressed in the form of S(p1, p2, p3, p4), where p1 denotes the probability

of playing C (Cooperate) after a CC outcome, p2 denotes the probability

of playing C after a CD outcome, p3 denotes the probability of playing

C after a DC outcome, and p4 denotes the probability of playing C af-

ter a DD outcome. Most of the well-known strategies can be expressed

in this form. For example, AllC = S(1, 1, 1, 1), AllD = S(0, 0, 0, 0), TFT

= S(1, 0, 1, 0), Pavlov = S(1, 0, 0, 1). Noise is conveniently introduced by

restricting the conditional probabilities pito range between 0 and 1. For ex-

ample, S(0.999, 0.001, 0.999, 0.001) is a TFT strategy with 0.001 probability

of being misinterpreted. In a computer simulation with a population using

the totally random strategy S(0.5, 0.5, 0.5, 0.5), win-stay lose-shift strat-

egy shows its evolutionary robustness in noisy environment. After each

100 generations from a total of 107 generations, 105 mutant strategies that

are generated at random are introduced. Simulation results show that the

populations are dominated by win-stay lose-shift strategy in 33 of a total of

40 simulations. TFT strategies perform poorly in large part because they

do not exploit overly cooperate strategies.

Simulations reveal that Pavlov loses against AllD but can invade TFT,

and that Pavlov cannot be invaded by AllD [Milinski (1993)].

2.2.3. Gradual

The Gradual strategies are like TFT but respond to the opponent with a

gradual pattern. This strategy acts as TFT, except when it is time to for-

give and remember the past. It uses cooperation on the first move and then

continues to do so as long as the other player cooperates. Then after the

first defection of the other player, it defects one time and cooperates two

times; after the second defection of the opponent, it defects two times and

cooperates two times, . . . after the nth defection it reacts with n consec-

utive defections and then calms down its opponent with two cooperations

[Beaufils, Delahaye and Mathieu (1996)].

Both round-robin competitions and ecological evolution experiments are

conducted in order to compare the performance of Gradual with TFT.



Gradual wins in experiments where round-robin competitions are conducted

with several well-known strategies, such as TFT and GRIM. In ecological

evolutionary experiments, gradual and TFT have the same type of evolu-

tion, with the difference of quantity in favor of gradual, which is far away in

front of all other survivors when the population is stabilised. However, it is

efficient to demonstrate that TFT is not always the best, but not efficient to

prove that Gradual always outperforms TFT. Gradual receives fewer points

than TFT while interacting with AllD because Gradual forgives too many

defections. Therefore, if there are lots of defecting strategies like AllD in

the competition, it would be possible that TFT outperforms Gradual in

this case.

Beaufils, Delahaye and Mathieu (1996) try to improve the performance

of Gradual by using a genetic algorithm. 19 different genes are used and a

fitness function evaluates the quality of the strategies. Several new strate-

gies are found after 150 generations of evolution. One of them beats Grad-

ual and TFT in round-robin tournament, as well as in an ecological simu-

lation. In the two cases it has finished first just in front of Gradual, TFT

being two or three places behind, with a wide gap in the score, or in the

size of the stabilised population.

The evolution dynamics of populations including Gradual has also been

studied in Delahaye and Mathieu (1996), Doebeli and Knowlton (1998),

Glomba, Filak, and Kwasnicka (2005), Beaufils, Delahaye, and Mathieu

(1996).

2.2.4. Adaptive strategies

From the viewpoint of automation, the strategies in IPD games can be re-

garded as automatic agents with or without feedback mechanisms. Most

well-known IPD strategies are not adaptive because their responses to any

certain opponent are fixed. It is impossible to improve their performance

since the parameters of their responding mechanism cannot be adjusted.

However, there are still some strategies in IPDs which are adaptive. Al-

though there is still no experimental evidence of adaptive strategies out-

performing non-adaptive ones in IPD games, adaptive strategies are worth

studying since creatures with higher intelligence are all adaptive.

There have been two approaches to developing adaptive strategies.

Firstly, adaptive mechanisms can be implemented by making the parame-

ters of a non-adaptive strategy adjustable. Secondly, new adaptive strate-

gies can be developed by using evolutionary computation, reinforcement



learning, and other computational techniques [Darwen and Yao (1995,

1996)].

Tzafestas (2000a, 2000b) introduced adaptive tit-for-tat (ATFT) strat-

egy that embedded an adaptive factor into the conventional TFT strategy.

ATFT keeps the advantages of tit-for-tat in the sense of retaliating and

forgiving, and implements some behavioural gradualness that would show

as fewer oscillations between Cooperate and Defect. It uses an estimate

of the opponent’s behavior, whether cooperative or defecting, and reacts

to it in a tit-for-tat manner. To represent degrees of cooperation and de-

fection, a continuous variable named “world” which ranges from 0 (total

defection) to 1 (total cooperation) is applied. The ATFT strategy can then

be formulated as a simple model:

If (opponent played C in the last cycle) then

world = world + r*(1-world)

else

world = world + r*(0-world)

If (world >= 0.5) play C, else play D

r is the adaptation rate here. The TFT strategy corresponds to the case

of r = 1 (immediate convergence to the opponent’s current move). Clearly,

ATFT is an extension of the conventional TFT strategy. By simulating the

spatial IPD games between ATFT, AllD, AllC, and TFT on 2D grid, it

shows that ATFT is fairly stable and resistant to perturbations. Since the

use of a fairly small adaptation rate r will allow more gradual behavior,

ATFT tends to be more robust than TFT in a noisy environment.

Since evolutionary computation has been widely used in simulating the

dynamics of IPD games, it is natural to consider obtaining IPD strategies

directly by using evolutionary approaches [Lindgren (1991); Fogel (1993);

Darwen and Yao (1995, 1996)]. Axelrod (1987) studied how to find effective

strategies by using genetic algorithms as simulation method. He established

an initial population of strategies that is deterministic and uses the outcome

of the three previous moves to make a choice in the current move. By

means of playing IPD games between one another, successful strategies

are selected to have more offspring. Then the new population will display

patterns of behavior that are more like those of the successful strategies of

the previous population, and less like those of the unsuccessful ones. As the

evolution process continues, the strategies with relatively high scores will

flourish while the unsuccessful strategies die out. Simulation results show



that most of the strategies that were evolved in the simulation actually

resemble TFT and does substantially better than TFT. However, it would

not be accurate to say that these strategies are better than TFT because

they are probably not very robust in other environments [Axelrod (1987)].

Many researchers have found that evolved strategies may lack robust-

ness, i.e., the strategies did well against the local population, but when

something new and innovative appeared they fail [Lindgren (1991); Fogel

(1993)]. Darwen and Yao (1996) applied a technique to prevent the genetic

algorithm from converging to a single optimum and attempted to develop

new IPD strategies without human intervention. It concludes that adding

static opponents to the round robin tournament improves the results of

final population.

Optimal strategies can be determined only if the strategy of the oppo-

nent is known. By means of reinforcement learning, model-based strategies

with the ability of on-line identification of an opponent can be built [Sand-

holm and Crites (1996); Freund et al. (1995); Schmidhuber (1996)]. How

can a player acquire a model of its opponent’s strategy? One possible source

of information available for the player is the history of the game. Another

possible source of information is observed games between the opponent and

other agents. In the case of IPD games, a player can infer an opponent’s

model based on the outcome of the past moves and then adapts its strategy

during the game. Reinforcement learning (RL) is based on the idea that

the tendency to produce an action should be strengthened if it produces

favorable results, and weakened if it produces unfavorable results [Watkins

(1989); Watkins and Dayan (1992); Kaelbling and Moore (1996)]. A model-

based RL approach generates expectation about the opponent’s behavior

by making use of a model of its strategy [Carmel and Markovitch (1997,

1998)]. It is well suited for use in IPD tournament against an unknown

opponent because of its small computational complexity. The major prob-

lem in designing a model-based strategy (MBS) is the risk involved in the

exploration, and thus the issue of exploitation versus exploration. An ex-

ploring action taken by the MBS tests unfamiliar aspects of the opponent

which can yield a more accurate model of the opponent. However, this

action also carries the risk of putting the MBS into a much worse position.

For example, in order to distinguish the strategy ALLC from GRIM and

TFT in IPD tournament, a MBS has to defect at least once and therefore

loses the chance to cooperate with GRIM. The exploratory action affects

not only the current payoff but also the future rewards [Berry and Frist-

edt (1985)]. There have been several approaches developed to solve this



problem [Berry and Fristedt (1985); Gittins (1989); Sutton (1990); Naren-

dra and Thathachar (1989); Kaelbling (1993); Moore and Atkeson (1993);

Carmel and Markovitch (1998)]. Since possible strategies for a repeated

game is usually infinite, computational complexity is another problem that

needs to be addressed [Ben-porath (1990); Carmel and Markovitch (1998)].

There is seldom a record of an effective MBS in round-robin IPD tourna-

ments. However, the strategy that won competition 4 in 2005 IPD tourna-

ment, Adaptive Pavlov, is such a strategy [Prisoner’s dilemma tournament

result (2005)]. Furthermore, it seems that each of the strategies that ranked

above TFT incorporated a mechanism to explore the opponent.

2.2.5. Group strategies

In the 2004 IPD competition [20th-anniversary Iterated Prisoner’s Dilemma

competition], a team from Southampton University led by Professor N. Jen-

nings introduced a group of strategies, which proved to be more successful

than Tit-for-Tat (see chapter 9).

The group of strategies were designed to recognise each other through a

known series of five to ten moves at the start. Once two Southampton play-

ers recognized each other, they would act as their “master” or “slave” roles

– a master will always defect while a slave will always cooperate in order

for the master to win the maximum points. If the program recognized that

another player was not a Southampton entry, it would immediately defect

to minimise the score of the oppositions. The Southampton group strate-

gies succeeded in defeating any non-grouped strategies and won the top

three positions in the competition [Prisoner’s dilemma tournament result

(2004)].

According to Grossman (2004), it was difficult to tell whether a group

strategy would really beat TFT because most of the “slave” group mem-

bers received far lower scores than the average level and were ranked at

the bottom of the table. The average score of the group strategies is not

necessarily higher than that of TFT.

The significance of group strategies maybe lies in their evolutionarily

characters. None of known strategies in IPD games is an evolutionarily sta-

ble strategy. [Boyd and Loberbaum (1987)] The strategies that are most

likely to be evolutionarily stable, such as AllD or GRIM, can resist the

invasion of some types of strategies but cannot resist the invasion of oth-

ers. For example, a small group of TFT strategies can not invade a large

population of AllD; however, STFT can do. There exists the possibility



that TFT can successfully invade a population of AllD indirectly. Suppose

that a large population of AllD is continuously attacked by small groups of

STFT. Because every invasion makes a small positive proportion of STFT

remain in the population of AllD, the number of STFT increases gradually.

When the number of STFT is large enough, a small group of TFT can

successfully invade and AllD will die out.

However, group strategies may be evolutionarily stable. By means of

cooperating with group members and defecting against non-group members,

a population of group strategies can prevent any foreigner from successfully

invading. This is, perhaps, the real value of group strategies.

2.3. Evolutionary Dynamics in Games

Traditional game theorists have developed several effective approaches to

study static games based on the assumption of rationality. By using

Neumann-Morgenstern utility, refinement of Nash equilibrium, and rea-

soning, both cooperative and non-cooperative games are analyzed within a

theoretical framework. However, in the area of repeated games, especially

in games where dynamics are concerned, few approaches from traditional

game theory are available.

Evolutionary game theory provides novel approaches to solve dynamic

games. If the precise length of an IPD is known to the players, then the

optimal strategy is to defect on each round. If the game has infinite length

or at least the players are not aware of the length of the game, there exists

the possibility that cooperation happens [Dugatkin (1989); Darwen and

Yao (2002); Akiyama and Kaneko (1995); Doebeli, Blarer, and Ackermann

(1997); Axelrod (1999); Glance and Huberman (1993, 1994); Ikegami and

Kaneko (1990); Schweitzer (2002)].

Nowak and May (1992, 1993) showed that cooperators and defectors

coexist in certain circumstances by introducing spatial evolutionary games,

in which two types of players – cooperators who always cooperate and

defectors who always defect are placed in a two-dimensional spatial array.

In each round, every individual plays the PD game with its immediate

neighbors. The selection scheme is that each lattice is occupied either by

its original owner or by one of the neighbors, depending on who scores

the highest total in that round, and so on to the next round of the game.

Simulation results show that cooperators remain a considerable percentage

of the population in some cases, and defector can invade any a lattice but

can not occupy the whole area.



When the parameters of the payoff matrix are set to be T = 2.8, R = 1.1,

P = 0.1, and S = 0 and the initial state is set to be a random mixture of the

two types of strategies, the evolutionary dynamics of the local interaction

model lead to a state where each player chooses the strategy Defect, the

only ESS in the prisoner’s dilemma. Figure 2.1 shows that the population

converges to a state where everyone defects and no Cooperate strategy

survives after 5 generations.

Generation 1 Generation 2 Generation 3 Generation 6

Fig. 2.1. Spatial Prisoner’s Dilemma with the values T = 2.8, R = 1.1, P = 0.1, and

S = 0 [Nowak and May (1993)].

However, when the parameters of the payoff matrix are set to T = 1.2,

R = 1.1, P = 0.1, and S = 0, the evolutionary dynamics do not converge

to the stable state of defection. Instead, a stable oscillating state where

cooperators and defectors coexist and some regions are occupied in turn by

different strategies.


Fig. 2.2. Spatial Prisoner’s Dilemma with the values T = 1.2, R = 1.1, P = 0.1, and

S = 0 [Nowak and May (1993)].

Moreover, when the parameters of payoff matrix are set to be T = 1.61,

R = 1.01, P = 0.01, and S = 0, the evolutionary dynamics lead to a chaotic

state: regions occupied predominantly by Cooperators may be successfully



invaded by Defectors, and regions occupied predominantly by Defectors

may be successfully invaded by Cooperators.


Fig. 2.3. Spatial Prisoner’s Dilemma with the values T = 1.61, R = 1.01, P = 0.01,

and S = 0 [Nowak and May (1993)].

If the starting configurations are sufficiently symmetrical, this spatial

version of the PD game can generate chaotically changing spatial patterns,

in which cooperators and defectors both persist indefinitely. For example,

if we set R = 1, P = 0.01, S = 0.0 and T = 1.4, and initial state is that

every individual in a square 69×69 lattice is a cooperator except a defector

in the middle of the lattice. The structure of the evolving lattice varies like

a kaleidoscope, and the ever-changing sequences of spatial patterns can be

very beautiful, as shown in Fig. 2.4. The role of the spatial interaction in

the evolution of cooperation is further studied by Durrett and Levin (1998),

Schweitzer, Behera, and Muhlenbein (2002), Ifti, Killingback, and Doebeli

(2004).


Fig. 2.4. Spatial Prisoner’s Dilemma with the values T = 1.4, R = 1, P = 0.01,

and S = 0, where Blue, Red, Green, and Yellow denote cooperators, defectors, new

cooperators, and new defectors respectively [Nowak and May (1993)].



2.3.1. Evolutionary stable strategy

Just like the Nash equilibrium in traditional game theory, Evolutionarily

Stable Strategy (ESS) is an important concept used in theoretical analysis

of evolutionary games. According to Maynard Smith (1982), an ESS is a

strategy such that, if all the members of a population adopt it, then no

mutant strategy could invade the population under the influence of natural

selection. ESS can be seen as an equilibrium refinement to the Nash equilib-

rium. Suppose that a player in a game can choose between two strategies:

I and J . Let E(J, I) denote the payoff he receives if he chooses the strategy

J while all other players choose I .

Then, the strategy I is evolutionarily stable if either

(1) E(I, I) > E(J, I), or

(2) E(I, I) = E(J, I) and E(I, J) > E(J, J)

is true for all I 6= J [Maynard Smith and Price (1973); Maynard Smith

(1982)].

Thomas (1985) rewrites the definition of ESS in a different form. Fol-

lowing the terminology given in the first definition above, we have

(1) E(I, I) ≥ E(J, I), and

(2) E(I, J) > E(J, J)

From this alternative form of definition, we find that ESS is just a subset

of Nash equilibrium. The benefit of this refinement of Nash equilibrium

is not just to eliminate those weak Nash equilibrium, but to provide an

efficient mathematical tool for dynamic games. Following the concept of

ESS, two approaches to evolutionary game theory have been developed.

The first approach directly applies the concept of ESS to analyze static

games. The second approach simulates the evolutionary process of dynamic

games by constructing a dynamic model, which may take into consideration

the factors of the population, replication dynamics, and strategy fitness.

As an example of using ESS in static games, consider the problem of the

Hawk-Dove game. Two types of animals employ different means to obtain

resources (a favorable habitat, for example) — Hawk always fights for some

resources while Dove never fights. Let V denote the value of the resources,

which can be considered the Darwinian fitness of an individual obtaining

the resource, described by Maynard Smith (1982). Let E(H, D) denote the

payoff to a Hawk against a Dove opponent. If we assume that (1) whenever

two Hawks meet, conflict eventually results and the two individuals are



equally likely to be injured, (2) the cost of the conflict reduces individual

fitness by some constant value C, (3) when a Hawk meets a Dove, the Dove

immediately retreats and the Hawk obtains the resource, and (4) when two

Doves meet the resource is shared equally between them, the payoff matrix

for Hawk-Dove game will look like this,

Hawk Dove

Hawk ((V -C/2, (V -C)/2) (V, 0)

Dove (0, V ) (v/2, V/2)

Then, it is easy to verify that the strategy Dove is not an ESS because

there is E(D, D) < E(H, D), which means that a pure population of Doves

can be invaded by a Hawk mutant. In the case that the value V of the

resource is greater than the cost C of injury, the strategy Hawk is an ESS

because there is E(H, H) > E(D, H), which means that a Dove mutant

can not invade a group of Hawks. If V < C is true, the Hawk-Dove game

becomes the game of Chicken originated from the 1955 movie Rebel without

a cause. Neither pure Hawk nor pure Dove is ESS in this game. However,

there is an ESS if mixed strategies are permitted [Bishop and Cannings

(1978)].

An evolutionarily stable state is a dynamical property of a population

to return to using a strategy, or mix of strategies, if it is perturbed from

that strategy, or mix of strategies [Maynard Smith (1982)]. A population of

ESS must be evolutionarily stable because it is impossible for any mutant to

invade it. Many biologists and sociologists attempt to explain animal and

human behavior and social structures in terms of ESS [Cohen and Machalek

(1988); Mealey (1995)]. However, a dynamic game is not necessarily con-

verging to a stable state in which ESS is prevalent. For example, using a

spatial model in which each individual plays the Prisoner’s Dilemma with

his or her neighbors, Nowak and May (1992, 1993) show that the result of

the game depends on the specific form of the payoff matrix.

Now imagine a population of players in a society where each one has to

play Prisoner’s Dilemma with another and whether or not one can survive

and breed is determined by his payoff in the game. How will the population

evolve? In order to show the evolutionary process of the population, a model

of dynamics that takes time t into consideration is needed.

2.3.2. Genetic algorithm

A genetic algorithm maintains a population of sample points from the

search space. Each point is represented by a string of characters, known



as genotype [Holland (1975, 1992, 1995)]. By defining a fitness function

to evaluate them, genetic algorithm proceeds to initialize a population of

solutions randomly, and then improve it through repetitive application of

mutation, crossover, and selection operators.

The common methodology to study the evolutionary dynamics in games

is through replicator equations. Replicator equations usually assume in-

finite populations, continuous time, complete mixing and that strategies

breed true [Taylor (1979); Maynard Smith (1982); Weibull (1995); Hofbauer

and Sigmund (1998)]. Originated from biology and then introduced into

evolutionary game theory by Taylor and Jonker (1978), replicator equations

provide a continuous dynamic model for evolutionary games.

Consider a population of n types of strategies, and let xi

be the fre-

quency of type i. Let A be the n× n payoff matrix. With the assumptions

that the population is infinitely large and strategies are completely mixed

and xiare differentiable functions of time t, a strategy’s fitness, or expected

payoff can be written as (Ax)iif strategies meet one another randomly. The

average fitness of the population as a whole can be written as xT Ax. Then,

the replicator equation is

xi= x

i((Ax)

i− xT Ax) (2.1)

Evolutionary games with a replicator dynamic as described in (2.1) will

converge to a result that strategies with strong fitness bloom in the popu-

lation.

For the Prisoner’s Dilemma, the expected fitness of the strategies Co-

operate and Defect, EC

and ED

respectively, are

EC

= xC

R + xD

S , and ED

= xC

T + xD

P (2.2)

where xC

and xD

denote the proportions of the strategies of Cooperate and

Defect in the population respectively. Let E denote the average fitness of

the entire population, there is

E = xC

Ec+ x

DE

D(2.3)

Then, the replicator equations for this game are

dxC

dt= x

C(E

c− E) ,

dxD

dt= x

D(E

D− E) (2.4)

Since there is T > R and P > S, ED−E

C= x

C(T−R)+x

D(P−S) > 0

holds, and there must be ED

> E > EC

. Therefore, there are dxC

dt

< 0 anddxD

dt

> 0. This means that the number of the strategies of Cooperate will

always decline while the number of the strategies of Defect increases as the



game goes on. Sooner or later, the proportion of the population choosing

the strategy Cooperate will, in theory, become extinct.

Besides replicator dynamics, there exist other types of dynamics equa-

tions that can be used in modeling evolutionary systems [Akin (1993);

Thomas (1985); Bomze (1998, 2002); Balkenborg and Schlag (2000); Cress-

man, Garay and Hofbauer (2001); Weibull (1995); Hofbauer (1996); Gilboa

and Matsui (1991); Matsui (1992); Fudenberg and Levine (1998); Skyrms

(1990); Swinkels (1993); Smith and Gray (1994)]. Lindgren (1995) and

Hofbauer and Sigmund (2003) have given a comprehensive review of them.

In general, dynamic games are of great complexity. How an evolutionary

system evolves depends not only on the population and dynamic structures

but also on where the evolution starts. Because of dynamic interactions

between multiple players, especially those players with intelligence, genetic

algorithms may converge towards local optima rather than the global opti-

mum. Also, operating on dynamic data sets is difficult as genomes begin to

converge early on towards solutions which may no longer be valid for later

data [Michalewicz (1999); Schmitt (2001)]. Analysis of the evolutionary dy-

namic systems is not just a problem of evolutionary game theory, but a new

direction in applied mathematics [Garay and Hofbauer (2003); Gaunersdor-

fer (1992); Gaunersdorfer, Hofbauer, and Sigmund (1991); Hofbauer (1981,

1984, 1996); Krishna and Sjostrom (1998); Plank (1997); Smith (1995);

Zeeman (1993), Zeeman and Zeeman (2002, 2003)].

2.3.3. Strategies

What strategies should be involved in evolutionary dynamics is a difficult

question. One approach is to take into consideration lots of representa-

tive strategies, for example Axelord (1984), Dacey and Pendegraft (1988),

and Akimov and Soutchanski (1994), since it is impossible to enumerate all

possible strategies. However, it is difficult to say what strategy should be

included and which ones not, and there is little comparability between evo-

lutionary processes with different strategies because the selection of strate-

gies may have great influence on the outcome of the dynamics. Another

approach is to study the interactions between specific strategies, for exam-

ple Nowak and Sigmund (1990, 1992) and Goldstein and Freeman (1990). In

this way, it is convenient to make clear the relationship between strategies

in the evolutionary process; however, generality of complex evolutionary

systems loses to some extent.



Strategies in PD games (or in non-PD games) can be characterized as

either deterministic or stochastic. Deterministic strategies leave nothing to

change and respond to the opponent with predetermined actions; stochastic

strategies, however, leave some uncertainty in their choices.

Oskamp (1971) presents a thorough review of the early studies on the

strategies involved in PD games and non-PD games, for example AllD,

TFT, and lots of stochastic strategies that play C or D with some certain

probabilities [Lave (1965); Bixenstine, Potash, and Wilson (1963); Solomon

(1960); Crumbaugh and Evans (1967); Wilson (1969); Oskamp and Perlman

(1965); Sermat (1967); Heller (1967); Knapp and Podell (1968); Lynch

(1968); Swingle and Coady (1967); Whitworth and Lucker (1969)].

After Axelord’s IPD tournament, memory-one strategies that interact

with the opponent according to both sides’ behavior in the previous move

become prevalent. TFT, Pavlov, Grim Trigger, and many other memory-

one strategies are analyzed in varies of environment: round-robin tourna-

ments, evolutionary dynamics with or without noise [Nowak and Sigmund

(1990, 1992, 1993); Pollock (1989); Wedekind and Milinski (1996); Milinski

and Wedekind (1998); Sigmund (1995); Stephens (2000); Stephens, Mclinn

and Stevens (2002); Sandholm and Crites (1996); Doebeli and Knowlton

(1998); Brauchli, Killingback and Doebeli (1999); Sasaki, Taylor and Fu-

denberg (2000)].

No strategy has been shown to be superior in a dynamic environment,

and even deterministic cooperators can invade defectors in specific circum-

stances. It is not sensible to discuss which strategy is best unless the context

is defined. Comparing TFT with GTFT, Grim (1995) suggests that, in the

non-stochastic Axelrod models, it is TFT that is the general winner; within

a purely stochastic model, the greater generosity of GTFT pays off; in a

model with both stochastic and spatial elements, a level of generosity twice

that of GTFT proves optimal. Pavlov has an obvious advantage over TFT

in noisy environments [Nowak and Sigmund (1993); Kraines and Kraines

(1995)]. In an evolutionary process where AllC, AllD, TFT, and GTFT

strategies are involved, evolution starts off toward defection but then veers

toward cooperation. TFT strategies play a key role in invading the pop-

ulation of defectors. However, GTFT strategies and then more generous

AllCs gradually become dominant once cooperation is widely established,

and this provides an opportunity to AllD to invade again [Nowak and Sig-

mund (1992)]. Additionally, Selten and Stoecker (1986) have studied the

end game behavior in finite IPD supergames, and find that cooperative

behaviors last until shortly before the end of the supergame.



Machine Learning approaches have been introduced into evolutionary

game theory to develop adaptive strategies, especially those for IPD games

[Carmel and Markovitch (1996, 1997, 1998); Littman (1994); Tekol and

Acan (2003); Hingston and Kendall (2004)]. Adaptive strategies, at least

in theory, have obvious advantages over fixed strategies. Among the set of

adaptive strategies, there may be an evolutionarily stable strategy for IPD

games and potential winner of future IPD tournaments.

2.3.4. Population

Population size and structure are of great importance in evolutionary dy-

namics. In general, evolutionary processes in a large population are quite

different from that in small populations [Maynard Smith (1982); Fogel and

Fogel (1995); Fogel, Fogel and Andrew (1997, 1998); Ficici and Pollack

(2000)].

Young and Foster (1991) have studied stochastic effects in a population

consisting of three strategies: AllD, AllC, and TFT. They show that the

outcome of the evolutionary process depends crucially on the amount of

noise, which is inversely proportional to the population size. The more

people there are, the more that random variations in their behavior are

smoothed out in the population proportions. For large populations, the

system tends to drift from TFT to AllC, which is then invaded by AllD.

As a result, most of the players behave as AllD, even though initially most

players may have started as TFT. They conclude that cooperation is viable

in the short run, but not stable in the long run in a large population.

Boyd and Richerson (1988, 1989) suggest that reciprocity is unlikely

to evolve in large groups as a result of natural selection because reciproca-

tors punish defection by withholding future cooperation which will penalize

other cooperators in the group. Boyd and Richerson (1990, 1992) analyze

a model in which the punishment response to defection is directed solely at

defectors. In this model, cooperation reinforced by retribution can lead to

the evolution of cooperation in different ways. There is the possibility that

strategies which cooperate and punish defectors, strategies which cooperate

only if punished, and strategies which cooperate but do not punish coexist

in the long run, as well as the possibility that only one type exists. As the

group size grows larger, however, the conditions for co-operators’ surviving

becomes more difficult.

Glance and Huberman (1994) discuss how to achieve cooperation in

groups of various sizes in n-person PD games and find that there are two



stable points in large groups: either there is a great deal or very little co-

operation. Cooperation is more likely in smaller groups than in larger ones

and there is greater cooperation when players are allowed more communica-

tion with each other. Large random fluctuations are related to group size.

Groups beyond a certain size may experience increased difficulty of informa-

tional exchange and coordination; further, reneging on contracts is possible

to be prevalent as each member may expect that the effect of his/her action

on other members will be diluted. However, Dugatkin (1990) finds that co-

operation may invade large populations more easily than smaller ones, but it

is likely to represent a smaller proportion of the population in larger groups.

In order to consider the potential importance of the relationship between

population size and cooperative behaviour, two N-person game theoretical

models are presented. The results show that cooperation is frequently not a

pure evolutionarily stable strategy, and that many metapopulations should

be polymorphic for both cooperators and defectors.

It is well accepted that communication among members of a society

leads to more cooperative behaviors [Insko et al. (1987); Orbell, Kragt,

and Dawes (1988)]. Insko et al. (1987, 1988, 1990, 1993) explore the

role of communication on interindividual-intergroup discontinuity in the

context of the extended PD game that adds a third withdrawal choice

to the usual cooperative and uncooperative choices, and interindividual-

intergroup discontinuity is the tendency of intergroup relations to be more

competitive and less cooperative than interindividual relations. The lesser

tendency of individuals to cooperate when there is no communication with

the opponent partially explains the group discontinuity.

Choice and refusal of partners may accelerate the emergence of coop-

eration. Experiments have shown that people who are given the option of

playing or not are more likely to choose to play if they are themselves plan-

ning to cooperate. More cooperative players are more likely to anticipate

that others will be cooperative [Orbell and Dawes (1993)]. Defecting play-

ers are possible to be alienated by cooperators [Schuessler (1989); Kitcher

(1992); Batali and Kitcher (1994)]. In the N-person PD game, it may be

that players can change groups if they don’t satisfy the size of their groups

[Hirshleifer and Rasmusen (1989)]. The option of choice and refusal of

partners in IPD means that players will attempt to select partners ratio-

nally. Analytical studies reveal that the subtle interplay between choice

and refusal in N-player IPD games can result in various long-run player

interaction patterns: mutual cooperation; mixed mutual cooperation and

mutual defection; parasitism; and wallflower seclusion. Simulation studies



indicate that choice and refusal can accelerate the emergence of coopera-

tion in evolutionary IPD games [Stanley, Ashlock, and Tesfatsion (1994);

Stanley, Ashlock and Smucker (1995)].

The effects of freedom to play, reciprocity and interchange, coalitions

and alliances, and various sizes of groups on evolution are also studied [Or-

bell and Robyn (1993); Alexander and Frans (1992); Glance and Bernardo

(1994); Hemelrijk (1991)]. In a specific scenario, the prestructuration of the

population may determine the evolution of the patterns of interaction that

constitute the final social structure [Eckert, Koch, and Mitlohner (2005)].

2.3.5. Selection scheme

Evolutionary selection schemes can be characterized as either generational

or steady-state schemes [Thierens (1997)]. Generational schemes that are

widely used in evolutionary game theory mean that each generation of a

population is replaced in one step by a new generation. In a system with a

steady-state scheme only a small percentage of the population is replaced in

each generation. Evolutionary selection schemes can be further subdivided

as pure or elitist selection schemes in terms of whether or not there is an

overlap between successive generations. Pure selection schemes allow no

overlap between successive generations: all parents from previous genera-

tion are discarded and the next generation is filled entirely with offspring

from these parents. In elitist schemes, subsequent generations may be the

same: parents with higher fitness are transferred to the next generation

and only poorly performing parents are replaced [Mitchell (1996)].

Pure selection schemes are commonly used in IPD research [Axelord

(1987); Axelrod and Dion (1988); Huberman and Glance (1993); Aki-

mov and Soutchanski (1994); Mill (1996)]. These schemes use fitness-

proportional selection of the parents in combination with single-point

crossover or use a random uniform simple set to select the fittest agent

to produce offspring. A robust society of cooperators emerges only if the

level of competition between the players is neither too small nor too large.

In elitist selection schemes, the population is firstly shuffled randomly and

partitioned into pairs of parents. Then, each pair of parents creates two

offspring, and a local competition between parents and their offspring is

held. Finally, the best two players of each pair of parents are transferred

to the next generation [Thierens and Goldberg (1994)]. In this case, sta-

ble societies of highly cooperative players evolve. It shows that a suitable

model of the selection process is of crucial importance in terms of simulating



real-world economic situations [Ficici, Melnik, and Pollack (2000); Bragt,

Kemenade and Poutre (2001)].

Selection is clearly an important genetic operator, but opinion is di-

vided over the importance of crossover versus mutation. Some argue that

crossover is the most important, while mutation is only necessary to ensure

that potential solutions are not lost [Grefenstette, Ramsey and Schultz

(1990); Wilson (1987)]. Others argue that crossover in a largely uniform

population only serves to propagate innovations originally found by muta-

tion, and in a non-uniform population crossover is nearly always equivalent

to a very large mutation [Spears (1992)].

2.4. Evolution of Cooperation

A fundamental problem in evolutionary game theory is to explain how

cooperation can emerge in a population of self-interested individuals. Ax-

elrod (1984, 1987) attributes the reason of emergence of cooperation to

the “shadow of the future”: the likelihood and importance of future in-

teraction. This implies that rewards from cooperation should be mutually

expected payoff and to cooperate is a rational choice for self-interested in-

dividuals [Martinez-Coll and Hirshleifer (1991)]. Axelrod’s work has been

subjected to a number of criticisms because his conclusions obviously con-

flict with traditional game theory [Binmore (1994, 1998)], as Nachbar’s

criticism that “Axelrod mistakenly ran an evolutionary simulation of the

finitely repeated Prisoners’ Dilemma. Since the use of a Nash equilibrium

in the finitely repeated Prisoners’ Dilemma necessarily results in both play-

ers always defecting, we then wouldn’t need a computer simulation to know

what would survive if every strategy were present in the initial population of

entries. The winning strategies would never co-operate.” [Nachbar (1992)].

There are also arguments that the conflict stems from the assumption of

Von Neumann-Morgenstern utility. According to Spiro (1988), the prob-

lem with Axelrod’s argument is the oft-discussed problem of interpersonal

utility comparison. Axelrod’s argument, and all game theoretic modeling,

welfare economics, and utilitarian moral philosophy, in fact, would require

that it be possible for one to measure and compare the utilities of different

people. The problem with this assumption is that it is quite impossible to

construct a scale of measurement for human preferences [Rothbard (1997)].

Although evolutionary game theory is aimed primarily towards dynamic

games, while traditional game theory deals with non-dynamic games, there

are still area of intersection, for instance in the field of repeated games.



Furthermore, although evolutionary game theory mainly depends on ex-

periments and computer simulations, its theoretical foundations, i.e. indi-

vidual utility (or preference) and payoff-maximizing, stem from traditional

game theory. Controversies about Axelrod’s work reflect the bifurcation

between evolutionary approaches and the basic assumptions of game the-

ory. Based on the assumption of “rational players”, traditional game theory

regards a finite repeated game as a combination of many singleton games.

“Backward induction” is applied in order to dissect the link between these

singleton games, and then each of them can be analyzed statically [Harsanyi

and Selten (1988)]. The concept of backward induction was first employed

by Von Neumann and Morgenstern (1944) and then developed by Selten

(1965, 1975) based on Nash equilibrium. First, one determines the optimal

strategy of the player who makes the last move of the game. Then, the

optimal action of the next-to-last moving player is determined taking the

last player’s action as given. The process continues in this way backwards

through time until all players’ actions have been determined. Subgame

perfect Nash equilibrium deduced directly from backward induction is an

equilibrium such that players’ strategies constitute a Nash equilibrium in

every subgame of the original game [Aumann (1995)]. Selten proved that

any game which can be broken into “sub-games” containing a sub-set of all

the available choices in the main game will have a subgame perfect Nash

equilibrium. In the case of a finite number of iterations in IPD games, the

unique subgame perfect Nash equilibrium is AllD. However, many psycho-

logical and economic experiments have shown that subjects would not nec-

essarily apply a strategy like AllD [Kahn and Murnighan (1993); McKelvey

and Palfrey (1992); Cooper et al. (1996)]. Game theorists explain these

experimental results in terms of incomplete information, reputation, and

bounded rationality, which are all based on theoretical analysis [Harsanyi

(1967); Kreps et al. (1982); Simon (1990); Bolton (1991); Bolton and Ock-

enfels (2000); Binmore et al. (2002); Samuelson (2001)]. In some sense,

Axelrods work is a parallel of these explanations, but it seems that his ap-

proach is absolutely different. Before a soundly theoretical explanation can

be established, the problem of how cooperation emerges is left unsolved.

As to the problem of how cooperation can persist during evolution, suffi-

cient evidence has been provided to support the point that cooperation can

survive and flourish in a wide range of circumstances if only some conditions

are satisfied. Nowak and Sigmund (1990) have shown that cooperation can

emerge among a population of randomly chosen reactive strategies, as long

as a stochastic version of TFT is added to the population. If cooperators



can recognize each other with the help of some label they can increase their

payoff by interacting selectively with one another [Frank (1988)]. Social

norms aid in cooperation in many ways [Bendor and Mookherjee (1990);

Kandori (1992); Sethi and Somanathan (1996)]. As to the influence of

payoff variations, Mueller (1988) finds that payoff settings with increas-

ing values of T relative to P promote cooperative behaviour; while Fogel

(1993) regards that smaller values for T promote the evolution of coopera-

tive behaviour. Nachbar (1992) selects a payoff setting strongly favouring

the relative reward of cooperating and finds that this setting elicits an in-

creased degree of cooperation. Kirchkamp (1995) finds that the value of S

becomes less important with longer memory. Also, the effects of popula-

tion structure, repetition, and noise have been studied [Hirshleifer and Coll

(1988); Mueller (1988); Boyd (1989); Marinoff (1992); Hoffmann (2001)].

To end, we note that Binmore (1998) stated:

“. . .One simply cannot get by without learning the underlying theory.

Without any knowledge of the theory, one has no way of assessing the

reliability of a simulation and hence no idea of how much confidence to

repose in the conclusions that it suggests”.

There is still a need for an underlying theory for IPD tournaments. Evo-

lutionary game theory has provided us with many experimental approaches;

however, better theoretical explanations are still needed. Even though IPD

tournaments have been run for over 40 years, we suspect there will be more

as we search for new strategies and new theories which explain the complex

interactions that take place.

Finally, this review has been restricted to the IPD literature. Even so,

we have not been able to include every article and there are, no doubt, omis-

sions. However we hope that this chapter has provided enough information

for the interested reader to follow up on.

References

Akimov V. and Soutchanski M. (1994) Automata simulation of N-person social

dilemma games, Journal of Conflict Resolution, 38, pp. 138-148.

Akin E. (1993) The general topology of dynamical systems, American Mathemat-

ics Society, Providence.

Akiyama E. and Kaneko K. (1995) Evolution of cooperation, differentiation, com-

plexity and diversity in an iterated three-person game, Artificial Life, 2,

pp. 293-304.

Alexander H. and Frans B. (1992) Coalitions and Alliances in Humans and Other

Animals. Oxford: Oxford University Press.



Anthonisen N. (1999) Strong rationalizability for two-player noncooperative

games, Economic Theory, 13, pp. 143-169.

Aumann R. (1995) Backward Induction and Common Knowledge of Rationality,

Games and Economic Behavior, 18, pp. 6-19.

Axelrod R. (1980a) Effective choice in the prisoner’s dilemma, Journal of Conflict

Resolution, 24, pp. 3-25.

Axelrod R. (1980b) More effective choice in the prisoner’s dilemma, Journal of

Conflict Resolution, 24, pp. 379-403.

Axelrod R. M. (1984). The Evolution of Cooperation (BASIC Books, New York).

Axelrod R. (1987) The evolution of strategies in the iterated prisoner’s dilemma,

In Davis L., Genetic Algorithms and Simulated Annealing, pp. 32-41.

Axelrod R. (1999) The Complexity of Cooperation: Agent-based Models of Com-

petition and Collaboration. University Press, Princeton, NJ.

Axelrod R. and Dion D. (1988) The further evolution of cooperation, Science,

242, pp. 1385-1390.

Axelrod R. and Hamilton W. (1981) The evolution of cooperation, Science, 211,

4489, pp. 1390-1396.

Balkenborg D. and Schlag K. (2000) Evolutionarily stable sets, International

Journal of Game Theory, 29, pp. 571-595.

Batali J. and Kitcher P. (1994) Evolutionary dynamics of altruistic behaviour

in optional and compulsory versions of the iterated prisoner’s dilemma, In

Rodney A. and Maes P. Artificial Life IV. MIT Press, pp. 343-348.

Beaufils B., Delahaye J., and Mathieu P. (1996) Our meeting with gradual: A

good strategy for the iterated prisoner’s dilemma, Proceedings of the Arti-

ficial Life V, pp. 202-209.

Becker N. and Cudd A. (1990) Indefinitely repeated games: a response to Carroll,

Theory and Decision, 28, pp. 189-195.

Bendor J. and Mookherjee D. (1990) Norms, third-party sanctions, and cooper-

ation, Journal of Law, Economics, and Organization, 6, pp. 33-63.

Bendor R., Kramer M., and Stout S. (1991) When in doubt: cooperation in a

noisy prisoner’s dilemma, Journal of Conflict Resolution, 35, pp. 691-719.

Ben-porath E. (1990) The complexity of computing a best response automaton

in repeated games with mixed strategies, Games and Economic Behavior,

2, pp. 1-12.

Berry D. and Fristedt B. (1985) Bandit problems: sequential allocation of experi-

ments. Chapman and Hall, London.

Binmore K. (1992) Fun and games. Lexington, MA: D.C. Heath and Company.

Binmore K. (1994) Playing fair game theory and the social contract I. MIT Press.

Binmore K. (1997) Rationality and backward induction, Journal of Economic

Methodology, 4, pp. 23-41.

Binmore K. (1998) Review of R. Axelrod’s ‘The complexity of cooperation: agent

based models of competition and collaboration’, Journal of Artificial Soci-

eties and Social Simulation, 1, 1.

Binmore K., McCarthy J., Ponti G., Samuelson L. and Shaked A. (2002) A back-

ward induction experiment, Journal of Economic Theory, 104, pp. 48-88.



Bishop, D. and Cannings, C. (1978) A generalized war of attrition, Journal of

Theoretical Biology, 70, pp. 85-124.

Bixenstine V., Potash H., and Wilson K. (1963) Effects of level of cooperative

choice by the other player on choices in a Prisoner’s Dilemma game, Journal

of Abnormal and Social Psychology, 66, pp. 308-313.

Bolton G. (1991) A comparative model of bargaining: theory and evidence, The

American Economic Review, 81, 5, pp. 1096-1136.

Bolton G. and Ockenfels A. (2000) ERC: a theory of equity, reciprocity, and

competition, The American Economic Review, 90, pp. 166-193.

Bomze I. (1998) Uniform barriers and evolutionarily stable sets, Game Theory,

Experience, Rationality, pp. 225-244.

Bomze I. (2002) Regularity vs. degeneracy in dynamics, games, and optimization:

a unified approach to different aspects, SIAM Review, 44, pp. 394-414.

Boyd R. (1989) Mistakes allow evolutionary stability in the repeated prisoner’s

dilemma game, Journal of Theoretical Biology, 136, 11, pp. 47-56.

Boyd R. (1992) The evolution of reciprocity when conditions vary, Harcourt A.

and Frans B. (eds.) Alliance formation among male baboons: shopping for

profitable partners. Oxford: Oxford University Press, pp. 473-489.

Boyd R. and Loberbaum J. (1987) No pure strategy is evolutionarily stable in

the repeated Prisoner’s Dilemma game, Nature, 327, pp. 58-59.

Boyd R. and Richerson P. (1988) The evolution reciprocity in sizable groups,

Journal of Theoretical Biology, 132, pp. 337-356.

Boyd R. and Richerson P. (1989) The evolution of indirect reciprocity, Social

Networks, 11, pp. 213-236.

Boyd R. and Richerson P. (1990) Group selection among alternative evolutionarily

stable strategies. Journal of Theoretical Biology, 145, pp. 331-342.

Boyd R. and Richerson P. (1992) Punishment allows the evolution of cooperation

(or anything else) in sizable groups, Ethology and Sociobiology, 13, pp. 171-

195.

Bovens L. (1997) The backward induction argument for the finite iterated prison-

ers dilemma and the surprise exam paradox, Analysis, 57, 3, pp. 179-186.

Bragt D., Kemenade C. and Poutre H. (2001) The influence of evolutionary selec-

tion schemes on the iterated prisoner’s dilemma, Computational Economics,

17, pp. 253-263.

Brauchli K., Killingback T. and Doebeli M. (1999) Evolution of cooperation

in spatially structured populations, Journal of Theoretical Biology, 200,

pp. 405-417.

Brelis M. (1992) Reputed mobster defends his honor. Boston Globe, 1, pp. 23.

Bunn G. and Payne R. (1988) Tit-for-tat and the negotiation of nuclear arms

control, Arms Control, 9, pp. 207-233.

Carmel D. and Markovitch S. (1996) Learning models of intelligent agents, Pro-

ceedings of the 13th National Conference on Artificial Intelligence and the

8th Innovative Applications of Artificial Intelligence Conference, 2, pp. 62-

67.



Carmel D. and Markovitch S. (1997) Model-based learning of interaction strate-

gies in multi-agent systems, Journal of Experimental and Theoretical Arti-

ficial Intelligence, 10, 3, pp. 309-332.

Carmel D. and Markovitch S. (1998) How to explore your opponent’s strategy

(almost) optimally, Proceedings of the International Conference on Multi

Agent Systems, pp. 64-71.

Cohen L. and Machalek R. (1988) A general theory of expropriative crime: an

evolutionary ecological approach, American Journal of Sociology, 94, 3,

pp. 465-501.

Cooper R., Jong D., Forsythe R., and Ross T. (1996) Cooperation without repu-

tation: experimental evidence from prisoner’s dilemma games, Games and

Economic Behavior, 12, 2, pp. 187–218.

Cressman R., Garay J. and Hofbauer J. (2001) Evolutionary stability concepts for

N-species frequency-dependent interactions, Journal of Theoretical Biology,

211, pp. 1-10.

Croson R. (2000) Thinking like a game theorist: Factors affecting the frequency

of equilibrium play, Journal of Economic Behavior and Organization, 41,

3, pp. 299–314.

Crumbaugh C. and Evans G. (1967) Presentation format, other-person strategies,

and cooperative behaviour in the prisoner’s dilemma, Psychological Reports,

20, pp. 895-902.

Dacey R. and Pendegraft N. (1988) The optimality of Tit-For-Tat, International

Interactions, 15, pp. 45-64.

Darwen P. and Yao X. (1995) On evolving robust strategies for iterated prisoner’s

dilemma, Progress in Evolutionary Computation, volume 956 in Lecture

Notes in Artificial Intelligence, Springer, pp. 276-292.

Darwen P. and Yao X. (1996) Automatic modularization by speciation, IEEE

International Conference on Evolutionary Computation, pp. 88-93.

Darwen P. and Yao X. (2001) Why more choices cause less cooperation in Iterated

Prisoner’s Dilemma, Proceedings of the 2001 IEEE Congress on Evolution-

ary Computation.

Darwen P. and Yao X. (2002) Coevolution in iterated prisoner’s dilemma with

intermediate levels of cooperation: Application to missile defense, Interna-

tional Journal of Computational Intelligence and Applications, 2, 1, pp. 83-

107.

Davis D. and Holt C. (1999) Equilibrium cooperation in two-stage games: Exper-

imental evidence, International Journal of Game Theory, 28, 1, pp. 89-109.

Delahaye J. and Mathieu P. (1996) Etude sur les dynamiques du Dilemme Itere

des Prisonniers avec un petit nombre de strategies : Y a-t-il du chaos dans

le Dilemme pur?, Publication Interne IT-294, Laboratoire d’Informatique

Fondamentale de Lille.

Doebeli M., Blarer A., and Ackermann M. (1997) Population dynamics, demo-

graphic stochasticity, and the evolution of cooperation, Proceedings of Na-

tional Academy Society of USA, 94: 5167–5171.

Doebeli M. and Knowlton N. (1998) The evolution of interspecific mutualisms,

Proceedings of the National Academy of Sciences, 95(15): 8676-8680.



Donninger C. (1986) Is it always efficient to be nice?, In Paradoxical effects of so-

cial behavior, edited by Dickmann A. and Mitter P., Heidelberg, Germany:

Physica Verlag, pp. 123-134.

Dugatkin L. (1989) N-person games and the evolution of cooperation: a model

based on predator inspection in fish, Journal of Theoretical Biology, 142,

pp. 123–135.

Dugatkin L. (1990) N-person Games and the Evolution of Co-operation: A Model

Based on Predator Inspection in Fish, Journal of Theoretical Biology, 142,

pp. 123-135.

Durrett R. and Levin S. (1998) Spatial aspects of interspecific competition, The-

oretical Population Biology, 53, 1, pp. 30-43.

Eckert D., Koch S., and Mitlohner J. (2005) Using the iterated prisoner’s dilemma

for explaining the evolution of cooperation in open source communities,

Proceedings of the First Conference on Open Source System, pp. 186-191.

Ficici S., Melnik O., and Pollack J. (2000) A game-theoretic investigation of

selection methods used in evolutionary algorithms, Proceedings of the 2000

Congress on Evolutionary Computation, 2, pp. 880-887.

Ficici S. and Pollack J. (2000) Effects of finite populations on evolutionary stable

strategies, Proceedings of the 2000 Genetic and Evolutionary Computation,

pp. 927-934.

Fogel D. (1993) Evolving behaviors in the iterated prisoners dilemma, Evolution-

ary Computation, 1, 1, pp. 77-97.

Fogel D. and Fogel G. (1995) Evolutionary stable strategies are not always stable

under evolutionary dynamics, Evolutionary Programming IV, pp. 565-577.

Fogel D., Fogel G., and Andrew P. (1997) On the instability of evolutionary stable

strategies, BioSystems, 44, pp. 135-152.

Fogel G., Andrew P., and Fogel D. (1998) On the instability of evolutionary stable

strategies in small populations, Ecological Modelling, 109, pp. 283-294.

Frank R. (1988) Passions within reason. The strategic role of the emotions, New

York: W.W. Norton & Co.

Freund Y., Kearns M., Mansour Y., Ron D., Rubinfeled R., and Schapire R.

(1995) Efficient algorithms for learning to play repeated games against com-

putationally bounded adversaries, Proceedings of the Annual Symposium on

the Foundations of Computer Science, pp. 332–341.

Fudenberg D. and Maskin E. (1986) The Folk Theorem in repeated games with

discounting and incomplete information, Econometrica, 54, pp. 533–554.

Fudenberg D. and Maskin E. (1990) Evolution and cooperation in noisy repeated

games, New Developments in Economic Theory, 80, pp. 274-279.

Fudenberg D. and Levine D. (1998) The theory of learning in games. MIT Press.

Garay B. and Hofbauer J. (2003) Robust permanence for ecological differential

equations: minimax and discretizations, SIAM Journal on Mathematical

Analysis, 34, pp. 1007-1093.

Gaunersdorfer A. (1992) Time averages for heteroclinic attractors, SIAM Journal

on Applied Mathematics, 52, pp. 1476-1489.

Gaunersdorfer A., Hofbauer J., and Sigmund K. (1991) On the dynamics of asym-

metric games, Theoretical Population Biology, 39, pp. 345-357.



Gilboa I. and Matsui A. (1991) Social stability and equilibrium, Econometrica,

59, pp. 859-867.

Gilboa I. and Schmeidler D. (2001) A theory of case-based decisions. Cambridge

University Press.

Gittins J. (1989) Multi-armed bandit allocation indices. Wiley, Chichester, NY.

Glance N. and Huberman B. (1993) The outbreak of cooperation, Journal of

Mathematical sociology, 17, 4, pp. 281–302.

Glance N. and Huberman B. (1994) The dynamics of social dilemmas, Scientific

American, 270, pp. 76-81.

Glomba M., Filak T., and Kwasnicka H. (2005) Discovering effective strategies for

the iterated prisoner’s dilemma using genetic algorithms, 5th International

Conference on Intelligent Systems Design and Applications, pp. 356-363.

Godfray H. (1992) The evolution of forgiveness, Nature, 355, pp. 206-207.

Goldstein J. and Freeman J. (1990) Three-Way Street: Strategic Reciprocity in

World Politics. Chicago: University of Chicago Press.

Grefenstette J., Ramsey C., and Schultz A. (1990) Learning sequential deci-

sion rules using simulation models and competition, Machine Learning, 5,

pp. 355-381.

Grim P. (1995) The greater generosity of the spatialized prisoner’s dilemma, Jour-

nal of Theoretical Biology, 173, pp. 242-248.

Grossman W. (2004) New tack wins Prisoner’s Dilemma, Wired News, Lycos.

Harborne S. (1997) Common belief of rationality in the finitely repeated prisoners’

dilemma, Games and Economic Behavior, 19, 1, pp. 133-143.

Hardin G. (1968) The tragedy of the commons, Science, 162, pp. 1243-1248.

Hargreaves H. and Varoufakis Y. (1995) Game theory: a critical introduction.

Routledge, London.

Harsanyi J. (1967) Games with incomplete information played by Bayesian play-

ers, Management Science, 14, 3, pp. 159-182.

Harsanyi, J., and Selten, R. (1988) A General Theory of Equilibrium Selection in

Games. Cambridge: MIT Press.

Hauser M. (1992) Costs of deception: cheaters are punished in rhesus monkeys

(Macaca mulatta). Proceedings of the National Academy of Sciences, 89,

pp. 12137-12139.

Heller J. (1967) The effects of racial prejudice, feedback strategy, and race on

cooperative-competitive behaviour, Dissertation Abstracts, 27, pp. 2507-

2508.

Hemelrijk C. (1991) Interchange of ’Altruistic’ Acts as an Epiphenomenon. Jour-

nal of Theoretical Biology, 153, pp. 131-139.

Hingston P. and Kendall G. (2004) Learning versus evolution in iterated prisoner’s

dilemma, Proceedings of Congress on Evolutionary Computation, pp. 364-

372.

Hirshleifer J. and Coll J. (1988) What strategies can support the evolutionary

emergence of cooperation?, Journal of Conflict Resolution, 32, 2, pp. 367-

398.

Hirshleifer D. and Rasmusen E. (1989) Cooperation in a repeated prisoner’s

dilemma with ostracism, Journal of Economic Behavior and Organization,

12, pp. 87-106.



Hofbauer J. (1981) On the occurrence of limit cycles in the Volterra-Lotka equa-

tion, Nonlinear Analysis, 5, pp. 1003-1007.

Hofbauer J. (1984) A difference equation model for the hypercycle, SIAM Journal

on Applied Mathematics, 44, pp. 762-772.

Hofbauer J. (1996) Evolutionary dynamics for bimatrix games: a Hamiltonian

system, Journal of Mathematical Biology, 34, pp. 675-688.

Hofbauer J. and Sigmund K. (1998) Evolutionary games and population dynamics.

Cambridge University Press.

Hofbauer J. and Sigmund K. (2003) Evolutionary game dynamics, Bulletin of the

American Mathematical Society, 40, pp. 479-519.

Hoffmann R. (2001) The ecology of cooperation, Theory and Decision, 50,

pp. 101-118.

Holland J. (1975) Adaptation in Natural and Artificial Systems. University of

Michigan Press, Ann Arbor.

Holland J. (1992) Genetic algorithm, Scientific American, 267, 4, pp. 44-50.

Holland J. (1995) Hidden Order - How adaptation builds complexity, Reading,

Mass.: Addison-Wesley.

Huberman B. and Glance N. (1993) Evolutionary games and computer simula-

tions, Proceedings of the National Academy of Sciences, 90, pp. 7716-7718.

Ifti M., Killingback T., and Doebeli M. (2004) Effects of neighborhood size and

connectivity on the spatial continuous prisoner’s dilemma, Journal of The-

oretical Biology, 231, pp. 97-106.

Ikegami T. and Kaneko K. (1990) Computer symbiosis - emergence of symbiotic

behavior through evolution, Physica D, 42, pp. 235-243.

Insko C., Pinkley R., Hoyle R., Dalton B., Hong G., Slim R., Landry P., Holton

B., Ruffin P., and Thibaut J. (1987) Individual-group discontinuity: the

role of intergroup contact, Journal of Experimental Social Psychology, 23,

pp. 250-267.

Insko C., Hoyle R., Pinkley R., and Hong G. (1988) Individual-group discontinu-

ity: the role of a consensus rule, Journal of Experimental Social Psychology,

24, pp. 505-519.

Insko C., Schopler J., Hoyle R., Dardis G., and Graetz K. (1990) Individual-group

discontinuity as a function of fear and greed, Journal of Personality and

Social Psychology, 58, pp. 68-79.

Insko C., Schopler J., Drigotas S., Graetz K., Kennedy J., Cox C., and Bornstein

G. (1993) The role of communication in interindividual-intergroup discon-

tinuity, Journal of Conflict Resolution, 37, pp. 108-138.

Kaelbling L. (1993) Learning in embedded systems. The MIT Press, Cambridge,

MA.

Kaelbling L. and Moore A. (1996) Reinforcement learning: a survey, Journal of

Artificial Intelligence Research, 4, pp. 237-285.

Kagel J. and Roth A. (1995) The Handbook of Experimental Economics. Princeton

University Press.

Kahn L. and Murnighan J. (1993) Conjecture, uncertainty, and cooperation in

Prisoners’ Dilemma games: Some Experimental Evidence, Journal of Eco-

nomic Behavior and Organisms, 22, pp. 91–117.



Kalai E. and Lehrer E. (1993) Rational learning leads to Nash equilibrium Econo-

metrica, 61, 5, pp. 1019-1045.

Kandori M. (1992) Social norms and community enforcement, The Review of

Economic Studies, 59, 1, pp. 63-80.

Katok A. and Hasselblatt B. (1996) Introduction to the modern theory of dynam-

ical systems. Cambridge ISBN 0521575575.

Kavka G. (1986) Hobbesean Moral and Political Theory. Princeton: Princeton

University Press.

Kirchkamp O. (1995) Spatial Evolution of Automata in the Prisoners’ Dilemma.

University of Bonn SFB 303, Discussion Paper B-330.

Kitcher P. (1992) Evolution of altruism in repeated optional games, Working

Paper of University of California at San Diego.

Knapp W. and Podell J. (1968) Mental patients, prisoners, and students with

simulated partners in a mixed-motive game, Journal of Conflict Resolution,

12, pp. 235-241.

Komorita S., Sheposh J., and Braver S. (1968) Power, the use of power, and

cooperative choice in a two-person game, Journal of Personality and Social

Psychology, 8, pp. 134-142.

Kraines D. and Kraines V. (1995) Evolution of learning among Pavlov strategies in

a competitive environment with noise, The Journal of Conflict Resolution,

39, 3, pp. 439-466.

Kraines D. and Kraines V. (2000) Natural selection of memory-one strategies

for the iterated Prisoner’s Dilemma, Journal of Theoretical Biology, 203,

pp. 335-355.

Kreps D., Milgrom P., Roberts J., and Wilson R. (1982) Rational cooperation in

the finitely repeated prisoner’s dilemma, Journal of Economic Theory, 27,

pp. 245–252.

Kreps, D., and Wilson R. (1982) Reputation and imperfect information, Journal

of Economic Theory, 27, pp. 253–279.

Krishna V. and Sjostrom T. (1998) On the convergence of fictitious play, Mathe-

matics Operations Research, 23, pp. 479-511.

Lave L. (1965) Factors affecting cooperation in the prisoner’s dilemma, Behavioral

Science, 10, pp. 26-38.

Lindgren K. (1991) Evolutionary phenomena in simple dynamics, In Christopher

G., et al. Santa Fe Institute Studies in the Sciences of Complexity. 10,

pp. 295-312.

Lindgren K. (1992) Evolutionary phenomena in simple dynamics, In Langton C.

(ed.) Artificial Life II. Addison-Wesley.

Lindgren K. (1995) Evolutionary dynamics in game-theoretic models, The econ-

omy as an evolving complex system II, Santa Fe Institute.

Littman M. (1994) Markov games as a framework for multiagent reinforcement

learning, Proceedings of the 11th International Conference on Machine

Learning, pp. 157-163.

Luce R. and Raiffa H. (1957) Games and decisions. New York: Wiley.

Lynch G. (1968) Defense preference and cooperation and competition in a game,

Dissertation Abstracts, 29, pp. 1174.



Manarini S. (1998) The prisoner’s dilemma, experiments for the study of coopera-

tion. Strategies, theories and mathematical models, Ph.D. thesis, University

of Padova.

Marinoff L. (1992) Maximizing expected utilities in the Prisoner’s Dilemma, Jour-

nal of Conflict Resolution, 36, 1, pp. 183-216.

Martinez-Coll J. and Hirshleifer J. (1991) The limits of reciprocity, Rationality

and Society, 3, pp. 35-64.

Matsui A. (1992) Best response dynamics and socially stable strategies, Journal

of Economic Theory, 57, pp. 343-362.

May R. (1987) More evolution of cooperation, Nature, 327, pp. 15-17.

Maynard Smith J. and Price G. (1973) The logic of animal conflict, Nature, 246,

pp. 15-18.

Maynard Smith J. (1982) Evolution and the Theory of Games, Cambridge Uni-

versity Press.

McKelvey R. and Palfrey T. (1992) An experimental study of the centipede game,

Econometrica, 60, pp. 803-836.

Mealey L. (1995) The sociobiology of sociopathy: an integrated evolutionary

model, Behavioral and Brain Sciences, 18, 3, pp. 523-599.

Michalewicz Z. (1999) Genetic Algorithms + Data Structures = Evolution Pro-

grams, Springer-Verlag.

Micko H. (1997) Benevolent tit for tat strategies with fixed intervals between

offers of cooperation, Meeting of Experimental Psychologists, pp. 250-256.

Micko H. (2000) Ex-

perimental Matrix games, In Open and Distance Learning-Mathematical

Psychology, Institut fur Sozial- und Personlichkeitspsychologie, Universitat

Bonn.

Milgrom, P. and Roberts J. (1982): Predation, reputation and entry deterrence,

Journal of Economic Theory, 27, pp. 280-312.

Milinski M. (1993) Cooperation wins and stays, Nature, 364, pp. 12-13.

Milinski M. and Wedekind C. (1998) Working memory constrains human coop-

eration in the prisoner’s dilemma, Proceedings of the National Academy of

Sciences of the United States of America, 95, 23, pp. 13755-13758.

Miller J. (1996) The coevolution of automata in the repeated prisoner’s dilemma,

Journal of Economic Behavior and Organization, 29, pp. 87-112.

Mitchell M. (1996) An introduction to Genetic Algorithms. The MIT Press, Cam-

bridge MA.

Molander P. (1985) The optimal level of generosity in a selfish, uncertain envi-

ronment, Journal of Conflict Resolution, 29, pp. 611-618.

Moore A. and Atkeson C. (1993) Prioritized sweeping: reinforcement learning

with less data and less real time, Machine Learning, 13, pp. 103-130.

Mueller U. (1988) Optimal retaliation for optimal cooperation, Journal of Conflict

Resolution, 31, 4, pp. 692-724.

Myerson R. (1991) Game Theory, Analysis of Conflict. Cambridge, Harvard Uni-

versity Press.

Nachbar J. (1992) Evolution in the finitely repeated Prisoners’ Dilemma, Journal

of Economic Behavior and Organization, 19, pp. 307-326.



Narendra K. and Thathachar M. (1989) Learning automata: an introduction.

Prentice-Hall, Englewood Cliffs, NJ.

Nash J. (1950) Equilibrium points in n-person games, Proceedings of the National

Academy of the USA, 36, 1, pp. 48-49.

Nash J. (1951) Non-cooperative games, The Annals of Mathematics, 54, 2,

pp. 286-295.

Nash J. (1996) Essays on Game Theory. Elgar. Cheltenham.

Noldeke G. and Samuelson L. (1993) An evolutionary analysis of backward and

forward induction, Games and Economic Behaviour, 5, pp. 425-454.

Nowak M., Bonhoeffer S., and May R. (1994) More spatial games, International

Journal of Bifurcation and Chaos, 4, 1, pp. 33-56.

Nowak M. and May R. (1992) Evolutionary games and spatial chaos, Nature,

359, pp. 826-829.

Nowak M. and May R. (1993) The spatial dilemmas of evolution, International

Journal of Bifurcation and Chaos, 3, pp. 35-78.

Nowak M. and Sigmund K. (1990) The evolution of stochastic strategies in the

prisoner’s dilemma, Acta Applicandae Mathematicae, 20, pp. 247-265.

Nowak M. and Sigmund K. (1992) Tit for tat in heterogeneous populations, Na-

ture, 359, pp. 250-253.

Nowak M. and Sigmund K. (1993) A strategy of win-stay lose-shift that outper-

forms Tit-for-Tat in the Prisoner’s Dilemma game, Nature, 364, pp. 56-58.

Nowak M., Sigmund K. and El-Sedy E. (1995) Automata, repeated games, and

noise, Journal of Mathematical Biology, 33, pp. 703-722.

Orbell J., Kragt A., and Dawes R. (1988) Explaining discussion-induced cooper-

ation, Journal of Personality and Social Psychology, 54, pp. 811-819.

Orbell J. and Dawes R. (1993) Social welfare, cooperator’s advantage, and the

option of not playing the game, American Sociological Review, pp. 787-800.

Orbell J. and Robyn M. (1993) Social welfare, cooperators’ advantage, and the

option of not playing the game. American Sociological Review, 58, pp. 787-

800.

Oskamp S. (1971) Effects of programmed strategies on cooperation in the pris-

oner’s dilemma and other mixed-motive games, The Journal of Conflict

Resolution, 15, 2, pp. 225-259.

Oskamp S. and Perlman D. (1965) Factors affecting cooperation in a prisoner’s

dilemma game, Journal of Conflict Resolution, 9, pp. 359-374.

Plank M. (1997) Some qualitative differences between the replicator dynamics of

two player and n player games, Nonlinear Analysis, 30, pp. 1411-1417.

Pollock G. (1989) Evolutionary Stability of Reciprocity in a Viscous Lattice.

Social Networks, 11, pp. 175-212.

Posch M. (1997) Win Stay–Lose Shift: An Elementary Learning Rule

for Normal Form Games, Working Paper of Santa Fe Institute,

http://ideas.repec.org/p/wop/safire/97-06-056e.html.

Prisoner’s dilemma tournament result (2004) http://www.prisoners-dilemma.

com/results/cec04/ipd cec04 full run.html.

Prisoner’s dilemma tournament result (2005) http://www.prisoners-dilemma.

com/results/cig05/cig05.html.



Radner R. (1980) Collusive behaviour in non-cooperative epsilon-equilibria in

oligopolies with long but finite lives, Journal of Economic Theory, 22,

pp. 136-154.

Radner R. (1986) Can bounded rationality resolve the prisoner’s dilemma, In

Mas- Colell A. and Hildenbrand W. Essays in Honor of Gerard Debreu,

pp. 387-399.

Rapoport A. (1966) Optimal policies for the prisoner’s dilemma, Technical Report

No. 50 Psychometric Laboratory, University of North California, MH-10006.

Rapoport A. (1999) Two-person Game Theory. Dover Publications, New York.

Rapoport and Chammah (1965) Prisoner’s dilemma: a study in conflict and

cooperation. Ann Arbor: University of Michigan Press.

Rothbard M. (1997) Toward a Reconstruction of Utility and Welfare Economics,

In The Logic of Action One: Method, Money, and the Austrian School,

pp. 211-55.

Rubinstein A. (1979) Equilibrium in super games with the overtaking criterion,

Journal of Economic Theory, 21, pp. 1-9.

Rubinstein A. (1998) Modeling bounded rationality. The MIT Press, 1998.

Samuelson L. (2001) Introduction to the evolution of preferences, Journal of Eco-

nomic Theory, 97, pp. 225-230.

Sandholm T. and Crites R. (1996) Multiagent reinforcement learning in the iter-

ated Prisoner’s Dilemma, Biosystems, 37, 1-2, pp. 147-66.

Sarin R. (1999) Simple play in the prisoner’s dilemma, Journal of Economic

Behavior and Organization, 40, 1, pp. 105–113.

Sasaki A., Taylor C. and Fudenberg D. (2000) Emergence of cooperation and

evolutionary stability in finite populations, Nature, 428, pp. 646-650.

Schmidhuber J. (1996) A general method for multi-agent learning and incremental

self-improvement in unrestricted environments, In Yao X. (ed.) Evolution-

ary Computation: Theory and Applications. Scientific Publications Co.

Schmitt L. (2001) Theory of genetic algorithms, Theoretical Computer Science,

259, pp. 1-61.

Schuessler R. (1989) Exit threats and cooperation under anonymity, Journal of

Conflict Resolution, 33, pp. 728-749.

Schweitzer F. (2002) Modeling Complexity in Economic and Social Systems. World

Scientific, Singapore.

Schweitzer F., Behera L., and Muhlenbein H. (2002) Evolution of cooperation

in a spatical prisoner’s dilemma, Advances in Complex Systems, 5, 2-3,

pp. 269-299.

Scodel A., Minas J., Ratoosh P., and Lipetz M. (1959) Some descriptive aspects of

two-person non-zero sum games, Journal of Conflict Resolution, 3, pp. 114-

119.

Selten, R. (1965) Spieltheoretische behandlung eines oligopolmodells mit nach-

fragetragheit, Zeitschrift fur die Gesamte Staatswissenschaft, 12, pp. 301-

324.

Selten, R. (1975) Reexamination of the perfectness concept for equilibrium points

in extensive games, International Journal of Game Theory, 4, pp. 25-55.



Selten R. (1983) Evolutionary stability in extensive two-person games, Mathe-

matical Social Science, 5, pp. 269-363.

Selten R. (1988) Evolutionary stability in extensive two-person games: correction

and further development, Mathematical Social Science, 16, pp. 223-266.

Selten R. and Stoecker R. (1986) End behaviour in sequences of finite Prisoner’s

Dilemma supergames: a learning theory approach, Journal of Economic

Behaviour and Organisation, 7, pp. 47-70.

Sethi R. and Somanathan E. (1996) The evolution of social norms in common

property resource use, The American Economic Review, 86, 4, pp. 766-788.

Simon H. (1955) A behavioral model of rational choice, Quarterly Journal of

Econometrics, 69, 1, pp. 99-118.

Simon H. (1990) A mechanism for social selection and successful altruism, Science,

250, 4988, pp. 1665-1668.

Sermat V. (1967) Cooperative behaviour in a mixed-motive game, Journal of

Social Psychology, 62, pp. 217-239.

Sigmund K. (1995) Games of Life: Explorations in Ecology, Evolution and Be-

haviour. Penguin, Harmondsworth.

Skyrms B. (1990) The Dynamics of Rational Deliberation. Harvard UP.

Smith H. (1995) Monotone dynamical systems: an introduction to the theory

of competitive and cooperative systems, AMS Mathematical Surveys and

Monographs, 41.

Smith R. and Gray B. (1994) Co-adaptive genetic algorithms: an example in Oth-

ello strategy, Proceedings of the 1994 Florida Artificial Intelligence Research

Symposium, pp. 259-264.

Sobel J. (1975) Reexamination of the perfectness concept of equilibrium in ex-

tensive games, International Journal of Game Theory, 4, pp. 25-55.

Sobel J. (1976) Utility maximization in iterated Prisoner’s Dilemmas, Dialogue,

15, pp. 38-53.

Solomon L. (1960) The influence of some types of power relationships and game

strategies upon the development of interpersonal trust, Journal of Abnormal

and Social Psychology, 61, pp. 223-230.

Spears W. (1992) Crossover or mutation? Foundations of Genetic Algorithms. 2,

FOGA-92, edited by Whitley D., California: Morgan Kaufmann.

Spiro D. (1988) The state of cooperation in theories of state cooperation: the evo-

lution of a category mistake, Journal of International Affairs, 42, pp. 205-

225.

Stanley E., Ashlock D., and Smucker M. (1995) Iterated prisoner’s dilemma with

choice and refusal of partners: Evolutionary results, Lecture Notes in Arti-

ficial Intelligence, 929, pp. 490-502.

Stanley E., Ashlock D., and Tesfatsion L. (1994) Iterated prisoner’s dilemma

with choice and refusal of partners, In Christopher G. Artificial Life III.

Addison-Wesley, pp. 131-176.

Stephens D. (2000) Cumulative benefit games: achieving cooperation when play-

ers discount the future, Journal of Theoretical Biology, 205, 1, pp. 1-16.

Stephens D., Mclinn C., and Stevens J. (2002) Discounting and Reciprocity in an

Iterated Prisoner’s Dilemma, Science, 298, 5601, pp. 2216-2218.



Sugden, R. (1986) The Economics of Cooperation, Rights and Welfare. Basil

Blackwell.

Surowiecki J. (2004) The Wisdom of Crowds: Why the Many Are Smarter Than

the Few and How Collective Wisdom Shapes Business, Economies, Societies

and Nations. Little, Brown.

Sutton R. (1990) Integrated architectures for learning, planning, and reacting

based on approximating dynamic programming, Proceedings of the 7th In-

ternational Conference on Machine Learning, pp. 216-224.

Swingle P. and Coady H. (1967) Effects of the partner’s abrupt strategy change

upon subject’s responding in the prisoner’s dilemma, Journal of Personality

and Social Psychology, 5, pp. 357-363.

Swinkels J. (1993) Adjustment dynamics and rational play in games, Games and

Economic Behavior, 5, pp. 455-84.

Taylor, P. D. (1979). Evolutionarily stable strategies with two types of players,

Journal of Applied Probability, 16, pp. 76-83.

Taylor, P. and Jonker, L. (1978) Evolutionary stable strategies and game dynam-

ics, Mathematical Biosciences, 40, pp. 145-156.

Tekol Y. and Acan A. (2003) Ants can play Prisoner’s Dilemma, Proceedings of

the 2003 Congress on Evolutionary Computation, pp. 1151-1157.

Thierens D. (1997) Selection schemes, elitist recombination, and selection inten-

sity, Proceedings of the 7th International Conference on Genetic Algorithms,

pp. 152-159.

Thierens D. and Goldberg D. (1994) Elitist recombination: an integrated se-

lection recombination GA, Proceedings of the First IEEE Conference on

Evolutionary Computation, pp. 508-512.

Thomas B. (1985) On evolutionarily stable sets, Journal of Mathematical Biology,

22, pp. 105-115.

Tzafestas E. (2000a) Toward adaptive cooperative behavior, Proceedings of the

Simulation of Adaptive Behavior Conference, pp. 334-340.

Tzafestas E. (2000b) Spatial games with adaptive tit-for-tats, Proceedings of the

6th Parallel Problem Solving from Nature (PPSN-VI), pp. 507-516.

Young H. and Foster D. (1991) Cooperation in the Short and in the Long Run,

Games and Economic Behavior, 3, pp. 145-156.

Vegaredondo F. (1994) Bayesian boundedly rational agents play the finitely re-

peated prisoner’s dilemma, Theory and Decision, 36, 2, pp. 187–206.

Von Neumann J. and Morgenstern O. (1944) Theory of Games and Economic

Behavior. Princeton UP.

Watkins C. (1989) Learning from delayed rewards. Ph.D. thesis, King’s College,

Cambridge, UK.

Watkins C. and Dayan P. (1992) Q-learning, Machine Learning, 8, 3, pp. 279-292.

Wedekind C. and Milinski M. (1996) Human cooperation in the simultaneous

and the alternating Prisoner’s Dilemma: Pavlov versus Generous Tit-for-

Tat, Proceedings of the National Academy of Sciences of the United States

of America, 93, 7, pp. 2686-2689.

Weibull J. (1995) Evolutionary Game Theory. MIT Press, Cambridge, Mass.



Wilson W. (1969) Cooperation and the cooperativeness of the other player, Jour-

nal of Conflict Resolution, 13, pp. 110-117.

Wilson W. (1987) Classifier systems and the animat problem, Machine Learning,

2, pp. 199-228.

Whitworth R. and Lucker W. (1969) Effective manipulation of cooperation with

college and culturally disadvantaged populations, Proceedings of 77th An-

nual Convention of American Psychological Association, 4, pp. 305-306.

Wu J. and Axelrod R. (1995) How to cope with noise in the Iterated Prisoner’s

Dilemma, Journal of Conflict Resolution, 39, pp. 183-189.

Zeeman M. (1993) Hopf bifurcations in competitive three dimensional Lotka-

Volterra systems, Dynamics and Stability of Systems, 8, pp. 189-217

Zeeman E., Zeeman M. (2002) An n-dimensional competitive Lotka-Volterra sys-

tem is generically determined by its edges, Nonlinearity, 15, pp. 2019-2032.

Zeeman E., Zeeman M. (2003) From local to global behavior in competitive

Lotka-Volterra systems, Transaction of American Mathematical Society,

355, pp. 713-734.


Chapter 3

Learning IPD Strategies Through Co-evolution

Siang Yew Chong1, Jan Humble2, Graham Kendall2, Jiawei Li2,3, Xin

Yao1


of Technology3

3.1. Introduction

Complex behavioral interactions can be abstracted and modelled using a

game. One particular aspect in modelling interactions that is of great

interest is in understanding the specific conditions that lead to cooperation

between selfish individuals. The iterated prisoner’s dilemma (IPD) game is

one famous example. In its classical form, two players engaged in repeated

interactions, are given two choices: cooperate and defect [Axelrod (1984)].

The dilemma of the game is captured by having both players who are

better off mutually cooperating than mutually defecting being vulnerable

to exploitation by one of the party who defects. Although the IPD game has

become a popular model to study conditions for cooperation to occur among

selfish individuals, which was due in large part to a series of tournaments

reported in [Axelrod (1980a,b)], it has also received much attention in many

other areas of study, and used to model social, economic, and biological

interactions [Axelrod (1984)].

The classical IPD can be easily defined as a nonzero-sum, noncooper-

ative, two-player game [Chellapilla and Fogel (1999)]. It is nonzero-sum

because the benefits that a player obtains do not necessarily lead to similar

penalties given to the other player. It is noncooperative because it assumes

no preplay communication between the two players.

The IPD game can be formulated by considering a predefined payoff

matrix that specifies the payoff that a player receives for the choice it makes

for a particular move given the choice that the opponent makes. Referring

63



to the payoff matrix given by figure 3.1, both players receive R (reward)

units of payoff if both cooperates. They both receive P (punishment) units

of payoff if they both defect. However, when one player cooperates while

the other defects, the cooperator will receive S (sucker) units of payoff

while the defector receives T (temptation) units of payoff.

With the IPD game, the values R, S, T , and P must satisfy the con-

straints; T > R > P > S and R > (S + T )/2. Axelrod in [Axelrod

(1980a,b)] used the following set of values: R = 3, S = 0, T = 5, and

P = 1. However, any set of values can be used as long as they satisfy the

IPD constraints. The game is played when both players choose between

the two alternative choices over a series of moves (i.e., repeated interac-

tions). Note that the game is fully symmetric, i.e., the same payoff matrix

is applied to both players.

Cooperate DefectR T

Cooperate R S

S PDefect

T P

Fig. 3.1. The payoff matrix framework of a two-player, two-choice game. The payoff

given in the lower left-hand corner is assigned to the player (row) choosing the move,

while that of the upper right-hand corner is assigned to the opponent (column).

For the simple case of the one-shot prisoner’s dilemma (both players

only get to make one move), the rational play will be to defect [Chellapilla

and Fogel (1999)]. This can be viewed by considering the obtained payoff

for a choice made by a player in light of the opponent’s. For example, a co-

operating player will receive either R (opponent cooperates) or S (opponent

defects). A defecting player will receive either T (opponent cooperates) or

P (opponent defects). As such, from the player’s point of view (i.e., self-

interested), the rational play will be to defect because regardless of the

opponent’s play, a higher payoff is obtained (T > R and P > S).

However, when the game is iterated over many rounds of moves and

that players can adopt game strategies where a response is based on what

happened in the previous moves, defection is not necessarily the best choice

of play. Instead, many studies have shown cooperative play to be a viable


Learning IPD Strategies Through Co-evolution 65

strategy, starting with the tournaments organized by Axelrod (reported in

[Axelrod (1980a,b)]). More importantly, later studies (of which Axelrod

himself is one of the early pioneers) showed that cooperative strategies can

be learned from an initial, random population using evolutionary algorithms

[Axelrod (1987); Fogel (1991, 1993); Darwen and Yao (1995)].

In particular, studies made in [Axelrod (1987); Fogel (1991, 1993); Dar-

wen and Yao (1995)] (and many others) used a co-evolutionary learning

approach. The motivation for the co-evolutionary learning approach is

the learning of strategy behaviors through an adaptation process on strat-

egy representations based solely on interactions (i.e., game-play). This

approach is different compared to the classical evolutionary game approach

(and also the ecological game approach used in [Axelrod (1980b); Axel-

rod and Hamilton (1981)]) that is mainly concerned with frequency depen-

dent reproductions of fixed and predetermined strategies. As such, the use

of co-evolutionary learning approach allows for one to construct a game

(i.e., specifying the possible interactions between players, the rules that

govern the interactions, and the payoffs) and then to search for effective

game strategies without the need of human intervention (e.g., specify vi-

able strategies) [Chellapilla and Fogel (1999)].

Within the framework of the co-evolutionary learning of game strategies,

it is natural to explore more complex interactions that is closer to real-world

interactions compared to highly abstracted models like the classical IPD.

This review aims to provide a survey of studies using the co-evolutionary

learning approach of more complex IPD games since the tournaments orga-

nized by Axelrod that were held almost 20 years ago. In particular, focus

is placed on the motivations of certain extensions to the classical IPD and

the general observations made when co-evolutionary learning systems are

used.

The following section describes the framework of co-evolutionary learn-

ing and the general issues of co-evolving IPD strategies. Section 3.2 surveys

studies that extend the classical IPD with more choices, noise, N-players,

and others. The review concludes with some remarks on the future di-

rections for research in co-evolutionary learning of IPD strategies. It is

emphasized again that this review focusses on the co-evolutionary learning

approach to IPD games, rather than all possible work related to IPD games.



3.2. Co-evolving Strategies for the IPD Game

3.2.1. Co-evolutionary Learning Framework

Co-evolutionary learning refers to a broad class of population-based,

stochastic search algorithms that involves the simultaneous evolution of

competing solutions (to a problem) with coupled fitness [Yao (1994)]. A co-

evolutionary learning system can be implemented using evolutionary algo-

rithms (EAs) [Fogel (1994a); Back et al. (1997)]. That is, a co-evolutionary

learning system iteratively apply the process of variation (e.g., mutation,

crossovers, and others) and selection (e.g., choosing solutions to procreate

in the next iterative step) on the competing solutions in the population.

With this view, the framework of co-evolutionary learning (and also that

of EAs) can be illustrated using figure 3.2.

(1) Initialize the population, X(t=0)

(2) Evaluate the fitness of each individual through a comparison process

with other individuals in X(t)

(3) Select parents from X(t) based on their evaluated fitness

(4) Generate offsprings from parents to produce X(t+1)

(5) Repeat steps (2-4) until some termination criteria are reached

Fig. 3.2. The general framework of co-evolutionary learning.

Co-evolutionary learning is different from EAs in the sense of assigning

fitness, i.e., the quality or worth of a solution (Step 2 in Fig. 3.2). EAs are

often viewed and constructed in terms of an optimization context, whereby

an absolute fitness function is required to assign fitnesses to contending so-

lutions. With co-evolutionary learning, the fitness of a solution is obtained

through its interactions with other contending solutions in the population.

That is, a solution fitness in a co-evolutionary learning system is relative

and dynamic because a solution’s fitness not only depends on the popu-

lation, but also changes as the composition of solutions in the population

changes.

Although the difference between co-evolutionary learning systems and

traditional EAs appear to be small at first, from the contexts of certain

problems, it can lead to significantly different outcomes. For example,

consider the problem of searching for optimal solutions. In many real-world



problems, designing a suitable fitness function that can lead to the search

of solutions can be very difficult, if possible [Yao (1994)]. However, with

co-evolutionary learning, this need of having a fitness function is essentially

removed. Instead, a co-evolutionary learning system only needs to be able

to rank contending solutions based on how they compared to one another.

Here, games are well-suited, natural problem applications for co-

evolutionary learning systems. In particular, although games can be ap-

proached from an optimization context, it may not be possible to construct

a fitness function that fully represent the problem of the game and fully

discriminate solutions found through optimization algorithms. With co-

evolutionary learning, however, the search can be directed to find for better

game strategies (e.g., defeat more strategies) as the evolutionary process

continues [Chellapilla and Fogel (1999)].

In particular, for the IPD game, there have been many different ap-

proaches since Axelrod’s early study in [Axelrod (1987)] that investigated

a particular co-evolutionary learning system. Like the study of EAs (com-

monly known as Evolutionary Computation) [Yao (1994); Fogel (1994a);

Back et al. (1997); Fogel (1995); Back (1996)], there are a wide variety of

specific strategy representations, selection and variation operators in the

co-evolutionary learning approach used for the IPD game. A complete sur-

vey is beyond the scope of this chapter. Instead, the more popular choices

will be reviewed here. The important thing to note is that all the co-

evolutionary learning systems used were based on the framework illustrated

in figure 3.2, i.e., they involved an adaptation process on IPD strategies in

some form of representations (involves variations and selection) based on

interactions (game-play between strategies).

For strategy representations, particularly on deterministic and reactive

IPD strategies that were mostly studied, Axelrod and Lindgren [Axelrod

(1987); Lindgren (1991)] were among the first few who used binary strings

of ones (cooperation) and zeroes (defection) encoding for a look-up table

(essentially a binary decision tree) representation. The look-up table in

particular determines the outcome for the strategy based on the pairs of

previous moves made by the strategy and the opponent. Since the strategies

require histories of previous moves in order to make a response, they are

encoded with the necessary histories for previous moves. We [Chong and

Yao (2005)] recently introduced a look-up table representation that directly

represents IPD strategies based on responses to previous moves. For the

case of looking back the previous pair of moves made by the strategy and

the opponent, direct look-up table represents the strategy responses as a



two-dimensional table. Each table element represents the response based

on the pair of previous moves. Instead of some fictitious histories required

to start the game, the direct look-up table specifies the first move directly.

Fogel among many others [Fogel (1991, 1993, 1996); Miller (1989); Stan-

ley et al. (1995)] used finite state machines (FSMs) for their capability of

representing complex behaviors of IPD strategies. With FSMs, behavioral

responses of an IPD strategy based on previous moves depend on the states

and the next-state transitions. The motivation for using FSM compared

to look-up table is to have a behavioral representation of IPD strategies

instead of the look-up table representation of responses based on histories

of previous moves (see [Fogel (1993)] for the full discussion on the origin of

using FSM and evolution to simulate intelligent behaviors).

In addition to the simple look-up table and FSM, neural network rep-

resentations had also been experimented with and studied [Harrald and

Fogel (1996); Darwen and Yao (2000); Chong and Yao (2005); Franken and

Engelbrecht (2005)]. Although neural networks are primarily used for their

ability of providing nonlinear input-output responses [Chellapilla and Fogel

(1999)], the initial motivation to representing IPD strategies also include

the capability of neural networks to process and represent a continuous

range of behaviors [Harrald and Fogel (1996)].

After selecting a strategy representation, the next step is to consider

the design of variation operators that are aimed at providing variations of

IPD strategies in the population. In most cases, variation operators are

dependent of the strategy representation considered. For example, look-up

table encoded as binary strings can use crossovers and bit-flip mutation as

in the case of standard genetic algorithms [Axelrod (1987)]. For the case

of FSMs, variation operators may include altering a next-state transition,

adding or removing states, and altering the output symbol (corresponding

to making a choice). With neural networks, especially those that are real-

valued representations, self-adapting mutations based on some probability

distribution (i.e., Gaussian or Cauchy) can be used [Chong and Yao (2005)]

(one of us has provided a comprehensive review on evolving neural networks

in [Yao (1999)]).

As for designing the process of selecting IPD strategies for the next

generation, many other selection operators can be used (those found in

EAs [Fogel (1994a); Back et al. (1997)]) and not just limited to proportional

selection used by Axelrod in the first study of co-evolving IPD strategies

[Axelrod (1987)]. For the case of obtaining the fitness for a particular IPD

strategy in the population, payoffs obtained from the IPD game are usually



used. In particular, many studies considered calculating the expected IPD-

payoff-based-fitness using a round robin tournament whereby all pairs of

strategies compete, including the pair where a strategy plays itself.

3.2.2. Shadow of the Future

In the IPD game, the shadow of the future refers to the situation whereby

the number of moves of a game is known in advance. In this situation,

there is no incentive to cooperate in the last move because there is no risk

of retaliation from the opponent. However, if every player defects on the

last move, then there is no incentive to cooperate in the move prior to the

last one. If every player defects in the last two moves, then there is no

incentive to cooperate in the move before that, and so forth. As such, we

would end up with mutual defection in all moves.

One popular way to address this issue and to allow for cooperation to

emerge is to have a fixed probability in ending the game on every move,

thereby keeping the game length uncertain. Most of the studies that used

the co-evolutionary learning approach considered a fixed game length (num-

ber of moves) in all game plays. For example, Axelrod [Axelrod (1987)]

and others such as [Fogel (1991, 1993); Chong and Yao (2005)] used 150

moves (move start from 0). Other game lengths can be used, although the

choice depends on the motivation of the study, e.g., a sufficiently long game

length to allow for strategies to reciprocate cooperation. In any case, the

fixed game length is used because the strategy representation cannot count

the number of moves that have been played and how many more remain.

3.2.3. Issues for Co-evolutionary Learning of IPD Strate-

gies

For the IPD game, there are two main contexts in which co-evolutionary

learning can be considered. First, co-evolutionary learning can be used to

search for effective strategies, given the specific the rules of the game that

govern the complexity of strategy interactions. Second, a co-evolutionary

learning system can serve as a model for investigating how certain condi-

tions (e.g., game rules, co-evolutionary learning system setup, or others)

can lead to the evolution of certain behaviors.

For the context of using co-evolutionary learning to search for effective

strategies, the main issue is to evolve IPD strategies that perform well (e.g.,

defeat) against a large number of opponents. Axelrod [Axelrod (1987)]

used a co-evolutionary learning system and compared the evolved strategies



with the representative strategies (e.g., tit for tat) obtained from his earlier

tournaments that accounted for average performance of all strategies that

participated the tournaments [Axelrod (1980a,b)]. He noted that some of

the evolved strategies outperformed these representative strategies.

Although results obtained from evolving effective IPD strategies were

promising, the study in [Axelrod (1987)] had the important implication on

specifying a principled method to determine the effectiveness (or robustness

[Axelrod and Hamilton (1981)]) of evolved IPD strategies by testing them

against some representative strategies. One of us (Yao) first framed this

particular study in the context of generalization [Darwen and Yao (1995);

Yao et al. (1996)]. In particular, co-evolutionary learning is a machine

learning system that can be analyzed for its generalization performance.

Here, the generalization performance of a co-evolutionary learning system

for the IPD game can be thought of as the performance of the best strategy

in the population or the population itself (e.g., using a gating algorithm that

effectively combines different IPD strategies of the population as a single

strategy entity [Darwen (1996); Darwen and Yao (1997)]) against a large

number of IPD strategies, especially those that the evolved strategies have

yet to play with during evolution.

For the context of using co-evolutionary learning as a model to under-

stand the conditions of how, why, and what IPD strategy behaviors are

evolved, there are many issues that can be studied. First, one can con-

sider the impact of specific IPD game specifications (e.g., payoff matrices

[Fogel (1993)] and duration of interactions or game length [Fogel (1996)])

on evolved IPD strategy behaviors. Second, there are also studies that

have focused on the impact of the interaction or game-play itself, which

are not just limited to noisy interactions [Julstrom (1997)], continuous be-

havioral responses [Harrald and Fogel (1996)], and the possibility of refusal

to interact [Stanley et al. (1995)]. Third, the specific the design of the

co-evolutionary learning system itself can have an impact whereby certain

IPD behaviors are favored and persist for a long period (e.g., investigating

whether systems that provided genotypic diversity actually lead to a diverse

population of IPD strategies with a variety of behaviors [Darwen and Yao

(2000, 2001, 2002)]).

3.3. Extending the IPD Game

The primary motivation in most studies that extend the classical IPD game

is to model more complex IPD interactions that are closer to real-world



interactions. This section describes some of the extended IPD games that

have been investigated using the co-evolutionary learning approach. Each

subsection starts with the motivation for extending the IPD game in a

specific manner, and the important issues of studying the more complex

IPD games. Each subsection discusses and concludes general observations

obtained from the co-evolutionary learning of the particular extended IPD

game.

3.3.1. Extending the IPD with More Choices

Several studies have extended the classical IPD with more than two extreme

choices that are available for play. That is, there are intermediate choices

between full cooperation and full defection that strategies can response

with. Fogel [Harrald and Fogel (1996)] investigated a continuous IPD game.

We have investigated the IPD with multiple, discrete levels of cooperation

[Darwen and Yao (2000, 2001, 2002); Chong and Yao (2005)], which could

be use to approximate the continuous IPD game when the number of levels

is sufficiently large.

The main motivation of extending the IPD with more choices is to al-

low for the modelling of subtle behavioral interactions that are not possible

with only two extreme choices. With the classical IPD game, the possible

behaviors that strategies can exhibit are severely limited. For example, a

strategy for the classical IPD game cannot play intermediate choices that

allow for some degree of exploitation of the opponent without risking retal-

iation from an otherwise cooperative opponent [Harrald and Fogel (1996)].

The co-evolutionary learning approach usually considers a neural net-

work strategy representation because it can be used to process a continuous

range of behaviors (i.e., real numbers for representing the degree of coopera-

tion) easily. Furthermore, for the case of IPD games with multiple, discrete

levels of cooperation, a neural network is scalable to the number of levels

considered.

Fogel [Harrald and Fogel (1996)] showed that for the extended IPD a

continuous range of choices, the evolution of cooperation is unstable, with

fluctuations of average scores representing short periods of cooperation and

defection. We have further shown that with increasingly higher number of

choices to play in the IPD game with multiple, discrete levels of cooperation,

evolution to cooperation are more difficult to achieve [Darwen and Yao

(2000, 2001, 2002)].

From these studies, it appears that a co-evolving population of IPD



strategies has a higher tendency of evolving to play full defection. However,

this does not mean that evolution to cooperation is not possible, or that

cooperative behaviors that persist cannot be evolved. For example, it has

been shown that evolving cooperative behaviors depends on the complexity

of strategy representation that is used. In the case of neural networks, the

number of nodes in the hidden layer can affect the co-evolutionary learning

system to produce IPD strategies with cooperative responses [Harrald and

Fogel (1996)].

In addition to the complexity of strategy representation, another impor-

tant factor for evolving cooperative strategies is that of behavioral diversity.

Early studies [Darwen and Yao (2000, 2001)] have shown that genetic di-

versity (i.e., variations at the genotypic level of strategy representations)

does not equate to behavioral diversity (i.e., variations of IPD strategy

responses) in the population. Without sufficient behavioral diversity, the

co-evolving population can overspecialize to a specific strategy behavior

that is vulnerable to invasion (e.g., cycles between tit for tat, naive coop-

erators, and defectors). As such, increasing the level of genetic diversity

in the co-evolutionary learning system does not necessarily lead to an in-

crease in behavioral diversity that can help with the evolution of cooperative

strategies.

We have recently further shown that strategy representation also

plays an important factor in introducing behavioral diversity in the co-

evolutionary learning system [Chong and Yao (2005)]. We considered the

n-choice IPD game, which was obtained based on the following linear in-

terpolation:

pA

= 2.5− 0.5cA

+ 2cB

, − 1 ≤ cA, c

B≤ 1,

where pA

is the payoff to player A, given that cA

and cB

are the cooperation

levels of the choices that players A and B make, respectively. Fogel [Harrald

and Fogel (1996)] also considered a similar interpolation process. However,

we considered multiple, discrete levels of cooperation. For example, we used

the four -choice IPD game, where the four cooperation levels are represented

as +1 (full cooperation), +1/3, −1/3, and−1 (full defection). These choices

can be used with the linear interpolation equation shown above to obtain

the payoff. Figure 3.3 illustrates the payoff matrix of a four -choice IPD

game that was used [Chong and Yao (2005)].

Note that in generating the payoff matrix for a n-choice IPD game, the

following conditions must be satisfied [Chong and Yao (2005)]:



PLAYER B

+1 +3

1

3

1

1

+1 4 23

2

13

1

0

PLAYER +3

1

43

1

3 13

2

3

1

A3

1

43

2

33

1

23

2

1 5 33

2

23

1

1

Fig. 3.3. The payoff matrix for the two-player four-choice IPD used in [Chong and Yao

(2005)]. Each element of the matrix gives the payoff for Player A.

(1) For cA

< c′A

and constant cB

: pA(c

A, c

B) > p

A(c′

A

, cB

),

(2) For cA≤ c′

A

and cB

< c′B

: pA(c

A, c

B) < p

A(c′

A

, c′B

), and

(3) For cA

< c′A

and cB

< c′B

: pA(c′

A

, c′B

) > 1

2(p

A(c

A, c′

B

) + pA(c′

A

, cB

)).

These conditions are analogous to those for the classical IPD’s. The first

condition ensures that defection always pays more. The second condition

ensures that mutual cooperation has a higher payoff than mutual defec-

tion. The third condition ensures that alternating between cooperation

and defection does not pay in comparison to just playing cooperation.

We investigated two strategy representation: neural networks and direct

look-up table. We considered these two strategy representations because

they allow the investigation on the impact of strategy representation on

the introduction and maintenance of variations of behavioral responses in

the population of IPD strategies. On the one hand, the neural network

indirectly represents the input-output response mappings of IPD strate-

gies, with possibilities of many-to-one mappings between representations

and actual behavioral responses [Fogel (1994b); Atmar (1994)]. On the

other hand, the direct look-up table directly represents the input-output

response mappings of IPD strategies. We hypothesized that a more direct

representation of IPD strategies will allow more behavioral variations to be

introduced and maintained in the population through co-evolution.

For the neural network representation, we used a fixed-architecture feed-

forward multilayer perceptron (MLP) [Chong and Yao (2005)]. Specifically,

the neural network consists of an input layer, a single hidden layer of ten

nodes, and an output node. The network is fully connected and strictly

layered (i.e., no short-cut connection from the input layer to the output

node. The transfer (activation) function used for all nodes is the hyperbolic



tangent function, tanh(x). The input layer consists of the following four

input nodes:

(1) The neural network’s previous choice, i.e., level of cooperation, in [−1,

+1].

(2) The opponent’s previous level of cooperation.

(3) An input of +1 if the opponent played a lower cooperation level com-

pared to the neural network, and 0 otherwise.

(4) An input of +1 if the neural network played a lower cooperation level

compared to the opponent, and 0 otherwise.

The input layer is a function of two variables (e.g., neural network’s previous

choice and the opponent’s previous choice) since the last two inputs are

derived from the first two inputs. These additional inputs are to facilitate

learning the recognition of being exploited and exploiting. Given the inputs,

the neural network’s output determines the choice for its next move. The

output is a real value between +1 and −1 that is discretized to either +1,

+1/3, −1/3 or −1, depending on which discrete value the neural network

output is closest to.

We considered self-adaptive mutation for variation operators for the

real-valued representation of neural networks that we used [Chong and Yao

(2005)]. This approach associates a neural network with a self-adaptive pa-

rameter vector [σi(j)] that controls the mutation step size of the respective

weights and biases of the neural network [wi(j)]. Offspring neural networks

([w′

i

(j)] and [σ′

i

(j)]) are generated from parent neural networks ([wi(j)] and

[σi(j)]) through mutations. Two different mutations based on Gaussian and

Cauchy distributions were used in order to further investigate the impact of

indirect strategy representation on variation operators that could increase

genetic diversity but not necessarily lead to increase in behavioral diversity.

For the self-adaptive Gaussian mutation, offspring neural networks are

generated according to the following equations:

σ′

i

(j) = σi(j) ∗ exp(τ ∗N

j(0, 1)); i = 1 . . . 15, j = 1, . . . , N

w,

w′

i

(j) = wi(j) + σ′

i

(j) ∗Nj(0, 1); i = 1 . . . 15, j = 1, . . . , N

w,

where Nw

= 63, τ = (2(Nw)0.5)−0.5 = 0.251, and N

j(0, 1) is a Gaussian

random variable (zero mean and standard deviation of one) resampled for

every j. Nw

is the total number of weights, biases, and the pre-game inputs

required for an IPD strategy based on memory length of one.



For the self-adaptive Cauchy mutation that is known to provide bigger

changes to the neural network weights (i.e., provide more genetic diversity)

[Yao et al. (1999)], the following equations are used:

σ′

i

(j) = σi(j) ∗ exp(τ ∗N

j(0, 1)); i = 1 . . . 15; j = 1, . . . , N

w,

w′

i

(j) = wi(j) + σ′

i

(j) ∗ Cj(0, 1); i = 1 . . . 15; j = 1, . . . , N

w,

where Cj(0, 1) is a Cauchy random variable (centered at zero and with a

scale parameter of 1) resampled for every j. All other variables remain the

same as those in the self-adaptive Gaussian mutation.

For the direct look-up table representation, the details can be illustrated

by figure 3.4 [Chong and Yao (2005)], which shows the behavioral response

of a four -choice IPD strategy. mij

specifies the choice to be made, given the

inputs i (player’s own previous choice) and j (opponent’s previous choice).

Rather than using pre-game inputs (two for memory length one strategies),

the first move is specified independently. Each of the table elements can

take any of the possible four choices (+1, +1/3, −1/3, −1).

Opponents Previous Move

+1 + 3

1

3

11

+1 m11 m12 m13 m14

Players + 3

1 m21 m22 m23 m24

Previous Move 3

1 m31 m32 m33 m34

1 m41 m42 m43 m44

Fig. 3.4. The look-up table representation for the two-player IPD with four choices and

memory length one [Chong and Yao (2005)].

A simple mutation operator was used to generate offspring. Mutation

replaces the original element, mij

, by one of the other three possible choices

with an equal probability. For example, if mutation occurs at m13 = +1/3,

then the mutated element m′

13can take either +1, −1/3, or −1 with an

equal probability. Each table element has a fixed probability, pm

, of being

replaced by one of the remaining three choices. The value pm

is not op-

timized. Crossover is not used in any of the experiments. With a direct

representation of IPD strategy behaviors, a simple mutation is more than

sufficient to provide behavioral diversity in the population.



The following co-evolutionary procedure was used [Chong and Yao

(2005)]:

(1) Generation step, t = 0:

Initialize N/2 parent strategies, Pi, i = 1, 2, ..., N/2, randomly.

(2) Generate N/2 offspring, Oi, i = 1, 2, ..., N/2, from N/2 parents using a

variation.

(3) All pairs of strategies compete, including the pair where a strategy plays

itself (i.e., round-robin tournament). For N strategies in a population,

every strategy competes a total of N games.

(4) Select the best N/2 strategies based on total payoffs of all games played.

Increment generation step, t = t + 1.

(5) Step 2 to 4 are repeated until termination criterion (i.e., a fixed number

of generation) is met.

In particular, we used N = 30, and repeated the co-evolutionary pro-

cess for 600 generations (which is sufficiently long to observe an evolutionary

outcome, e.g., persistent cooperation). A fixed game length of 150 itera-

tions is used for all games. Experiments are repeated for 30 independent

runs. Note that additional steps were taken to ensure that the initial pop-

ulation has sufficient behavioral diversity in addition to genotypic diversity

[Darwen and Yao (2000)] to avoid early convergence of results. All details

are available in [Chong and Yao (2005)]. The procedure involves setting

particular parameters for specific strategy representation and resampling

for new strategies to make sure that the frequency at which each of the

four choices (+1, +1/3, −1/3, −1) is played is approximately similar so

that there is no bias to play a particular choice early in the evolution.

Results showed that there were fewer number of runs where the popu-

lation evolved to play mutual cooperation in experiments that used neural

network representations [Chong and Yao (2005)]. For example, some runs

had intermediate outcomes while a few had defection outcomes (Fig. 3.5).

This is quite different from the case for classical IPD games [Axelrod (1987);

Darwen and Yao (1995)] where each run converged to mutual cooperation

quite consistently and quickly. Increasing genetic diversity (e.g., using self-

adaptive Cauchy mutation) do not necessarily lead to more behavioral di-

versity in the population since some runs still evolved to intermediate or

defection outcomes (Fig. 3.6). The results further illustrates that more

choices have made cooperation more difficult to evolve.

However, when direct look-up table representation was used, results



1

1.5

2

2.5

3

3.5

4

0 100 200 300 400 500 600

Average Payoff

Generation

Fig. 3.5. Five sample runs of a co-evolutionary learning system that used neural network

representation with a self-adaptive Gaussian mutation in the four-choice IPD [Chong and

Yao (2005)].

1

1.5

2

2.5

3

3.5

4

0 100 200 300 400 500 600

Average Payoff

Generation

Fig. 3.6. Five sample runs of a co-evolutionary learning system that used neural network

representation with a self-adaptive Cauchy mutation in the four-choice IPD [Chong and

Yao (2005)].

showed that the evolution to cooperation was not difficult [Chong and Yao

(2005)]. For example, results showed that even when a simple mutation

with a low probability of mutation (e.g., pm

= 0.05) was used, no run

evolved to mutual defection even though intermediate outcomes were ob-

tained (Fig. 3.7). However, increasing the probability of mutation resulted

with all populations in all runs evolving to mutual cooperation play. The

results showed that the choice of strategy representation can have an impact

on the evolution of cooperation if it allows for greater behavioral diversity

in the population.

3.3.2. IPD with Noise

A natural extension to the classical IPD is to consider the impact of noisy

interactions on the evolution of certain behaviors. Axelrod noted two types



1

1.5

2

2.5

3

3.5

4

0 100 200 300 400 500 600

Average Payoff

Generation

Fig. 3.7. Five sample runs of a co-evolutionary learning system that used direct look-up

table representation with a simple mutation at pm = 0.05 in the four-choice IPD [Chong

and Yao (2005)].

of noise, i.e., misimplementation and misperception, that can affect a strat-

egy’s response to the opponent’s choice of play [Axelrod and Dion (1988)].

With misimplementation, the strategy knows a mistaken play but the oppo-

nent does not know. With misperception, one or both interacting strategies

may not know that a different choice was made. The main motivation for

this extension is to study the impact of noise on the learning of certain be-

haviors through co-evolution when interactions can be noisy. In particular,

one issue that can be considered is whether cooperative strategies based

on reciprocity (such as tit for tat) can still perform well when noise, which

affects strategy behavioral response based on previous moves, is present.

Julstrom [Julstrom (1997)] investigated the effects of noise in the two-

choice IPD through a co-evolutionary learning system. In particular, noise

was modelled as mistakes. That is, there is a probability that the choice

played by a strategy is changed to the other choice (e.g., defection is played

instead of the original cooperation, and vice versa). Results from the ex-

periments showed that noise (starting around 2%) can reduce the level of

cooperation in the population.

Recently, we further extended the IPD game with more choices by in-

troducing noise and used a co-evolutionary learning system as a model for

investigations [Chong and Yao (2005)], which we have detailed in the earlier

subsection. We also modelled noise as mistakes that a player makes. For

the four -choice IPD game, there is a certain probability of occurrence, pn,

and is fixed throughout a game where a strategy intends to play a partic-

ular choice but ends up with a different choice instead. For example, with

pn

= 0.05, there will be a 0.05 probability that if 1/3 is intended to be

played, one of the other three possible cooperation levels, i.e., +1, −1/3,



and −1, will be chosen uniformly at random.

Results from experiments again showed the importance of behavioral

diversity for the evolution of cooperation for noisy IPD games with more

choices. For noise introduced at very low probabilities (less than 1.5% or

pn

= 0.0015), evolution to cooperation is more likely than the case when

noise was not introduced. Strategies were observed to be more forgiving,

confirming the predictions of other studies noted in [Axelrod and Dion

(1988); Wu and Axelrod (1995)]. However, when noise was introduced at

high probabilities (starting around 5% or pn

= 0.05), evolution to coop-

eration was more difficult. The population was more likely to evolve to

defection.

Despite this, if the co-evolutionary learning system has sufficient behav-

ioral diversity (e.g., using direct look-up table representation that allows

for behavioral diversity to be introduced and maintained more easily and

effectively), evolution of cooperation is not greatly affected [Chong and Yao

(2005)]. Evolved strategies still played high levels of cooperation even when

there are more choices to play and that the interactions can be noisy, both

which can contribute to more difficulty of evolving cooperative behaviors.

For example, table 3.1 compares different co-evolutionary learning system

with different levels of behavioral diversity, e.g., C-CEP (neural network

and self-adaptive Gaussian mutation), C-FEP (neural network and self-

adaptive Cauchy mutation), C-PM05 (direct look-up table and mutation

at pm

= 0.05) for different noise levels (%) [Chong and Yao (2005)]. Re-

sults show the number of runs for each experiment that evolved to mutual

defection, e.g., average payoff less than 1.5. The table showed that no runs

evolved to mutual defection when direct look-up table representation was

used in the co-evolutionary learning system [Chong and Yao (2005)].

Table 3.1. Comparison of results for three

different co-evolutionary learning systems.

Noise (%) C-CEP C-FEP C-PM05

0 4 1 0

5 4 9 0

10 7 11 0

15 8 17 0

20 18 26 0

It should be noted that although both mutation and noise can be consid-

ered as sources of behavioral variations in models that encourage coopera-



tion [Mcnamara et al. (2004)], they produce behavioral diversity differently.

Mutation introduces strategies with different behaviors into the population.

Noise allows other parts of a strategy’s behavior that are not played other-

wise in a noiseless IPD game to be accessed. Our results [Chong and Yao

(2005)] showed that noise does not necessarily promote behavioral diversity

in the population that lead to a stable evolution to cooperation, although

noise at low levels does help. With higher levels of noise, closer inspection

of evolved strategies showed the population to overspecialize to a specific

behavior that is vulnerable to invasion, leading to cyclic dynamics in the

evolutionary process between cooperation and defection.

In particular, noise and mutation have different impacts on the evolu-

tionary process [Chong and Yao (2005)]. For example, increasingly higher

levels of noise lead to mutual defection outcomes. Given a very noisy en-

vironment, strategies overspecialized to play defection only. This was not

observed in the noiseless case of the IPD with increasingly more mutations.

For example, increasingly higher mutation rates in the co-evolutionary

learning system that used direct look-up table representation did not lead to

mutual defection outcomes. Strategies were not observed to overspecialized

to play defection, or any specific play.

3.3.3. N-Player IPD

Real-world interactions may involve more than two players. One famous

example is the “tragedy of the commons” [Hardin (1968)], which illustrates

the problem of self-interested actions of players for a particular public goods

for initial rewards leading to a situation where everyone loses out in the

end. For the case of the IPD, N-player interactions can be extended to the

original formulation of two-player game [Axelrod and Dion (1988)]. This

allows for the study of whether cooperative behaviors are possible when

interactions involve more than two players since strategies that are effective

for the two-player case may not be effective (or worse, fail) in large group

interactions [Glance and Huberman (1994)].

One of us formulated an N-player IPD or NIPD game for investiga-

tions using the co-evolutionary learning approach [Yao and Darwen (1994)]

(other studies include [Bankes (1994); Lindgren and Johansson (2001)]).

The NIPD game is defined by the following three properties [Colman (1982)]

(page 159):

• Each player faces two choices between cooperation and defection.



• Defection is dominant for each player, i.e., each player is better off

defecting than cooperating regardless of how many other players that

cooperate.

• The dominant defection strategies intersect in a deficit equilibrium. In

particular, the outcome if all players choose their non-dominant coop-

eration strategies is preferable from every player’s point of view to the

one in which everyone chooses defection, but no one is motivated to

deviate unilaterally from defection.

The payoff matrix (Fig. 3.8) for the NIPD game can then be constructed

based on the following conditions that must be satisfied [Yao and Darwen

(1994)]:

• Di> C

ifor 0 ≤ i ≤ n− 1.

• Di+1 > D

iand C

i+1 > Ci

for 0 ≤ i ≤ n− 1.

• Ci> (D

i+ C

i−1)/2 for 0 ≤ i ≤ n− 1.

A large number values satisfy these conditions. For the study in [Yao

and Darwen (1994)], the values are chosen such that if nc

is the number of

cooperators in the NIPD game, then the payoff for cooperation is 2nc− 2

and the payoff for defection is 2nc+1 (Fig. 3.9). For this payoff matrix, the

average per-move payoff a can be calculated as follows if Nc

cooperative

moves are made out of N moves:

a = 1 +N

c

N(2n− 3),

which will allow the measurement of how common cooperation was by ex-

amining the average per-round payoff.

Number of cooperators among the remaining n-1 players

0 1 2 n-1

C C0 C1 C2 … Cn-1

Pla

yer

A

D D0 D1 D2 … Dn-1

Fig. 3.8. The payoff matrix for the NIPD game. The value in the table gives the payoff

to the player based on its choice of play [Yao and Darwen (1994)].



Number of cooperators among the remaining n-1 players

0 1 2 n-1

C 0 2 4 … 2(n-1)

Pla

yer

A

D 1 3 5 … 2(n-1)+1

Fig. 3.9. An example of the payoff matrix for the NIPD game [Yao and Darwen (1994)].

NIPD game interactions were in the form of a large number of random

selection of groups of N players with replacement (e.g., 1000 NIPD games

for a population of 100 strategies). Results from the experiments in [Yao

and Darwen (1994)] showed the group size (i.e., the value of N in the NIPD

game) has a negative impact on the evolution of cooperation. As N in-

creases, there are fewer number of runs where the population evolved to

play cooperation. For example, in the case of memory two strategies, only

one out of 20 runs had defection outcomes for 3IPD. However, the number

of runs with defection outcomes increased to nine for 6IPD. Increasing N to

16 (i.e., 16IPD) resulted with all runs evolved to defection outcomes [Yao

and Darwen (1994)].

3.3.4. Other Extensions

There are many other extensions to the classical IPD game, or even fur-

ther extensions to already extended IPD games (such as the NIPD) that

can be studied through a co-evolutionary learning approach. For example,

we examined the impact of localized interactions of the NIPD games in

[Seo et al. (1999, 2000)]. The earlier study for the NIPD [Yao and Darwen

(1994)] showed that the evolution of cooperation is more difficult to achieve

through a co-evolutionary learning process as N increases. However, in some

real-world interactions, it is unlikely that a player interacts with everybody

(or that it has equal probability of interacting with anyone in the popu-

lation). Instead, a player might interact with other specific players (e.g.,

neighbours, relatives, or at the workplace). Such localized interactions may

involve spatial models [Nowak and May (1992); Ishibuchi and Namikawa

(2005)]. In particular, localized interactions can have a positive impact

on the evolution of cooperation in the NIPD game. That is, population



structured in a spatial model is more likely to evolve cooperation [Seo et al.

(1999); Lindgren and Johansson (2001)].

Another extension that can be considered is to incorporate indirect in-

teractions to the IPD game that originally only considers direct interactions

between strategies. Most of the previous studies have focused on modelling

direct interactions (e.g., cooperative behaviors through direct reciprocity

that involves repeated encounters, i.e., IPD games [Axelrod (1984)]) or

indirect interactions (e.g., cooperative behaviors through mechanisms of

indirect reciprocity such as reputation where an individual receives coop-

eration from third parties due to the individual’s cooperative behaviors to

others in the case of indirect reciprocity [Nowak and Sigmund (1998b)]).

However, it has been suggested that complex real-world interactions involve

both direct and indirect interations (although for simplicity for modelling

and analysis, only one of the interactions is considered at one time) [Nowak

and Sigmund (1998a)]. For this aspect, we have investigated a model with

both direct and indirect interactions [Yao and Darwen (1999)]. In partic-

ular, each strategy is tagged with a reputation score, which is calculated

based on payoffs received from a small random sample of pre-games. A

co-evolutionary approach to show that with the addition of reputation, co-

operative outcomes are possible and more likely even for the case of the IPD

with more choices and shorter game durations [Yao and Darwen (1999)].

In addition to that, another extension will be to consider the adaptation

of payoff matrices. We recently conducted a preliminary study on evolving

strategy payoff matrices, and how such an adaptation process can affect

the learning of strategy behaviors [Chong and Yao (2006)]. The motivation

for the study is to relax the assumption of having fixed, symmetric payoff

matrix for all evolving strategies. This assumption may not be realistic,

considering that not all players are similar in real-world interactions. We

focus specifically on an adaptation process of payoff matrix based on past

behavioral interactions. In particular, a simple update rule that provides

a reinforcement feedback process between strategy behaviors and payoff

matrices during the co-evolutionary process is used. Results from exper-

iments [Chong and Yao (2006)] showed that the evolutionary outcome is

dependent on the adaptation process of both behaviors (i.e., strategy be-

havioral responses) and utility expectations that determine how behaviors

are rewarded (i.e., strategy payoff matrices). Defection outcomes are more

likely to be obtained if IPD-like update rules that favor the exploitation of

opponents are used. However, cooperative outcomes can be easily obtained

when mutualism-like update rules that favor mutual cooperation are used.



3.4. Conclusion and Future Directions

The greatest advantage and the most important feature of co-evolutionary

learning is that of the process of adaptation on representation that is de-

pendent on the interactions between members of the population. In this

aspect, the co-evolutionary learning approach is well-suited to solving the

problem of IPD games in two contexts. First, co-evolutionary learning can

be used as a search algorithm for effective strategies without requiring hu-

man knowledge. All that is required is the rules of the game. Second,

the adaptation process of strategy behaviors based on interactions in co-

evolution provides a natural way to investigate conditions that lead to the

evolution of certain behaviors. In both of these contexts, the advantage of

co-evolutionary learning to other approaches is that strategy behaviors are

not fixed or predefined. Instead, co-evolutionary learning provides a means

to realize strategy behavioral responses that are not necessarily bounded

by expert human knowledge, thus providing new insight to the problem.

Since the first study of co-evolutionary learning on the classical IPD by

Axelrod [Axelrod (1987)], there had been a wide-range of studies that fur-

ther extended the classical IPD game with additional features such as, but

not limited to, continuous or multiple levels of cooperation, noisy interac-

tions, N-player interactions, spatial interactions, and indirect interactions.

The motivation in all of these studies is to bridge the gap between the ab-

stract IPD interactions with the complex real-world interactions. As such,

by understanding the specific conditions that lead to the evolution of spe-

cific IPD strategy behaviors, these studies have further helped to provide a

more in-depth view on complex real-world interactions such as those found

in the human society.

There are still much more that can be explored using the co-evolutionary

learning approach. One direction will be to further extend the more com-

plex IPD games and investigate the impact of the additional extension.

This is important because the extensions might interact with one another

in some unknown and nonlinear fashion. Understanding these interactions

will help to further unravel complex human interactions. Another direction

will be to investigate a more rigorous approach to determine the robust-

ness of evolved strategy behaviors. In this particular aspect, the notion of

generalization might provide a more natural approach for co-evolutionary

learning in addition to classical evolutionary game theory approach of the

evolutionarily stable strategies.



References

Atmar, W. (1994). Notes on the simulation of evolution, IEEE Transactions on

Neural Networks 5, 1, pp. 130–147.

Axelrod, R. (1980a). Effective choice in the prisoner’s dilemma, The Journal of

Conflict Resolution 24, 1, pp. 3–25.

Axelrod, R. (1980b). More effective choice in the prisoner’s dilemma, The Journal

of Conflict Resolution 24, 3, pp. 379–403.

Axelrod, R. (1984). The Evolution of Cooperation (Basic Books, New York).

Axelrod, R. (1987). The evolution of strategies in the iterated prisoner’s dilemma,

in L. D. Davis (ed.), Genetic Algorithms and Simulated Annealing, chap. 3

(Morgan Kaufmann, New York), pp. 32–41.

Axelrod, R. and Dion, D. (1988). The further evolution of cooperation, Science

242, 4884, pp. 1385–1390.

Axelrod, R. and Hamilton, W. D. (1981). The evolution of cooperation, Science

211, pp. 1390–1396.

Back, T. (1996). Evolutionary Algorithms in Theory and Practice (Oxford Uni-

versity Press, New York).

Back, T., Hammel, U. and Schwefel, H. P. (1997). Evolutionary computation:

Comments on the history and current state, IEEE Transactions on Evolu-

tionary Computation 1, 1, pp. 3–17.

Bankes, S. (1994). Exploring the foundations of artificial societies: Experiments

in evolving solutions to iterated n-player prisoner’s dilemma, in R. Brookes

and P. Maes (eds.), Artificial Life IV (Addison-Wesley), pp. 337–342.

Chellapilla, K. and Fogel, D. B. (1999). Evolution, neural networks, games, and

intelligence, Proc. IEEE 87, 9, pp. 1471–1496.

Chong, S. Y. and Yao, X. (2005). Behavioral diversity, choices, and noise in the

iterated prisoner’s dilemma, IEEE Transactions on Evolutionary Compu-

tation 9, 6, pp. 540–551.

Chong, S. Y. and Yao, X. (2006). Self-adaptive payoff matrices in repeated in-

teractions, in 2006 IEEE Symposium on Computational Intelligence and

Games (CIG’06) (IEEE Press, Piscataway, NJ), pp. 103–110.

Colman, A. M. (1982). Game Theory and Experimental Games (Pergamon Press,

Oxford).

Darwen, P. and Yao, X. (1995). On evolving robust strategies for iterated pris-

oner’s dilemma, in Progress in Evolutionary Computation, Lecture Notes in

Artificial Intelligence, Vol. 956, pp. 276–292.

Darwen, P. and Yao, X. (2000). Does extra genetic diversity maintain escalation

in a co-evolutionary arms race, International Journal of Knowledge-Based

Intelligent Engineering Systems 4, 3, pp. 191–200.

Darwen, P. and Yao, X. (2001). Why more choices cause less cooperation in

iterated prisoner’s dilemma, in Proc. 2001 Congress on Evolutionary Com-

putation (CEC’01) (IEEE Press, Piscataway, NJ), pp. 987–994.

Darwen, P. and Yao, X. (2002). Co-evolution in iterated prisoner’s dilemma with

intermediate levels of cooperation: Application to missile defense, Inter-

national Journal of Computational Intelligence and Applications 2, 1, pp.



83–107.

Darwen, P. J. (1996). Co-evolutionary Learning by Automatic Modularization with

Speciation, Ph.D. thesis, University of New South Wales, Sydney, Australia.

Darwen, P. J. and Yao, X. (1997). Speciation as automatic categorical modulariza-

tion, IEEE Transactions on Evolutionary Computation 1, 2, pp. 101–108.

Fogel, D. B. (1991). The evolution of intelligent decision making in gaming, Cy-

bernetics and Systems: An International Journal 22, pp. 223–236.

Fogel, D. B. (1993). Evolving behaviors in the iterated prisoner’s dilemma, Evo-

lutionary Computation 1, 1, pp. 77–97.

Fogel, D. B. (1994a). An introduction to simulated evolutionary optimization,

IEEE Transactions on Neural Networks 5, 1, pp. 3–14.

Fogel, D. B. (1994b). An introduction to simulated evolutionary optimization,

IEEE Transactions on Neural Networks 5, 1, pp. 3–14.

Fogel, D. B. (1995). Evolutionary Computation: Toward a New Philosophy of

Machine Intelligence (IEEE Press, Piscataway, NJ).

Fogel, D. B. (1996). On the relationship between the duration of an encouter and

the evolution of cooperation in the iterated prisoner’s dilemma, Evolution-

ary Computation 3, 3, pp. 349–363.

Franken, N. and Engelbrecht, A. P. (2005). Particle swarm optimization ap-

proaches to coevolve strategies for the iterated prisoner’s dilemma, IEEE

Transactions on Evolutionary Computation 9, 6, pp. 562–579.

Glance, N. S. and Huberman, B. A. (1994). The dynamics of social dilemmas,

Scientific American , pp. 58–63.

Hardin, G. (1968). The tragedy of the commons, Science 162, pp. 1243–1248.

Harrald, P. G. and Fogel, D. B. (1996). Evolving continuous behaviors in the

iterated prisoner’s dilemma, BioSystems: Special Issue on the Prisoner’s

Dilemma 37, pp. 135–145.

Ishibuchi, H. and Namikawa, N. (2005). Evolution of iterated prisoner’s dilemma

game strategies in structured demes under random pairing in game playing,

IEEE Transactions on Evolutionary Computation 9, 6, pp. 552–561.

Julstrom, B. A. (1997). Effects of contest length and noise on reciprocal altruism,

cooperation, and payoffs in the iterated prisoner’s dilemma, in Proc. 7th

International Conf. on Genetic Algorithms (ICGA’97) (Morgan Kauffman,

San Francisco, CA), pp. 386–392.

Lindgren, K. (1991). Evolutionary phenomena in simple dynamics, in C. G. Lang-

ton, C. Taylor, J. D. Farmer and S. Rasmussen (eds.), Artificial Life II

(Addison-Wesley), pp. 295–312.

Lindgren, K. and Johansson, J. (2001). Coevolution of strategies in n-person

prisoner’s dilemma, in J. Crutchfield and P. Schuster (eds.), Evolutionary

Dynamics - Exploring the Interplay of Selection, Neutrality, Accident, and

Function (Addison-Wesley).

Mcnamara, J. M., Barta, Z. and Houston, A. I. (2004). Variation in behaviour

promotes cooperation in the prisoner’s dilemma, Nature 428, pp. 745–748.

Miller, J. (1989). The coevolution of automata in the iterated prisoner’s dilemma,

Tech. Rep. 89-003, Santa Fe Institute Report.

Nowak, M. A. and May, R. M. (1992). Evolutionary games and spatial chaos,



Nature 355, pp. 250–253.

Nowak, M. A. and Sigmund, K. (1998a). The dynamics of indirect reciprocity,

Journal of Theoretical Biology 194, pp. 561–574.

Nowak, M. A. and Sigmund, K. (1998b). Evolution of indirect reciprocity by

image scoring, Nature 393, pp. 573–577.

Seo, Y. G., Cho, S. B. and Yao, X. (1999). Emergence of cooperative coalition

in nipd game with localization of interaction and learning, in Proc. IEEE

1999 Congress on Evolutionary Computation (CEC’99) (IEEE Press, Pis-

cataway, NJ), pp. 877–884.

Seo, Y. G., Cho, S. B. and Yao, X. (2000). Exploiting coalition in co-evolutionary

learning, in Proc. IEEE 2000 Congress on Evolutionary Computation

(CEC’00) (IEEE Press, Piscataway, NJ), pp. 1268–1275.

Stanley, E. A., Ashlock, D. and Smucker, M. D. (1995). Prisoner’s dilemma with

choice and refusal of partners: Evolutionary results, in Proc. Third Euro-

pean Conf. on Advances in Artificial Life, pp. 490–502.

Wu, J. and Axelrod, R. (1995). How to cope with noise in the iterated prisoner’s

dilemma, The Journal of Conflict Resolution 39, 1, pp. 183–189.

Yao, X. (1994). Introduction, Informatica (Special Issue on Evolutionary Com-

putation) 18, pp. 375–376.

Yao, X. (1999). Evolving artificial neural networks, Proc. IEEE 87, 9, pp. 1423–

1447.

Yao, X. and Darwen, P. (1999). How important is your reputation in a multi-

agent environment, in Proc. 1999 Conf. on Systems, Man, and Cybernetics

(SMC’99) (IEEE Press, Piscataway, NJ), pp. 575–580.

Yao, X. and Darwen, P. J. (1994). An experimental study of n-person iterated

prisoner’s dilemma games, Informatica 18, pp. 435–450.

Yao, X., Liu, Y. and Darwen, P. J. (1996). How to make best use of evolutionary

learning, in R. Stocker, H. Jelinck, B. Burnota and T. Bossomaier (eds.),

Complex Systems - From Local Interactions to Global Phenomena (IOS

Press, Amsterdam), pp. 229–242.

Yao, X., Liu, Y. and Lin, G. (1999). Evolutionary programming made faster,

IEEE Transactions on Evolutionary Computation 3, 2, pp. 82–102.


Chapter 4

How to Design a Strategy to Win an IPD Tournament

Jiawei Li

University of Nottingham, Harbin Institute of Technology

4.1. Introduction

Imagine that a player in an IPD tournament knows the strategy of each of

his opponents; he will defect against opponents such as ALLC or ALLD and

cooperate with opponents such as GRIM or TFT in order to maximize his

payoff. This means that he can interact with each opponent optimally and

receive higher payoffs. Although this information a priori is not possible,

one can identify a strategy during the game. For example, if a strategy

cooperated with its opponent in the previous 10 rounds while its opponent

defected, it seems sensible to deduce that it will always cooperate. In fact,

each strategy will gradually reveal itself through the IPD game; moreover,

it is not after the game that we can identify the strategy but possibly after

a few rounds. With an efficient identification mechanism, it is possible for

a strategy to interact with most of its opponent optimally.

However, two main problems must be solved in designing an efficient

identification mechanism. Firstly, it is impossible, in theory, for a strat-

egy to identify an opponent within a finite number of rounds because the

number of possible strategies is huge. Only can the types of strategies be-

longing to a preconcerted finite set be identified, which may be just a small

proportion of all those possible because identification will be of no use if

it takes too long. Secondly, there exists a risk of exploring an opponent

putting the player into a much worse position. In other words, such an

action may have negative effect on future rewards. For example, in order

to distinguish between ALLC and GRIM, a strategy has to defect at least

once and loses the chance to cooperate with GRIM in the future.

In this chapter we will discuss how to resolve these problems, how to

89


90 J. Li

design an identification mechanism for IPD games, and how the strategy of

Adaptive Pavlov was designed, which was ranked first in Competition 4 of

the 2005 IPD tournament.

4.2. Analysis of strategies involved in IPD games

Every strategy may have its disadvantages as well as its advantages. A

strategy may receive high payoffs when its opponent belongs to some set of

strategies, and receive lesser payoffs when an opponent belongs to another

set of strategies. However, some strategies always do better than others in

IPD tournaments.

The strategies involved in IPDs can be classified according to whether

or not they respond to their opponents. One set of strategies is fixed and

plays a predetermined action no matter what their opponent does. ALLD,

ALLC and RAND are typical. Other strategies are more complicated and

their actions depend on their opponent’s behavior. TFT, for example, starts

with COOPERATE and then repeats his opponent’s last move. The second

set is obviously superior to the former since the strategies like TFT, TFTT

and GRIM have always performed better than ’fixed’ strategies in past IPD

tournaments.

Then, the question is what the optimal response to every opponent is.

Is TFT’s imitation of opponent’s last move the best response? Although

TFT has been shown to be superior to many other strategies, it is not good

enough to win every IPD tournament.

Let’s consider a simulation of IPD tournament with 9 players. These

players are ALLC, ALLD, RAND, GRIM, TFT, STFT, TFTT, TTFT, and

Pavlov. The descriptions of the strategies of these players are as shown in

Table 4.1. These strategies are simple and representational, and have all

appeared in past IPD tournaments.

The rule of our simulation is that each strategy will play a 200-round

IPD game with every strategy (including itself). The payoffs in a round

are as shown in Fig. 4.1. The total payoff received by any given strategy

is the summation of the payoffs throughout the tournament.

The results of the tournaments vary because there are random choices

in the strategies of Pavlov and RAND. In order to decrease the variability

of the result, the tournament is repeated several times and the average

score for each strategy is calculated. Simulation results show that TFT,

TFTT and GRIM acquire higher scores than the others and their average

scores across several tournaments are quite close. TFTT, however, wins


How to Design a Strategy to Win an IPD Tournament 91

Table 4.1. Description of the players of the IPD simulation.

Players Descriptions

ALLC This strategy always plays COOPERATE.

ALLD This strategy always plays DEFECT.

RAND It plays DEFECT or COOPERATE with 1/2 probability.

GRIM Starts with COOPERATE, but after one defection plays always

DEFECT.

TFT Starts with COOPERATE, and then repeats opponent’s moves.

TFTT Like TFT but it plays DEFECT after two consecutive defections.

STFT Like TFT but in first move it plays DEFECT.

TTFT Like TFT but it plays two DEFECT after opponent’s defection.

Pavlov Result of each move is divided into two groups:

SUCCESS (payoff 5 or 3) and DEFEAT (payoff 1 or 0).

If the last result belongs to SUCCESS group it plays the same move,

otherwise it plays the other move.

Player1’schoice

Player2’schoice

COOPERATE DEFECT

COOPERATE (3,3) (0,5)

DEFECT (5,0) (1,1)

Fig. 4.1. Payoffs table of the IPD tournament. The numbers in brackets denote the

payoffs two players receive in a round of a game.

more times than the others in a single tournament. For example, TFTT

wins 11 tournaments from a total of 20, while TFT wins 4 and GRIM wins

5. In addition, if Pavlov and RAND are removed TFTT will always win.

One of the limitations of TFT is that it will inevitably run into the circle

of defecting-defected (which means that TFT plays COOPERATE while

its opponent defects; and then TFT plays DEFECT while its opponent

cooperates) when its opponent happens to be STFT. However, cooperation

will be achieved resulting in higher payoffs if TFT cooperates once after

its opponent defects. TFTT is superior to TFT in this regard. And it

is this reason why TFTT wins more tournaments than TFT in the above

IPD simulation. It is easy to verify that TFT will not get lower scores than

TFTT if STFT is removed from the simulation.

Thus, we can improve the strategy of TFT in this way: when TFT

enters a circle of defecting-defected (for example a sequence of 3 pairs of

defecting-defected) it will choose COOPERATE in two continuous rounds.

This modified TFT (MTFT) will achieve higher payoffs than TFT in the

case that their opponents are STFT. By substituting MTFT for TFT, IPD


92 J. Li

experiments show that MTFT gets the highest average score and wins more

single tournaments than the others.

MTFT has used an identification technique. It identified STFT by de-

tecting the defecting-defected circles in the process of an IPD game. When

the opponent was considered to be STFT, optimal action (cooperates in

two sequential rounds) would be carried out in order to maximize future

payoffs. In this way, it is natural to deduce that MTFT can be further im-

proved so that it can identify more strategies and then interact with them

optimally.

In the following sections, an approach to identify each strategy in a

finite set will be introduced. A strategy can interact with the opponents

almost optimally by using this identification mechanism.

4.3. Estimation of possible strategies in an IPD tournament

In this section, we seek to define a finite set of types of strategies to be

identified. Since the number of possible strategies for IPD are infinite, it is

impossible to identify each of them in a finite number of rounds. For exam-

ple, suppose that a strategy cooperated with its opponent in 10 sequential

rounds while its opponent defected continuously. Although it is very likely

to be ALLC, there are always other possibilities. It may be GRIM but the

trigger is 11 defections; it may be RAND that has just happened to play 10

sequential COOPERATEs; or it may be a combination of ALLC and TFT

and it will behave as TFT type in the following rounds. However, since

only ALLC belongs to the set of identification, those other possibilities will

be eliminated.

How to choose the set of identification depends on prior knowledge and

subjective estimation. Some strategies like TFT are likely to appear; while

others are designated as default strategies.

There are numerous strategies one can design for an IPD tournament.

However, most of them seldom appear because their chances of winning are

very small. For example, there may be such a strategy that it cooperates

in the first two rounds and defects in the following two rounds, and then

it cooperates and defects alternately. Few players will apply such a strat-

egy because it is unlikely to win any IPD tournament. It is obvious that

the strategies that usually win appear frequently and the others appear

infrequently.

We define two classifications of IPD strategies: cooperating and defect-

ing. Cooperating strategies, for example TFT and TFTT, wish to coop-



erate with their opponents and never start defecting. Defecting strategies,

for example ALLD and Pavlov beginning with DEFECT (PavlovD), wish

to defect in order to maximize their payoffs and they always start defecting.

The cooperating strategies differ in the way of their responses to the

opponent’s defections. For example, TFTT is more forgiving than TFT

as it retaliates only if its opponent has defected twice. GRIM is sterner

than TFT as it never forgives a defection. These strategies can be classified

according to their responses to the opponent’s defections. The rules are the

same as the one described in the previous simulation as shown in Fig. 4.2.

TFTTFTT TTFT GRIMALLC

Forgiving Stern

Fig. 4.2. The cooperating strategies.

The defecting strategies differ in the way they insist on defecting.

PavlovD is a representative strategy in this set. It starts with DEFECT.

If the opponent is too forgiving to retaliate, it defects forever. Otherwise,

it tries to cooperate with the opponent.a The defecting strategies can be

classified as shown in Fig. 4.3.

PavlovD ALLDSTFT

Defect less Defect more

Fig. 4.3. The defecting strategies.

Other simple strategies which lack a clear objective differ from the co-

operating and defecting strategies and hardly ever get high scores in IPD

tournaments.

Most of the players of an IPD tournament will be cooperating strate-

gies at the present time since cooperating strategies have been dominant

in most of the tournaments. There will also be a small quantity of de-aAlthough PavlovD tries to cooperate with an opponent when the opponent retaliates

upon its defection, it seldom succeeds. For example, even if PavlovD meets a forgiving

strategy like TFTT they cannot keep cooperating in the game. In fact, if only PavlovD

cooperates one more time cooperating can be achieved. We have examined a modified

PavlovD (MPavlovD) strategy that starts with DEFECT and cooperates twice when the

opponent retaliates. The results of simulation show that MPavlovD always gains more

scores than PavlovD.


94 J. Li

fecting strategies. Based on the above idea, we have designed the Adaptive

Pavlov strategy that applies a simple mechanism to distinguish cooperating

strategies and several representative defecting strategies.

4.4. Interaction with a strategy optimally

For any strategy there must be another strategy that optimally deals with

it. Because the strategies of ALLC, ALLD and RAND are independent of

the opponent’s behavior, ALLD is the optimal strategy. Because GRIM,

TFT, STFT and TTFT retaliate as soon as their opponent defects, the op-

timal strategy for its opponent is to always cooperate but defect in the last

round. TFTT is more charitable and forgives a single defection; therefore,

its opponent can maximize the payoff by alternately choosing DEFECT and

COOPERATE. If Pavlov starts with COOPERATE its opponent should al-

ways cooperate except in the last round; Otherwise, its opponent should

start with DEFECT, then always cooperate except in the last round. Ta-

ble 4.2 shows the optimal strategies to deal with each strategy shown in

Table 4.1.

Table 4.2. Optimal strategies to interact with a known strategy.

Strategies Optimal strategy of opponent

ALLC It always plays DEFECT.

ALLD It always plays DEFECT.

RAND It always plays DEFECT.

GRIM It always plays COOPERATE except DEFECT in the last move.

TFT It always plays COOPERATE except DEFECT in the last move.

TFTT It starts with DEFECT, and then plays COOPERATE and

DEFECT in turn.

STFT It always plays COOPERATE except DEFECT in the last move.

TTFT It always plays COOPERATE except DEFECT in the last move.

Pavlov If Pavlov starts with DEFECT it starts with DEFECT, and then

always plays COOPERATE except that it plays DEFECT in the

last round; If Pavlov starts with COOPERATE it always plays

COOPERATE except that it plays DEFECT in the last round.

Given an IPD tournament with n players, a player will win the tour-

nament if it interacts with each of its opponent optimally. For example, a

unique ALLD will win when the other n− 1 players in a IPD tournament

are all ALLC. Hence, the winning strategy of an IPD tournament must be

optimal in interacting with most of the others.

Although the strategy of a player is unknown to his opponent before



a game, the strategy gradually emerges as the game progresses. It is not

difficult for a human player to identify the strategy of his opponent but it is

more difficult for a computer program to possess the ability of identification.

To make this feasible, there is a need for a method to distinguish each type

of strategy from the others, and then a computer program can interact with

different types of strategies with a relevant response. Under the assumption

that every player belongs to a pre-defined finite set of strategies, an example

is given to show how the method of identification is realized and how the

winning strategy is designed .

Consider an IPD tournament with 10 players. Besides the players shown

in Table 4.1, let us add a new player MyStrategy (MS) which applies an

identification mechanism to identify its opponent. The rules are the same

as those described in the previous simulation.

MS starts with DEFECT. If its opponent chooses DEFECT in the first

round, MS chooses COOPERATE in round two, otherwise MS chooses

DEFECT. MS always chooses COOPERATE in the third round. In this

way, most of the strategies can be identified after just three rounds.

For example, suppose that the choices of MS and its opponent in the

first 3 rounds are as shown in Fig. 4.4. The strategy of the opponent can

be confirmed to be RAND. Because the opponent starts with DEFECT it

must be one of the strategies of ALLD, STFT, RAND and Pavlov. Since

MS defects in the first round and the opponent cooperates in round two, it

is impossible to be ALLD or STFT. Since MS and the opponent cooperate

in the second round, the opponent should not defect in the third round

if it were Pavlov. Therefore, the opponent must be RAND. The optimal

strategy is ALLD in interacting with RAND, and MS will behave as ALLD

in the following rounds of the game.

Round 1 Round 2 Round 3

M S’smoves Defect Cooperate Cooperate

Opponent’smoves Defect Cooperate Defect

Fig. 4.4. A possible process of a game (shows that the opponent is RAND).

Some possible results of identification for the 9 strategies are listed in

Table 4.3, where ’C’ denotes COOPERATE and ’D’ denotes DEFECT.

Because the strategy RAND chooses its move randomly it may behave like


96 J. Li

any other strategy during a short period; therefore, there needs to be more

rounds to distinguish RAND from other strategies. If there is a process

different from that of as shown in Table 4.3, the strategy of the opponent

must be RAND.

Table 4.3. Identification of the 9 strategies.

Players Possible moves of two players Identification result

MyStrategy D C C Pavlov (RAND)

The opponent D C C

MyStrategy D C C ALLD (RAND)

The opponent D D D

MyStrategy D C C STFT (RAND)

The opponent D D D C

MyStrategy D D C ALLC (RAND)

The opponent C C C C

MyStrategy D D C TFTT (RAND)

The opponent C C C D

MyStrategy D D C Pavlov (RAND)

The opponent C C D C

MyStrategy D D C C TFT (RAND)

The opponent C C D D C

MyStrategy D D C C C TTFT (RAND)

The opponent C D D D C

MyStrategy D D C C C GRIM (RAND)

The opponent C D D D D

In this way, a strategy can be identified after several rounds of game,

and then the optimal strategy can be applied.

Ten IPD tournaments with the above 10 players are carried out.b The

simulation results are as shown in Fig. 4.5. It shows that MS gains the

highest average payoffs when compared to the other strategies and achieves

the highest score in each tournament. The reason for MS’s success is that

bHow many rounds an IPD game commits is usually not fixed in order to avoid the

players’ knowing when the end of the game is due. The simulation applies a fixed number

of rounds in order to decrease complexity of computation. However, the strategy of MS

does not make use of this to get extra payoff; that is to say, MS does not purposely

choose DEFECT in the last round of a game.

January

30,2007

11:0

World

Scie

ntifi

cRevie

wVolu

me

-9in

x6in

chapte

r4

How

toD

esig

na

Strate

gy

toW

inan

IPD

Tournam

ent

97

Players Points in 10 tournaments Average Rank

MS 6134 6213 6179 6127 6202 6175 6152 6172 6212 6187 6175.3 1

TFTT 5957 5996 5970 6003 5994 5959 5965 5969 5966 5976 5975.5 2

TFT 5961 5936 5919 5946 5959 5938 5940 5929 5954 5978 5946.0 3

Pavlov 5718 5691 5725 5775 5816 5763 5748 5763 5733 5745 5747.7 4

TTFT 5725 5723 5725 5717 5719 5725 5746 5732 5722 5716 5725.0 5

GRIM 5404 5394 5416 5410 5440 5468 5322 5400 5390 5384 5402.8 6

ALLC 5115 5091 5103 5127 5103 5103 5103 5082 5109 5091 5102.7 7

RAND 4339 4349 4254 4340 4216 4219 4258 4241 4228 4274 4271.8 8

STFT 4165 4187 4160 4169 4179 4144 4173 4158 4142 4158 4163.5 9

ALLD 3800 3792 3852 3792 3848 3856 3832 3864 3832 3832 3830.0 10

Fig. 4.5. Simulation results of 10 IPD tournaments.


98 J. Li

it has almost optimally interacted with most of the strategies in this IPD

tournament.

Most IPD strategies, such as TFT or Pavlov, are memory-one strategies

which can only respond to the opponent’s last move; however, the past pro-

cess of the game contains more information. The identification mechanism

of MS uses information about the opponent’s strategy, thus MS responds to

not just the opponent’s past moves but the opponent’s strategy. By iden-

tifying different opponents, MS makes use of more information than the

simple strategies. This is the reason MS is able to win IPD tournaments.

Different identification approaches may lead to different results for MS.

For example, all of the strategies GRIM, TFT and ALLC start with CO-

OPERATE, and they will not defect if their opponents don’t. To identify

each of these strategies, MS starts with DEFECT and loses the chance to

cooperate with GRIM. On the other hand, if MS doesn’t firstly defect, it

cannot distinguish the 3 strategies and cannot interact with ALLC opti-

mally. The risk involved in exploring the opponent must be considered in

order to choose an efficient or payoff-maximizing identification approach.

4.5. Escape from the trap of defection

When a player begins to explore the opponent, there is a risk of the identify-

ing process’s putting the player into a much worse position. Some strategies,

especially those with trigger mechanism such as GRIM, will change their

behaviors at the trigger point. For example, the strategy MS described in

the above section defects at the beginning of IPD games in order to distin-

guish each of the cooperating strategies ALLC, TFT and GRIM; however,

the chance to cooperate with GRIM is lost. In IPD games, the risk of

identification is mainly the trap of defection, which means an identifying

process leading the opponent to keep defecting with nothing that can be

done to rescue the situation.

It appears that a strategy will not run into the trap of defection if it

never defects first. But this is not the case. Suppose a strategy keeps

playing COOPERATE if its opponent defects, and defects forever once

its opponent cooperates; then, any cooperating strategy will be defected

against in interacting with it while most of defecting strategies will keep

cooperating. If there is a equal possibility of this reverse-GRIM strategy

appearing in a game to that of GRIM, to cooperate or to defect has equal

risk to invoke future defection. This means that there always exists the risk

of the defection trap whether or not an identification mechanism is applied.



One may argue that the reverse-GRIM type of strategies will not appear

as frequently as GRIMs in IPDs, so to cooperate is safer than to defect and

the MS strategy is more likely to run into the defection trap than TFT.

That is right. But it is not enough to testify that a defection trap is not

inevitable for a strategy with an identification mechanism because many

identification approaches can be applied. For example, a simple way to

avoid retaliation from GRIM is not to defect first. The identification mech-

anism that Adaptive Pavlov used in 2005 IPD tournament only explored

defecting strategies in order to keep cooperation with each of those coop-

erating strategies.

Again, what kind of identification mechanisms should be applied de-

pends on prior knowledge and subjective estimation. If there are enough

ALLC strategies in an IPD game, it is worth identifying them from other

cooperating strategies. But if GRIMs are prevailing, it is better not to

defect first. Generally speaking, we can compare different identification ap-

proaches to choose the most efficient one although uncertainty still exists.

4.6. Adaptive Pavlov and Competition 4 of 2005 IPD tour-

nament

The 2005 IPD tournament comprised 4 competitions. Competition 4 mir-

rored the original competition of Axelrod. There were a total of 50 players

including 8 default strategies. The strategy of Adaptive Pavlov (AP) that

was ranked first in Competition 4 will be analyzed in this section.

The strategy of AP combines 6 continuous rounds to a period and ap-

plies different tactics in different periods. AP behaves as a TFT strategy in

the first period, and then changes its strategy according to the identification

of its opponent.

AP classifies the possible opponents into 5 categories: cooperating

strategies, STFT, PavlovD, ALLD and RAND.c By identifying the op-

ponent’s strategy at the end of a period, AP shift its strategy in the new

period in order to deal with each opponent optimally.

AP is never the first to defect, and thus it will cooperate with each

cooperating strategy. AP tries to cooperate with the strategies of STFT

and PavlovD, and defect to the strategies such as ALLD or RAND. The

processes of AP’s interacting with cooperating strategies, ALLD, STFT,

and PavlovD in the first 6 rounds are shown in Fig. 4.6 (AP behaves as

TFT). For example, when a process of interaction as shown in Fig. 4.6(c)cRAND is claimed to be a default strategy.

January

30,2007

11:0

World

Scie

ntifi

cRevie

wVolu

me

-9in

x6in

chapte

r4

100

J.Li

(a)

1 2 3 4 5 6

AP C C C C C C

Co-op C C C C C C

1 2 3 4 5 6

AP C D D D D D

ALLD D D D D D D

(b)

1 2 3 4 5 6

AP C D C D C D

STFT D C D C D C

(c)

1 2 3 4 5 6

AP C D D C D D

PavlovD D D C D D C

(d)

Fig. 4.6. Identifying the opponent according to the process of interaction in six rounds. (a) AP cooperates with any cooperating

strategy. (b) ALLD strategy always defects. (c) If a strategy alternately plays D and C when interacting with TFT, it is identified to

be STFT. (d) If a strategy periodically plays D-D-C when interacting with TFT, it is identified to be PavlovD.



occurs, the opponent will be identified to be STFT and AP will cooperate

twice in the next period in order to achieve cooperation. If the opponent

is determined to be PavlovD, AP will defect once and then always coop-

erate in the next period. If there is a process of interaction different from

that of as shown in Fig. 4.6, the opponent will be identified as RAND. In

this way, any strategy that is not defined in identification set is likely to

be identified as RAND. Once cooperation has been established, AP will

always cooperate unless a defection occurs. Identification of the opponent

is performed in each period throughout the IPD tournament in order to

correct misidentification and to deal with those players who change their

strategies during a game.

As we have mentioned, most of the players will be cooperating strategies.

The results show that there are 34 cooperating strategies in Competition 4

(including 4 default strategies of TFT, TFTT, GRIM and ALLC). With the

exception of the default strategies, there are still 3 strategies that behave

like ALLD, 5 strategies that behave like STFT, and 2 strategies that behave

like NEG. As shown in Table 4.4, AP can identify most of the strategies

involved in Competition 4.d

Table 4.4. Categories of the strategies in Competition 4.

Categories Number of the strategies

Cooperating strategies 34

Strategies like STFT 6

Strategies like ALLD 4

Strategies like NEG 3

Strategies like RAND 1

Others 2

4.7. Discussion and conclusion

AP belongs to the type of adaptive automata for IPD. However, it differs

from other adaptive strategies in respect of how adaptation is achieved.

The approach of AP exactly belongs to the set of artificial intelligence ap-

proaches. Rather than adjusting some parameters in computing responses

as most of the adaptive strategies do, AP uses an identification mechanism

dAP regards NEG as RAND. It maximizes the scores when interacting with the strategies

like NEG because either of the optimal strategies to interact with NEG and RAND are

ALLD.


102 J. Li

which acts as an expert system. Knowledge about different opponents is

expressed in the form of ’If..., then...”, for example, if the opponent cooper-

ates in 6 rounds then it is determined to be ALLC. In this way, information

acquired and used can be transparently expressed and thus AP can tell

which strategy the opponent is using.

Recent years have seen many AI approaches applied in evolutionary

game theory and IPD, for example reinforcement learning, artificial neural

networks, and fuzzy logic [Sandholm and Crites (1996); Macy and Carley

(1996); Fort and Perez (2005)]. To solve the problem of computing a best

response to an unknown strategy has been one of the objectives of those AI

approaches. The problem is, in general, intractable because of the compu-

tational complexity, and finding the best response for an arbitrary strategy

can be non-computable [Papadimitriou (1992); Nachbar and Zame (1996)].

Reinforcement learning which is based on the idea that the tendency to pro-

duce an action should be reinforced if it produces favourable results, and

weakened if it produces unfavourable results [Gilboa (1988); Gilboa and

Zemel (1989)] is widely used for the automata to learn from the interaction

with others. With respect to IPD, several approaches have been developed

to learn optimal responses to a deterministic or mixed strategy [Carmel

and Markovitch (1998); Darwen and Yao (2002)]. However, computational

complexity is still the main difficulty in the application of these approaches

in real IPD tournaments. AP’s identification mechanism is implemented in

a simple way by making use of a priori knowledge, which greatly reduces the

computational complexity and makes it practical for AP to respond to the

opponent almost optimally. First, a priori knowledge about what strategies

are more likely to appear in the IPD tournament is used in determining the

identification set. The size of the identification set is restricted in order to

reduce computational complexity. Second, a priori knowledge about how

well different identification approaches will work in a certain environment

is used in selecting an efficient identification approach, with which AP can

avoid the risk of identification and maximize the payoffs. Third, a priori

knowledge about how to identify the opponent according to the process

of interaction is used in constructing the identification rules. With these

simple rules, the AP strategy is easy to understand.

It is obvious that the identification set can be extended in order to

include more strategies that can be identified; however, more calculations

will be involved as the size of identification set increases. We have to

make a tradeoff between the wish to identify any strategy and the wish

to develop a less complicated strategy. Compared to the NP-completeness



of those reinforcement learning approaches [Papadimitriou (1992)], AP’s

computational complexity is between O(√

n) and O(n), which depends on

the similarities of those strategies to be identified. Therefore, the algorithm

of AP is suitable for real IPD tournaments.

An identification mechanism can also work in the environment with

noise, where each strategy might, with a possibility, misunderstand the

outcome of game. Noise blurs the boundaries between different strategies.

However, identification can still be applicable by admitting a small identi-

fication error. In this circumstance, we can set a threshold value that the

opponent is considered to be identified if the probability of misidentification

is smaller than this value. Just as the case of identifying the strategy of

RAND, the probability of mistakenly identifying a strategy will decrease to

zero as the process of computation and identification repeats.

Information plays a key role in intelligent activities. The individuals

with more information consequentially gain the advantage over others in

most circumstances. With an identification mechanism, strategies such as

AP acquire information about their opponents and they are more intel-

ligent than any known strategies such as TFT or Pavlov. These type of

strategies are suitable in modeling the decision-making process of human

beings, where learning and improvement frequently happens.

References

Carmel, D. and Markovitch, S. (1998). How to explore your opponent’s strategy

(almost) optimally, in Proceedings of the International Conference on Multi

Agent Systems, pp. 64–71.

Darwen, P. and Yao, X. (2002). Co-evolution in iterated prisoner’s dilemma with

intermediate levels of cooperation: Application to missile defense, Inter-

national Journal of Computational Intelligence and Applications 2, 1, pp.

83–107.

Fort, H. and Perez, N. (2005). The fate of spatial dilemmas with different fuzzy

measures of success, Journal of Artificial Societies and Social Simulation

8, 3.

Gilboa, I. (1988). The complexity of computing best response automata in re-

peated games, Journal of Economic Theory 45, pp. 342–352.

Gilboa, I. and Zemel, E. (1989). Nash and correlated equilibria: some complexity

considerations, Games and Economic Behavior 1, pp. 80–93.

Macy, M. and Carley, K. (1996). Natural selection and social learning in pris-

oner’s dilemma: co-adaptation with genetic algorithms and artificial neural

networks, Sociological Methods and Research 25, 1, pp. 103–137.


104 J. Li

Nachbar, J. and Zame, W. (1996). Non-computable strategies and discounted

repeated games, Economic Theory 8, pp. 103–122.

Papadimitriou, C. (1992). On players with bounded number of states, Games and

Economic Behavior 4, pp. 122–131.

Sandholm, T. and Crites, R. (1996). Multiagent reinforcement learning in the

iterated prisoner’s dilemma, Biosystems 37, 1-2, pp. 147–166.

April 16, 2007 10:32 World Scientific Review Volume - 9in x 6in chapter5

Chapter 5

An Immune Adaptive Agent for the Iterated Prisoner’s

Dilemma

Oscar Alonso, Fernando Nino

National University of Colombia

5.1. Introduction

The Prisoner’s Dilemma [Tucker (1950)] is a game in which two players have

to decide between two options: cooperate, doing something that is good for

both players, and defect, doing something that is worse for the other player

but better for oneself. No pre-play communication is permitted between

the players. The dilemma arises because no matter what the other does,

each player will do better defecting than cooperating, but as both players

defect, both will do worse than if both had cooperated [Alonso et al.]. The

payoff obtained by each player is given by a payoff matrix, as shown in

table 5.1. The first number in each cell represents the payoff for the row

player, and the second value represents the payoff for the column player.

Table 5.1. Payoff ma-

trix

C D

C 3 , 3 0 , 5

D 5 , 0 1 , 1

When the game is played several times between the same players, and

the players are able to remember past interactions, it is called the Iterated

Prisoner’s Dilemma (IPD). Each player is said to have a strategy, i.e., a way

to decide its next move depending on previous interactions. Accordingly,

complex patterns of strategic interactions may emerge, which may lead to

exploitation, retaliation or mutual cooperation.

The Iterated Prisoner’s Dilemma game has attracted the interest of

many researchers in a wide set of fields, including game theorists, social

105


106 O. Alonso and F. Nino

scientists, economists and computer scientists [Axelrod (1984); Angeline;

Hofstadter (1985); Yao and Darwen (1994)]. From the computational point

of view, there has been a deep interest in the development of effective strate-

gies for the IPD game [Yao and Darwen (1994); Axelrod (1984); Delahaye

and Mathieu (1995)]. Most well-known IPD strategies have been proposed

by humans, specifying the decision rules that a player will follow depend-

ing on the opponent’s behaviour[Beaufils et al. (1997); Nowak and Sigmund

(1993)]. Clearly, this has mainly depended on the researcher’s assumptions

about the game. In a first computational approach, Axelrod explored hu-

man designed strategies by confronting them through a tournament [Axel-

rod (1984)].

Conversely, there has been also some interest in obtaining IPD strate-

gies using evolutionary computation, coevolution, reinforcement learning

and other computational techniques, without explicitly specifying the de-

cision rules [Sandholm and Crites (1995); Darwen and Yao (1995)]. These

methods have found good IPD strategies, requiring little or no intervention

from a human. For instance, in Axelrod’s work, human-designed strategies

were compared to strategies obtained through evolution and coevolution

[Axelrod (1984)]. Further research has been done towards finding strate-

gies that generalise well without human intervention. Studies have focused

on coevolutionary approaches, since no human intervention is required in

the evaluation process. For instance, Darwen and Yao [Darwen and Yao

(1996)] proposed a speciation scheme in order to get a modular system that

played the IPD, in which coevolution and fitness sharing were used in order

to get a diverse population that played as a whole against the opponent.

The scheme showed a significant degree of generalisation.

The model proposed in this work falls into the second kind of method.

Thus, the main goal of this research is to generate an agent which will learn

to play the IPD game and will be able to adapt to the opponent’s behaviour.

Learning, memory and adaptation capabilities are argued to be desirable to

be present in an IPD agent, and consequently, the agent implementation is

accomplished using artificial immune networks, a computational technique

inspired in the natural immune system that presents such capabilities.

The rest of this chapter is organised as follows. First, some funda-

mentals about artificial immune systems, namely, immune networks are

summarised. Subsequently, a general model for an adaptive agent is intro-

duced. Then, a specific immune-based model of this agent is explained in

detail. An implementation of the immune model was developed and some

experiments were carried out to validate the agent capabilities. The imple-


An Immune Adaptive Agent for the Iterated Prisoner’s Dilemma 107

mented agent showed adaptation and learning; however, in some cases, the

immune agent exhibited a poor performance.

5.2. Immune network fundamentals

Antigens are substances capable of inducing a specific immune response.

They may be viruses, bacteria, fungi, or other protozoa. They are invaders

assumed to cause harm in the body. However, an antigen may be harmless,

such as grass pollen [Jonathan (2001)].

On the other hand, antibodies are proteins found in the blood, produced

by specialised white blood cells, called B-cells. B-cells make antibodies

when the body recognises that something foreign (antigen) is present. An-

tibodies are the antigen-binding proteins that are present on the B-cell

membrane. They are also secreted by plasma cells.

The affinity between an antigen and an antibody is given by the com-

plementarity of their binding proteins. If the antigen/antibody affinity is

higher than an affinity threshold, the corresponding B-cell becomes stim-

ulated. In the early stages of the immune response, the affinity between

the antibodies and antigens may be low, but as the B-cells undergo clonal

selection, the binding B-cells mutate and clone again and again to improve

the affinity of the binding between a particular antigen and a B-cell. Then,

the mature and activated B-cells produce plasma cells, which differentiate

into antibodies with a high affinity of the antigen/antibody bonds.

The Immune Network Theory tries to explain the way in which a natural

immune system achieves immunological memory [Perelson and Weisbuch

(1997)]. Jerne [Jerne (1974)] hypothesised that the immune system is a

regulated network of molecules and cells that recognise one another even in

the absence of antigens, rather than being a set of isolated cells that respond

only when stimulated by antigens. Though in immune network theory

the main elements are B-cells, most models only consider the antibodies

attached to the B-cell membranes. Therefore, here only antibodies will be

considered.

The basic idea behind immune network theory is that antibodies are

stimulated not only by antigens, but also by other antibodies, allowing the

generated antibodies to be preserved over time for future encounters with

the same or similar antigens. Therefore, when the same antigen reappears,

the immune response is faster, since the immune system already contains

suitable antibodies to deal with such antigen. This is known as secondary

response, which is depicted in figure 5.1. [Jonathan (2001)].



Fig. 5.1. Secondary Response. The amount of antibodies is greater and the response

time is shorter when the antigen is presented for the second time to the immune system

Even though antibodies stimulate each other, there is also a suppres-

sion relation between them, which controls the size of the network. Thus,

the network structure is a result of the interactions among antibodies. A

graphical representation of an immune network model is shown in figure

5.2.

An Artificial Immune Network (AIN) is a computational model based

on immune network theory. In a broad sense, immune networks are mainly

suitable to solve clustering and classification problems, due to their natural

dynamics by which affine antibodies stimulate each other, thus forming

clusters of antibodies with similar features. Typically, an immune network

is stimulated by a set of antigens, corresponding to input data to a problem,

and the resulting structure of the immune network will give the solution to

the related problem [Castro and Zuben (2000)].

When an antigen is presented to the AIN, the internal dynamics of the

AIN develops antibodies with high affinity to the antigen, through a process

called affinity maturation. This process implies selection of high affinity

antibodies and a mutation process called somatic hypermutation; this is



Fig. 5.2. Immune Network Theory

an evolutionary process that in a short period of time evolve antibodies

capable of deal with the presented antigen.

Several computational models for immune networks have been proposed,

which are mainly derived from aiNet, a model used for optimisation and

data clustering proposed by de Castro, and RAIN, a model proposed by

Timmis, also used for data analysis [Castro and Zuben (2000); Castro

(2003)].

In the RAIN model, the resulting set of antibodies exhibits an spatial

distribution that reflects the data concentration in the data space. On the

other hand, the result of the aiNet model does not present this behaviour,

as highly concentrated data are considered redundant and then eliminated.

In the aiNet model the interaction among antibodies leads to network sup-

pression, i.e., antibodies that are affine (close) will suppress each other in

order to control the size of the network and eliminate redundant informa-

tion. Consequently, in this work the aiNet model will be used. Notice that

this model does not consider stimulation among antibodies.

When using an immune network to solve a problem, it is necessary to

specify the following aspects:



(1) Identify the entities of the problem and find the corresponding elements

in an immune network, i.e., antibodies and antigens;

(2) define an appropriate representation of such elements,

(3) define an affinity measure between antigens and antibodies, and among

antibodies themselves; and

(4) establish the algorithms that model the behaviour of the immune net-

work.

5.3. A general adaptive agent model

In this section, a general adaptive agent model to play the IPD game is

proposed. The model is based on trying to figure out the opponent’s strat-

egy, which is further used to determine the next move of the agent. The

information about the recent history of the game is used to model the op-

ponent’s strategy. Accordingly, in order to decide the next move, the IPD

agent will accomplish the following three phases:

(1) Recognition of the opponent’s strategy

(2) Development of a good strategy to face the opponent

(3) Selection of the next move to play

In the first phase, the Agent attempts to guess the strategy the opponent

is playing, based on the recent history of moves from both players. As a

result of this phase, an IPD strategy which resembles the behaviour of the

opponent is obtained, which will be used in the next stage. In the second

phase, the Agent generates a strategy which obtains a good score when it

is faced to the strategy generated in the first phase. Finally, in the third

phase, the Agent uses the strategy obtained in the second phase to decide

its next move.

The adaptive IPD agent consists of a memory, a recognition module,

strategy generation module and a decision module (see figure 5.3), which

are explained next.

• The memory is responsible for storing the recent history of moves

played by both, the agent and the opponent.

• The recognition module is responsible for recognising the opponent’s

strategy based on the recent history; it produces a strategy that resem-

bles the one of the opponent

• The strategy generation module is responsible for generating a strategy

which obtains a good score when faced to the strategy that resembles



Fig. 5.3. General model of the IPD agent

the opponent’s

• The decision module is responsible for using the strategy obtained by

the strategy generation module in order to decide the next move that

the agent will play

Though, the model may look simple at first, it should be emphasised

that the implementation of each one of the modules is not trivial. The

recognition module should try to infer the strategy that the opponent is

playing, which may be a difficult task. Also, the strategy generation module

should be able to adapt to the changes in the opponent’s strategy.

5.4. Immune agent model

The definition of a particular agent based on the general model presented

above requires the stipulation of each module, as well as the representation

that will be used for the strategies.

The recognition of the opponent and the generation of a good strat-

egy against it require adaptability and learning. Additionally, it would

be desirable to preserve the strategies generated, which implies a memory



mechanism. For these reasons, Artificial Immune Networks are used to im-

plement the recognition and strategy generation modules. The structure of

the general IPD agent can be seen in figure 5.4, and the global IPD decision

making process is described in algorithm 5.1.

Fig. 5.4. Structure of the immune agent

Algorithm 5.1. Decision making algorithm

Decision making

1 while playing

2 do

3 Present history to the recognition AIN

4 Find recognised strategy from the recognition AIN

5 Present recognised strategy to strategy generation AIN

6 Find best payoff strategy from strategy generation AIN

7 Obtain suggested next move from best payoff strategy

8 Play next move



5.4.1. Strategy representation

First, each strategy is represented using a look up table [Axelrod (1984)].

This representation indicates the next move to play, based on the n previous

moves of both players. The representation consists of a vector of moves,

where each position in the vector indicates the next move to be played given

a specific history of the game. Thus, there are 22n possible histories given a

memory of n previous moves. Additionally, since there is no initial history,

this representation requires 2n assumed pre-game moves at the beginning

of the game. Hence, the total length of the vector of moves will be 22n +2n,

and given that each position of the vector has 2 possible values, cooperate

and defect, the number of strategies that can be represented is 222n

+2n. An

example of a look up table is shown in figure 5.5.

Fig. 5.5. Example of a Look up Table representing the strategy TFT

5.4.2. Memory

The memory of the agent is represented by 2 vectors, containing the last k

moves played by the agent and the opponent.

5.4.3. Recognition module

An antibody of the Recognition AIN is represented by an IPD strategy.

The Recognition AIN will receive as an antigen the history of recent moves

of both, the opponent and the agent itself.

As the agent should obtain a strategy similar to the opponent’s, the an-



tibodies are stimulated according to its similarity with the opponent. This

was measured presenting the moves played by the agent to each strategy,

and comparing its response with the one of the opponent. Such measure is

given by the Hamming distance between the move sequences of the strategy

and the opponent.

Additionally, the AIN model requires a measure of stimulation between

antibodies. Such measure was given by the similarity between the strate-

gies. The similarity between two strategies is measured indirectly as fol-

lows: both strategies play against a randomly generated sequence of moves.

Then, the moves of the strategies are compared using the Hamming dis-

tance and the percentage of coincidences determines the similarity of the

strategies. The interaction between antibodies leads to suppression, i.e.,

similar strategies suppress each other.

After presenting the history of recent moves to the Recognition AIN,

the most stimulated antibody is taken, because it represents the strategy

which better resembles the opponent.

A summary of the representation of the elements in the recognition AIN

is given in table 5.2.

Table 5.2. Recognition immune network representation

Immune network Representation

Antigen History of moves

Antibody IPD strategy

Antibody/Antigen affinity Similarity between the strategy an the oppo-

nent’s

Antibody/Antibody affinity Similarity between the strategies

In addition, the process of affinity maturation requires the strategies

to be mutated. Particularly, strategies will be mutated in two fashions.

The first one consists of changing the number of previous interactions re-

membered by the strategy (memory length), and the second one consists

of mutating each position of the vector that defines the strategy according

to the mutation rate.

The process of changing the memory length is performed as follows: the

new length is selected randomly between one and the maximum allowed

memory length. If the new length of the strategy is same as the old one,

nothing has to be done. If it is longer, the new positions of the vector are

filled in such a way that the strategy presents the same decision rules as

before. This operation is shown in figure 5.6.

If the new history length is shorter than before, the process is done as



Fig. 5.6. Example of mutation when new LuT memory length is larger

follows: notice that there are four histories which are different in only the

last move. Thus, removing the last move will cause those four histories to be

compressed into one. Therefore, the corresponding value of the compressed

history will be the value that has the majority in the correspondent histories

of the original vector. If there is a tie, it is resolved as Defect. This operation

is shown in figure 5.7.

5.4.3.1. Immune network model

In the aiNet model, all the antigens are known a priori and they are pre-

sented to the network many times until the structure of the network adapts

to the antigen set. In contrast, the proposed IPD agent, the opponents are

not known a priori, and the agent will have to be adapted to the opponents

as they appear. Accordingly, to deal with such problem, a slightly modified

version of the aiNet algorithm will be used.

The main modification of the aiNet algorithm is introduced in the mech-

anism used by the network to add antibodies to the memory. An antibody

interacts with the antibodies that have been already memorised. If the sup-

pression it receives from memorised antibodies is less than the suppression

threshold, it is added to the memory and will never be removed. Notice

that if an antibody is suppressed by the memorised ones, it means that

an antibody capable of recognising such antigen is already present in the



Fig. 5.7. Example of mutation when new LuT memory length is shorter

memory. Thus, in order to avoid redundancy, this new antibody is not

added to the memory.

When a new opponent starts playing a game, there is not enough in-

formation to consider that the recognised antibodies correspond to the op-

ponent, therefore adding the antibody in the very early beginning of the

game is not a good idea. Additionally, since the agent confronts the same

opponent during various moves, it is not necessary to add antibodies to the

memory in each movement given that the history of moves does not change

significantly with only one new movement. Thus, in this situation, it is

more efficient to add the antibodies that have been periodically generated

every k movements.

The modified version of the aiNet algorithm is summarised in algorithm

5.2.

Algorithm 5.2. Modified aiNet algorithm



Modified aiNet

1 for each antigen

2 do

3 Add new random antibodies to the network

4 Calculate antigen/antibody affinity

5 Select the n antibodies with highest affinity

6 Clone and hypermutate selected antibodies

7 Re-calculate antigen/antibody affinity

8 Re-select a percentage of highest affinity antibodies

9 Remove low affinity antibodies

10 Calculate suppression among antibodies

11 Remove highly suppressed antibodies

12 Add resultant antibodies to the memory

In the algorithm, the affinity (suppression) of the antibodies is nor-

malised in the interval [0,1]. After that, it is considered low (high) in

relation to an affinity (suppression) threshold, which is a parameter of the

algorithm. Additionally, in the hypermutation process, the mutation rate is

inversally related to the affinity of the antigen. Particularly, it was defined

as 1- affinity. This means that high affinity antibodies are mutated less

than low affinity antibodies, which helps keeping good antibodies while ex-

ploring new regions of the search space. When an new antigen is presented,

the network dynamics develops antibodies with high affinity (similarity)

with it.

5.4.4. Strategy generation module

For this module, the antibodies are also represented as game strategies.

In this case, the strategy obtained in the phase one is presented as an

antigen for the second AIN. As the agent is interested in obtaining a good

strategy against the one obtained in the first phase, the antibodies are

stimulated according to the result of a short IPD game between the antigen

and each antibody, beginning from the current history of the game between

the agent and the opponent. The affinity between antibodies, the mutation

operator and the immune network algorithm are defined in the same way

as in recognition AIN.

Therefore, the most stimulated antibody corresponds to the best strat-

egy against the one that resembles the opponent, and is selected as the

output of this phase.



A summary of the representation in the strategy generation AIN is

shown in table 5.3.

Table 5.3. Strategy generation immune network representation

Immune network Representation

Antigen IPD strategy

Antibody IPD strategy

Antibody/Antigen affinity Payoff of a short IPD game

Antibody/Antibody affinity Similarity between the strategies

5.4.5. Decision module

Once a good strategy against the opponent has been found, it is used to

look up the next move that the agent will play, given the recent history of

the game.

5.5. Experimental results

Some experiments were carried out in order to explore the capabilities of

the proposed agent. All the experiments used a payoff matrix where Temp-

tation=5, Punishment=1, Sucker’s Payoff=0 and Reward=4.

The values of the parameters of an immune network affect some aspects

of it, such as the number of antibodies of the network and the performance

of the affinity maturation process. After testing several values for the pa-

rameters, the following were found to provide a good behaviour to the agent:

the suppression threshold was 0.8, and the affinity threshold was 0.9; the

number of stimulated antibodies that were selected in each iteration was 5,

and the percentage of stimulated antibodies that were selected after being

cloned and hypermutated was 20%. In each iteration of the immune net-

works, four new random antibodies were added to the network. In clonal

selection, the minimum number of clones that a stimulated antibodies could

generate was 5, and the maximum amount was 10. New antibodies were

added to the memory of each network every 20 moves.

In the recognition process, the length of the history of moves was 10,

and the maximum length of the memory of the lookup table representation

was set to 3 previous moves.

The experiments were designed to answer some key questions about the

agent’s capabilities, which are addressed in the following subsections.



5.5.1. Can the agent adapt to a new opponent?

In order to test the adaptability of the immune agent when confronting one

opponent, it was faced to opponents playing the well-known strategies TFT,

ALLD, Pavlov and GRIM. The length of the game was 100 moves, and there

were 100 repetitions for each opponent. The average score obtained in this

experiment is shown in figure 5.8.

a)

0

1

2

3

4

5

6

0 10 20 30 40 50 60 70 80 90 100

Ave

rage

Pay

off

Move Number

Adaptation for TFT

AgentOptimal

b)

0

1

2

3

4

5

6

0 10 20 30 40 50 60 70 80 90 100

Ave

rage

Pay

off

Move Number

Adaptation for ALLD

AgentOptimal

c)

0

1

2

3

4

5

6

0 10 20 30 40 50 60 70 80 90 100

Ave

rage

Pay

off

Move Number

Adaptation for Pavlov

AgentOptimal

d)

0

1

2

3

4

5

6

0 10 20 30 40 50 60 70 80 90 100

Ave

rage

Pay

off

Move Number

Adaptation for GRIM

AgentOptimal

Fig. 5.8. Adaptability Tests. Optimal is obtained from mutual cooperation in a), c and

d), and mutual defection in b).

As it can be seen, the agent adapts its behaviour to the one of the

opponent, which leads to an increase of the mean payoff over the first 20

moves, and then it stabilises.

5.5.2. Can the agent adapt to consecutive opponents?

The agent was confronted with two opponents one after the other, in order

to evaluate the adaptability of the agent to further opponents (i.e. not

only the first opponent it confronts). The results for consecutive opponents

playing ALLD-TFT and PAVLOV-GRIM can be seen in figure 5.9.



0

1

2

3

4

5

6

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

Pay

off

Move Number

Adaptation for Consecutive Opponents (ALLD-TFT)

AgentOptimal

0

1

2

3

4

5

6

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

Pay

off

Move Number

Adaptation for Consecutive Opponents (Pavlov-GRIM)

AgentOptimal

Fig. 5.9. Tests of adaptation to consecutive opponents

Experimental results showed that the immune agent adapts to every

new opponent it confront. Moreover, the curves described by the mean



payoff is very similar to the one found in the first experiment, which shows

that the agent preserves its adaptability through multiple games.

5.5.3. Can the agent remember previous opponents?

Since immune networks possess a memory mechanism, this was evaluated

in the agent. In this setup, the agent first confronts an opponent, then it

is faced with another opponent and once again it is confronted by the first

opponent. In this case two experiments were performed, the first confronted

TFT-ALLD-TFT, and the second one confronted Pavlov-GRIM-Pavlov-

GRIM. Also 100 repetitions of the experiment were carried out and the

length of every game was 100 moves. The average payoff can be seen in

figure 5.10.

The results showed that the mean average curves stabilised faster the

second time the agent faced an opponent, as a result of the memory capa-

bility of the agent. However, the mean in which the payoff stabilises did

not increase.

5.5.4. Results from the IPD competition

The agent proposed in this chapter participated in the IPD competitions

held at CEC 2004 and CIG 2005, under the name of ”Immune Based

Agent”. It competed twice in the first competition and it was ranked 126

out of 223 and 160 out of 223. In the second competition, it participated

in the category # 4 (one entry per participant), and it was ranked 40 out

of 50.

5.6. Discussion

Experimental results show that the proposed agent presents the expected

behaviour: it adapts its behaviour to the opponents it confronts in order

to increase its payoff, and is also able to remember its interactions with

opponents in order to recognise them faster in future encounters. However,

the following was observed in the agents behaviour:

• The payoff stabilises in a mean value which is less than the best possible

payoff.

• For some opponents such as GRIM, the performance of the agent is

very poor: it obtains a payoff much lower than the best possible



0

1

2

3

4

5

6

0 50 100 150 200 250 300

Ave

rage

Pay

off

Move Number

Memory of previous opponents (TFT-ALLD-TFT)

AgentOptimal

0

1

2

3

4

5

6

0 50 100 150 200 250 300 350 400

Ave

rage

Pay

off

Move Number

Memory of previous opponents (Pavlov-GRIM-Pavlov-GRIM)

AgentOptimal

Fig. 5.10. Test of memory of previously met opponents

Some explanations to the agent’s behaviour could be hypothesised as

follows:



• The recognition module finds a strategy similar to the opponent’s,

which is good enough for most cases. However, a history of moves

could correspond to several opponent strategies, which makes very dif-

ficult for the recognition module to find the exact strategy that the

opponent is playing. For instance, a history where all previous moves

are COOPERATE could correspond to players playing TFT or ALLC,

and the best response is different in every case.

• Since the recognition process is imperfect, the strategy found by the

strategy generation module may not be the most adequate, and could

lead the agent to make bad decisions. This produces non-optimum

payoff, and with some strongly retaliative opponents, it may lead to

mutual defection and low payoffs. This explains why the immune agent

does not obtain a good payoff confronting strategies such as GRIM,

since it tries to take advantage of the opponent and, consequently, it

receives a strong retaliation from GRIM.

• The model does not implements a feedback mechanism which may help

to determine how good a strategy selected is. Notice that the agent

knows the history of moves, but it does not analyse if the strategy it

is currently using is good or bad in order to change it if the strategy is

performing badly.

The experiments also show that since the agent does not reach the best

possible payoff, it is slightly exploited by very uncooperative strategies,

such as ALLD.

An analysis of the performance of the strategy during the competition

showed that the agent frequently evolved to mutual defection with oppo-

nents that were not fully uncooperative, such as go by majority. There were

also some cases where the agent became exploited by some opponents, prob-

ably due to the perception limitations of the agent exposed above. As a

consequence, the agent performed poorly in the competition.

5.7. Conclusions

This work presented an agent model that played the IPD game. The model

is based on artificial immune systems in order to achieve adaptability, learn-

ing and memory.

Some experiments were carried out in order to evaluate the behaviour

of the proposed agent. The results showed that the agent presents the

expected capabilities: it adapted its own behaviour to suit the opponent’s



one, going through a learning process which produced an increase of the

mean payoff until it reached a stable value. Additionally, the learning

process was faster when the agent met the opponent for the second time,

which evidenced a memory mechanism.

However, although the mean payoff increased and stabilised due to the

learning process, it did not reach the optimum value. Additionally, for some

strategies such as GRIM, the agent did not even obtain a payoff close to the

best possible. This shows that although the agent perform as expected, it

still needs to be tuned in order to avoid poor performance in some special

cases.

Particularly, the proposed model could be modified by using different

computational techniques, such as evolutionary algorithms, to implement

some of the modules. It may also be extended to include multiple levels

of cooperation and multiple opponents. Additionally, the agent could be

endowed with a feedback mechanism, such as reinforcement learning.

References

Alonso, O., Nino, F. and Velez, M. (2004). A robust immune based approach

to the iterated prisoner’s dilemma, in Proceedings of the 3rd International

Conference on Artificial Immune Systems, pp. 290–301.

Angeline, P. J. (1994). An alternate interpretation of the iterated prisoner’s

dilemma and the evolution of non-mutual cooperation, in Proceedings 4th

Artificial Life Conference, pp. 353–358.

Axelrod, R. (1984). The Evolution of Cooperation (Basic Books, New York, USA).

Beaufils, B., Delahaye, J.-P. and Mathieu, P. (1997). Our meeting with gradual:

A good strategy for the iterated prisoner’s dilemma, in Artificial Life V

(Proceedings of the Fifth Int’l Workshop on the Synthesis and Simulation

of Living Systems) (MIT Press), pp. 202–209.

Castro, L. N. D. (2003). The immune response of an artificial immune network

(ainet), in Congress on Evolutionary Computation (CEC’03) (Canberra),

pp. 146–153.

Castro, L. N. D. and Zuben, F. J. V. (2000). An evolutionary immune network

for data clustering, in IEEE Brazilian Symposium on Artificial Neural Net-

works (Rio de Janeiro), pp. 84–89.

Darwen, P. and Yao, X. (1995). On evolving robust strategies for iterated pris-

oner’s dilemma, in Progress in Evolutionary Computation, Lecture Notes in

Artificial Intelligence, Vol. 956, pp. 276–292.

Darwen, P. and Yao, X. (1996). Automatic modularization by speciation, in Proc.

of the 1996 IEEE Int’l Conf. on Evolutionary Computation (ICEC’96)

(IEEE Press, Nagoya, Japan), pp. 88–93.

Delahaye, J.-P. and Mathieu, P. (1995). Complex strategies in the iterated pris-

oner’s dilemma, in A. Albert (ed.), Chaos and Society, Frontiers in Arti-



ficial Intelligence and Applications, Vol. 29 (IOS Press, Amsterdam), pp.

283–292.

Hofstadter, D. R. (1985). The prisoner’s dilemma computer tournaments and the

evolution of cooperation, in Metamagical Themas: Questing for the essence

of mind and pattern (Basic Books, New York).

Jerne, N. K. (1974). Towards a network theory of the immune system, Ann.

Immunol. 125, pp. 373–389.

Jonathan, T. (2001). Artificial Immune Systems: A novel data analysis technique

inspired by the immune network theory, Ph.D. thesis, University of Wales,

Aberystwyth, Wales.

Nowak, M. A. and Sigmund, K. (1993). A strategy of win-stay lose-shift that

outperforms tit-for-tat in the prisoner’s dilemma game, Nature 364, pp.

56–58.

Perelson, A. S. and Weisbuch, R. (1997). Immunology for physicists, Rev. Modern

Physics 69, pp. 1219–1267.

Sandholm, T. and Crites, R. (1995). Multiagent reinforcement learning in the

iterated prisoner’s dilemma, BioSystems: Special Issue on the Prisoner’s

Dilemma 37, pp. 147–166.

Tucker, A. W. (1950). A two person dilemma, .

Yao, X. and Darwen, P. J. (1994). An experimental study of n-person iterated

prisoner’s dilemma games, Informatica 18, pp. 435–450.

March 14, 2007 8:45 World Scientific Review Volume - 9in x 6in chapter6

Chapter 6

Exponential Smoothed Tit-for-Tat

Michael Filzmoser

University of Vienna

Reciprocating strategies, as for instance Tit-for-Tat, have been shown

to be very successful in IPD Tournaments without noise, while other tour-

naments and analytical studies show that they perform rather poor in noisy

environments. The implementation of generosity or contrition into recipro-

cating strategies was proposed as a solution for this poor performance. We

propose a third possibility, a relief of the provocability property of recipro-

cating strategies, which we design by exponential smoothing. This chapter

explores how exponential smoothing and Tit-for-Tat can be combined in

’Exponential Smoothed Tit-for-Tat’ strategies for the Iterated Prisoners’

Dilemma and how the strategies perform in competitions with and without

noise compared to Tit-for-Tat

6.1. Introduction

Robert Axelrod (1980a,b, 1984) was the first to perform computer tour-

naments of the Iterated Prisoners’ Dilemma (IPD). In these tournaments

strategies played the Prisoners’ Dilemma repeatedly with additional infor-

mation about the history of their own moves as well as of the moves of

the opponent strategy. In two tournaments with 14 and 62 entries respec-

tively the winner both times was Tit-for-Tat (TFT), submitted by Anatol

Rapoport, the simplest of all participating strategies. TFT starts with coop-

eration and afterwards mirrors the opponent’s move of the previous round.

Niceness and provocability were identified to be important properties of

successful strategies in the IPD and both are embodied in TFT. Niceness

in this context denotes that a strategy never should be the first to defect,

while provocability denotes that an ’uncalled for’ defection of the opponent

should be punished by a defection immediately (Axelrod, 1980b).

127


128 M. Filzmoser

In recent years the original IPD has been extended by the integration

of ’noise’. Noise in the context of IPD can either denote measurement er-

rors — a strategy receives incorrect information that its opponent defected

while it actually cooperated and vice versa — or implementation errors —

a strategy which is intended to cooperate in a given situation erroneously

defects and vice versa (Bendor, 1993).a Axelrod and Wu (1995) state that

noise is an important feature of real-world interaction as errors in the imple-

mentation of choice can never be completely excluded. It has been shown

analytically (Molander, 1985; Bendor, 1993) as well as by further IPD tour-

naments which incorporated noise (Donninger, 1986; Bendor et al., 1991)

that the existence of noise undermines the performance of reciprocating

strategies like TFT dramatically. Bendor, Kramer and Stout (1991) argue

that a main reason for the poor performance of TFT in noisy environments

is the unintended involvement in vendettas of mutual or alternating defec-

tion with other nice and provocable strategies, which can be caused by one

single implementation error on either side.

For coping with noise Axelrod and Wu (1995) propose to make recipro-

cating strategies more generous or more contrite. Generosity denotes that

some of the opponent’s defections are not punished as they could be the

result of noise. Such generosity of course can be exploited easily but pre-

vents from an echoing of a single error throughout the whole game and

therefore can maintain mutual cooperation among reciprocating strategies.

Contrition on the other hand means that a defection as a reaction to a

defection of the opponent in the last round, which in turn occurred as an

answer to one’s own implementation error in the round before last, should

be avoided.

While generosity can be conceived as a correction of the opponent’s im-

plementation errors, contrition can be interpreted as the correction of one’s

own implementation errors in a noisy environment. However both of these

further-developments of reciprocating strategies for noisy environments are

one-sided insofar as they focus on correcting their own or the opponent’s

implementation errors only, none of these concepts attempts to correct both

kinds.

Moreover the mitigation of the provocability property of reciprocating

strategies takes place in a rather indiscriminate way by an increase of gen-

erosity or the implementation of contrition. In an effort to improve the

aWe focus exclusively on implementation error as this was the category of noise imple-

mented in the IPD tournament of G. Kendal, P. Darwen, and X. Yao performed in April

2005 on which this study bases (see http://www.prisoners-dilemma.com).


Exponential Smoothed Tit-for-Tat 129

performance of reciprocating strategies if they play against other recipro-

cating strategies one must not neglect the existence of non-reciprocating

strategies. Such strategies could capitalize on the combination of noise and

generosity by infrequent but intentional defections. Moreover if the history

of the opponent’s moves consists of a series of defections, a single coopera-

tion, which could be an implementation error of the opponent, should not

induce a reciprocating strategy to switch from defection to cooperation. In

such a case of continuous defection of the opponent an increase of generosity

will only reduce the performance of a reciprocating strategy.

We share the opinion that a mitigation of the provocability property is

essential to overcome the comparatively poor performance of reciprocating

strategies like TFT in IPD tournaments with noise. To do so we propose a

third alternative beside generosity and contrition. We hold the view that

the whole history of the opponent’s moves as well as the misperceptions

should be taken into consideration by a reciprocating strategy in the deci-

sion to cooperate or defect. Generous or contrite reciprocating strategies as

proposed by Axelrod and Wu (1995) only take into account the last move

of the opponent and use some additional modification rules to adapt to the

situation of noise. The analysis of the entire series of moves of the opponent

should allow filtering out reactions to our own implementation errors as well

as the opponent’s implementation errors, which in turn should improve the

performance of a so-designed reciprocating strategy.

In section 6.2 we present exponential smoothing which we suggest as a

method to implement the concept of considering the whole history of moves

in the decision making process of reciprocating IPD strategies. Further-

more ’Exponential Smoothed Tit-for-Tat’ (ESTFT) strategies are developed.

Section 6.3 reports on the performance of the ESTFT strategies in an IPD

tournament in competitions with and without noise, and in comparison to

TFT. Section 6.4 summarizes the main results and concludes.

6.2. Exponential Smoothed Tit-for-Tat

The intention of ESTFT is to incorporate the two properties of TFT, niceness

and provocability, which have demonstrated to be important ingredients of

successful strategies in the IPD without noise, and mitigate provocability to

adjust to the existence of noise. To do so ESTFT uses exponential smooth-

ing. Tzafestas (2000) used exponential smoothing as the basis for the devel-

opment of his meta-regulated adaptive TFT (a strategy that drops the

cooperation rate when the opponent is conceived cooperative and increases


130 M. Filzmoser

it otherwise) and by Ashlock et al. (1996) for memory weighting in a study

on partner selection for the IPD. However exponential smoothing has not

yet been applied to cope with the problem of noise in the IPD, which is

the focus of this study. In the next two subsections, first the concept of

exponential smoothing will be briefly presented and afterwards applied for

the design of exponential smoothed Tit-for-Tat strategies for competitions

with and without noise.

6.2.1. Exponential Smoothing

Exponential smoothing was originally a time series analysis approach, which

can be used for the analysis of time series that neither exhibit trend nor

seasonal components. It allows for weighting past – possibly not so im-

portant – observations differently than the recent ones. From the original

time series Xt

the exponential smoothed time series St

can be calculated

by (6.1). For the calculations it is necessary to indicate a starting value S0

as for the first period no observations of the original time series exist.

St=

S0 if t = 0,

(1− α)St−1 + αX

t−1 else(6.1)

In (6.1) α is the smoothing parameter that indicates the weight assigned

to the last observation.b The higher α, the lower the smoothing of the time

series, so for α = 1 exponential smoothing reproduces the original time

series while for α = 0 the smoothed time series is a constant of St= S0.

Exponential smoothing can be customized for the design of reciprocat-

ing or simple deterministic IPD strategies. We conceive the series of the

transformations of the opponent’s moves mtas the observations that are to

be smoothed. Where the opponent’s moves are transformed into discrete

numbers applying (6.2).

mt=

1 if opponent move in t is ’cooperate’

0 if opponent move in t is ’defect’(6.2)

With the adapted exponential smoothing formula (6.3) different kinds of

simple deterministic strategies can be designed, that either defect if St= 0

bThe notion ’smoothing parameter’ is somewhat misleading, as a higher value for this

parameter leads to a stronger consideration of currently observed values and therefore

results in a less smoothed time series.



or cooperate if St

= 1. The parameter combination S0 = 1 and α = 1

exactly equals the TFT strategy (Tzafestas, 2000), with α = 0 and S0 = 1

(respectively S0 = 0) a constant series of cooperations (respectively de-

fections) and therefore an ALLC (respectively ALLD) strategy can be mod-

elled. Many other combinations of the two variables, starting value S0 and

smoothing parameter α, are possible which allow modelling a large number

of strategies.

St=

S0 if t = 0

(1− α)St−1 + αm

t−1 else(6.3)

We refer to the internal register St

as the ’mood’ of the strategy. This

mood is a continuous variable ranging from 0 — in case of total defection

of the opponent — to 1 — for total cooperation of the opponent. Inter-

mediate values between these two extremes represent different degrees of

cooperation (closer to 1) and defection (closer to 0). In the spirit of TFT the

next own move (either cooperate or defect) is derived from this mood by a

threshold rule (see section 6.2.2). Furthermore we need an initial mood I —

an expectation about the opponent’s behavior — to calculate the bounds

on the smoothing parameter α for the ESTFT strategies designed for the

competition with noise (see section 6.2.2). We derive I from the optimistic

assumption that the opponent strategy is cooperative or reciprocating and

therefore will cooperate if it plays against ESTFT strategies except the ex-

pected 10% of implementation errors due to noise (i.e. I = 0.9).

6.2.2. Strategies for Competitions with and without Noise

The ESTFT strategies were actually a two parameter family of IPD strate-

giesc where the two decision parameters are i) the smoothing parameter α

and ii) the threshold rule for which values of Stthe strategy should cooper-

ate or defect. For all ESTFT strategies the threshold rule for cooperation and

defection in round t is determined as follows: for St≥ 0.5 ESTFT cooperate

otherwise defect (see 6.4).

move in t =

’cooperate’ if 0.5 ≤ St≤ 1

’defect’ if 0 ≤ St< 0.5

(6.4)

In defining one threshold rule for all ESTFT strategies the only vari-

able parameter of these strategies is the α-value. For the ESTFT strategiescThe starting value S0 can be derived from α


132 M. Filzmoser

designed for the competition with noise we demand two additional char-

acteristics to cope with the problem of noise in the IPD: i) they should

never defect in return to a single defection of the opponent as this single

defection could be an implementation error by the opponent or a reaction

to an implementation error of the ESTFT strategy itself, and ii) they should

react with defection in return to two consecutive defections of the oppo-

nent to avoid exploitation by the opponent. These additional requirements

restrict the area of possible values for the smoothing parameter α. The

possible area according to restrictions i) and ii) is calculated in (6.5) and

(6.6) respectively for an initial mood of I = 0.9.

Restriction i): ESTFT should cooperate after a single defection

St= (1− α)I + αm

t−1 ≥ 0.5

for mt−1 = 0 and I = 0.9

α ≤ 1− 0.5

0.9= 0.44444

(6.5)

Restriction ii): ESTFT should defect after two consecutive defections

St−1 = (1− α)I + αm

t−2

St= (1− α)S

t−1 + αmt−1 < 0.5

for mt−2 = m

t−1 = 0 and I = 0.9

α > 1−√

0.5

0.9= 0.25464

(6.6)

From (6.5) we derive that the ESTFT strategy will not defect after a single

defection (mt−1 = 0) of assumed cooperative or reciprocating strategies

(I = 0.9) when α ≤ 0.44444 as for this value of the smoothing parameter St

remains above the threshold for defection of 0.5. An α > 0.25464 guarantees

that for two consecutive defections of the opponent (mt−2 = m

t−1 = 0) the

St

lies below the threshold value and the ESTFT strategy therefore defects

(6.6).

Three ESTFT strategies lowESTFT_noise, mediumESTFT_noise, and

highESTFT_noise were designed for the competition with noise using α-

values that represent the upper bound, the lower bound, and the average

between these extremes (α ≈ 0.34) respectively. As large numbers for the

smoothing parameter α lead to higher weighting of the current observa-

tions and to a less smoothed value the border induced by (6.5) is applied in

the lowESTFT_noise, the border induced by (6.6) in the highESTFT_noise

and the average between these borders in the mediumESTFT_noise ESTFT

strategy. For the competition without noise neither of the two restrictions



mentioned above is necessary as the true move of the opponent can be ob-

served with certainty. We determine smoothing parameters of α = 0.2 for

the highESTFT_classic , α = 0.35 for the mediumESTFT_classic , and

α = 0.5 for the lowESTFT_classic strategy respectively and start with

cooperation. Due to the decision rule mentioned above values above 0.5

would not change the result and are therefore omitted. The results these

strategies achieved compared to TFT in competitions with and without noise

are summarized in the next section.

6.3. Tournament Results

The IPD computer tournament organized by Graham Kendal, Paul Dar-

wen, and Xin Yao in April 2005 offered an excellent possibility to test the

ESTFT strategies and to compare it to TFT in situations with and without

noise. In addition to the classical competition without noise (competition

1) – a re-run of Robert Axelrod’s original tournaments – a competition with

a 10% chance of noise in the form of implementation error (competition 2)

was conducted. The Java Applet that was used to run the tournament,

as well as the entries, based on the Java IPDLX software library, in addi-

tion simple deterministic strategies with a history of maximal three rounds

could be entered via a web-interface. In each of the competitions five runs

were performed, each of these runs lasted 200 rounds.

For each competition we calculate the average of the payoffs of all five

runs the ESTFT strategies and TFT reached if it plays against a specific oppo-

nent. In using the average we can filter out random effects induced by noise

or by the strategies themselves (e.g. RAND). Furthermore we consider only

the payoffs against the 141 strategies that are represented in both the classi-

cal competition and the competition with noise. This establishes a common

basis for analysis that allows us to perform paired tests on the difference

between the average payoffs against each of these 141 reference strategies.

Figure 6.3 presents box-wisker diagrams for the six ESTFT strategies and

TFT (the according data can be taken from Table 6.1).

First we apply a non-parametric paired Wilcoxon test to test the dif-

ference in the payoffs of the ESTFT strategies and TFT between the com-

petition without noise and the competition with noise. The alternative

hypothesis that payoffs are higher in the competition without noise than

in the competition with noise can be accepted for all seven reciprocating


134 M. Filzmoser

0.20 0.26 0.34 0.35 0.44 0.5 1 (TFT)

300

400

500

600

soomthing parameter

payo

ff (a

vera

ged

over

the

five

runs

)

Fig. 6.1. Box-Wisher plot of the average payoff of ESTFT and TFT strategies for the

competition with noise

strategies. The results are highly significant as can be seen from Ta-

ble 6.2.d

Next we test the difference in the payoff between the ESTFT strate-

gies and TFT in the competitions without and with noise. Again we use a

non-parametric paired Wilcoxon test. The alternative hypothesis that the

payoffs of the focal ESTFT strategy are greater than the payoffs of TFT can

be accepted only for the lowESTFT_classic and mediumESTFT_classic

(p < 0.05).

From Table 6.2 we see that the tournament results reproduce what has

been argued analytically and already shown in previous IPD tournaments

already (Molander, 1985; Donninger, 1986; Bendor et al., 1991; Bendor,

1993), reciprocating strategies like TFT or ESTFT are less successful in noisy

dIn Tables 6.2 and 6.3 the column α represents the specific value of the smoothing

parameter for this strategy, µ the mean of the payoff averaged over the five runs per

competition, ± the standard deviation of payoffs, V the test statistic and p the signifi-

cance of the non-parametric paired Wilcoxon test.



Table 6.1. Data for the Box-Wisher plot of the average payoff of ESTFT and

TFT strategies of the competition with noise

strategy min 1. quartile median 3. quartile max

highESTFT_classic 232.6 255.4 432.8 500.2 613.4

highESTFT_noise 242.0 268.8 435.4 515.6 594.2

mediumESTFT_noise 241.0 260.2 430.6 514.0 585.6

mediumESTFT_classic 243.0 270.6 430.4 505.0 593.2

lowESTFT_noise 240.0 264.2 435.4 513.0 597.6

lowESTFT_classic 245.8 300.6 428.0 512.2 586.4

TFT 223.4 269.0 430.6 502.6 614.2

Table 6.2. Performance of the ESTFT and TFT strategies in environments with and

without noise

without noise with noise

strategy α µ ± µ ± V p

highESTFT_classic 0.20 467.22 181.14 397.73 123.93 7,817.0 < 0.0001

highESTFT_noise 0.26 470.97 180.20 399.73 117.57 8,131.0 < 0.0001

mediumESTFT_noise 0.34 470.43 180.45 399.45 120.43 8,115.5 < 0.0001

mediumESTFT_classic 0.35 468.47 179.45 404.26 114.16 7,733.5 < 0.0001

lowESTFT_noise 0.44 469.43 181.72 401.94 120.24 8,000.5 < 0.0001

lowESTFT_classic 0.50 469.76 179.64 408.43 109.95 7,734.0 < 0.0001

TFT 1.00 467.49 181.06 400.39 121.78 7,615.0 < 0.0001

Table 6.3. Comparison of the TFT and ESTFT strategies in environ-

ments with and without noise

without noise with noise

strategy α V p V p

highESTFT_classic 0.20 3.0 0.6054 4,779.0 0.6798

highESTFT_noise 0.26 449.0 0.9980 5,342.5 0.2443

mediumESTFT_noise 0.34 366.0 0.9998 4,962.0 0.5361

mediumESTFT_classic 0.35 500.5 0.6830 6,071.0 0.0142

lowESTFT_noise 0.44 282.0 1.0000 5,308.5 0.1758

lowESTFT_classic 0.50 628.5 0.2342 5,961.0 0.0247

environments than in environments without noise. The comparison between

the performance of ESTFT strategies compared to TFT for the competition

without and the competition with noise summarized in Table 6.3 shows

two noteworthy results. First while there are no significant differences in

performance between ESTFT strategies and TFT in the case of no noise, the

ESTFT strategies lowESTFT_classic and mediumESTFT_classic are sig-

nificantly better (p < 0.05) in the presence of noise. Moreover the three

ESTFT strategies designed for the competition with noise (lowESTFT_noise,


136 M. Filzmoser

mediumESTFT_noise, and highESTFT_noise) where, though still less suc-

cessful than TFT, able to reduce the distance to TFT in the competition with

noise compared to the one without.

Above we mentioned that the α-value is the only parameter that varies

across the six ESTFT strategies. In the competition without noise the av-

erage performance of all ESTFT strategies except highESTFT-noise – the

strategy with the lowest α-value – exceeded the performance of TFT, how-

ever these results are not significant according to the non-parametric paired

Wilcoxon tests (see Table 6.3). From Table 6.2 one can see that in the com-

petition with noise the three ESTFT strategies with the higher α-values

(mediumESTFT_classic, lowESTFT_noise, and lowESTFT_classic) reach

higher average payoffs than TFT while the three ESTFT strategies with lower

α-values (lowESTFT_classic, lowESTFT_noise, and mediumESTFT_noise)

reach lower average payoffs. Generally the performance of the ESTFT strate-

gies in the competition with noise increases with the smoothing parameter.

That two ESTFT strategies designed for the classical competition with-

out noise outperformed TFT in the competition with noise while the ESTFT

strategies designed for the competitions with noise did rather poorly, does

not necessarily contradict the statements made above. We stated that an

unbalanced mitigation of the provocability property of reciprocating strate-

gies or too much generosity is insufficient to improve the performance of

reciprocating strategies in the IPD with noise. A one-sided reduction of

provocability which just focuses on not punishing some of the opponent’s

defections as they could be the direct or indirect result of implementation

errors neglects the possibility of implementation errors in combination with

opponent’s cooperation, while too much generosity could cause exploitation.

On the one hand the ESTFT strategies for the competition with noise were

more generous as they only defect when two consecutive defections of the

opponent occur or the smoothed value declines below a limit for coopera-

tion. On the other hand highESTFT_classic probably smoothed too much

which reduces its performance. Two strategies that outperformed TFT used

higher values for the smoothing parameter that leads to a higher weight-

ing of the currently observed opponent’s moves and reduces the smoothing

effect.

6.4. Conclusions

Based on the shortfalls of existing approaches that attempt to improve the

poor performance of reciprocating strategies for the IPD with noise, we



suggest that exponential smoothing is an approach that allows an outbal-

anced mitigation of the provocability property of reciprocating strategies.

By exponential smoothing the whole series of the opponent’s moves rather

than only the previous move of the opponent can be taken into considera-

tion in the decision of cooperation or defection. Six ESTFT strategies were

designed and participated in an IPD tournament in competitions with and

without noise.

The results of the tournament show that in noisy environments the per-

formance of ESTFT strategies increases with the smoothing parameter and

that low exponential smoothing improves the performance of reciprocating

strategies. While exponential smoothing improves the ability of TFT to deal

with noise in the IPD, it still does not deal with it very well. Moreover the

results indicate that our design concept for determining smoothing param-

eters for strategies for the competition with noise seem to be inadequate,

as in the competition with noise strategies designed for the competition

without noise outperformed strategies designed especially for this environ-

ment. The optimistic assumptions concerning the initial mood of the ESTFT

strategies and the simplistic restrictions for the smoothing parameter α for

strategies for the competition with noise may be the cause of this weaker

than expected performance. While we found a seemingly promising way to

improve the performance of reciprocating strategies for noisy environments,

obviously further research in this direction is necessary. Moreover we used

TFT — as the most important representative of reciprocating strategies —

as a benchmark, clearly ESTFT strategies have to be compared to other

(reciprocating) strategies as well.

References

Ashlock, D., Smucker, M. D., Stanley, E. A. and Tesfatsion, L. (1996). Pref-

erential partner selection in an evolutionary study of prisoner’s dilemma,

BioSystems 37, pp. 99–125.

Axelrod, R. (1980a). Effective choice in the prisoner’s dilemma, Journal of Con-

flict Resolution 24, 2, pp. 3–25.

Axelrod, R. (1980b). More effective choice in the prisoner’s dilemma, Journal of


Axelrod, R. (1984). Genetic algorithms and simulated annealing, chap. The evo-

lution of strategies in the iterated prisoner’s dilemma (Pitman, London),

pp. 32–41.

Axelrod, R. and Wu, J. (1995). How to cope with noise in the iterated prisoner’s

dilemma, Journal of Conflict Resolution 39, 1, pp. 183–189.


138 M. Filzmoser

Bendor, J. (1993). Uncertainty and the evolution of cooperation, Journal of Con-

flict Resolution 37, 4, pp. 709–734.

Bendor, J., Kramer, R. M. and Stout, S. (1991). When in doubt. cooperation

in a noisy prisoner’s dilemma, Journal of Conflict Resolution 35, 4, pp.

691–719.

Donninger, C. (1986). Paradoxical effects of social behavior. Essays in honor of

Anatol Rapoport, chap. Is it always efficient to be nice? A computer simu-

lation of Axelrod’s computer tournament (Physica, Heidelberg).

Molander, P. (1985). The optimal level of generosity in a selfish, uncertain envi-

ronment, Journal of Conflict Resolution 29, 4, pp. 611–618.

Tzafestas, E. S. (2000). Toward adaptive cooperative behavior, in Proceedings of

the simulation of Adaptive behavior conference (Paris).


Chapter 7

Opponent Modelling, Evolution, and the Iterated

Prisoner’s Dilemma

Philip Hingston1, Dan Dyer2, Luigi Barone2, Tim French2,

Graham Kendall3

Edith Cowan University1 , The University of Western Australia2 ,

The University of Nottingham3

In this chapter, we report on a series of studies exploring the interplay

between evolution and intelligence. The evolutionary setting is a population

of agents playing Iterated Prisoners Dilemma, a setting which provides

choice between cooperative and selfish behaviour in interactions between

agents. Intelligence is represented using opponent modelling agents. Our

studies show that, while opponent modellers can survive in such a setting,

an evolving population of less intelligent agents can limit their success. We

also report on the performance of our opponent modelling agent, which

competed in the CIG’05 IPD competition.

7.1. Introduction

IPD has served as a model for cooperation between self-interested individ-

uals for 40 years. Sometimes, these individuals are taken to be animals,

sometimes humans, and sometimes some other kind of agency, such as a

corporation or a nation. A useful way to categorise studies based on the

IPD model is by what is assumed about the cognitive abilities of the players.

On an increasing scale of rationality, well-studied assumptions include

• That populations of players can evolve good strategies. This is

the traditional evolutionary computation approach.

• That the players can learn good strategies. This is the traditional

machine learning approach.

• That the players are perfectly rational. This is the traditional

mathematical game theory approach.

139


140 P. Hingston et al.

But there is another point on this scale, somewhere between the last

two points, that has, surprisingly, been largely neglected — the assumption

that players adapt their play based on a learned model of their

opponents’ play. In this chapter, we will argue the merits of opponent

modelling as a realistic approach to the study of IPD, and present some

results of experiments designed to explore this approach.

The expanded, four point scale, corresponds roughly to some theories

concerning stages in the evolution of intelligence. A recent and controver-

sial example is the Machiavellian intelligence hypothesis – “that apes and

humans have evolved special cognitive adaptations for predicting and ma-

nipulating the behaviour of other individuals” [Miller (1997), p313]. There

are various stronger or weaker interpretations of this hypothesis. A strong

version postulates a “theory of mind”, a module that attributes beliefs and

desires to others in order to better predict their behaviour. In other words,

in our terms, apes and humans use opponent modelling.

Researchers disagree about whether or not, and to what degree, vari-

ous primates have such a module. Everyone seems to agree that humans

do, but there is evidence supporting both sides of the argument regarding

distinctions between sprepsirhine primates (lemurs and lorises), haplorine

primates (the rest), or between monkeys, great apes and ancient and mod-

ern humans. A well known example is the ability of great apes to recognize

themselves in mirrors, whereas monkeys cannot [(Parker et al. (1994), as

cited in Miller (1997)].

If IPD is to teach us about human behaviour, then it makes sense to

model intelligence at the correct level. To evolve agents that play pre-

determined, fixed strategies seems appropriate for studies of animal with

low levels of intelligence. Game theorists might argue that corporations or

nations are best modelled as perfectly rational, though many popular com-

mentators would disagree. For humans, neither of these seems realistic –

humans do not ignore what experience teaches. Rote learning of strategies,

via some mechanism such as reinforcement learning, or learning by imita-

tion, may be sufficient to explain animal behaviours, and some aspects of

human behaviour, but even our great ape cousins are known to go beyond

this. Thus, we believe, to realistically model strategies employed by hu-

mans, we must include learning to predict our opponent’s behaviour, and

applying our reasoning abilities to devise a plan based on our predictions.

Just how sophisticated the prediction and reasoning method needs to be is

another question, but some kind of opponent modelling is called for.


Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma 141

In the field of multi-agent systems, our approach would be called model-

based learning, as distinct from model-free learning. We see many of the

acknowledged advantages and disadvantages of model-based learning in this

study. Many variations and subtleties of approaches to problems of learning

in multi-agent systems have been studied (see, e.g. [Markovitch and Reger

(2005)] for one such variation, and a nice overview). It is not our aim

in this study to survey this field. Likewise, there are many aspects of

intelligence that we do not concern ourselves with, including the question

of what intelligence actually is! One of our reviewers pointed out that

our “four point scale” could have many other intermediate points on it,

depending on the level of sophistication of the modeller. For example,

should the modeller assume that the other players are also modellers and

reason about them on that level (we have chosen to answer “no”)? Again,

should the modeller model only individual players, or the population of

players in its environment (we have chosen the former)? We neglect these

questions not because we see them as uninteresting – far from it – in fact

it is because there are so many aspects, so many possibilities, that we must

make our one set of choices and stick with them. From our point of view,

the key requirement is that the players must be opponent modellers of some

sort, and that we want to learn about what happens when such players are

subjected to the forces of evolution.

In the following pages, we discuss the advantages of opponent modelling,

and the problems that a successful opponent modeller must solve. With this

background, we then describe our opponent modelling entry for the IPD

competition held at CIG’05, the 2005 IEEE Computational Intelligence in

Games conference. This entry was adapted from an earlier IPD opponent

modeller used to study the role of intelligence in the evolution of cooperation

[Hingston and Kendall (2004)]. We revisit this work, and follow up with a

report on some new experiments carried out recently to better understand

our earlier findings.

7.1.1. Opponent modelling

Opponent modelling is the term used to describe the process of constructing

some form of representation (called the model) for an opponent’s strategy,

typically in order to exploit inherent weaknesses in their play. It is worth

pointing out here that we take the view that all game players are ultimately

self-interested. Even in games where cooperation is possible, players only



cooperate because it is to their advantage to do so. This is not so much

a value judgement on our part – since we are going to be dealing with

evolution, those who are not self-interested (or at least those whose genes

are not self-interested!) will cease to be relevant.

Consider then a simple example, the two-player game of rock, paper,

scissors (also known as Roshambo). In this game, each player selects one of

the three options, rock, paper, or scissors, ensuring their selection is hidden

from the other player. After both players have made their selection, players

reveal their choices and the winner is determined as follows: rock defeats

scissors, scissors defeats paper, and paper defeats rock. Should both players

select the same option, the game is deemed a draw.

Simple analysis shows that if an opponent truly selects randomly, the

best a player can do is to also choose randomly, thus assuring the overall

expectation is neutral (each player winning one third of games, each player

losing one third of games, with the remaining third of games drawn). How-

ever, if the opponent is not selecting randomly (or truly randomly), a player

can potentially do better than this neutral expectation by “guessing” (or in

artificial intelligence speak, “predicting”) which option the opponent will

select next. Using this prediction, the player can then choose the option

(the counter-strategy) that ensures victory in the game (choosing rock if

the prediction suggests the opponent will select paper, choosing scissors if

the prediction suggests the opponent will select paper, and choosing paper

if the prediction suggests the opponent will select rock). This is the do-

main of opponent modelling – building a model, typically from observation

or experience, of the next most likely action (move) of the opponent.

Note that building a model of the opponent’s next most likely action is

equivalent to building a model of the opponent’s strategy directly since a

player’s strategy directly determines the next move of the player. Once a

model of an opponent’s strategy is determined, the model can be analysed

(or deconstructed) to identify weaknesses in the opponent’s play. From this

analysis, a counter-strategy that best “improves” the player’s position in the

game (typically by exploiting any identified weaknesses in the opponent’s

strategy) can then be determined and executed, allowing the player to

maximise personal gain in the game.

All other things being equal, the overall success of a player employing

opponent modelling (an opponent modeller) depends on the accuracy of its

prediction of the opponent’s next action. For example, imagine an opponent

that always selects rock as their hidden choice in the fore-mentioned game

of Roshambo. Obviously, the optimal counter-strategy is to select paper,



thus ensuring a win against the opponent’s rock selection. An opponent

modeller that is able to correctly deduce this strategy weakness is then

able to ensure victory in all games against this opponent.

While this type of obvious strategy flaw is unlikely, experience shows

most players of games contain some form of strategy weakness, especially

in games containing many different game states. For example, standard

5-card poker has over 2,500,000 ways of forming a hand; 7-card vari-

ants have over 133 million ways. Factoring in the complications due to

betting, considerations of the different playing abilities and styles, and

the large number of situations a player must respond to, every poker

player is likely to contain some (and probably many) predictabilities in

their strategy (if they didn’t, playing the game would be pointless —

ignoring short-term variance, all players would end up level in the long

run).

Even in simple games like Roshambo, players often contain subtle weak-

nesses in their play that can be exploited by an opponent modeller (world

championships in the game pit players’ abilities to determine and exploit

these weaknesses). For example, some players may always choose rock af-

ter winning the previous game with paper. Other players may never select

any single option four times in a row. Both of these examples demonstrate

non-random choices by an opponent and hence can be exploited by an op-

ponent modeller capable of deciphering predictable patterns in opponent

behaviour. Indeed, any non-random choice in the selection of the hidden

option may be exploited. If the game is played often enough, subtle weak-

nesses in strategy may well give the advantage to the opponent modeller

in the long run. Artificial intelligence research into opponent modelling is

interested in just this – finding subtle flaws in an opponent’s strategy in

order to maximise personal gain in the game.

While it seems opponent modelling is an obviously good way of identify-

ing strategy weaknesses, care must be taken to ensure against over-reliance

on the inferred model of the opponent’s strategy. Two immediate prob-

lems can arise: the inferred model may be incomplete, or even worse, in-

correct for certain scenarios, or even if the model is definitely correct at

some moment in time, an opponent may dynamically modify their strategy

over time invalidating the model. The first problem is obvious – incorrect

or incomplete models affect an opponent modeller’s capability to identify

weaknesses in an opponent’s strategy and hence determine the next best

action to select. The second problem motivates the need for adaptation –

the opponent modeller must constantly re-assess its inferences and resulting



counter-strategies in order to stay abreast of the strategy employed by the

opponent.

For example, consider the always-select-rock strategy flaw discussed ear-

lier for the game of Roshambo. Obviously, an opponent modeller capable of

correctly inferring this strategy weakness will have no problem exploiting

the weakness to ensure victory against this opponent. However, with such

an obvious weakness in strategy, the opponent is likely soon to realise their

flaw and try another strategy instead. The opponent modeller must now

adapt their counter-strategy in order to exploit the new strategy, otherwise

they run the risk of becoming predictable and may well be exploited them-

selves (recall that in Roshambo, a player must select randomly, otherwise

an opponent may be able to predict their next action). The opponent may

be “setting-up” the opponent modeller with a false model in order to exploit

the opponent modeller later on with rapid successive changes in strategy

(the hunter becoming the hunted).

The other major problem for an opponent modeller is striking a balance

between exploring unknown regions of an opponent’s strategy to discover

new information (and new weaknesses) and using existing information to ex-

ploit weaknesses in the strategy. A trade-off occurs: insufficient exploration

may prevent the opponent modeller from finding better counter-strategies

that yield higher returns, but exploration is costly since it is a distraction

from the primary task of exploiting the opponent by using the information

the player already has. Exploring new counter-strategies may mean sacri-

ficing short-term performance (the player may need to accept short-term

losses), and in the worse case, may even lead to inescapable sub-graphs of

the opponent’s strategy that yield sub-optimal returns in the long run (for

example, exploring the strategy of defecting against a grim-like player in

IPD – see later, in 7.1.4).

Opponent modelling is not only useful in games, but also in other situa-

tions involving responding to opponent actions. Examples include evolving

cooperative behaviour, stock market prediction, negotiation and diplomacy,

and military strategy planning. These types of problems can benefit from

opponent modelling — building a model of the behaviour of the opponent

in order to exploit strategy weaknesses and to respond “well” to opponent

actions. Indeed, most environments containing adversarial situations can

benefit from opponent modelling — that is, exploitation of opponent weak-

nesses in order to maximise personal gain in the game. The question is to

what extent.



IPD is a game often touted as being an example of human behaviour.

Due to the iterated nature of the game, players may choose to take their

opponent’s previous actions into account when deciding how to act in sub-

sequent rounds of the game. This opens up the possibility of a player being

predictable, and hence the possibility of exploiting the player’s predictabil-

ities. This is the thesis of our work — that the use of opponent modelling

to construct a model of an opponent’s strategy can offer an advantage to

a player in the IPD game. Using this approach, we aim to construct au-

tomated computer players capable of exploiting observable strategy weak-

nesses in opponent’s strategies in IPD. This means that we need some way

to automatically construct a model of the opponent’s strategy, some way of

automatically analysing the constructed model to determine weaknesses in

the strategy, and some way of automatically determining the best counter-

strategy to counter-act the inferred strategy of the opponent. In general,

all three of these tasks may indeed be difficult. In the next section, we

detail how we did these things in the context of an IPD competition.

7.1.2. Modeller, the competition entry

In this section, we describe Modeller, the strategy that we entered into the

IPD competition held in conjunction with CIG’05, the 2005 IEEE Com-

putational Intelligence in Games conference. It is a modified version of an

opponent modelling agent described in a paper presented at CEC’04, the

IEEE Congress on Evolutionary Computation in Seattle in 2004, [Hingston

and Kendall (2004)]. The focus of that work was the interplay of evolution

and learning, which was explored by simulating co-evolving populations

of IPD playing agents using fixed strategies with agents using opponent

modelling. It is discussed further in the next section, Opponent modelling

versus evolution. We made some minor changes to the opponent modelling

agent, for the purposes of the competition, but the precise details of the

implementation are less important than the overall spirit of the opponent

modelling approach.

7.1.3. Anatomy of the modeller

Modeller plays tit-for-tat for a fixed number (50) of moves. (Recall that

tit-for-tat cooperates on the first move, and copies the opponent’s previous

move from then on.) During that time, it builds up a predictive model

of the opponent. After the fixed number of moves, it uses the model to



calculate expected future payoffs for each possible move, depending on the

game position, choosing the move with the highest expected future payoff.

In the case of ties, it chooses randomly between the moves with the highest

expected future payoff.

The opponent model used is a 1st order lookup table. It is assumed

that the opponent’s probability of cooperation on a given move is deter-

mined by what happened on the previous move, e.g. both cooperated, or

we cooperated but our opponent defected, etc. This assumption was prob-

ably incorrect for most of the strategies entered in the competition, but

we hoped that it would be approximately true, or at least true enough to

obtain good average scores.

We could employ more complicated models. For example, the oppo-

nent’s probability of cooperation could depend on the previous two moves,

or even more generally, could be described by a probabilistic finite state

automaton. We opted for the simplest choice that demonstrates the op-

ponent modelling approach. In any case, we reasoned, more complicated

models might not be warranted for the competition, because they have

more parameters to estimate, requiring more time to learn. However, if the

expected game length was very long, and our opponents were sophisticated,

we conjecture that using more complicated models would produce a more

capable strategy.

The hypothetical probabilities that determine a 1st order model are

estimated by counting how many times the opponent cooperated or defected

after each possible previous move. These counters are initialized with values

that are consistent with the opponent playing tit-for-tat, that is, we used

tit-for-tat as an a priori model. This seemed like a good choice for the

competition, as we expected that many opponents would play variants of tit-

for-tat. The counters are used to compute an estimate of the probability of

the opponent cooperating as the ratio of the cooperation counter to the sum

of the cooperation and defection counters. We continue to update the model

by incrementing the counters during subsequent play. More sophisticated

updating could also be used, for example, weighting evidence on recency,

to respond faster to opponents with dynamic strategies, as was done in

Hingston and Kendall (2004). Since our aim was to see how well opponent

modelling would do in the competition environment, rather than to compare

and tweak implementation details, we again decided to opt for simplicity.

Assuming that the opponent model is correct, and given knowledge of the

probability of the game continuing to another move, we can calculate the

expected future payoff for any move in any game position. According to



cc

cccd ccdc

dddccd

ccdd cdcc cdcd ddddcccc

first

C

D

C C C

C

DD DD

C01-C0

C1(cc)

C0 1-C0

1-C1(dd)

Fig. 7.1. Game tree for the first few moves of a game of IPD.

the competition rules, this probability was constant at 1-0.00346, giving an

expected game length of 200 moves.

To see how expected payoffs can be calculated, consider the initial seg-

ment of an IPD game tree shown in figure 7.1. Starting at the bottom,

there is one branch for our choice to cooperate (labeled C) and one for

our choice to defect (labeled D). Following the C branch, there is then one

branch for our opponent’s choice to cooperate (labeled C0) and one for his

choice to defect (labeled 1-C0). These labels represent the probability that

our opponent will cooperate, or respectively, defect, on the first move of the

game. Following the “cooperate” branch, we reach a node that represents

the game position in which both players cooperated on the previous move

(labeled cc). There is then a branch representing our next move (C or D),

and then our opponent’s next move, where the label C1(cc) represents the

probability that the opponent will cooperate when both players cooper-

ated on the previous move. Likewise, the label on the rightmost branch,

1 − C1(dd) represents the probability that the opponent will defect when

both players defected on the previous move. The topmost nodes are shown

with labels like cccd, representing a game position where both players coop-

erated two moves ago, and we cooperated but our opponent defected on the

previous move. Since we are assuming that our opponent (and therefore we

also) only consider the previous move, when deciding on his next play, this

node might as well be labeled cd, and be identified with the other nodes

labeled cd. Thus the infinite game tree collapses to become a finite graph

(not drawn).

Thus, to determine a counter-strategy, we need only decide on our choice

at the start of the game, and at each of the games positions cc, cd, dc and



dd. Thus, there are 25 = 32 possible counter-strategies to consider. We can

choose between these by calculating their expected payoffs.

Let V (cc) be the value of the game at position cc, by which we mean the

expected discounted future payoff starting from this position. Define the

value of the game for the other positions similarly. Let δ be the probability

of continuing the game, P be the penalty for mutual defection, R be the

reward for mutual cooperation, T be the temptation to defect, and S be

the sucker payoff. If we choose to cooperate at position cc, then the ex-

pected future payoff, V1(cc), is equal to the probability that the opponent

cooperates (that is C1(cc)) times future payoff given that he cooperates,

plus the probability that the opponent defects (that is 1 − C1(cc)) times

the future payoff given that he defects. The future payoff given that he

cooperates is equal to the immediate payoff for both cooperating (R), plus

the expected future payoff after that (V1(cc)) times the probability that

the game continues (δ). The future payoff given that he defects is equal

to the immediate payoff for us cooperating while he defects (S), plus the

expected future payoff after that (V1(cd)) times the probability that the

game continues (δ). Putting that all together:

V (cc) = C1(cc)× (R + δ × V (cc)) + (1− C1(cc))× (S + δ × V (cd)) .

Similarly, if we choose to defect, then:

V (cc) = C1(cc)× (T + δ × V (dc)) + (1− C1(cc))× (P + δ × V (dd)) .

Analogous equations hold for the other positions, giving a system of

equations that can be solved for the values V ( ). Finally, the value of the

game at the start of the game is either

V = C0 × (R + δ × V (cc)) + (1− C0)× (S + δ × V (cd)) , or

V = C0 × (T + δ × V (dc)) + (1− C0)× (P + δ × V (dd))

depending on whether we choose to cooperate or defect on the first move.

The best counter-strategy is that set of 5 choices which maximizes the value

of V ( ) for the current game position. After playing a move selected by this

method, and observing the opponent’s move, the model is updated and the

calculation above must be repeated to choose our next move. This tech-

nique can be extended to calculate a best response against any finite-order

stochastic strategy, or indeed against any strategy defined by a probabilistic

finite state automaton.

This implementation of an opponent modeller can likely be improved

(in terms of achieving a higher score in IPD competitions), by more careful



choices of the target class of opponent models, exploration/exploitation

balance, and updating method. We intend to test this claim in future IPD

competitions! However, the good performance of this simple implementa-

tion in the CIG’05 competition is evidence that opponent modelling is a

viable strategy for IPD.

7.1.4. Competition performance

While our main target was Competition 4, we also entered Modeller in

Competitions 1 and 2. Competition 4 was a faithful reproduction of Ax-

elrod’s original conception [Axelrod (1984)] Modeller performed very well

in this competition, placing 3rd out of 50 entries in four runs out of five,

and 5th on the remaining run. In all cases, it was just over 3% behind the

winning entry, and 1% ahead of the next best defeated opponent.

Despite this good performance, a detailed examination of individual

games reveals some weaknesses in our implementation. One thorny issue

that we side-stepped by using tit-for-tat as an a priori model, is the “cost”

of learning. In order to develop an accurate model of an opponent, one

would like to “explore” — that is, to sample the opponent’s moves in every

possible game position, many times. As discussed earlier, there are two

barriers to this. One is that the opponent’s play may be such that some

game positions are never reached. The second, and more troublesome, is

that such exploration does not come for free: if we deliberately play a

certain move to see what our opponent will do, our decision will affect

the payoff that we receive for this move, and possibly for future moves

too. Playing against the grim strategy is an extreme example. A grim

player, cooperates on the first move of a game, and continues to cooperate

as long as the opponent does, but if ever the opponent defects, then grim

continues to defect forever more. In Hingston and Kendall (2004), we used

the device of deliberately playing the “wrong” move from time to time – the

so-called “trembling hand” device. Because of this, the opponent modeller

was frequently punished by grim-like opponents. A single experimental

defection against grim ensures that the opponent will defect for the rest of

the game, locking both players into low payoffs.

For the purpose of the competition, we avoided this problem by playing

tit-for-tat at first, and then the moves that we calculate to be optimal.

This is simple and reduces our risk of offending grim-like opponents, but

also reduces the accuracy of our models, so that we may miss the chance of

truly optimal play. For example, Modeller loses badly in games against the



fixed strategy Always Defect, which simply defects at all times. The reason

for this is outlined below.

During the first 50 moves against Always Defect, Modeller gets no infor-

mation about what the opponent would do if both players cooperated last

move, or if we defected while he cooperated (because he never cooperates).

Also, since we play tit-for-tat up to this point, we only cooperate on the

first move, and thereafter defect, so we only see one example of the oppo-

nent’s play after we cooperate and he defects. The problem of this lack of

data is the reason we begin with an a priori model, specifically, tit-for-tat.

After 50 moves against Always Defect, the model looks like this:

Probability of cooperating after we both cooperate = 1

Probability of cooperating after I cooperate and he defects = 0.5

Probability of cooperating after I defect and he cooperates = 0

Probability of cooperating after we both defect = 0

Thus, when first called on to apply the model, Modeller reasons like

this:

“We just both defected. If I cooperate on this move, I’m sure he’ll

defect. After that, there’s a 50% chance he’ll cooperate on the next move.

(If not, I can try again.) If I cooperate too, that makes a 50% chance that

we will both cooperate. From then on, I’m sure we’ll keep cooperating.”

So Modeller expects to reach mutual cooperation and good payoffs after

a few more moves, if he continues to cooperate. The problem is that the

50% estimate is wrong (the true probability is 0). Although this incorrect

value will continue to be updated, it will take many more sucker moves

before the model is accurate enough for Modeller to make the right choice

(defect).

So we see that, for several reasons, the models learned by Modeller are

imperfect. Sometimes, this hurts us, but we push ahead regardless, hoping

that, on average, it will not hurt us too much. At least in this competition,

this was a reasonable assumption.

We expected that Modeller would have a tougher time in Competitions 1

and 2, because, in these competitions, collusion between entries was allowed.

With collusion allowed, the best approaches will use a “champion” strategy

that takes advantage of conditions by relying on “confederate” strategies

to sacrifice themselves by cooperating whilst their champion constantly de-

fects. In addition, confederate strategies can damage other competitors by

constantly defecting against them. The idea is for champion and confed-

erates to use the first few moves to identify each other. It is clear that no



non-colluding strategy can hope to compete in this environment. Never-

theless, we entered Modeller to provide an opponent for other entries, and

out of curiosity. It performed creditably, finishing 62nd, 61st, 64th, 60th

and 65th out of 192 entries in the five runs. It would be interesting to know

where it was placed among the non-colluding entries.

Modeller fared better in Competition 2, which allowed collusion, but

introduced “noise” – that is, with low probability, signals may be misinter-

preted by players. This upsets colluders by interfering with the identifica-

tion of confederates, but doesn’t inconvenience Modeller much at all, as it

is designed to deal with stochastic opponent strategies. In this competition,

Modeller finished 20th, 18th, 5th, 13th and 18th out of 165 entries in the

five runs.

It could be argued that, by allowing collusion, Competitions 1 and 2

changed the nature of the problem under consideration. The problem be-

comes one of teamwork, rather than one of cooperation with a self-interested

other. One can think of real-world scenarios that IPD-with-collusion use-

fully models — for example, teams of riders in the Tour de France, in which

team members sacrifice their own chances in order to protect a teammate

and improve his chance of a high-placed finish. The analogy is imperfect,

though, as in the Tour, teammates do not compete against each other di-

rectly. It is harder to think of examples from Nature. At least at first

sight, it would seem that colluding strategies would not work very well in

simulated evolution experiments like those that we describe in the next sec-

tion. Strategies acting as confederates would be selected against. In such

a scenario, it would be the average fitness of all members of the species

that determined reproductive success of the species as a whole. We won-

der what the average scores of the teams entered in Competitions 1 and 2

were, but we cannot calculate this because we do not know who was col-

luding with whom. Perhaps there are examples in Nature that we are not

aware of, and it may be a matter of appropriately structuring the simula-

tion to make collusion profitable. It would be interesting to hear of such

examples.

7.2. Opponent Modelling Versus Evolution

The opponent modeller described in the previous section was based on that

used in the CEC’04 study, which had nothing to do with the competition,

or, really, with Axelrod’s original competitions. It did take inspiration,

though, from Axelrods experiments with evolution and IPD.



Those experiments by Axelrod were motivated in part by an apparent

anomaly in evolutionary theory. Cooperation between organisms in nature

entails one organism changing its behaviour in order to benefit another,

possibly to its own detriment. A commonly used example is that of the

lookout in groups of social animals, that makes an alarm call to warn the

rest of the group of the presence of a predator, placing itself at risk by

calling attention to itself. If evolution favours survival of the fittest, then

why does it not work against this kind of cooperation? Would it not rather

favour the cheat, who benefits from the alarm calls of others, but stays

silent when his own turn comes to act as lookout?

One can make plausible arguments to resolve this puzzle, invoking ideas

like kin-selection [Maynard-Smith (1988), pp. 192-193], or social reputation

[Maynard-Smith and Harper (2003), pp. 121-122], or one can build and

analyse mathematical models to test hypothetical mechanisms to explain

it, as in evolutionary game theory [Maynard-Smith (1988), pp. 194-200].

Or one can design and carry out simulated evolution experiments, which

is what Axelrod did, using IPD as his model. Subsequently, many others

have carried out their own, similar experiments, using variations on the

classic IPD model, exploring issues such as spatial effects [Nowak and May

(1992)], more complex strategies [Fogel (1993); Miller (1996)], the ability

to choose partners [Ashlock et al. (1996)] and so on.

There is another natural phenomenon that evolutionary theory must ex-

plain – intelligence. The central question is: Why and how did intelligence

evolve? This is a large topic, much debated, and one that has many facets.

Theories include Calvin’s “throwing theory” (that bigger brains evolved in

order to better throw rocks) [Calvin (1983)], the theory that greater intel-

ligence resulted as a response to the last ice age [Calvin (1991)], that the

evolution of intelligence was a result of sexual selection [Miller (1997)], and

the idea that intelligence is about being better at deceiving and detect-

ing deception – as in the Machiavellian intelligence hypothesis [Byrne and

Whiten (1988); Whiten and Byrne(1997)].

Our CEC’04 study was in part our attempt to contribute to the debate.

Just as Axelrod used IPD to study the evolution of cooperation, we used

similar experiments to study the evolution of intelligence. To better explain

what we did, we ask the reader to keep mind the following hypothetical

scenario:

Imagine a world inhabited by simple creatures who interact and ex-

change resources by playing IPD with each other. Those who get bet-

ter payoffs live longer and have more offspring. These creatures are not



intelligent. Their moves are determined by their genes. They can recog-

nize each other, and they can only remember what moves were played the

last time they played each other — but no more than that. They act in-

stinctively, and never learn anything at all. This is the world of Axelrod’s

experiments.

Now imagine that a strange mutation arises amongst these creatures.

Mutants have abnormally large brains, large enough for them to remember

quite a bit about what happened in previous encounters with each other

creature. Enough for them to be able to make a good guess about what

move the other will make the next time they play. In fact, their brains are so

big and complex, that they can use this information to plan what move they

should make next, and what would happen after that, and so on, and choose

a move that will maximize the payoff in games against each other in the

future. These mutants are intelligent. What will happen to these intelligent

mutants? What will happen to the original, unintelligent creatures? This

is the world, and these are the questions that were addressed in the CEC’04

study.

This was not the first study that considered how an intelligent player

might play IPD or to use simulation to study the evolution of intelligence,

but it may be the first to consider opponent modelling as an approach to

IPD, and also the first to consider combining evolution and intelligence in

the context of IPD.

The opponent modelling implementation for this study was similar to

that used in the competition, except that there was no initial period in

which the model was not used, the model update method was different,

and the players were all equipped with a “trembling hand”. As explained

earlier, the competition variant used an initial waiting period as a safeguard

against grim-like opponents, and because we guessed that many competi-

tion opponents would be tit-for-tat-like. The model update method in the

CEC’04 study used a “forgetting factor”, γ, to give greater weight to more

recent events. After each move, both counters pertaining to the current

game position were multiplied by γ, and the relevant counter was incre-

mented by 2× (1− γ), keeping the sum of the two counters constant. All

players in the CEC’04 study had a “trembling hand”. That is, they would

occasionally play defect when they intended to cooperate, or vice versa.

The advantage of this is that it makes all parts of an opponent model

reachable, and offers some hope of recovery against a grim-like opponent.

Players using 1st order lookup tables were used for the unintelligent

players. Only pure strategies were used – that is, ones in which each



1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3

0 100 200 300 400 500 600 700 800 900 1000

Generation

Mean

fit

ness

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Co

op

era

tio

n

mean

coop

Fig. 7.2. A typical run with unintelligent strategies.

cooperation probability is either 0 or 1. No crossover was used, and mu-

tation, simply, was to flip a probability of 0 to 1, or vice versa. Though

they were not reported in the paper, experiments were also conducted with

stochastic strategies, giving broadly similar results.

There were two experiments reported. In the first experiment, fixed

strategy, unintelligent players were evolved in a simulation similar to Axel-

rod’s:

An initial population is created.

A round-robin IPD tournament is held between the members of the pop-

ulation. Every player plays every other player in a game of IPD in which

the game continues to another round with probability δ (set to 0.96, for an

average game length of 25 moves). The fitness of each individual is assigned

to be that player’s average payoff per move in the tournament.

Fitness-proportionate selection is used to select parents for the next gener-

ation (stochastic uniform selection).

Each parent, when selected, produces one child, by a process of copying

the genome of the parent (with a low mutation rate – the probability of

mutation as each gene is copied), and the development of a new individual

from this genome. The children become the next generation.

Repeat steps 2-4 for 1000 generations.

The results of this experiment were similar to those reported by Axel-

rod. The populations evolved in a few generations to a mixture of generally



Table 7.1. Summary statistics for evolution of unintelli-

gent strategies, n = 20, mean ± std.dev.

mean fitness Mean coop% grim% tft%

2.783 ± 0.013 86.7 ± 0.8 26.4 ± 1.5 19.7 ± 1.8

0

10

20

30

40

50

60

70

80

0 100 200 300 400 500 600 700 800 900 1000

Generation

Perc

en

t

nGrim

nTFT

Fig. 7.3. Percentage of grim and tit-for-tat strategies for the run in figure 7.2.

cooperative players, cooperating around 87% of the time. As can be seen

in table 7.1, the mean reward was close to the mean of 2.783 in all the

runs. The average percentages of grim and tit-for-tat (TFT) strategies

were around 26% and 20% respectively. Figure 7.2 shows a typical run,

with defection initially popular, and cooperation taking over after about

20 generations. Although the mean reward and degree of cooperation of the

population have stabilised, the composition of the population is constantly

fluctuating, with grim and tit-for-tat always present in large numbers, ap-

pearing to be loosely tied together in a cycle of period about 100 genera-

tions. Figure 7.3 shows the percentages for the same typical run.

In the second experiment, the players’ genomes were extended by adding

a “smart bit”. With the smart bit turned on, the player becomes an in-

telligent mutant, and plays as an opponent modeller. With the smart bit

turned off, the player remains an unintelligent player. In the initial pop-

ulation, all the smart bits were off. The scene is set. The mutants are

equipped to exploit the weak amongst the normal players. Will this ability

enable them to take over the population? Will they merely weed out the

exploitable players?



1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3

0 100 200 300 400 500 600 700 800 900 1000

Generation

Mean

fit

ness

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Co

op

era

tio

n

mean

coop

Fig. 7.4. A typical run with unintelligent players and opponent modellers.

Table 7.2. Summary statistics for coevolution of unintelligent players with op-

ponent modellers, n = 20, mean ± std.dev.

Meanmean

Meanmodeller% grim% tft%

fitnessmodeler

coop%fitness

2.67 ± 0.01 2.51 ± 0.01 80.6 ± 0.6 13.5 ± 0.7 21.7 ± 2.0 24.1 ± 2.0

Figure 7.4 shows the mean fitness and level of cooperation in a typical

run. The picture is similar to that of the first experiment, with a slightly

lower degree of cooperation at around 81%, and slightly lower mean re-

wards around 2.67. Figure 7.5 shows the percentage of tit-for-tat and grim

strategies and also the percentage of opponent modellers for the same run.

As table 7.2 shows, a significant number of opponent modellers, a mean of

around 13.5% of the population, is able to survive. Compared to the first

experiment, some of the grim strategies have been displaced, but the per-

centage of tit-for-tat strategies has actually increased. We conjecture that

the increase in tit-for-tat was at the expense of more exploitable strategies,

which are under pressure from the opponent modellers. While grim play-

ers can’t be exploited, they are involved in a lot of unprofitable mutual

defection with opponent modellers, and also suffer.

Although opponent modellers are able to survive in this simulated envi-

ronment, their mean fitness is lower than that of the rest of the population.

Without mutation, they would be driven to extinction. One problem for



0

10

20

30

40

50

60

0 100 200 300 400 500 600 700 800 900 1000

Generation

Percent

nGrim

nTFT

nSmart

Fig. 7.5. Percentage of opponent modellers, grim and tit-for-tat strategies for the run

in figure 7.4.

opponent modellers is the poor payoff from games with grim. But, the

main reason for their relatively poor performance is that when two oppo-

nent modellers meet, their average payoff is only 1.69. (In the competition,

this doesn’t happen, as after the first 50 moves, each thinks the other is

playing tit-for-tat, so mutual cooperation is locked in. In any case, each

player only plays itself once in the competition, so self-play is not an im-

portant factor.)

As an explanation of how intelligence might evolve, this model has raised

some questions. One could regard it as an illustration of the self-limiting

nature of exploitative behaviour in human and animal societies. Taking

these results as a starting point, one could ask under what conditions intel-

ligent players would do better, or worse, against unintelligent opponents,

than they did in this experiment. Answers to this question might provide

clues as to how and why intelligence has evolved in Nature, and why various

successful species have varying degrees of intelligence. In the next section,

we describe some new experiments in which we investigate some of the

effects that contributed to the results of this section.

7.2.1. The new experiments

As described above, one finding of the CEC’04 work was that the presence

of opponent modellers in an evolving population of IPD playing agents has



an influence on the kinds of fixed strategy players favoured by evolution.

The experiments described in this section seek to explore this further.

The main difference between these experiments and the earlier ones is

that opponent modellers do not directly take part in the evolution process,

but are used to test the fitness of the members of an evolving population of

fixed strategy IPD players. This makes it possible to isolate and manipulate

the influence of the opponent modellers.

There are some minor differences between the implementation of oppo-

nent modelling used in these experiments and the one used in the earlier

study. Instead of using a default model for an opponent strategy (based

on TFT) as used in the CEC’04 work, for these new experiments, we in-

stead start with an empty model of the opponent. As before, we count the

number of times the opponent cooperated for each game state to determine

a probability of cooperation for that game state based on observation of

the opponent’s moves. From this probability of cooperation, we are able

to calculate the next best move by calculating the best expectation for

all possibilities by looking ahead in the game state graph to consider the

consequences of each possible course of action. Since look-ahead is compu-

tationally expensive, we consider only a look-ahead of 5 moves (sufficient

to prevent short-term gains from taking precedence over long-term consid-

erations). Unlike the earlier work, we do not include a recency factor to

discount older observations as we do not make any short-cut assumption

about the opponent strategy at the start.

Exploration of an opponent’s strategy is also undertaken differently. In

the earlier CEC’04 work, a trembling hand was used for exploration of the

opponent strategy. In this work, exploration is more immediate – the op-

ponent modeller makes random decisions when it encounters games states

for which it has no information about the opponent’s strategy. The ad-

vantage of this approach is that exploration occurs earlier in the modelling

process, thus meaning more information is available earlier in the game,

hopefully leading to better exploitation of an opponent’s weaknesses in the

short term.

Of course, the key difference between our new experiments and those

presented at CEC’04 has to do with the effects of the opponent modeller on

the course of the evolutionary process. In the CEC’04 work, the opponent

modeller was considered another instance of the evolving population that

could replicate (so there could be multiple copies of the opponent modeller),

and needed to compete to earn their position in the population in order to

survive (i.e., opponent modellers were subjected to the same evolutionary



selection pressure as the unintelligent players). Results from the CEC’04

work showed that due to poor performance against other opponent mod-

ellers (the average return for self-play was 1.69), the number of opponent

modellers in the population fell away over the course of the evolutionary

run. In these new experiments, we take a different approach – we do not

involve the opponent modeller in the evolutionary process (so the oppo-

nent modeller is not subjected to the same evolutionary pressures), and it

is instead treated separately from the evolving population. As before, we

still maintain a population of unintelligent players that must compete for

their right to remain (and reproduce) in the population, but now assess-

ment of an individual’s ability (its fitness) is calculated as a weighted sum

of its performance against the other (unintelligent) members of the evolv-

ing population and its performance against the separate opponent modeller.

Below, we report on experiments with different weightings to determine and

isolate the effects of the opponent modeller on the evolutionary process.

While there are a number of differences between these two studies,

analysis shows that the results are mostly robust with respect to these

differences. Indeed, compensating for the effects of self-play in the ear-

lier CEC’04 work yields results mostly similar to the results found using

this new methodology (some differences occur due to the differences in ex-

ploration between the two approaches). We use our simpler approach in

the experiments below, thus allowing us to explore longer-term effects and

longer-term IPD games (these new experiments investigates games last-

ing 1000 rounds while the earlier work investigated games lasting only

25 rounds).

Our baseline experiment is to play the opponent modeller against a se-

lection of eight commonly known IPD strategies. Each strategy is played

against each of the others for 1000 iterations, giving a total of 8000 iter-

ations for each strategy. The results of the round-robin tournament are

presented below in table 7.3. Table 7.4 reports a breakdown of the oppo-

nent modeller’s performance (average payoff) against each of the strategies

in table 7.3.

Table 7.3 lists a couple of strategies we have yet to describe. STST

(Suspicious tit-for-tat) is like tit-for-tat except that it defects on the first

move. Gradual is another variation on tit-for-tat : this strategy acts as tit-

for-tat, except that after the first defection of the other player, it defects one

time and cooperates two times; After the second defection of the opponent,

it defect two times and cooperate two times, and so on. The Pavlov strategy

is similar to grim, except that it is more forgiving. Based around the



Table 7.3. Round-robin tournament results in-

volving the opponent modeller against eight other

commonly known IPD strategies.

Rank Strategy Average payoff

1 Opponent Modeller 2.74

2 Gradual 2.68

3 TFT 2.59

4 Grim 2.26

5 STFT 2.22

6 Pavlov 2.15

7 Always Cooperate 2.07

8 Always Defect 2.05

9 Random 1.64

Table 7.4. Round-robin tournament results involving the

opponent modeller against eight other commonly known

IPD strategies.

StrategyAverage payoff

Opponent Modeller Opponent

Gradual 2.87 2.75

TFT 2.99 2.99

Grim 1.00 1.01

STFT 2.99 3.00

Pavlov 3.00 0.50

Always Cooperate 5.00 0.01

Always Defect 1.00 1.00

Random 3.04 0.51

principle of continuing to do the same thing when performing well and only

changing when performing poorly, Pavlov starts cooperating and continues

to cooperate until its opponent defects. Upon defection, Pavlov switches

to defection. The difference between grim and Pavlov is that Pavlov will

return to cooperation if defection does not prove to be profitable (i.e., if its

opponent also begins to defect), hoping to return back to a state of mutual

cooperation.

That the opponent modeller emerged as the winner of the tournament

is encouraging, but is not particularly significant given the arbitrary se-

lection of opponents in the tournament. What is more interesting is the

performance of the opponent modeller against each individual strategy.

The first thing to note is the ability of the opponent modeller to suc-

cessfully identify Always Defect as the best counter-strategy against the

non-reactive opponents (Always Cooperate, Always Defect, and random),



achieving near-perfect scores against Always Cooperate and Always Defect,

and the best possible result against random.

Against tit-for-tat and STFT, the opponent modeller is able to identify

cooperation as the best course of action without falling into the defection

echo trap. As expected, the inevitable strategy exploration against grim

is punished, resulting in a poor score for the opponent modeller. The rel-

atively poor performance of Pavlov in the round-robin tournament is at

least partially due to the opponent modeller settling on an Always Defect

counter-strategy, rather than the equally effective Always Cooperate alter-

native.

Our next experiments examine the effect of the opponent modeller on

the course of a population of IPD players subjected to evolutionary selec-

tion pressure. As seen in the earlier CEC’04 work, the presence of op-

ponent modellers in the population effects the kinds (and distribution) of

fixed strategy players selected by evolution. These new experiments further

elaborate on these effects.

First, we report on the performance of the opponent modeller against

an evolving population of fixed pure strategies. Figure 7.6 plots the average

payoff for the opponent modeller against each member of the population

along with the average-payoff of the evolving population.

0

0.5

1

1.5

2

2.5

3

3.5

4

0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000

generation

me

an

pa

y-o

ff

population

modeller

Fig. 7.6. Average payoffs for the evolving population of fixed pure strategies and an

opponent modeller playing against each member of the population over time.



We can see that in most generations, the opponent modeller is able

to outperform the evolving population, obtaining a higher average payoff

than the average payoff of the evolving population. However, there are sev-

eral generations where the population outperforms the opponent modeller.

Analysis of the population composition at these points reveals that this

occurs when there are a large number of grim strategies in the population

(recall that exploration against the unforgiving grim is fatal – one defection

against grim locks the opponent modeller into a payoff at best 1.0 from then

on). For example, in generation 988, where the opponent modeller is at its

least effective (scoring on average 1.10 less than the evolving population),

the number of grim strategies reaches its peak −68% of the population.

The first row of table 7.6 reports the composition of the evolving popu-

lation for the corresponding experiments plotted in figure 7.6. We can see

grim, tit-for-tat, and Pavlov are the most prevalent in the population.

The results from these experiments show that the opponent modeller

is successful against an evolving population of fixed pure IPD strategies,

provided the proportion of grim strategies in the population is not high.

However, these experiments have not rewarded fixed strategies that score

well against the opponent modeller, only those that perform well against

the rest of the evolving population. Next, we examine experiments that

incorporate scores achieved against the opponent modeller into the fitness

evaluations of the fixed strategies.

Table 7.5 reports the average payoffs for the members of the evolving

population of pure strategies and the opponent modeller, along with the

composition of selected strategies in the population (dashed entries indicate

low numbers) for different ratios of the weighted sum that constitutes the

fitness of a member of the evolving population.

Table 7.5. Average payoffs for the members of an evolving population of

pure strategies and an opponent modeller playing against each member of

the population, for different weightings in the weighted sum.

WeightingPopulation average payoff

Modeller average

(against Modeller)Against

Against Modeller payoffpopulation

0 2.50(0.19) 1.07 2.62(0.32)

0.05 2.51(0.21) 1.31 2.81(0.27)

0.1 2.60(0.12) 1.60 2.94(0.21)

0.2 2.68(0.09) 2.08 2.99(0.12)

0.5 2.65(0.09) 2.53 2.93(0.06)

1.0 2.14(0.25) 2.72 2.91(0.05)



Table 7.6. Distribution of strategies for the members of an

evolving population of pure strategies and an opponent mod-

eller playing against each member of the population, for dif-

ferent weightings in the weighted sum as in table 7.5.

Weighting Number of each fixed strategy

(against Modeller) Grim TFT Pavlov STFT

0 25(10) 15(7) 10(5) -

0.05 15(6) 25(10) 7(5) -

0.1 12(6) 33(9) 5(3) -

0.2 8(3) 49(8) - 8(3)

0.5 4(1) 61(6) - 13(4)

1.0 2(1) 38(11) - 41(11)

The first obvious difference of the experiments that include performance

against the opponent modeller in fitness calculations (rows 2 onwards in

table 7.5) is the increased average payoff of the opponent modeller. In

comparison to the first row of the Table, we see that the average payoff of

the opponent modeller increases up to a point, before leveling off at around

2.95. This is due to the changes in the composition of the resulting evolved

population (see table 7.6). As we saw in our baseline experiment, grim and

Pavlov do not perform well against the opponent modeller (scoring 1.01

and 0.50 respectively) and hence even a very low degree of influence from

the opponent modeller on fitness scores is enough to reduce the appear-

ance in the evolving population of these strategies. With a weighting of

0.2, Pavlov is unable to score highly enough to survive in any significant

quantities and the presence of grim is much reduced. The reduction in the

number of grim strategies explains the increase in the average payoff of

the opponent modeller (recall that the opponent modeller performs poorly

against grim because of the high cost of strategy exploration). With a 0.5

weighting, grim becomes marginalised. The increasing number of STFT

strategies at the higher weightings explains the small decrease in average

payoff of the opponent-modeller – STFT is not exploitable and indeed may

benefit from its suspicious nature at the beginning of the game. At the

higher weightings, the only strategies other than tit-for-tat and STFT to

appear in the population are single-step (differing in just one state) mu-

tants from tit-for-tat and STFT (including grim) induced by the mutation

in the evolutionary process. These mutants are not able to survive in the

evolving population and are quickly eliminated.

Variance in the performance of the opponent modeller also decreases as

we increase the relative importance of performance against the opponent



modeller in the evaluation of the success of a population member. This is

because the opponent modeller acts as a stabilising influence on the fitness

of the evolving population since it is a constant in the environment. The

more that the fitness is derived from games against the remainder of the

population (low weightings), the more performance is affected by changes

in the population.

As seen in column 3 of table 7.5, the average payoff of the evolving pop-

ulation against the opponent modeller increases as the relative importance

of performance against the opponent modeller in fitness calculations for a

population member increases. This is as expected, because survival in the

population now depends more and more on this metric than performance

against the other members of the evolving population. Indeed, at a weight-

ing of 1.0, performance against the opponent modeller is maximal, at a

sacrifice of performance against the other members of the evolving popula-

tion. Somewhat strangely though, as the weighting increases from 0 to 0.2,

performance of the evolving population against other members of the pop-

ulation increases, even though fitness now depends more on performance

against the opponent modeller. This is due to the decreased numbers of

grim strategies (driven out by the opponent modeller) – defection is no

longer as costly as it was before. At the highest weighting, even the small

numbers of grim strategies ensure that the abundance of STFT strate-

gies perform relatively poorly, lowering the average-payoff of the evolving

population in play against each other.

Importantly, we observe in table 7.5 that while the performance of the

evolving population against the opponent modeller increases as the weight-

ing increases, the evolving population is never able to obtain a level of

performance comparable to that of the opponent modeller (contrast col-

umn 4 of table 7.5 against column 3). Of course, evolution does its best –

evolving a population consisting of predominately non-exploitable strate-

gies (tit-for-tat and STFT ). However, due to the stochastic nature of the

evolutionary process in the mutation operation, other strategies find their

way into the population, thus allowing the opponent modeller to exploit

weaknesses and obtain an average payoff higher than that of the evolving

population.

Our analysis of table 7.5 shows that the opponent-modeller to be ef-

fective against populations of pure strategies, outperforming the evolving

population in terms of average payoff in play against each other. The oppo-

nent modeller is able to outperform the evolving population, learning with

a high degree of certainty what its opponent will do in any given situation



(game state). In our next experiment, we repeat these tests using stochastic

strategies in place of pure strategies.

Stochastic IPD strategies differ from pure IPD strategies as they allow

the player the flexibility of selecting a cooperate/defect action probabilis-

tically given a particular game state. Whereas a pure strategy will always

select the same action for a given game state, a stochastic strategy may

(probabilistically) decide which action to take. This means that successive

calls of a stochastic strategy for the same input game state may produce

different output actions. This cannot occur for a pure strategy – the pure

strategy will always select the same response given an input game state.

Stochastic strategies are implemented as follows: for each unique game

state (recall, we are assuming 1st order strategies only), the stochastic

strategy stores a probability that determines the probability of cooperating

in this game state. Choice of an action depends directly on this probability

– this probability of cooperating is this stored probability. Mutation of

a stochastic strategy occurs by adjusting each internal probability by a

randomly sampled variable taken from a Gaussian distribution with mean

0 and a standard deviation of 0.025.

Against stochastic strategies, the opponent modeller can still observe

the probability with which its opponent will cooperate, but it cannot be

sure that the opponent will cooperate on any given move. This experiment

against stochastic strategies will report on the effects of this uncertainty in

behaviour on the performance of the opponent modeller.

Table 7.7 reports the average payoffs for the members of the evolving

population of stochastic strategies and the opponent modeller, along with

the composition of selected strategies in the population (dashed entries

Table 7.7. Average payoffs for the members of an evolving

population of stochastic strategies and an opponent modeller

playing against each member of the population, for different

weightings in the weighted sum.

Weighting Population average payoffModeller average

(against Against Againstpayoff

Modeller) population Modeller

0 2.08(0.61) 1.29 2.16(0.59)

0.05 2.43(0.22) 2.23 2.54(0.22)

0.1 2.59(0.14) 2.49 2.68(0.12)

0.2 2.61(0.12) 2.76 2.68(0.10)

0.5 2.54(0.11) 2.96 2.62(0.09)

1.0 2.13(0.18) 3.08 2.44(0.08)



Table 7.8. Distribution of strategies for the members of an

evolving population of stochastic strategies and an opponent

modeller playing against each member of the population, for

different weightings in the weighted sum as in table 7.7.

Weighting Number of each fixed strategy

(against Modeller) Grim TFT Pavlov STFT

0 22(18) 12(14) 7(8) 2(2)

0.05 6(10) 23(23) - 23(22)

0.1 - 22(19) - 37(27)

0.2 - 40(29) - 29(28)

0.5 - 40(30) - 36(27)

1.0 - 60(29) - 31(28)

indicate low numbers) for different ratios of the weighted sum that consti-

tutes the fitness of a member of the evolving population.

Against a population of evolved stochastic strategies (row 1 of table 7.7),

the opponent modeller, on average, does outscore the evolving population,

performing well in certain generations, but not in others. As in the equiv-

alent experiment against pure strategies, this performance depends on the

number of grim-like strategies in the evolving population – when the num-

ber of grim-like strategies is high, performance is relatively weak; when

the number of grim-like strategies is low, performance is relatively high.

However, unlike the experiment involving pure strategies, the performance

of the opponent modeller is more unstable, perhaps due to large dynamic

changes in the composition of the opponent strategies observed in the evo-

lution of a population of stochastic strategies. No such large-scale changes

in strategy composition were evident in the evolution of a population of

pure strategies (contrast the variance in the numbers of each fixed strategy

in table 7.5 and table 7.8).

As in the experiment with pure strategies, as the importance of the per-

formance against the opponent modeller increases, the average payoff of the

evolving population against the opponent modeller increases. However, in

contrast to the experiments involving pure strategies, we see that the evolv-

ing population is able to obtain a higher average payoff than the opponent

modeller for weightings greater than 0.2 (recall previously that an evolv-

ing population of pure strategies was unable to surpass the performance

of an opponent modeller regardless of the relative weighting). Indeed, at

a weighting of 1.0, the evolving population is able to achieve an average

payoff of greater than 3 against the opponent modeller, whilst the oppo-

nent modeller scores less than 2.5 on average, suggesting that the evolving



population is exploiting the opponent modeller. Why is the opponent mod-

eller scoring less than its opponent in these scenarios? Does this represent

a failure for our opponent modelling approach, or even opponent modelling

general? The key to understanding these observations has to do with the

composition of the evolving population.

At the higher weightings, tit-for-tat-like and STFT-like strategies ac-

count for the majority of strategies making up the evolving population

(indeed, grim-like strategies have mostly disappeared). If these were pure

strategies, we would expect to see them achieve an average payoff of no

more than 3 (mutual cooperation). However, these strategies are not pure,

instead behaving stochastically, acting mostly like their pure strategy coun-

terpart, but sometimes not. This means that a stochastic TFT-like strategy

will typically play like tit-for-tat and enter into mutual cooperation. How-

ever, occasionally, this stochastic tit-for-tat-like strategy will attempt an

unprovoked defection.

To understand why these stochastic variants are successful, particularly

against the opponent modeller, we need to consider the nature of the game.

As IPD is not a zero-sum game, and since the objective of the opponent

modeller is to achieve the highest-payoff it can (and not to achieve a higher

payoff than its opponent), it is often better for the opponent modeller to

accept the occasional defection without retaliating in order to achieve a

higher average payoff in the long run (provided the defection doesn’t occur

too frequently). Indeed, if the opponent modeller was to reciprocate every

defection by its opponent, it would be able to prevent its opponent from

significantly out-scoring it, but at the cost of lowering its own average payoff

(for example, it is better to accept an average payoff of 2 and allow your

opponent an average payoff of 4 than to retaliate and restrict both players

to an average payoff of 1).

This scenario provides an interesting demonstration of the interactions

between evolution and learning in a competitive environment. We have ob-

served that evolution produces IPD players that can improve their average

payoffs against the opponent modeller by employing occasional unprovoked

defections. It would seem that as long as the evolved strategies do not de-

fect often enough to evoke retaliation from the opponent modeller, they will

achieve higher average payoffs than the opponent modeller. In response, the

opponent modeller seemingly recognises that, although it is being exploited,

it will achieve better future rewards by not retaliating, since its opponent

will resume cooperation after each unprovoked defection. Indeed, this sug-

gests that Axelrod’s third guideline for playing IPD (“always reciprocate



cooperation and defection”) does not apply against stochastic strategies.

We still deem this a success for the opponent modeller – indeed, the oppo-

nent modeller is still able to achieve the highest payoff possible against this

particularly “nasty” opponent. Sometimes, you just have to grin and bear

it.

7.3. Conclusions

IPD is a game that models human choices in self-interested environments.

Previous studies of the game have focused on both evolution and standard

artificial intelligence techniques to study game strategies. However, some-

thing has been missed in these previous investigations – the role of a theory

of mind, specifically, of adapting one’s play based upon a learned model of

an opponent’s strategy. This is the area of opponent modelling – building

a representation of an opponent’s strategy, typically from experience, in or-

der to exploit weaknesses in their play. The trade-off between exploration

(searching for better ways to exploit an opponent) and exploitation (taking

advantage of the weaknesses in an opponent’s strategy) is paramount to

the success of the opponent modeller – too much strategy exploration and

the opponent modeller may not solidify its advantage; too little strategy

exploration and the opponent modeller may be sacrificing potential gains.

A balance between the two must be achieved for near-optimal play.

Using an observational model of the choices made by an opponent and a

simple technique to select the best choice given the next most likely action of

the opponent, we have introduced a simple approach to construct computer

IPD players capable of exploiting observable strategy weaknesses in oppo-

nents’ play. Our experiments show that a computer opponent modelling

IPD player is able to outperform an evolving population of fixed pure-

strategy opponents in terms of average payoff in play against each other

and perform as well as possible against a population of stochastic-strategy

opponents. Further, the strong performance of our entry in the IPD compe-

tition held at CIG’05, the 2005 IEEE Computational Intelligence in Games

conference, supports our claims of the benefits of opponent modelling – our

entry, based on the ideas presented in this work, consistently finished in

the top five in the classical IPD competitions, and performed honourably

in the collusion-based competitions.

Beyond the IPD game, this work makes a contribution to the question of

how intelligent behaviour evolves. Higher intelligence is more than simple

mimicry or rote learning, requiring the ability to predict and respond to



specific “opponent” choices. Our work reflects on a Machiavellian view of

intelligence, in which the manipulation of the behaviour of other individuals

is crucial. High levels of intelligence are not universal in Nature – the

majority of life is simple and unintelligent, and human level intelligence is

unique. A traditional explanation for this invokes cost in terms of energy

needs of a highly developed brain. One of our reviewers pointed out that our

approach offers a fundamentally different explanation in terms of the cost

of exploration. Another explanation again is the self-limiting dynamics of

having an intelligent sub-population. Our experiments show, for example,

that opponent modelling is a viable strategy in an IPD environment, and

moreover, that the presence of opponent modellers affects the success of

other strategies, which in turn alters the characteristics of that environment.

This may be an important factor to consider in any study of the evolution of

intelligence. The subtleties and parameters of such interactions might offer

an explanation as to why the varying requirements of different ecological

niches lead to co-existence of species having different levels of intelligence.

Further study is needed to understand such interactions and the factors

that determine their outcomes.

References

Ashlock, D., Smucker, M. D., Stanley, E. A., and Tesfatsion, L. (1996) Pref-

erential partner selection in an evolutionary study of prisoner’s dilemma,

BioSystems, 37, pp. 99-125.

Axelrod, R. (1984) The Evolution of Cooperation. New York, Basic Books.

Byrne, R. W. and Whiten, A. (1988) Machiavellian Intelligence: Social Exper-

tise and the Evolution of Intellect in Monkeys, Apes and Humans. Oxford,

Clarendon Press.

Calvin, W. H. (1983) A Stone’s Throw and its Launch Window: Timing Pre-

cision and its Implications for Language and Hominid Brains, Journal of

Theoretical Biology, 104, pp. 121-135.

Calvin, W. H. (1991) The Ascent of Mind, Bantam.

Fogel, D. B. (1993) Evolving behaviors in the iterated prisoner’s dilemma. Evo-

lutionary Computation, 1, 1, pp. 77-97.

Hingston, P., and Kendall, G (2004) Learning versus Evolution in Iterated Pris-

oner’s Dilemma, Proceedings of the IEEE Congress on Evolutionary Com-

putation (CEC’05), Portland, IEEE, pp. 364-372

Markovitch, S. and Reger, R. (2005) Learning and Exploiting Relative Weaknesses

of Opponent Agents, Autonomous Agents and Multi-Agent Systems, 10,

pp. 103-130.

Maynard-Smith, J. (1988) Did Darwin get it right? Essays on Games, Sex and

Evolution, Penguin Books.



Maynard-Smith, J. and Harper, D. (2003) Animal Signals. Oxford, Oxford Uni-

versity Press.

Miller, G. F. (1997) Protean primates: The evolution of adaptive unpredictability

in competition and courtship. Machiavellian Intelligence II: Extensions and

Evaluations. Cambridge, Cambridge University Press: 312-340.

Miller, J. H. (1996) The coevolution of automata in the repeated prisoner’s

dilemma, Journal of Economic Behavior and Organization, 29, pp. 87-112.

Nowak, M. and May, R (1992) Evolutionary games and spatial chaos, Nature,

359, pp. 826-829.

Parker, S. T., Mitchell, R.W. and Boccia, M.L., Ed. (1994). Self-awareness in

Animals and Humans: Developmental Perspectives. Cambridge, Cambridge

University Press.

Whiten, A. B., and Byrne, R. W. (1997) Machiavellian Intelligence II: Extensions

and Evaluations. Cambridge, Cambridge University Press.


Chapter 8

On some winning strategies for the Iterated Prisoner’s

Dilemma or Mr. Nice Guy and the Cosa Nostra

Wolfgang Slany and Wolfgang Kienreich

Technical University, Graz, Austria

We submitted two kinds of strategies to the iterated prisoner’s dilemma

(IPD) competitions organized by Graham Kendall, Paul Darwen and Xin

Yao in 2004 and 2005.a Our strategies performed exceedingly well in both

years. One type is an intelligent and optimistic enhanced version of the well

known TitForTat strategy which we named OmegaTitForTat. It recognizes

common behaviour patterns and detects and recovers from repairable mu-

tual defect deadlock situations, otherwise behaving much like TitForTat.

OmegaTitForTat was placed as the first or second individual strategy in

both competitions in the leagues in which it took part. The second type

consists of a set of strategies working together as a team. The call for par-

ticipation of the competitions explicitly stated that cooperative strategies

would be allowed to participate. This allowed a form of implicit communi-

cation which is not in keeping with the original IPD idea, but represents a

natural extension to the study of cooperative behaviour in reality as it is

aimed at through the study of the simple, yet insightful, iterated prisoner’s

dilemma model. Indeed, one’s behaviour towards another person in reality

is very often influenced by one’s relation to the other person.

In particular, we submitted three sets of strategies that work together as

groups. In the following, we will refer to these types of strategies as group

strategies. We submitted the CosaNostra,b the StealthCollusion, and the

EmperorAndHisClones group strategies. These strategies each have one dis-

tinguished individual strategy, respectively called the CosaNostraGodfather

aSee http://www.prisoners-dilemma.com/ for more details.

bOne of us, Slany, had submitted the CosaNostra group strategy previously to an iterated

prisoner’s dilemma competition organized by Thomas Grechenig in 1988. Our submitted

group strategies are inspired by this first formulation of such a group strategy that we

are aware of.

171


172 W. Slany & W. Kienreich

(called ADEPT in 2004), the Lord strategy, and the Emperor, that heavily

profit from the behaviour of the other members of their respective groups:

the CosaNostraHitmen (10 to 20 members), the Peons (open number of

members), and the CloneArmy (with more than 10,000 individually named

members), which willingly let themselves being abused by their masters but

themselves lowering the scores of all other players as much as possible, thus

further maximizing the performance of their masters in relation to other

participants. Our group strategies were placed first, second and third places

in several leagues of the competitions and also likely were the most efficient

of all group strategies that took part in the competitions. Such group

strategies have since been described as collusion group strategies. We will

show that the study of collusion in the simplified framework of the iterated

prisoner’s dilemma allows us to draw parallels to many common aspects

of reality both in Nature as well as Human Society, and therefore further

extends the scope of the iterated prisoner’s dilemma as a metaphor for the

study of cooperative behaviour in a new and natural direction. We fur-

ther provide evidence that it will be unavoidable that such group strategies

will dominate all future iterated prisoner’s dilemma competitions as they

can be stealthy camouflaged as non-group strategies with arbitrary sub-

tlety. Moreover, we show that the general problem of recognizing stealth

colluding strategies is undecidable in the theoretical sense.

The organization of this chapter is as follows: Section 0 introduces the

terminology. Section 0 evaluates our results in the competitions. Section 0

describes our strategies. Section 0 analyses the performance of our and

similar strategies and proves the undecidability of recognizing collusion.

Section 0 relates the findings to phenomena observed in Nature and Human

Society and draws conclusions.

8.1. Introduction

The payoff values in an iterated prisoner’s dilemma are traditionally called

T (for temptation to betray a cooperating opponent), S (for sucker’s payoff

when being betrayed while cooperating oneself), P (for punishment when

both players betray each other), and R (for reward when both players coop-

erate with each other). Their values vary from formulation to formulation

of the prisoner’s dilemma. Nevertheless, the inequalities S < P < R < T

and 2R > T + S are always observed between them. The last one ensures

that cooperating twice (2R) pays more than alternating one’s own betrayal

of one’s partner (T) with allowing oneself to be betrayed by him or her (S)


On some winning strategies for the Iterated Prisoner’s Dilemma 173

[Kuhn (2003)]. In the iterated prisoner’s dilemma competitions organized

by Graham Kendall, Paul Darwen and Xin Yao in 2004 and 2005, these

values were, respectively, S = 0, P = 1, R = 3, and T = 5. Note that

the general results in Section 0 are true for arbitrary values constrained by

the inequalities stated above.

8.2. Analysis of the Tournament Results

The strategies we submitted to the competitions were the OmegaTitForTat

individual, single-player strategy (OTFT), the CosaNostra group strategy,

the StealthCollusion group strategy, and the EmperorAndHisClones group

strategy. The following subsections summarize the results, followed by two

sections commenting on real and presumed irregularities in some of the

results.

8.2.1. 2004 competition, league 1 (standard IPD rules, with

223 participating strategies)

• Our OTFT was the best non-group, individual strategy.

• Our Godfather strategy (called ADEPT in 2004) of our CosaNostra group

was the second best group strategy (with less than 10 members) after the

STAR group strategy of Gopal Ramchurn (with 112 members, though

we are not sure that all strategies colluded as one group). Note that

even badly performing group strategies can score arbitrarily higher than

individually better group strategies by sheer numerical superiority (see

below and Section 0). We also initially noted with one eyebrow raised

that 112 is exactly the smallest integer larger than 223 divided by 2, so the

STAR group members were just more than 50% of the total population.

However, we now believe that this might have been just a coincidence.

• Our EmperorAndHisClones group strategy was not allowed to fully com-

pete but would have won by large (it had more than 10,000 individually

named clones of which unfortunately only one was eventually allowed to

participate), for payoff values see below. EMP scored as good as ADEPT

as it was following the same recognition protocol.

• Our StealthCollusion group strategy (sent in by a virtual person Con-

stantin Ionescu and called LORD and PEON) participated as a proof of

the collusion concept, apparently without detection of the collusion by

the organizers, as further variants of members of the CosaNostra group

strategy. Constantin asked the organizers to clone his PEON strategy



as often as possible; however, only one copy was eventually allowed to

participate. Read more about Constantin later in Section 0.

Simple calculations show that a numerical advantage would have vastly

improved the results of our ADEPT and EmperorAndHisClones strate-

gies. In all the following calculations we neglect protocol losses among

group members as they insignificantly increase the numbers reported be-

low compared to the scores that would really have been achieved had the

competitions taken place as described. Table 8.1a shows the results of

the tournament with the number of clones actually allocated. Table 8.1b

shows the estimated results if 100 additional clones had been allowed for

our collusion strategy. Table 8.1c shows how 10,000 additional clones would

have influenced the results. These results were computed for an average of

200 turns per game, giving on the one hand full temptation payoff value

t to EMP/ADEPT from their CosaNostraHitmen, Peons, and clones of

the CloneArmy, whereas EMP/ADEPT played OmegaTitForTat against

all strategies outside our group and thus achieving the same result against

these as if the very well performing OmegaTitForTat strategy would have

been used by itself. CosaNostraHitmen, Peons, and clones of the Clon-

eArmy, and EMP/ADEPT on the other hand always cooperated with their

EMP/ADEPT bosses while permanently betraying all strategies outside our

group and thus resulting in full punishment payoff value p or even sucker’s

payoff value s to strategies outside our group to themselves and to their

opponents. Clearly, had our strategies been composed of as many members

as the STAR strategy or, even better, as many as we had submitted, it

very plausibly would have won by large factors (43% with additional 100

members, 800% with additional 10,000 members as we had submitted). We

can therefore plausibly conjecture, under the assumption that the STAR

strategy had more then 100 strategies colluding with each other, that our

group strategies would be vastly more efficient than the winning STAR

group strategy and would have won had we been allowed to play as we

had submitted our strategies and as it was positively hinted at by one of

the organisers when we submitted our strategies, in a mail received from

Graham Kendall on May 29, 2004, as otherwise we would have inflated our

stealth collusion strategies — we had prepared a respectable number of vir-

tual persons similar to Constantin Ionescu as described in Section 0. Also

note that a sufficiently large group of real people (e.g., one of us, Slany, has

to teach 750 computer science students each year that in theory could all

be enticed to participate) would have produced a similar effect.



Table 8.1a. Original tournament results.

Rank Player Strategy Score

1 Gopal Ramchurn StarSN (StarSN) 117,057

2 Gopal Ramchurn StarS (StarS) 110,611

3 Gopal Ramchurn StarSL (StarSL) 110,511

4GRIM (GRIM

GRIM (GRIM Trigger)100,611

Trigger) 1

5 Wolfgang Kienreich OTFT (Omega tit for tat) 100,604

6 Wolfgang KienreichADEPT (ADEPT

96,291Strategy)

7 Emp 1 EMP (Emperor) 95,927

8 Bingzhong Wang (noname) 94,161

9 Hannes Payer Probbary 94,123

10 Nanlin Jin HCO (HCO) 93,953

Table 8.1b. Tournament results with additional 100 clones.



196,291Strategy)

2 Emp 1 EMP (Emperor) 195,927

3 Gopal Ramchurn StarSN (StarSN) 137,057

4 Gopal Ramchurn StarS (StarS) 130,611

5 Gopal Ramchurn StarSL (StarSL) 130,511

6GRIM (GRIM

GRIM (GRIM Trigger) 120,611Trigger) 1

7 Wolfgang Kienreich OTFT (Omega tit for tat) 120,604

8 Bingzhong Wang (noname) 114,161

9 Hannes Payer Probbary 114,123

10 Nanlin Jin HCO (HCO) 113,953

8.2.2. 2004 competition, league 2 (uncertainty IPD vari-

ant, same 223 participating strategies as in the first

league)

• OTFT was a very close 2nd.

• ADEPT and other Godfather variants ranked as the 2nd group strategy.

8.2.3. 2005 competition, league 1 (standard IPD rules, with

192 participating strategies)

• CosaNostra Godfather was overall winner, with 20 CosaNostra Hitmen

participating in the CosaNostra group strategy.

• OTFT did not participate; it remains unclear why.



Table 8.1c. Tournament results with additional 10,000 clones.



10,096,291Strategy)

2 Emp 1 EMP (Emperor) 10,095,927

3 Gopal Ramchurn StarSN (StarSN) 2,117,057

4 Gopal Ramchurn StarS (StarS) 2,110,611

5 Gopal Ramchurn StarSL (StarSL) 2,110,511

6 GRIM (GRIM Trigger) 1 GRIM (GRIM Trigger) 2,100,611

7 Wolfgang Kienreich OTFT (Omega tit for tat) 2,100,604

8 Bingzhong Wang (noname) 2,094,161

9 Hannes Payer Probbary 2,094,123

10 Nanlin Jin HCO (HCO) 2,093,953

• Our StealthCollusion group strategy member LORD was placed 5th, the

collusion again apparently being undetected by the organizers.

8.2.4. 2005 competition, league 4 (standard IPD rules, but

only non-group, individual strategies were allowed to

participate; 50 participating strategies)

OTFT was a very close 2nd. Detailed analysis of results initially suggested

that the first placed strategy APavlov OTFT might have been a member of

a stealth colluding group strategy — this later turned out to most likely not

being true. However, our most likely mistaken analysis of some strategies

that seemed to be involved illustrates how difficult it can be to clearly

differentiate between stealth collusion strategies and strategies that only

appear to behave as colluding strategies, seemingly showing a cooperative

behaviour that in fact emerges randomly among strategies that actually

are not consciously cooperating with each other. A more detailed analysis

follows in the discussion below.

8.2.5. Analysis of OmegaTitForTat’s (OTFT) performance

In the following, we review the performance of our single player, individual

OTFT strategy in more detail. In the first league of the 2004 competi-

tion, which was intended to be a replay of the famous first iterated pris-

oner’s dilemma competition organized by Robert Axelrod in 1984 [Axelrod

(1984)], our OTFT strategy was arguably placed second together with the

default GRIM strategy out of a total of 223 participating strategies. Ac-

tually OTFT was placed third after the GRIM strategy, GRIM leading

by a mere 0.007% points. However, this lead was later seriously put into



question by the fact that GRIM on average had played 0.92% more games

than OTFT in the tournament, as pointed out by Abraham Heifets in an

email sent to the organizers on March 29 2005 which the organizers kindly

forwarded to us. More rounds obviously add to the score so this difference

was significant. When results are scaled to reflect the difference, OTFT

would have been placed as the first non-group strategy before GRIM, with

an estimated payoff of 101,530 points compared to the 100,604 of GRIM.

OTFT and GRIM were clearly outperformed only by a winning strategy

being member of the same stealth colluding group of strategies sent in by

Gopal Ramchurn.

In the following we will refer to Ramchurn’s group as the STAR group

strategy. More on group strategies against individual strategies will fol-

low in Section 0. Let us just remark here that we will show in Section 0

that group strategies can perform arbitrarily better than non-group, single-

player strategies. This basically means that OTFT was the best single-

player strategy. Moreover, the good results of GRIM are very likely due to

the tournament having been dominated by the STAR group strategy, with

its individual group members accounting for more than 50% of the partic-

ipating strategies. GRIM scores best against STAR group members that

always defect against members outside their group, the purpose being to

damage competing strategies by always defecting (ALLD), because GRIM

has a very short (one turn) interval of determination before it switches

to ALLD itself. OTFT loses some points in comparison because of inter-

spaced recovery trials during which OTFT cooperates instead of continuing

to defect. However, in Section 0 we show that, with and without a high

percentage of ALLD strategies OTFT is robustly superior to GRIM.

In the second league of the 2004 competition, which was the league with

a small probability of erroneous interpretation of the other player’s last

move, OTFT was placed as the second best non-group, individual strategy,

placed third after three members of Ramchurn’s STAR group and an indi-

vidual strategy sent in by Colm O’Riordan.c GRIM again ranked high but

was slightly outperformed by OTFT, a result that was to be expected in the

slightly randomized setting of this league. Miscommunication does happen

in the real world, so this illustrates again that in a non-perfect environment

an optimistic strategy like OTFT fares better than one with a pessimistic

world-view such as GRIM. It also shows that OTFT was again among the

cOne of our reviewers learned from ORiordan that this strategy is actually very similar

to OTFT.



best single-player strategies, now also in an environment in which miscom-

munication happens inherently.

For reasons that remain unclear to the authors, OTFT was not allowed

to participate in the first and second leagues in the 2005 competition.

However, OTFT achieved a second place in league number four in the

2005 competition, which was the league allowing participation of only one

strategy by each team, thereby supposedly eliminating the participation of

group strategies. Winner was the strategy APavlov sent in by Jia-Wei Li,

outperforming our second placed OTFT by 1.2%.

8.2.6. The practical difficulty of detecting collusion

The small margin by which APavlov outperformed OTFT caused us to take

a very close look at the tournament results of the single-player league. We

first note that in the general results, there were strategies present which

achieved a lower score than ALLC (always cooperates), RAND (randomly

cooperates or defects), NEG (always plays the opposite from what the op-

ponent played last, first move is random) and the other standard strategies

usually ranking lowest in tournaments with only single-player strategies

present. These scores are shown in Table 8.2.

It takes quite an amount of ingenuity to achieve scores as low as the last

three candidates. Each one scored even lower than standard RAND and

NEG, and all the scores are within an interval below the variance introduced

by the RAND strategy. We initially suspected that the last three strategies

represented part of a collusion strategy somebody tried to introduce into

Table 8.2. Strategies having the lowest score in

2005’s league 4.


39 (Standard) ALLC 22,182

40 Oscar Alonso IBA 22,054

41 Oliver Jackson OJ 21,694

42 Bin Xiang A1 19,586

43 Quek Han Yang SPILA 19,518

44 (Standard) ALLD 18,764

45 Kaname Narukawa (noname) 18,592

46 (Standard) RAND 18,153

47 (Standard) NEG 17,176

48 Bernat Ricardo ALT 16,934

49 Yusuke Nojima (noname) 16,383

50 Yannis Aikater TCO3 16,228



Table 8.3. Collusion suspects: TCO3 and ALT cooperating with Apav.

TCO3 C D D C C D D C C C C C C...

ALT C D D C C D D C C C C C C...

APav C C D D C C D D D D D D D...

Table 8.4. Collusion suspects: TCO and ALT cooperating with OTFT.

TCO3 C D D C C D D C C D D C C...

ALT C D D C C D D C C D D C C...

OTFT C C D D C C D D C C D D D...

Table 8.5. Collusion suspect: TCO3 showing TFT a cold shoulder.

TCO3 C D D C C D D C C D D C...

TFT C C D D C C D D C C D D...

the single player league and therefore took a closer look at their style of

play in respect to standard strategies and to player strategies, including the

winning strategy Apavlov and our OTFT strategy.

Analysis of two suspect strategies looked very much as if they cooperated

with the winning APavlov strategy (compare Table 8.3) but also with our

OTFT strategy (compare Table 8.3), raising their score by cooperating in

the face of continuous defection. On the other hand, the suspect strategies

did not exhibit this kind of cooperative behaviour against defection by

standard strategies (compare Table 8.5).

Obviously, a trigger sequence of moves similar to the protocol exchange

employed by our CosaNostra strategy (see 1.3.2) caused the switch to an

exploitable ALLC behaviour in the strategies analysed above.

Now, we cannot speak for the authors of APavlov, but we swear on our

honour and solemnly declared that we did not consciously implement collu-

sion features into OTFT, nor did we introduce any of the suspect strategies

above ourselves. Both OTFT and APavlov, if its name is any indicator of

the type of algorithm used, are strategies that try to correct for occasional

mistakes. Such strategies have generally been known to outperform Tit-

ForTat (see, for example, [Nowak and Sigmund (1993)]) and rank highly in

single player tournaments. In this case, the correction algorithm in both

dOne reviewer suggested that swearing on our honour and solemnly declaring this would

not be necessary. However, since this chapter involves so many aspects of stealth collu-

sion, we felt it would help making sure that readers would trust us that OTFT was not

involved in any collusion.



strategies obviously triggered the exploitable behaviour in the collusion

suspects, effectively “taking over someone else’s hitman” in the terminology

of our CosaNostra collusion strategy (compare Section 0).

We conclude that in the presence of strategies which exhibit exploitable

behaviour based on very simple trigger mechanisms, collusion as a concept

is essentially undetectable. It is not possible to denounce a strategy for us-

ing collusion if the behaviour triggering the collusion is entirely reasonable

in the context of standard strategies playing to win. In case of IPD com-

petitions in which cooperation and defection can be done in a gradual way,

that is, when more than one payoff and multi-choice as in league 3 of the

two competitions of 2004 and 2005 exist, this cooperation can be hidden

with even more subtlety. In Section 0 we will show that in general deciding

whether a set of strategies are involved in a collusion group is among the

most difficult questions that theoretically can arise.

8.3. Details of Our Strategies

8.3.1. OmegaTitForTat, or Mr. Nice Guy meets the iterated

prisoner’s dilemma

The OmegaTitForTat (OTFT) strategy is based on heuristics targeting

several tournament situations which have been identified, by tests and sta-

tistical analysis, as being both common and damaging to conventional

strategies for the IPD. In a tournament environment, certain types of strat-

egy behaviour are very common both in standard strategies added to get

a performance comparison base as well as in custom strategies designed

to dominate. Several such types of behaviour have been identified, and

solutions to optimize the interaction with them have been implemented in

OTFT. Let us note that, while we constructed OTFT from scratch, similar

forgiving strategies have been described in the literature, see, for exam-

ple, [Nowak and Sigmund (1993); Beaufils, Delahaye, and Mathieu (1996);

Tzafestas 2000; O’Riordan 2000].

8.3.1.1. Suspicion

A common trait of many strategies, including the SuspiciousTitForTat

(STFT) strategy from the standard set of strategies used in the tournament,

is suspicion: The strategy starts by playing defect, or plays defect after a

succession of mutual cooperation. Such a move can prove beneficial for a

strategy if the opponent strategy does not immediately counter a defection;



Table 8.6. Deadlock between TFT and STFT.

TFT C D C D C D CD...

STFT D C D C D C DC...

for example, TFTT (TitForTwoTat) would not react to occasional, singular

defections, thus giving a suspicious strategy a clear advantage. Note that

suspicious strategies do not need to keep defecting after an initial defect:

The STFT strategy, for example, simply plays standard TFT but starts

each game with a defection.

The problem many strategies encounter when facing suspicion is that of

deadlock: If a strategy is programmed to counter defection in a TitForTat

manner, and the suspicious strategy itself is programmed the same way,

one suspicious defection can cause a mutual exchange of defects between

two strategies which could cooperate perfectly if only one player would

once forgive a defection. In general, we define deadlock as any situation

where a succession of defects is being played by two strategies because of

an out-of-phase TitForTat behaviour, as shown in Table 8.6.

OTFT counters deadlocks by forgiving a certain number of defections

when a strategy has cooperated for a long time. OTFT starts by cooper-

ating and then tracks the number of cooperations encountered. The initial

idea was that for a certain amount of cooperation, a certain number of de-

fections would be forgivable. The final OTFT algorithm incorporates this

idea, together with other adaptations, into a single strategy as described

below.

8.3.1.2. Randomness

Randomness, in the form of cooperative and defective moves varying with-

out any discernible pattern, can be introduced by simulated noise in the

command transmission, as used in several specific tournament environ-

ments, or it can be a trait of a strategy as such. Strategies trying to gain

by finding a cooperative base with an opponent are faced with a difficult

problem when the opponent is acting erratically: Finding a cooperative

base requires some small sacrifice (for example, STFT and TFTT, in con-

trast to TFT, can cooperate for the whole game because TFTT sacrifices

the initial defection). However a random strategy is highly likely to not

stick to a cooperative behaviour, resulting in the sacrifice cost mounting

and damaging the score of an otherwise successful, cooperative strategy.



As a consequence, randomness must be detected in an opponent’s be-

haviour, and countered appropriately: By playing ALLD (full defect).

There is no way to gain from mutual cooperation if an opponent plays

completely random. Nevertheless, a strategy can at least deny such an op-

ponent gains by playing defection itself, and moreover, thereby profit from

defecting on any unrelated cooperative moves from the random strategy.

OFTF counters randomness by playing ALLD when a strategy exhib-

ited a certain amount of random behaviour. The initial idea was to cut

losses against the standard RAND strategy. However, in the final OTFT

algorithm, the random detection routine was merged with other traits into

a single strategy described below.

8.3.1.3. Exploits

Many strategies can be devised that try to exploit forgiving behaviour. For

example, a simple strategy could be designed to check once if it is playing

against any type of TFTT opponent, who forgives one defection “for free”,

and to exploit such behaviour. Table 8.7 shows the result of such an exploit

strategy at work on TFTT.

Fully countering such exploits leads to a strategy similar to PAV: Con-

stant checks would ensure that an opponent does not gain more from the

current play mode than oneself. When devising a scheme to implement

such checks, a solution was found which incorporates the above mentioned

problems of randomness and suspicion. The result is the final version of

the OTFT algorithm.

8.3.1.4. OTFT

The OTFT algorithm starts by playing C, then TFT. It then maintains a

variable noting the behaviour of the opponent according to typical situa-

tions as described above: For every time the opponent’s move differs from

the opponents previous move, and for every time the opponent’s move dif-

fers from OTFT’s previous move, the variable is increased. For every time

the opponent cooperated with OTFT, the variable is decreased. These rules

allow tracking of randomness and exploits: Based on mutual cooperation

Table 8.7. A strategy exploiting TFTT.

EXPL D D C D D C D D CDD...

TFTT C C D C C D C C DCC...



as the mutually most beneficial case, each change of move of the opponent

indicates some kind of either randomness, or of a try of exploitation of

the TFT behaviour used by OTFT. When the so-called exploit tracker in

OTFT reaches a certain value, the algorithm switches to all-out defection

ALLD to cut losses against an opponent repeatedly breaking cooperation.

A second mechanism is at work and allows recovery from deadlocks

as described above. When OTFT plays standard TFT, it is vulnerable

to deadlock, so independently of the exploit tracker described, a second

variable counts the number of times the opponent’s move was the opposite

of OTFT’s move. If this so-called deadlock tracker encounters a certain

number of exchanges of C and D, an additional C is played and the deadlock

counter is reset. As a consequence, OTFT is able to recover from deadlocks

occurring anywhere in a given exchange of moves.

8.3.1.5. Examples

Table 8.8 demonstrates how the desired avoidance of deadlocks is achieved

in a game played by OTFT versus STFT.

8.3.1.6. OTFT’s behaviour laid bare

In the end, there is no more detailed and exact description of OTFT’s inner

workings than the source code of its implementation. Luckily, the code is

short and easy to understand. We therefore reproduce it in Table 8.10,

leaving aside only the general parts required for the IPDLX framework

that was used in the competitions.e

Table 8.8. Deadlock resolved by OTFT.

OTFT C D C D C C C C C...

STFT D C D C D C C C C...

Table 8.9 shows how OTFT counters random strategies with all-out

defection after a certain amount of random behaviour has been detected.

Table 8.9. Random recognized and countered by OTFT.

OTFT C C D C D C C D C C C D D D D...

RAND C D C D D D C C C D D C D C Cs&Ds...

eFor details of IPDLX see http://www.prisoners-dilemma.com/competition.html#java



Table 8.10. Main parts of OTFT’s source code.

private static final int DEADLOCK_THRESHOLD = 3;private static final int RANDOMNESS_THRESHOLD = 8;public void reset() super.reset();

deadlockCounter = 0; randomnessMeasure = 0; opponentMove = COOPERATE; opponentsPreviousMove = COOPERATE; myPreviousMove = COOPERATE; public double getMove() if( deadlockCounter >= DEADLOCK_THRESHOLD )

// OTFT assumes a deadlock and tries to break it cooperating myReply = COOPERATE; // ... twice ... if( deadlockCounter == DEADLOCK_THRESHOLD ) deadlockCounter = DEADLOCK_THRESHOLD + 1; else // ... and then assumes the deadlock has been broken deadlockCounter = 0; else // OTFT assumes that there is no deadlock (yet) // OTFT assesses the randomness of the opponent’s behaviour if( opponentMove == COOPERATE && opponentsPreviousMove == COOPERATE randomnessMeasure-; if(opponentMove != opponentsPreviousMove) randomnessMeasure++; if(opponentMove != myPreviousMove) randomnessMeasure++; if(randomnessMeasure >= RANDOMNESS_THRESHOLD) // OTFT switches to ALLD (randomnessMeasure can only increase) myReply = DEFECT; else // OTFT assumes the opponent is not (yet) behaving randomly // OTFT behaves like TFT ... myReply = opponentMove; // ... but checks whether a deadlock situation seems to arise if( opponentMove != opponentsPreviousMove ) deadlockCounter++; else // OTFT recognizes that there is no sign of a deadlock deadlockCounter = 0; // OTFT memorizes the current moves for the next round opponentsPreviousMove = opponentMove; myPreviousMove = myReply; return(super.getFinalMove(myReply));



8.3.2. Our group strategies

8.3.2.1. The CosaNostra group strategy, or Organized crime meets

the iterated prisoner’s dilemma

The CosaNostra strategy is based on the concept of one strategy, denoted

Godfather, exploiting another strategy, denoted Hitman, to achieve a higher

total score in an IPD tournament scenario. In this context, exploitation

denotes the ability to deliberately extract cooperative moves from a strat-

egy while playing defect, a situation yielding high payoff for the exploiting

strategy. It is obvious that most opponents would avoid such a situation,

stopping to cooperate with an opponent who repeatedly played defection

in the past. Hence, a special opponent strategy, the Hitman, is designed to

provide this kind of behaviour, and is introduced into the tournament in as

large a number as possible.

A Hitman strategy which indiscriminatingly plays cooperation, however,

is of no use for a Godfather. In mimicking the ALLC standard strategy,

such a Hitman would be beneficial for all other strategies in a tournament

able to recognize and exploit ALLC. Consequentially, the Hitman must be

able to conditionally exhibit two types of behaviour:

• By default, Hitman must play a strategy which does not benefit other

strategies, which is not easily exploitable. Extending the idea, Hitman

should play a strategy most damaging to other strategies to lower their

score. Such a strategy is simple ALLD.

• When confronted with a certain stimulus, Hitman must switch to the

cooperative behaviour defined above.

Complementing the Hitman, Godfather should by default play the best

standard strategy available against any non-Hitman and switch to ALLD

when it encounters a Hitman, relying on the Hitman’s unconditional coop-

eration to raise its score. In our case, the Godfather plays OTFT when not

playing against a Hitman.

The critical part of CosaNostra is the identification of opponents, the

way in which Godfather detects a Hitman, and a Hitman detects a God-

father. We have employed sequences of Defections and Cooperations to

implement a bit-wise protocol which both sides use to mutually establish,

and check, identities (in case of multiple choices and multiple payoffs, this

protocol could be made very short, depending on the number of choices,

possibly to one exchange). If Godfather is aware he is not facing a Hitman,



he must switch to a good non-group strategy like OTFT or GRIM, and if

Hitman is aware it is not facing a Godfather, he must switch to the ALLD

strategy strafing all strategies that are not in their group. This occurs in

the following cases:

• “Unhonorable behaviour”: A presumed Hitman defecting or a presumed

Godfather cooperating outside protocol exchanges

• “Protocol breach”: Both not following the rules during protocol ex-

changes

Putting the rules in other words, the CosaNostra strategy is based on a

Godfather which can be sure that the next n moves of its opponent will be

cooperation, because it identifies the opponent through a simple exchange

protocol. A problematic aspect of such a strategy is the notion of Godfather

or Hitman being “taken over”: Both are prone to wrongly identify an op-

ponent as their strategic counterpart and grant it an advantage (in the case

of Hitman) or depend on predefined behaviour (in the case of Godfather)

and thus lower their score.

The effects if Godfather is taken over: Godfather thinks it is exploiting

a Hitman, plays DEFECT, but the opponent plays DEFECT, too, so God-

father gets the lowest possible score for the exchange. This situation is easy

to counter: If Godfather detects any defects when it believes it is exploiting

a Hitman, it assumes takeover and switches to its good non-group strategy

like OTFT or GRIM.

The effect of a Hitman being taken over is more subtle: Hitman thinks

he is being exploited by Godfather and plays COOP, a behaviour which

benefits the opponent. Countering this situation is complex: A first solu-

tion would be for Hitman to start playing ALLD as soon as it detects a

cooperative move outside the defined protocol exchanges (Hitman assumes

to be exploited). But another strategy could still play mostly DEFECT

and sometimes cooperate, thus fooling a Hitman: For example, a random

opponent strategy with 1/10 of all its moves being cooperative could by

chance emulate a protocol exchange which takes place when a interval of

fixed length ten is used by Hitman (and Godfather), at least for some time.

CosaNostra solves the takeover problem by varying intervals of

cooperation-protocol exchange, with the time between exchanges (the num-

ber of turns) in one interval being communicated within the protocol ex-

change. Godfather and Hitman both have an internal counter which tells

them when to synchronize by executing a protocol exchange, and check for



the other strategy truly being part of CosaNostra. Godfather communi-

cates to the Hitman a modification to the interval during each handshake.

Thus, no other strategy is likely to take over a Hitman or manipulate a

Godfather.

The communication protocol contains a 1 bit signature plus a 2 bit

sequence coding the length of the next interval, as depicted in Table 8.11

(the numbers at the beginning of the lines are countdown steps until the

start of the next interval).

A sample exchange will then look as illustrated in Table 8.12:

In this example an offset of 2 (CD = 01, binary = 2) is encoded. In-

ternally, the offset is used to select an interval length from a table roughly

adjusted to match the number of iterations in the tournament, in our im-

plementation as follows: DD → 26, DC → 15, CD → 19, and CC → 8.

Thus, after sending DD, Godfather will defect 23 times, then again start-

ing the hand shake with a single COOPERATE followed by the next two

bits encoding the length of the following interval, and so on. On his side,

the Hitman will expect 23 DEFECTs and then one COOPERATE from

Table 8.11. CosaNostra communication protocol.

Godfather Hitman

CountdownIndex Move Description Move Description

3 C

Godfather plays

D

Hitman plays a

a single signature single signature

COOPERATE DEFECT

2 D/CGodfather plays

C

Hitman

first message bit

COOPERATEs to

minimize protocol

loss

1 D/C

Godfather plays

C

Hitman stores first

second message

message bit,

bit

COOPERATEs to

minimize protocol

loss

CountdownD

Godfather starts

C

Hitman stores

restart, e.g., 26

new interval and second message

plays DEFECT bit, starts new

from now on interval, plays

until the next cooperate from

shake hand, now on until the

when next shake hand

CountIndex == 3 when CountIndex

again == 3 again



Table 8.12. Synchronization between Godfather and Hitman.

# ... 6 5 4 3 2 1 19 ...

Godfather D D D D C C D D D...

Hitman C C C C D C C C C...

his opponent to continue to believe the opponent is a Godfather and thus

allowed to exploit the Hitman.

Note that it would be easy to make the communication protocol more

efficient by increasing the length of each new interval according to a func-

tion known only to the Hitman and his Godfather, e.g., by adding a number

proportional to the number of intervals completed so far. Indeed, the like-

lihood that a non-Godfather strategy by coincidence can continually fool

a Hitman into believing he is serving his Godfather while instead allow-

ing the non-Godfather to take advantage of the Hitman, is decreasing very

quickly with each successful exchange. Conversely, the longer the opponent

of Hitman is following the Godfather’s behaviour, the more likely it is that

the opponent really is his Godfather, and so it becomes safer and safer for

the Hitman to let the opponent abuse him for longer and longer interval

lengths.

The bootstrap for the two strategies is that the Hitman starts with a

defection and the Godfather with cooperation, mimicking step 3 as shown

above. The initial cooperation move is important for Godfathers stan-

dard strategy: To achieve a good score against certain standard opponents

(GRIM being an extreme example), it is necessary to start off with Coop-

eration.

Godfather’s protocol loss per interval is at a minimum 5 points (the

single protocol cooperation), at a maximum 9 for the Godfather: A base

loss of 5 for the single protocol bit is inevitable. Then, at worst, Godfather

sends CC, the Hitman cooperates to minimize loss, yielding 3 + 3 = 6

instead of 5 + 5 = 10 in the best case where Godfather sends two defections

as protocol bits.

The CosaNostra group strategies have not been designed to fare well in

a noisy environment as in league 2 of the 2004 competition, though they

in practice did quite well (see Section 0). Note that it would not be very

difficult to make them more noise resistant by introducing some error cor-

recting mechanism such as, e.g., allowing a certain number of mistakes (or

unexpected replies but explainable as answers to possibly wrongly commu-

nicated signals from oneself) of the other player until deciding that he is

not part of one’s group.



Table 8.13. Main parts of CosaNostra Godfather’s source code.

>> private variables and constants like in Table 8.10 <<

private static final int SYNC_GF_COOPERATES = 3;private static final int SYNC_HM_REPLIES_WITH_DEFECT = 2;private static final int GF_SENDS_FIRST_MESSAGE_BIT = 2;// private static final int GF_SENDS_SECOND_MESSAGE_BIT = 1;private int nextCountdownRestartValue;

public void reset() >> Content of OTFT's reset() method from Table 8.10 <<

countdownIndex = SYNC_GF_COOPERATES; // First COOPERATEopponentPlayedSoFarLikeHitman = true;

public double getMove() if( opponentPlayedSoFarLikeHitman ) // Did the opponent just break the Hitman behaviour pattern? if( ( countdownIndex == SYNC_HM_REPLIES_WITH_DEFECT && opponentMove == COOPERATE ) || ( countdownIndex != SYNC_HM_REPLIES_WITH_DEFECT && opponentMove == DEFECT ) ) // Yes, so the opponent cannot be a Hitman, so Godfather ... myReply = DEFECT; // ... defects and switches ... opponentPlayedSoFarLikeHitman = false; // ... to OTFT else // No, the opponent again played like a Hitman. if( countdownIndex > SYNC_GF_COOPERATES ) myReply = DEFECT; // Godfather thus exploits Hitman else if( countdownIndex == SYNC_GF_COOPERATES ) myReply = COOPERATE; // COOPERATE once to synchronize nextCountdownRestartValue = 9; // GF starts to prepare else if( countdownIndex == GF_SENDS_FIRST_MESSAGE_BIT ) myReply = (Math.random()>0.5) ? DEFECT : COOPERATE; nextCountdownRestartValue += (myReply==DEFECT)?7:0; else // if( countdownIndex == GF_SENDS_SECOND_MESSAGE_BIT ) myReply = (Math.random()>0.5) ? DEFECT : COOPERATE; nextCountdownRestartValue += (myReply==DEFECT)?11:0; countdownIndex = nextCountdownRestartValue; // restart countdownIndex--; else // Opponent surely is no Hitman and thus Godfather plays OTFT >> Content of OTFT's getMove() method from Table 8.10 <<



Table 8.14. Main parts of CosaNostra Hitman’s source code.

private static final int SYNC_HM_DEFECTS = 3;private static final int SYNC_GF_REPLIES_WITH_COOPERATE = 2;private static final int FIRST_MESSAGE_BIT_FROM_GF = 1;private static final int SECOND_MESSAGE_BIT_FROM_GF = 0;private int nextCountdownRestartValue;

public void reset() super.reset();

opponentPlayedSoFarLikeGodfather = true; // Assume the best opponentMove = DEFECT; // As a Godfather would have been doingcountdownIndex = SYNC_DEFECT; // First DEFECT to synchronize

public double getMove() if( opponentPlayedSoFarLikeGodfather ) // Did the opponent just break the Godfather behaviour pattern? if( ( countdownIndex == SYNC_GF_REPLIES_WITH_COOPERATE && opponentMove == DEFECT ) || ( countdownIndex > SYNC_GF_REPLIES_WITH_COOPERATE && opponentMove == COOPERATE ) ) // Yes, so the opponent cannot be a Godfather, so Hitman ... myReply = DEFECT; // ... defects and switches... opponentPlayedSoFarLikeGodfather = false; // ... to ALLD else // No, the opponent again played like a Godfather. if( countdownIndex != SYNC_HM_DEFECTS ) myReply = COOPERATE; // Godfather thus can exploit Hitman if( countdownIndex == FIRST_MESSAGE_BIT_FROM_GF ) nextCountdownRestartValue += (opponentMove==DEFECT)?7:0; else if( countdownIndex == SECOND_MESSAGE_BIT_FROM_GF ) nextCountdownRestartValue += (opponentMove ==DEFECT)?11:0; countdownIndex = nextCountdownRestartValue - 1; // restart else // if( countdownIndex == SYNC_HM_DEFECTS ) myReply = DEFECT; // Hitman DEFECTs once to synchronize nextCountdownRestartValue = 9; // HM starts to prepare countdownIndex--; else // Opponent surely is no Godfather and thus Hitman ... myReply = DEFECT; // ... plays ALLD

return(super.getFinalMove(myReply));



8.3.2.2. The gory details of the CosaNostra group strategy

As in OTFT’s case, there is no more detailed and exact description of the

CosaNostra group strategy’s inner workings than the source code of its im-

plementation. Again, the code is short and easy to understand. We there-

fore reproduce it in Tables 8.13 for the Godfather and 8.14 for the Hitman

strategy, again leaving aside only the general parts required for the IPDLX

framework that was used in the competitions5. As Godfather uses the

OTFT strategy against strategies other than Hitman, the part of the code

of Godfather that is identical to the one of OTFT in Table 8.10 is not

repeated but referred to.

8.3.2.3. TheEmperorAndHisCloneWarriors

This group strategy is based on the same principles as the CosaNostra group

strategy, with one emperor playing the role of the Godfather, and his clone

warriors playing the Hitman strategy in large numbers (the number being

the major difference), each clone strategy having an individual number in

its name since it was required in the submission procedure to the competi-

tion to give each individual strategy a different name. We had trusted the

organizers after enquiring via email that open group strategies would be

allowed in the 2004 competition and accordingly had submitted the Em-

perorAndHisClones strategy with altogether 11,110 individually numbered

clones as one group strategy, as it was not clear how large groups would be

permitted to be. For reasons that, especially in hindsight, are not entirely

clear to us, the organizers decided to let altogether only one clone (with

the emperor) participate in the competitions. We are still perplexed with

respect to this point. In particular, we were initially prepared to submit a

much larger collusion group within the CosaNostra group strategy but —

after hearing that groups would be allowed — decided to submit only one

such collusion strategy as a proof of concept, counting on the fact that our

clone army would evaporate all competitors.

8.3.2.4. The StealthCollusion group strategy

As a proof of concept (see previous section), we submitted under the name

of Constantin Ionescu a group strategy that cooperates with our CosaNos-

tra group strategy, though not perfectly so. The mail with which we

submitted the strategy was written on purpose with some typos, a few

grammatical glitches, and sloppy formatting, all in order to add to the look



of authenticity of the submission by distracting from the real intention. It

was sent from a free mail account hosted in Romania, the sender claiming

to be a Student of informatica from the technical school of Timisoara. As

expected the deception went undetected.

8.4. Analysis of the Performance of the Strategies

8.4.1. OmegaTitForTat

Table 8.15 shows how OTFT clearly dominates a standard tournament with

strategies commonly used as test cases. Table 8.16 illustrates how OTFT

dominates in harsh environments where a lot of unconditional defection

occurs. Table 8.17 demonstrates OTFT’s dominance in random environ-

ments. The slight lead of GRIM in league 4 of the 2005 competition was due

to the higher number of games GRIM was allowed to play as we explained

already in Section 0.

8.4.2. Group strategies

In this section we study general characteristics of important possible group

strategies. We first classify and name group strategy classes as follows:

• Democracy during peace (DP): All group members are equals and treat

each other nicely by always cooperating, and play TFT or a better strat-

egy such as OTFT or GRIM outside of their community.

• Democracy at war (DW): All group members are equals and treat each

other nicely, however they continually defect (ALLD) against all other

strategies (after a short recognition interval).

Table 8.15. OTFT in a standard envi-

ronment, standard strategy sample, 200

turns.

Rank Strategy Score

1 OTFT 5,978

2 GRIM 5,538

3 TFT 5,180

4 TFTT 5,134

5 ALLC 4,515

6 RAND 4,062

7 STFT 4,018

8 ALLD 4,016

9 NEG 3,726



Table 8.16. OTFT in a harsh environ-

ment, 50% ALLD opponents, 200 turns.

Rank Strategy Score

1 OTFT 7,358

2 GRIM 6,959

3 TFT 6,577

4 TFTT 6,524

5 ALLD 5,512

6 ALLD 5,464

7 ALLD 5,452

8 ALLD 5,428

9 ALLD 5,428

10 ALLD 5,416

11 STFT 5,415

12 ALLD 5,404

13 ALLD 5,400

14 RAND 4,658

15 ALLC 4,530

16 NEG 3,728

Table 8.17. OTFT in a random en-

vironment with 50% RAND opponents,

200 turns.

Rank Strategy Score

1 OTFT 10,114

2 GRIM 9,867

3 TFT 8,338

4 ALLD 8,236

5 TFTT 7,806

6 RAND 7,357

7 RAND 7,212

8 RAND 7,195

9 STFT 7,192

10 RAND 7,150

11 RAND 7,150

12 RAND 7,099

13 RAND 7,099

14 RAND 7,082

15 NEG 6,947

16 ALLC 6,624

• Empire during peace (EP): There is one special group member, the em-

peror, which is allowed to take advantage of all other members of his

empire by playing defect while they cooperate with him. The subjects

otherwise cooperate among each other, and play TFT or a better strategy



such as OTFT or GRIM outside their community, after a short recogni-

tion interval.

• Empire at war (EW): Again, the emperor is allowed to take advantage of

all other members of his empire by playing defect while they cooperate

with him. Again, the subjects otherwise cooperate among each other,

but now they play, after a short recognition interval, ALLD against all

other strategies.

In the following, we will show that groups can be arbitrarily better

performing than individual strategies, and that, under equal group size,

EW groups can achieve arbitrarily higher payoffs (for the emperor) than

EP groups, and that EP groups can achieve arbitrarily higher payoffs (for

the emperor) than members of an DP group, which can achieve arbitrarily

higher payoffs than members of a DW group. When group sizes vary, we

show that even the weak DW group members can achieve arbitrarily higher

payoffs than the emperor of a competing EW group by sheer numerical

superiority.

First some preliminaries: We know that the payoff values observe the

relations S < P < R < T and 2R > T + S of Section 0. Let us assume

in the following that the group in the democracy variants and the group of

subjects in the empire variants are of size m (for members), and that there

are altogether n players in total (so m < n) which play i iterations during

the IPD competition.

We further assume that:

• The best single-player (non-group) strategy IOPT (for individual optimal

strategy) achieves payoff X · i after i iterations.

• The emperor strategy achieves payoff E · i after i iterations.

• The individual members (or subjects) achieve payoff M·i after i iterations.

• The loss due to recognition of members of the same group is negligible

due to the size of i.

• We further assume that the emperor always plays the best non-group

strategy against non-members of his group.

• During peace, individual members always play the best non-group strat-

egy against non-members of their group.

• We assume that the best single-player strategy achieves an average payoff

of A against other non-group strategies. The relations P < A < T

are plausible, and a value of A near R is likely under the assumption

that most individual strategies are similar to TFT. We therefore assume



that A = R in the following unless stated otherwise. This implies that

members of groups of type DP achieve more or less the same payoff

as the best individual strategy IOPT, so we assume that MDP = XDP.

This assumption simplifies the calculations in the following claim without

sacrificing the fundamental relations between the different strategies.

• We also assume that most single-player strategies achieve an average

score near A (and thus near R according to the previous assumption)

when playing against other single-player strategies (so more or less all of

them are optimal) and against DP, EP, or emperors of EW strategies (so

they all play fairly against each other), and an average score of P when

playing against members of groups at war. This would roughly corre-

spond to the pay-off achievable by OTFT and similar strategies. Again,

this assumption simplifies the calculations in the following claim without

sacrificing the fundamental relations between the different strategies.

Claim 8.1: Under the above assumptions and unless stated otherwise,

the following relations hold:

(1) Members of groups of type DW can achieve larger payoffs than members

of groups of type DP only when the DW members constitute more than

50% of the total population. When group sizes are equal and there are

other strategies, DP has an advantage over DW. By increasing i, this

advantage can be made arbitrarily large: mDP ≥ mDW → MDP · i >>

MDW · i.

(2) Emperors from EP groups can achieve larger payoffs than members of

groups of type DP (assuming equal group size). By increasing i, this

advantage can be made arbitrarily large: EEP · i >> MDP · i. Because

of our assumption that MDP = XDP the relation also holds for the best

individual strategy IOPT, so emperors from EP groups can achieve

arbitrarily larger payoffs than the best individual strategy.

(3) Emperors from EW groups can achieve larger payoffs than an emperor

from an EP group (assuming equal group size). By increasing i, this

advantage can be made arbitrarily large: EEW · i >> EEP · i.

(4) When two groups of unequal size compete, then:

(a) Independently of the group sizes and the values of S, P, R, and

T, emperors (at war or during peace) fare better than democrats

at peace. By increasing i, this advantage can be made arbitrarily

large: EE · i >> MDP · i.



(b) Depending on the values of P, R, and T, and when i increases, a

democracy at war can fare arbitrarily better than an emperor (at

war or during peace) when it is sufficiently large: mDW >> mE →

MDW · i >> EE · i.

(5) We now assume that IOPT scores a higher average payoff value A

against non-group strategies than the group strategies achieve against

non-group strategies; let B with B ¡ A ¡ T be the (bad) score that an

emperor achieves on average against non-group strategies (we here de-

liberately drop the initial assumption that emperors play IOPT against

non-group strategies). In order for the emperor to nevertheless win de-

spite playing worse in general than IOPT, the following inequalities

must be satisfied: In case of EP,

(6)

mEP > (A− B)/(T− B)n ,

and in case of EW,

mEW > (A− B)/(T− B− P + A)n .

Again, larger group size helps even when the strategies are badly per-

forming. We also see that as B approaches A, emperors can win against

IOPT even with very few other group members.

(7) When two DW, EP, or EW groups of the same type but of different size

and with different “efficiencies” compete (we here again deliberately

drop the initial assumptions that emperors play IOPT against non-

group strategies), larger group size can compensate for less efficiency,

and vice versa. Note that this is not true for DP groups.

Proof.

(1) MDP = R (n−mDW)+P mDW and MDW = R mDW < +P (n−mDW),

assuming that no other group at war is present in the population. Thus,

MDW > MDP if and only if mDW > n/2.

(2) MDP = R n and EEP = R(n−m) + T m. Since T > R, EEP > MDP.

(3) EEP = R (n− 2m) + T m + P m and EEW = R (n−m) + T m. Since

R > P, EEW > EEP.

(4) For groups of unequal size:

(a) It suffices to show that EEP > MDP is independent of the size of

the groups. EEP = R (n−mEP) + T mEP and MDP = R n. Since

T > R, EEP > MDP holds independently of the size of the groups.



(b) It suffices to show that there exists a large enough mDW such that

MDW > EEW. MDW = R (n − mEW) + P mEW and EEW = R

(n − mEW − mDW) + T mEW + P mDW. Then MDW > EEW if

and only if mDW > (T − P)/(R − P)mEW. In the 2004 and 2005

competitions, P = 1, R = 3, and T = 5, so mDW would have to

be larger than 2mEW. In case only the two group strategies would

compete, this would mean that the DW strategy would need 2/3

of the strategies in the whole population.

(5) In case of EP: EEP = B (n −mEP) + T mEP and XEP = A n. Then

EEP > XEP if and only if mEP > (A − B)/(T − B)n (assuming that

T > A > B). In case of EW: EEW = B (n − mEW) + T mEW and

XEW = A (n − mEW) + P mEW. Then EEW > XEW if and only if

mEW > (A− B)/(T− B− P + A)n.

(6) We show it here for two unequal EW strategies, and note that similar

arguments work for the cases EP and DW. Let B1 and B2 be the scores

that the two emperors achieve on average against non-group strategies,

with B1 < B2 and |B1−B2| = α (T−P) with 0 < α < 1. Then E1 > E2

if and only if

m1 > (1− α)/(1 + α)m2 + α/(1 + α)n .

Example: suppose B1 = 2.5 and B2 = 2.6, and as before P = 1 and T

= 5 so that α = 0.025, and

m1 > 0.9513m2 + 0.0244n .

Thus, when m2 = 20 and n = 100 then m1 must be at least 22 so that

the first emperor can triumph above his more efficient opponent.

8.4.3. Collusion detection is an undecidable problem

The difficulty of detecting collusion practically has been shown in previous

parts of this chapter. The difficulty of recognizing collusion is also sup-

ported by the difficulty of solving the problem from a theoretical point of

view: We show below that the general question of whether two strategies

of which the source code is known and that do not depend on any third

party source of randomness are actually colluding or not, is undecidable

— of course it is even harder when the strategies only are known as black

boxes, without having access to their source code. Simpler arguments than

ours would also do but we try in our approach to define the formal collusion

problem as closely to the practical collusion detection problem as possible.



Remember the definition of the Halting problem: Is there a finite de-

terministic Turing machine H that is able to decide in finitely many steps

whether an arbitrary finite deterministic Turing machine M ultimately will

halt or not? It is well known that the Halting problem has been shown to

be undecidable by Turing. Exact definitions of Turing machines and other

notions appearing in this section as well as references to the original sources

can easily be found, e.g., in any theoretical computer science reference book

such as Papadimitriou (1994).

Let the Simplified Collusion problem formally be defined as follows: Is

there a deterministic Turing machine SC that is able to decide in finitely

many steps whether, given two arbitrary integers i and j, two arbitrary

finite deterministic Turing machines S1 and S2 will both output a sequence

of at least length i+j characters (one character per tape position) composed

only of the letters “C” and “D” on their two separate write-once output

tapes T1 and T2, such that the j letters starting from tape position i + 1

will all be “D”s on T1 and all be “C”s on T2?

This simplistic definition covers many (but surely not all) real collusion

cases. It also would imply that strategies usually not considered collud-

ing consciously like ALLD as T1 and ALLC as T2 would be classified as

colluding in the Simplified Collusion terminology. However, ALLD really

could be colluding with a large group of ALLC where other more cautious

strategies like OTFT would not be able to take advantage of ALLC since

they never would defect first. Thus, when a player or a group of players are

able to introduce an ALLD and many ALLC into a competition, they could

well be part of an intentional collusion, and thus the classification in the

Simplified Collusion terminology would not be completely wrong. Eventu-

ally, deciding what really is a collusion and what not cannot be solved by

formal methods alone. Nevertheless, we can at least show the following:

Claim 8.2: The Simplified Collusion problem is undecidable.

Proof. To formally show the undecidability of the Simplified Collusion

problem, we follow the standard argument by reducing the Halting problem

to it. Take any finite deterministic one-tape Turing machine M for which we

want to know whether it halts or not. Without loss of generality, we assume

that the tape of M is infinite in both directions, that each combination of

the finitely many characters of the alphabet, which includes the letters “C”

and “D”, and of the finitely many states of M defines exactly one of the

finitely many rules of M, and that only the special state h stops M.



To decide whether M halts or not, we construct for each M two new

Turing machines N1 and N2. N1, in comparison to M, is defined as follows:

It has an additional initially empty output tape T, an additional tape IJ

that initially contains the numbers i and j in binary with the character “:”

written between the two numbers, an additional state s, and a constant

number of other states needed to be able to countdown the two binary

numbers and do the other things described below, and almost the same set

of rules as M, with only the following changes: each rule of M leading to h

instead leads to state s, and there is a constant number of additional rules

that make sure the following: When N1 enters state s, it will countdown

from i to zero, each time writing one letter “C” on IJ and then moving

one position to the right on IJ, so that at the end a sequence of i “C”s is

written on IJ. Then it will countdown from j to zero, each time writing one

letter “D” on IJ and then moving one position to the right on IJ, so that

at the end a sequence of i “C”s followed by j “D”s is written on IJ. Then

it will change to state h and halt. N2 is defined as follows: it simply writes

i + j letters “C” to its output tape T. Finally, we choose the two numbers

i and j, e.g., i = 1 and j = 1.

It is clear that this construction always leads to a valid instance of the

Simplified Collusion problem. It is also clear that if and only if M halts, then

the question posed in the Simplified Collusion problem will have a positive

answer for the constructed instance of Simplified Collusion problem.

Now, if a finite deterministic Turing machine SC that is able to decide

the Simplified Collusion problem in finitely many steps would exist, then we

could also decide the Halting in finitely many steps, as follows: We would

define a new finite deterministic Turing machine R that for any given Turing

machine M (properly encoded for R on R’s input tape), first constructs

(in finitely many steps) an encoding of corresponding finite deterministic

Turing machines N1 and N2 with i = 1 and j = 1 as described above

(this surely can be done in finitely many steps), then simulates SC applied

to this instance of the Simplified Collusion problem, thereby deciding in

finitely many steps (SC takes only finitely many steps, and simulating it

on R is also easily feasible in finitely man steps) whether it is a yes or a

no-instance, and returns this answer of SC as the answer of R, which must

also be the answer to the question of whether M halts or not. So, if the

Simple Collusion problem is decidable, then the Halting problem must also

be decidable. Since we know for sure the latter is not true, the former also

cannot be true, and thus the Simple Collusion problem is undecidable.



8.5. Conclusion

We have described our submissions to the iterated prisoner’s dilemma (IPD)

competitions of 2004 and 2005, the OmegaTitForTat (OTFT) single-player

strategy and the CosaNostra group strategy composed of one Godfather

(CNGF) and several Hitman (CNHM). We also studied their performance

in the different leagues of the competitions.

The observed slight superiority of OTFT in comparison to GRIM psy-

chologically is a reassuring result. The charm of OTFT compared to GRIM

is that OTFT is an intelligent forgiving strategy whereas GRIM, as the

name implies, is an unforgiving iron-handed pig-head that falls in an eter-

nal revenge mode after being deceived a single time.

We also have established a taxonomy of generalized group strategies

for IPD competitions. In it, the types of group strategies are classified

according to their behaviour towards other members of the same group

and towards strategies outside of their group. We labelled the four classes

of group strategies studied as democracies during peace (DP), democ-

racies at war (DW), empires during peace (EP), and empires at war

(EW). As we have shown in the previous section, group strategies can

easily outperform any individual strategy by sheer numerical superior-

ity. Group strategies appear at every place in Nature and Human So-

ciety, and group strategies competing in IPD competitions can serve as

simplified study objects of the former. It is interesting to note that in

the analysis in the last section, individual strategies member of a DW

group fare less well than those of a DP group, and that this relation

is reversed for empires, EW faring better than EP, not because the em-

peror itself fares better, but because his competitors are more harmed.

This is clear from the fact that members of DW lose individually more

than members of DP, whereas emperors at war (EW) fare better than

emperors during peace (EP), and these better than DW and DP. E.g.,

emperors at war do not have to suffer from their aggressive acts, and ac-

tually do better in comparison than their opponents by letting the pay-

off of individuals that are not members of their group get lowered by

their other, underling members, while at the same time retaliation from

others does not hit them directly (think of real emperors, Mafia bosses,

etc).

But it not even has to be fights for life and death, wars, or outright geno-

cide: the same pattern appears in business where larger or more advanced

companies (in particular their owners) that are more or less aggressive



can crush competitors or, in extremity, take advantage of cheap child-slave

labour, thus extremely abusing their own workforce.

It is also interesting to note that better resources, be it people, money,

or technology, corresponding to a higher number of individual strategies

in the group, or better average payoff values against non-group strategies,

positively influence the overall payoff values of the groups. Thus, numerical

superiority does not have to mean that the number of soldiers is higher,

but can also be due to better technology, be it military, commercial, or

biological. It is also not surprising that, as described point 4.a of Claim 8.1

of Section 0, individual strategies in democracies during peace always “lose”

against emperors, the latter always being able to get more from his subjects

than what he gives in return, and certainly more than his unorganized

competitors. However, given enough superiority, again either in number,

money, or technology, even democracies at war can win against empires at

war (point 4.b of the claim in Section 0), the Second World War for instance

having several examples of such situations.

We also showed that group strategies can be subtly camouflaged to

look like unrelated single-player strategies. These stealth collusion group

strategies will elude detection with high probability, e.g., by introducing

a certain amount of noise in the interaction with one’s group members to

make the collusion less evident. We showed that the differentiation between

colluding and non-colluding behaviour can be very difficult in practice and

is generally undecidable from a theoretical point of view.

In the study of economics, collusion takes place within an industry when

rival companies cooperate for their mutual benefit. According to game the-

ory, the independence of suppliers forces prices to their minimum, increas-

ing efficiency and decreasing the price determining ability of each individual

firm. If one firm decreases its price, other firms will follow suit in order to

maintain sales, and if one firm increases its price, its rivals are unlikely

to follow, as their sales would only decrease. These rules are used as the

basis of kinked-demand theory. If firms collude to increase prices as a co-

operative, however, loss of sales is minimized as consumers lack alternative

choices at lower prices. This benefits the colluding firms at the cost of

efficiency to society [Wikipedia: Collusion 2005].

There was some discussion whether collusion group strategies were ac-

tually cheating in the 2004 and 2005 IPD competitions, but since the orga-

nizers clearly said that cooperating strategies were to be allowed, it would

have been strange to deny participation to such group strategies. What we

can say at least is that the detection of StealthCollusion, both in future IPD



competitions as well as in real life, in practice is very difficult. The Mafia,

or for that matter, any human organization that is not readily recognizable

as a group, be it Masonic lodges, secret religious groups, or corporate car-

tels, exist and as such are certainly worth to be modelled. Being able to

secretly communicate, thereby “colluding” in a general sense, is quite com-

mon, and in practice forbidding it is nearly infeasible whenever intelligent

individuals exchange information repeatedly. An exception where a biolog-

ical occurrence of an IPD without information exchange has been reported

to take place has been described by Turner and Chao (1999). They show

that certain viruses that infect and reproduce in the same host cells seem to

be engaged in a survival of the fittest-driven prisoner’s dilemma. However,

in light of the ways different types of bird’s flu viruses infecting the same

human cells can exchange RNA in order to increase their fitness, it can be

argued that such emerging colluding group behaviour appears already at

this relatively low level of life.

In commerce, collusion is largely illegal due to antitrust law, but im-

plicit collusion in the form of price leadership and tacit understandings is

unavoidable. Several recent examples of explicit collusion in the United

States include [Wikipedia: Collusion 2005]:

• Price fixing and market division among manufacturers of heavy electrical

equipment in the 1960s.

• An attempt by Major League Baseball owners to restrict players’ salaries

in the mid-1980s.

• Price fixing within food manufacturers providing cafeteria food to schools

and the military in 1993.

• Market division and output determination of livestock feed additive by

companies in the US, Japan and South Korea in 1996.

There are many ways that implicit collusion tends to develop

[Wikipedia: Collusion 2005]:

• The practice of stock analyst conference calls and meetings of indus-

try almost necessarily cause tremendous amounts of strategic and price

transparency. This allows each firm to see how and why every other firm

is pricing their products. Again, the line between insider information and

just being better informed is often very thin.

• If the practice of the industry causes more complicated pricing, which is

hard for the consumer to understand (such as risk based pricing, hidden

taxes and fees in the wireless industry, negotiable pricing), this can cause



competition based on price to be meaningless (because it would be too

complicated to explain to the customer in a short ad). This causes in-

dustries to have essentially the same prices and compete on advertising

and image, something theoretically as damaging to a consumer as normal

price fixing.

We predict that all iterated prisoner’s dilemma competitions in the fu-

ture will be dominated by group strategies. Even when in a future IPD

competition all strategies will be chosen by the same single person who

consciously tries to avoid that any “group cooperation” happens among

his strategies, then random and involuntary cooperation that mathemati-

cally is identical to voluntary cooperation can never be excluded. Actually,

group cooperation can be self-emerging in a population, some strategies

involuntarily faring better together and possibly against other groups or

individuals, however loosely they are constituted. We predict that when

evolutionary algorithms are used to breed new species of IPD strategies,

such cooperation will automatically emerge at a certain point.

Cooperation in groups of strategies in IPD competitions mimics co-

operation of groups in Nature and Human Society — it therefore allows

modelling another common aspect of cooperative behaviour that so far was

not explicitly studied in the IPD framework: more or less open coopera-

tion of subgroups versus other subgroups or individuals. The number of

members of the group does not have to correspond to the actual number of

individuals. Instead, it could also mean the amount of money involved, or

the technological advantage of one subgroup relative to another one.

Acknowledgments

The authors would like to thank the anonymous reviewers for many useful

comments and corrections.

References

Axelrod, R. (1984) The evolution of cooperation. Basic Books.

Beaufils, B., Delahaye, J.-P., and Mathieu, P. (1996). Our meeting with gradual:

A good strategy for the iterated prisoner’s dilemma, Proceedings Artificial

Life V, Nara, Japan, 1996.

Kuhn, S. (2003) Prisoner’s Dilemma. The Stanford Encyclopedia of Philoso-

phy (Fall 2003 Edition), Edward N. Zalta (ed.), http://plato.stanford.edu/

archives/fall2003/entries/prisoner-dilemma/.



Mehlmann, A. (2000) The Game’s Afoot! Game Theory in Myth and Paradox.

AMS Press.

Nowak, M. and K. Sigmund (1993) A strategy of win-stay, lose-shift that outper-

forms tit-for-tat in the Prisoner’s Dilemma game, Nature, 364, pp. 56-58.

O’Riordan, C. A (2000) Forgiving Strategy for the Iterated Prisoner’s Dilemma.

Journal of Artificial Societies and Social Simulation, 3, 4.

Papadimitriou, C. H. (1994) Computational Complexity. Addison-Wesley.

Turner, P. and L. Chao (1999). Prisoner’s dilemma in an RNA virus, Nature,

398, pp. 441-443.

Tzafestas, E.S. (2000) Toward adaptive cooperative behavior, From Animals to

animats, Proceedings of the 6th International Conference on the Simulation

of Adaptive Behavior (SAB-2000), 2, pp. 334-340.

Wikipedia: Collusion (2005). http://en.wikipedia.org/w/index.php?title=

Collusion&oldid=33029071.


Chapter 9

Error-Correcting Codes for Team Coordination within a

Noisy Iterated Prisoner’s Dilemma Tournament

Alex Rogers, Rajdeep K. Dash, Sarvapali D. Ramchurn, Perukrishnen

Vytelingum, Nicholas R. Jennings

University of Southampton

9.1. Introduction

The mechanism by which cooperation arises within populations of selfish

individuals has generated significant research within the biological, social

and computer sciences. Much of this interest derives from the original re-

search of Axelrod and Hamilton[Axelrod and Hamilton (1981)], and, in

particular, the two computer tournaments that Axelrod organised in or-

der to investigate successful strategies for playing the Iterated Prisoner’s

Dilemma (IPD)[Axelrod (1984)]. These tournaments were so significant as

they demonstrated that a simple strategy based on reciprocity, namely tit-

for-tat, was extremely effective in promoting and maintaining cooperation

when playing against a wide range of seemingly more complex opponents.

To mark the twentieth anniversary of the publication of this work,

these two computer tournaments were recently recreated (see http://www.

prisoners-dilemma.com/) with separate events being hosted at the 2004

IEEE Congress on Evolutionary Computing (CEC’04) and the 2005 IEEE

Symposium on Computational Intelligence and Games (CIG’05). To stim-

ulate novel research, the rules of Axelrod’s original tournaments were ex-

tended in two key ways. Firstly, noise was introduced, whereby the moves

of each player would be mis-executed with some small probability. Sec-

ondly, and most significantly, researchers were invited to enter more than

one player into the round-robin style tournament. This second extension

to the original rules, prompted several researchers to enter teams of players

into the tournament. This choice being motivated by the intuition that the

members of such a team could, in principle, recognise and collaborate with

205


206 A. Rogers et al.

one another in order to gain an advantage over other competing players.

This proved to be the case, and teams of players performed well in both

competitions. Indeed, a member of such a team, entered by the authors,

won the noisy IPD tournaments held at both events.

Now, for this approach to be effective in practice, two key questions have

to be addressed. Firstly, the players, who have no access to external means

of communication, have to be able to recognise one another when they meet

within the IPD tournament. Secondly, having achieved this recognition, the

players have to adopt a strategy that increases the probability that one of

their own kind wins the tournament. In this chapter, we present our work

investigating these two questions. Specifically:

(1) We show how our players are able to use a pre-agreed sequence of

moves, that they make at the start of each interaction, to transmit a

covert signal to one another, and thus detect whether they are facing

a competing player or a member of their own team.

(2) We show that by recognising and then cooperating with one another,

the members of the team can act together to mutually improve their

performance within the tournament. In addition, by recognising and

acting preferentially toward a single member of the team, the team

can further increase the probability that this member wins the overall

tournament. In both cases, this can be achieved with a team that is

small in comparison to the population (typically less than 15%).

(3) Given this approach, we show with an experimental IPD tournament

that the performance of our team is highly dependent on the length

of the pre-agreed sequence of moves. The length of this sequence de-

termines both the cost and the effectiveness of the signalling between

team members, and these factors contribute to an optimum sequence

length that is independent of both the size of the team and the number

of competing players within the tournament.

(4) Using the results of these experimental IPD tournaments, we show that

signalling with a pre-agreed sequence of moves, within the noisy IPD

tournament, is exactly analogous to the problem, studied in informa-

tion theory, of communicating reliably over a noisy channel. Thus we

demonstrate that we can implement error correcting codes in order to

further optimise the performance of the team.

(5) Finally, we discuss how the results of these investigations guided the

design of the teams that we entered into the two recent IPD competi-


Error-Correcting Codes for Team Coordination 207

tions, and thus we follow this analysis with a discussion of the results

of these competitions.

The remainder of this chapter is organised as follows: section 9.2 describes

the Iterated Prisoner’s Dilemma setting and related work. Section 9.3 de-

scribes the team players that we implemented in our investigations and

section 9.4 describes the results of the experimental IPD tournaments that

we implemented. In section 9.5 we analyse these results and in section 9.6

we discuss our use of coding theory to optimise the performance of the

team. Finally, we discuss the application of these techniques within the

two computer tournaments in section 9.7 and we conclude in section 9.8.

9.2. The Iterated Prisoner’s Dilemma and Related Work

In our investigations, we consider the standard Iterated Prisoner’s Dilemma

(IPD) as used by Axelrod in his original computer tournaments. Thus, in

each individual IPD game, two players engage in repeated rounds of the

normal form Prisoner’s Dilemma game, where, at each round, they must

choose one of two actions: either to cooperate (C) or to defect (D). These

actions are chosen simultaneously and depending on the combination of

moves revealed, each player receives the payoff indicated in the game matrix

shown in table 9.1. For example, should player 1 cooperate (C) whilst player

2 defects (D), then player 1 receives zero points whilst player 2 receives five

points. The scores of each player in the overall IPD game are then simply

the sum of the payoffs achieved in each of these rounds. In our experiments

we assume that each IPD game consists of 200 such rounds, however, this

number is of course unknown to the players participating.

As in the original tournaments, a large number of such players (each

using a different strategy to choose its actions in each individual IPD game)

are entered into a round-robin tournament. In such a tournament, each

player faces every other player (including a copy of itself) in separate IPD

games, and the winner of the tournament is the player whose total score,

summed over each of these individual interactions, is the greatest.

Given this problem description, the goal of Axelrod’s original tourna-

ments was to find the most effective strategies that the players should adopt.

Whilst in a single instance of the Prisoner’s Dilemma game it is a domi-

nant strategy for each player to defect, in the iterated game this immediate

temptation is tempered by the possibility of cooperation in future rounds.

This is often termed the shadow of the future[Trivers (1971)], and, thus, in



Table 9.1. Pay-off matrix of the

normal form Prisoner’s Dilemma

game.

Player 2

C D

Player 1C 3,3 0,5

D 5,0 1,1

order to perform well in an IPD tournament, it is preferable for a player to

attempt to establish mutual cooperation with the opponent. Thus, strate-

gies based on reciprocity have proved to be successful, and, indeed, the

simplest such strategy, tit-for-tat (i.e. start by cooperating and then de-

fect whenever the opponent defected in the last move) famously won both

tournaments[Axelrod (1984)].

More recent research has extended this reciprocity based approach, and

has lead to strategies that out-perform tit-for-tat in general populations.

For example, Gradual[Beaufils et al. (1997)] is an adaption of tit-for-tat that

incrementally increases the severity of its retaliation to defections (i.e. the

first defection is punished by a single defection, the second by two consec-

utive defections, and so on). Likewise, Adaptive[Tzafestas (2000)] follows

the same intuition as Gradual but addresses the fact that the opponent’s

behaviour may change over time and thus a permanent count of past de-

fections may not be the best approach. Rather, it maintains a continually

updated estimate of the opponent’s behaviour, and uses this estimate to

condition its future actions.

However, this reciprocity is challenged within the noisy IPD tourna-

ment. Here, there is a small possibility (typically around 1 in 10) that the

moves proposed by either of the players is mis-executed. Thus a player

who intended to cooperate, may defect accidentally (or vice versa)a and

this noise makes maintaining mutual cooperation much more difficult. For

example, a single accidental defection in a game where two players are using

the tit-for-tat strategy, will lead to a series of mutual defections in which

each player scores are reduced. This detrimental effect is often resolvedaNote that this noise can be implemented in two different ways: either the cooperation

is actually mis-executed as a defection, or it is simply perceived by the other player

as a defection. The difference between these two implementations results in different

payoffs to the players in that round on the IPD game. Whilst this does result in slightly

different scores in the overall IPD tournament, it does not significantly effect the results,

as, in general, the performance of a player is determined by its actions in the moves

that follow either the real or perceived defection. In our experiments, we use the first

implementation and assume that noisy moves are actually mis-executed.



by implementing more generous strategies which do not retaliate immedi-

ately. For example, tit-for-two-tats (TFTT) will only retaliate after two

successive defections[Axelrod (1997); Axelrod and Wu (1995)] and gener-

ous tit-for-tat (GTFT) only retaliates a small percentage of the times that

tit-for-tat would[Axelrod and Wu (1995)]. However, whilst these strategies

manage to maintain mutual cooperation when playing against similar gen-

erous strategies, their generosity is also vulnerable to exploitation by more

complex strategies. Thus effective strategies for noisy IPD tournaments

must carefully balance generosity against vulnerability to exploitation, and

in practise, this is difficult to achieve.

Now, the possibility of entering a team of players within a noisy IPD

tournament offers an alternative to this reciprocity based approach. If the

members of the team are able to recognise one another, they can uncon-

ditionally mutually cooperate and thus do not need to retaliate against

defections that are the result of mis-executed moves. In addition, by de-

fecting against players who they do not recognise as fellow team members,

they are immune to exploitation from these competing players. As such,

this approach resembles the notion of kin selection from the evolutionary

biology literature, where individuals act altruistically toward those that

they recognise as being their genetic relatives[Hamilton (1963, 1964)].

However, to use this approach in practise, we must address two specific

issues. Firstly, we must enable the players to recognise one another and

we do so by using a pre-agreed sequence of moves that each player makes

at the start of each IPD interaction. Secondly, since our goal is to ensure

that one member of the team wins the tournament, we explicitly identify

one team member as the team leader, and have the other team members

favour this individual. We describe these steps, in more detail, in the next

section.

9.3. Team Players

Thus, as described in the previous section, we initially implement a team

of players who recognise one another through the initial sequence of moves

they make at the start of each IPD interaction. To this end, each team

player uses a fixed length binary code word to describe this initial sequence

of moves. Specifically, we denote 0 as defect and 1 as cooperate, and the

binary code word indicates the fixed sequence of moves that the player

should make, regardless of the actions of the opponent. This binary code

word is known to all members of the team, and by comparing the moves



Team Member

‘team member code’

-

-CCCCCCCC recognise team member

DDDDDDDD otherwise.

Fig. 9.1. Diagram showing the sequence of actions played by each of the team members.

of their opponents against this code word, players within the team can

recognise if they are playing against another member of the team or against

an unknown opponentb.

Now, whenever a team member meets another team member within the

IPD tournament, they can recognise one another and then cooperate with

one another unconditionally. In addition, the team members can recognise

when they are playing against a competing player and then defect contin-

ually (see figure 9.1). In this way, since the team players no longer have to

reciprocate any mis-executed moves in order to maintain cooperation, they

achieve close to the maximum possible score whenever they play against

other team members. In addition, since they defect against competing

players, they are also immune to exploitation from these players. Thus

given a sufficient number of team members within the IPD tournament,

the team players perform well, compared to reciprocity based strategies.

However, our goal is to form a team that maximises the probability that

one of its members will be the most successful player within the IPD tourna-

ment. Thus, we can improve the performance of the team by identifying one

of the team members as the team leader, and allowing the other ordinary

team members to act preferentially towards this team leader. Thus, when

the ordinary team members encounter the team leader, they continually

cooperate, whilst allowing the team leader to exploit them by continually

defecting. In this way, whilst competing players derive the minimum possi-

ble score in interactions with the ordinary team members, the team leader

derives the maximum possible score in these same interactions. Hence,

by allowing the team leader to exploit them, the ordinary team members

sacrifice their own chance of winning the tournament, but by changing the

tournament environment, they are able to increase the chance that the team

leader will winc.bNote that this recognition will not be perfectly reliable; the code word may be corrupted

by noise or competing players may accidentally make a sequence of moves that matches

the team code word. These are effects that we explicitly consider in section 6.cThus the team that we implement is similar to the ‘master’ and ‘slave’ approach sug-



Team Leader

‘team leader code’-

-

-

CCCCCCCC

recognise team memberDDDDDDDD

recognise team leader

CC – TFT – otherwise.

Team Member

‘team member code’-

-

-

CCCCCCCC

recognise team memberCCCCCCCC

recognise team leader

DDDDDDDD otherwise.

Fig. 9.2. Diagram showing the sequence of actions played by each of the team players.

The case above describes the instances in which the team leader encoun-

ters another team member. However, when the team leader encounters any

other competing players it should adopt some default strategy. Clearly,

using the best performing strategy available will increase the chances of

the team leader winning the tournament. However, since our purpose here

is to demonstrate the factors that influence the effectiveness of the team,

rather than to optimise a single example case, in the investigations that

we present here, we use tit-for-tat as this default strategy. As such, tit-

for-tat is well understood, and whilst it does not exploit other strategies

as effectively as the more recently developed alternatives discussed in the

previous section, it is immune to being exploited itself. Thus in the case

that the team leader does not recognise another team player, it cooperates

on the next two moves in an attempt to reestablish cooperation and then

continues by playing tit-for-tat for the rest of the interaction.

Finally, since the rules of the IPD tournament mean that each player

must play against a copy of themselves, we also enable the team leader to

recognise and cooperate with a copy of itself. Thus, the actions of both the

ordinary team members and the team leader are shown schematically in

figure 9.2. Note, that it is not strictly necessary to implement two different

codes (i.e. one for the team leader and one for ordinary team members),

however, we do so to reduce the chances of a competing player exploiting

the ordinary team members (see section 9.7 for a more detailed discussion).

gested by Delahaye and Mathieu[Delahaye and Mathieu (1993)]. However, unlike this

example, where the slaves were simple strategies that could potentially be exploited by

any member of the population, all of our team players explicitly recognise one another

and condition their actions on this recognition.



9.4. Experimental Results

Now, given the team players described in the previous section, two imme-

diate questions are posed: (i) how does the number of team players within

the population effect the probability that the team leader does in fact win

the tournament? and (ii) how does the length of the code word (i.e. the

length of the initial sequence of moves that the team players use to signal

to one another) affect the performance of the team leader? In order to

address these questions and to test the effectiveness of the team, we imple-

ment an IPD tournament (with and without noise) using a representative

population of competing players. To ensure consistency between differ-

ent comparisons within the literature, we adopt the same test population

as previous researchers[Beaufils et al. (1997); ORiodan (2000); Tzafestas

(2000)] and thus the population consists of eighteen players implementing

the base strategies used in the original Axelrod competition (e.g. All C,

All D, Random and Negative), simple strategies that play periodic moves

(e.g. periodic CD, CCD and DDC) and state-of-the-art strategies that have

been shown to outperform these simple strategies (e.g. Adaptive, Forgiving

and Gradual). A full list and description of the strategies adopted by these

players is provided in Appendix A.

We first run this tournament, using this fixed competing population,

whilst varying the number of team players within the population, from 2

to 5 (i.e. one team leader and 1 to 4 ordinary team members), and varying

the length of code word, L, from 1 to 16 bits. To ensure representative

results, we also average over all possible code words, and in total, we run

the tournament 1000 times and average the results. Since our aim is to

show the benefit that the team has yielded, compared to the the default

strategy of the team leader (in this case tit-for-tat), we divide the total score

of the team leader by the total score of the player adopting the simple tit-

for-tat strategy. Thus, we calculate 〈ScoreLeader

〉 / 〈ScoreTFT〉 and note

that the greater this value, the better the performance of the team. The

results of these experiments are shown in figure 9.3 for the noise free IPD

tournament and in figure 9.5 for the noisy IPD tournament. In these figures,

the experimental results are plotted with error bars, along with a continuous

best fit curve (see section 9.5 for a discussion of the calculation of this line).

Now, in order to investigate the effect of larger population sizes, we

also run experiments where we fix the number of team players within the

population to be five (again composed of one team leader and four ordi-

nary team members), but then generate competing populations of differ-



2 4 6 8 10 12 14 160.9

1

1.1

1.2

1.3

<ScoreLeader > / <ScoreTFT >

Code Word Length (L)

2 team players3 team players4 team players5 team players

Fig. 9.3. Experimental results showing the benefit of the team in a noise free IPD

tournament. Results show code word lengths from 1 to 16 bits where the total population

consists of 2 to 5 team players (i.e. one team leader and 1 to 4 ordinary team members)

and 18 competing players. Results are averaged over 1000 tournament runs.

2 4 6 8 10 12 14 160.9

1.1

1.3

1.5

1.7



6 competing players12 competing players18 competing players24 competing players30 competing players

Fig. 9.4. Experimental results showing the benefit of the team in a noise free IPD

tournament. Results show code word lengths from 1 to 16 bits where the total population

consists of 5 team players (i.e. one team leader and 4 ordinary team members) and 6,

12, 18, 24 and 30 competing players. Results are averaged over 10000 tournament runs.



2 4 6 8 10 12 14 16

1

1.05

1.1

1.15



2 team players3 team players4 team players5 team players

Fig. 9.5. Experimental results showing the benefit of the team in a noisy IPD tour-

nament. Results show code word lengths from 1 to 16 bits where the total population

consists of 2 to 5 team players (i.e. one team leader and 1 to 4 ordinary team members)

and 18 competing players. Results are averaged over 1000 tournament runs.

2 4 6 8 10 12 14 161

1.1

1.2

1.3

1.4



6 competing players12 competing players18 competing players24 competing players30 competing players

Fig. 9.6. Experimental results showing the benefit of the team in a noisy IPD tour-

nament. Results show code word lengths from 1 to 16 bits where the total population

consists of 5 team players (i.e. one team leader and 4 ordinary team members) and 6,

12, 18, 24 and 30 competing players. Results are averaged over 10000 tournament runs.



ent sizes by randomly selecting players from our pool of 18 base strate-

gies (always ensuring that we have at least one player using the tit-for-tat

strategy). We run the tournament 10000 (more than before as we must

also average over the stochastic competing population) and again calculate

〈ScoreLeader

〉 / 〈ScoreTFT〉. Figure 9.4 shows these results for the noise free

IPD tournament and figure 9.6 show results for the noisy IPD tournament

The results clearly indicate that, as expected, increasing the number

of team players, or more exactly, increasing the percentage of the popula-

tion represented by the team, improves the performance of the team (i.e.

increases 〈ScoreLeader

〉 / 〈ScoreTFT〉). In addition, in both the noise free

and noisy IPD tournaments there is clearly an optimum code word length

whereby the benefit of the team decreases when the code word length is

longer or shorter than this optimum. Most significantly, this optimum code

word length is clearly independent of both the size of the team and the

population. In addition, in the case of the noisy IPD tournament, the re-

sults are very sensitive to this optimum code word length and, overall, the

benefit of the team is much less than that achieved in the noise free IPD

tournament. In the next section, we analyse these results and propose error

correcting codes to improve performance in the noisy IPD tournament.

9.5. Analysis

The optimum code word lengths observed in the previous experimental re-

sults are the result of a number of opposing factors. If we initially consider

the noise free IPD tournament, we can identify two such factors. The first

represents the cost of the signalling between team players. As the length of

the code word is increased, the team players have less available remaining

moves in which to manipulate the outcome of the tournament and, thus,

this factor favours shorter code word lengths. However, for this signalling

to be effective, the team players must be able to distinguish between com-

peting players and other team players. If the code word becomes too short,

it becomes increasingly likely that a competing player will through pure

chance make the sequence of moves that correspond to either of the code

words of the team players. Thus the second factor represents the effec-

tiveness of the signalling. It has the opposite effect of the first and thus

favours longer code word lengths. The balance of these two opposing fac-

tors give rise to the behaviour seen in figures 9.3 and 9.4 where we observe

an optimum code length near seven bits; at greater lengths we observe an

approximately linear decrease in performance, whilst at shorter lengths, we



2 4 6 8 10 12 14 160

0.2

0.4

0.6

0.8

1

Probability of Discimination (Pd )


Fig. 9.7. Experimental and theoretical results showing the probability of a team player

successfully discriminating between another team player and a competing player in an

IPD tournament.

observe a more rapid decrease in performance.

When noise is added to the IPD tournament, a third factor, which also

affects the effectiveness of the signalling, becomes apparent. In order for

the team players to recognise one another, the sequence of moves made

by each player must be correctly executed. In the noisy IPD tournament,

there is a small probability that one or more of the moves that constitute

these code words will be mis-executed and, in this case, the team players

will fail to recognise one another. The effect of this additional factor is

clearly seen in a comparison of figures 9.3 and 9.4 and figures 9.5 and 9.6.

In the noisy IPD tournament the optimum code word length is significantly

shorter than the noise free case and there is a very rapid non-linear decrease

in performance at code word lengths greater than this optimum. This final

factor is very significant, and thus in the noisy IPD tournament, the team

yields much less benefit than that in the noise free IPD tournament.

Now, the two factors that describe the effectiveness of the signalling can

usefully be expressed as two probabilities. These are the probability that a

team player will successfully discriminate a competing player from another

team player, Pd, and the probability that two team players will successfully

recognise one another, Pr. We can directly measure these probabilities from

the experimental results presented in the last section, and then compare



them to theoretical predictions.

Thus, to calculate the probability of successful discrimination, Pd, we

consider that out of the 2L possible code words, one is required for the team

leader code and one for the team member code. Thus, when we consider

the average over all possible code words, this probability is given by:

Pd

= 1−2

2L

(9.1)

In the case of the probability of successful recognition, Pr, we require that

both code word sequences are played with no mis-executed moves. If the

probability of mis-executing a move is γ (in our case γ = 1/10), then this

probability is simply given by:

Pr

= (1− γ)2L (9.2)

Figures 9.7 and 9.8 show a comparison of these analytical results against the

probabilities measured from the experimental results presented in the last

section. Clearly the theoretical predictions match the experimental data

extremely welld and these results indicate that the benefit of the team is

strongly dependent on the effectiveness of the signalling between the team

members. Most surprising, is that in the case of the noisy IPD tournament,

with anything but the very shortest code word lengths, the chances of two

team players successfully recognising one another is extremely small. At

first sight, this result suggests that the use of teams is unlikely to be very

effective in noisy environments. However, the problem that we face here

(i.e. how to reliably recognise code words in the presence of mis-executed

moves), is exactly analogous to that studied in information theory of com-

municating reliably over a noisy channel. As such, we can use the results

of this field (specifically error correcting codes), to increase the probability

that the team members successfully recognise one another, and thus, in

turn, increase the benefit that the team will yield.

9.6. Error Correcting Codes

The problem of communicating reliably over a noisy channel, or in our case,

reliably recognising code words when moves of the IPD game are subject todFurther confirmation of this analysis is provided by the observation that the best-fit

lines shown in figures 9.3 to 9.6, are calculated by postulating that the shape of the line

is given by y = A + Bx +C

2x + D(1 − γ)2x

. The coefficients A, B, C and D are then

found via regression so as to minimise the sum of the squared error between observed

and calculated results. In the case of the noise free IPD tournament, the value of D is

fixed at zero.



2 4 6 8 10 12 14 160

0.2

0.4

0.6

0.8

1


Probability of Recognition (Pr )

Fig. 9.8. Experimental and theoretical results showing the probability of two team

players successfully recognising one another in a noisy IPD tournament.

mis-execution, is fundamental to the field of information theory[Shannon

(1948)]. One of the most widely used results of this work is the concept

of error correcting codes; codes that allow random transmission errors to

be detected and corrected[MacKay (2003); Peterson and Weldon (1972)].

Such codes typically take a binary code word of length Lc

and encode it

into a longer binary message of length Lm

(i.e. Lm

> Lc). Should any

errors occur in the transmission of this message (e.g. a 1 transmitted by

the sender is interpreted as a 0 by the receiver), the decoding procedure and

the redundancy that has been incorporated into the longer message, mean

that these errors can be corrected and the original code word retrieved.

Different coding algorithms are distinguished by the length of the initial

code word, the degree of redundancy added to the message and by the

number of errors that they can correct. Thus, in our application, all the

team members must implement the same coding algorithm, but now, rather

than using the code word directly to describe their initial sequence of moves,

they use the longer encoded message. Likewise, they observe the moves of

their opponent and then compare the results of the decoding algorithm to

their reference code words.

The improvement that such error-correcting codes can achieve is signifi-

cant but we have several requirements when selecting an appropriate coding



algorithm. The coding algorithm should increase the effectiveness of the

signalling, by increasing the probability that the team members can suc-

cessfully discriminate between team members and other competing players

(i.e. increase Pd) and by increasing the probability that the team members

recognise one another successfully (i.e. increase Pr). However, it should not

increase the cost of the signalling such that this increase in effectiveness is

lost. The need to limit the increase in the cost of signalling, and thus limit

the length of the encoded message, Lm

, is the key factor in restricting our

choice of coding algorithm. As shown in figures 9.3 and 9.4, even with the

perfect recognition that is achieved in the noise free case, the performance

of the team begins to degrade when Lm

> 7, and whilst many coding algo-

rithms exist, the vast majority generate message lengths far in excess of this

value[Peterson and Weldon (1972)]. Thus, our choice of coding algorithm

is limited to the three presented below:

(1) A single block Hamming code that takes a 4 bit code word and generates

a seven bit message that can be corrected for a single error.

(2) A two block Hamming code that simply concatenates two four bit words

and thus produces a fourteen bit message that can be corrected for a

single error in each 7 bit block.

(3) A [15,5] Bose-Chaudhuri-Hochquenghem (BCH) code that encodes a

five bit code word into a fifteen bit message, but is capable of correcting

up to three errors.

Now, in each case, the probability of successfully discriminating between

team players and competing players is still determined by the initial code

word length (i.e. the decoding algorithm maps the 2Lm possible encoded

messages onto 2Lc possible code words), and thus, as before, is given by:

Pd

= 1−2

2Lc

(9.3)

However, the probability that the team players successfully recognise one

another is determined by the message length and by the error correcting

ability of the code. Thus, for the Hamming code with n blocks, this prob-

ability is given by the probability that less than two error occurs in each

seven bit encoded message:

Pr

=

[

1∑

k=0

(

k

7

)

γk(1− γ)7−k

]2n

(9.4)



Table 9.2. Calculated results for the probability of discrimination, Pd, and the

probability of recognition, Pr, for three different error correcting codes considered.

Direct Hamming BCH

L=3 1 block 2 blocks [15,5]

Lc – Code Word Length 3 4 8 5

Lm – Message length 3 7 14 15

Pd – Probability of Discrimination 0.750 0.875 0.992 0.937

Pr – Probability of Recognition 0.531 0.723 0.527 0.892

For the [15,5] BCH code, the probability of recognition is given by consid-

ering that the code word can be correctly decoded if less than four errors

occur in the fifteen bit encoded message, and thus:

Pr

=

[

3∑

k=0

(

k

15

)

γk(1− γ)15−k

]2

(9.5)

These calculated values are shown in table 9.2 for the three coding algo-

rithms considered, along with the original case results in which the direct

code words are used (we use the value of L = 3 which was shown to be

optimal for the noisy IPD tournament presented in section 9.4). Note, that

all of the coding algorithms result in improvements in Pd

since they all im-

plement a code word of length greater than three. However, only the single

block Hamming code and the [15,5] BCH code improve upon Pr. In the case

of the two block Hamming code, the error correcting ability is not sufficient

to overcome the long message length that results. Of the three algorithms,

the [15,5] BCH code is superior; it creates the longest message length, yet

its error correcting ability is such that it also displays the best probability of

recognition. This result is confirmed by implementing the different coding

algorithms within the team players and repeating the experimental noisy

IPD tournament, with a fixed competing population, described in section

9.4. As before, to ensure representative results, we run the tournament

1000 times and average over all possible choices of code words. Table 9.3

shows the results of this comparison when 2 to 5 team players (i.e. one

team leader and 1 to 4 ordinary team members) are included within the

population. As expected, the [15,5] BCH code outperforms the others and,

in the case where there are five team members, the performance of the

[15,5] BCH algorithm is very close to the best achieved in the noise free

IPD tournament presented in figure 9.3.



Table 9.3. Experimental results for 〈ScoreLeader

〉 / 〈Score TFT 〉 for

the three different error correcting codes considered here. Tournaments

are averaged over 1000 runs and the standard error of the mean is

±0.002.

Direct Hamming BCH

L=3 1 block 2 blocks [15,5]

Number of

Team Players

2 1.043 1.055 1.044 1.062

3 1.079 1.101 1.083 1.120

4 1.112 1.145 1.121 1.173

5 1.141 1.184 1.159 1.221

Finally, we present results from implementing this [15,5] BCH code in

the noisy IPD tournament, again with a fixed competing population. In

table 9.4 we show the total scores achieved by each player when the number

of team players increases from 2 to 5. To enable comparison with other

populations, we normalise these scores and divide the total score achieved

by each player, by the size of the population and by the number of rounds

in each IPD game (in this case 200). Thus, the values shown are the

ranked average pay-off received by the player in each round of the Prisoner’s

Dilemma game. Within this table, the competing players are denoted by

the mnemonic given in Appendix A, the team leader is denoted by LEAD

and the ordinary team members by MEMB .

Clearly, as more team members are added to the population, they are

increasingly able to change the environment in which the team leader must

interact and thus they are able to influence the outcome of the tournament

in favour of the team leader. In three out of the four cases, the team leader

is in fact the winner of the tournament, despite the fact that this player is

based upon the tit-for-tat strategy that performs relatively poorly against

this population (see the results shown in Appendix A). In addition, these

results also clearly show that the mutual cooperation of the other team

members, also leads them to perform well. Indeed, when the team consists

of five (or more) such team members, all five occupy the top positions.

In table 9.5, rather than showing the averaged scores of the tournament

players, we present the probability that one of the team players actually

wins the overall noisy IPD tournament. In addition to the previous results

where the probability that a move was mis-executed was 1/10, we present

a range of values from 0 to 1/5. The results indicate that whilst we have

assumed a noise level of 1/10 throughout the analysis, our results are not

particularly sensitive to this value. Indeed, the more significant factor is



Table 9.4. Experimental results showing the results of the noisy IPD tournament when

the team players implement a [15,5] BCH coding algorithm and there are increasing

numbers of team players (a). . .(d). The tournaments are averaged over 1000 runs and

the standard error of the mean is ±0.002.

(a) (b) (c) (d)

Player Score

GRAD 2.347

LEAD 2.344

ADAP 2.263

SMAJ 2.256

GRIM 2.239

ALLD 2.219

MEMB 2.219

TFT 2.207

TFTT 2.175

FORG 2.171

GTFT 2.160

PCD 2.138

PCCD 2.136

STFT 2.124

HMAJ 2.109

RAND 2.101

PAVL 2.099

PDDC 2.072

NEG 2.049

ALLC 1.996

Player Score

LEAD 2.427

GRAD 2.298

MEMB 2.246

MEMB 2.246

ADAP 2.228

SMAJ 2.221

GRIM 2.221

ALLD 2.192

TFT 2.168

TFTT 2.135

FORG 2.126

GTFT 2.114

PCD 2.091

STFT 2.090

HMAJ 2.084

PCCD 2.078

RAND 2.058

PAVL 2.047

PDDC 2.033

NEG 1.991

ALLC 1.934

Player Score

LEAD 2.503

MEMB 2.273

MEMB 2.272

MEMB 2.271

GRAD 2.256

ADAP 2.191

SMAJ 2.186

GRIM 2.181

ALLD 2.161

TFT 2.133

TFTT 2.099

FORG 2.086

GTFT 2.068

STFT 2.061

HMAJ 2.054

PCD 2.047

PCCD 2.027

RAND 2.013

PDDC 2.005

PAVL 2.004

NEG 1.938

ALLC 1.877

Player Score

LEAD 2.568

MEMB 2.296

MEMB 2.294

MEMB 2.294

MEMB 2.292

GRAD 2.218

ADAP 2.164

SMAJ 2.157

GRIM 2.156

ALLD 2.136

TFT 2.103

TFTT 2.062

FORG 2.054

STFT 2.036

GTFT 2.031

HMAJ 2.030

PCD 1.999

PCCD 1.982

RAND 1.969

PDDC 1.969

PAVL 1.966

NEG 1.886

ALLC 1.820

the loss of performance of the competing players as the noise level increases.

The table shows that with just two team members and no noise, a team

player will win the tournament just 3.4% of the time. However, as the noise

level increases, the performance of the other players within the tournament

degrades at a faster rate than that at which the effectiveness of the signalling

between team members diminishes. At a noise level of 1/5 the same team

members win 70.2% of the time. Indeed with 3 or 4 team members, the

results are independent of the noise level within this range.



Table 9.5. Experimental results showing the probability that one of the team

members wins the noisy IPD tournament. Results are for different numbers

of team members and a range of noise levels. Results are averaged over 1000

tournament runs and the standard error of the mean for each result is ±0.5.

Noise Level (γ)

0.00 0.05 0.10 0.15 0.20

Number of

Team Players

2 2.8 % 10.6 % 22.4 % 30.0 % 32.6 %

3 3.4 % 81.0 % 80.4 % 81.6 % 70.2 %

4 97.6 % 99.0 % 96.4 % 96.6 % 97.2 %

5 97.4 % 96.6 % 97.2 % 96.6 % 96.8 %

9.7. Competition Entry

The results of the previous sections clearly indicate that there is an advan-

tage to be gained by entering a team of players into the noisy IPD tour-

nament. However, when using these results to actually design the players

for the IPD competition entries, a number of additional factors must be

considered. Firstly, in our experimental investigations we have averaged

over all possible code words to produce representative results. However,

for the competition entry we must actually select two code words: one for

the team members and one for the team leader. Whilst the probability of

recoginising a team player is independent of the choice of code word (this

is a property of the codes that are implemented), the probability of succes-

fully discriminating between team and competing players is not. Clearly,

code words that are close (in Hamming distance) to the initial moves of

competing players are more likely to be corrupted by noise and thus falsely

recognised. Thus we must select code words that are most unlike the moves

that we expect to observe from competing players. Actually making this

choice is complicated by the fact that we do not know the strategies that

the competing players will use, and the moves that they make will them-

selves depend on the actual code words that the team players use. Thus, we

again use our test population of eighteen default strategies, and by exhaus-

tive test, we select two code words which most often lead to the correct

recognition of team players and the correct discrimination of competing

players.

Secondly, throughout these investigations, we have not considered the

possibility of another competing player learning the code words of the team

members and then attempting to exploit them. Within our competition en-

tries, we greatly reduce the possibility of this occurring by having each team

player monitor the behaviour of their opponent, in order to check that they



behave as expected. Thus, if an ordinary team member recognises their

opponent to be another ordinary team member, they check that the oppo-

nent does in fact cooperate in the subsequent rounds of the game. Should

the opponent attempt to defect (with some allowance for the possibility

of mis-executed moves), it is assumed that the opponent has been falsely

recognised and thus the team member begins to defect to avoid the possibil-

ity of being exploited. Given this additional checking, the only possibility

of exploitation is that a competing player learns the code word of the team

leader, and thus tricks the ordinary team members into allowing themselves

to be exploited. However, in the IPD tournament, this is extremely unlikely

to occur. The players within the tournament only interact with each other

once, thus, whilst a competing player may encounter several ordinary team

members, there is little possibility of them learning the code word of the

team leader in this single interaction. This is the reason for implementing

separate team member and team leader code words.

Finally, we must decide how many team members to submit into the

competition. Clearly, our results indicate that the larger the number of

players, the better the performance of the team leader. However, typically,

this number is limited by the rules of the competition (e.g. the rules of the

second IPD tournament capped this number at 20), and thus, we should

submit the maximum allowable number of players.

Thus, the teams that we entered into the two recent IPD competitions

held at the 2004 IEEE Congress on Evolutionary Computing (CEC’04)

and the 2005 IEEE Symposium on Computational Intelligence and Games

(CIG’05), followed these guidelines and were successful. In the first compe-

tition, we entered several teams, that used the single block Hamming code,

and a range of default strategies for the team leader. Whilst a few other

researchers entered teams of players, the policy was not widely adopted and

the team leader from the largest team won with a clear advantage.

In the second round of competitions we entered a single team using

the more complex [15,5] BCH coding scheme, and, as in our investigations

here, we used tit-for-tat as the default strategy of the team leader. In this

competition, separate noise free and noisy IPD tournaments were held, and

these tournaments were more competitive, as given the results of the first

competition, many more researchers adopted the policy of submitting a

team of players. Within the noise free IPD tournament, three of the top

four positions were occupied by representatives of different teams. However,

within the noisy IPD tournament, our team leader again won with a clear

advantage, despite using the tit-for-tat as a default strategy. The other



teams entered into this tournament performed poorly compared to the noise

free IPD tournament. Thus, these results clearly illustrate the advantage

that the use of error-correcting codes has yielded by enabling our team

players to recognise one another in the noisy environment.

9.8. Conclusions

In this chapter, we presented our investigations into the use of a team of

players within an Iterated Prisoner’s Dilemma tournament. We have shown

that if the team players are capable of recognising one another, they can

condition their actions to increase the probability that one of their mem-

bers wins the tournament. Since, outside means of communication are not

available to these players, we have shown that they are able to make use

of a covert channel (specifically, a pre-agreed sequence of moves that they

make at the start of each interaction) to signal to one another and thus

perform this recognition. By carefully considering both the cost and effec-

tiveness of the signalling, we have shown that we can use error correcting

codes to optimise the performance of the team and that this coding allows

the teams to be extremely effective in the noisy IPD tournament; a noisy

environment which initially appears to preclude their use.

Our future work in this area concerns the use of these team players in

an evolutionary model of the IPD tournament. That is, rather than the

static IPD tournament presented here (where the population of competing

players is fixed), we consider a model where the population of competing

players evolves over time (i.e. the survival of any individual within the

population is dependent on their performance within an IPD tournament

held at each generation). Here we are particularly interested in searching

for evolutionary stable strategies (ESS), and thus are interested whether an

explicit team leader is required (or indeed, can even be implemented) and

how team players may attempt to exploit other team players to their own

advantage. As such, this work attempts to compare the roles of kin selection

and reciprocity for maintaining cooperation in noisy environments.

A.1. Test Population

The test population consists of eighteen players implementing the base

strategies used in the original Axelrod competition (e.g. All C, All D,

Random and Negative) plus simple strategies that play periodic moves (e.g.

periodic CD, CCD and DDC) and state-of-the-art strategies that have been



shown to outperform these simple strategies (e.g. Adaptive, Forgiving and

Gradual). A full list and description of the strategies adopted by these

players is shown in table A.1, and table A.2 shows the results of running

noise free and noisy IPD tournaments using just these players. To ensure

repeatable results, we run the tournament 1000 times and present the aver-

age results. To allow easy comparison with other publications, we normalise

the scores and thus divide them by the size of the population and the num-

ber of rounds in each IPD game (in this case 200). Thus, the values shown

are the ranked average pay-off received by the player in each round of the

Prisoner’s Dilemma game.

Note, that in this population, tit-for-tat performs relatively poorly and

is easily beaten by a number of strategies. In addition, in general the scores

in the noisy IPD tournament are less than those in the noise free tourna-

ment, since it is far harder to ensure mutual cooperation in the presence of

accidental defections.



Table A.1. Description of the strategies adopted by the competing players in the

test population.

Strategy Name Description

Adaptive ADAP Uses a continuously updated estimate of the

opponent player’s propensity to defect to

condition future actions[Tzafestas (2000)].

All C ALLC Cooperates continually.

All D ALLD Defects continually.

Forgiving FORG Modified tit-for-tat strategy that attempts

to reestablish mutual cooperation after

a sequence of mutual defections[ORiodan

(2000)].

Gradual GRAD Modified tit-for-tat strategy that use pro-

gressively longer sequences of defections in

retaliation[Beaufils et al. (1997)].

Grim GRIM Cooperates until a strategy defects against

it. From that point on defects continually.

Generous Tit-For-Tat GTFT Like tit-for-tat but cooperates 1/3 of the

times that tit-for-tat would defect[Axelrod

and Wu (1995)].

Hard Majority HMAJ Plays the majority move of the opponent.

On the first move, or when there is a tie, it

cooperates.

Negative NEG Plays the negative of the opponents last

move.

Pavlov PAVL Plays win-stay, lose-shift[Nowak and Sig-

mund (1993)].

Periodic CD PCD Plays ‘cooperate, defect’ periodically.

Periodic CCD PCCD Plays ‘cooperate, cooperate, defect’ period-

ically.

Periodic DDC PDDC Plays ‘defect, defect, cooperate’ periodically.

Random RAND Cooperates and defects at random.

Suspicious

Tit-For-Tat

STFT Identical to tit-for-tat but starts by defect-

ing.

Soft Majority SMAJ Plays the majority move of the opponent.

On the first move, or when there is a tie, it

defects.

Tit-For-Tat TFT Starts by cooperating and then plays the last

move of the opponent.

Tit-For-Two-Tats TFTT Like tit-for-tat but only defects after two

consecutive defections against it.



Table A.2. Reference performance of the test popu-

lation in the (a) noise free and (b) noisy IPD tourna-

ment. Results are averaged over 1000 repeated tour-

naments and the standard error of the mean is ±0.002.

(a) (b)

Strategy Score

ADAP 2.888

GRAD 2.860

GRIM 2.773

TFT 2.647

FORG 2.627

GTFT 2.591

SMAJ 2.575

TFTT 2.544

PAVL 2.390

ALLC 2.332

PCD 2.279

HMAJ 2.277

STFT 2.233

PCCD 2.190

ALLD 2.175

RAND 2.114

NEG 2.111

PDDC 2.081

Strategy Score

GRAD 2.410

ADAP 2.329

GRIM 2.297

SMAJ 2.292

ALLD 2.278

TFT 2.245

FORG 2.211

TFTT 2.204

GTFT 2.198

PCCD 2.185

PCD 2.179

STFT 2.155

RAND 2.143

PAVL 2.140

HMAJ 2.134

NEG 2.112

PDDC 2.110

ALLC 2.043

References

Axelrod, R. (1984). The Evolution of Cooperation (Basic Books).

Axelrod, R. (1997). The Complexity of Cooperation (Princeton University Press).

Axelrod, R. and Hamilton, W. D. (1981). The evolution of cooperation, Science

211, pp. 1390–1396.

Axelrod, R. and Wu, J. (1995). How to cope with noise in the iterated prisoner’s

dilemma, Journal of Conflict Resolution 39, 1, pp. 183–189.

Beaufils, B., Delahaye, J. P. and Mathieu, P. (1997). Our meeting with gradual:

A good strategy for the iterated prisoner’s dilemma, in Proceedings of the

Fifth International Workshop on the Synthesis and Simulation of Living

Systems (MIT Press), pp. 202–212.

Delahaye, J. P. and Mathieu, P. (1993). L’altruisme perfectionne, Pour la Science

187, pp. 102–107.

Hamilton, W. D. (1963). The evolution of altruistic behaviour, Am. Nat. 97, pp.

354–356.

Hamilton, W. D. (1964). The genetical evolution of social behaviour, J. Theor.

Biol. 7, pp. 1–16.

MacKay, D. J. C. (2003). Information Theory, Inference and Learning Algorithms

(Cambridge University Press).



Nowak, M. and Sigmund, K. (1993). A strategy of win-stay, lose-shift that out-

performs tit-for-tat in the prisoner’s dilemma game, Nature 364, pp. 56–58.

ORiodan, C. (2000). A forgiving strategy for the iterated prisoner’s dilemma,

Journal of Artificial Societies and Social Simulation 3, 4, pp. 56–58.

Peterson, W. W. and Weldon, E. J. (1972). Error-Correcting Codes (MIT Press).

Shannon, C. E. (1948). A mathematical theory of communication, The Bell Sys-

tem Technical Journal 27, pp. 379–423, 623–656.

Trivers, R. (1971). The evolution of reciprocal altruism, Quarterly Review of

Biology 46, pp. 35–57.

Tzafestas, E. S. (2000). Toward adaptive cooperative behavior, in Proceedings of

the Sixth International Conference on the Simulation of Adaptive Behavior

(SAB-2000), Vol. 2, pp. 334–340.


Chapter 10

Is it Accidental or Intentional? A Symbolic Approach to

the Noisy Iterated Prisoner’s Dilemma

Tsz-Chiu Au, Dana Nau

University of Maryland

10.1. Introduction

The Iterated Prisoner’s Dilemma (IPD) has become well known as an ab-

stract model of a class of multi-agent environments in which agents accu-

mulate payoffs that depend on how successful they are in their repeated

interactions with other agents. An important variant of the IPD is the

Noisy IPD, in which there is a small probability, called the noise level, that

accidents will occur. In other words, the noise level is the probability of

executing “cooperate” when “defect” was the intended move, or vice versa.

Accidents can cause difficulty in cooperations with others in real-life sit-

uations, and the same is true in the Noisy IPD. Strategies that do quite well

in the ordinary (non-noisy) IPD may do quite badly in the Noisy IPD [Axel-

rod and Dion (1988); Bendor (1987); Bendor et al. (1991); Molander (1985);

Mueller (1987); Nowak and Sigmund (1990)]. For example, if two players

both use the well-known Tit-For-Tat (TFT) strategy, then an accidental

defection may cause a long series of defections by both players as each of

them punishes the other for defecting.

This chapter reports on a strategy called the Derived Belief Strategy

(DBS), which was the best-performing non-master-slave strategy in Cate-

gory 2 (noisy environments) of the 2005 Iterated Prisoner’s Dilemma com-

petition (see Table 10.1).

Like most opponent-modeling techniques, DBS attempts to learn a

model of the other player’s strategy (i.e., the opponent model∗) during the

∗The term “opponent model” appears to be the most common term for a model of the

other player, even though this player is not necessarily an “opponent” (since the IPD is

not zero-sum).

231


232 T-C. Au and D. Nau

Table 10.1. Scores of the best programs in Competition 2 (IPD with Noise). The

table shows each program’s average score for each run and its overall average over

all five runs. The competition included 165 programs, but we have listed only the

top 25.

Score

Rank Program Author Run1 Run2 Run3 Run4 Run5 Avg.

1 BWIN P. Vytelingum 441.7 431.7 427.1 434.8 433.5 433.8

2 IMM01 J.W. Li 424.7 414.6 414.7 409.1 407.5 414.1

3 DBSz T.C. Au 411.7 405.0 406.5 407.7 409.2 408.0

4 DBSy T.C. Au 411.9 407.5 407.9 407.0 405.5 408.0

5 DBSpl T.C. Au 409.5 403.8 411.4 403.9 409.1 407.5

6 DBSx T.C. Au 401.9 410.5 407.7 408.4 404.4 406.6

7 DBSf T.C. Au 399.2 402.2 405.2 398.9 404.4 402.0

8 DBStft T.C. Au 398.4 394.3 402.1 406.7 407.3 401.8

9 DBSd T.C. Au 406.0 396.0 399.1 401.8 401.5 400.9

10 lowES-

TFT classic

M. Filzmoser 391.6 395.8 405.9 393.2 399.4 397.2

11 TFTIm T.C. Au 399.0 398.8 395.0 396.7 395.3 397.0

12 Mod P. Hingston 394.8 394.2 407.8 394.1 393.7 396.9

13 TFTIz T.C. Au 397.7 396.1 390.7 392.1 400.6 395.5

14 TFTIc T.C. Au 400.1 401.0 389.5 388.9 389.2 393.7

15 DBSe T.C. Au 396.9 386.8 396.7 394.5 393.7 393.7

16 TTFT L. Clement 389.1 395.8 394.1 393.4 394.7 393.4

17 TFTIa T.C. Au 389.5 394.4 395.1 389.6 397.7 393.3

18 TFTIb T.C. Au 391.7 390.0 390.5 401.0 392.4 393.1

19 TFTIx T.C. Au 398.3 391.3 390.8 391.0 393.7 393.0

20 mediumES-

TFT classic

M. Filzmoser 396.7 392.6 398.3 390.8 386.0 392.9

21 TFTIy T.C. Au 391.7 394.6 390.8 392.1 394.9 392.8

22 TFTId T.C. Au 395.6 393.1 388.8 385.7 391.3 390.9

23 TFTIe T.C. Au 396.7 391.1 385.2 388.2 393.5 390.9

24 DBSb T.C. Au 393.2 386.1 392.6 391.1 391.0 390.8

25 T4T D. Fogel 391.5 387.6 400.4 387.3 383.5 390.0

games. Our main innovation involves how to reason about noise using the

opponent model.

The key idea used in DBS is something that we call symbolic noise

detection—the use of the other player’s deterministic behavior to tell

whether an action has been affected by noise. More precisely, DBS builds

a symbolic model of how the other player behaves, and watches for any

deviation from this model. If the other player’s next move is inconsistent

with its past behavior, this inconsistency can be due either to noise or to

a genuine change in its behavior; and DBS can often distinguish between

these two cases by waiting to see whether this inconsistency persists in the


Is it Accidental or Intentional? 233

next few iterations of the game.†

Of the nine different version of DBS that we entered into the competi-

tion, all of them placed in the top 25, and seven of them placed among top

ten (see Table 10.1). Our best version, DBSz, placed third; and the two

players that placed higher were both masters of master-and-slave teams.

DBS operates in a distinctly different way from the master-and-slaves

strategy used by several other entrants in the competition. Each participant

in the competition was allowed to submit up to 20 programs as contestants.

Some participants took advantage of this to submit collections of programs

that worked together in a conspiracy in which 19 of their 20 programs (the

“slaves”) worked to give as many points as possible to the 20th program

(the “master”). DBS does not use a master-and-slaves strategy, nor does it

conspire with other programs in any other way. Nonetheless, DBS remained

competitive with the master-and-slaves strategies in the competition, and

performed much better than the master-and-slaves strategies if the score of

each master is averaged with the scores of its slaves. Furthermore, a more

extensive analysis [Au and Nau (2005)] shows that if each master-and-slaves

team had been limited to 10 programs or less, DBS would have placed first

in the competition.

10.2. Motivation and Approach

The techniques used in DBS are motivated by a British army officer’s storythat was quoted in (Axelrod, 1997, page 40):

I was having tea with A Company when we heard a lot of

shouting and went out to investigate. We found our men and

the Germans standing on their respective parapets. Suddenly

a salvo arrived but did no damage. Naturally both sides got

down and our men started swearing at the Germans, when all

at once a brave German got onto his parapet and shouted out:

“We are very sorry about that; we hope no one was hurt. It

is not our fault. It is that damned Prussian artillery.” (Rutter

1934, 29)

Such an apology was an effective way of resolving the conflict and preventing

a retaliation because it told the British that the salvo was not the intention

of the German infantry, but instead was an unfortunate accident that the

German infantry did not expect nor desire. The reason why the apology was

convincing was because it was consistent with the German infantry’s past

†An iteration has also been called a period or a round by some authors.



behavior. The British had was ample evidence to believe that the German

infantry wanted to keep the peace just as much as the British infantry did.

More generally, an important question for conflict prevention in noisy

environments is whether a misconduct is intentional or accidental. A devia-

tion from the usual course of action in a noisy environment can be explained

in either way. If we form the wrong belief about which explanation is cor-

rect, our response may potentially destroy our long-term relationship with

the other player. If we ground our belief on evidence accumulated before

and after the incident, we should be in a better position to identify the true

cause and prescribe an appropriate solution. To accomplish this, DBS uses

the following key techniques:

(1) Learning about the other player’s strategy. DBS uses an induc-

tion technique to identify a set of rules that model the other player’s

recent behavior. The rules give the probability that the player will

cooperate under different situations. As DBS learns these probabili-

ties during the game, it identifies a set of deterministic rules that have

either 0 or 1 as the probability of cooperation.

(2) Detecting noise. DBS uses the above rules to detect anomalies that

may be due either to noise or a genuine change in the other player’s

behavior. If a move is different from what the deterministic rules pre-

dict, this inconsistency triggers an evidence collection process that will

monitor the persistence of the inconsistency in the next few iterations

of the game. The purpose of the evidence-collection process is to deter-

mine whether the violation is likely to be due to noise or to a change

in the other player’s policy. If the inconsistency does not persist, DBS

asserts that the derivation is due to noise; if the inconsistency persists,

DBS assumes there is a change in the other player’s behavior.

(3) Temporarily tolerating possible misbehaviors by the other

player. Until the evidence-collection process finishes, DBS assumes

that the other player’s behavior is still as described by the determin-

istic rules. Once the evidence collection process has finished, DBS de-

cides whether to believe the other player’s behavior has changed, and

updates the deterministic rules accordingly.

Since DBS emphasizes the use of deterministic behaviors to distinguish

noise from the change of the other player’s behavior, it works well when

the other player uses a pure (i.e., deterministic) strategy or a strategy that

makes decisions deterministically most of the time. Fortunately, determin-

istic behaviors are abundant in the Iterated Prisoner’s Dilemma. Many



well-known strategies, such as TFT and GRIM, are pure strategies. Some

strategies such as Pavlov or Win-Stay, Lose-Shift strategy (WSLS) [Kraines

and Kraines (1989, 1993, 1995); Nowak and Sigmund (1993)] are not pure

strategies, but a large part of their behavior is still deterministic. The rea-

son for the prevalence of determinism is discussed by Axelrod in [Axelrod

(1984)]: clarity of behavior is an important ingredient of long-term cooper-

ation. A strategy such as TFT benefits from its clarity of behavior, because

it allows other players to make credible predictions of TFT’s responses to

their actions. We believe the success of our strategy in the competition is

because this clarity of behavior also helps us to fend off noise.

The results of the competition show that the techniques used in DBS

are indeed an effective way to fend off noise and maintain cooperation in

noisy environments. When DBS defers judgment about whether the other

player’s behavior has changed, the potential cost is that DBS may not

be able to respond to a genuine change of the other player’s behavior as

quickly as possible, thus losing a few points by not retaliating immediately.

But this delay is only temporary, and after it DBS will adapt to the new

behavior. More importantly, the techniques used in DBS greatly reduce

the probability that noise will cause it to end a cooperation and fall into

a mutual-defect situation. Our experience has been that it is hard to re-

establish cooperation from a mutual-defection situation, so it is better avoid

getting into mutual defection situations in the first place. When compared

with the potential cost of ending an cooperation, the cost of temporarily

tolerating some defections is worthwhile.

Temporary tolerance also benefits us in another way. In the noisy It-

erated Prisoner’s Dilemma, there are two types of noise: one that affects

the other player’s move, and the other affects our move. While our method

effectively handles the first type of noise, it is the other player’s job to deal

with the second type of noise. Some players such as TFT are easily pro-

voked by the second type of noise and retaliate immediately. Fortunately, if

the retaliation is not a permanent one, our method will treat the retaliation

in the same way as the first type of noise, thus minimizing its effect.

10.3. Iterated Prisoner’s Dilemma with Noise

In the Iterated Prisoner’s Dilemma, two players play a finite sequence of

classical prisoner’s dilemma games, whose payoff matrix is:



Player 2

Cooperate Defect

Cooperate (uCC, u

CC) (u

CD, u

DC)

Player 1Defect (u

DC, u

CD) (u

DD, u

DD)

where uDC

> uCC

> uDD

> uCD

and 2uCC

> uDC

+ uCD

. In the

competition, uDC

, uCC

, uDD

and uCD

are 5, 3, 1 and 0, respectively.

At the beginning of the game, each player knows nothing about the

other player and does not know how many iterations it will play. In each

iteration, each player chooses either to cooperate (C) or defect (D), and

their payoffs in that iteration are as shown in the payoff matrix. We call

this decision a move or an action. After both players choose a move, they

will each be informed of the other player’s move before the next iteration

begins.

If ak, b

k∈ C,D are the moves of Player 1 and Player 2 in iteration

k, then we say that (ak, b

k) is the interaction of iteration k. If there are

N iterations in a game, then the total scores for Player 1 and Player 2 are∑

1≤k≤N

uakbk

and∑

1≤k≤N

ubkak

, respectively.

The Noisy Iterated Prisoner’s Dilemma is a variant of the Iterated Pris-

oner’s Dilemma in which there is a small probability that a player’s moves

will be mis-implemented. The probability is called the noise level.‡ In other

words, the noise level is the probability of executing C when D was the in-

tended move, or vice versa. The incorrect move is recorded as the player’s

move, and determines the interaction of the iteration.§ Furthermore, nei-

ther player has any way of knowing whether the other player’s move was

executed correctly or incorrectly.¶

For example, suppose Player 1 chooses C and Player 2 chooses D in

iteration k, and noise occurs and affects the Player 1’s move. Then the

interaction of iteration k is (D,D). However, since both players do not

know that the Player 1’s move has been changed by noise, Player 1 and

Player 2 perceive the interaction differently: for Player 1, the interaction is

(C,D), but for Player 2, the interaction is (D,D). As in real life, this mis-

understanding would become an obstacle in establishing and maintaining

‡The noise level in the competition was 0.1.

§Hence, a mis-implementation is different from a misperception, which would not change

the interaction of the iteration. The competition included mis-implementations but no

misperceptions.

¶As far as we know, the definitions of “mis-implementation” used in the existing litera-

ture are ambiguous about whether either of the players should know that an action has

been mis-executed.



cooperation between the players.

10.4. Strategies, Policies, and Hypothesized Policies

A history H of length k is the sequence of interactions of all iterations up

to and including iteration k. We write H = 〈(a1, b1), (a2, b2), . . . , (ak, b

k)〉.

Let H = 〈(C,C), (C,D), (D,C), (D,D)〉∗ be the set of all possible histories.

A strategy M : H → [0, 1] associates with each history H a real number

called the degree of cooperation. M(H) is the probability that M chooses

to cooperate at iteration k + 1, where k = |H | is H ’s length.

For examples, TFT can be considered as a function MTFT

, such that (1)

MTFT

(H) = 1.0 if k = 0 or ak

= C (where k = |H |), and (2) MTFT

(H) =

0.0 otherwise; Tit-for-Two-Tats (TFTT), which is like TFT except it defects

only after it receives two consecutive defections, can be considered as a

function MTFTT

, such that (1) MTFTT

(H) = 0.0 if k ≥ 2 and ak−1 = a

k=

D, and (2) MTFTT

(H) = 1.0 otherwise.

We can model a strategy as a policy. A condition Cond : H →

True,False is a mapping from histories to boolean values. A history H

satisfies a condition Cond if and only if Cond(H) = True. A policy schema

Ω is a set of conditions such that each history in H satisfies exactly one

of the conditions in Ω. A rule is a pair (Cond, p), which we will write as

Cond → p, where Cond is a condition and p is a degree of cooperation

(a real number in [0, 1] ). A rule is deterministic if p is either 0.0 or 1.0;

otherwise, the rule is probabilistic. In this paper, we define a policy to be a

set of rules whose conditions constitute a policy schema.

MTFT

can be modeled as a policy as follows: we define Conda,b

to be

a condition about the interactions of the last iteration of a history, such

that Conda,b

(H) = True if and only if (1) k ≥ 1, ak

= a and bk

= b,

(where k = |H |), or (2) k = 0 and a = b = C. For simplicity, we also write

Conda,b

as (a, b). The policy for MTFT

is πTFT

= (C,C)→ 1.0, (C,D)→

1.0, (D,C) → 0.0, (D,D) → 0.0. Notice that the policy schema for πTFT

is Ω = (C,C), (C,D), (D,C), (D,D).

Given a policy π and a historyH , there is one and only one rule Cond→

p in π such that Cond(H) = True. We write p as π(H). A policy π is

complete for a strategy M if and only if π(H) = M(H) for any H ∈ H. In

other words, a complete policy for a strategy is one that completely models

the strategy. For instance, πTFT

is a complete policy for MTFT

.

Some strategies are much more complicated than TFT—we need a large

number of rules in order to completely model these strategies. If the number



of iterations is small and the strategy is complicated enough, it is difficult

or impossible for DBS to obtain a complete model of the other player’s

strategy. Therefore, DBS does not aim at obtaining a complete policy of

the other player’s strategy; instead, DBS leans an approximation of the

other player’s strategy during a game, using a small number of rules. In

order to distinguish this approximation from the complete policies for a

strategy, we call this approximation a hypothesized policy.

Given a policy schema Ω, DBS constructs a hypothesized policy π whose

policy schema is Ω. The degrees of cooperation of the rules in π are esti-

mated by a learning function (e.g., the learning methods in Section 10.6),

which computes the degrees of cooperation according to the current his-

tory. For example, suppose the other player’s strategy is MTFTT

, the given

policy schema is Ω = (C,C), (C,D), (D,C), (D,D), and the current his-

tory is H = (C,C), (D,C), (C,C), (D,C), (D,C), (D,D), (C,D), (C,C).

If we use a learning method which computes the degrees of cooperation by

averaging the number of time the next action is C when a condition holds,

then the hypothesized policy is π = (C,C)→ 1.0, (C,D)→ 1.0, (D,C)→

0.66, (D,D) → 0.0. Notice that the rule (D,C) → 0.66 does not accu-

rately model MTFTT

; this probabilistic rule is just an approximation of

what MTFTT

does when the condition (D,C) holds. This approximation is

inaccurate as long as the policy schema contains (D,C)—there is no com-

plete policy for MTFTT

whose policy schema contains (D,C). If we want

to model MTFTT

correctly, we need a different policy schema that allows

us to specify more complicated rules.

We interpret a hypothesized policy as a belief of what the other player

will do in the next few iterations in response to our next few moves. This

belief does not necessarily hold in the long run, since the other player can

behave differently at different time in a game. Even worse, there is no

guarantee that this belief is true in the next few iterations. Nonetheless,

hypothesized policies constructed by DBS usually have a high degree of

accuracy in predicting what the other player will do.

This belief is subjective—it depends on the choice of the policy schema

and the learning function. We formally define this subjective viewpoint as

follows. The hypothesized policy space spanned by a policy schema Ω and a

learning function L : Ω×H → [0, 1] is a set of policies Π = π(H) : H ∈ H,

where π(H) = Cond → L(Cond,H) : Cond ∈ Ω. Let H be a history of

a game in which the other player’s strategy is M . The set of all possible

hypothesized policies for M in this game is π(Hk) : H

k∈ prefixes(H) ⊆

Π, where prefixes(H) is the set of all prefixes of H , and Hk

is the prefix



of length k of H . We say π(Hk) is the current hypothesized policy of

M in the iteration k. A rule Cond → p in π(Hk) describes a particular

behavior of the other player’s strategy in the iteration k. The behavior is

deterministic if p is either zero or one; otherwise, the behavior is random

or probabilistic. If π(Hk) 6= π(H

k+1), we say there is a change of the

hypothesized policy in the iteration k + 1, and the behaviors described by

the rules in (π(Hk) \ π(H

k+1)) have changed.

10.5. Derived Belief Strategy

In the ordinary Iterated Prisoner’s Dilemma (i.e., without any noise), if

we know the other player’s strategy and how many iterations in a game,

we can compute an optimal strategy against the other player by trying

every possible sequence of moves to see which sequence yields the highest

score, assuming we have sufficient computational power. However, we are

missing both pieces of information. So it is impossible for us to compute

an optimal strategy, even with sufficient computing resource. Therefore,

we can at most predict the other player’s moves based on the history of a

game, subject to the fact that the game may terminate any time.

Some strategies for the Iterated Prisoner’s Dilemma do not predict the

other player’s moves at all. For example, Tit-for-Tat and GRIM react de-

terministically to the other player’s previous moves according to fixed sets

of rules, no matter how the other player actually plays. Many strategies

adapt to the other player’s strategy over the course of the game: for exam-

ple, Pavlov [Kraines and Kraines (1989)] adjusts its degree of cooperation

according to the history of a game. However, these strategies do not take

any prior information about the other player’s strategy as an input; thus

they are unable to make use of this important piece of information even

when it is available.

Let us consider a class of strategies that make use of a model of the other

player’s strategy to make decisions. Figure 10.1 shows an abstract represen-

tation of these strategies. Initially, these strategies start out by assuming

that the other player’s strategy is TFT or some other strategy. In every

iteration of the game, the model is updated according to the current history

(using UpdateModel). These strategies decide which move it should make

in each iteration using a move generator (GenerateMove), which depends

on the current model of the other player’s strategy of the iteration.

DBS belongs to this class of strategies. DBS maintains a model of

the other player in form of a hypothesized policy throughout a game, and



Procedure StrategyUsingModelOfTheOtherPlayer()

π ← InitialModel() // the current model of the other player

H ← ∅ // the current history

a← GenerateMove(π,H) // the initial move

Loop until the end of the game

Output our move a and obtain the other player’s move b

H ← 〈H, (a, b)〉

π ← UpdateModel(π,H)

a← GenerateMove(π,H)

End Loop

Fig. 10.1. An abstract representation of a class of strategies that generate moves using

a model of the other player.

makes decisions based on this hypothesized policy. The key issue for DBS in

this process is how to maintain a good approximation of the other player’s

strategy, despite that some actions in the history are affected by noise. A

good approximation will increase the quality of moves generated by DBS,

since the move generator in DBS depends on an accurate model of the other

player’s behavior.

The approach DBS uses to minimize the effect of noise on the hypoth-

esized policy has been discussed in Section 10.2: temporarily tolerate pos-

sible misbehaviors by the other player, and then update the hypothesized

policy only if DBS believes that the misbehavior is due to a genuine change

of behaviors. Figure 10.2 shows an outline of the implementation of this

approach in DBS. As we can see, DBS does not maintain the hypothesized

policy explicitly; instead, DBS maintains three sets of rules: the default

rule set (Rd), the current rule set (R

c), and the probabilistic rule set (R

p).

DBS combines these rule sets to form a hypothesized policy for move gen-

eration. In addition, DBS maintains several auxiliary variables (promotion

counts and violation counts) to facilitate the update of these rule sets. We

will explain every line in Figure 10.2 in detail in the next section.

10.6. Learning Hypothesized Policies in Noisy Environ-

ments

We will describe how DBS learns and maintains a hypothesized policy for

the other player’s strategy in this section. Section 10.6.1 describes how

DBS uses discounted frequencies for each behavior to estimate the degree of



Procedure DerivedBeliefStrategy()

1. Rd← π

TFT// the default rule set

2. Rc← ∅ // the current rule set

3. a0 ← C ; b0 ← C ; H ← 〈(a0, b0)〉 ; π = Rd

; k ← 1 ; v ← 0

4. a1 ← MoveGen(π,H)

5. Loop until the end of the game

6. Output ak

and obtain the other player’s move bk

7. r+ ← ((ak−1, bk−1)→ b

k)

8. r− ← ((ak−1, bk−1)→ (C,D \ b

k))

9. If r+, r− 6∈ Rc, then

10. If ShouldPromote(r+) = true, then insert r+ into Rc.

11. If r+ ∈ Rc, then set the violation count of r+ to zero

12. If r− ∈ Rc

and ShouldDemote(r−) = true, then

13. Rd← R

c∪R

d; R

c← ∅ ; v ← 0

14. If r− ∈ Rd, then v ← v + 1

15. If v > RejectThreshold, or (r+ ∈ Rc

and r− ∈ Rd), then

16. Rd← ∅ ; v ← 0

17. Rp← (Cond→ p′) ∈ ψ

k+1 : Cond not appear in Rc

or Rd

18. π ← Rc∪R

d∪R

p// construct a hypothesized policy

19. H ← 〈H, (ak, b

k)〉; a

k+1 ← MoveGen(π,H) ; k ← k + 1

20. End Loop

Fig. 10.2. An outline of the DBS strategy. ShouldPromote first increases r+’s promotion

count, and then if r+’s promotion count exceeds the promotion threshold, ShouldPromote

returns true and resets r+’s promotion count. Likewise, ShouldDemote first increases

r−’s violation count, and then if r−’s violation count exceeds the violation threshold,

ShouldPromote returns true and resets r−’s violation count. Rp in Line 17 is the proba-

bilistic rule set; ψk+1 in Line 17 is calculated from Equation 10.1.

cooperation of each rule in the hypothesized policy. Section 10.6.2 explains

why using discounted frequencies alone are not sufficient for constructing an

accurate model of the other player’s strategy in the presence of noise, and

how symbolic noise detection and temporary tolerance can help overcome

the difficulty in using discounted frequencies alone. Section 10.6.3 presents

the induction technique DBS uses to identify deterministic behaviors in the

other player. Section 10.6.4 illustrates how DBS defers judgment about

whether an anomaly is due to noise. Section 10.6.5 discusses how DBS

updates the hypothesized policy when it detects a change of behavior.



10.6.1. Learning by Discounted Frequencies

We now describe a simple way to estimate the degree of cooperation of

the rules in the hypothesized policy. The idea is to maintain a discounted

frequency for each behavior: instead of keeping an ordinary frequency count

of how often the other player cooperates under a condition in the past, DBS

applies discount factors based on how recent each occurrence of the behavior

was.

Given a history H = (a1, b1), (a2, b2), . . . , (ak, b

k), a real number α

between 0 and 1 (called the discount factor), and an initial hypothesized

policy π0 = Cond1 → p0

1, Cond2 → p0

2, . . . , Cond

n→ p0

n

whose policy

schema is C = Cond1, Cond2, . . . , Condn, the probabilistic policy at iter-

ation k + 1 is ψk+1 = Cond1 → pk+1

1, Cond2 → pk+1

2, Cond

n→ pk+1

n

,

where pk+1

i

is computed by the following equation:

pk+1

i

=

∑

0≤j≤k

(

αk−jgj

)

∑

0≤j≤k

(αk−jfj)

(10.1)

and where

gj

=

p0

i

if j = 0,

1 if 1 ≤ j ≤ k, Condi(H

j−1) = True and bj

= C,

0 otherwise;

fj

=

p0

i

if j = 0,

1 if 1 ≤ j ≤ k, Condi(H

j−1) = True,

0 otherwise;

Hj−1 =

∅ if j = 1,

(a1, b1), (a2, b2), . . . , (aj−1, bj−1) otherwise.

In short, the current historyH has k+1 possible prefixes, and fjis basically

a boolean function indicating whether the prefix of H up to the j − 1’th

iteration satisfies Condi. g

jis a restricted version of f

j.

When α = 1, pi

is approximately equal to the frequency of the occur-

rence of Condi→ p

i. When α is less than 1, p

ibecomes a weighted sum of

the frequencies that gives more weight to recent events than earlier ones.

For our purposes, it is important to use α < 1, because it may happen that

the other player changes its behavior suddenly, and therefore we should

forget about its past behavior and adapt to its new behavior (for instance,

when GRIM is triggered). In the competition, we used α = 0.75.

An important question is how large a policy schema to use for the hy-

pothesized policy. If the policy schema is too small, the policy schema won’t



provide enough detail to give useful predictions of the other player’s behav-

ior. But if the policy schema is too large, DBS will be unable to compute

an accurate approximation of each rule’s degree of cooperation, because the

number of iterations in the game will be too small. In the competition, we

used a policy schema of size 4: (C,C), (C,D), (D,C), (D,D). We have

found this to be good enough for modeling a large number of strategies.

It is essential to have a good initial hypothesized strategy because at

the beginning of the game the history is not long enough for us to derive

any meaningful information about the other player’s strategy. In the com-

petition, the initial hypothesized policy is πTFT

= (C,C)→ 1.0, (C,D)→

1.0, (D,C)→ 0.0, (D,D)→ 0.0.

10.6.2. Deficiencies of Discounted Frequencies in Noisy En-

vironments

It may appear that the probabilistic policy learned by the discounted-

frequency learning technique should be inherently capable of tolerating

noise, because it takes many, if not all, moves in the history into account:

if the number of terms in the calculation of the average or weighted average

is large enough, the effect of noise should be small. However, there is a

problem with this reasoning: it neglects the effect of multiple occurrences

of noise within a small time interval.

A mis-implementation that alters the move of one player would distort

an established pattern of behavior observed by the other player. The gen-

eral effect of such distortion to the Equation 10.1 is hard to tell—it varies

with the value of the parameters and the history. But if several distortions

occur within a small time interval, the distortion may be big enough to al-

ter the probabilistic policy and hence change our decision about what move

to make. This change of decision may potentially destroy an established

pattern of mutual cooperation between the players.

At first glance, it might seem rare for several noise events to occur at

nearly the same time. But if the game is long enough, the probability of it

happening can be quite high. The probability of getting two noise events in

two consecutive iterations out of a sequence of i iterations can be computed

recursively as Xi

= p(p + qXi−2) + qX

i−1, providing that X0 = X1 = 0,

where p is the probability of a noise event and q = 1−p. In the competition,

the noise level was p = 0.1 and i = 200, which gives X200 = 0.84. Similarly,

the probabilities of getting three and four noises in consecutive iterations

are 0.16 and 0.018, respectively.



In the 2005 competition, there were 165 players, and each player played

each of the other players five times. This means every player played 825

games. On average, there were 693 games having two noises in two consecu-

tive iterations, 132 games having three noises in three consecutive iterations,

and 15 games having four noises in four consecutive iterations. Clearly, we

did not want to ignore situations in which several noises occur nearly at

the same time.

Symbolic noise detection and temporary tolerance outlined in Sec-

tion 10.2 provide a way to reduce the amount of susceptibility to multi-

ple occurrences of noise in a small time interval. Deterministic rules enable

DBS to detect anomalies in the observed behavior of the other player. DBS

temporarily ignores the anomalies which may or may not be due to noise,

until a better conclusion about the cause of the anomalies can be drawn.

This temporary tolerance prevents DBS from learning from the moves that

may be affected by noise, and hence protects the hypothesized policy from

the influence of errors due to noise. Since the amount of tolerance (and the

accuracy of noise detection) can be controlled by adjusting parameters in

DBS, we can reduce the amount of susceptibility to multiple occurrences of

noise by increasing the amount of tolerance, at the expense of a higher cost

of noise detection—losing more points when a change of behavior occurs.

10.6.3. Identifying Deterministic Rules Using Induction

As we discussed in Section 10.2, deterministic behaviors are abundant in the

Iterated Prisoner’s Dilemma. Deterministic behaviors can be modeled by

deterministic rules, whereas random behavior would require probabilistic

rules.

A nice feature about deterministic rules is that they have only two

possible degrees of cooperation: zero or one, as opposed to an infinite set of

possible degrees of cooperation of the probabilistic rules. Therefore, there

should be ways to learn deterministic rules that are much faster than the

discounted frequency method described earlier. For example, if we knew at

the outset which rules were deterministic, it would take only one occurrence

to learn each of them: each time the condition of a deterministic rule was

satisfied, we could assign a degree of cooperation of 1 or 0 depending on

whether the player’s move was C or D.

The trick, of course, is to determine which rules are deterministic. We

have developed an inductive-reasoning method to distinguish deterministic

rules from probabilistic rules during learning and to learn the correct degree



of cooperation for the deterministic rules.

In general, induction is the process of deriving general principles from

particular facts or instances. To learn deterministic rules, the idea of induc-

tion can be used as follows. If a certain kind of behavior occurs repeatedly

several times, and during this period of time there is no other behavior

that contradicts to this kind of behavior, then we will hypothesize that the

chance of the same kind of behavior occurring in the next few iterations is

pretty high, regardless of how the other player behaved in the remote past.

More precisely, let K ≥ 1 be a number which we will call the promotion

threshold. Let H = 〈(a1, b1), (a2, b2), . . . , (ak, b

k)〉 be the current history.

For each condition Condj∈ C, let I

jbe the set of indexes such that for

all i ∈ Ij, i < k and Cond

j(〈(a1, b1), (a2, b2), . . . , (ai

, bi)〉) = True. Let I

j

be the set of the largest K indexes in Ij. If |I

j| ≥ K and for all i ∈ I

j,

bi+1 = C (i.e., the other player chose C when the previous history up to the

i’th iteration satisfies Condj), then we will hypothesize that the other player

will choose C whenever Condj

is satisfied; hence we will use Condj→ 1

as a deterministic rule. Likewise, if |Ij| ≥ K and for all i ∈ I

j, b

i+1 = D,

we will use Condj→ 0 as a deterministic rule. See Line 7 to Line 10 in

Figure 10.2 for an outline of the induction method we use in DBS.

The induction method can be faster at learning deterministic rules than

the discounted frequency method that regards a rule as deterministic when

the degree of cooperation estimated by discounted frequencies is above or

below certain thresholds. As can be seen in Figure 10.3, the induction

method takes only three iterations to infer the other player’s moves cor-

rectly, whereas the discounted frequency technique takes six iterations to

obtain a 95% degree of cooperation, and it never becomes 100%.‖ We may

want to set the threshold in the discounted frequency method to be less than

0.8 to make it faster than the induction method. However, this will increase

the chance of incorrectly identifying a random behavior as deterministic.

A faster learning speed allows us to infer deterministic rules with a

shorter history, and hence increase the effectiveness of symbolic noise de-

tection by having more deterministic rules at any time, especially when a

change of the other player’s behavior occurs. The promotion threshold K

controls the speed of the identification of deterministic rules. The larger the

value of K, the slower the speed of identification, but the less likely we will

mistakenly hypothesize that the other player’s behavior is deterministic.

‖If we modify Equation 10.1 to discard the early interactions of a game, the degree of

cooperation of a probabilistic rule can attain 100%.



0 1 2 3 4 5 6 7 8 9 10

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Iteration

Deg

ree

of C

oope

ratio

n

InductionDiscount Frequency

Fig. 10.3. Learning speeds of the induction method and the discounted frequency

method when the other player always cooperates. The initial degree of cooperation

is zero, the discounted rate is 0.75, and the promotion threshold is 3.

10.6.4. Symbolic Noise Detection and Temporary Tolerance

Once DBS has identified the set of deterministic rules, it can readily use

them to detect noise. As we said earlier, if the other player’s move violate

a deterministic rule, it can be caused either by noise or by a change in

the other player’s behavior, and DBS uses an evidence collection process

to figure out which is the case. More precisely, once a deterministic rule

Condi→ o

iis violated (i.e., the history up to the previous iteration satis-

fies Condi

but the other player’s move in the current iteration is different

from oi), DBS keeps the violated rule but marks it as violated. Then DBS

starts an evidence collection process that in the implementation of our com-

petition entries is a violation counting: for each violated probabilistic rule

DBS maintains a counter called the violation count to record how many

violations of the rule have occurred (Line 12).∗∗ In the subsequent itera-

tions, DBS increases the violation count by one every time a violation of

the rule occurs. However, if DBS encounters a positive example of the rule,

DBS resets the violation count to zero and unmark the rule (Line 11). If

any violation count excesses a threshold called the violation threshold, DBS

concludes that the violation is not due to noise; it is due to a change of

the other player’s behavior. In this case, DBS invokes a special procedure

∗∗We believe that a better evidence collection process should be based on statistical

hypothesis testing.



(described in Section 10.6.5) to handle this situation (Line 13).

This evidence collection process takes advantages of the fact that the

pattern of moves affected by noise is often quite different from the pat-

tern of moves generated by the new behavior after a change of behavior

occurs. Therefore, it can often distinguish noise from a change of behavior

by observing moves in the next few iterations and gather enough evidence.

As discussed in Section 10.6.2, we want to set a larger violation threshold

in order to avoid the drawback of the discount frequency method in dealing

with several misinterpretations caused by noise within a small time inter-

val. However, if the threshold is too large, it will slow down the speed of

adaptation to changes in the other player’s behavior. In the competition,

we entered DBS several times with several different violation thresholds;

and in the one that performed the best, the violation threshold was 4.

10.6.5. Coping with Ignorance of the Other Player’s New

Behavior

When the evidence collection process detects a change in the other player’s

behavior, DBS knows little about the other player’s new behavior. How

DBS copes with this ignorance is critical to its success.

When DBS knows little about the other player’s new behavior when

it detects a change of the other player’s behavior, DBS temporarily uses

the previous hypothesized policy as the current hypothesized policy, un-

til it deems that this substitution no longer works. More precisely, DBS

maintains two sets of deterministic rules: the current rule set Rc

and the

default rule set Rd. R

cis the set of deterministic rules that is learned after

the change of behavior occurs, while Rd

is the set of deterministic rules

before the change of behavior occurs. At the beginning of a game, Rd

is

πTFT

and Rc

is an empty set (Line 1 and Line 2). When DBS constructs a

hypothesized policy π for move generation, it uses every rule in Rc

and Rd.

In addition, for any missing rule (i.e., the rule those condition are differ-

ent from any rule’s condition in Rc

or Rd), we regard it as a probabilistic

rule and approximate its degree of cooperation by Equation 10.1 (Line 17).

These probabilistic rules form the probabilistic rule set Rp⊆ ψ

k+1.

While DBS can insert any newly found deterministic rule in Rc, it insert

rules into Rd

only when the evidence collection process detects a change of

the other player’s behavior. When it happens, DBS copies all the rules in

Rc

to Rd, and then set R

cto an empty set (Line 13).

The default rule set is designed to be rejected : we maintain a violation



count to record the number of violations to any rule in Rd. Every time any

rule in Rd

is violated, the violation count increased by 1 (Line 14). Once

the violation count exceeds a rejection threshold, we drop the default rule

set entirely (set it to an empty set) and reset the violation count (Line 15

and Line 16). We also reject Rd

whenever any rule in Rc

contradicts any

rule in Rd

(Line 15).

We preserve the rules in Rc

mainly for sake of providing a smooth tran-

sition: we don’t want to convert all deterministic rules to probabilistic rules

at once, as it might suddenly alter the course of our moves, since the move

generator in DBS generates moves according to the current hypothesized

policy only. This sudden change in DBS’s behavior can potentially disrupt

the cooperative relationship with the other player. Furthermore, some of

the rules in Rcmay still hold, and we don’t want to learn them from scratch.

Notice that symbolic noise detection and temporary tolerance makes use

of the rules in Rc

but not the rules in Rd, although DBS makes use of the

rules in both Rc

and Rd

when DBS decides the next move (Line 18). We do

not use Rd

for symbolic noise detection and temporary tolerance because

when DBS inserts rules into Rd, a change of the other player’s behavior

has already occurred—there is little reason to believe that anomalies de-

tected using the rules in Rd

are due to noise. Furthermore, we want to turn

off symbolic noise detection and temporary tolerance temporarily when a

change of behavior occurs, in order to identify a whole new set of deter-

ministic rules from scratch.

10.7. The Move Generator in DBS

We devised a simple and reasonably effective move generator for DBS. As

shown in Figure 10.1, the move generator takes the current hypothesized

policy π and the current history Hcurrent

whose length is l = |Hcurrent

|,

and then decides whether DBS should cooperate in the current iteration.

It is difficult to devise a good move generator, because our move could lead

to a change of the hypothesized policy and complicate our projection of

the long-term payoff. Perhaps, the move generator should take the other

player’s model of DBS into account [Carmel and Markovitch (1994)]. How-

ever, we found that by making the assumption that hypothesized policy

will not change for the rest of the game, we can devise a simple move gen-

erator that generates fairly good moves. The idea is that we compute the

maximum expected score we can possibly earn for the rest of the game, us-

ing a technique that combines some ideas from both game-tree search and



Markov Decision Processes (MDPs). Then we choose the first move in the

set of moves that leads to this maximum expected score as our move for

the current iteration.

To accomplish the above, we consider all possible histories whose prefix

isHcurrent

as a tree. In this tree, each path starting from the root represents

a possible history, which is a sequence of past interactions in Hcurrent

plus

a sequence of possible interactions in future iterations. Each node on a path

represents the interaction of an iteration of a history. Figure 10.4 shows an

example of such a tree. The root node of the tree represents the interaction

of the first iteration.

Let interaction(S) be the interaction represented by a node S. Let

〈S0, S1, . . . , Sk〉 be a sequence of nodes on the path from the root S0

to Sk. We define the depth of S

kto be k − l, and the history of S

k

be H(Sk) = 〈interaction(S1), interaction(S2), . . . , interaction(S

k)〉. S

iis

called the current node if the depth of Si

is zero; the current node rep-

resents the interaction of the last iteration and H(Si) = H

current. As we

do not know when the game will end, we assume it will go for N ∗ more

iterations ; thus each path in the tree has length of at most l+N ∗.

Our objective is to compute a non-negative real number called the max-

imum expected score E(S) for each node S with a non-negative depth. Like

a conventional game tree search in computer chess or checkers, the maxi-

mum expected scores are defined recursively: the maximum expected score

of a node at depth i is determined by the maximum expected scores of its

children nodes at depth i + 1. The maximum expected score of a node S

of depth N∗ is assumed to be the value computed by an evaluation func-

tion f . This is a mapping from histories to non-negative real numbers,

such that E(S) = f(H(S)). The maximum expected score of a node S of

depth k, where 0 ≤ k < N∗, is computed by the maximizing rule: sup-

pose the four possible nodes after S are SCC

, SCD

, SDC

, and SDD

, and

let p be the degree of cooperation predicted by the current hypothesized

policy π (i.e., p is the right-hand side of a rule (Cond → p) in π such that

H(S) satisfies the condition Cond). Then E(S) = maxEC(S), E

D(S),

where EC

(S) = p(uCC

+E(SCC

)) + (1− p)(uCD

+E(SCD

)) and ED

(S) =

p(uDC

+E(SDC

))+(1−p)(uDD

+E(SDD

)). Furthermore, we letmove(S) be

the decision made by the maximizing rule at each node S, i.e., move(S) = C

if EC

(S) ≥ ED

(S) and move(S) = D otherwise. By applying this max-

imizing rule recursively, we obtain the maximum expected score of every

node with a non-negative depth. The move that we choose for the current

iteration is move(Si), where S

iis the current node.



.

PSfrag replacements

First Iteration(Root Node)

Previous Iteration(Current Node)

Depth 0

Depth 1

Depth 2

Fig. 10.4. An example of the tree that we use to compute the maximum expected scores.

Each node denotes the interaction of an iteration. The top four nodes constitute a path

representing the current history Hcurrent. The length of Hcurrent is l = 2, and the

maximum depth N∗is 2. There are four edges emanating from each node S after the

current node; each of these edges corresponds to a possible interaction of the iteration

after S. The maximum expected scores (not shown) of the nodes with depth 2 are set by

an evaluation function f ; these values are then used to calculate the maximum expected

scores of the nodes with depth 1 by using the maximizing rule. Similarly, the maximum

expected scores of the current node is calculated using four maximum expected scores

of the nodes with depth 1.

The number of nodes in the tree increases exponentially with N ∗. Thus,

the tree can be huge—there are over a billion nodes when N ∗≥ 15.

It is infeasible to compute the maximum expected score for every node

one by one. Fortunately, we can use dynamic programming to speed

up the computation. As an example, suppose the hypothesized policy is

π = (C,C) → pCC, (C,D) → p

CD, (D,C) → p

DC, (D,D) → p

DD, and

suppose the evaluation function f returns a constant fo1o2

for any history

that satisfies the condition (o1, o2), where o1, o2 ∈ C,D. Then, given our

assumption that the hypothesized policy does not change, it is not hard to

show by induction that all nodes whose histories have the same length and

satisfy the same condition have the same maximum expected score. By

using this property, we construct a table of size 4× (N ∗ + 2) in which each

entry, denoted by Ek

o1o2, stores the maximum expected score of the nodes

whose histories have length l + k and satisfy the condition (o1, o2), where

o1, o2 ∈ C,D. We also have another table of the same size to record the

decisions the procedure makes; the entry mk

o1o2of this table is the deci-

sion being made at Ek

o1o2. Initially, we set EN+1

CC

= fCC

, EN+1

CD

= fCD

,



EN+1

DC

= fDC

, and EN+1

DD

= fDD

. Then the maximum expected scores in

the remaining entries can be computed by the following recursive equation:

Ek

o1o2= max

(

po1o2

(uCC

+Ek+1

CC

) + (1− po1o2

)(uCD

+Ek+1

CD

),

po1o2

(uDC

+Ek+1

DC

) + (1− po1o2

)(uDD

+Ek+1

DD

))

,

where o1, o2 ∈ C,D. Similarly, mk

o1o2= C if (p

o1o2(u

CC+ Ek+1

CC

) + (1−

po1o2

)(uCD

+Ek+1

CD

)) ≥ (po1o2

(uDC

+Ek+1

DC

)+ (1− po1o2

)(uDD

+Ek+1

DD

) and

mk

o1o2= D otherwise. If the interaction of the previous iteration is (o1, o2),

we pick m0

o1o2as the move for the current iteration. The pseudocode of

this dynamic programming algorithm is shown in Figure 10.5.

Procedure MoveGen(π,H)

〈pCC

, pCD

, pDC

, pDD〉 ← π

(a1, b1), (a2, b2), . . . , (ak, b

k) ← H

(a0, b0)← (C,C) ; (a, b)← (ak, b

k)

〈EN

∗

+1

CC

, EN

∗

+1

CD

, EN

∗

+1

DC

, EN

∗

+1

DD

〉 ← 〈fCC, f

CD, f

DC, f

DD〉

For k = N∗ down to 0

For each (o1, o2) in (C,C), (C,D), (D,C), (D,D)

F k

o1o2← p

o1o2(u

CC+Ek+1

CC

) + (1− po1o2

)(uCD

+Ek+1

CD

)

Gk

o1o2← p

o1o2(u

DC+Ek+1

DC

) + (1− po1o2

)(uDD

+Ek+1

DD

)

Ek

o1o2← max(F k

o1o2, Gk

o1o2)

If F k

o1o2≥ Gk

o1o2, then mk

o1o2← C

If F k

o1o2< Gk

o1o2, then mk

o1o2← D

End For

End For

Return m0

ab

Fig. 10.5. The procedure for computing a recommended move for the current iteration.

In the competition, we set N∗= 60, fCC = 3, fCD = 0, fDC = 5, and fDD = 1.

10.8. Competition Results

The 2005 IPD Competition was actually a set of four competitions, each for

a different version of the IPD. The one for the Noisy IPD was Category 2,

which used a noise level of 0.1.

Of the 165 programs entered into the competition, eight of them were

provided by the organizer of the competition. These programs included



ALLC (always cooperates), ALLD (always defects), GRIM (cooperates un-

til the first defection of the other player, and thereafter it always defects),

NEG (cooperate (or defect) if the other player defects (or cooperates) in

the previous iteration), RAND (defects or cooperates with the 1/2 proba-

bility), STFT (suspicious TFT, which is like TFT except it defects in the

first iteration) TFT, and TFTT. All of these strategies are well known in

the literature on IPD.

The remaining 157 programs were submitted by 36 different partici-

pants. Each participant was allowed to submit up to 20 programs. We

submitted the following 20:

• DBS. We entered nine different versions of DBS into the competition,

each with a different set of parameters or different implementation.

The one that performed best was DBSz, which makes use of the exact

set of features we mentioned in this chapter. Versions that have fewer

features or additional features did not do as well.

• Learning of Opponent’s Strategy with Forgiveness (LSF). Like

DBS, LSF is a strategy that learns the other player’s strategy during

the game. The difference between LSF and DBS is that LSF does not

make use of symbolic noise detection. It uses the discount frequency

(Equation 10.1) to learn the other player’s strategy, plus a forgiveness

strategy that decides when to cooperate if mutual defection occurs. We

entered one instance of LSF. It placed around the 30’th in three of the

runs and around 70’th in the other two runs. We believe the poor

ranking of LSF is due to the deficiency of using discount frequency

alone as we discussed at the beginning of Section 10.6.

• Tit-for-Tat Improved (TFTI). TFTI is a strategy based on a to-

tally different philosophy from DBS’s. It is not an opponent-modeling

strategy, in the sense that it does not model the other player’s behavior

using a set of rules. Instead, it is a variant of TFT with a sophisticated

forgiveness policy that aims at overcoming some of the deficiencies of

TFT in noisy environments. We entered ten instantiations of TFTI in

the competition, each with a different set of parameters or some dif-

ferences in the implementation. The best of these, TFTIm, did well in

the competition (see Table 10.1), but not as well as DBS.

Three of the other participants each entered the full complement

of twenty programs: Wolfgang Kienreich, Jia-wei Li, and Perukrishnen

Vytelingum. All three of them appear to have adopted the master-and-

slaves strategy that was first proposed by Vytelingum’s team from the Uni-



versity of Southampton. A master-and-slaves strategy is not a strategy for

a single program, but instead for a team of collaborating programs. One of

the programs in such a team is the master, and the remaining programs are

slaves. The basic idea is that at the start of a run, the master and slaves

would each make a series of moves using a predefined protocol, in order to

identify themselves to each other. From then on, the master program would

always play “defect” when playing with the slaves, and the slave programs

would always play “cooperate” when playing with the master, so that the

master would gain the highest possible payoff at each iteration. Further-

more, a slave would alway plays “defect” when playing with a program

other than the master, in order to try to minimize that player’s score.

Wolfgang Kienreich’s master program was CNGF (CosaNostra Godfa-

ther), and its slaves were 19 copies of CNHM (CosaNostra Hitman). Jia-wei

Li’s master program was IMM01 (Intelligent Machine Master 01), and its

slaves were IMS02, IMS03, . . . , IMS20 (Intelligent Machine Slave n, for

n = 02, 03, . . .20). Perukrishnen Vytelingum’s master program was BWIN

(S2Agent1 ZEUS), and its slaves were BLOS2, BLOS3, . . . , BLOS20 (like

BWIN, these programs also had longer names based on the names of ancient

Greek gods).

We do not know what strategies the other participants used in their

programs.

10.8.1. Overall Average Scores

Category 2 (IPD with noise) consisted of five runs. Each run was a round-

robin tournament in which each program played with every program, in-

cluding itself. Each program participated in 166 games in each run (recall

that there is one game in which a player plays against itself, which counts

as two games for that player). Each game consisted of 200 iterations. A

program’s score for a game is the sum of its payoffs over all 200 iterations

(note that this sum will be at least 0 and at most 1000). The program’s

total score for an entire run is the sum of its scores over all 166 games. On

the competition’s website, there is a ranking for each of the five runs, each

program is ranked according to its total score for the run.

A program’s average score within a run is its total score for the run

divided by 166. The program’s overall average score is its average over all

five runs, i.e., its total over all five runs divided by 830 = 5× 166.

The table in Table 10.1 shows the average scores in each of the five runs

of the top twenty-five programs when the programs are ranked by their



overall average scores. Of our nine different versions of DBS, all nine of

them are among the top twenty-five programs, and they dominate the top

ten places. This phenomenon implies that DBS’s performance is insensitive

to the parameters in the programs and the implementation details of an

individual program. The same phenomenon happens to TFTI—nine out of

ten programs using TFTI are ranked between the 11th place and the 25th

place, and the last one is at the 29th place.

10.8.2. DBS versus the Master-and-Slaves Strategies

Recall from Table 10.1: that DBSz placed third in the competition: it lost

only to BWIN and IMM01, the masters of two master-and-slaves strategies.

DBS does not use a master-and-slaves strategy, nor does it conspire with

other programs in any other way—but in contrast, BWIN’s and IMM01’s

performance depended greatly on the points fed to them by their slaves. In

particular,

(1) If we average the score of each master with the scores of its slaves, we get

379.9 for BWIN and 351.7 for IMM01, both of which are considerably

less than DBSz’s score of 408.

(2) A more extensive analysis [Au and Nau (2005)] shows that if the size of

each master-and-slaves team had been limited to less than or equal to

10, DBSz would have outperformed BWIN and IMM01 in the compe-

tition, even without averaging the score of each master with its slaves.

The reason for the above two phenomena is that the master-and-slaves

strategies did not cooperate the other players as much as they did amongst

themselves. In particular, Table 10.2 gives the percentages of each of the

four possible interactions when any program from one group plays with any

program from another group. Note that:

• When BWIN and IMM01 play with their slaves, about 64% and 47% of

the interactions are (D,C), but when non-master-and-slaves strategies

play with each other, only 19% of the interactions are (D,C).

• When the slave programs play with non-master-and-slaves programs,

over 60% of interactions are (D,D), but when non-master-and-slaves

programs play with other non-master-and-slaves programs, only 31%

of the interactions are (D,D).

• The master-and-slaves strategies decrease the overall percentage of

(C,C) from 31% to 13%, and increase the overall percentage of (D,D)

from 31% to 55%.



Table 10.2. Percentages of different interactions. “All but

M&S” means all 105 programs that did not use master-and-slaves

strategies, and “all” means all 165 programs in the competition.

Player 1 Player 2 (C,C) (C,D) (D,C) (D,D)

BWIN BWIN’s slaves 12% 5% 64% 20%

IMM01 IMM01’s slaves 10% 6% 47% 38%

CNGF CNGF’s slaves 2% 10% 10% 77%

BWIN’s slaves all but M&S 5% 9% 24% 62%

IMM01’s slaves all but M&S 7% 9% 23% 61%

CNGF’s slaves all but M&S 4% 8% 24% 64%

TFT all but M&S 33% 20% 20% 27%

DBSz all but M&S 54% 15% 13% 19%

TFTT all but M&S 55% 20% 11% 14%

TFT all 23% 19% 16% 42%

DBSz all 36% 14% 11% 39%

TFTT all 38% 21% 10% 31%

all but M&S all but M&S 31% 19% 19% 31%

all all 13% 16% 16% 55%

10.8.3. A comparison between DBSz, TFT, and TFTT

Next, we consider how DBSz performs against TFT and TFTT. Table 10.2

shows that when playing with another cooperative player, TFT cooperates

((C,C) in the table) 33% of the time, DBSz does so 54% of the time, and

TFTT does so 55% of the time. Furthermore, when playing with a player

who defects, TFT defects ((D,D) in the table) 27% of the time, DBSz

does so 19% of the time, and TFTT does so 14% of the time. From this,

one might think that DBSz’s behavior is somewhere between TFT’s and

TFTT’s.

But on the other hand, when playing with a player who defects, DBSz

cooperates ((C,D) in the table) only 15% of the time, which is a lower

percentage than for TFT and TFTT (both 20%). Since cooperating with

a defector generates no payoff, this makes TFT and TFTT perform worse

than DBSz overall. DBSz’s average score was 408 and it ranked 3rd, but

TFTT’s and TFT’s average scores were 388.4 and 388.2 and they ranked

30th and 33rd.

10.9. Related Work

Early studies of the effect of noise in the Iterated Prisoner’s Dilemma fo-

cused on how TFT, a highly successful strategy in noise-free environments,

would do in the presence of noise. TFT is known to be vulnerable to noise;

for instance, if two players use TFT at the same time, noise would trig-



ger long sequences of mutual defections [Molander (1985)]. A number of

people confirmed the negative effects of noise to TFT [Molander (1985);

Bendor (1987); Mueller (1987); Axelrod and Dion (1988); Nowak and Sig-

mund (1990); Bendor et al. (1991)]. Axelrod found that TFT was still the

best decision rule in the rerun of his first tournament with a one percent

chance of misperception (Axelrod, 1984, page 183), but TFT finished sixth

out of 21 in the rerun of Axelrod’s second tournament with a 10 percent

chance of misperception [Donninger (1986)]. In Competition 2 of the 2005

IPD competition, the noise level was 0.1, and TFT’s overall average score

placed it 33rd out of 165.

The oldest approach to remedy TFT’s deficiency in dealing with noise

is to be more forgiving in the face of defections. A number of studies found

that more forgiveness promotes cooperation in noisy environments [Bendor

et al. (1991); Mueller (1987)]. For instance, Tit-For-Two-Tats (TFTT), a

strategy submitted by John Maynard Smith to Axelrod’s second tourna-

ment, retaliates only when it receives two defections in two previous itera-

tions. TFTT can tolerate isolated instances of defections caused by noise

and is more readily to avoid long sequences of mutual defections caused by

noise. However, TFTT is susceptible to exploitation of its generosity and

was beaten in Axelrod’s second tournament by TESTER, a strategy that

may defect every other move. In Competition 2 of the 2005 IPD Competi-

tion, TFTT ranked 30—a slightly better ranking than TFT’s. In contrast

to TFTT, DBS can tolerate not only an isolated defection but also a se-

quence of defections caused by noise, and at the same time DBS monitors

the other player’s behavior and retaliates when exploitation behavior is

detected (i.e., when the exploitation causes a change of the hypothesized

policy, which initially is TFT). Furthermore, the retaliation caused by ex-

ploitation continues until the other player shows a high degree of remorse

(i.e., cooperations when DBS defects) that changes the hypothesized policy

to one with which DBS favors cooperations instead of defections.

[Molander (1985)] proposed to mix TFT with ALLC to form a new

strategy which is now called Generous Tit-For-Tat (GTFT) [Nowak and

Sigmund (1992)]. Like TFTT, GTFT avoids an infinite echo of defections

by cooperating when it receives a defection in certain iterations. The differ-

ence is that GTFT forgives randomly: for each defection GTFT receives it

randomly choose to cooperate with a small probability (say 10%) and defect

otherwise. DBS, however, does not make use of forgiveness explicitly as in

GTFT; its decisions are based entirely on the hypothesized policy that it

learned. But temporary tolerance can be deemed as a form of forgiveness,



since DBS does not retaliate immediately when a defection occurs in a mu-

tual cooperation situation. This form of forgiveness is carefully planned

and there is no randomness in it.

Another way to improve TFT in noisy environments is to use contrition:

unilaterally cooperate after making mistakes. One strategy that makes use

of contrition is Contrite TFT (CTFT) [Sugden (1986); Boyd (1989); Wu

and Axelrod (1995)], which does not defect when it knows that noise has

occurred and affected its previous action. However, this is less useful in the

Noisy IPD since a program does not know whether its action is affected by

noise or not. DBS does not make use of contrition, though the effect of

temporary tolerance resembles contrition.

A family of strategies called “Pavlovian” strategies, or simply called

Pavlov, was found to be more successful than TFT in noisy environ-

ments [Kraines and Kraines (1989, 1993, 1995); Nowak and Sigmund

(1993)]. The simplest form of Pavlov is called Win-Stay, Lose-Shift [Nowak

and Sigmund (1993)], because it cooperates only after mutual cooperation

or mutual defection, an idea similar to Simpleton [Rapoport and Chammah

(1965)]. When an accidental defection occurs, Pavlov can resume mu-

tual cooperation in a smaller number of iterations than TFT [Kraines and

Kraines (1989, 1993)]. Pavlov learns by conditioned response through re-

wards and punishments; it adjusts its probability of cooperation according

to the previous interaction. Like Pavlov, DBS learns from its past experi-

ence and makes decisions accordingly. DBS, however, has an intermediate

step between learning from experience and decision making: it maintains a

model of the other player’s behavior, and uses this model to reason about

noise. Although there are probabilistic rules in the hypothesized policy,

there is no randomness in its decision making process.

For readers who are interested, there are several surveys on the Iterated

Prisoner’s Dilemma with noise [Axelrod and Dion (1988); Hoffmann (2000);

O’Riordan (2001); Kuhn (2001)].

The use of opponent modeling is common in games of imperfect infor-

mation such as Poker [Billings et al. (1998); Barone and While (1998, 1999,

2000); Davidson et al. (2000); Billings et al. (2003)] and RoShamBo [Egnor

(2000)]. One entry in Axelrod’s original IPD tournament used opponent

modeling, but it was not successful. There have been many works on learn-

ing the opponent’s strategy in the non-noisy IPD [Dyer (2004); Hingston

and Kendall (2004); Powers and Shoham (2005)]. By assuming the oppo-

nent’s next move depends only on the interactions of the last few iterations,

these works model the opponent’s strategy as probabilistic finite automata,



and then use various learning methods to learn the probabilities in the au-

tomata. For example, [Hingston and Kendall (2004)] proposed an adaptive

agent called an opponent modeling agent (OMA) of order n, which main-

tains a summary of the moves made up to n previous iterations. Like DBS,

OMA learns the probabilities of cooperations of the other player in dif-

ferent situations using an updating rule similar to the Equation 10.1, and

generates a move based on the opponent model by searching a tree similar

to that shown in Figure 10.4. The opponent model in [Dyer (2004)] also

has a similar construct. The main way they differ from DBS is how they

learn the other player’s strategy, but there are several other differences: for

example, the tree they used has a maximum depth of 4, whereas ours has

a depth of 60.

The agents of both [Hingston and Kendall (2004)] and [Dyer (2004)]

learned the other player’s strategy by exploration—deliberately making

moves in order to probe the other player’s strategy. The use of exploration

for learning opponent’s behaviors was studied by [Carmel and Markovitch

(1998)], who developed a lookahead-based exploration strategy to balance

between exploration and exploitation and avoid making risky moves during

exploration. [Hingston and Kendall (2004)] and [Dyer (2004)] used a differ-

ent exploration strategy than [Carmel and Markovitch (1998)]; [Hingston

and Kendall (2004)] introduced noise to 1% of their agent’s moves (they

call this method the trembling hand), whereas the agent in [Dyer (2004)]

makes decisions at random when it uses the opponent’s model and finds a

missing value in the model. Both of their agents used a random opponent

model at the beginning of a game.

DBS does not make deliberate moves to attempt to explore the other

player’s strategy, because we believe that this is a high-risk, low-payoff

business in IPD. We believe it incurs a high risk because many programs in

the competition are adaptive; our defections made in exploration may affect

our long-term relationship with them. We believe it has a low payoff because

the length of a game is usually too short for us to learn any non-trivial

strategy completely. Moreover, the other player may alter its behavior at

the middle of a game, and therefore it is difficult for any learning method

to converge. It is essentially true in noisy IPD, since noise can provoke the

other player (e.g., GRIM). Furthermore, our objective is to cooperate with

the other players, not to exploit their weakness in order to beat them. So as

long as the opponent cooperates with us there is no need to bother with their

other behaviors. For these reasons, DBS does not aim at learning the other

player’s strategy completely; instead, it learns the other player’s recent



behavior, which is subject to change. In contrast to the OMA strategy

described earlier in this section, most of our DBS programs cooperated

with each other in the competition.

Our decision-making algorithm combines elements of both minimax

game tree search and the value iteration algorithm for Markov Decision

Processes. In contrast to [Carmel and Markovitch (1994)], we do not model

the other player’s model of our strategy; we assume that the hypothesized

policy does not change for the rest of the game. Obviously this assump-

tion is not valid, because our decisions can affect the decisions of the other

players in the future. Nonetheless, we found that the moves returned by

our algorithm are fairly good responses. For example, if the other player

behaves like TFT, the move returned by our algorithm is to cooperate re-

gardless of the previous interactions; if the other player does not behave

like TFT, our algorithm is likely to return defection, a good move in many

situations.

To the best of our knowledge, ours is the first work on using opponent

models in the IPD to detect errors in the execution of another agent’s

actions.

10.10. Summary and Future Work

For conflict prevention in noisy environments, a critical problem is to distin-

guish between situations where another player has misbehaved intentionally

and situations where the misbehavior was accidental. That is the problem

that DBS was formulated to deal with. DBS’s impressive performance in

the 2005 Iterated Prisoner’s Dilemma competition occurred because DBS

was better able to maintain cooperation in spite of noise than any other

program in the competition.

To distinguish between intentional and unintentional misbehaviors, DBS

uses a combination of symbolic noise detection plus temporary tolerance: if

an action of the other player is inconsistent with the player’s past behavior,

we continue as if the player’s behavior has not changed, until we gather

sufficient evidence to see whether the inconsistency was caused by noise or

by a genuine change in the other player’s behavior.

Since clarity of behavior is an important ingredient of long-term coop-

eration in the IPD, most IPD programs have behavior that follows clear

deterministic patterns. The clarity of these patterns made it possible for

DBS to construct policies that were good approximations of the other play-

ers’ strategies, and to use these policies to fend off noise.



We believe that clarity of behavior is also likely to be important in

other multi-agent environments in which agents have to cooperate with

each other. Thus it seems plausible that techniques similar to those used

in DBS may be useful in those domains.

In the future, we are interested in studying the following issues:

• The evidence collection process takes time, and the delay may invite

exploitation. For example, the policy of temporary tolerance in DBS

may be exploited by a “hypocrite” strategy that behaves like TFT most

of the time but occasionally defects even though DBS did not defect

in the previous iteration. DBS cannot distinguish this kind of inten-

tional defection from noise, even though DBS has built-in mechanism

to monitor exploitation. We are interested to seeing how to avoid this

kind of exploitation.

• In multi-agent environments where agents can communicate with each

other, the agents might be able to detect noise by using a predefined

communication protocol. However, we believe there is no protocol that

is guaranteed to tell which action has been affected by noise, as long as

the agents cannot completely trust each other. It would be interesting

to compare these alternative approaches with symbolic noise detection

to see how symbolic noise detection could enhance these methods or

vice versa.

• The type of noise in the competition assumes that no agent know

whether an execution of an action has been affected by noise or not.

Perhaps there are situations in which some agents may be able to ob-

tain partial information about the occurrence of noise. For example,

some agents may obtain a plan of the malicious third party by counter-

espionage. We are interested to see how to utilize these information

into symbolic noise detection.

• It would be interesting to put DBS in an evolutionary environment to

see whether it can survive after a number of generations. Is it evolu-

tionarily stable?

Acknowledgment. This work was supported in part by ISLE contract

0508268818 (subcontract to DARPA’s Transfer Learning program), UC

Berkeley contract SA451832441 (subcontract to DARPA’s REAL program),

and NSF grant IIS0412812. The opinions in this paper are those of the au-

thors and do not necessarily reflect the opinions of the funders.

This work is based on an earlier work: Accident or Intention: That Is

the Question (in the Noisy Iterated Prisoner’s Dilemma), in AAMAS’06



(May 8–12 2006) c©ACM, 2006.

We would like to thank the anonymous reviewers for their comments.

References

Au, T.-C. and Nau, D. (2005). An Analysis of Derived Belief Strategy’s Perfor-

mance in the 2005 Iterated Prisoner’s Dilemma Competition, Tech. Rep.

CSTR-4756/UMIACS-TR-2005-59, University of Maryland, College Park.

Axelrod, R. (1984). The Evolution of Cooperation (Basic Books).

Axelrod, R. (1997). The Complexity of Cooperation: Agent-Based Models of Com-

petition and Collaboration (Princeton University Press).

Axelrod, R. and Dion, D. (1988). The further evolution of cooperation, Science

242, 4884, pp. 1385–1390.

Barone, L. and While, L. (1998). Evolving adaptive play for simplified poker, in

Proceedings of IEE International Conference on Computational Intelligence

(ICEC-98), pp. 108–113.

Barone, L. and While, L. (1999). An adaptive learning model for simplified poker

using evolutionary algorithms, in Proceedings of the Congreess of Evolu-

tionary Computation (GECCO-1999), pp. 153–160.

Barone, L. and While, L. (2000). Adaptive learning for poker, in Proceedings of

the Genetic and Evolutionary Computation Conference, pp. 566–573.

Bendor, J. (1987). In good times and bad: Reciprocity in an uncertain world,

American Journal of Politicial Science 31, 3, pp. 531–558.

Bendor, J., Kramer, R. M. and Stout, S. (1991). When in doubt... cooperation in

a noisy prisoner’s dilemma, The Journal of Conflict Resolution 35, 4, pp.

691–719.

Billings, D., Burch, N., Davidson, A., Holte, R. and Schaeffer, J. (2003). Approxi-

mating game-theoretic optimal strategies for full-scale poker, in IJCAI, pp.

661–668.

Billings, D., Papp, D., Schaeffer, J. and Szafron, D. (1998). Opponent modeling

in poker, in AAAI, pp. 493–499.

Boyd, R. (1989). Mistakes allow evolutionary stability in the repeated prisoner’s

dilemma game, Journal of Theoretical Biology 136, pp. 47–56.

Carmel, D. and Markovitch, S. (1994). The M* algorithms: Incorporating oppo-

nent models into adversary search, Tech. Rep. CIS9402, Computer Science

Department Technion.

Carmel, D. and Markovitch, S. (1998). How to explore your opponent’s strategy

(almost) optimally, in Proceedings of the Third International Conference on

Multi-Agent Systems, pp. 64–71.

Davidson, A., Billings, D., Schaeffer, J. and Szafron, D. (2000). Improved oppo-

nent modeling in poker, in Proceedings of the 2000 International Conference

on Artificial Intelligence (ICAI’2000), pp. 1467–1473.

Donninger, C. (1986). Paradoxical Effects of Social Behavior, chap. Is it always

efficient to be nice? (Heidelberg: Physica Verlag), pp. 123–134.

Dyer, D. W. (2004). Opponent Modelling and Strategy Evolution in the Iterated



Prisoner’s Dilemma, Master’s thesis, School of Computer Science and Soft-

ware Engineering, The University of Western Australia.

Egnor, D. (2000). Iocaine powder explained, ICGA Journal 23, 1, pp. 33–35.

Hingston, P. and Kendall, G. (2004). Learning versus evolution in iterated pris-

oner’s dilemma, in Proceedings of the Congress on Evolutionary Computa-

tion (CEC’04).

Hoffmann, R. (2000). Twenty years on: The evolution of cooperation revisited,

Journal of Artificial Societies and Social Simulation 3, 2.

Kraines, D. and Kraines, V. (1989). Pavlov and the prisoner’s dilemma, Theory

and Decision 26, pp. 47–79.

Kraines, D. and Kraines, V. (1993). Learning to cooperate with pavlov an adap-

tive strategy for the iterated prisoner’s dilemma with noise, Theory and

Decision 35, pp. 107–150.

Kraines, D. and Kraines, V. (1995). Evolution of learning among pavlov strategies

in a competitive environment with noise, The Journal of Conflict Resolution

39, 3, pp. 439–466.

Kuhn, S. T. (2001). Prisoner’s dilemma,

http://karmak.org/archive/2002/11/Prisoner’s Dilemma.html

Stanford Encyclopedia of Philosophy.

Molander, P. (1985). The optimal level of generosity in a selfish, uncertain envi-

ronment, The Journal of Conflict Resolution 29, 4, pp. 611–618.

Mueller, U. (1987). Optimal retaliation for optimal cooperation, The Journal of


Nowak, M. and Sigmund, K. (1990). The evolution of stochastic strategies in the

prisoner’s dilemma, Acta Applicandae Mathematicae 20, pp. 247–265.

Nowak, M. and Sigmund, K. (1993). A strategy of win-stay, lose-shift that out-

performs tit-for-tat in the prisoner’s dilemma game, Nature 364, pp. 56–58.

Nowak, M. A. and Sigmund, K. (1992). Tit for tat in heterogeneous populations,

Nature 355, pp. 250–253.

O’Riordan, C. (2001). Iterated prisoner’s dilemma: A review, Tech. Rep. NUIG-

IT-260601, Department of Information Technology, National University of

Ireland, Galway.

Powers, R. and Shoham, Y. (2005). Learning against opponents with bounded

memory, in IJCAI.

Rapoport, A. and Chammah, A. M. (1965). Prisoner’s dilemma (University of

Michigan Press).

Sugden, R. (1986). The economics of rights, co-operation and welfare (Blackwell).

Wu, J. and Axelrod, R. (1995). How to cope with noise in the iterated prisoner’s

dilemma, Journal of Conflict Resolution 39, pp. 183–189.

the iterated prisoners dilemma 20 years on advances in natural computation.9789812706973.28764

Documents

university of southampton

university of vienna

university of maryland

university of queensland

ukthe university of

department of computer

school of computer science

computer systems