combining genetics, learning and parenting michael berger based on: “when to apply the fifth...

Combining Genetics, Learning and Parenting

Michael Berger

Based on:

“When to Apply the Fifth Commandment: The Effects of Parenting on Genetic and Learning Agents” / Michael Berger and Jeffrey S. Rosenschein Submitted to AAMAS 2004

Abstract Problem

•Hidden state

•Metric defined over state space

•Condition C1: When state changes, it is only

to an “adjacent” state

•Condition C2: State changes occur at a low,

but positive rate

The Environment

• 2D Grid • Food patches

Probabilistic Unlimited (reduces analytical complexity) May move to adjacent squares (retains structure)

0.5

0.5 0.7

0.7

0.5

0.5

0.5

0.2

0.20.2

0.20.2

• Cyclic (reduces analytic complexity)

Agent Definitions•Reward = Food Presence (0 or 1)•Perception = <Position, Food Presence>•Action {NORTH, EAST, SOUTH, WEST, HALT}•Memory = <<Per, Ac>, …, <Per, Ac>, Per>

Memory length = |Mem| = No. of elements in memory No. of possible memories =

(2*<Grid Width>*<Grid Height>)|Mem| * 5|Mem|-1

•MAM - Memory-Action Mapper Table One entry for every possible memory

•ASF - Action-Selection Filter

Genetic Algorithm (I)•Algorithm on a complete population, not on a

single agent

•Requires introduction of generations Every generation consists of a new group of agents

Each agent is created at the beginning of a

generation, and is terminated at its end

Agent’s life cycle:

Birth --> Run (foraging) --> Possible matings --> Death

Genetic Algorithm (II)•Each agent carries a gene sequence

Each gene has a key (memory) and a value (action)

A given memory determines the resultant action

Gene sequence remains constant during the life-time

of an agent

Gene sequence is determined at the mating stage of

an agent’s parents

Genetic Algorithm (III)•Mating consists of two stages:

Selection stage - Determining mating rights. Should be

performed according to two principles:• Survival of the fittest (as indicated in performance during the life-

time)

• Preservation of genetic variance

Offspring creation stage:• One or more parents create one or more offspring

• Offspring inherit some combination of parents’ gene sequence

•Each of the stages has many variants

Genetic Algorithm Variant•Selection:

Will be discussed later.

•Offspring creation: Two parents mate and create two offspring

Gene sequences of parents are aligned against one another,

and then two processes occur:• Random crossover

• Random mutation

Resultant pair of gene sequences are inherited by the

offspring (one by each offspring).

Genetic Inheritance

K1, V1

K2, V2

K5, V5

K1, U1

K2, U2

K3, U3

K4, U4

K5, U5

K3, V3

K4, V4

Crossover

Crossover

Crossover

Crossover

K1, V1

K2, V2

K3, V3

K4, V4

K5, V5

K1, U1

K2, U2

K3, U3

K4, U4

K5, U5

Parent1 Parent2

K1, V1

K2, V2

K5, V5

K1, U1

K2, U2*

K4, U4

K5, U5

K3, V3

K4, V4

MutationMutation

K3, U3*

Offspring1 Offspring2

Genetic Agent

GenCrosPGenMutP

Crossover probability for each gene pair

Mutation probability for each gene

•MAM: Every entry is considered a gene

• First column - Possible memory (key)

• Second column - Action to take (value)

No changes after creation

•Parameters: Genm Memory length

Learning Algorithm•Reinforcement Learning type algorithm:

After performing an action, agents receive a signal

informing them how well their choice of action was

(in this case, the reward)

•Selected algorithm: Q-learning with Boltzmann

exploration

Basic Q-Learning (I)

•Q-learning attempts to maximize the expected

rewards’ discounted sum of an agent as a

function of any given memory at any round n

•Definitions:

0iin

ir

jrDiscount factor (non-negative, less than 1)Reward at round j

Rewards’ Discounted sum at round n

Basic Q-Learning (II)•Q(s,a) - “Q-value”. The expected discounted sum of future rewards for an agent when its

memory is s and it selects action a and follows an optimal policy thereafter.

•Q(s,a) is updated after every time an agent selects action a when at memory s. After action

execution, agent receives reward r and contains memory s’. Q(s,a) is updated as follows:

)],(),'(max[),( asQbsQrasQb

LrnLrn

•Q(s,a) values can be stored in different forms: Neural network

Table (nicknamed a Q-table)

•When saved as a Q-table, each row corresponds

to a possible memory s, and each column to a

possible action a.

•When an agent contains memory s, it should

simply select an action a with that maximizes

Q(s,a) - right ???

•Q(s,a) values can be stored in different forms: Neural network

Table (nicknamed a Q-table)

•When saved as a Q-table, each row corresponds

to a possible memory s, and each column to a

possible action a.

•When an agent contains memory s, it should

simply select an action a with that maximizes

Q(s,a) - WRONG !!!

Basic Q-Learning (III)

•Full exploitation of a Q-value might hide other,

better Q-values

•Exploration of Q-values needed, at least in early

stages

•Boltzmann exploration:

The probability of selecting action ai:

Boltzmann Exploration (I)

a

t

asQ

t

asQ

i

e

eap

i

),(

),(

)(

•t - An annealing temperature

•At round n:

Boltzmann Exploration (II)

•t decreases ==> exploration decreases, exploitation

increases For a given s, the probability for selecting its best Q-value

approaches 1 as n increases

•Variant here uses a freezing temperature:

)(nft LrnTemp

LrnFreezet Freezing temperature - when t is below it,

exploration is replaced by full exploitation

Learning Agent•MAM:

A Q-table (dynamic)

•Parameters: Lrnm Memory length

LrnLrn

)(nf LrnTempLrnFreezet

Learning rate

Rewards’ discount factor

Temperature annealing function

Freezing temperature

Parenting Algorithm•No classical “parenting” algorithm around, this

needs to be simulated

•Selected algorithm: Monte-Carlo (another

Reinforcement Learning type algorithm)

Monte-Carlo (I)•Some similarity to Q-learning:

A table (nicknamed an “MC-table”) stores values (“MC-

values”) that describe how good it is to take action a

given memory s

Table dictates a policy of action-selection

•Major differences from Q-learning: Table isn’t modified after every round, but only after

episodes of rounds (in our case, a generation)

Q-Value and MC-values have different meanings

Monte-Carlo (II)•“Off-line” version of Monte-Carlo:

After completing an episode (generation) where one table has

dictated the action-selection policy, a new, second table is

constructed from scratch to evaluate how good any action a is

for a given memory s

Second table will dictate policy in the next episode

(generation)

Equivalent to considering the second table as being built

during the current episode, as long as it isn’t used in the

current episode

Monte-Carlo (III)

•MC(s,a) is defined as the average of all rewards

received after memory s was encountered and

action a was selected

•What if (s,a) was encountered more than once?

•“Every-visit” variant: The average of all subsequent rewards is calculated

for each occurrence of (s,a)

MC(s,a) is the average of all calculated averages

Monte-Carlo (IV)•“Every-visit” variant more suitable than “first-visit”

variant (where only the first encounter with (s,a) counts) Environment can change a lot since the first encounter with

(s,a)

•Exploration variants not used here For a given memory s, action a with the highest MC-value is

selected

Full exploitation here because we have the experience of the

previous episode of rounds

Parenting Agent•MAM:

An MC-table (doesn’t matter if dynamic or static)

Dictates action-selection for offsprings only

•ASF: Selects between the actions suggested by both parents with

equal chance

•Parameters:

Parm Memory length

Complex Agent (I)•Contains a genetic agent, a learning agent and a parenting agent

in a subsumption architecture

•Mating selection (debt from before) occurs among complex

agents: At a generation’s end, each agent’s average reward serves as its score

Agents receive mating rights according to scores “strata” (determined

by scores’ average and standard deviation)

Complex Agent (II)•Mediates between the inner agents and the

environment

•Perceptions passed directly to inner agents

•Actions suggested by all inner agents passed through

an ASF, which selects one of them

•Parameters:

CompGenP ASF’s prob. to select genetic actionCompLrnPCompParP

ASF’s prob. to select learning action

ASF’s prob. to select parenting action

Complex Agent - MatingComplex (Previous Generation)

ASFParentingMemory

ASFMAM

LearningMemory

MAM

GeneticMemory

MAM

EN

VIR

ON

ME

NT

Complex (Previous Generation)

ASFParentingMemory

ASFMAM

LearningMemory

MAM

GeneticMemory

MAM

Complex (Current Generation)

ASFParentingMemory

ASFMAM

LearningMemory

MAM

GeneticMemory

MAM

Complex Agent - PerceptionComplex (Previous Generation)

ASFParentingMemory

ASFMAM

LearningMemory

MAM

GeneticMemory

MAM

EN

VIR

ON

ME

NT


ASFParentingMemory

ASFMAM

LearningMemory

MAM

GeneticMemory

MAM


ASFParentingMemory

ASFMAM

LearningMemory

MAM

GeneticMemory

MAM

Complex Agent - ActionComplex (Previous Generation)

ASFParentingMemory

ASFMAM

LearningMemory

MAM

GeneticMemory

MAM

EN

VIR

ON

ME

NT


ASFParentingMemory

ASFMAM

LearningMemory

MAM

GeneticMemory

MAM


ASFParentingMemory

ASFMAM

LearningMemory

MAM

GeneticMemory

MAM

PGen

PLrn

PPar

Experiment (I)•Measures:

Eating-rate: average reward for a given agent

(throughout its generation)

BER: Best Eating-Rate (in a generation)

•Framework: 20 agents in generation

9500 generations

30000 rounds per generation

•Dependent variable: Success measure (Lambda) - Average of the BERs

in the last 1000 generations

Experiment (II)•Environment:

Grid: 20 x 20

A single food patch, 5 x 5 in size

0.8

0.4 0.4 0.4

0.4

0.4 0.4 0.4

0.4

0.2

0.2 0.2 0.2 0.2 0.2

0.2

0.2

0.2 0.2 0.2 0.2 0.2

0.2

0.2

0.2

Experiment (III)•Constant values:

GenCrosPGenMutP

0.02

0.005

Genm 1

Lrnm 1LrnLrn

)(nf LrnTempLrnFreezet

0.2

0.95

5 * 0.999n

0.2Parm 1

Experiment (IV)•Independent variables:

Complex agent parameters:

ASF probabilities (111 combinations)

Environment parameter:

Probability that in a given round, the food patch moves in a random direction (0, 10-6, 10-5, 10-4, 10-3, 10-2, 10-1)

•One run for each combination of values

),,( ParLrnGen PPP

Env “Movement Probability”

Results: Static Environment•Best combination:

Genetic-Parenting hybrid (PLrn = 0) PGen > PPar

•Pure genetics don’t perform well

GA converges slower if not assisted by learning or parenting

•Pure parenting performs poorly•For a given PPar, success improves as PLrn decreases

(Graph for movement prob. 0)

Mov. Prob. Best (PGen, PLrn, PPar) Success0 (0.7, 0, 0.3) 0.7988

Results: Low Dynamic Rate•Best combination:

Genetic-Learning-Parenting hybrid

PLrn > PGen + PPar

PPar >= PGen

•Pure parenting performs poorly

(Graph for movement prob. 10-4)

Mov. Prob. Best (PGen, PLrn, PPar) Success10-6 (0.15, 0.7, 0.15) 0.752810-5 (0, 0.9, 0.1) 0.701110-4 (0.03, 0.9, 0.07) 0.602110-3 (0.02, 0.8, 0.18) 0.3647

Results: High Dynamic Rate•Best combination:

Pure learning(PGen = 0,Ppar = 0)

•Pure parenting performs poorly

•Parenting loses effectiveness:

Non-parenting agents have better success

(Graph for movement prob. 10-2)

Mov. Prob. Best (PGen, PLrn, PPar) Success10-2 (0, 1, 0) 0.183410-1 (0, 1, 0) 0.0698

Conclusions•Pure parenting doesn’t work•Agent algorithm A will be defined as an action-augmentor of agent algorithm B if:

A and B are always used for receiving perceptions B is applied for executing an action in most steps A is applied for executing an action in at least 50% of the

other steps•In a static enviornment (C1 + ~C2), parenting helps when used as an action-augmentor for genetics

•In slowly changing enviornments (C1 + C2), parenting helps when used as an action-augmentor for learning

•In quickly changing enviroments (C1 only), parenting doesn’t work - pure learning is best

Bibliography (I)•Genetic Algorithm:

R. Axelrod. The complexity of Cooperation: Agent-Based Models of Competition and Collaboration. Princeton University Press, 1997.

H.G. Cobb and J.J. Grefenstette. Genetic algorithms for tracking changing environments. In Proceedings of the Fifth International Conference on Genetic Algorithms, pages 523-530, San Mateo, 1993.

Q-Learning: T.W. Sandholm and R.H. Crites. Multiagent

reinforcement learning in the iterated prisoner’s dilemma. Biosystems, 37: 147-166, 1996.

Monte-Carlo methods, Q-Learning, Reinforcement Learning:

R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998.

Bibliography (II)•Genetic-Learning combinations:

G. E. Hinton and S. J. Nowlan. How learning can guide evolution. In Adaptive Individuals in Evolving Populations: Models and Algorithms, pages 447-454. Addison-Wesley, 1996.

T.D. Johnston. Selective costs and benefits in the evolution of learning. In Adaptive Individuals in Evolving Populations: Models and Algorithms, pages 315-358. Addison-Wesley, 1996.

M. Littman. Simulations combining evolution and learning. In Adaptive Individuals in Evolving Populations: Models and Algorithms, pages 465-477. Addison-Wesley, 1996.

G. Mayley. Landscapes, learning costs and genetic assimilation. Evolutionary Computation, 4(3): 213-234, 1996.

Bibliography (III)•Genetic-Learning combinations (cont.):

S. Nolfi, J.L. Elman and D. Parisi. Learning and evolution in neural networks. Adaptive Behavior, 3(1): 5-28, 1994.

S. Nolfi and D. Parisi. Learning to adapt to changing environments in evolving neural networks. Adaptive Behavior, 5(1): 75-98, 1997.

D. Parisi and S.Nolfi. The influence of learning on evolution. Models and Algorithms, pages 419-428. Addison-Wesley, 1996.

P.M. Todd and G.F. Miller. Exploring adaptive agency II: Simulating the evolution of associative learning. In From Animals to Animats: Proceedings of the First International Conference on Simulation of Adaptive Behavior, pages 306-315, San Mateo, 1991.

Bibliography (IV)•Exploitation vs. Exploration:

D. Carmel and S. Markovitch. Exploration strategies for model-based learning in multiagent systems. Autonomous Agents and Multi-agent Systems, 2(2): 141-172, 1999.

Subsumption architecture: R.A. Brooks. A robust layered control system for a

mobile robot. IEEE Journal of Robotics and Automation, 2(1): 14-23, March 1986.

Backup - Qualitative Data

Qual. Data: Mov. Prob. 0Pure Parenting

Pure Learning

Pure Genetics

Best: (0.7, 0, 0.3)

Pure Learning

Qual. Data: Mov. Prob. 10-4

Pure Parenting

Best: (0.03, 0.9, 0.07)

(0.09, 0.9, 0.01)

Qual. Data: Mov. Prob. 10-2

Pure Parenting

Best: Pure Learning

combining genetics, learning and parenting michael berger based on: “when to apply the fifth...

Documents