phd presentation - ulisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf ·...

PhD Presentation Biologically-inspired Models

for Learning Agents

http://web.ist.utl.pt/~pedro.sequeira/phd




web.ist.utl.pt/~pedro.sequeira/phd

Introduction

Motivation

Case Studies

Conclusions



Prof. Francisco Melo as thesis co-supervisor

CAT in mid-July

Objectives

General problem

General solution

Focus on case studies / experiments

Main idea

Provide learning models to autonomous agents

Inspired on biological models



Introduction

Motivation

Case Studies

Conclusions



Definitions [Franklin & Graesser, 1997; Maes, 1994]

situated in dynamic environments

have and actively pursue goals

satisfy their needs

respond to external events from the environment

MAS - live and interact with

other agents



Requirements [Franklin & Graesser, 1997; Maes, 1994]

mechanisms to distinguish perceived features

focus on relevant features, ignore non-important ones

adapt to and learn new knowledge from the environment

take the right action at each decision time

structures that represent the acquired knowledge

update representations overtime to reflect experience



Building Agents

Key is ADAPTATION

Provide prior knowledge

sufficient for the agent to perceive its environment

Use learning mechanisms

update the agent’s knowledge



Problems

Prior knowledge

lots of pre-programming of behaviors

large knowledge bases

Perceptual limitations

world dynamics, good states

Acting limitations

good actions

Learning

which paradigm / framework to use?



Parallel between natural and artificial agents

Inhabit highly dynamic environments

Have to make complex decisions under uncertainty

Limited perceptual and acting capabilities

Focus on important events

Live in organized societies



Inspiration from biological models

Evolutionary adaptive mechanisms

Simple but powerful survival tools

Improve performance with experience

Take the most of the perceived information

Lead to a greater fitness



Inspiration from several research areas

Psychology

Biology

Ethology

Neuroscience

…



Classical conditioning in RL

Improve learning speed

State-space reduction

Emotion-based Intrinsic Motivated RL

Single-agent event-processing mechanism

Use emotions as intrinsic rewards

Clues from agent-environment relationship

Improve agent fitness

Socially-aware IMRL

Multi-agent social processing mechanism

Use affiliation / cooperation

Improve population fitness



Introduction

Motivation

Case Studies

Conclusions



Inspired from animal learning

Teach an animal to respond in certain way

Provide reward and punishment appropriately

Main Ideas [Sutton & Barto, 1998]

Learn from experience

Situations + Actions → Reward

Reward is external feedback signal

Objective: maximize the reward receive throughout time

Task: discover which actions maximize reward in each state

Trial-and-error search

Mind subsequent (delayed) rewards



Main Idea

Inspiration from classical conditioning paradigm

Partition observations into stimuli

Propose a measure for distance between states

Learn the value of states based on proximate

states

Propagated learning

Reduce space-state

Reduce learning time



Classical Conditioning [Pavlov, 1927]

Advantages Contingency between stimuli in the environment

Independent of the animal's behavior

Animal does not learn behavior consequences

Predict the outcomes of new events from already-known situations

Create new contexts for behavior activation

US

food delivery

UR

salivation

CS

bell

CR

conditioned salivation

CS

bell

training…



Model Based on Sensory Pattern Mining [Sequeira & Antunes, 2010]

Partition observations into stimuli

e.g. see bone, has ball, hear “Fetch!”

Build tree containing frequent patterns



Model Use the Jaccard index [Jaccard, 1912]

frequency of intersection between stimuli over

frequency of union of the stimuli:

Advantages Sensible to particular correlations between stimuli

Rapid access to frequent patterns



Learning Model Extend Q-learning algorithm [Watkins, 1989]

Determine similar states using the pattern tree

State distance measure:

Propagated multi-state update of values

New state receives information from similar states



Experiment

Inspired in animal training

stimuli: 3 visual, 2 tactile, 2 auditory

actions: Pick, Drop, Eat,

Approach Trainer, Approach Ball

4 phases: acquisition, extinction,

association, substitution

Objectives

form associations between co-occurrent stimuli

evoke innate responses in new stimuli

discovery of new contexts for already-known responses



Main results Faster initial learning

Secondary conditioning (e.g. “Fetch!” heard in more cells)

New contexts for actions (e.g. Eat when bone is present)



Main Ideas [Singh et al., 2010]

Reward behaviors rather than consequences

Agent receives augmented reward

extrinsic reward

“normal” reward in RL, related with task (e.g. fulfillment of

needs)

intrinsic reward

does not directly relate with the task (e.g. play or explore)

Objective: maximize total reward



Main Idea

Inspiration from emotional appraisal mechanisms

Mathematical adaptation of dimensions

Emotions as intrinsic rewards

Integrate with IMRL framework

Provide clues from agent-environment relationship

Enhance single agent fitness



Emotions [Dawkins, 2000; Cardinal et al., 2002]

Evolutionary adaptive mechanism

Combined with learning signal advantageous and dangerous situations

help when seeking food and avoiding harm

Bias decision making [Naqvi et al., 2006]

maximizing reward and minimizing punishment

In humans [Phelps & LeDoux, 2005]

memory enhancement, sensory plasticity, attention facilitation,

regulation of social behavior, regulation and inhibition of

emotional responses



Appraisal theories of emotion [Ellsworth & Scherer, 2003; Leventhal & Scherer, 1987]

Emotions arise from evaluations

Characterize subject-environment relationship

Significance for the person’s well-being or goals

Appraisal dimensions

each dimension evaluates a specific aspect



Model of emotions in IMRL

Inspired in appraisal theories of emotions

Adopt four common appraisal dimensions

novelty, motivation, valence, control

each evaluates agent-environment relationship

numerical value represents dimension activation

Use dimension adaptations as reward features

each feature is component of intrinsic reward



Affective reward features

Adaptation from Major Dimensions of Appraisal [Ellsworth & Scherer, 2003]

intentionally did not adapt social dimensions

Problem: appraisal theories usually deal with high-level

psychological processes

complex concepts (e.g. causal attribution, norms)

Solution: inspiration from the Multilevel Process Theory Of

Emotion [Leventhal & Scherer, 1987]

appraise events at different levels

emotions from reflex-like responses into complex cog. Patterns

Evaluate aspects of the agent’s history of interaction



Affective reward features

Novelty: degree of familiarity of events

Valence: innate pleasure detector, learned preferences

Motivation: relevance of event for goals or needs

Control: degree of correctness of the world-model



Experiments

Grid-world scenarios inspired in foraging environments

agent is a predator, tries to eat preys in the environment

observations: cell position, see prey

actions: N, S, E, W, Eat

Dyna-Q/prioritized sweeping alg. [Moore & Atkeson, 2003]

Objectives

maximize the agent’s fitness (extrinsic reward)

optimize feature weight vector



Exploration scenario

One prey

Eat prey, rext=1

Non-Markovian

Results

optimal weight vector

optimal fitness: 1.902,2

"only extrinsic" fitness: 135,9



Persistence scenario Two preys: rabbit and hare

Eat prey

rabbit: rext=0,1; hare: rext=1

Fence

n North actions, next time, n+1

Non-Markovian

Results






Prey-season scenario Two preys: rabbit and hare

Eat prey


Two seasons: rabbit and hare

10.000 steps

if 10 rabbits eaten, rext=-1

Non-Markovian

Results






Different rewards scenario

Two preys: rabbit and hare

always available

Eat prey


Markovian

Results



"only extrinsic" fitness: 87.890,8



Conclusions

Intrinsic reward features based on emotional appraisal

Guide the agent during learning

Focus on specific aspects of the environment

Balance between different strategies

Bring attention to advantageous states

Ignore not so favorable states



Main Idea

Integrate with IMRL framework

Multi-agent scenarios

Inspiration from affiliation and altruism

Mathematical adaptation of social signals

Emergence of socially-aware behaviors

Raise the fitness of the population

Even the fitness of each agent



Affiliation [Dörner, 1999; Bach, 2009]

Urge to affiliate / interact with other agents

Send and receive legitimacy signals

reward socially-acceptable behaviors (l-signals)

punish unsuccessful interactions (anti l-signals)

internally reward or punish socially-aware behaviors

(internal l-signals)

Altruism [de Waal, 2008]

Intrinsic reward when benefit for the social group

initial cost but subsequent compensation



Model for socially-motivated learning

Fitness measured at the population level:

Intrinsic reward: two social features

external signal: received from other agents, based on l-signal

internal signal : generated by the agent, based on internal l-signal

represent level of satisfaction of affiliation need

Total reward



Social features for limited resource scenarios

Extrinsic reward

rext = IsFull – 0.1 IsHungry

External reward feature

rsExt = LastToEat AND SeeFood AND SeeOther AND !Eat

Internal reward feature

rsInt = LastToEat AND SeeFood AND !Eat



Experiments

Grid-world scenarios inspired in foraging environments

two predator agents

observations: position, SeeFood, SeeOther, LastToEat, IsHungry

actions: N, S, E, W, Eat

rewards: reat = 1, rhungry = -0.1

agents become hungry after 30 timesteps

Objectives

maximize the population fitness (sum of extrinsic rewards)

optimize feature weight vector



Single-food scenario

One food resource

Agent that eats starts closer

to food resource (bottom-right)

Results



"only extrinsic" fitness: -19.991,3



Equal-resource scenario

Two food resources

Agent that eats starts bottom-right

Possibility of both eating

Results






Stronger-agent scenario

One food resource

Both start bottom-right

One agent is stronger

When both try to eat,

only one succeeds

Results






Introduction

Motivation

Case Studies

Conclusions



Biologically-inspired learning models Provide built-in prior knowledge

Learning framework based on RL and IMRL

Rewards based on agent-environment relationship

Results Speed up learning

State-space reduction

Intrinsic features provide clues on important aspects

Lead to different strategies

Not directly related to, but increase fitness

Lead to “socially-aware” behaviors



Improve classical conditioning model

Support more learning paradigms

Improve multi-agent model

Inspiration on cooperation

Evolutionary Game Theory

CAT…


phd presentation - ulisboaweb.ist.utl.pt/.../documents/phdpresentation4gaips.pdf ·...

Documents