autonomous learning laboratory – department of computer science perspectives on computational...

Autonomous Learning Laboratory – Department of Computer Science

Perspectives on Computational Reinforcement Learning

Andrew G. Barto

Autonomous Learning LaboratoryDepartment of Computer Science

University of MassachusettsAmherst

[email protected]

Searching in the Right Space

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Psychology

Artificial Intelligence(machine learning)

Control Theory andOperations Research

Artificial Neural Networks

ComputationalReinforcementLearning (RL)

Neuroscience

Computational Reinforcement Learning

“Reinforcement learning (RL) bears a tortuous relationship with historical and contemporary ideas in classical and instrumental conditioning.” —Dayan 2001


The Plan

High-level intro to RL Part I: The personal odyssey Part II: The modern view Part III: Intrinsically Motivated RL


The View from Machine Learning

Unsupervised Learning• recode data based on some given principle

Supervised Learning• “Learning from examples”, “Learning with a

teacher”, related to Classical (or Pavlovian) Conditioning

Reinforcement Learning• “Learning with a critic”, related to Instrumental

(or Thorndikian) Conditioning


Classical Conditioning

Tone(CS: Conditioned Stimulus)

Food(US: Unconditioned Stimulus)

•••

Salivation(UR: Unconditioned Response)

Anticipatory salivation(CR: Conditioned Response)

Pavlov, 1927


Edward L. Thorndike (1874-1949)

puzzle box

Learning by “Trial-and-Error”


Trial-and-Error = Error Correction

Artificial Neural Network:

learns from a set of examples via error-correction


“Least-Mean-Square” (LMS) Learning Rule

input pattern

desiredoutput

“delta rule”, Adaline, Widrow and Hoff, 1960

z+adjust weights

actual output

+

–

Vx2

xn

x1

wn

w1

w2

wi z V xi


Trial-and-Error?

“The boss continually seeks a better worker by trial and error experimentation with the structure of the worker. Adaptation is a multidimensional performance feedback process. The `error’ signal in the feedback control sense is the gradient of the mean square error with respect to the adjustment.”

Widrow and Hoff, “Adaptive Switching Circuits”

1960 IRE WESCON Conventional Record


MENACE Michie 1961

“Matchbox Educable Noughts and Crosses Engine”

x

x

x

x

o

o

o

o

o

x

x

xx

o

o

ox

x

xo

o

ox x

x x

x x

x x

o

o

xo

o

x

x

x

xx

x o

o

o

o

o

o

o

o

o

o

o

x

x

x

o

o

o

oo

o

o

x

x

x

oox

x

x

x

o

o

o

xo

o

o

o

o

o

o

o

o

x

x

x

x

o

o

o

o

o

x

x

x

x

o

o

o

o

o

o

o

o

x

x

x

x

o

o

o

o

o

o

o

x

x

x

x

o

o

o

ox

x

x

x

o

o

o

o

x

x


Essence of RL (for me at least!): Search + Memory

Search: Trial-and-Error, Generate-and-Test, Variation-and-Selection, . . .

Memory: remember what worked best for each situation and start from there next time

RL is about caching search results(so you don’t have to keep searching!)


Generate-and-Test

Generator should be smart:• Generate lots things that are likely to be good

based on prior knowledge and prior experience• But also take chances …

Tester should be smart too:• Evaluate based on real criteria, not convenient

surrogates• But be able to recognize partial success


The Plan



Key Players

Harry Klopf Rich Sutton Me


Arbib, Kilmer, and Spinelli

in Neural Mechanisms of Learning and Memory, Rosenzweig and Bennett, 1974

“Neural Models and Memory”


A. Harry Klopf

“Brain Function and Adaptive Systems -- A Heterostatic Theory” Air Force Cambridge Research Laboratories Technical Report 3 March 1972

“…it is a theory which assumes that living adaptive systems seek, as their primary goal, a maximal condition (heterostasis), rather than assuming that the primary goal is a steady-state condition (homeostasis). It is further assumed that the heterostatic nature of animals, including man, derives from the heterostatic nature of neurons. The postulate that the neuronis a heterostat (that is, a maximizer) is a generalization of a more specific postulate, namely, that the neuron is a hedonist.”


Klopf’s theory (very briefly!)

Inspiration: The nervous system is a society of self-interested agents.• Nervous Systems = Social Systems• Neuron = Man• Man = Hedonist• Neuron = Hedonist• Depolarization = Pleasure• Hyperpolarization = Pain

A neuronal model:• A neuron “decides” when to fire based on comparing a spatial and temporal summation of

weighted inputs with a threshold.• A neuron is in a condition of heterostasis from time t to t+ if it maximizes the amount of

depolarization and minimizes the amount of hyperpolarization over this interval.• Two ways to adapt weights to do this:

• Push excitatory weights to upper limits; zero out inhibitory weights • Make neuron control its input.


Heterostatic Adaptation

When a neuron fires, all of its synapses that were active during the summation of potentials leading to the response become eligible to undergo changes in their transmittances.

The transmittances of an eligible excitatory synapse increases if the generation of an action potential is followed by further depolarization for a limited time after the response.

The transmittances of an eligible inhibitory synapse increases if the generation of an action potential is followed by further hyperpolarization for a limited time after the response.

Add a mechanism that prevents synapses that participate in the reinforcement from undergoing changes due to that reinforcement (“zerosetting”).


Key Components of Klopf’s Theory

Eligibility Closed-loop control by neurons Extremization (e.g., maximization) as goal

instead of zeroing something “Generalized Reinforcement”: reinforcement is

not delivered by a specialized channel

The Hedonistic NeuronA Theory of Memory, Learning, and Intelligence

A. Harry KlopfHemishere Publishing Corporation 1982


Eligibility Traces

x

t

€

x y

€

wt = α y t x y t

xt

wt

Klopf, 1972

€

x y t

Optimal ISIThe same curve as the reinforcement-

effectiveness curve in conditioning:max at 400 ms; 0 after approx 4 s.

€

y t

tt

€

ya histogram of the lengths of feedback pathways in which

the neuron is embedded


Later Simplified Eligibility Traces

visits to state s

TIME

accumulatingtrace

replacetrace


Rich Sutton

BA Psychology, Stanford, 1978 As an undergrad, discovered Klopf’s 1972 tech

report Two unpublished undergraduate reports:

• “Learning Theory Support for a Single Channel Theory of the Brain” 1978

• “A Unified Theory of Expectation in Classical and Instrumental Conditioning” 1978 (?)

Rich’s first paper:• “Single Channel Theory: A Neuronal Theory of Learning”

Brain Theory Newsletter, 1978.


Sutton’s Theory

Aj: level of activation of mode j at time t Vij: sign and magnitude of association from mode

i to mode j at time t Eij: eligibility of Vij for undergoing changes at

time t. It is proportional to the average of the product Ai(t)Aj(t) over some small past time interval (or an average of the logical AND).

Pj: expected level of activation of mode j at time t (a prediction of level of activation of mode j)

Cij a constant depending on particular association being changed

€

d

dtVij = Cij A j − Pj( )E ij


What exactly is Pj?

Based on recent activation of the mode: The higher the activation within the last few seconds, the higher the level expected for the present . . .

Pj(t) is proportional to the average of the activation level over some small time interval (a few seconds or less) before t.

€

wt = α y t − y t( ) x y t

xt

wt

€

x y t

€

y t


Sutton’s theory

Contingent Principle: based on reinforcement a neuron receives after firings and the synapses which were involved in the firings, the neuron modifies its synapses so that they will cause it to fire when the firing causes an increase in the neuron’s expected reinforcement after the firing.• Basis of Instrumental, or Thorndikian, conditioning

Predictive Principle: if a synapse’s activity predicts (frequently precedes) the arrival of reinforcement at the neuron, then that activity will come to have an effect on the neuron similar to that of reinforcement.• Basis of Classical, or Pavlovian, conditioning


Sutton’s Theory

Main addition to Kopf’s theory: addition of the difference term — a temporal difference term

Showed relationship to the Rescorla-Wagner model (1972) of Classical Conditioning• Blocking• Overshadowing

Sutton’s model was a real-time model of both classical and instrumental conditioning

Emphasized conditioned reinforcement


Rescorla Wagner Model, 1972

ΔVA =α λ −VΣ( )

change in associative strength of CS A: parameter related to CS intensity: parameter related to US intensity : sum of associative strengths of all CSs present (“composite expectation”)

ΔVA

“Organisms only learn when events violate their expectations.”

VΣ

A “trial-level” model


Conditioned Reinforcement

Stimuli associated with reinforcement take on reinforcing properties themselves

Follows immediately from the predictive principle: “By the predictive principle we propose that the neurons of the brain are learning to have predictors of stimuli have the same effect on them as the stimuli themselves” (Sutton, 1978)

“In principle this chaining can go back for any length …” (Sutton, 1978)

Equated Pavlovian conditioned reinforcement with instrumental higher-order conditioning


Where was I coming from?

Studied at the University of Michigan: at the time a hotbed of genetic algorithm activity due to John Holland’s influence (PhD in 1975)

Holland talked a lot about the exploration/exploitation tradeoff But I studied dynamic system theory, relationship between

state-space and input/output representations of systems, convolution and harmonic analysis, finally cellular automata

Fascinated by how simple local rules can generate complex global behavior:• Dynamic systems• Cellular automata• Self-organization• Neural networks• Evolution• Learning


Sutton and Barto, 1981

“Toward a Modern Theory of Adaptive Networks: Expectation and Prediction” Psych Review 88, 1981

Drew on Rich’s earlier work, but clarified the math and simplified the eligibility term to be non-contingent: just a trace of x instead of xy.

Emphasized anticipitory nature of CR Related to “Adaptive System Theory”:

• Other neural models (Hebb, Widrow & Hoff’s LMS, Uttley’s “Informon”, Anderson’s associative memory networks)

• Pointed out relationship between Rescorla-Wagner model and Adaline, or LMS algorithm

• Studied algorithm stability• Reviewed possible neural mechanisms: e.g. eligibility =

intracellular Ca ion concentrations


“SB Model” of Classical Conditioning

€

wt = c y t − y t( ) x ty t = y t−1

x t +1 = α x t + x t

xt

wt

€

x

€

y t


Temporal Primacy Overrides Blocking in SB model

Kehoe, Schreurs,and Graham 1987

our simulation


Intratrial Time Courses (part 2 of blocking)


Adaline Learning Rule

input pattern

target output

LMS rule, Widrow and Hoff, 1960

xt

zt

wtyt =wt

Txt

€

wt = α zt − y t[ ] x t


“Rescorla–Wagner Unit”

CS

US

to CR

“composite expectation”

to UR

€

wt = α zt − y t[ ] x t

yt =wtTxt

yt =wtTxt

xt

zt

wt vector of “associative strengths”


Important Notes

The “target output” of LMS corresponds to the US input to Rescorla-Wagner model

In both cases, this input is specialized in that it does not directly activate the unit but only directs learning

The SB model is different, with the US input activating the unit and directing learning

Hence, SB model can do secondary reinforcement

SB model stayed with Klopf’s idea of “generalized reinforcement”


One Neural Implementation of S-B Model


A Major Problem: US offset

e.g., if a CS has same time course as US, weights would change so US will be cancelled out.

US

CS

Final result

Why? Because trying to zero out yt – yt–1


Associative Memory Networks

Kohonen et al. 1976, 1977; Anderson et al. 1977


Associative Search Network Barto, Sutton, & Brouwer 1981

€

Input vector X(t) = x1(t),K ,xn (t)( ) randomly chosen from set X = X1, X2,K , X k{ }

Output vector in response to X(t) is vector Y(t) = y1(t),K ,ym (t)( )

For each Xα (t) the payoff is a scalar Zα (Y(t))

€

X(t) is the context vector at time t

€

y(t) =1 if s(t) + NOISE(t) > 0

0 otherwise

⎧ ⎨ ⎩

where

s(t) = wi(t) x i(t)i=1

n

∑


Associative Search NetworkBarto, Sutton, Brouwer, 1981

Problem of context transitions:add a predictor

€

wi(t) = c z(t) − z(t −1)[ ]

⋅ y(t −1) − y(t − 2)[ ] x i(t −1)

€

wi(t) = c z(t) − p(t −1)[ ]

⋅ y(t −1) − y(t − 2)[ ] x i(t −1)

€

wpi(t) = cp z(t) − p(t −1)[ ] x i(t −1)

“one-step-ahead LMS predictor”


Relation to Klopf/Sutton Theory

€

wi(t) = c z(t) − p(t −1)[ ]

⋅ y(t −1) − y(t − 2)[ ] x i(t −1)

€

eligibility = ˙ y x

Did not include generalized reinforcementsince z(t) is a specialized reward input

Associative version of the ALOPEX algorithm of Harth & Tzanakou, and later Unnikrishnan


Associative Search Network


“Landmark Learning” Barto & Sutton 1981

An illustration of associative search


“Landmark Learning”


“Landmark Learning”

swap E and Wlandmarks


Note: Diffuse Reward Signal

x1

x2

x3

y1

y2

y3

reward

Units can learn different thingsdespite receiving identical inputs . . .


Provided there is variability

ASN just used noisy units to introduce variability Variability drives the search Needs to have an element of “blindness”, as in

“blind variation”: i.e. outcome is not completely known beforehand

BUT does not have to be random IMPORTANT POINT:

Blind Variation does not have to be random, or dumb


Pole Balancing

Widrow & Smith, 1964“Pattern Recognizing Control Systems”

Michie & Chambers, 1968“Boxes: An Experiment in Adaptive Control”

Barto, Sutton, & Anderson 1984


MENACE Michie 1961

“Matchbox Educable Noughts and Crosses Engine”

x

x

x

x

o

o

o

o

o

x

x

xx

o

o

ox

x

xo

o

ox x

x x

x x

x x

o

o

xo

o

x

x

x

xx

x o

o

o

o

o

o

o

o

o

o

o

x

x

x

o

o

o

oo

o

o

x

x

x

oox

x

x

x

o

o

o

xo

o

o

o

o

o

o

o

o

x

x

x

x

o

o

o

o

o

x

x

x

x

o

o

o

o

o

o

o

o

x

x

x

x

o

o

o

o

o

o

o

x

x

x

x

o

o

o

ox

x

x

x

o

o

o

o

x

x


The “Boxes” Idea

“Although the construction of this Matchbox Educable Noughts and Crosses Engine (Michie 1961, 1963) was undertaken as a ‘fun project’, there was present a more serious intention to demonstrate the principle that it may be easier to learn to play many easy games than one difficult one. Consequently it may be advantageous to decompose a game into a number of mutually independent sub-games even if much relevant information is put out of reach in the process.”

Michie and Chambers, “Boxes: An Experiment in Adaptive Control”

Machine Intelligence 2, 1968


Boxes


Actor-Critic Architecture

ACE = adaptive critic elementASE = associative search element


The Actor

ASE: associative search element

€

y(t) =+1 if s(t) + NOISE(t) ≥ 0

−1 otherwise

⎧ ⎨ ⎩

Δwi(t) = α r(t)ei(t)

where eligibility ei(t) = y x i(t)

Note: 1) Move from changes in evaluation to just r2) Move from y to just y in eligibility

.


The Critic

ACE: adaptive critic element

€

p(t) = v i(t) x i(t)i=1

n

∑

Δv i(t) = β r(t) + γ p(t) − p(t −1)[ ] ei(t)

where eligibility ei(t) = x i(t)

Note differences with SB model:1) Reward has been pulled out of the weighted sum

2) Discount factor : decay rate of predictions if notsustained by external reinforcement


Putting them Together

lower rewardprediction

higher rewardprediction

actionmake taking action y in state s more likely

s s’y

p(s) p(s’)

€

ˆ r (t) = r(t) + γ p(t) − p(t −1)

internal reinforcement

effective reinforcement

temporal - difference (TD) error δ(t)


Actor & Critic learning almost identical

Actor

Adaptive Critic

++

–

p

+

noise

r

action

prediction

primaryreward

we

e = trace of presynaptic

activity only

e = trace of pre-

and postsynaptic

correlation

e

e


“Credit Assignment Problem”

Spatial

Temporal

Getting useful training information to theright places at the right times

Marvin Minsky, 1961


“Associative Reward-Penalty Element” (AR-

P)Barto & Anandan 1985

€

wi(t) =ρ r(t)y(t) − E y(t) s(t){ }[ ]x i(t) if r(t) = +1

λ ρ r(t)y(t) − E y(t) s(t){ }[ ]x i(t) if r(t) = −1

⎧ ⎨ ⎪

⎩ ⎪€

y(t) =+1 if s(t) + NOISE(t) > 0

−1 otherwise

⎧ ⎨ ⎩

where

s(t) = wi(t) x i(t)i=1

n

∑

(same as ASE)

€

ρ > 0

0 ≤ λ ≤1


€

wi(t) =ρ r(t)y(t) − E y(t) s(t){ }[ ]x i(t) if r(t) = +1

λ ρ r(t)y(t) − E y(t) s(t){ }[ ]x i(t) if r(t) = −1

⎧ ⎨ ⎪

⎩ ⎪

AR-P

If = 0, “Associative Reward-Inaction Element” AR-I

Think of r(t)y(t) as desired response Stochastic version of Widrow et al.’s “Selective Boostrap

Element” [Widrow, Gupta, & Maitra “Reward/Punish: learning with a critic in adaptive threshold systems”, 1973]

Associative generalization of LR-P, a “stochastic Learning automaton” algorithm (with roots in Tsetlin’s work and in mathematical psychology, e.g., Bush & Mosteller, 1955)

Where we got the term “Critic”


AR-P Convergence Theorem

Input patterns linearly independent Each input has nonzero probability of being

presented on a trial NOISE has cumulative distribution that is strictly

monotonically increasing (excludes uniform dist. and deterministic case)

ρ has to decrease as usual…. For all stochastic reward contingencies, as

approaches 0, the probability of each correct action approaches 1.

BUT, does not work when = 0.


Contingency Space: 2 actions (two-armed bandit)

Explore/Exploit Dilemma


Interesting follow up to AR-P theorem

Williams’ REINFORCE class of algorithms (1987) generalizes AR-I (i.e., = 0).

He showed that the weights change according to an unbiased estimate of the gradient of the reward function

BUT NOTE: this is the case for which our theorem isn’t true!

Recent “policy gradient” methods generalize REINFORCE algorithms


Learning by Statistical Cooperation Barto 1985

Feedforward networks of AR-P units

Most reward achieved when the network implements the identity map

each unit has an (unshown) constant input


Identity Network Results

= .04


XOR Network

Most reward achieved when the network implements XOR


XOR Network Behavior

Visible element

Hidden element

= .08


Notes on AR-P Nets

None of these networks work with = . They almost always converge to a local maximum.

Elements face non-stationary reward contingencies; they have to converge for all contingencies, even hard ones.

Rumelhart, Hinton, & Williams published the backprop paper shortly after this (in 1986).

AR-P networks and backprop networks do pretty much the same thing, BUT backprop is much faster.

Barto & Jordan “Gradient Following without Backpropagation in Layered Networks” First IEEE conference on NNs, 1987.


On the speeds of various layered network algorithms

Backprop: slow Boltzmann Machine: glacial Reinforcement Learning: don’t ask!

My recollection of a talk by Geoffrey Hinton c. 1988


“Credit Assignment Problem”

Spatial

Temporal

Getting useful training information to theright places at the right times

Marvin Minsky, 1961


Teams of Learning Automata

Tsetlin, M. L. Automata Theory and Modeling of Biological Systems, Academic Press NY, 1973

e.g. the “Goore Game”

Real games were studied too…


Neurons and Bacteria

Koshland’s (1980) model of bacterial tumbling

Barto (1989) “From Chemotaxis to Cooperativity: Abstract Exercises in Neuronal Learning Strategies” in The Computing Neuron, Durbin, Miall, & Mitchison (eds.), Addison-Wesley, Workingham England


TD Model of Pavlovian Conditioning

The adaptive critic (slightly modified) as a model of Pavlovian conditioning

Sutton & Barto 1990

€

p(t) = v i(t) x i(t)i=1

n

∑

Δv i(t) = β λ (t) + γ p(t)⎣ ⎦− p(t −1)⎣ ⎦[ ] ei(t)

where eligibility ei(t) = x i(t)

“floor”

US instead of r


TD Model

Predictions of what?

“imminence weightedsum of future USs”

i.e. discounting


TD Model

“Complete Serial Compound”

. . .

“tapped delay line”


Summary of Part I

• Eligibility• Neurons as closed-loop controllers• Generalized reinforcement• Prediction• Real-time conditioning models• Conditioned reinforcement• Adaptive system/machine learning theory• Stochastic search• Associative Reinforcement Learning• Teams of self-interested units


Key Computational Issues

Trial-and-error Error-Correction Essence of RL (for me): search + memory Variability is essential Variability needs to be somewhat blind but not

dumb Smart generator; smart tester The “Boxes Idea”: break up large search into many

small searches Prediction is important What to predict: total future reward Changes in prediction are useful local evaluations Credit assignment problems

€

≠


The Plan



Part II: The Modern View

Shift from animal learning to sequential decision problems: stochastic optimal control

Markov Decision Processes (MDPs) Dynamic Programming (DP) RL as approximate DP Give up the neural models…


Samuel’s Checkers Player 1959

CURRENT BOARD

EVALUATION FUNCTION(Value Function)

+20

V


Arthur L. Samuel

“. . . we are attempting to make the score, calculated for the current board position, look like that calculated for the terminal board positions of the chain of moves which most probably occur during actual play.”

Some Studies in Machine Learning

Using the Game of Checkers, 1959


TD-Gammon

Start with a random network

Play very many games against self

Learn a value function from this simulated experience

This produces (arguably) the best player in the world

Value = estimated prob. of winning

Tesauro, 1992–1995

STATES: configurations of the playing board (about 10 )

ACTIONS: moves

REWARDS: win: +1

lose: 0

20


Sequential Decision Problems

Decisions are made in stages. The outcome of each decision is not fully

predictable but can be observed before the next decision is made.

The objective is to maximize a numerical measure of total reward over the entire sequence of stages: called the return

Decisions cannot be viewed in isolation: need to balance desire for immediate reward with possibility of high reward in the future.


The Agent-Environment Interface

Agent and environment interact at discrete time steps: t =0,1, 2,K

Agent observes state at step t: st ∈S

produces action at step t : at ∈A(st)

gets resulting reward: rt+1 ∈ℜ

and resulting next state: st+1

t

. . . st art +1 st +1

t +1art +2 st +2

t +2art +3 st +3

. . .t +3a


Markov Decision Processes

If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP).

If state and action sets are finite, it is a finite MDP.

To define a finite MDP, you need to give:• state and action sets• one-step “dynamics” defined by transition probabilities:

• reward expectations:

Ps ′ s a =Pr st+1 = ′ s st =s,at =a{ } for all s, ′ s ∈S, a∈A(s).

Rs ′ s a =E rt+1 st =s,at =a,st+1 = ′ s { } for all s, ′ s ∈S, a∈A(s).


Elements of the MDP view

Policies Return: e.g, discounted sum of future rewards Value functions Optimal value functions Optimal policies Greedy policies Models: probability models, sample models Backups Etc.


T T T TT

T T T T T

V(s) ← ?

s

Backups


T T T TT

T T T T T

Needs a probabilitymodel to compute all the requiredexpected values

V(s)← maxa

E r+V(succssorofsundera)[ ]

s

r′ s

Stochastic Dynamic Programming


= update the value of each state once using the max backup

Lookup–table storage of

a SWEEP

V0 → V1 → L → Vk → Vk+1 → L → V∗

V

e.g., Value Iteration


Dynamic Programming

Bellman 195?

“… it’s impossible to to use the word, dynamic, in a pejorative sense. Try thinking of some combination which will possibly give it a pejorative meaning. It’s impossible. … It was something not even a Congressman could object to.”

Bellman


Stochastic Dynamic Programming

COMPUTATIONALLY COMPLEX:• Multiple exhaustive sweeps• Complex "backup" operation• Complete storage of evaluation function,

NEEDS ACCURATE PROBABILITY MODEL


Approximating Stochastic DP

AVOID EXHAUSTIVE SWEEPS OF STATE SET• To which states should the backup operation be

applied?

SIMPLIFY THE BACKUP OPERATION• Can one avoid evaluating all possible next states in

each backup operation?

REDUCE DEPENDENCE ON MODELS• What if details of process are unknown or hard to

quantify?

COMPACTLY APPROXIMATE V• Can one avoid explicitly storing all of V ?


Avoiding Exhaustive Sweeps

Generate multiple sample paths: in reality or with a simulation (sample) model

FOCUS backups around sample paths Accumulate results in V


T T T TT

T T T T T

V(s) ← ?

s

Simplifying Backups


T T T TT

T T T T T

• no probability model needed

• real or simulated experience

• relatively efficient on very large problems

s

V(s)← (1−)V(s) + REWARD(path)

Simple Monte Carlo


T T T TT

T T T T T

• no probability model needed

• real or simulated experience

• incremental

• but less informative than a DP backup

V(s)← (1−)V(s) + r +V( ′ s )[ ]

′ s r

s

Temporal Difference Backup


to get:

TD error

Rewrite this

V(s)← (1−)V(s) + r +V( ′ s )[ ]

V(s)← V(s) + r +V( ′ s )−V(s)[ ]

Our familiar TD error


Why TD?

Loss

Win

BadNew

90%

10%


Function Approximation Methods:

e.g., artificial neural networks

ANNdescriptionof state

evaluation of

V(s)s

s

Compactly Approximate V


expected return for taking action in state and following an optimal policy thereafter

Let current estimate of

For any state, any action with a maximal optimal action value is an optimal action:

( optimal action in )

action valuess

aQ∗ Q∗(s,a) a s

Q(s,a) = Q∗(s,a)

=arg maxa

Q∗(s, a)s

Q-Learning Watkins 1989; Leigh Tesfatsion


Does not need aprobability model(for either learning orperformance)

T T T TT

T T T T T

Q(s,a) ← 1−( )Q(s,a) + r +maxb

Q( ′ s ,b)[ ]

′ s r

s

a

The Q-Learning Backup


Another View: Temporal Consistency

Vt rt1 rt2 rt3 L Vt 1 rt rt1 rt2 L

so:

Vt 1 rt Vt

or:

rt Vt Vt 1 0

“TD error”


Review

MDPs Dynamic Programming Backups Bellman equations (temporal consistency) Approximating DP

• Avoid exhaustive sweeps• Simplify backups• Reduce dependence on models• Compactly approximate V

A good case can be made for using RL to approximate solutions to large MDPs


The Plan



Environment

actionstate

rewardAgent

A Common View


external sensations

memory

state

reward

actions

internal sensations

RL agent

A Less Misleading Agent View…


Motivation

“Forces” that energize an organism to act and that direct its activity.

Extrinsic Motivation: being moved to do something because of some external reward ($$, a prize, etc.).

Intrinsic Motivation: being moved to do something because it is inherently enjoyable.


Intrinsic Motivation

An activity is intrinsically motivated if the agent does it for its own sake rather than as a step toward solving a specific problem

Curiosity, Exploration, Manipulation, Play, Learning itself . . . Can an artificial learning system be intrinsically motivated? Specifically, can a Reinforcement Learning system be

intrinsically motivated?

Working with Satinder Singh


The Usual View of RL

Reward looks extrinsic


The Less Misleading View

All reward is intrinsic.


So What is IMRL?

Key distinction:• Extrinsic reward = problem specific• Intrinsic reward = problem independent

Learning phases:• Developmental Phase: gain general competence• Mature Phase: learn to solve specific problems

Why important: open-ended learning via hierarchical exploration


Scaling Up: Abstraction Ignore irrelevant details

• Learn and plan at a higher level• Reduce search space size• Hierarchical planning and control

• Knowledge transfer• Quickly react to new situations

• c.f. macros, chunks, skills, behaviors, . . . Temporal abstraction: ignore temporal details (as

opposed to aggregating states)


The “Macro” Idea

A sequence of operations with a name; can be invoked like a primitive operation• Can invoke other macros. . . hierarchy• But: an open-loop policy

Closed-loop macros• A decision policy with a name; can be invoked

like a primitive control action• behavior (Brooks, 1986), skill (Thrun & Schwartz,

1995), mode (e.g., Grudic & Ungar, 2000), activity (Harel, 1987), temporally-extended action, option (Sutton, Precup, & Singh, 1997)


Options (Precup, Sutton, & Singh, 1997)

A generalization of actions to include temporally-extendedcourses of action

€

An option is a triple o =< I,π ,β >

• I : initiation set : the set of states in which o may be started

• π : is the policy followed during o

• β : termination conditions : gives the probability of

terminating in each state

Example: robot docking

: pre-defined controller

: terminate when docked or charger not visible

I : all states in which charger is in sight


Options cont.

Policies can select from a set of options & primitive actions

Generalizations of the usual concepts:• Transition probabilities (“option models”)• Value functions• Learning and planning algorithms

Intra-option off-policy learning:• Can simultaneously learn policies for many

options from same experience


Discrete timeHomogeneous discount

Continuous timeDiscrete eventsInterval-dependent discount

Discrete timeOverlaid discrete eventsInterval-dependent discount

A discrete-time SMDP overlaid on an MDPCan be analyzed at either level

MDP

SMDP

Optionsover MDP

State

Time

Options define a Semi-Markov Decision Process


Where do Options come from?

Dominant approach: hand-crafted from the start How can an agent create useful options for itself?

• Several different approaches (McGovern, Digney, Hengst, ….). All involve defining subgoals of various kinds.


Canonical Illustration: Rooms Example

HALLWAYS

O2

O1

4 rooms

4 hallways

8 multi-step options

Given goal location, quickly plan shortest route

up

down

rightleft

(to each room's 2 hallways)

G?

G?

4 unreliable primitive actions

Fail 33% of the time

Goal states are givena terminal value of 1 = .9

All rewards zero

ROOM


Task-Independent Subgoals

“Bottlenecks”, “Hubs”, “Access States”, … Surprising events Novel events Incongruous events Etc. …


A Developmental Approach

Subgoals: events that are “intrinsically interesting”; not in the service of any specific task

Create options to achieve them Once option is well learned, the triggering event

becomes less interesting Previously learned options are available as

actions in learning new option policies When facing a specific problem: extract a

“working set” of actions (primitive and abstract) for planning and learning


For Example:

Built-in salient stimuli: changes in lights and sounds

Intrinsic reward generated by each salient event:• Proportional to the error in prediction of that event

according to the option model for that event (“surprise”)

Motivated in part by novelty responses of dopamine neurons


Creating Options

Upon first occurrence of salient event: create an option and initialize:• Initiation set• Policy• Termination condition• Option model

All options and option models updated all the time using intra-option learning


The Playroom Domain

Agent has eye, hand, visual marker

Actions:

move eye to hand

move eye to marker

move eye N, S, E, or W

move eye to random object

move hand to eye

move hand to marker

move marker to eye

move marker to hand

If both eye and hand are on object: turn on light, push ball. etc.


The Playroom Domain cont.

Switch controls room lightsBell rings and moves one square

if ball hits itPress blue/red block turns music

on/offLights have to be on to see

colorsCan push blocksMonkey cries out if bell and

music both sound in dark room


Skills

To make monkey cry out:• Move eye to switch• Move hand to eye• Turn lights on• Move eye to blue block• Move hand to eye• Turn music on• Move eye to switch• Move hand to eye• Turn light off• Move eye to bell• Move marker to eye• Move eye to ball• Move hand to ball• Kick ball to make bell ring

Using skills (options)• Turn lights on• Turn music on• Turn lights off• Ring bell


Reward for Salient Events


Speed of Learning Various Skills


Learning to Make the Monkey Cry Out


Connects with Previous RL Work

Schmidhuber Thrun and Moller Sutton Kaplan and Oudeyer Duff Others….

But these did not have the option frameworkand related algorithms available


Beware the “Fallacy of Misplaced Concreteness”

Alfred North Whitehead

We have a tendency to mistake our models forreality, especially when they are good models.


Thanks to all my past PhD Students

Rich Sutton Chuck Anderson Stephen Judd Robbie Jacobs Jonathan Bachrach Vijay Gullapalli Satinder Singh Bob Crites Steve Bradtke Mike Duff Amy McGovern Ted Perkins Mike Rosenstein Balaraman Ravindran


And my current students

Colin Barringer Anders Jonsson George D. Konidaris Ashvin Shah Özgür Şimşek Andrew Stout Chris Vigorito Pippin Wolfe

And the funding agencies

AFOSR, NSF, NIH, DARPA

Whew!

Thanks!

autonomous learning laboratory – department of computer science perspectives on computational...

Documents