autonomous learning laboratory – department of computer science perspectives on computational...
TRANSCRIPT
Autonomous Learning Laboratory – Department of Computer Science
Perspectives on Computational Reinforcement Learning
Andrew G. Barto
Autonomous Learning LaboratoryDepartment of Computer Science
University of MassachusettsAmherst
Searching in the Right Space
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Psychology
Artificial Intelligence(machine learning)
Control Theory andOperations Research
Artificial Neural Networks
ComputationalReinforcementLearning (RL)
Neuroscience
Computational Reinforcement Learning
“Reinforcement learning (RL) bears a tortuous relationship with historical and contemporary ideas in classical and instrumental conditioning.” —Dayan 2001
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
The Plan
High-level intro to RL Part I: The personal odyssey Part II: The modern view Part III: Intrinsically Motivated RL
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
The View from Machine Learning
Unsupervised Learning• recode data based on some given principle
Supervised Learning• “Learning from examples”, “Learning with a
teacher”, related to Classical (or Pavlovian) Conditioning
Reinforcement Learning• “Learning with a critic”, related to Instrumental
(or Thorndikian) Conditioning
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Classical Conditioning
Tone(CS: Conditioned Stimulus)
Food(US: Unconditioned Stimulus)
•••
Salivation(UR: Unconditioned Response)
Anticipatory salivation(CR: Conditioned Response)
Pavlov, 1927
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Edward L. Thorndike (1874-1949)
puzzle box
Learning by “Trial-and-Error”
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Trial-and-Error = Error Correction
Artificial Neural Network:
learns from a set of examples via error-correction
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
“Least-Mean-Square” (LMS) Learning Rule
input pattern
desiredoutput
“delta rule”, Adaline, Widrow and Hoff, 1960
z+adjust weights
actual output
+
–
Vx2
xn
x1
wn
w1
w2
wi z V xi
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Trial-and-Error?
“The boss continually seeks a better worker by trial and error experimentation with the structure of the worker. Adaptation is a multidimensional performance feedback process. The `error’ signal in the feedback control sense is the gradient of the mean square error with respect to the adjustment.”
Widrow and Hoff, “Adaptive Switching Circuits”
1960 IRE WESCON Conventional Record
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
MENACE Michie 1961
“Matchbox Educable Noughts and Crosses Engine”
x
x
x
x
o
o
o
o
o
x
x
xx
o
o
ox
x
xo
o
ox x
x x
x x
x x
o
o
xo
o
x
x
x
xx
x o
o
o
o
o
o
o
o
o
o
o
x
x
x
o
o
o
oo
o
o
x
x
x
oox
x
x
x
o
o
o
xo
o
o
o
o
o
o
o
o
x
x
x
x
o
o
o
o
o
x
x
x
x
o
o
o
o
o
o
o
o
x
x
x
x
o
o
o
o
o
o
o
x
x
x
x
o
o
o
ox
x
x
x
o
o
o
o
x
x
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Essence of RL (for me at least!): Search + Memory
Search: Trial-and-Error, Generate-and-Test, Variation-and-Selection, . . .
Memory: remember what worked best for each situation and start from there next time
RL is about caching search results(so you don’t have to keep searching!)
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Generate-and-Test
Generator should be smart:• Generate lots things that are likely to be good
based on prior knowledge and prior experience• But also take chances …
Tester should be smart too:• Evaluate based on real criteria, not convenient
surrogates• But be able to recognize partial success
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
The Plan
High-level intro to RL Part I: The personal odyssey Part II: The modern view Part III: Intrinsically Motivated RL
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Key Players
Harry Klopf Rich Sutton Me
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Arbib, Kilmer, and Spinelli
in Neural Mechanisms of Learning and Memory, Rosenzweig and Bennett, 1974
“Neural Models and Memory”
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
A. Harry Klopf
“Brain Function and Adaptive Systems -- A Heterostatic Theory” Air Force Cambridge Research Laboratories Technical Report 3 March 1972
“…it is a theory which assumes that living adaptive systems seek, as their primary goal, a maximal condition (heterostasis), rather than assuming that the primary goal is a steady-state condition (homeostasis). It is further assumed that the heterostatic nature of animals, including man, derives from the heterostatic nature of neurons. The postulate that the neuronis a heterostat (that is, a maximizer) is a generalization of a more specific postulate, namely, that the neuron is a hedonist.”
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Klopf’s theory (very briefly!)
Inspiration: The nervous system is a society of self-interested agents.• Nervous Systems = Social Systems• Neuron = Man• Man = Hedonist• Neuron = Hedonist• Depolarization = Pleasure• Hyperpolarization = Pain
A neuronal model:• A neuron “decides” when to fire based on comparing a spatial and temporal summation of
weighted inputs with a threshold.• A neuron is in a condition of heterostasis from time t to t+ if it maximizes the amount of
depolarization and minimizes the amount of hyperpolarization over this interval.• Two ways to adapt weights to do this:
• Push excitatory weights to upper limits; zero out inhibitory weights • Make neuron control its input.
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Heterostatic Adaptation
When a neuron fires, all of its synapses that were active during the summation of potentials leading to the response become eligible to undergo changes in their transmittances.
The transmittances of an eligible excitatory synapse increases if the generation of an action potential is followed by further depolarization for a limited time after the response.
The transmittances of an eligible inhibitory synapse increases if the generation of an action potential is followed by further hyperpolarization for a limited time after the response.
Add a mechanism that prevents synapses that participate in the reinforcement from undergoing changes due to that reinforcement (“zerosetting”).
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Key Components of Klopf’s Theory
Eligibility Closed-loop control by neurons Extremization (e.g., maximization) as goal
instead of zeroing something “Generalized Reinforcement”: reinforcement is
not delivered by a specialized channel
The Hedonistic NeuronA Theory of Memory, Learning, and Intelligence
A. Harry KlopfHemishere Publishing Corporation 1982
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Eligibility Traces
x
t
€
x y
€
wt = α y t x y t
xt
wt
Klopf, 1972
€
x y t
Optimal ISIThe same curve as the reinforcement-
effectiveness curve in conditioning:max at 400 ms; 0 after approx 4 s.
€
y t
tt
€
ya histogram of the lengths of feedback pathways in which
the neuron is embedded
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Later Simplified Eligibility Traces
visits to state s
TIME
accumulatingtrace
replacetrace
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Rich Sutton
BA Psychology, Stanford, 1978 As an undergrad, discovered Klopf’s 1972 tech
report Two unpublished undergraduate reports:
• “Learning Theory Support for a Single Channel Theory of the Brain” 1978
• “A Unified Theory of Expectation in Classical and Instrumental Conditioning” 1978 (?)
Rich’s first paper:• “Single Channel Theory: A Neuronal Theory of Learning”
Brain Theory Newsletter, 1978.
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Sutton’s Theory
Aj: level of activation of mode j at time t Vij: sign and magnitude of association from mode
i to mode j at time t Eij: eligibility of Vij for undergoing changes at
time t. It is proportional to the average of the product Ai(t)Aj(t) over some small past time interval (or an average of the logical AND).
Pj: expected level of activation of mode j at time t (a prediction of level of activation of mode j)
Cij a constant depending on particular association being changed
€
d
dtVij = Cij A j − Pj( )E ij
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
What exactly is Pj?
Based on recent activation of the mode: The higher the activation within the last few seconds, the higher the level expected for the present . . .
Pj(t) is proportional to the average of the activation level over some small time interval (a few seconds or less) before t.
€
wt = α y t − y t( ) x y t
xt
wt
€
x y t
€
y t
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Sutton’s theory
Contingent Principle: based on reinforcement a neuron receives after firings and the synapses which were involved in the firings, the neuron modifies its synapses so that they will cause it to fire when the firing causes an increase in the neuron’s expected reinforcement after the firing.• Basis of Instrumental, or Thorndikian, conditioning
Predictive Principle: if a synapse’s activity predicts (frequently precedes) the arrival of reinforcement at the neuron, then that activity will come to have an effect on the neuron similar to that of reinforcement.• Basis of Classical, or Pavlovian, conditioning
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Sutton’s Theory
Main addition to Kopf’s theory: addition of the difference term — a temporal difference term
Showed relationship to the Rescorla-Wagner model (1972) of Classical Conditioning• Blocking• Overshadowing
Sutton’s model was a real-time model of both classical and instrumental conditioning
Emphasized conditioned reinforcement
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Rescorla Wagner Model, 1972
ΔVA =α λ −VΣ( )
change in associative strength of CS A: parameter related to CS intensity: parameter related to US intensity : sum of associative strengths of all CSs present (“composite expectation”)
ΔVA
“Organisms only learn when events violate their expectations.”
VΣ
A “trial-level” model
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Conditioned Reinforcement
Stimuli associated with reinforcement take on reinforcing properties themselves
Follows immediately from the predictive principle: “By the predictive principle we propose that the neurons of the brain are learning to have predictors of stimuli have the same effect on them as the stimuli themselves” (Sutton, 1978)
“In principle this chaining can go back for any length …” (Sutton, 1978)
Equated Pavlovian conditioned reinforcement with instrumental higher-order conditioning
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Where was I coming from?
Studied at the University of Michigan: at the time a hotbed of genetic algorithm activity due to John Holland’s influence (PhD in 1975)
Holland talked a lot about the exploration/exploitation tradeoff But I studied dynamic system theory, relationship between
state-space and input/output representations of systems, convolution and harmonic analysis, finally cellular automata
Fascinated by how simple local rules can generate complex global behavior:• Dynamic systems• Cellular automata• Self-organization• Neural networks• Evolution• Learning
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Sutton and Barto, 1981
“Toward a Modern Theory of Adaptive Networks: Expectation and Prediction” Psych Review 88, 1981
Drew on Rich’s earlier work, but clarified the math and simplified the eligibility term to be non-contingent: just a trace of x instead of xy.
Emphasized anticipitory nature of CR Related to “Adaptive System Theory”:
• Other neural models (Hebb, Widrow & Hoff’s LMS, Uttley’s “Informon”, Anderson’s associative memory networks)
• Pointed out relationship between Rescorla-Wagner model and Adaline, or LMS algorithm
• Studied algorithm stability• Reviewed possible neural mechanisms: e.g. eligibility =
intracellular Ca ion concentrations
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
“SB Model” of Classical Conditioning
€
wt = c y t − y t( ) x ty t = y t−1
x t +1 = α x t + x t
xt
wt
€
x
€
y t
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Temporal Primacy Overrides Blocking in SB model
Kehoe, Schreurs,and Graham 1987
our simulation
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Intratrial Time Courses (part 2 of blocking)
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Adaline Learning Rule
input pattern
target output
LMS rule, Widrow and Hoff, 1960
xt
zt
wtyt =wt
Txt
€
wt = α zt − y t[ ] x t
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
“Rescorla–Wagner Unit”
CS
US
to CR
“composite expectation”
to UR
€
wt = α zt − y t[ ] x t
yt =wtTxt
yt =wtTxt
xt
zt
wt vector of “associative strengths”
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Important Notes
The “target output” of LMS corresponds to the US input to Rescorla-Wagner model
In both cases, this input is specialized in that it does not directly activate the unit but only directs learning
The SB model is different, with the US input activating the unit and directing learning
Hence, SB model can do secondary reinforcement
SB model stayed with Klopf’s idea of “generalized reinforcement”
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
One Neural Implementation of S-B Model
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
A Major Problem: US offset
e.g., if a CS has same time course as US, weights would change so US will be cancelled out.
US
CS
Final result
Why? Because trying to zero out yt – yt–1
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Associative Memory Networks
Kohonen et al. 1976, 1977; Anderson et al. 1977
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Associative Search Network Barto, Sutton, & Brouwer 1981
€
Input vector X(t) = x1(t),K ,xn (t)( ) randomly chosen from set X = X1, X2,K , X k{ }
Output vector in response to X(t) is vector Y(t) = y1(t),K ,ym (t)( )
For each Xα (t) the payoff is a scalar Zα (Y(t))
€
X(t) is the context vector at time t
€
y(t) =1 if s(t) + NOISE(t) > 0
0 otherwise
⎧ ⎨ ⎩
where
s(t) = wi(t) x i(t)i=1
n
∑
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Associative Search NetworkBarto, Sutton, Brouwer, 1981
Problem of context transitions:add a predictor
€
wi(t) = c z(t) − z(t −1)[ ]
⋅ y(t −1) − y(t − 2)[ ] x i(t −1)
€
wi(t) = c z(t) − p(t −1)[ ]
⋅ y(t −1) − y(t − 2)[ ] x i(t −1)
€
wpi(t) = cp z(t) − p(t −1)[ ] x i(t −1)
“one-step-ahead LMS predictor”
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Relation to Klopf/Sutton Theory
€
wi(t) = c z(t) − p(t −1)[ ]
⋅ y(t −1) − y(t − 2)[ ] x i(t −1)
€
eligibility = ˙ y x
Did not include generalized reinforcementsince z(t) is a specialized reward input
Associative version of the ALOPEX algorithm of Harth & Tzanakou, and later Unnikrishnan
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Associative Search Network
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
“Landmark Learning” Barto & Sutton 1981
An illustration of associative search
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
“Landmark Learning”
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
“Landmark Learning”
swap E and Wlandmarks
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Note: Diffuse Reward Signal
x1
x2
x3
y1
y2
y3
reward
Units can learn different thingsdespite receiving identical inputs . . .
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Provided there is variability
ASN just used noisy units to introduce variability Variability drives the search Needs to have an element of “blindness”, as in
“blind variation”: i.e. outcome is not completely known beforehand
BUT does not have to be random IMPORTANT POINT:
Blind Variation does not have to be random, or dumb
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Pole Balancing
Widrow & Smith, 1964“Pattern Recognizing Control Systems”
Michie & Chambers, 1968“Boxes: An Experiment in Adaptive Control”
Barto, Sutton, & Anderson 1984
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
MENACE Michie 1961
“Matchbox Educable Noughts and Crosses Engine”
x
x
x
x
o
o
o
o
o
x
x
xx
o
o
ox
x
xo
o
ox x
x x
x x
x x
o
o
xo
o
x
x
x
xx
x o
o
o
o
o
o
o
o
o
o
o
x
x
x
o
o
o
oo
o
o
x
x
x
oox
x
x
x
o
o
o
xo
o
o
o
o
o
o
o
o
x
x
x
x
o
o
o
o
o
x
x
x
x
o
o
o
o
o
o
o
o
x
x
x
x
o
o
o
o
o
o
o
x
x
x
x
o
o
o
ox
x
x
x
o
o
o
o
x
x
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
The “Boxes” Idea
“Although the construction of this Matchbox Educable Noughts and Crosses Engine (Michie 1961, 1963) was undertaken as a ‘fun project’, there was present a more serious intention to demonstrate the principle that it may be easier to learn to play many easy games than one difficult one. Consequently it may be advantageous to decompose a game into a number of mutually independent sub-games even if much relevant information is put out of reach in the process.”
Michie and Chambers, “Boxes: An Experiment in Adaptive Control”
Machine Intelligence 2, 1968
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Boxes
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Actor-Critic Architecture
ACE = adaptive critic elementASE = associative search element
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
The Actor
ASE: associative search element
€
y(t) =+1 if s(t) + NOISE(t) ≥ 0
−1 otherwise
⎧ ⎨ ⎩
Δwi(t) = α r(t)ei(t)
where eligibility ei(t) = y x i(t)
Note: 1) Move from changes in evaluation to just r2) Move from y to just y in eligibility
.
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
The Critic
ACE: adaptive critic element
€
p(t) = v i(t) x i(t)i=1
n
∑
Δv i(t) = β r(t) + γ p(t) − p(t −1)[ ] ei(t)
where eligibility ei(t) = x i(t)
Note differences with SB model:1) Reward has been pulled out of the weighted sum
2) Discount factor : decay rate of predictions if notsustained by external reinforcement
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Putting them Together
lower rewardprediction
higher rewardprediction
actionmake taking action y in state s more likely
s s’y
p(s) p(s’)
€
ˆ r (t) = r(t) + γ p(t) − p(t −1)
internal reinforcement
effective reinforcement
temporal - difference (TD) error δ(t)
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Actor & Critic learning almost identical
Actor
Adaptive Critic
++
–
p
+
noise
r
action
prediction
primaryreward
we
e = trace of presynaptic
activity only
e = trace of pre-
and postsynaptic
correlation
e
e
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
“Credit Assignment Problem”
Spatial
Temporal
Getting useful training information to theright places at the right times
Marvin Minsky, 1961
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
“Associative Reward-Penalty Element” (AR-
P)Barto & Anandan 1985
€
wi(t) =ρ r(t)y(t) − E y(t) s(t){ }[ ]x i(t) if r(t) = +1
λ ρ r(t)y(t) − E y(t) s(t){ }[ ]x i(t) if r(t) = −1
⎧ ⎨ ⎪
⎩ ⎪€
y(t) =+1 if s(t) + NOISE(t) > 0
−1 otherwise
⎧ ⎨ ⎩
where
s(t) = wi(t) x i(t)i=1
n
∑
(same as ASE)
€
ρ > 0
0 ≤ λ ≤1
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
€
wi(t) =ρ r(t)y(t) − E y(t) s(t){ }[ ]x i(t) if r(t) = +1
λ ρ r(t)y(t) − E y(t) s(t){ }[ ]x i(t) if r(t) = −1
⎧ ⎨ ⎪
⎩ ⎪
AR-P
If = 0, “Associative Reward-Inaction Element” AR-I
Think of r(t)y(t) as desired response Stochastic version of Widrow et al.’s “Selective Boostrap
Element” [Widrow, Gupta, & Maitra “Reward/Punish: learning with a critic in adaptive threshold systems”, 1973]
Associative generalization of LR-P, a “stochastic Learning automaton” algorithm (with roots in Tsetlin’s work and in mathematical psychology, e.g., Bush & Mosteller, 1955)
Where we got the term “Critic”
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
AR-P Convergence Theorem
Input patterns linearly independent Each input has nonzero probability of being
presented on a trial NOISE has cumulative distribution that is strictly
monotonically increasing (excludes uniform dist. and deterministic case)
ρ has to decrease as usual…. For all stochastic reward contingencies, as
approaches 0, the probability of each correct action approaches 1.
BUT, does not work when = 0.
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Contingency Space: 2 actions (two-armed bandit)
Explore/Exploit Dilemma
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Interesting follow up to AR-P theorem
Williams’ REINFORCE class of algorithms (1987) generalizes AR-I (i.e., = 0).
He showed that the weights change according to an unbiased estimate of the gradient of the reward function
BUT NOTE: this is the case for which our theorem isn’t true!
Recent “policy gradient” methods generalize REINFORCE algorithms
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Learning by Statistical Cooperation Barto 1985
Feedforward networks of AR-P units
Most reward achieved when the network implements the identity map
each unit has an (unshown) constant input
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Identity Network Results
= .04
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
XOR Network
Most reward achieved when the network implements XOR
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
XOR Network Behavior
Visible element
Hidden element
= .08
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Notes on AR-P Nets
None of these networks work with = . They almost always converge to a local maximum.
Elements face non-stationary reward contingencies; they have to converge for all contingencies, even hard ones.
Rumelhart, Hinton, & Williams published the backprop paper shortly after this (in 1986).
AR-P networks and backprop networks do pretty much the same thing, BUT backprop is much faster.
Barto & Jordan “Gradient Following without Backpropagation in Layered Networks” First IEEE conference on NNs, 1987.
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
On the speeds of various layered network algorithms
Backprop: slow Boltzmann Machine: glacial Reinforcement Learning: don’t ask!
My recollection of a talk by Geoffrey Hinton c. 1988
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
“Credit Assignment Problem”
Spatial
Temporal
Getting useful training information to theright places at the right times
Marvin Minsky, 1961
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Teams of Learning Automata
Tsetlin, M. L. Automata Theory and Modeling of Biological Systems, Academic Press NY, 1973
e.g. the “Goore Game”
Real games were studied too…
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Neurons and Bacteria
Koshland’s (1980) model of bacterial tumbling
Barto (1989) “From Chemotaxis to Cooperativity: Abstract Exercises in Neuronal Learning Strategies” in The Computing Neuron, Durbin, Miall, & Mitchison (eds.), Addison-Wesley, Workingham England
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
TD Model of Pavlovian Conditioning
The adaptive critic (slightly modified) as a model of Pavlovian conditioning
Sutton & Barto 1990
€
p(t) = v i(t) x i(t)i=1
n
∑
Δv i(t) = β λ (t) + γ p(t)⎣ ⎦− p(t −1)⎣ ⎦[ ] ei(t)
where eligibility ei(t) = x i(t)
“floor”
US instead of r
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
TD Model
Predictions of what?
“imminence weightedsum of future USs”
i.e. discounting
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
TD Model
“Complete Serial Compound”
. . .
“tapped delay line”
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Summary of Part I
• Eligibility• Neurons as closed-loop controllers• Generalized reinforcement• Prediction• Real-time conditioning models• Conditioned reinforcement• Adaptive system/machine learning theory• Stochastic search• Associative Reinforcement Learning• Teams of self-interested units
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Key Computational Issues
Trial-and-error Error-Correction Essence of RL (for me): search + memory Variability is essential Variability needs to be somewhat blind but not
dumb Smart generator; smart tester The “Boxes Idea”: break up large search into many
small searches Prediction is important What to predict: total future reward Changes in prediction are useful local evaluations Credit assignment problems
€
≠
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
The Plan
High-level intro to RL Part I: The personal odyssey Part II: The modern view Part III: Intrinsically Motivated RL
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Part II: The Modern View
Shift from animal learning to sequential decision problems: stochastic optimal control
Markov Decision Processes (MDPs) Dynamic Programming (DP) RL as approximate DP Give up the neural models…
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Samuel’s Checkers Player 1959
CURRENT BOARD
EVALUATION FUNCTION(Value Function)
+20
V
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Arthur L. Samuel
“. . . we are attempting to make the score, calculated for the current board position, look like that calculated for the terminal board positions of the chain of moves which most probably occur during actual play.”
Some Studies in Machine Learning
Using the Game of Checkers, 1959
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
TD-Gammon
Start with a random network
Play very many games against self
Learn a value function from this simulated experience
This produces (arguably) the best player in the world
Value = estimated prob. of winning
Tesauro, 1992–1995
STATES: configurations of the playing board (about 10 )
ACTIONS: moves
REWARDS: win: +1
lose: 0
20
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Sequential Decision Problems
Decisions are made in stages. The outcome of each decision is not fully
predictable but can be observed before the next decision is made.
The objective is to maximize a numerical measure of total reward over the entire sequence of stages: called the return
Decisions cannot be viewed in isolation: need to balance desire for immediate reward with possibility of high reward in the future.
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
The Agent-Environment Interface
Agent and environment interact at discrete time steps: t =0,1, 2,K
Agent observes state at step t: st ∈S
produces action at step t : at ∈A(st)
gets resulting reward: rt+1 ∈ℜ
and resulting next state: st+1
t
. . . st art +1 st +1
t +1art +2 st +2
t +2art +3 st +3
. . .t +3a
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Markov Decision Processes
If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP).
If state and action sets are finite, it is a finite MDP.
To define a finite MDP, you need to give:• state and action sets• one-step “dynamics” defined by transition probabilities:
• reward expectations:
Ps ′ s a =Pr st+1 = ′ s st =s,at =a{ } for all s, ′ s ∈S, a∈A(s).
Rs ′ s a =E rt+1 st =s,at =a,st+1 = ′ s { } for all s, ′ s ∈S, a∈A(s).
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Elements of the MDP view
Policies Return: e.g, discounted sum of future rewards Value functions Optimal value functions Optimal policies Greedy policies Models: probability models, sample models Backups Etc.
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
T T T TT
T T T T T
V(s) ← ?
s
Backups
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
T T T TT
T T T T T
Needs a probabilitymodel to compute all the requiredexpected values
V(s)← maxa
E r+V(succssorofsundera)[ ]
s
r′ s
Stochastic Dynamic Programming
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
= update the value of each state once using the max backup
Lookup–table storage of
a SWEEP
V0 → V1 → L → Vk → Vk+1 → L → V∗
V
e.g., Value Iteration
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Dynamic Programming
Bellman 195?
“… it’s impossible to to use the word, dynamic, in a pejorative sense. Try thinking of some combination which will possibly give it a pejorative meaning. It’s impossible. … It was something not even a Congressman could object to.”
Bellman
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Stochastic Dynamic Programming
COMPUTATIONALLY COMPLEX:• Multiple exhaustive sweeps• Complex "backup" operation• Complete storage of evaluation function,
NEEDS ACCURATE PROBABILITY MODEL
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Approximating Stochastic DP
AVOID EXHAUSTIVE SWEEPS OF STATE SET• To which states should the backup operation be
applied?
SIMPLIFY THE BACKUP OPERATION• Can one avoid evaluating all possible next states in
each backup operation?
REDUCE DEPENDENCE ON MODELS• What if details of process are unknown or hard to
quantify?
COMPACTLY APPROXIMATE V• Can one avoid explicitly storing all of V ?
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Avoiding Exhaustive Sweeps
Generate multiple sample paths: in reality or with a simulation (sample) model
FOCUS backups around sample paths Accumulate results in V
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
T T T TT
T T T T T
V(s) ← ?
s
Simplifying Backups
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
T T T TT
T T T T T
• no probability model needed
• real or simulated experience
• relatively efficient on very large problems
s
V(s)← (1−)V(s) + REWARD(path)
Simple Monte Carlo
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
T T T TT
T T T T T
• no probability model needed
• real or simulated experience
• incremental
• but less informative than a DP backup
V(s)← (1−)V(s) + r +V( ′ s )[ ]
′ s r
s
Temporal Difference Backup
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
to get:
TD error
Rewrite this
V(s)← (1−)V(s) + r +V( ′ s )[ ]
V(s)← V(s) + r +V( ′ s )−V(s)[ ]
Our familiar TD error
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Why TD?
Loss
Win
BadNew
90%
10%
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Function Approximation Methods:
e.g., artificial neural networks
ANNdescriptionof state
evaluation of
V(s)s
s
Compactly Approximate V
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
expected return for taking action in state and following an optimal policy thereafter
Let current estimate of
For any state, any action with a maximal optimal action value is an optimal action:
( optimal action in )
action valuess
aQ∗ Q∗(s,a) a s
Q(s,a) = Q∗(s,a)
=arg maxa
Q∗(s, a)s
Q-Learning Watkins 1989; Leigh Tesfatsion
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Does not need aprobability model(for either learning orperformance)
T T T TT
T T T T T
Q(s,a) ← 1−( )Q(s,a) + r +maxb
Q( ′ s ,b)[ ]
′ s r
s
a
The Q-Learning Backup
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Another View: Temporal Consistency
Vt rt1 rt2 rt3 L Vt 1 rt rt1 rt2 L
so:
Vt 1 rt Vt
or:
rt Vt Vt 1 0
“TD error”
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Review
MDPs Dynamic Programming Backups Bellman equations (temporal consistency) Approximating DP
• Avoid exhaustive sweeps• Simplify backups• Reduce dependence on models• Compactly approximate V
A good case can be made for using RL to approximate solutions to large MDPs
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
The Plan
High-level intro to RL Part I: The personal odyssey Part II: The modern view Part III: Intrinsically Motivated RL
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Environment
actionstate
rewardAgent
A Common View
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
external sensations
memory
state
reward
actions
internal sensations
RL agent
A Less Misleading Agent View…
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Motivation
“Forces” that energize an organism to act and that direct its activity.
Extrinsic Motivation: being moved to do something because of some external reward ($$, a prize, etc.).
Intrinsic Motivation: being moved to do something because it is inherently enjoyable.
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Intrinsic Motivation
An activity is intrinsically motivated if the agent does it for its own sake rather than as a step toward solving a specific problem
Curiosity, Exploration, Manipulation, Play, Learning itself . . . Can an artificial learning system be intrinsically motivated? Specifically, can a Reinforcement Learning system be
intrinsically motivated?
Working with Satinder Singh
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
The Usual View of RL
Reward looks extrinsic
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
The Less Misleading View
All reward is intrinsic.
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
So What is IMRL?
Key distinction:• Extrinsic reward = problem specific• Intrinsic reward = problem independent
Learning phases:• Developmental Phase: gain general competence• Mature Phase: learn to solve specific problems
Why important: open-ended learning via hierarchical exploration
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Scaling Up: Abstraction Ignore irrelevant details
• Learn and plan at a higher level• Reduce search space size• Hierarchical planning and control
• Knowledge transfer• Quickly react to new situations
• c.f. macros, chunks, skills, behaviors, . . . Temporal abstraction: ignore temporal details (as
opposed to aggregating states)
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
The “Macro” Idea
A sequence of operations with a name; can be invoked like a primitive operation• Can invoke other macros. . . hierarchy• But: an open-loop policy
Closed-loop macros• A decision policy with a name; can be invoked
like a primitive control action• behavior (Brooks, 1986), skill (Thrun & Schwartz,
1995), mode (e.g., Grudic & Ungar, 2000), activity (Harel, 1987), temporally-extended action, option (Sutton, Precup, & Singh, 1997)
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Options (Precup, Sutton, & Singh, 1997)
A generalization of actions to include temporally-extendedcourses of action
€
An option is a triple o =< I,π ,β >
• I : initiation set : the set of states in which o may be started
• π : is the policy followed during o
• β : termination conditions : gives the probability of
terminating in each state
Example: robot docking
: pre-defined controller
: terminate when docked or charger not visible
I : all states in which charger is in sight
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Options cont.
Policies can select from a set of options & primitive actions
Generalizations of the usual concepts:• Transition probabilities (“option models”)• Value functions• Learning and planning algorithms
Intra-option off-policy learning:• Can simultaneously learn policies for many
options from same experience
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Discrete timeHomogeneous discount
Continuous timeDiscrete eventsInterval-dependent discount
Discrete timeOverlaid discrete eventsInterval-dependent discount
A discrete-time SMDP overlaid on an MDPCan be analyzed at either level
MDP
SMDP
Optionsover MDP
State
Time
Options define a Semi-Markov Decision Process
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Where do Options come from?
Dominant approach: hand-crafted from the start How can an agent create useful options for itself?
• Several different approaches (McGovern, Digney, Hengst, ….). All involve defining subgoals of various kinds.
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Canonical Illustration: Rooms Example
HALLWAYS
O2
O1
4 rooms
4 hallways
8 multi-step options
Given goal location, quickly plan shortest route
up
down
rightleft
(to each room's 2 hallways)
G?
G?
4 unreliable primitive actions
Fail 33% of the time
Goal states are givena terminal value of 1 = .9
All rewards zero
ROOM
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Task-Independent Subgoals
“Bottlenecks”, “Hubs”, “Access States”, … Surprising events Novel events Incongruous events Etc. …
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
A Developmental Approach
Subgoals: events that are “intrinsically interesting”; not in the service of any specific task
Create options to achieve them Once option is well learned, the triggering event
becomes less interesting Previously learned options are available as
actions in learning new option policies When facing a specific problem: extract a
“working set” of actions (primitive and abstract) for planning and learning
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
For Example:
Built-in salient stimuli: changes in lights and sounds
Intrinsic reward generated by each salient event:• Proportional to the error in prediction of that event
according to the option model for that event (“surprise”)
Motivated in part by novelty responses of dopamine neurons
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Creating Options
Upon first occurrence of salient event: create an option and initialize:• Initiation set• Policy• Termination condition• Option model
All options and option models updated all the time using intra-option learning
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
The Playroom Domain
Agent has eye, hand, visual marker
Actions:
move eye to hand
move eye to marker
move eye N, S, E, or W
move eye to random object
move hand to eye
move hand to marker
move marker to eye
move marker to hand
If both eye and hand are on object: turn on light, push ball. etc.
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
The Playroom Domain cont.
Switch controls room lightsBell rings and moves one square
if ball hits itPress blue/red block turns music
on/offLights have to be on to see
colorsCan push blocksMonkey cries out if bell and
music both sound in dark room
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Skills
To make monkey cry out:• Move eye to switch• Move hand to eye• Turn lights on• Move eye to blue block• Move hand to eye• Turn music on• Move eye to switch• Move hand to eye• Turn light off• Move eye to bell• Move marker to eye• Move eye to ball• Move hand to ball• Kick ball to make bell ring
Using skills (options)• Turn lights on• Turn music on• Turn lights off• Ring bell
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Reward for Salient Events
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Speed of Learning Various Skills
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Learning to Make the Monkey Cry Out
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Connects with Previous RL Work
Schmidhuber Thrun and Moller Sutton Kaplan and Oudeyer Duff Others….
But these did not have the option frameworkand related algorithms available
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Beware the “Fallacy of Misplaced Concreteness”
Alfred North Whitehead
We have a tendency to mistake our models forreality, especially when they are good models.
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
Thanks to all my past PhD Students
Rich Sutton Chuck Anderson Stephen Judd Robbie Jacobs Jonathan Bachrach Vijay Gullapalli Satinder Singh Bob Crites Steve Bradtke Mike Duff Amy McGovern Ted Perkins Mike Rosenstein Balaraman Ravindran
Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005
And my current students
Colin Barringer Anders Jonsson George D. Konidaris Ashvin Shah Özgür Şimşek Andrew Stout Chris Vigorito Pippin Wolfe
And the funding agencies
AFOSR, NSF, NIH, DARPA
Whew!
Thanks!