unconditioned stimulus (food) causes unconditioned response (saliva) conditioned stimulus (bell)...
TRANSCRIPT
Unconditioned stimulus (food) causes unconditioned response (saliva)Conditioned stimulus (bell) causes conditioned response (saliva)
Rescola-Wagner Rule
• V=wu, with u stimulus (0,1), w weight and v is predicted response. Adapt w to minimize quadratic error
Rescola Wagner rule for multiple inputs can predict various phenomena:Blocking: learned s1 to r prevents learning of association s2 to rInhibition: s2 reduces prediction when combined with any predicting stimulus
Temporal difference learning
• Interpret v(t) as ‘total future expected reward’
• v(t) is predicted from the past
After learning delta(t)=0 implies: v(t=0) is sum of expected future rewardv(t) constant, thus expected reward r(t)=0v(t) decreasing, positive expected reward
Explanation fig 9.2Since u(t)=delta(t,0), Eq. 9.6 becomes: v(t)=w(t)Eq. 9.7 becomes delta w(t)= \epsilon delta(t)Thus, delta v(t)= \epsilon(r(t)+v(t+1)-v(t))R(t)=delta(t,T)Step 1: only change is v(T)=v(T)+epsilonStep 2: change v(T-1) and v(T)Etc.
Dopamine
• Monkey release button and press other after stimulus to receive reward. A: VTA cells respond to reward in early trials and to stimulus in late trials. Similar to delta in TD rule fig. 9.2
Dopamine
• Dopamine neurons encode reward prediction error (delta). B: witholding reward reduced neural firing in agreement with delta interpretation.
Static action choice
• Rewards result from actions
• Bees visit flowers whose color (blue, yellow) predict reward (sugar).
• M are action values, encode expected reward. Beta implements exploration
The indirect actor model
Learn the average nectar volumes for each flower and act accordingly.
Implemented by on-line learning. When visit blue flower
And leave yellow estimate unchanged
Fig: rb=1, ry=2 for t=1:100 and reversedFor t=101:200. A: my, mb; B-D Cumulated reward low beta (B), highBeta (C,D).
Bumble bees
• Blue: r=2 for all flowers, yellow: r=6 for 1/3 of the flowers. When switched at t=15 bees adapt fast.
Bumble bees
• Model with m=< f(r)> with f concave, so that mb=f(2) larger than my=1/3 f(6)
Direct actor (policy gradient)
Direct actor
Stochastic gradient ascent:
Fig: two sessions as in fig. 9.4 with good andBad behaviour. Problem is size m preventsExploration.
Sequential action choice
• Reward obtained after sequence of actions
• Credit assignment problem.
Sequential action choice
• Policy iteration:– Critic: use TD eval. v(state) using current policy– Actor: improve policy m(state)
Policy evaluation
• Policy is random left/right at each turn.
• Implemented as TD:
Policy improvement
• Can be understood as policy gradient rule:
where we replace ra-r by
And m becomes state dependent.
Example: current state is A
Policy improvement
• Policy improvement changes policy, thus reevaluate policy for proven convergence
• Interleaving PI and PE is called actor-critic• Fig: AC learning of maze. NB learning at C is slow.
Generalizations
• Discounted reward:
• TD rule changes to
• TD(lambda): apply TD rule not only to update value of current state but also of recently past visited states. TD(0)=TD, TD(1)=updating all past states.
Water maze
• State u = 493 place cells, 8 actions
• AC rules:
Comparing rats and model
• RL predicts well initial learning, but not change to new task.
Markov decision process
• State transitions P(u’|u,a).
• Absorbing states:
• Find M such that
• Solution: solve Bellman equation
Policy iteration
• Is Policy evaluation + policy improvement
• Evaluation step: Find value of a policy M:
• RL evaluates rhs stochasticly
V(u)=v(u) +eps delta(t)
• Improvement step: maximize {...} wrt a
Requires knowledge of P(u’|u,a).
Earlier formula
can be derived as stochastic version