lyle ungar, university of pennsylvania learning and memory reinforcement learning

Lyle Ungar, University of Pennsylvania

Learning and Learning and MemoryMemory

Reinforcement

Learning

2Lyle H Ungar, University of Pennsylvania

Learning Levels Learning Levels Learning Levels Learning Levels

Darwinian Trial -> death or children

Skinnerian Reinforcement learning

Popperian Our hypotheses die in our stead

Gregorian Tools and artifacts


Machine LearningMachine LearningMachine LearningMachine Learning

Unsupervised Cluster similar items Association (no “right” answer)

Supervised For observations/features, teacher gives

the correct “answer” E.g., Learn to recognize categories

Reinforcement Take action, observe consequence bad dog!


Pavlovian ConditioningPavlovian ConditioningPavlovian ConditioningPavlovian Conditioning

Pavlov Food causes salivation Sound before food -> sound causes salivation

Learn to associate sound with food


Operant ConditioningOperant ConditioningOperant ConditioningOperant Conditioning


Associative MemoryAssociative MemoryAssociative MemoryAssociative Memory

Hebbian Learning When two connected neurons are both

excited, the connection between them is strengthened

Neurons that fire together, wire together

Neurons that fire together, wire together


Explanations of PavlovExplanations of PavlovExplanations of PavlovExplanations of Pavlov

S-S (stimulus-stimulus) Dogs learn to associate sound with food (and salivate based on “thinking” of food)

S-R (stimulus-response) Dogs learn to salivate based on the tone (and salivate directly without “thinking” of

food)

How to test? Do dogs think lights are food?


Conditioning in humansConditioning in humansConditioning in humansConditioning in humans

Two pathways The “slow” pathway dogs use Cognitive (conscious) learning

How to test this hypothesis Learn to blink based on a stimuli

associated with a puff of air.


BlockingBlockingBlockingBlocking

Tone -> Shock -> Fear Tone -> Fear Tone + Light -> Shock -> Fear Light -> ?


Rescorla-Wagner ModelRescorla-Wagner ModelRescorla-Wagner ModelRescorla-Wagner Model

Hypothesis: learn from observations that are surprising Vn<- Vn + c (Vmax - Vn)

Vn= c (Vmax - Vn)

Vn is strength of association between US and CS

c is the learning rate

Predictions contingency


Limitations of Rescorla-Limitations of Rescorla-WagnerWagnerLimitations of Rescorla-Limitations of Rescorla-WagnerWagner

Tone -> food Light -> food Tone + light -> ?


Reinforcement LearningReinforcement LearningReinforcement LearningReinforcement Learning

Many times one takes a long sequence of actions, and only discovers the result of these actions later (e.g. when you win or lose a game)

Q: How can one ascribe credit (or blame) to one action is a sequence of actions

A: by noting surprises


Consider a gameConsider a gameConsider a gameConsider a game

Estimate probability of winning Take an action, see how the opponent

(or the world) responds Re-estimate probability of winning

If it is unchanged, you learned nothing If it is higher, the initial state was better

than you thought If it is lower, the state was worse than you

thought


Tic-tac-toe exampleTic-tac-toe exampleTic-tac-toe exampleTic-tac-toe example

Decision tree Alternate layers give possible moves for

each player


Reinforcement LearningReinforcement LearningReinforcement LearningReinforcement Learning

State E.g. board position

Action E.g. move

Policy State -> Action

Reward function State -> utility

Model of the environment State, action -> state


Definitions of key termsDefinitions of key termsDefinitions of key termsDefinitions of key terms

State What you need to know about the world to

predict the effect of an action

Policy What action to take in each state

Reward function The cost or benefit of being in a state (e.g. points won or lost, happiness gained

or lost)


Value IterationValue IterationValue IterationValue Iteration

Value Function Expected value of a policy over time

= sum of the expected rewards

V(s) <- V(s) + c[V(s’) - V(s)] s = state before the move s’ = state after the move “temporal difference” learning


Mouse in Maze ExampleMouse in Maze ExampleMouse in Maze ExampleMouse in Maze Example

policy value function


Dopamine & ReinforcementDopamine & ReinforcementDopamine & ReinforcementDopamine & Reinforcement


Exploration - ExploitationExploration - ExploitationExploration - ExploitationExploration - Exploitation

Exploration Always try a different route to work

Exploitation Always take the best route to work that

you have found so far

Learning requires exploration Unless the environment is noisy


RL can be very simpleRL can be very simpleRL can be very simpleRL can be very simple

Simple learning algorithm leads to optimal policy Without predicting the effects of the

agents actions Without predicting immediate payoffs Without planning Without explicit model of the world


How to play chessHow to play chessHow to play chessHow to play chess

Computer Evaluation function for board positions Fast search

Human (grandmaster) Memorize tens of thousands of board

positions and what do to Do a much smaller search!


AI and GamesAI and GamesAI and GamesAI and Games

Chess Backgammon

Deterministic Stochastic

Position Policy

evaluation

+ search


Scaling up value functionsScaling up value functionsScaling up value functionsScaling up value functions

For small number of states Learn the value function of each state

Not possible for Backgammon 1020 states Learn mapping from features to value

Then use reinforcement learning to get improved value estimates


Q-learningQ-learningQ-learningQ-learning

Instead of the Value of a state, learn the value Q(s,a) of taking an action a from a state s.

Optimal policy: take best action maxa Q(s,a)

Learning rule Q(s, a) <- Q(s, a) +

c[rt + maxb Q(s’, b) - Q(s, a)]


Learning to SingLearning to SingLearning to SingLearning to Sing

Zerbra Finch hears father’s song

Memorizes it Then practices for

months to learn to reproduce it

What kind of learning is this?


Controversies?Controversies?Controversies?Controversies?

Is conditioning good? How much learning do people do? Innateness, learning, and free will

lyle ungar, university of pennsylvania learning and memory reinforcement learning

Documents

lyle ungar

food slide

action state slide

surprising v n lyle

player slide

surprises slide

reinforcement learning

food sound