lyle ungar, university of pennsylvania learning and memory reinforcement learning

27
le Ungar, University of Pennsylvania Learning and Learning and Memory Memory Reinforcement Learning

Upload: gabriella-walters

Post on 26-Dec-2015

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

Lyle Ungar, University of Pennsylvania

Learning and Learning and MemoryMemory

Reinforcement

Learning

Page 2: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

2Lyle H Ungar, University of Pennsylvania

Learning Levels Learning Levels Learning Levels Learning Levels

Darwinian Trial -> death or children

Skinnerian Reinforcement learning

Popperian Our hypotheses die in our stead

Gregorian Tools and artifacts

Page 3: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

3Lyle H Ungar, University of Pennsylvania

Machine LearningMachine LearningMachine LearningMachine Learning

Unsupervised Cluster similar items Association (no “right” answer)

Supervised For observations/features, teacher gives

the correct “answer” E.g., Learn to recognize categories

Reinforcement Take action, observe consequence bad dog!

Page 4: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

4Lyle H Ungar, University of Pennsylvania

Pavlovian ConditioningPavlovian ConditioningPavlovian ConditioningPavlovian Conditioning

Pavlov Food causes salivation Sound before food -> sound causes salivation

Learn to associate sound with food

Page 5: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

5Lyle H Ungar, University of Pennsylvania

Operant ConditioningOperant ConditioningOperant ConditioningOperant Conditioning

Page 6: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

6Lyle H Ungar, University of Pennsylvania

Associative MemoryAssociative MemoryAssociative MemoryAssociative Memory

Hebbian Learning When two connected neurons are both

excited, the connection between them is strengthened

Neurons that fire together, wire together

Neurons that fire together, wire together

Page 7: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

7Lyle H Ungar, University of Pennsylvania

Explanations of PavlovExplanations of PavlovExplanations of PavlovExplanations of Pavlov

S-S (stimulus-stimulus) Dogs learn to associate sound with food (and salivate based on “thinking” of food)

S-R (stimulus-response) Dogs learn to salivate based on the tone (and salivate directly without “thinking” of

food)

How to test? Do dogs think lights are food?

Page 8: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

8Lyle H Ungar, University of Pennsylvania

Conditioning in humansConditioning in humansConditioning in humansConditioning in humans

Two pathways The “slow” pathway dogs use Cognitive (conscious) learning

How to test this hypothesis Learn to blink based on a stimuli

associated with a puff of air.

Page 9: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

9Lyle H Ungar, University of Pennsylvania

BlockingBlockingBlockingBlocking

Tone -> Shock -> Fear Tone -> Fear Tone + Light -> Shock -> Fear Light -> ?

Page 10: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

10Lyle H Ungar, University of Pennsylvania

Rescorla-Wagner ModelRescorla-Wagner ModelRescorla-Wagner ModelRescorla-Wagner Model

Hypothesis: learn from observations that are surprising Vn<- Vn + c (Vmax - Vn)

Vn= c (Vmax - Vn)

Vn is strength of association between US and CS

c is the learning rate

Predictions contingency

Page 11: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

11Lyle H Ungar, University of Pennsylvania

Limitations of Rescorla-Limitations of Rescorla-WagnerWagnerLimitations of Rescorla-Limitations of Rescorla-WagnerWagner

Tone -> food Light -> food Tone + light -> ?

Page 12: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

12Lyle H Ungar, University of Pennsylvania

Reinforcement LearningReinforcement LearningReinforcement LearningReinforcement Learning

Many times one takes a long sequence of actions, and only discovers the result of these actions later (e.g. when you win or lose a game)

Q: How can one ascribe credit (or blame) to one action is a sequence of actions

A: by noting surprises

Page 13: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

13Lyle H Ungar, University of Pennsylvania

Consider a gameConsider a gameConsider a gameConsider a game

Estimate probability of winning Take an action, see how the opponent

(or the world) responds Re-estimate probability of winning

If it is unchanged, you learned nothing If it is higher, the initial state was better

than you thought If it is lower, the state was worse than you

thought

Page 14: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

14Lyle H Ungar, University of Pennsylvania

Tic-tac-toe exampleTic-tac-toe exampleTic-tac-toe exampleTic-tac-toe example

Decision tree Alternate layers give possible moves for

each player

Page 15: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

15Lyle H Ungar, University of Pennsylvania

Reinforcement LearningReinforcement LearningReinforcement LearningReinforcement Learning

State E.g. board position

Action E.g. move

Policy State -> Action

Reward function State -> utility

Model of the environment State, action -> state

Page 16: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

16Lyle H Ungar, University of Pennsylvania

Definitions of key termsDefinitions of key termsDefinitions of key termsDefinitions of key terms

State What you need to know about the world to

predict the effect of an action

Policy What action to take in each state

Reward function The cost or benefit of being in a state (e.g. points won or lost, happiness gained

or lost)

Page 17: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

17Lyle H Ungar, University of Pennsylvania

Value IterationValue IterationValue IterationValue Iteration

Value Function Expected value of a policy over time

= sum of the expected rewards

V(s) <- V(s) + c[V(s’) - V(s)] s = state before the move s’ = state after the move “temporal difference” learning

Page 18: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

18Lyle H Ungar, University of Pennsylvania

Mouse in Maze ExampleMouse in Maze ExampleMouse in Maze ExampleMouse in Maze Example

policy value function

Page 19: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

19Lyle H Ungar, University of Pennsylvania

Dopamine & ReinforcementDopamine & ReinforcementDopamine & ReinforcementDopamine & Reinforcement

Page 20: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

20Lyle H Ungar, University of Pennsylvania

Exploration - ExploitationExploration - ExploitationExploration - ExploitationExploration - Exploitation

Exploration Always try a different route to work

Exploitation Always take the best route to work that

you have found so far

Learning requires exploration Unless the environment is noisy

Page 21: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

21Lyle H Ungar, University of Pennsylvania

RL can be very simpleRL can be very simpleRL can be very simpleRL can be very simple

Simple learning algorithm leads to optimal policy Without predicting the effects of the

agents actions Without predicting immediate payoffs Without planning Without explicit model of the world

Page 22: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

22Lyle H Ungar, University of Pennsylvania

How to play chessHow to play chessHow to play chessHow to play chess

Computer Evaluation function for board positions Fast search

Human (grandmaster) Memorize tens of thousands of board

positions and what do to Do a much smaller search!

Page 23: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

23Lyle H Ungar, University of Pennsylvania

AI and GamesAI and GamesAI and GamesAI and Games

Chess Backgammon

Deterministic Stochastic

Position Policy

evaluation

+ search

Page 24: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

24Lyle H Ungar, University of Pennsylvania

Scaling up value functionsScaling up value functionsScaling up value functionsScaling up value functions

For small number of states Learn the value function of each state

Not possible for Backgammon 1020 states Learn mapping from features to value

Then use reinforcement learning to get improved value estimates

Page 25: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

25Lyle H Ungar, University of Pennsylvania

Q-learningQ-learningQ-learningQ-learning

Instead of the Value of a state, learn the value Q(s,a) of taking an action a from a state s.

Optimal policy: take best action maxa Q(s,a)

Learning rule Q(s, a) <- Q(s, a) +

c[rt + maxb Q(s’, b) - Q(s, a)]

Page 26: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

26Lyle H Ungar, University of Pennsylvania

Learning to SingLearning to SingLearning to SingLearning to Sing

Zerbra Finch hears father’s song

Memorizes it Then practices for

months to learn to reproduce it

What kind of learning is this?

Page 27: Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning

27Lyle H Ungar, University of Pennsylvania

Controversies?Controversies?Controversies?Controversies?

Is conditioning good? How much learning do people do? Innateness, learning, and free will