lyle ungar, university of pennsylvania learning and memory reinforcement learning
TRANSCRIPT
Lyle Ungar, University of Pennsylvania
Learning and Learning and MemoryMemory
Reinforcement
Learning
2Lyle H Ungar, University of Pennsylvania
Learning Levels Learning Levels Learning Levels Learning Levels
Darwinian Trial -> death or children
Skinnerian Reinforcement learning
Popperian Our hypotheses die in our stead
Gregorian Tools and artifacts
3Lyle H Ungar, University of Pennsylvania
Machine LearningMachine LearningMachine LearningMachine Learning
Unsupervised Cluster similar items Association (no “right” answer)
Supervised For observations/features, teacher gives
the correct “answer” E.g., Learn to recognize categories
Reinforcement Take action, observe consequence bad dog!
4Lyle H Ungar, University of Pennsylvania
Pavlovian ConditioningPavlovian ConditioningPavlovian ConditioningPavlovian Conditioning
Pavlov Food causes salivation Sound before food -> sound causes salivation
Learn to associate sound with food
5Lyle H Ungar, University of Pennsylvania
Operant ConditioningOperant ConditioningOperant ConditioningOperant Conditioning
6Lyle H Ungar, University of Pennsylvania
Associative MemoryAssociative MemoryAssociative MemoryAssociative Memory
Hebbian Learning When two connected neurons are both
excited, the connection between them is strengthened
Neurons that fire together, wire together
Neurons that fire together, wire together
7Lyle H Ungar, University of Pennsylvania
Explanations of PavlovExplanations of PavlovExplanations of PavlovExplanations of Pavlov
S-S (stimulus-stimulus) Dogs learn to associate sound with food (and salivate based on “thinking” of food)
S-R (stimulus-response) Dogs learn to salivate based on the tone (and salivate directly without “thinking” of
food)
How to test? Do dogs think lights are food?
8Lyle H Ungar, University of Pennsylvania
Conditioning in humansConditioning in humansConditioning in humansConditioning in humans
Two pathways The “slow” pathway dogs use Cognitive (conscious) learning
How to test this hypothesis Learn to blink based on a stimuli
associated with a puff of air.
9Lyle H Ungar, University of Pennsylvania
BlockingBlockingBlockingBlocking
Tone -> Shock -> Fear Tone -> Fear Tone + Light -> Shock -> Fear Light -> ?
10Lyle H Ungar, University of Pennsylvania
Rescorla-Wagner ModelRescorla-Wagner ModelRescorla-Wagner ModelRescorla-Wagner Model
Hypothesis: learn from observations that are surprising Vn<- Vn + c (Vmax - Vn)
Vn= c (Vmax - Vn)
Vn is strength of association between US and CS
c is the learning rate
Predictions contingency
11Lyle H Ungar, University of Pennsylvania
Limitations of Rescorla-Limitations of Rescorla-WagnerWagnerLimitations of Rescorla-Limitations of Rescorla-WagnerWagner
Tone -> food Light -> food Tone + light -> ?
12Lyle H Ungar, University of Pennsylvania
Reinforcement LearningReinforcement LearningReinforcement LearningReinforcement Learning
Many times one takes a long sequence of actions, and only discovers the result of these actions later (e.g. when you win or lose a game)
Q: How can one ascribe credit (or blame) to one action is a sequence of actions
A: by noting surprises
13Lyle H Ungar, University of Pennsylvania
Consider a gameConsider a gameConsider a gameConsider a game
Estimate probability of winning Take an action, see how the opponent
(or the world) responds Re-estimate probability of winning
If it is unchanged, you learned nothing If it is higher, the initial state was better
than you thought If it is lower, the state was worse than you
thought
14Lyle H Ungar, University of Pennsylvania
Tic-tac-toe exampleTic-tac-toe exampleTic-tac-toe exampleTic-tac-toe example
Decision tree Alternate layers give possible moves for
each player
15Lyle H Ungar, University of Pennsylvania
Reinforcement LearningReinforcement LearningReinforcement LearningReinforcement Learning
State E.g. board position
Action E.g. move
Policy State -> Action
Reward function State -> utility
Model of the environment State, action -> state
16Lyle H Ungar, University of Pennsylvania
Definitions of key termsDefinitions of key termsDefinitions of key termsDefinitions of key terms
State What you need to know about the world to
predict the effect of an action
Policy What action to take in each state
Reward function The cost or benefit of being in a state (e.g. points won or lost, happiness gained
or lost)
17Lyle H Ungar, University of Pennsylvania
Value IterationValue IterationValue IterationValue Iteration
Value Function Expected value of a policy over time
= sum of the expected rewards
V(s) <- V(s) + c[V(s’) - V(s)] s = state before the move s’ = state after the move “temporal difference” learning
18Lyle H Ungar, University of Pennsylvania
Mouse in Maze ExampleMouse in Maze ExampleMouse in Maze ExampleMouse in Maze Example
policy value function
19Lyle H Ungar, University of Pennsylvania
Dopamine & ReinforcementDopamine & ReinforcementDopamine & ReinforcementDopamine & Reinforcement
20Lyle H Ungar, University of Pennsylvania
Exploration - ExploitationExploration - ExploitationExploration - ExploitationExploration - Exploitation
Exploration Always try a different route to work
Exploitation Always take the best route to work that
you have found so far
Learning requires exploration Unless the environment is noisy
21Lyle H Ungar, University of Pennsylvania
RL can be very simpleRL can be very simpleRL can be very simpleRL can be very simple
Simple learning algorithm leads to optimal policy Without predicting the effects of the
agents actions Without predicting immediate payoffs Without planning Without explicit model of the world
22Lyle H Ungar, University of Pennsylvania
How to play chessHow to play chessHow to play chessHow to play chess
Computer Evaluation function for board positions Fast search
Human (grandmaster) Memorize tens of thousands of board
positions and what do to Do a much smaller search!
23Lyle H Ungar, University of Pennsylvania
AI and GamesAI and GamesAI and GamesAI and Games
Chess Backgammon
Deterministic Stochastic
Position Policy
evaluation
+ search
24Lyle H Ungar, University of Pennsylvania
Scaling up value functionsScaling up value functionsScaling up value functionsScaling up value functions
For small number of states Learn the value function of each state
Not possible for Backgammon 1020 states Learn mapping from features to value
Then use reinforcement learning to get improved value estimates
25Lyle H Ungar, University of Pennsylvania
Q-learningQ-learningQ-learningQ-learning
Instead of the Value of a state, learn the value Q(s,a) of taking an action a from a state s.
Optimal policy: take best action maxa Q(s,a)
Learning rule Q(s, a) <- Q(s, a) +
c[rt + maxb Q(s’, b) - Q(s, a)]
26Lyle H Ungar, University of Pennsylvania
Learning to SingLearning to SingLearning to SingLearning to Sing
Zerbra Finch hears father’s song
Memorizes it Then practices for
months to learn to reproduce it
What kind of learning is this?
27Lyle H Ungar, University of Pennsylvania
Controversies?Controversies?Controversies?Controversies?
Is conditioning good? How much learning do people do? Innateness, learning, and free will