RL Framework Q-Learning Temporal Abstractions : Options Modeling Intrinsic Motivation Results
Exploring Effects of Intrinsic Motivation inReinforcement Learning Agents
Nitish Srivastava T Sudhamsh Goutham
CS 365: Artificial IntelligenceDepartment of Computer Science and Engineering, IIT Kanpur
April 12, 2010
RL Framework Q-Learning Temporal Abstractions : Options Modeling Intrinsic Motivation Results
The RL Framework
The Problem
How should an agent take actions in an environment so as tomaximize long-term reward.
1 The entity that learns and performs actions : agent
2 A set of environment states S3 A set of actions A4 A set of scalar rewards R
RL Framework Q-Learning Temporal Abstractions : Options Modeling Intrinsic Motivation Results
Formalization as MDP
Markov Decision Process - MDP
Agent interacts with the environment at discrete time stepst = 0, 1, 2, . . ..
At each t perceives state of the environment st
Chooses an action at on the basis of st and performs it
Environment returns a reward rt+1 and the new state st+1
In our experiments
Agent is a child, modeled as objects: HAND, EYE, MARKER
Environment is a Playroom - a light switch, a ball, a bell,buttons for turning music on/off, toy monkey
Actions - move eye, hand, marker, use objects
This environment was first used by Singh et al. [3] to demonstrateintrinsic motivation.
RL Framework Q-Learning Temporal Abstractions : Options Modeling Intrinsic Motivation Results
Formalization as MDP
Markov Decision Process - MDP
Agent interacts with the environment at discrete time stepst = 0, 1, 2, . . ..
At each t perceives state of the environment st
Chooses an action at on the basis of st and performs it
Environment returns a reward rt+1 and the new state st+1
In our experiments
Agent is a child, modeled as objects: HAND, EYE, MARKER
Environment is a Playroom - a light switch, a ball, a bell,buttons for turning music on/off, toy monkey
Actions - move eye, hand, marker, use objects
This environment was first used by Singh et al. [3] to demonstrateintrinsic motivation.
RL Framework Q-Learning Temporal Abstractions : Options Modeling Intrinsic Motivation Results
Formalization as MDP
Markov Policy
A mapping from a state-action pair to the probability of takingthat action in that state.
π : S ×A → [0, 1]
Value functions
State-value function
V π(s) = E [rt+1 + γrt+2 + γ2rt+3 + . . . |s = st , π]
Action-value function
Qπ(s, a) = E [rt+1 + γrt+2 + γ2rt+3 + . . . |s = st , a = at , π]
Objective: Find the policy π∗ =argmaxπV π(s0)
RL Framework Q-Learning Temporal Abstractions : Options Modeling Intrinsic Motivation Results
Formalization as MDP
Markov Policy
A mapping from a state-action pair to the probability of takingthat action in that state.
π : S ×A → [0, 1]
Value functions
State-value function
V π(s) = E [rt+1 + γrt+2 + γ2rt+3 + . . . |s = st , π]
Action-value function
Qπ(s, a) = E [rt+1 + γrt+2 + γ2rt+3 + . . . |s = st , a = at , π]
Objective: Find the policy π∗ =argmaxπV π(s0)
RL Framework Q-Learning Temporal Abstractions : Options Modeling Intrinsic Motivation Results
Formalization as MDP
Markov Policy
A mapping from a state-action pair to the probability of takingthat action in that state.
π : S ×A → [0, 1]
Value functions
State-value function
V π(s) = E [rt+1 + γrt+2 + γ2rt+3 + . . . |s = st , π]
Action-value function
Qπ(s, a) = E [rt+1 + γrt+2 + γ2rt+3 + . . . |s = st , a = at , π]
Objective: Find the policy π∗ =argmaxπV π(s0)
RL Framework Q-Learning Temporal Abstractions : Options Modeling Intrinsic Motivation Results
RL as a Dynamic Programming Problem
Optimal functions
V ∗(s) = maxa∈As [ras + γ
∑s′
pass′V ∗(s ′)]
Q∗(s, a) = ras + γ
∑s′
pass′maxa∈As Q∗(s ′, a′)
The Bellman Equations recursively relate to themselves
If we treat V ∗ or Q∗ as unknowns, these equations can beused as update rules in Dynamic programming algorithms
RL Framework Q-Learning Temporal Abstractions : Options Modeling Intrinsic Motivation Results
Q-Learning
Update Method
Learning method for action-value functions using value iterationupdate
Q(st , at) = (1− α)Q(st , at) + α[rt+1 + γmaxaQ(st+1, at+1)]
Here,
α: learning rate, weightage to current information over oldinformation
γ: discount rate, weightage to future rewards
RL Framework Q-Learning Temporal Abstractions : Options Modeling Intrinsic Motivation Results
Learning Temporal Abstractions: Options
Options are multi-step actions
The steps for each option may be different
MDP’s can now be treated as SMDP’s over options
Option O characterized by [I , π, β], where I is the initiationset, π is the policy and β is the termination condition
RL Framework Q-Learning Temporal Abstractions : Options Modeling Intrinsic Motivation Results
Option Learning
Options are learnt the same way as actions
Bellman Equations and Q-learning for options are similar toactions
The value functions are temporal generalisations ofaction-value functions, which depend on the length of theoption.
Actions can be treated as single-step options
Policies: Let µ be the SMDP policy defined over options.Then in a state st ∈ o, actions are chosen according to πo . Ifst is a termination state, then next option is chosen accordingto µ.
RL Framework Q-Learning Temporal Abstractions : Options Modeling Intrinsic Motivation Results
Intra-option learning
Intra-option learning is used to learn the option models andvalues using [I , π, β].
They are learnt from experience and knowledge within oneoption
They are significantly faster than SMDP methods
Similar Bellman Equations exist for intra-option learning
We use an intra-option version of Q-learning
RL Framework Q-Learning Temporal Abstractions : Options Modeling Intrinsic Motivation Results
Intrinsic Motivation
Doing for own sake, not for solving problems or gettingrewards.
No external critic present, so no external reward.
The reward depends on the unpredictability of the event -surprisal.
Previous Work:
Singh et al. [3]: Learning hierarchical collection of skills
Schmidhuber [2]: Algorithmic models of various emotions
Laird [1] : Intrinsic motivation on the Soar cognitivearchitecture.
RL Framework Q-Learning Temporal Abstractions : Options Modeling Intrinsic Motivation Results
Where do rewards come from?
Where do happiness, curiosity, fear, surprisal come from?
Neurological motivation
Unpredicted, biologically salient events, cause a stereotypicshort-latency (70 - 100 ms), short-duration (100 - 200 ms) burst ofDopamine activity
r it = τ (1− P(st+1|st , o))
RL Framework Q-Learning Temporal Abstractions : Options Modeling Intrinsic Motivation Results
Where do rewards come from?
Where do happiness, curiosity, fear, surprisal come from?
Neurological motivation
Unpredicted, biologically salient events, cause a stereotypicshort-latency (70 - 100 ms), short-duration (100 - 200 ms) burst ofDopamine activity
r it = τ (1− P(st+1|st , o))
RL Framework Q-Learning Temporal Abstractions : Options Modeling Intrinsic Motivation Results
Where do rewards come from?
Where do happiness, curiosity, fear, surprisal come from?
Neurological motivation
Unpredicted, biologically salient events, cause a stereotypicshort-latency (70 - 100 ms), short-duration (100 - 200 ms) burst ofDopamine activity
r it = τ (1− P(st+1|st , o))
RL Framework Q-Learning Temporal Abstractions : Options Modeling Intrinsic Motivation Results
Environment - Playroom
A grid with different objects
light switch
ball
bell
movable buttons
toy that can make sounds
RL Framework Q-Learning Temporal Abstractions : Options Modeling Intrinsic Motivation Results
Agent Description
Agent has:
Eye
Hand
Visual Marker
Agent can:Move eye to hand Move hand to eyeMove eye to marker Move marker to eyeMove hand to marker Move marker to handMove eye to random object
If both hand and eye are on same object: use the object
RL Framework Q-Learning Temporal Abstractions : Options Modeling Intrinsic Motivation Results
Simple Q-Learning of one option
Agent learns to switch light on/off very fast
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●●●
●
●
●
0 20 40 60 80 100
020
4060
80
Episode
Rew
ard
RL Framework Q-Learning Temporal Abstractions : Options Modeling Intrinsic Motivation Results
Experiment with 1 salient event
Agent learns to switch light on/off but in the absence of any otherevent, gets bored.
●
●
●
● ●
●
●
●
● ● ● ●
●
● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0 10 20 30 40
0.0
0.5
1.0
1.5
2.0
c(1:ncol(y))
m
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ● ●
●
●
●
●
●●
●●
●
●
●
●
0 10 20 30 40
010
2030
4050
60
c(1:ncol(y))
m
RL Framework Q-Learning Temporal Abstractions : Options Modeling Intrinsic Motivation Results
More than 1 salient event
Agent learns to switch light on/off and keeps doing it.
●
●
●
●●●
●
●
●●●
●●
●
●●●●●
●
●
●
●●
●
●●●●●
●●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●●●●●●●●●
●●●
●
●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 20 40 60 80 100
0.0
0.5
1.0
1.5
2.0
Episode
Tota
l Rew
ard
●
●
●
●
●
●
●
●
●●●
●●
●
●●●●●
●
●
●
●
●
●●
●
●
●●
●●●
●
●
●
●●●
●
●●
●●
●
●
●
●
●
●
●
●●●●
●
●
●●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●●
●
●
●●
●
0 20 40 60 80 100
010
020
030
040
0
Episode
Num
ber
of ti
mes
sal
ient
eve
nt o
ccur
ed
RL Framework Q-Learning Temporal Abstractions : Options Modeling Intrinsic Motivation Results
The Vulcans are defeated: Emotional agent learns better
Results from Singh et al. [3]
RL Framework Q-Learning Temporal Abstractions : Options Modeling Intrinsic Motivation Results
John E. Laird.Extending the soar cognitive architecture.In Proceeding of the 2008 conference on Artificial GeneralIntelligence 2008, pages 224–235, Amsterdam, TheNetherlands, The Netherlands, 2008. IOS Press.
Jurgen Schmidhuber.Simple algorithmic principles of discovery, subjective beauty,selective attention, curiosity and creativity.In Marcus Hutter, Rocco A. Servedio, and Eiji Takimoto,editors, ALT, volume 4754 of Lecture Notes in ComputerScience, pages 32–33. Springer, 2007.
Satinder P. Singh, Andrew G. Barto, and NuttapongChentanez.Intrinsically motivated reinforcement learning.In NIPS, 2004.