reinforcement learning1 comp538 reinforcement learning recent development group 7: chan ka ki...
Post on 20-Dec-2015
219 views
TRANSCRIPT
Reinforcement Learning 1
COMP538Reinforcement LearningRecent Development
Group 7:
Chan Ka Ki ([email protected])
Fung On Tik Andy ([email protected])
Li Yuk Hin ([email protected])
Instructor: Nevin L. Zhang
Reinforcement Learning 2
Outline Introduction 3 Solving Methods Main Consideration
Exploration vs. Exploitation Directed / Undirected Exploration
Function Approximation Planning and Learning
Directed RL vs. Undirected RL Dyna-Q and Prioritized Sweeping
Conclusion on recent development
Reinforcement Learning 3
Introduction Agent interacts with environment Goal-directed learning from interaction
Environment
Action a
AI Agent
s(t)
Reward r
s(t + 1)
Reinforcement Learning 4
Key Features Agent is NOT told which actions to take, but lear
n by itself By trial-and-error From experiences Explore and exploit
Exploitation = agent takes the best action based on its current knowledge
Exploration = try to take NOT the best action to gain more knowledge
Reinforcement Learning 5
Elements of RL Policy: what to do Reward: what is good Value: what is good because it predicts reward Model: what follows what
Reinforcement Learning 6
Dynamic Programming Model-based
compute optimal policies given a perfect model of the environment as a Markov decision process (MDP)
Bootstrap update estimates based in part on other
learned estimates, without waiting for a final outcome
Reinforcement Learning 7
Dynamic Programming
T
T T TT
TT
T
TT
T
T
T
Reinforcement Learning 8
Monte Carlo Model-free NOT bootstrap Entire episode included Only one choice at each
state (unlike DP) Time required to estimate
one state does not depend on the total number of states
Reinforcement Learning 9
Monte Carlo
T T T TT
T T T T T
T T
T T
TT T
T TT
Reinforcement Learning 10
Temporal Difference Model-free Bootstrap Partial episode included
Reinforcement Learning 11
Temporal Difference
T T T TT
T T T T TTTTTT
T T T T T
Reinforcement Learning 12
Example: Driving home
Reinforcement Learning 13
Driving home Changes recommended
by Monte Carlo methodsChanges recommendedby TD methods
Reinforcement Learning 14
N-step TD Prediction MC and TD are extreme cases!
Reinforcement Learning 15
Averaging N-step Returns n-step methods were introduced to help
with TD() understanding Idea: backup an average of several
returns e.g. backup half of 2-step and half of 4-step
Called a complex backup Draw each component Label with the weights for that component
)4()2(
2
1
2
1tt
avgt RRR
Reinforcement Learning 16
Forward View of TD() TD() is a method for
averaging all n-step backups weight by n-1 (time
since visitation) -return:
Backup using -return
Rt (1 ) n 1
n1
Rt(n)
Vt(st ) Rt Vt(st )
Reinforcement Learning 17
Forward View of TD() Look forward from each state to determine
update from future states and rewards:
Reinforcement Learning 18
Backward View of TD() The forward view was for theory The backward view is for mechanism
New variable called eligibility trace On each step, decay all traces by and
increment the trace for the current state by 1 Accumulating trace
)(set
et(s) et 1(s) if s st
et 1(s) 1 if s st
Reinforcement Learning 19
Backward View
Shout t backwards over time The strength of your voice decreases with
temporal distance by
)()( 11 tttttt sVsVr
Reinforcement Learning 20
Forward View = Backward View The forward (theoretical) view of TD() is
equivalent to the backward (mechanistic) view for off-line updating
Adaptive Exploration in Reinforcement Learning
Relu PatrascuDepartment of Systems Design
EngineeringUniversity of Waterloo
Waterloo, Ontario, Canada
Deborah StaceyDept. of Computing and
Information ScienceUniversity of Guelph
Ontario, [email protected]
Reinforcement Learning 22
Objectives Explains the trade-off between exploitation
and exploration Introduces two categories of exploration
methods: Undirected Exploration
-greedy exploration Directed Exploration
Counter-based exploration Past-Success directed exploration
Function approximation Backpropagation algorithm and Fuzzy ARTMAP
Reinforcement Learning 23
Introduction Main problem: How to make the learning
process adapt to the non-stationary environment?
Sub-Problems: How to balance exploitation and exploration
when the environment change? How can the function approximators adapt
the environment?
Reinforcement Learning 24
Exploitation and Exploration Exploit or Explore?
To maximize reward, a learner must exploit the knowledge it already has
Explore an action with small immediate reward, but may yield more reward in the long run
An example: Choosing the job Suppose you are working at a small company with $25,000 salary You have another offer from an enterprise but only start at $12,000 Keep working on the small company may guarantee you have
stable income Work on an enterprise may have more opportunities for promotion,
which increase the income in long run
Reinforcement Learning 25
Undirected Exploration Undirected Exploration
No biased purely random Eg. -greedy exploration it explores it chooses
equally among all actions likely to choose the worst
appearing action as it is to choose the next-to-best
Reinforcement Learning 26
Directed Exploration Directed Exploration
Memorize exploration-specific knowledge Biased by some features of the learning process Eg. Counter-based techniques Favor the choice of actions resulting in a transition to
a state that has not been frequently visited The main idea is encourage the learner to explore :
parts of the state space that have not been sampled often parts that have not been sampled recently
Reinforcement Learning 27
Past-success Directed Exploration Based on -greedy exploration Bias to adapt the environment from the learning
process Increase exploitation rate if receives reward at an
increasing rate Increase exploration rate when stop receiving reward
Average discounted reward Reflects amount and frequency of received immediate
rewards The further back in time, the less effect on average
reward
Reinforcement Learning 28
Average discounted reward defined as:
Apply it on -greedy algorithm
Past-Success Directed Exploration
t
k
kt
t
kt
kt
rt
r
1
1
1
1
where (0,1] is the discount factor
rt the reward received at time t
1.08.0 )( srtet
Reinforcement Learning 29
Gradient Descent Method Why use a gradient descent method?
RL applications use table to store the value functions Large number of states causes practically impossible Solution: use function approximator to predict the
value Error backpropagation algorithm
Catastrophic Interference cannot learn incrementally in non-stationary environment acquire new knowledge forget much of its previous
knowledge
Reinforcement Learning 30
Gradient Descent Method
Initialize w arbitrarily and e = 0Repeat (for each episode):
Initialize sPass s through each network and obtain Qa
a arg maxa Qa
With probability : a a random action A(s)Repeat (for each step of episode):
e eea ea wQa
Take action a, observe reward, r and next state, s’ r – Qa
Pass s’ through each network and obtain Q’a
a’ arg maxa Q’a
With probability : a a random action A(s’) + Q’a
w w + ea a’until s’ is terminal
where a’ arg maxa Q’a means a’ is set to the action for which the expression is maximal, in this case the highest Q’ is a constant step size parameter named the learning ratewQa is the partial derivative of Qa with respect to the weights w the discount factor e the vector of eligibility traces (0, 1] is the eligibility trace parameter
Reinforcement Learning 31
Fuzzy ARTMAP ARTMAP - Adaptive Resonancy Theory
mapping between input vector and output pattern a neural network specifically designed to deal
with the stability/plasticity dilemma This dilemma means a neural network isn't able
to learn new information without damaging what was learned previously, similar to catastrophic interference
Reinforcement Learning 32
Experiments Gridworld with non-stationary environment
Learning agent can move up, down, left or right Two gates: must pass through one of them from start
state to goal state First 1000 episodes, gate 1 open and gate 2 close 1001-5000 episodes, gate 1 close and gate 2 open To test how well the algorithm adapt to the changed
environment
Reinforcement Learning 33
Results Backpropagation algorithm
After 1000th episode: average discounted reward drops rapidly and
monotonically Surges to maximum exploitation
Fuzzy ARTMAP After 1000th episode: Reward drops in a few episode and goes back to high
values A temporary surge in exploration
Reinforcement Learning 34
Planning and Learning
Use of environment models Integration of planning and learning
methods
Objectives:
Reinforcement Learning 35
Models Model: anything the agent can use to predict how
the environment will respond to its actions Distribution model: description of all possibilities
and their probabilities e.g.,
Sample model: produces sample experiences e.g., a simulation model, set of data
Both types of models can be used to produce simulated experience
Often sample models are much easier to obtain
Ps s a and Rs s
a for all s, s , and a A(s)
Reinforcement Learning 36
Planning Planning: any computational process that
uses a model to create or improve a policy
We take the following view: all state-space planning methods involve computing
value functions, either explicitly or implicitly they all apply backups to simulated experience
Model PolicyPlanning
Simulated Experience
Model Values Policybackups
Reinforcement Learning 37
Learning, Planning, and Acting Two uses of real
experience: model learning: to
improve the model direct RL: to directly
improve the value function and policy
Improving value function and/or policy via a model is sometimes called indirect RL or model-based RL. Here, we call it planning.
Reinforcement Learning 38
Direct vs. Indirect RL Indirect methods:
make fuller use of experience: get better policy with fewer environment interactions
Direct methods simpler not affected by
bad models
But they are very closely related and can be usefully combined:
planning, acting, model learning, and direct RL can occur simultaneously and in parallel
Reinforcement Learning 39
The Dyna-Q Architecture(Sutton 1990)
Reinforcement Learning 40
The Dyna-Q Architecture (Sutton 1990)• Dyna use the experience to build the model (R, T), uses experience
to adjust the policy and user the model to adjust the policy
For each interaction with environment, experiencing <s, a, s’, r>
1. use experience to adjust the policyQ(s,a) = R(s,a) + [ r + Maxa’Q(s’, a’) – Q(s,a)]
2. use experience to update a model (T, R)Model (s,a) = (s’, r)
3. use model to simulate the experience to adjust the policya Rand(a), s Rand(s)(s’, r) Model(s, a)Q(s,a) = R(s,a) + [ r + Maxa’Q(s’, a’) – Q(s,a)]
Reinforcement Learning 41
The Dyna-Q Algorithm
direct RL
model learning
planning
Reinforcement Learning 42
Dyna-Q Snapshots: Midway in 2nd Episode
Reinforcement Learning 43
Dyna-Q Properties Dyna algorithm requires about N times the comp
utation of Q learning per instance But it is typically vastly less than that for naïve m
odel-based method N can be determined by the relative speed of co
mputation and of the taking action
What if the environment is changed ? Change to harder or change to easier.
Reinforcement Learning 44
Blocking Maze
The changed
environment is harder
Reinforcement Learning 45
Shortcut MazeThe changed
environment is
easier
Reinforcement Learning 46
What is Dyna-Q ? Uses an “exploration bonus”:
Keeps track of time since each state-action pair was tried for real
An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting
The agent actually “plans” how to visit long unvisited states
+
Reinforcement Learning 47
Prioritized Sweeping The updating of the model is no longer random Instead, store additional information in the model in order to make
the appropriate choice of updating
Store the change of each state value V(s), and use it to modify the priority of the predecessors of s, according their transition probability T(s,a, s’)
s4
s5
s1
s2
s3
= 10
= 5
S4 S5 S2 S1 S3
Priority: High Low
Reinforcement Learning 48
Prioritized Sweeping
Reinforcement Learning 49
Prioritized Sweeping vs. Dyna-QBoth use N=5 backups per
environmental interaction
Reinforcement Learning 50
Full and Sample(One-Step)Backups
Reinforcement Learning 51
Summary Emphasized close relationship between
planning and learning Important distinction between distribution
models and sample models Looked at some ways to integrate planning
and learning synergy among planning, acting, model
learning
Reinforcement Learning 52
RL Recent Development : Problem Modeling
Partially Observable MDP
MDP
Hidden State RL
Traditional RL
Known Unknown
Completely Observable
Partially Observable
Model of environment
Reinforcement Learning 53
Research topics Exploration-Exploitation tradeoff Problem of delayed reward (credit assignment) Input generalization
Function Approximator
Multi-Agent Reinforcement Learning Global goal vs Local goal Achieve several goals in parallel Agent cooperation and communication
Reinforcement Learning 54
RL ApplicationTD Gammon Tesauro 1992, 1994,
1995, ... 30 pieces, 24 locations
implies enormous number of configurations
Effective branching factor of 400
TD() algorithm Multi-layer Neural Network Near the level of world’s
strongest grandmasters
Reinforcement Learning 55
RL ApplicationElevator Dispatching Crites and Barto 1996
56Reinforcement Learning
RL Application
Conservatively about 1022 states
Elevator Dispatching 18 hall call buttons: 218 combinations positions and directions of cars: 184 (rounding to nearest floor) motion states of cars (accelerating, moving, decelerating, stopped, loading,
turning): 6 40 car buttons: 240
18 discretized real numbers are available giving elapsed time since hall buttons pushed
Set of passengers riding each car and their destinations: observable only through the car buttons
Reinforcement Learning 57
RL Application
Dynamic Channel Allocation Singh and Bertsekas 1997
Job-Shop Scheduling Zhang and Dietterich 1995, 1996
Reinforcement Learning 58
Q & A