reinforcement learning1 comp538 reinforcement learning recent development group 7: chan ka ki...

Reinforcement Learning 1

COMP538Reinforcement LearningRecent Development

Group 7:

Chan Ka Ki ([email protected])

Fung On Tik Andy ([email protected])

Li Yuk Hin ([email protected])

Instructor: Nevin L. Zhang

mailto:[email protected]




Outline Introduction 3 Solving Methods Main Consideration

Exploration vs. Exploitation Directed / Undirected Exploration

Function Approximation Planning and Learning

Directed RL vs. Undirected RL Dyna-Q and Prioritized Sweeping

Conclusion on recent development


Introduction Agent interacts with environment Goal-directed learning from interaction

Environment

Action a

AI Agent

s(t)

Reward r

s(t + 1)


Key Features Agent is NOT told which actions to take, but lear

n by itself By trial-and-error From experiences Explore and exploit

Exploitation = agent takes the best action based on its current knowledge

Exploration = try to take NOT the best action to gain more knowledge


Elements of RL Policy: what to do Reward: what is good Value: what is good because it predicts reward Model: what follows what


Dynamic Programming Model-based

compute optimal policies given a perfect model of the environment as a Markov decision process (MDP)

Bootstrap update estimates based in part on other

learned estimates, without waiting for a final outcome


Dynamic Programming

T

T T TT

TT

T

TT

T

T

T


Monte Carlo Model-free NOT bootstrap Entire episode included Only one choice at each

state (unlike DP) Time required to estimate

one state does not depend on the total number of states


Monte Carlo

T T T TT

T T T T T

T T

T T

TT T

T TT


Temporal Difference Model-free Bootstrap Partial episode included


Temporal Difference

T T T TT

T T T T TTTTTT

T T T T T


Example: Driving home


Driving home Changes recommended

by Monte Carlo methodsChanges recommendedby TD methods


N-step TD Prediction MC and TD are extreme cases!


Averaging N-step Returns n-step methods were introduced to help

with TD() understanding Idea: backup an average of several

returns e.g. backup half of 2-step and half of 4-step

Called a complex backup Draw each component Label with the weights for that component

)4()2(

2

1

2

1tt

avgt RRR


Forward View of TD() TD() is a method for

averaging all n-step backups weight by n-1 (time

since visitation) -return:

Backup using -return

Rt (1 ) n 1

n1

Rt(n)

Vt(st ) Rt Vt(st )


Forward View of TD() Look forward from each state to determine

update from future states and rewards:


Backward View of TD() The forward view was for theory The backward view is for mechanism

New variable called eligibility trace On each step, decay all traces by and

increment the trace for the current state by 1 Accumulating trace

)(set

et(s) et 1(s) if s st

et 1(s) 1 if s st


Backward View

Shout t backwards over time The strength of your voice decreases with

temporal distance by

)()( 11 tttttt sVsVr


Forward View = Backward View The forward (theoretical) view of TD() is

equivalent to the backward (mechanistic) view for off-line updating

Adaptive Exploration in Reinforcement Learning

Relu PatrascuDepartment of Systems Design

EngineeringUniversity of Waterloo

Waterloo, Ontario, Canada

[email protected]

Deborah StaceyDept. of Computing and

Information ScienceUniversity of Guelph

Ontario, [email protected]


Objectives Explains the trade-off between exploitation

and exploration Introduces two categories of exploration

methods: Undirected Exploration

-greedy exploration Directed Exploration

Counter-based exploration Past-Success directed exploration

Function approximation Backpropagation algorithm and Fuzzy ARTMAP


Introduction Main problem: How to make the learning

process adapt to the non-stationary environment?

Sub-Problems: How to balance exploitation and exploration

when the environment change? How can the function approximators adapt

the environment?


Exploitation and Exploration Exploit or Explore?

To maximize reward, a learner must exploit the knowledge it already has

Explore an action with small immediate reward, but may yield more reward in the long run

An example: Choosing the job Suppose you are working at a small company with $25,000 salary You have another offer from an enterprise but only start at $12,000 Keep working on the small company may guarantee you have

stable income Work on an enterprise may have more opportunities for promotion,

which increase the income in long run


Undirected Exploration Undirected Exploration

No biased purely random Eg. -greedy exploration it explores it chooses

equally among all actions likely to choose the worst

appearing action as it is to choose the next-to-best


Directed Exploration Directed Exploration

Memorize exploration-specific knowledge Biased by some features of the learning process Eg. Counter-based techniques Favor the choice of actions resulting in a transition to

a state that has not been frequently visited The main idea is encourage the learner to explore :

parts of the state space that have not been sampled often parts that have not been sampled recently


Past-success Directed Exploration Based on -greedy exploration Bias to adapt the environment from the learning

process Increase exploitation rate if receives reward at an

increasing rate Increase exploration rate when stop receiving reward

Average discounted reward Reflects amount and frequency of received immediate

rewards The further back in time, the less effect on average

reward


Average discounted reward defined as:

Apply it on -greedy algorithm

Past-Success Directed Exploration

t

k

kt

t

kt

kt

rt

r

1

1

1

1

where (0,1] is the discount factor

rt the reward received at time t

1.08.0 )( srtet


Gradient Descent Method Why use a gradient descent method?

RL applications use table to store the value functions Large number of states causes practically impossible Solution: use function approximator to predict the

value Error backpropagation algorithm

Catastrophic Interference cannot learn incrementally in non-stationary environment acquire new knowledge forget much of its previous

knowledge


Gradient Descent Method

Initialize w arbitrarily and e = 0Repeat (for each episode):

Initialize sPass s through each network and obtain Qa

a arg maxa Qa

With probability : a a random action A(s)Repeat (for each step of episode):

e eea ea wQa

Take action a, observe reward, r and next state, s’ r – Qa

Pass s’ through each network and obtain Q’a

a’ arg maxa Q’a

With probability : a a random action A(s’) + Q’a

w w + ea a’until s’ is terminal

where a’ arg maxa Q’a means a’ is set to the action for which the expression is maximal, in this case the highest Q’ is a constant step size parameter named the learning ratewQa is the partial derivative of Qa with respect to the weights w the discount factor e the vector of eligibility traces (0, 1] is the eligibility trace parameter


Fuzzy ARTMAP ARTMAP - Adaptive Resonancy Theory

mapping between input vector and output pattern a neural network specifically designed to deal

with the stability/plasticity dilemma This dilemma means a neural network isn't able

to learn new information without damaging what was learned previously, similar to catastrophic interference


Experiments Gridworld with non-stationary environment

Learning agent can move up, down, left or right Two gates: must pass through one of them from start

state to goal state First 1000 episodes, gate 1 open and gate 2 close 1001-5000 episodes, gate 1 close and gate 2 open To test how well the algorithm adapt to the changed

environment


Results Backpropagation algorithm

After 1000th episode: average discounted reward drops rapidly and

monotonically Surges to maximum exploitation

Fuzzy ARTMAP After 1000th episode: Reward drops in a few episode and goes back to high

values A temporary surge in exploration


Planning and Learning

Use of environment models Integration of planning and learning

methods

Objectives:


Models Model: anything the agent can use to predict how

the environment will respond to its actions Distribution model: description of all possibilities

and their probabilities e.g.,

Sample model: produces sample experiences e.g., a simulation model, set of data

Both types of models can be used to produce simulated experience

Often sample models are much easier to obtain

Ps s a and Rs s

a for all s, s , and a A(s)


Planning Planning: any computational process that

uses a model to create or improve a policy

We take the following view: all state-space planning methods involve computing

value functions, either explicitly or implicitly they all apply backups to simulated experience

Model PolicyPlanning

Simulated Experience

Model Values Policybackups


Learning, Planning, and Acting Two uses of real

experience: model learning: to

improve the model direct RL: to directly

improve the value function and policy

Improving value function and/or policy via a model is sometimes called indirect RL or model-based RL. Here, we call it planning.


Direct vs. Indirect RL Indirect methods:

make fuller use of experience: get better policy with fewer environment interactions

Direct methods simpler not affected by

bad models

But they are very closely related and can be usefully combined:

planning, acting, model learning, and direct RL can occur simultaneously and in parallel


The Dyna-Q Architecture(Sutton 1990)


The Dyna-Q Architecture (Sutton 1990)• Dyna use the experience to build the model (R, T), uses experience

to adjust the policy and user the model to adjust the policy

For each interaction with environment, experiencing <s, a, s’, r>

1. use experience to adjust the policyQ(s,a) = R(s,a) + [ r + Maxa’Q(s’, a’) – Q(s,a)]

2. use experience to update a model (T, R)Model (s,a) = (s’, r)

3. use model to simulate the experience to adjust the policya Rand(a), s Rand(s)(s’, r) Model(s, a)Q(s,a) = R(s,a) + [ r + Maxa’Q(s’, a’) – Q(s,a)]


The Dyna-Q Algorithm

direct RL

model learning

planning


Dyna-Q Snapshots: Midway in 2nd Episode


Dyna-Q Properties Dyna algorithm requires about N times the comp

utation of Q learning per instance But it is typically vastly less than that for naïve m

odel-based method N can be determined by the relative speed of co

mputation and of the taking action

What if the environment is changed ? Change to harder or change to easier.


Blocking Maze

The changed

environment is harder


Shortcut MazeThe changed

environment is

easier


What is Dyna-Q ? Uses an “exploration bonus”:

Keeps track of time since each state-action pair was tried for real

An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting

The agent actually “plans” how to visit long unvisited states

+


Prioritized Sweeping The updating of the model is no longer random Instead, store additional information in the model in order to make

the appropriate choice of updating

Store the change of each state value V(s), and use it to modify the priority of the predecessors of s, according their transition probability T(s,a, s’)

s4

s5

s1

s2

s3

= 10

= 5

S4 S5 S2 S1 S3

Priority: High Low


Prioritized Sweeping


Prioritized Sweeping vs. Dyna-QBoth use N=5 backups per

environmental interaction


Full and Sample(One-Step)Backups


Summary Emphasized close relationship between

planning and learning Important distinction between distribution

models and sample models Looked at some ways to integrate planning

and learning synergy among planning, acting, model

learning


RL Recent Development : Problem Modeling

Partially Observable MDP

MDP

Hidden State RL

Traditional RL

Known Unknown

Completely Observable

Partially Observable

Model of environment


Research topics Exploration-Exploitation tradeoff Problem of delayed reward (credit assignment) Input generalization

Function Approximator

Multi-Agent Reinforcement Learning Global goal vs Local goal Achieve several goals in parallel Agent cooperation and communication


RL ApplicationTD Gammon Tesauro 1992, 1994,

1995, ... 30 pieces, 24 locations

implies enormous number of configurations

Effective branching factor of 400

TD() algorithm Multi-layer Neural Network Near the level of world’s

strongest grandmasters


RL ApplicationElevator Dispatching Crites and Barto 1996

56Reinforcement Learning

RL Application

Conservatively about 1022 states

Elevator Dispatching 18 hall call buttons: 218 combinations positions and directions of cars: 184 (rounding to nearest floor) motion states of cars (accelerating, moving, decelerating, stopped, loading,

turning): 6 40 car buttons: 240

18 discretized real numbers are available giving elapsed time since hall buttons pushed

Set of passengers riding each car and their destinations: observable only through the car buttons


RL Application

Dynamic Channel Allocation Singh and Bertsekas 1997

Job-Shop Scheduling Zhang and Dietterich 1995, 1996


Q & A

reinforcement learning1 comp538 reinforcement learning recent development group 7: chan ka ki...

Documents