the utility of temporal abstraction in …pstone/courses/394rspring13/resources/...motivation...

MotivationExperimental Results

Summary

The Utility of Temporal Abstraction inReinforcement Learning

Nicholas K. Jong Todd Hester Peter Stone

Department of Computer SciencesThe University of Texas at Austin

The Seventh International Conference on AutonomousAgents and Multiagent Systems

Nicholas K. Jong, Todd Hester, Peter Stone The Utility of Temporal Abstraction in Reinforcement Learning


Summary

Outline

1 Motivation: Hierarchical Reinforcement Learning

2 Experimental ResultsLearning with OptionsOptions and Random ExplorationOther Applications of Options



Summary

Goal: Learn Agent Behaviors Autonomously

Reinforcement learning algorithms:

Given experience with anunknown environmentEstimates the value of statesLearns a policy

ProblemHow to learn more efficiently?

G



Summary

Intuition: Decompose Tasks into Subtasks

Standard RL assumes flat stateand action spaces.Real-world applications havehierarchical structure.Abstract actions

Represent sequences ofprimitive actionsAchieve subgoals

G



Summary

The Most Popular Framework for Hierarchical RL

Options: analogous to macro-operatorsInitiation set (precondition)Termination function (postcondition)Option policy (implementation)

Typically used to augment an action spaceCan be treated simply as temporally extended actions



Summary

The Benefits of Options

Prior work: options are goodFuture work: where do the options come from?

Key QuestionHow precisely does the addition of options affect learning?



Summary

Learning with OptionsOptions and Random ExplorationOther Applications of Options

Outline





Summary


Replicating Results in Option Discovery

Apply standard Q-learning with ε-greedy explorationIntroduce options after 20 episodes

One option for each of four given subgoalsOption policies learned from experience replayInitiation set: states that can reach subgoal

G

One of four options 0

100

200

300

400

500

0 10 20 30 40 50

Ste

ps p

er e

piso

de

Episodes

Q-learning



Summary


Replicating Results in Option Discovery

Apply standard Q-learning with ε-greedy explorationIntroduce options after 20 episodes

One option for each of four given subgoalsOption policies learned from experience replayInitiation set: states that can reach subgoal

One of four options 0

100

200

300

400

500

0 10 20 30 40 50

Ste

ps p

er e

piso

de

Episodes

Q-learningOptions after 20 eps



Summary


Hierarchical Reasoning or Additional Computation?

ObservationThe technique used to obtain the option policy can also beused to improve the value function without using options at all!

0

100

200

300

400

500

0 10 20 30 40 50

Ste

ps p

er e

piso

de

Episodes

Q-learningOptions after 20 eps

Experience replay at 20

Better baseline: just experience replay after 20 episodes



Summary


Options Can Degrade Learning Performance

Isolating the effect of hierarchyGive only subgoals (at start)Learn option policies online

Subgoals can degrade performance initially.Correct options can severely degrade performance!

0

1000

2000

3000

4000

5000

0 10 20 30 40 50

Ste

ps p

er e

piso

de

Episodes

Q-learningSubgoals from start



Summary


Options Can Degrade Learning Performance

Isolating the effect of hierarchyGive only subgoals (at start)Learn option policies online

Subgoals can degrade performance initially.Correct options can severely degrade performance!

0

1000

2000

3000

4000

5000

0 10 20 30 40 50

Ste

ps p

er e

piso

de

Episodes

Q-learningSubgoals from start

Options from start



Summary


Outline





Summary


Options Change the Environment Structure

G

Random walk inoriginal environment

G

Random walk inaugmented environment



Summary


Restricting the Initiation Set

Idea: Limit options to certain statesRequires domain expertise

Initiation set of one option

0

200

400

600

800

1000

0 10 20 30 40 50

Ste

ps p

er e

piso

de

Episodes

Q-learningOptions



Summary


Restricting the Initiation Set

Idea: Limit options to certain statesRequires domain expertise

Initiation set of one option

0

200

400

600

800

1000

0 10 20 30 40 50

Ste

ps p

er e

piso

de

Episodes

Q-learningOptions

Limited options



Summary


Delaying Option Deployment

Idea: wait until value function partially learnedSomewhat brittle

G

Value function on optiondeployment

0

200

400

600

800

1000

0 10 20 30 40 50

Ste

ps p

er e

piso

de

Episodes

Q-learningOptions



Summary


Delaying Option Deployment

Idea: wait until value function partially learnedSomewhat brittle

G

Value function on optiondeployment

0

200

400

600

800

1000

0 10 20 30 40 50

Ste

ps p

er e

piso

de

Episodes

Q-learningOptions

Delayed options



Summary


Outline





Summary


Options and Optimistic Exploration

ObservationWe can blame some of the performance degradation onrandom exploration.

Alternative: optimism inthe face of uncertaintyOptimism offers solidtheoretical benefits.Heuristic implementation:optimistic initialization ofthe value function

Thorough exploration eliminates the impact of options!



Summary


Options and Optimistic Exploration

ObservationWe can blame some of the performance degradation onrandom exploration.

Alternative: optimism inthe face of uncertaintyOptimism offers solidtheoretical benefits.Heuristic implementation:optimistic initialization ofthe value function

0

200

400

600

800

1000

0 50 100 150 200

Ste

ps p

er e

piso

de

Episodes

Q-learningOptimistic q-learning

With subgoalsWith options

Thorough exploration eliminates the impact of options!



Summary


Options that Abstract Instead of Augment

Remove primitive actions superceded by options.

G

Initiation set Availabilityof one option of primitive

actions 0

200

400

600

800

1000

1200

0 10 20 30 40 50

Ste

ps p

er e

piso

de

Episodes

Primitive actions onlyOptions only then both

Options only then primitives only



Summary


Temporal Abstraction in Other Algorithms

ObservationQ-learning may not be the best baseline algorithm for studyinghierarchy.

Q-learning uses each piece of experience exactly once.It therefore confounds data acquisition (exploration) withcomputatation (planning).

See alsoIn ICML 2008: Jong and Stone, “Hierarchical Model-BasedReinforcement Learning: R-MAX + MAXQ”



Summary

Summary

Options do not always help reinforcement learning; insome cases, they can severely hinder learning.Hierarchical methods impact learning by biasing orconstraining exploration.


the utility of temporal abstraction in …pstone/courses/394rspring13/resources/...motivation...

Documents