an automated measure of mdp similarity for transfer in...

An Automated Measure of MDP Similarity for Transfer in Reinforcement Learning

Haitham Bou Ammar Eric Eaton

Gerhard Weiss Kurt Driessens Karl Tuyls

Decebal Mocanu Matthew Taylor

Introduc)on

Reinforcement learning (RL) is a key technique for learning through interac8on with the environment

Problem Defini)on:

RL problems are formalized as Markov Decision Processes (MDPs):

: Ac8on Space

: Discount Factor

: State Space : Transi8on Probability

: Reward Func8on

2 Bou Ammar, Eaton, et al.

Goal

Learn op8mal policy by maximizing

Q(s, a) = E

" 1X

t=0

�tRt

#

hS,A,P,R, �i

Reinforcement learners are slow to learn

Problem

Reuse knowledge from other sources

Possible Solu)on Impressive Results


Mo)va)on

•  Learning from Demonstra8on

•  Transfer Learning

Transfer Learning

New target task Pool of source tasks from same domain

… …

…

Ques)ons to answer: 1. How to transfer? 2. What to transfer?

3. When to transfer?

lots of approaches lots of approaches

Needs a task similarity measure


Less progress has been achieved

RBDist: Similarity Measure Between MDPs



RBM Energy Func)on

Probability distribu)on Weights are trained using contras)ve

divergence


Our measure is based on Restricted Boltzmann Machines (RBMs):

•  Set of visible units •  Set of hidden units

. . .

Visible Layer

Hidden Layer

. . .

Sampled trajectories capturing source dynamics



Step 1: Train an RBM to approximate the source task’s dynamics

. . .

Visible Layer

Hidden Layer

. . .

The RBM learns a genera8ve model that captures the source dynamics. Key Idea: If the dynamics of a source and target domain are similar, the

RBM trained on the source task should be able to reconstruct trajectories from the target task.

Separate into i.i.d. tuples and train RBM

hs, a, s0i

Trajectories from target task



Step 2: Reconstruct target task trajectories by sampling the trained RBM

. . . Visible Layer

Hidden Layer

. . .

Reconstruc)on of target trajectories based on source

task’s dynamics

Sampling

Step 3: Measure reconstruc)on error of sampled target trajectories as RBDist

RBDist =1

n

nX

k=1

ek ek = L2⇣D

s(k)2 , a(k)2 , s0(k)2

E

0,Ds(k)2 , a(k)2 , s0

(k)2

E

1

⌘

original tuple reconstructed tuple

Experiments & Results


Dynamical Systems & Benchmarks

Inverted Pendulum Cart Pole Mountain Car

Swing and balance pole upright by applying torques

Balance pole upright by applying linear forces

Control car to reach goal by oscilla8ng around the valley


Inverted Pendulum

Mountain Car


Results: Dynamical Phases

RBDist can automa)cally cluster dynamical phases

Cart Pole

20 40 60 80 100 120 140−700

−600

−500

−400

−300

−200

−100

0

100

200

Different Cartpoles

Re

wa

rd

Jump Start Inverted Pendulum

Inverted Pendulum

Mountain Car

Cart Pole

Results: Transfer Performance


RBDist correlates with transfer performance


Thank you!

This work was supported in part by ONR N00014-‐11-‐1-‐0139,

AFOSR FA8750-‐14-‐1-‐0069, and NSF IIS-‐1149917.

Please send correspondence to:

Haitham Bou Ammar Eric Eaton [email protected] [email protected]

an automated measure of mdp similarity for transfer in...

Documents