an automated measure of mdp similarity for transfer in...

13
An Automated Measure of MDP Similarity for Transfer in Reinforcement Learning Haitham Bou Ammar Eric Eaton Gerhard Weiss Kurt Driessens Karl Tuyls Decebal Mocanu Matthew Taylor

Upload: others

Post on 24-Feb-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning

An Automated Measure of MDP Similarity for Transfer in Reinforcement Learning

Haitham Bou Ammar Eric Eaton

Gerhard Weiss Kurt Driessens Karl Tuyls

Decebal Mocanu Matthew Taylor

Page 2: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning

Introduc)on  

Reinforcement  learning  (RL)  is  a  key  technique  for  learning  through  interac8on  with  the  environment  

Problem  Defini)on:  

RL  problems  are  formalized  as  Markov  Decision  Processes  (MDPs):  

:  Ac8on  Space  

:  Discount  Factor  

:  State  Space   :  Transi8on  Probability  

:  Reward  Func8on  

2  Bou  Ammar,  Eaton,  et  al.  

Goal  

Learn  op8mal  policy  by  maximizing  

Q(s, a) = E

" 1X

t=0

�tRt

#

hS,A,P,R, �i

Page 3: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning

Reinforcement  learners  are    slow  to  learn  

Problem  

Reuse  knowledge    from  other  sources  

Possible  Solu)on   Impressive  Results  

3  Bou  Ammar,  Eaton,  et  al.  

Mo)va)on  

•  Learning  from  Demonstra8on  

•  Transfer  Learning  

Page 4: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning

Transfer  Learning  

New  target  task  Pool  of  source  tasks  from  same  domain  

…  …  

…  

Ques)ons  to  answer:  1.  How  to  transfer?   2.  What  to  transfer?  

3.  When  to  transfer?  

lots  of  approaches   lots  of  approaches  

Needs  a  task  similarity  measure  

4  Bou  Ammar,  Eaton,  et  al.  

Less  progress  has    been  achieved  

Page 5: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning

RBDist: Similarity Measure Between MDPs

5  Bou  Ammar,  Eaton,  et  al.  

Page 6: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning

RBDist:  Similarity  Measure  Between  MDPs  

RBM  Energy  Func)on  

Probability  distribu)on  Weights  are  trained  using  contras)ve    

divergence  

6  Bou  Ammar,  Eaton,  et  al.  

Our  measure  is  based  on    Restricted  Boltzmann  Machines  (RBMs):    

•  Set  of  visible  units  •  Set  of  hidden  units  

. . .

Visible    Layer  

Hidden    Layer  

. . .

Page 7: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning

Sampled  trajectories  capturing  source  dynamics  

7  Bou  Ammar,  Eaton,  et  al.  

RBDist:  Similarity  Measure  Between  MDPs  

Step  1:          Train  an  RBM  to  approximate  the  source  task’s  dynamics  

. . .

Visible    Layer  

Hidden    Layer  

. . .

The  RBM  learns  a  genera8ve  model  that  captures  the  source  dynamics.    Key  Idea:    If  the  dynamics  of  a  source  and  target  domain  are  similar,  the  

RBM  trained  on  the  source  task  should  be  able  to  reconstruct  trajectories  from  the  target  task.  

Separate  into                                    i.i.d.    tuples  and  train  RBM  

hs, a, s0i

Page 8: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning

Trajectories  from  target  task  

8  Bou  Ammar,  Eaton,  et  al.  

RBDist:  Similarity  Measure  Between  MDPs  

Step  2:          Reconstruct  target  task  trajectories  by  sampling  the  trained  RBM  

. . . Visible    Layer  

Hidden    Layer  

. . .

Reconstruc)on  of  target  trajectories  based  on  source  

task’s  dynamics      

Sampling  

Step  3:          Measure  reconstruc)on  error  of  sampled  target  trajectories  as  RBDist  

RBDist =1

n

nX

k=1

ek ek = L2⇣D

s(k)2 , a(k)2 , s0(k)2

E

0,Ds(k)2 , a(k)2 , s0

(k)2

E

1

original  tuple   reconstructed  tuple  

Page 9: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning

Experiments & Results

9  Bou  Ammar,  Eaton,  et  al.  

Page 10: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning

Dynamical  Systems  &  Benchmarks  

Inverted  Pendulum     Cart  Pole   Mountain  Car  

Swing  and  balance  pole  upright  by  applying  torques  

Balance  pole  upright  by  applying  linear  forces  

Control  car  to  reach  goal  by  oscilla8ng  around  the  valley  

10  Bou  Ammar,  Eaton,  et  al.  

Page 11: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning

Inverted  Pendulum    

Mountain  Car  

11  Bou  Ammar,  Eaton,  et  al.  

Results:  Dynamical  Phases  

 

RBDist  can  automa)cally  cluster  dynamical  phases    

Cart  Pole  

Page 12: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning

20 40 60 80 100 120 140−700

−600

−500

−400

−300

−200

−100

0

100

200

Different Cartpoles

Re

wa

rd

Jump Start Inverted Pendulum

Inverted  Pendulum    

Mountain  Car  

Cart  Pole  

Results:  Transfer  Performance    

12  Bou  Ammar,  Eaton,  et  al.  

 

RBDist  correlates  with  transfer  performance    

Page 13: An Automated Measure of MDP Similarity for Transfer in ...eeaton/papers/slides-BouAmmar2014Automated.pdfAn Automated Measure of MDP Similarity for Transfer in Reinforcement Learning

13  Bou  Ammar,  Eaton,  et  al.  

Thank  you!  

   This  work  was  supported  in  part  by  ONR  N00014-­‐11-­‐1-­‐0139,    

AFOSR  FA8750-­‐14-­‐1-­‐0069,  and  NSF  IIS-­‐1149917.  

Please  send  correspondence  to:          

Haitham  Bou  Ammar      Eric  Eaton  [email protected]    [email protected]