guiding policies with john d (jd) co-reyes,...

21
Guiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek Gupta, Suvansh Sanjeev, Nick Altieri, John DeNero, Pieter Abbeel, Sergey Levine

Upload: others

Post on 04-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek

Guiding Policies with Language via Meta-LearningJohn D (JD) Co-Reyes, Abhishek Gupta, Suvansh Sanjeev, Nick Altieri,

John DeNero, Pieter Abbeel, Sergey Levine

Page 2: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek

Ideal Robot

Page 3: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek

Learning new tasks quickly

● Want diverse range of skills

● Cost of supervision can be high

● Want to learn new things with as little supervision as possible

Page 4: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek

Meta-RL

Prior Experience Fast Learning of New Tasks

Leverage prior experience to quickly learn new tasks

Meta-Training Meta-Testing

Page 5: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek

Problem with reward design

Hard to provideHard to design Hard to learn from

Page 6: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek

More natural way to provide supervision

Human feedback

Page 7: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek

Replace reward with human feedback

Human-in-the-loop supervision

RL Algorithm

Deep TAMER Preferences

Warnell et al Christiano et al

Page 8: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek

Why current methods are insufficient?

Very few bits of information per intervention

RL Algorithm

1 bit

Significanthuman effort

More bits of information per intervention

RL Algorithm

1 bit

Lesshuman effort

Scalar Feedback

Language Feedback

230 bits

Page 9: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek

Language Corrections

Page 10: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek

Quickly incorporate language corrections in the loop

Agent provided with ambiguous/incomplete instruction

Problem Setting

Page 11: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek

Model improves based on previous trajectories and corrections.

3 modules – corrections, policy and instruction modules

Language Guided Policy Model

Page 12: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek

Algorithm Overview

Meta-Training Meta-Testing

Page 13: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek

Meta-Training

Page 14: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek

Meta-Training

Page 15: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek

Experimental Setup

Multi-room domain

Instruction: Move green triangle to yellow goal. Instruction: Move red square to yellow goal.

Page 16: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek

Experimental Setup

Block pushing domain

Instruction: Move red block above magenta block. Instruction: Move cyan block left of blue block.

Page 17: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek

Instruction: Move blue triangle to green goal.

Correction 1: Enter the blue room.

Correction 2: Enter the red room.

Correction 3: Exit the blue room.

Correction 4: Pick up the blue triangle.

Solved

Quick Learning of New Tasks

Page 18: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek

Quick Learning of New TasksInstruction: Move cyan block below magenta block.

Correction 1: Touch cyan block.

Correction 2: Move closer to magenta block.

Correction 3: Move a lot up.

Correction 4: Move a little up. Solved

Page 19: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek

Quantitative EvaluationSuccess Rates on New Tasks

Much quicker learning than using rewards

RL – PPO with reward used to train expert GPL – Ours

Page 20: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek

• Avoid demos/reward functions using human-in-the-loop

• Language provides more information per intervention

• Ground language in multi-task setup; learn new tasks quickly with corrections

Summary

Page 21: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek

Thank you

Abhishek Gupta Suvansh Sanjeev Nick Altieri

John DeNero Pieter Abbeel Sergey Levine