sdrl: interpretable and data-efficient deep reinforcement...

19
Symbolic Deep Reinforcement Learning Lyu, Yang, Liu, Gustafson Introduction Background Method Experimental Results Conclusion and Future Work SDRL: Interpretable and Data-efficient Deep Reinforcement Learning Leveraging Symbolic Planning Daoming Lyu 1 , Fangkai Yang 2 , Bo Liu 1 , Steven Gustafson 3 1 Auburn University, Auburn, AL, USA 2 NVIDIA Corporation, Redmond, WA, USA 3 Maana Inc., Bellevue, WA, USA 1 / 19

Upload: others

Post on 29-Mar-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SDRL: Interpretable and Data-efficient Deep Reinforcement ...webhome.auburn.edu/~dzl0053/pubs/aaai19slides.pdfDeep Reinforcement Learning Lyu, Yang, Liu, Gustafson Introduction Background

SymbolicDeep

ReinforcementLearning

Lyu, Yang,Liu, Gustafson

Introduction

Background

Method

ExperimentalResults

Conclusionand FutureWork

SDRL: Interpretable and Data-efficient DeepReinforcement Learning

Leveraging Symbolic Planning

Daoming Lyu1, Fangkai Yang2, Bo Liu1, Steven Gustafson3

1Auburn University, Auburn, AL, USA2NVIDIA Corporation, Redmond, WA, USA

3Maana Inc., Bellevue, WA, USA

1 / 19

Page 2: SDRL: Interpretable and Data-efficient Deep Reinforcement ...webhome.auburn.edu/~dzl0053/pubs/aaai19slides.pdfDeep Reinforcement Learning Lyu, Yang, Liu, Gustafson Introduction Background

SymbolicDeep

ReinforcementLearning

Lyu, Yang,Liu, Gustafson

Introduction

Background

Method

ExperimentalResults

Conclusionand FutureWork

Sequential Decision-Making

Sequential decision-making (SDM) concerns an agentmaking a sequence of actions based on its behavior in theenvironment.

Deep reinforcement learning (DRL) achievestremendous success on sequential decision-makingproblems using deep neural networks (Mnih et al., 2015).

2 / 19

Page 3: SDRL: Interpretable and Data-efficient Deep Reinforcement ...webhome.auburn.edu/~dzl0053/pubs/aaai19slides.pdfDeep Reinforcement Learning Lyu, Yang, Liu, Gustafson Introduction Background

SymbolicDeep

ReinforcementLearning

Lyu, Yang,Liu, Gustafson

Introduction

Background

Method

ExperimentalResults

Conclusionand FutureWork

Challenge: Montezuma’s Revenge

The avatar: climbs down the ladder, jumps over a rotatingskull, picks up the key (+100), goes back and uses thekey to open the right door (+300).

Vanilla DQN achieves 0 score (Mnih et al., 2015).

3 / 19

Page 4: SDRL: Interpretable and Data-efficient Deep Reinforcement ...webhome.auburn.edu/~dzl0053/pubs/aaai19slides.pdfDeep Reinforcement Learning Lyu, Yang, Liu, Gustafson Introduction Background

SymbolicDeep

ReinforcementLearning

Lyu, Yang,Liu, Gustafson

Introduction

Background

Method

ExperimentalResults

Conclusionand FutureWork

Challenge: Montezuma’s Revenge

Problem: long horizon sequential actions, sparse anddelayed reward.

poor data efficiency.lack of interpretability.

4 / 19

Page 5: SDRL: Interpretable and Data-efficient Deep Reinforcement ...webhome.auburn.edu/~dzl0053/pubs/aaai19slides.pdfDeep Reinforcement Learning Lyu, Yang, Liu, Gustafson Introduction Background

SymbolicDeep

ReinforcementLearning

Lyu, Yang,Liu, Gustafson

Introduction

Background

Method

ExperimentalResults

Conclusionand FutureWork

Our Solution

Solution: task decomposition

Symbolic planning: subtasks scheduling (high-level plan).

DRL: subtask learning (low-level control).

Meta-learner: subtask evaluation.

Goal

Symbolic planning drives learning, improving task-levelinterpretablility.

DRL learns feasible subtasks, improving data-efficiency.

5 / 19

Page 6: SDRL: Interpretable and Data-efficient Deep Reinforcement ...webhome.auburn.edu/~dzl0053/pubs/aaai19slides.pdfDeep Reinforcement Learning Lyu, Yang, Liu, Gustafson Introduction Background

SymbolicDeep

ReinforcementLearning

Lyu, Yang,Liu, Gustafson

Introduction

Background

Method

ExperimentalResults

Conclusionand FutureWork

Background: Action Language

Action language (Gelfond & Lifschitz, 1998): a formal,declarative, logic-based language that describes dynamicdomains.

Dynamic domains can be represented as a transitionsystem.

6 / 19

Page 7: SDRL: Interpretable and Data-efficient Deep Reinforcement ...webhome.auburn.edu/~dzl0053/pubs/aaai19slides.pdfDeep Reinforcement Learning Lyu, Yang, Liu, Gustafson Introduction Background

SymbolicDeep

ReinforcementLearning

Lyu, Yang,Liu, Gustafson

Introduction

Background

Method

ExperimentalResults

Conclusionand FutureWork

Action Language BC

Action Language BC (Lee et al., 2013) is a language thatdescribes the transition system using a set of causal laws.

dynamic laws describe transition of states

move(x , y1, y2) causes on(x , y2) if on(x , y1).

static laws describe value of fluents inside a state

intower(x , y2) if intower(x , y1), on(y1, y2).

7 / 19

Page 8: SDRL: Interpretable and Data-efficient Deep Reinforcement ...webhome.auburn.edu/~dzl0053/pubs/aaai19slides.pdfDeep Reinforcement Learning Lyu, Yang, Liu, Gustafson Introduction Background

SymbolicDeep

ReinforcementLearning

Lyu, Yang,Liu, Gustafson

Introduction

Background

Method

ExperimentalResults

Conclusionand FutureWork

Background: Reinforcement Learning

Reinforcement learning is defined on a Markov DecisionProcess (S,A,Pa

ss′ , r , γ). To achieve optimal behavior, apolicy π : S ×A 7→ [0, 1] is learned.

An option is defined on the tuple (I , π, β), which enablesthe decision-making to have a hierarchical structure:

the initiation set I ⊆ S,policy π: S ×A 7→ [0, 1],probabilistic termination condition β: S 7→ [0, 1].

8 / 19

Page 9: SDRL: Interpretable and Data-efficient Deep Reinforcement ...webhome.auburn.edu/~dzl0053/pubs/aaai19slides.pdfDeep Reinforcement Learning Lyu, Yang, Liu, Gustafson Introduction Background

SymbolicDeep

ReinforcementLearning

Lyu, Yang,Liu, Gustafson

Introduction

Background

Method

ExperimentalResults

Conclusionand FutureWork

SDRL: Symbolic Deep Reinforcement Learning

Symbolic Planner: orchestrates sequence of subtasksusing high-level symbolic plan.

Controller: uses DRL approaches to learn the subpolicyfor each subtask with intrinsic rewards.

Meta-Controller: measures learning performance ofsubtasks, updates intrinsic goal to enable reward-drivenplan improvement.

9 / 19

Page 10: SDRL: Interpretable and Data-efficient Deep Reinforcement ...webhome.auburn.edu/~dzl0053/pubs/aaai19slides.pdfDeep Reinforcement Learning Lyu, Yang, Liu, Gustafson Introduction Background

SymbolicDeep

ReinforcementLearning

Lyu, Yang,Liu, Gustafson

Introduction

Background

Method

ExperimentalResults

Conclusionand FutureWork

Symbolic Planner

10 / 19

Page 11: SDRL: Interpretable and Data-efficient Deep Reinforcement ...webhome.auburn.edu/~dzl0053/pubs/aaai19slides.pdfDeep Reinforcement Learning Lyu, Yang, Liu, Gustafson Introduction Background

SymbolicDeep

ReinforcementLearning

Lyu, Yang,Liu, Gustafson

Introduction

Background

Method

ExperimentalResults

Conclusionand FutureWork

Symbolic Planner: Planning with Intrinsic Goal

Intrinsic goal: a linear constraint on plan qualityquality ≥ quality(Πt) where Πt is the plan at episode t.

Plan quality: a utility function

quality(Πt) =∑

〈si−1,gi−1,si 〉∈Πt

ρgi−1(si−1)

where ρgi is the gain reward for subtask gi .

Symbolic planner: generates a new plan that

explores new subtasks,exploits more rewarding subtasks.

11 / 19

Page 12: SDRL: Interpretable and Data-efficient Deep Reinforcement ...webhome.auburn.edu/~dzl0053/pubs/aaai19slides.pdfDeep Reinforcement Learning Lyu, Yang, Liu, Gustafson Introduction Background

SymbolicDeep

ReinforcementLearning

Lyu, Yang,Liu, Gustafson

Introduction

Background

Method

ExperimentalResults

Conclusionand FutureWork

From Symbolic Transition to Subtask

Assumption: given the set S of symbolic states and S ofsensory input, we assumed there is an Oracle for symbolgrounding: F : S × S 7→ {t, f}.Given F and a pair of symbolic states s, s ′ ∈ S:

initiation set I = {s ∈ S : F(s, s) = t},π : S 7→ A is the subpolicy for the corresponding subtask,β is the termination condition such that

β(s ′) =

{1 F(s ′, s ′) = t, for s ′ ∈ S,0 otherwise.

12 / 19

Page 13: SDRL: Interpretable and Data-efficient Deep Reinforcement ...webhome.auburn.edu/~dzl0053/pubs/aaai19slides.pdfDeep Reinforcement Learning Lyu, Yang, Liu, Gustafson Introduction Background

SymbolicDeep

ReinforcementLearning

Lyu, Yang,Liu, Gustafson

Introduction

Background

Method

ExperimentalResults

Conclusionand FutureWork

Controller

13 / 19

Page 14: SDRL: Interpretable and Data-efficient Deep Reinforcement ...webhome.auburn.edu/~dzl0053/pubs/aaai19slides.pdfDeep Reinforcement Learning Lyu, Yang, Liu, Gustafson Introduction Background

SymbolicDeep

ReinforcementLearning

Lyu, Yang,Liu, Gustafson

Introduction

Background

Method

ExperimentalResults

Conclusionand FutureWork

Controllers: DRL with Intrinsic Reward

Intrinsic reward: pseudo-reward crafted by the human.

Given a subtask defined on (I , π, β), intrinsic reward

ri (s ′) =

{φ β(s ′) = 1r otherwise

where φ is a positive constant encouraging achievingsubtasks and r is the reward from the environment atstate s ′.

14 / 19

Page 15: SDRL: Interpretable and Data-efficient Deep Reinforcement ...webhome.auburn.edu/~dzl0053/pubs/aaai19slides.pdfDeep Reinforcement Learning Lyu, Yang, Liu, Gustafson Introduction Background

SymbolicDeep

ReinforcementLearning

Lyu, Yang,Liu, Gustafson

Introduction

Background

Method

ExperimentalResults

Conclusionand FutureWork

Meta-Controller

15 / 19

Page 16: SDRL: Interpretable and Data-efficient Deep Reinforcement ...webhome.auburn.edu/~dzl0053/pubs/aaai19slides.pdfDeep Reinforcement Learning Lyu, Yang, Liu, Gustafson Introduction Background

SymbolicDeep

ReinforcementLearning

Lyu, Yang,Liu, Gustafson

Introduction

Background

Method

ExperimentalResults

Conclusionand FutureWork

Meta-Controller: Evaluation with Extrinsic Reward

Extrinsic rewards: re(s, g) = f (ε) where ε can measurethe competence of the learned subpolicy for each subtask.

For example, let ε be the success ratio, f can be definedas

f (ε) =

{−ψ ε < thresholdr(s, g) ε ≥ threshold

ψ is a positive constant to punish selecting unlearnablesubtasks,r(s, g) is the cumulative environmental reward by followingthe subtask g .

16 / 19

Page 17: SDRL: Interpretable and Data-efficient Deep Reinforcement ...webhome.auburn.edu/~dzl0053/pubs/aaai19slides.pdfDeep Reinforcement Learning Lyu, Yang, Liu, Gustafson Introduction Background

SymbolicDeep

ReinforcementLearning

Lyu, Yang,Liu, Gustafson

Introduction

Background

Method

ExperimentalResults

Conclusionand FutureWork

Experimental Results I.

17 / 19

Page 18: SDRL: Interpretable and Data-efficient Deep Reinforcement ...webhome.auburn.edu/~dzl0053/pubs/aaai19slides.pdfDeep Reinforcement Learning Lyu, Yang, Liu, Gustafson Introduction Background

SymbolicDeep

ReinforcementLearning

Lyu, Yang,Liu, Gustafson

Introduction

Background

Method

ExperimentalResults

Conclusionand FutureWork

Experimental Results II.

Baseline: Kulkarni et. al, Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and

Intrinsic Motivation, NeurIPS’2016.18 / 19

Page 19: SDRL: Interpretable and Data-efficient Deep Reinforcement ...webhome.auburn.edu/~dzl0053/pubs/aaai19slides.pdfDeep Reinforcement Learning Lyu, Yang, Liu, Gustafson Introduction Background

SymbolicDeep

ReinforcementLearning

Lyu, Yang,Liu, Gustafson

Introduction

Background

Method

ExperimentalResults

Conclusionand FutureWork

Conclusion

We present a SDRL framework features:

High-level symbolic planning based on intrinsic goalLow-level policy control with DRL.Subtask learning evaluation by a meta-learner.

This is the first work on integrating symbolic planningwith DRL that achieves both task-level interpretabilityand data-efficiency for decision-making.

Future work.

19 / 19