towards online planning for dialogue management with rich ... · rich domain knowledge pierre lison...

41
Towards Online Planning for Dialogue Management with Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012

Upload: others

Post on 18-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Towards Online Planning for Dialogue Management with

Rich Domain Knowledge

Pierre LisonLanguage Technology Group,

University of Oslo

IWSDS, November 29, 2012

Page 2: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Introduction

• Dialogue management (DM) as a decision-theoretic planning problem

• Most approaches perform this planning offline (precomputed policies)

2

Page 3: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Introduction

• Dialogue management (DM) as a decision-theoretic planning problem

• Most approaches perform this planning offline (precomputed policies)

2

+ -

Offline planning

Action selection is straightforward and quite efficient (direct lookup)

• Must consider all possible situations• Policy must be recomputed after every model change

Page 4: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Introduction

Alternative: perform planning online, at execution time

3

Page 5: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Introduction

Alternative: perform planning online, at execution time

3

+ -

Offline planning

Action selection is straightforward (and efficient)

• Must consider all possible situations• Policy must be recomputed after every model change

Online planning

• Must only plan for current situation• Easier to adapt to runtime changes

• Must meet real-time requirements

Page 6: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Approach

• To address these real-time constraints, we must ensure the planner concentrates on relevant actions and states

• Intuition: we can exploit prior domain knowledge to filter the space of possible actions and transitions

• We report here on our ongoing work with the use of probabilistic rules to encode the probability/reward models of our domain

4

Page 7: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Approach

• We represent the dialogue state as a Bayesian Network

• Probabilistic rules are applied upon this dialogue state to update / extend it

• Advantages:

• (exponentially) fewer parameters to estimate

• Can incorporate prior domain knowledge

5

[Pierre Lison, «Probabilistic Dialogue Models with Prior Domain Knowledge», SIGDIAL 2012]

Page 8: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Approach

• We represent the dialogue state as a Bayesian Network

• Probabilistic rules are applied upon this dialogue state to update / extend it

• Advantages:

• (exponentially) fewer parameters to estimate

• Can incorporate prior domain knowledge

5

[Pierre Lison, «Probabilistic Dialogue Models with Prior Domain Knowledge», SIGDIAL 2012]

Page 9: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Probabilistic rules

• The rules take the form of structured if...then...else cases, mappings from conditions to (probabilistic) effects:

• For action-selection rules, the effect associates rewards to particular actions:

6

if (condition1 holds) then P(effect1)= θ1, P(effect2)= θ2

else if (condition2 holds) then P(effect3) = θ3

if (condition1 holds) then R(actions)= θ1

Page 10: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Rule examples

• Example r1 (probability rule):

• Example r2 (reward rule):

7

if (am = AskRepeat) then P(au’ = au) = 0.9 P(au’ ≠ au) = 0.1

if (au ≠ None) then R(am = AskRepeat) = -0.5’

Page 11: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Rule examples

• Example r1 (probability rule):

• Example r2 (reward rule):

7

if (am = AskRepeat) then P(au’ = au) = 0.9 P(au’ ≠ au) = 0.1

if (au ≠ None) then R(am = AskRepeat) = -0.5

au

am

r1 au’

am

au

Page 12: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Rule examples

• Example r1 (probability rule):

• Example r2 (reward rule):

7

if (am = AskRepeat) then P(au’ = au) = 0.9 P(au’ ≠ au) = 0.1

if (au ≠ None) then R(am = AskRepeat) = -0.5

am

r2au

au

am

r1 au’

am

au

Page 13: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Approach

8

Page 14: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Approach

• The transition, observation and reward models for the domain are all encoded in terms of these probability & reward rules

8

Page 15: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Approach

• The transition, observation and reward models for the domain are all encoded in terms of these probability & reward rules

• We devised a simple forward planning algorithm that:

8

Page 16: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Approach

• The transition, observation and reward models for the domain are all encoded in terms of these probability & reward rules

• We devised a simple forward planning algorithm that:

• samples a collection of trajectories (sequence of actions) starting from the current state, until a specific horizon is reached

8

Page 17: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Approach

• The transition, observation and reward models for the domain are all encoded in terms of these probability & reward rules

• We devised a simple forward planning algorithm that:

• samples a collection of trajectories (sequence of actions) starting from the current state, until a specific horizon is reached

• and records the return obtained for each trajectory

8

Page 18: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Approach

• The transition, observation and reward models for the domain are all encoded in terms of these probability & reward rules

• We devised a simple forward planning algorithm that:

• samples a collection of trajectories (sequence of actions) starting from the current state, until a specific horizon is reached

• and records the return obtained for each trajectory

• The algorithm then searches for the action sequence with highest average return, and executes the first action in this sequence

8

Page 19: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Planning algorithm

9

Page 20: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Planning algorithm

9

s0

Page 21: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Planning algorithm

9

s0 Loop (for each trajectory):

Page 22: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Planning algorithm

9

s0 Loop (for each trajectory):Q = 0

Page 23: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Planning algorithm

9

s0 Loop (for each trajectory):Q = 0Loop (until horizon reached):

Page 24: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Planning algorithm

9

s0

am0

Loop (for each trajectory):Q = 0Loop (until horizon reached):

Sample system action am

Page 25: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Planning algorithm

9

s0

am0

Loop (for each trajectory):Q = 0Loop (until horizon reached):

Sample system action am

Q = Q + γtRR0

Page 26: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Planning algorithm

9

s0

am0

Loop (for each trajectory):Q = 0Loop (until horizon reached):

Sample system action am

Q = Q + γtRPredict next state s

R0

s1

Page 27: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Planning algorithm

9

s0

am0

Loop (for each trajectory):Q = 0Loop (until horizon reached):

Sample system action am

Q = Q + γtRPredict next state sSample next user action au

R0

au1s1

Page 28: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Planning algorithm

9

s0

am0

Loop (for each trajectory):Q = 0Loop (until horizon reached):

Sample system action am

Q = Q + γtRPredict next state sSample next user action au

Do belief update

R0

au1s1

Page 29: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Planning algorithm

9

s0

am0

Loop (for each trajectory):Q = 0Loop (until horizon reached):

Sample system action am

Q = Q + γtRPredict next state sSample next user action au

Do belief update

R0

au1s1

Page 30: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Planning algorithm

9

s0

am0

Loop (for each trajectory):Q = 0Loop (until horizon reached):

Sample system action am

Q = Q + γtRPredict next state sSample next user action au

Do belief update

R0

au1

am1

s1

Page 31: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Planning algorithm

9

s0

am0

Loop (for each trajectory):Q = 0Loop (until horizon reached):

Sample system action am

Q = Q + γtRPredict next state sSample next user action au

Do belief update

R0

au1

am1

R1

s1

Page 32: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Planning algorithm

9

s0

am0

Loop (for each trajectory):Q = 0Loop (until horizon reached):

Sample system action am

Q = Q + γtRPredict next state sSample next user action au

Do belief update

R0

au1

am1

R1

until horizon is reached...

s1

Page 33: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Planning algorithm

9

s0

am0

Loop (for each trajectory):Q = 0Loop (until horizon reached):

Sample system action am

Q = Q + γtRPredict next state sSample next user action au

Do belief updateRecord Q for trajectory

R0

au1

am1

R1

until horizon is reached...

s1

Page 34: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Planning algorithm

9

s0

am0

Loop (for each trajectory):Q = 0Loop (until horizon reached):

Sample system action am

Q = Q + γtRPredict next state sSample next user action au

Do belief updateRecord Q for trajectory

Repeat until

enough trajectories

are collected

R0

au1

am1

R1

until horizon is reached...

s1

Page 35: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Planning algorithm

9

s0

am0

Loop (for each trajectory):Q = 0Loop (until horizon reached):

Sample system action am

Q = Q + γtRPredict next state sSample next user action au

Do belief updateRecord Q for trajectory

∀ trajectory, calculate average Q

Repeat until

enough trajectories

are collected

R0

au1

am1

R1

until horizon is reached...

s1

Page 36: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Planning algorithm

9

s0

am0

Loop (for each trajectory):Q = 0Loop (until horizon reached):

Sample system action am

Q = Q + γtRPredict next state sSample next user action au

Do belief updateRecord Q for trajectory

∀ trajectory, calculate average Q

Return trajectory with max. Q

Repeat until

enough trajectories

are collected

R0

au1

am1

R1

until horizon is reached...

s1

Page 37: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Discussion

• What is original in our approach?

• The rules provide high-level constraints on the relevant actions (and subsequent future states)

• For instance, if the user intention is Want(Mug), the rules will provide utilities for the actions PickUp(Mug) or AskRepeat, but will consider PickUp(Box) as irrelevant

• In other words, they enable the algorithm to quickly focus the search towards high-utility regions

• ... and discard irrelevant actions and predictions

10

Page 38: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Discussion

• We did some preliminary experiments with an implementation of this algorithm

• The rules did make a difference in guiding the search for «relevant» trajectories

• But unfortunately, the algorithm doesn’t yet scale to real-time performance (sorry)

• Need to find better heuristics to aggressively prune the applied rules, improve action sampling and the performance of the inference algorithm

11

Page 39: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Conclusions

• Online planning can help us build more adaptive dialogue systems

• Can be combined with offline planning, using a precomputed policy to guide the search of an online planner!

• Using prior domain knowledge (encoded with e.g. probabilistic rules) should help the planner focus on relevant actions and filter out irrelevant ones

• More work needed to find tractable planning techniques, and collect empirical results

12

Page 40: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

13

Page 41: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction

Online reinforcement learning

• Links with the reinforcement learning literature:

• Our approach can be seen as a model-based approach to reinforcement learning, where a model is learned and then used to plan the best actions

• Most other approaches rely on a model-free reinforcement learning paradigm, and try to learn the optimal action directly from experience, without trying to estimate explicit models (and plan over them)

14