towards online planning for dialogue management with rich ... · rich domain knowledge pierre lison...
TRANSCRIPT
![Page 1: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/1.jpg)
Towards Online Planning for Dialogue Management with
Rich Domain Knowledge
Pierre LisonLanguage Technology Group,
University of Oslo
IWSDS, November 29, 2012
![Page 2: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/2.jpg)
Introduction
• Dialogue management (DM) as a decision-theoretic planning problem
• Most approaches perform this planning offline (precomputed policies)
2
![Page 3: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/3.jpg)
Introduction
• Dialogue management (DM) as a decision-theoretic planning problem
• Most approaches perform this planning offline (precomputed policies)
2
+ -
Offline planning
Action selection is straightforward and quite efficient (direct lookup)
• Must consider all possible situations• Policy must be recomputed after every model change
![Page 4: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/4.jpg)
Introduction
Alternative: perform planning online, at execution time
3
![Page 5: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/5.jpg)
Introduction
Alternative: perform planning online, at execution time
3
+ -
Offline planning
Action selection is straightforward (and efficient)
• Must consider all possible situations• Policy must be recomputed after every model change
Online planning
• Must only plan for current situation• Easier to adapt to runtime changes
• Must meet real-time requirements
![Page 6: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/6.jpg)
Approach
• To address these real-time constraints, we must ensure the planner concentrates on relevant actions and states
• Intuition: we can exploit prior domain knowledge to filter the space of possible actions and transitions
• We report here on our ongoing work with the use of probabilistic rules to encode the probability/reward models of our domain
4
![Page 7: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/7.jpg)
Approach
• We represent the dialogue state as a Bayesian Network
• Probabilistic rules are applied upon this dialogue state to update / extend it
• Advantages:
• (exponentially) fewer parameters to estimate
• Can incorporate prior domain knowledge
5
[Pierre Lison, «Probabilistic Dialogue Models with Prior Domain Knowledge», SIGDIAL 2012]
![Page 8: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/8.jpg)
Approach
• We represent the dialogue state as a Bayesian Network
• Probabilistic rules are applied upon this dialogue state to update / extend it
• Advantages:
• (exponentially) fewer parameters to estimate
• Can incorporate prior domain knowledge
5
[Pierre Lison, «Probabilistic Dialogue Models with Prior Domain Knowledge», SIGDIAL 2012]
![Page 9: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/9.jpg)
Probabilistic rules
• The rules take the form of structured if...then...else cases, mappings from conditions to (probabilistic) effects:
• For action-selection rules, the effect associates rewards to particular actions:
6
if (condition1 holds) then P(effect1)= θ1, P(effect2)= θ2
else if (condition2 holds) then P(effect3) = θ3
if (condition1 holds) then R(actions)= θ1
![Page 10: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/10.jpg)
Rule examples
• Example r1 (probability rule):
• Example r2 (reward rule):
7
if (am = AskRepeat) then P(au’ = au) = 0.9 P(au’ ≠ au) = 0.1
if (au ≠ None) then R(am = AskRepeat) = -0.5’
![Page 11: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/11.jpg)
Rule examples
• Example r1 (probability rule):
• Example r2 (reward rule):
7
if (am = AskRepeat) then P(au’ = au) = 0.9 P(au’ ≠ au) = 0.1
if (au ≠ None) then R(am = AskRepeat) = -0.5
au
am
r1 au’
am
au
’
![Page 12: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/12.jpg)
Rule examples
• Example r1 (probability rule):
• Example r2 (reward rule):
7
if (am = AskRepeat) then P(au’ = au) = 0.9 P(au’ ≠ au) = 0.1
if (au ≠ None) then R(am = AskRepeat) = -0.5
am
r2au
’
au
am
r1 au’
am
au
’
![Page 13: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/13.jpg)
Approach
8
![Page 14: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/14.jpg)
Approach
• The transition, observation and reward models for the domain are all encoded in terms of these probability & reward rules
8
![Page 15: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/15.jpg)
Approach
• The transition, observation and reward models for the domain are all encoded in terms of these probability & reward rules
• We devised a simple forward planning algorithm that:
8
![Page 16: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/16.jpg)
Approach
• The transition, observation and reward models for the domain are all encoded in terms of these probability & reward rules
• We devised a simple forward planning algorithm that:
• samples a collection of trajectories (sequence of actions) starting from the current state, until a specific horizon is reached
8
![Page 17: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/17.jpg)
Approach
• The transition, observation and reward models for the domain are all encoded in terms of these probability & reward rules
• We devised a simple forward planning algorithm that:
• samples a collection of trajectories (sequence of actions) starting from the current state, until a specific horizon is reached
• and records the return obtained for each trajectory
8
![Page 18: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/18.jpg)
Approach
• The transition, observation and reward models for the domain are all encoded in terms of these probability & reward rules
• We devised a simple forward planning algorithm that:
• samples a collection of trajectories (sequence of actions) starting from the current state, until a specific horizon is reached
• and records the return obtained for each trajectory
• The algorithm then searches for the action sequence with highest average return, and executes the first action in this sequence
8
![Page 19: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/19.jpg)
Planning algorithm
9
![Page 20: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/20.jpg)
Planning algorithm
9
s0
![Page 21: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/21.jpg)
Planning algorithm
9
s0 Loop (for each trajectory):
![Page 22: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/22.jpg)
Planning algorithm
9
s0 Loop (for each trajectory):Q = 0
![Page 23: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/23.jpg)
Planning algorithm
9
s0 Loop (for each trajectory):Q = 0Loop (until horizon reached):
![Page 24: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/24.jpg)
Planning algorithm
9
s0
am0
Loop (for each trajectory):Q = 0Loop (until horizon reached):
Sample system action am
![Page 25: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/25.jpg)
Planning algorithm
9
s0
am0
Loop (for each trajectory):Q = 0Loop (until horizon reached):
Sample system action am
Q = Q + γtRR0
![Page 26: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/26.jpg)
Planning algorithm
9
s0
am0
Loop (for each trajectory):Q = 0Loop (until horizon reached):
Sample system action am
Q = Q + γtRPredict next state s
R0
s1
![Page 27: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/27.jpg)
Planning algorithm
9
s0
am0
Loop (for each trajectory):Q = 0Loop (until horizon reached):
Sample system action am
Q = Q + γtRPredict next state sSample next user action au
R0
au1s1
![Page 28: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/28.jpg)
Planning algorithm
9
s0
am0
Loop (for each trajectory):Q = 0Loop (until horizon reached):
Sample system action am
Q = Q + γtRPredict next state sSample next user action au
Do belief update
R0
au1s1
![Page 29: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/29.jpg)
Planning algorithm
9
s0
am0
Loop (for each trajectory):Q = 0Loop (until horizon reached):
Sample system action am
Q = Q + γtRPredict next state sSample next user action au
Do belief update
R0
au1s1
![Page 30: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/30.jpg)
Planning algorithm
9
s0
am0
Loop (for each trajectory):Q = 0Loop (until horizon reached):
Sample system action am
Q = Q + γtRPredict next state sSample next user action au
Do belief update
R0
au1
am1
s1
![Page 31: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/31.jpg)
Planning algorithm
9
s0
am0
Loop (for each trajectory):Q = 0Loop (until horizon reached):
Sample system action am
Q = Q + γtRPredict next state sSample next user action au
Do belief update
R0
au1
am1
R1
s1
![Page 32: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/32.jpg)
Planning algorithm
9
s0
am0
Loop (for each trajectory):Q = 0Loop (until horizon reached):
Sample system action am
Q = Q + γtRPredict next state sSample next user action au
Do belief update
R0
au1
am1
R1
until horizon is reached...
s1
![Page 33: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/33.jpg)
Planning algorithm
9
s0
am0
Loop (for each trajectory):Q = 0Loop (until horizon reached):
Sample system action am
Q = Q + γtRPredict next state sSample next user action au
Do belief updateRecord Q for trajectory
R0
au1
am1
R1
until horizon is reached...
s1
![Page 34: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/34.jpg)
Planning algorithm
9
s0
am0
Loop (for each trajectory):Q = 0Loop (until horizon reached):
Sample system action am
Q = Q + γtRPredict next state sSample next user action au
Do belief updateRecord Q for trajectory
Repeat until
enough trajectories
are collected
R0
au1
am1
R1
until horizon is reached...
s1
![Page 35: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/35.jpg)
Planning algorithm
9
s0
am0
Loop (for each trajectory):Q = 0Loop (until horizon reached):
Sample system action am
Q = Q + γtRPredict next state sSample next user action au
Do belief updateRecord Q for trajectory
∀ trajectory, calculate average Q
Repeat until
enough trajectories
are collected
R0
au1
am1
R1
until horizon is reached...
s1
![Page 36: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/36.jpg)
Planning algorithm
9
s0
am0
Loop (for each trajectory):Q = 0Loop (until horizon reached):
Sample system action am
Q = Q + γtRPredict next state sSample next user action au
Do belief updateRecord Q for trajectory
∀ trajectory, calculate average Q
Return trajectory with max. Q
Repeat until
enough trajectories
are collected
R0
au1
am1
R1
until horizon is reached...
s1
![Page 37: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/37.jpg)
Discussion
• What is original in our approach?
• The rules provide high-level constraints on the relevant actions (and subsequent future states)
• For instance, if the user intention is Want(Mug), the rules will provide utilities for the actions PickUp(Mug) or AskRepeat, but will consider PickUp(Box) as irrelevant
• In other words, they enable the algorithm to quickly focus the search towards high-utility regions
• ... and discard irrelevant actions and predictions
10
![Page 38: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/38.jpg)
Discussion
• We did some preliminary experiments with an implementation of this algorithm
• The rules did make a difference in guiding the search for «relevant» trajectories
• But unfortunately, the algorithm doesn’t yet scale to real-time performance (sorry)
• Need to find better heuristics to aggressively prune the applied rules, improve action sampling and the performance of the inference algorithm
11
![Page 39: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/39.jpg)
Conclusions
• Online planning can help us build more adaptive dialogue systems
• Can be combined with offline planning, using a precomputed policy to guide the search of an online planner!
• Using prior domain knowledge (encoded with e.g. probabilistic rules) should help the planner focus on relevant actions and filter out irrelevant ones
• More work needed to find tractable planning techniques, and collect empirical results
12
![Page 40: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/40.jpg)
13
![Page 41: Towards Online Planning for Dialogue Management with Rich ... · Rich Domain Knowledge Pierre Lison Language Technology Group, University of Oslo IWSDS, November 29, 2012 . Introduction](https://reader033.vdocuments.us/reader033/viewer/2022060909/60a3ce7de8b98541e011b613/html5/thumbnails/41.jpg)
Online reinforcement learning
• Links with the reinforcement learning literature:
• Our approach can be seen as a model-based approach to reinforcement learning, where a model is learned and then used to plan the best actions
• Most other approaches rely on a model-free reinforcement learning paradigm, and try to learn the optimal action directly from experience, without trying to estimate explicit models (and plan over them)
14