dqa meeting: 17.07.2007: learning more effective dialogue strategies using limited dialogue move...
TRANSCRIPT
DQA meeting: 17.07.2007:
Learning more effective dialogue strategies using limited dialogue move features
Matthew Frampton & Oliver Lemon,Coling/ACL-2006
Presented by: Mark Hepple
Data-driven methodology for SDS Development
• Broader context = realising a “data-driven” methodology for creating SDSs, with following steps:– 1. Collect data (using prototype or WOZ)– 2. Build probabilistic user simulation from
data (covering user behaviour, ASR errors)– 3. [Feature selection - using USimulation]– 4. Learn dialog strategy, using reinforcement
learning over system interactions with simulation
Task
• Information seeking dialog systems– Specifically task-oriented, slot-filling
dialogs, leading to database query– E.g. getting user requirements for a flight
booking (c.f. COMMUNICATOR task)– Aim is to achieve effective system strategy
for such dialog interactions
Reinforcement Learning
• System modelled as a Markov Decision Process (MDP)– model decision making in situations where outcomes
partly random, partly under system control
• Reinforcement learning used to learn an effective policy– determines best action to take in each situation
• Aim is to maximize overall reward– need a reward function, assigning reward value of
different dialogs
Action set of dialog system
– 1. Ask open qu. (how may I help you?)– 2. Ask value for slot 1..n– 3. Explicitly confirm a slot 1..n– 4. Ask for slot k, whilst implicitly confirm
slot k-1 or k+1– 5. Give help– 6. Pass to human operator– 7. Database query
Reward function
• Reward function is “all-or-nothing”:– 1. DB query, all slots confirmed = +100– 2. Any other DB query = -75– 3. Usimulation hangs up = -100– 4. System passes to human operator = -50– 5. Each system turn = -5
N-Gram User Simulation• Employs n-gram user simulation of Georgila,
Lemon & Henderson:– Derived from annotated version of COMMUNICATOR
data– Treat dialog as sequence of pairs of DAs/tasks– Output next user “utterance” as DA/task pair, based on
last n-1 pairs– Incorporate effects of ASR errors (built from user utts
as recognised by ASR components of original COMMUNICATOR systems)
– Have separate 4- and 5-gram simulations: used for training/testing
Key Question: what context features to use
• Past work - has used on limited state information– Based on number/fill-status of slots
• Proposal: include richer context information, specifically– Dialog act (DA) of last system turn– DA of last user turn
Experiments
• Compare 3 systems – Baseline: slot features only– Strat 2: slot features + last user DA– Strat 3: slot features + last user + system
Das
• Train with 4-gram, test with 5-gram USim, and vice versa
Results
• Main reported result is improvement in average reward level of dialogs for strategies, compared to baseline– Str-2 improves over baseline by 4.9%– Str-3 improves over baseline by 7.8%
• All 3 strategies achieve 100% slot filling and confirmation
• Augmented strategies also improve over baseline w.r.t. average dialogue length
Qualitative Analysis
• Learns to:– Only query DB when all slots filled– Not pass to operator– Use implicit confirmation where possible
• Emergent behaviour: – When baseline system fails to fill/confirm slot from
user input, state remains same, and system will repeat same action
– For augmented systems, ‘state’ changes, so can learn to do different action, e.g. ask about a different slot, or use “give help” action
Questions/Comments
• Value of performance improvement figures based on reward?
• Does improvement w.r.t. reward function -> improvement for human-machine dialogs
• Validity of comparisons to COMMr systems• Why does system performance improve?
– Is avoidance of repetition the key?