dqa meeting: 17.07.2007: learning more effective dialogue strategies using limited dialogue move...

DQA meeting: 17.07.2007:

Learning more effective dialogue strategies using limited dialogue move features

Matthew Frampton & Oliver Lemon,Coling/ACL-2006

Presented by: Mark Hepple

Data-driven methodology for SDS Development

• Broader context = realising a “data-driven” methodology for creating SDSs, with following steps:– 1. Collect data (using prototype or WOZ)– 2. Build probabilistic user simulation from

data (covering user behaviour, ASR errors)– 3. [Feature selection - using USimulation]– 4. Learn dialog strategy, using reinforcement

learning over system interactions with simulation

Task

• Information seeking dialog systems– Specifically task-oriented, slot-filling

dialogs, leading to database query– E.g. getting user requirements for a flight

booking (c.f. COMMUNICATOR task)– Aim is to achieve effective system strategy

for such dialog interactions

Reinforcement Learning

• System modelled as a Markov Decision Process (MDP)– model decision making in situations where outcomes

partly random, partly under system control

• Reinforcement learning used to learn an effective policy– determines best action to take in each situation

• Aim is to maximize overall reward– need a reward function, assigning reward value of

different dialogs

Action set of dialog system

– 1. Ask open qu. (how may I help you?)– 2. Ask value for slot 1..n– 3. Explicitly confirm a slot 1..n– 4. Ask for slot k, whilst implicitly confirm

slot k-1 or k+1– 5. Give help– 6. Pass to human operator– 7. Database query

Reward function

• Reward function is “all-or-nothing”:– 1. DB query, all slots confirmed = +100– 2. Any other DB query = -75– 3. Usimulation hangs up = -100– 4. System passes to human operator = -50– 5. Each system turn = -5

N-Gram User Simulation• Employs n-gram user simulation of Georgila,

Lemon & Henderson:– Derived from annotated version of COMMUNICATOR

data– Treat dialog as sequence of pairs of DAs/tasks– Output next user “utterance” as DA/task pair, based on

last n-1 pairs– Incorporate effects of ASR errors (built from user utts

as recognised by ASR components of original COMMUNICATOR systems)

– Have separate 4- and 5-gram simulations: used for training/testing

Key Question: what context features to use

• Past work - has used on limited state information– Based on number/fill-status of slots

• Proposal: include richer context information, specifically– Dialog act (DA) of last system turn– DA of last user turn

Experiments

• Compare 3 systems – Baseline: slot features only– Strat 2: slot features + last user DA– Strat 3: slot features + last user + system

Das

• Train with 4-gram, test with 5-gram USim, and vice versa

Results

• Main reported result is improvement in average reward level of dialogs for strategies, compared to baseline– Str-2 improves over baseline by 4.9%– Str-3 improves over baseline by 7.8%

• All 3 strategies achieve 100% slot filling and confirmation

• Augmented strategies also improve over baseline w.r.t. average dialogue length

Qualitative Analysis

• Learns to:– Only query DB when all slots filled– Not pass to operator– Use implicit confirmation where possible

• Emergent behaviour: – When baseline system fails to fill/confirm slot from

user input, state remains same, and system will repeat same action

– For augmented systems, ‘state’ changes, so can learn to do different action, e.g. ask about a different slot, or use “give help” action

Questions/Comments

• Value of performance improvement figures based on reward?

• Does improvement w.r.t. reward function -> improvement for human-machine dialogs

• Validity of comparisons to COMMr systems• Why does system performance improve?

– Is avoidance of repetition the key?

dqa meeting: 17.07.2007: learning more effective dialogue strategies using limited dialogue move...

Documents