partially observable markov decision processes for spoken dialog systems

Partially observable Markov decision processes for spoken dialog systems

Jason D. Williams, Steve Young (AT&T Labs)

2007, Computer Speech and Language, 21(2):

Outline

Introduction

Partially observable Markov decision processes

Spoken Dialog System

SDS-POMDP

Comparing

Empirical support

POMDP (1)

Partially observable Markov decision processes

POMDP = {S, A, T, R, O, Z, , b0}S set of states describing agent's world

A set of actions, that agent may take

T transition probability P(s'|s, a)

R reward r(s, a)

O set of observation about the world

Z observation probability P(o'|s', a)

POMDP (2)

POMDP = {S, A, T, R, O, Z, , b0} geometric discount factor

b0 initial belief state b0(s)

POMDP (3)

- random variable

- decision node

- utility node

Shaded unobserved

| - causal effect

- distribution is used

RL reinforced learning

POMDP (Example)

Dialog systemsaving/deleting messages


Su internal user state

Sd dialog state (user view)

Au user action (intention)


Yu user audio signal

Au action recognized by machine

C confidence score

Sm dialog state (machine view)

~


Am machine action

Ym machine audio signal

Am action recognized by user

~

Mapping SDS to POMDP

POMDP = {S, A, T, R, O, Z, , b0}

SDS = {Su, Sd, Sm, C, Au, Au, Am}

~

SDS-POMDP

s = (su, au, sd)

sm = b(s) = b(su, au, sd)

Math behind

Formula for new belief

Exact algorithms rarely scale with more than 10 actions, states and observations.

Effective approximate solutions exist.

Comparing SDS-POMDP

Better than current approaches

CA are simplification or special case

ApproachesParallel state hypotheses

Local use of confidence score

Automated action planning

Parallel state hypotheses

Traditional = 1 state

Uncertainty multiple states

2 techniquesGreedy decision theoretic approaches

M-Best list

Greedy decisions

Maximizes immediate reward

Doesn't perform plan

Handcrafting + ad hoc tunning

M-Best list

Considers only the top hypotheses

= POMDP with handcrafted action selection

Subspace of belief space

Local use of confidence score

Handcrafted update rules

Ac = {expl-confirm, imp-confirm, reject}

Useful, but hard for long-term goals

Automated action selection

Handcrafted planningUnforseen dialog situations

POMDP with single state

2 main techniquesSupervised learning

Markov decision processes

Supervised learning

Training dataHuman-human much richer

Human-machine machine errors

Single state

Markov decision process

Fully Observable MDP is simplification of PO

Assumes, that world state is known exactly

Single state

Empirical support

Based on simulations

Benefits of POMDP toParallel state hypotheses

Confidence score

Automated planning

Real data

Parallel state hypotheses (1)



Confidence score (1)

Confidence Score: Reject, 0.4, Low, 0.8, Hight




Automated planning (1)

HC1

HC2

HC3



Real data (1)

SACTI-1 Corpus144 human-human dialogs in the travel domain

Real data (2)

Conclusion

Significant improvement in robustness

CA are simplification or special case

Scales purely

Unique

Future work

Other approachesInformation State Update

Hidden Information State

Evaluating on real users

Questions?

Thank you!

Thank you!

Muokkaa otsikon tekstimuotoa napsauttamalla

Muokkaa jsennyksen tekstimuotoa napsauttamallaToinen jsennystasoKolmas jsennystasoNeljs jsennystasoViides jsennystasoKuudes jsennystasoSeitsems jsennystasoKahdeksas jsennystasoYhdekss jsennystaso

partially observable markov decision processes for spoken dialog systems

Technology