partially observable markov decision processes for spoken dialog systems
TRANSCRIPT
Partially observable Markov decision processes for spoken dialog systems
Jason D. Williams, Steve Young (AT&T Labs)
2007, Computer Speech and Language, 21(2):
Outline
Introduction
Partially observable Markov decision processes
Spoken Dialog System
SDS-POMDP
Comparing
Empirical support
POMDP (1)
Partially observable Markov decision processes
POMDP = {S, A, T, R, O, Z, , b0}S set of states describing agent's world
A set of actions, that agent may take
T transition probability P(s'|s, a)
R reward r(s, a)
O set of observation about the world
Z observation probability P(o'|s', a)
POMDP (2)
POMDP = {S, A, T, R, O, Z, , b0} geometric discount factor
b0 initial belief state b0(s)
POMDP (3)
- random variable
- decision node
- utility node
Shaded unobserved
| - causal effect
- distribution is used
RL reinforced learning
POMDP (Example)
Dialog systemsaving/deleting messages
Spoken Dialog System
Su internal user state
Sd dialog state (user view)
Au user action (intention)
Spoken Dialog System
Yu user audio signal
Au action recognized by machine
C confidence score
Sm dialog state (machine view)
~
Spoken Dialog System
Am machine action
Ym machine audio signal
Am action recognized by user
~
Mapping SDS to POMDP
POMDP = {S, A, T, R, O, Z, , b0}
SDS = {Su, Sd, Sm, C, Au, Au, Am}
~
SDS-POMDP
s = (su, au, sd)
sm = b(s) = b(su, au, sd)
Math behind
Formula for new belief
Exact algorithms rarely scale with more than 10 actions, states and observations.
Effective approximate solutions exist.
Comparing SDS-POMDP
Better than current approaches
CA are simplification or special case
ApproachesParallel state hypotheses
Local use of confidence score
Automated action planning
Parallel state hypotheses
Traditional = 1 state
Uncertainty multiple states
2 techniquesGreedy decision theoretic approaches
M-Best list
Greedy decisions
Maximizes immediate reward
Doesn't perform plan
Handcrafting + ad hoc tunning
M-Best list
Considers only the top hypotheses
= POMDP with handcrafted action selection
Subspace of belief space
Local use of confidence score
Handcrafted update rules
Ac = {expl-confirm, imp-confirm, reject}
Useful, but hard for long-term goals
Automated action selection
Handcrafted planningUnforseen dialog situations
POMDP with single state
2 main techniquesSupervised learning
Markov decision processes
Supervised learning
Training dataHuman-human much richer
Human-machine machine errors
Single state
Markov decision process
Fully Observable MDP is simplification of PO
Assumes, that world state is known exactly
Single state
Empirical support
Based on simulations
Benefits of POMDP toParallel state hypotheses
Confidence score
Automated planning
Real data
Parallel state hypotheses (1)
Parallel state hypotheses (2)
Parallel state hypotheses (3)
Confidence score (1)
Confidence Score: Reject, 0.4, Low, 0.8, Hight
Confidence score (2)
Confidence score (3)
Confidence score (4)
Automated planning (1)
HC1
HC2
HC3
Automated planning (2)
Automated planning (3)
Real data (1)
SACTI-1 Corpus144 human-human dialogs in the travel domain
Real data (2)
Conclusion
Significant improvement in robustness
CA are simplification or special case
Scales purely
Unique
Future work
Other approachesInformation State Update
Hidden Information State
Evaluating on real users
Questions?
Thank you!
Thank you!
Muokkaa otsikon tekstimuotoa napsauttamalla
Muokkaa jsennyksen tekstimuotoa napsauttamallaToinen jsennystasoKolmas jsennystasoNeljs jsennystasoViides jsennystasoKuudes jsennystasoSeitsems jsennystasoKahdeksas jsennystasoYhdekss jsennystaso