cheriton school of computer science | university of …ppoupart/publications/...david r. cheriton...

1

Topics of Active Research Topics of Active Research in Reinforcement Learning in Reinforcement Learning

Relevant to Spoken Relevant to Spoken Dialogue SystemsDialogue Systems

Pascal PoupartDavid R. Cheriton School of Computer Science

University of Waterloo

2

OutlineOutline

• Review– Markov Models– Reinforcement Learning

• Some areas of active research relevant to SDS– Bayesian Reinforcement Learning (BRL)– Inverse Reinforcement Learning (IRL)– Predictive State Representations (PSRs)

• Conclusion

3

Automated SystemAutomated System• Abstraction:

• Problems:– Uncertain action effects– Imprecise percepts– Unknown environment

action

Agent Environment

goals

percept

4

Automated SystemAutomated System• Spoken Dialogue System:

• Problems:– User may not hear/understand system utterances– Imprecise speech recognizer– Unknown user model

utterance

Booking system user

Book flight

Speech recognizer

5

Markov ModelsMarkov Models

Markov Process

MDP

HMMDBN

POMDP

RL PO-RL

Learning

Partial observability

Decision making

6

Stochastic ProcessStochastic Process• World state changes over time• Convention:

– Circle: Random variable– Arc: Conditional dependency

• Stochastic dynamics: Pr(st+1|st, …, s0)

s0 s1 s2 s3 s4

Too many dependencies!

7

• Markov assumption: current state depends only on finite history of past states– K-order Markov process:

Pr(st|st-1,…,st-k,st-k-1,…,s0) = Pr(st|st-1,…,st-k)

• Example:– N-gram model: Pr(wordi|wordi-1, …, wordi-n)

Markov ProcessMarkov Process

s0 s1 s2 s3 s4

8

• Stationary Assumption: dynamics do not change– Pr(st|st-1,…,st-k) is same for all t

• Two slices sufficient for a first-order Markov process…– Graph:– Dynamics: Pr(st|st-1)– Prior: Pr(s0)

Markov ProcessMarkov Process

St-1 St

9

• Intuition: (First-order) Markov Process with…– Decision nodes– Utility nodes

Markov Decision ProcessMarkov Decision Process

s0 s1 s2 s3 s4

a0 a1 a2 a3

r1 r2 r3 r4

10

Markov Decision ProcessMarkov Decision Process• Definition

– Set of states: S– Set of actions: A– Transition model: T(st-1,at-1,st) = Pr(st|at-1,st-1)– Reward model: R(st,at) = rt

– Discount factor: 0 ≤ γ ≤ 1

• Goal: find optimal policy– Policy π: S A– Value: Vπ(s) = Eπ [ Σt γt rt ]– Optimal policy π*: Vπ*(s) ≥ Vπ(s) ∀π,s

11

MDPsMDPs for SDSfor SDS• MDPs for SDS: Biermann and Long (1996),

Levin and Pieraccini (1997), Singh et al. (1999), Levin et al. (2000)

• Flight booking example:– State: Assignment of values to dep. date, dep. time,

dep. city and dest. city– Actions: any utterance (e.g., question, confirmation)– User model: Pr(user response | sys. utterance, state)– Rewards: positive reward for correct booking, negative

reward for incorrect booking

12

Value IterationValue Iteration• Three families of algorithms:

– Value iteration, policy iteration, linear programming

• Bellman’s equation:– V(st) = maxat

R(st,at) + γ Σst+1Pr(st+1|st,at) V(st+1)

– at* = argmaxatR(st,at) + γ Σst+1

Pr(st+1|st,at) V(st+1)

• Value iteration:– V(sh) = R(sh)– V(sh-1) = maxah-1

R(sh-1,ah-1) + γΣshPr(sh|sh-1,ah-1)V(sh)

– V(sh-2) = maxah-2R(sh-2,ah-2) + γΣsh-1

Pr(sh-1|sh-2,ah-2)V(sh-1)

13

Unrealistic AssumptionsUnrealistic Assumptions

• Transition (user) model known:– How to learn a good user model?

• Reward model known:– How to assess user preferences?

• Speech recognizer flawless:– How to account for ASR errors?

RL

POMDP

PORLIRL

14

Reinforcement LearningReinforcement Learning• Markov Decision Process:

– S: set of states– A: set of actions– R(s,a) = r: reward model– T(s,a,s’) = Pr(s’|s,a): transition function

• RL for SDS: Walker et al. (1998), Singh et al. (1999), Scheffler and Yound (1999), Litman et al. (2000), Levin et al. (2000), Pietquin (2004), Georgila et al (2005), Lewis & Di Fabbrizio (2006)

Reinforcement Learning

15

Algorithms for RLAlgorithms for RL• Model-based RL:

– Estimate T from s,a,s’ triples• E.g., Max likelihood: Pr(s’|s,a) = #(s,a,s’) / #(s,a,•)

– Model learning: offline (corpus of s,a,s’ triples) and/or online (s,a,s’ directly from env.)

• Model-free RL:– Estimate V* and/or π* directly

• E.g., Temporal difference: Q(s,a) = Q(s,a) + α [R(s,a) + γ maxa’ Q(s’,a’) – Q(s,a)]

– Learning: offline (s,a,s’ from simulator) online (s,a,s’ directly from environment)

16

Successes of RLSuccesses of RL• Backgammon [Tesauro 1995]

– Temporal difference learning– Trained by self-play– Simulator: opponent model consists of itself– Offline learning: simulated millions of games

• Helicopter control [Ng et al. 2003,2004]– PEGASUS: stochastic gradient descent– Offline learning: with flight simulator

17

OutlineOutline



• Conclusion

18

Assistive TechnologiesAssistive Technologies

• Handwashing assistant– [Boger et al. IJCAI-05]

• Use RL to adapt to users– Start with basic user model– Online learning:

• Adjust model as system interacts with users

• Bear cost of actions • Cannot explore too much• Real-time response

19

Bayesian ModelBayesian Model--based RLbased RL• Formalized in Operations Research by Howard

and his students at MIT in the 1960s• In AI: Kaelbling (1992), Meuleau and Bourgine

(1999), Dearden & al (1998,1999), Strens (2000), Duff (2003), Wang & al (2005), Poupart & al (2006)

• Advantages– Opt. exploration/exploitation tradeoff– Encode prior knowledge

less data required

20

Bayesian ModelBayesian Model--based RLbased RL• Disadvantage:

– Computationally complex

• Poupart et al. (ICML 2006):– Optimal value function has simple parameterization

• i.e., upper envelope of a set of multivariate polynomials

– BEETLE: Bayesian Exploration/Exploitation Tradeoff in LEarning

• Exploit polynomial parameterization

21

Bayesian RLBayesian RL

• Basic Idea:• Encode unknown prob. by random variables θ

– i.e., θsas’ = Pr(s’|s,a): random variable in [0,1]– i.e., θsa = Pr(•|s,a): multinomial distribution

• Model learning: update Pr(θ)• Pr(θ) tells us which part of the model are not well

known and therefore worth exploring

22

Model LearningModel Learning• Assume prior b(θsa) = Pr(θsa)• Learning: compute posterior given s,a,s’

– bsas’(θsa) = k Pr(θsa) Pr(s’|s,a,θsa) = k b(θsa) θsas’

• Conjugate prior: – Dirichlet prior Dirichlet posterior

• b(θsa) = Dir(θsa;nsa) = k Πs’’ (θsas’’)nsas’’ – 1

• bsas’(θsa) = k b(θsa) θsas’= k Πs’’ (θsas’’)nsas’’ – 1 + δ(s’,s’’)

= k Dir(θsa; nsa + δ(s’,s’’))

23

0 0.2 1p

Pr(

p)

Dir(p; 1, 1)Dir(p; 2, 8)Dir(p; 20, 80)

Prior KnowledgePrior Knowledge• Structural priors

– Tie identical parameters• If Pr(•|s,a) = Pr(•|s’,a’) then θsa = θs’a’

– Factored representation• DBN: unknown conditional dist.

• Informative priors– No knowledge: uniform Dirichlet– If (θ1, θ2) ~ (0.2, 0.8)

then set (n1, n2) to (0.2k, 0.8k)• k indicates the level of confidence

24

Policy OptimizationPolicy Optimization• Classic RL:

– V*(s) = maxa R(s,a) + Σs’ Pr(s’|s,a) V*(s’)– Hard to tell what needs to be explored– Exploration heuristics: ε-greedy, Boltzmann, etc.

• Bayesian RL:– V*(s,b) = maxa R(s,a) + Σs’ Pr(s’|s,b,a) V*(s’,bsas’)– Belief b tells us what parts of the model are not well

known and therefore worth exploring– Optimal exploration/exploitation tradeoff

25

Value Function ParameterizationValue Function Parameterization• Theorem: V* is the upper envelope of a set of

multivariate polynomials (Vs(θ) = maxi polyi(θ))• Proof: by induction

– Define value function in terms of θ instead of b• i.e. V*(s,b) = ∫θ b(θ) Vs(θ) dθ

– Bellman’s equation

• Vs(θ) = maxa R(s,a) + Σs’ Pr(s’|s,a,θ) Vs’(θ)

= maxa ka + Σs’ θsas’ maxi polyi(θ)

= maxj polyj(θ)

26

BEETLE AlgorithmBEETLE Algorithm• Sample a set of reachable belief points B• V {0}• Repeat

– V’ {}– For each b in B compute multivariate polynomial

• polyas’(θ) argmaxpoly∈V ∫θ bsas’(θ) poly(θ) dθ• a* argmaxa ∫θ bsas’(θ) R(s,a) + Σs’ θsas’ polyas’(θ) dθ• poly(θ) R(s,a*) + Σs’ θsa*s’ polya*s’(θ)• V’ V’ ∪ {poly}

– V V’

27

Bayesian RLBayesian RL• Summary:

– Optimizes exploration/exploitation tradeoff– Easily encode prior knowledge to reduce exploration

• Potential for SDS:– Online user modeling:

• Tailor model to specific user with least exploration possible

– Offline user modelling:• Large corpus of unlabeled dialogues• Labeling takes time• Automated selection of a subset of dialogues to be labeled• Active learning: Jaulmes, Pineau et al. (2005)

28

OutlineOutline



• Conclusion

29

Reward FunctionReward Function• MDPs: T and R π

• RL: s,a,s’ and R π

• But R is often difficult to specify!

• SDS booking system:– Correct booking: large positive reward– Incorrect booking: large negative reward– Cost per question: ???– Cost per confirmation: ???– User frustration: ???

30

Apprenticeship learningApprenticeship learning• Sometimes: expert policy π+ observable

• Apprenticeship learning:– Imitation: π+ π

• When T doen’t change, just imitate policy directly

– Inverse RL: π+ and s,a,s’ R• When T could change, estimate R• Then do RL: s,a,s’ and R π

• For different SDS, we have different policies because of different scenarios, but perhaps the same R can be used.

31

Inverse RLInverse RL• In AI: Ng and Russell (2000), Abbeel and Ng

(2004), Ramachandran and Amir (2006)

• Bellman’s equation:– V(st) = maxat

R(st,at) + γ Σst+1Pr(st+1|st,at) V(st+1)

• Idea: find R such that π+ is optimal according to Bellman’s equation.

• Bayesian Inverse RL:– Prior Pr(R)– Posterior Pr(R|s,a,s’,a’,s’’,a’’,…)

32

OutlineOutline



• Conclusion

33

Partially Observable RLPartially Observable RL• States are rarely observable• Noisy sensors: measurements are correlated with

states of the world• Extend Markov models to account for sensor

noise

• Recall:– Markov Process HMM– MDP POMDP– RL PORL

34

Hidden Markov ModelHidden Markov Model• Intuition: Markov Process with …

– Observation variables

• Example: speech recognition

s0 s1 s2 s3 s4

o1 o2 o3 o4

35

Hidden Markov ModelHidden Markov Model• Definition

– Set of states: S– Set of observations: O– Transition model: Pr(st|st-1)– Observation model: Pr(ot|st)– Prior: Pr(s0)

• Belief monitoring:– Prior: b(s) = Pr(s)– Posterior: bao(s’) = Pr(s’|a,o)

= k Σs b(s) Pr(s’|s,a) Pr(o|s’)

36

• Intuition: HMM with…– Decision nodes– Utility nodes

Partially Observable MDPPartially Observable MDP

s0 s1 s2 s3 s4

a0 a1 a2 a3

r1 r2 r3 r4

o1 o2 o3 o4

37

Partially Observable MDPPartially Observable MDP• Definition

– Set of actions: A– Set of observations: O– Reward model: R(st,at) = rt

– Set of states: S– Transition model: T(st-1,at-1,st) = Pr(st|at-1,st-1)– Observation model: Z(st,ot) = Pr(ot|st)

• POMDPs for SDS: Roy et al. (2000), Zhang et al. (2001), Williams et al. (2006), Atrash and Pineau(2006)

38

Partially Observable RLPartially Observable RL• Definition

– Set of actions: A– Set of observations: O– Reward model: R(st,at) = rt

– Set of states: S– Transition model: T(st-1,at-1,st) = Pr(st|at-1,st-1)– Observation model: Z(st,ot) = Pr(ot|st)

• NB: S is generally unknown since it is an unobservable quantity

PO-RL

39

PORL algorithmsPORL algorithms• Model-free PORL:

– Stochastic gradient descent

• Model-based PORL:– Assume S, learn T and Z from a,o,a’,o’,… sequences

• E.g. EM algorithm for HMMs• But S is really unknown• In SDS, S may refer to user intentions, mental state, language

knowledge, etc.

– Learn S, T and Z from a,o,a’,o’,… sequences• E.g., Predictive state representations

41

Predictive State RepresentationsPredictive State Representations• Belief b

– vector of probabilities– Information to predict future observations– After each o,a pair, b is updated to bo,a using T and Z

• Idea: find sufficient statistic x such that– x is a vector of real numbers– x is a smaller vector than b– There exist functions f and g such that

• f(x) = Pr(o|b)• g(x,a,o) = xo,a and f(g(x,a,o)) = Pr(o|b,o,a)

42

Predictive State RepresentationsPredictive State Representations• References: Litman et al. (2002), Poupart and

Boutilier (2003), Singh et al. (2003), Rudary and Singh (2004), James and Singh (2004), etc.

• Potential for SDS– Instead of handcrafting state variables, learn a state

representation of users from data– Learn a smaller user model

43

ConclusionConclusion• Overview of Markov models for SDS• RL topics of active research relevant to SDS:

– Bayesian RL: tailor model to specific users– Inverse RL: learn reward model– PSR: learn state representation of user model

• Fields of machine learning and user modelling could offer more techniques to advance SDS

cheriton school of computer science | university of …ppoupart/publications/...david r. cheriton...

Documents