cheriton school of computer science | university of …ppoupart/publications/...david r. cheriton...

43
1 Topics of Active Research Topics of Active Research in Reinforcement Learning in Reinforcement Learning Relevant to Spoken Relevant to Spoken Dialogue Systems Dialogue Systems Pascal Poupart David R. Cheriton School of Computer Science University of Waterloo

Upload: others

Post on 19-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

1

Topics of Active Research Topics of Active Research in Reinforcement Learning in Reinforcement Learning

Relevant to Spoken Relevant to Spoken Dialogue SystemsDialogue Systems

Pascal PoupartDavid R. Cheriton School of Computer Science

University of Waterloo

Page 2: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

2

OutlineOutline

• Review– Markov Models– Reinforcement Learning

• Some areas of active research relevant to SDS– Bayesian Reinforcement Learning (BRL)– Inverse Reinforcement Learning (IRL)– Predictive State Representations (PSRs)

• Conclusion

Page 3: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

3

Automated SystemAutomated System• Abstraction:

• Problems:– Uncertain action effects– Imprecise percepts– Unknown environment

action

Agent Environment

goals

percept

Page 4: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

4

Automated SystemAutomated System• Spoken Dialogue System:

• Problems:– User may not hear/understand system utterances– Imprecise speech recognizer– Unknown user model

utterance

Booking system user

Book flight

Speech recognizer

Page 5: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

5

Markov ModelsMarkov Models

Markov Process

MDP

HMMDBN

POMDP

RL PO-RL

Learning

Partial observability

Decision making

Page 6: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

6

Stochastic ProcessStochastic Process• World state changes over time• Convention:

– Circle: Random variable– Arc: Conditional dependency

• Stochastic dynamics: Pr(st+1|st, …, s0)

s0 s1 s2 s3 s4

Too many dependencies!

Page 7: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

7

• Markov assumption: current state depends only on finite history of past states– K-order Markov process:

Pr(st|st-1,…,st-k,st-k-1,…,s0) = Pr(st|st-1,…,st-k)

• Example:– N-gram model: Pr(wordi|wordi-1, …, wordi-n)

Markov ProcessMarkov Process

s0 s1 s2 s3 s4

Page 8: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

8

• Stationary Assumption: dynamics do not change– Pr(st|st-1,…,st-k) is same for all t

• Two slices sufficient for a first-order Markov process…– Graph:– Dynamics: Pr(st|st-1)– Prior: Pr(s0)

Markov ProcessMarkov Process

St-1 St

Page 9: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

9

• Intuition: (First-order) Markov Process with…– Decision nodes– Utility nodes

Markov Decision ProcessMarkov Decision Process

s0 s1 s2 s3 s4

a0 a1 a2 a3

r1 r2 r3 r4

Page 10: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

10

Markov Decision ProcessMarkov Decision Process• Definition

– Set of states: S– Set of actions: A– Transition model: T(st-1,at-1,st) = Pr(st|at-1,st-1)– Reward model: R(st,at) = rt

– Discount factor: 0 ≤ γ ≤ 1

• Goal: find optimal policy– Policy π: S A– Value: Vπ(s) = Eπ [ Σt γt rt ]– Optimal policy π*: Vπ*(s) ≥ Vπ(s) ∀π,s

Page 11: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

11

MDPsMDPs for SDSfor SDS• MDPs for SDS: Biermann and Long (1996),

Levin and Pieraccini (1997), Singh et al. (1999), Levin et al. (2000)

• Flight booking example:– State: Assignment of values to dep. date, dep. time,

dep. city and dest. city– Actions: any utterance (e.g., question, confirmation)– User model: Pr(user response | sys. utterance, state)– Rewards: positive reward for correct booking, negative

reward for incorrect booking

Page 12: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

12

Value IterationValue Iteration• Three families of algorithms:

– Value iteration, policy iteration, linear programming

• Bellman’s equation:– V(st) = maxat

R(st,at) + γ Σst+1Pr(st+1|st,at) V(st+1)

– at* = argmaxatR(st,at) + γ Σst+1

Pr(st+1|st,at) V(st+1)

• Value iteration:– V(sh) = R(sh)– V(sh-1) = maxah-1

R(sh-1,ah-1) + γΣshPr(sh|sh-1,ah-1)V(sh)

– V(sh-2) = maxah-2R(sh-2,ah-2) + γΣsh-1

Pr(sh-1|sh-2,ah-2)V(sh-1)

Page 13: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

13

Unrealistic AssumptionsUnrealistic Assumptions

• Transition (user) model known:– How to learn a good user model?

• Reward model known:– How to assess user preferences?

• Speech recognizer flawless:– How to account for ASR errors?

RL

POMDP

PORLIRL

Page 14: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

14

Reinforcement LearningReinforcement Learning• Markov Decision Process:

– S: set of states– A: set of actions– R(s,a) = r: reward model– T(s,a,s’) = Pr(s’|s,a): transition function

• RL for SDS: Walker et al. (1998), Singh et al. (1999), Scheffler and Yound (1999), Litman et al. (2000), Levin et al. (2000), Pietquin (2004), Georgila et al (2005), Lewis & Di Fabbrizio (2006)

Reinforcement Learning

Page 15: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

15

Algorithms for RLAlgorithms for RL• Model-based RL:

– Estimate T from s,a,s’ triples• E.g., Max likelihood: Pr(s’|s,a) = #(s,a,s’) / #(s,a,•)

– Model learning: offline (corpus of s,a,s’ triples) and/or online (s,a,s’ directly from env.)

• Model-free RL:– Estimate V* and/or π* directly

• E.g., Temporal difference: Q(s,a) = Q(s,a) + α [R(s,a) + γ maxa’ Q(s’,a’) – Q(s,a)]

– Learning: offline (s,a,s’ from simulator) online (s,a,s’ directly from environment)

Page 16: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

16

Successes of RLSuccesses of RL• Backgammon [Tesauro 1995]

– Temporal difference learning– Trained by self-play– Simulator: opponent model consists of itself– Offline learning: simulated millions of games

• Helicopter control [Ng et al. 2003,2004]– PEGASUS: stochastic gradient descent– Offline learning: with flight simulator

Page 17: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

17

OutlineOutline

• Review– Markov Models– Reinforcement Learning

• Some areas of active research relevant to SDS– Bayesian Reinforcement Learning (BRL)– Inverse Reinforcement Learning (IRL)– Predictive State Representations (PSRs)

• Conclusion

Page 18: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

18

Assistive TechnologiesAssistive Technologies

• Handwashing assistant– [Boger et al. IJCAI-05]

• Use RL to adapt to users– Start with basic user model– Online learning:

• Adjust model as system interacts with users

• Bear cost of actions • Cannot explore too much• Real-time response

Page 19: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

19

Bayesian ModelBayesian Model--based RLbased RL• Formalized in Operations Research by Howard

and his students at MIT in the 1960s• In AI: Kaelbling (1992), Meuleau and Bourgine

(1999), Dearden & al (1998,1999), Strens (2000), Duff (2003), Wang & al (2005), Poupart & al (2006)

• Advantages– Opt. exploration/exploitation tradeoff– Encode prior knowledge

less data required

Page 20: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

20

Bayesian ModelBayesian Model--based RLbased RL• Disadvantage:

– Computationally complex

• Poupart et al. (ICML 2006):– Optimal value function has simple parameterization

• i.e., upper envelope of a set of multivariate polynomials

– BEETLE: Bayesian Exploration/Exploitation Tradeoff in LEarning

• Exploit polynomial parameterization

Page 21: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

21

Bayesian RLBayesian RL

• Basic Idea:• Encode unknown prob. by random variables θ

– i.e., θsas’ = Pr(s’|s,a): random variable in [0,1]– i.e., θsa = Pr(•|s,a): multinomial distribution

• Model learning: update Pr(θ)• Pr(θ) tells us which part of the model are not well

known and therefore worth exploring

Page 22: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

22

Model LearningModel Learning• Assume prior b(θsa) = Pr(θsa)• Learning: compute posterior given s,a,s’

– bsas’(θsa) = k Pr(θsa) Pr(s’|s,a,θsa) = k b(θsa) θsas’

• Conjugate prior: – Dirichlet prior Dirichlet posterior

• b(θsa) = Dir(θsa;nsa) = k Πs’’ (θsas’’)nsas’’ – 1

• bsas’(θsa) = k b(θsa) θsas’= k Πs’’ (θsas’’)nsas’’ – 1 + δ(s’,s’’)

= k Dir(θsa; nsa + δ(s’,s’’))

Page 23: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

23

0 0.2 1p

Pr(

p)

Dir(p; 1, 1)Dir(p; 2, 8)Dir(p; 20, 80)

Prior KnowledgePrior Knowledge• Structural priors

– Tie identical parameters• If Pr(•|s,a) = Pr(•|s’,a’) then θsa = θs’a’

– Factored representation• DBN: unknown conditional dist.

• Informative priors– No knowledge: uniform Dirichlet– If (θ1, θ2) ~ (0.2, 0.8)

then set (n1, n2) to (0.2k, 0.8k)• k indicates the level of confidence

Page 24: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

24

Policy OptimizationPolicy Optimization• Classic RL:

– V*(s) = maxa R(s,a) + Σs’ Pr(s’|s,a) V*(s’)– Hard to tell what needs to be explored– Exploration heuristics: ε-greedy, Boltzmann, etc.

• Bayesian RL:– V*(s,b) = maxa R(s,a) + Σs’ Pr(s’|s,b,a) V*(s’,bsas’)– Belief b tells us what parts of the model are not well

known and therefore worth exploring– Optimal exploration/exploitation tradeoff

Page 25: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

25

Value Function ParameterizationValue Function Parameterization• Theorem: V* is the upper envelope of a set of

multivariate polynomials (Vs(θ) = maxi polyi(θ))• Proof: by induction

– Define value function in terms of θ instead of b• i.e. V*(s,b) = ∫θ b(θ) Vs(θ) dθ

– Bellman’s equation

• Vs(θ) = maxa R(s,a) + Σs’ Pr(s’|s,a,θ) Vs’(θ)

= maxa ka + Σs’ θsas’ maxi polyi(θ)

= maxj polyj(θ)

Page 26: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

26

BEETLE AlgorithmBEETLE Algorithm• Sample a set of reachable belief points B• V {0}• Repeat

– V’ {}– For each b in B compute multivariate polynomial

• polyas’(θ) argmaxpoly∈V ∫θ bsas’(θ) poly(θ) dθ• a* argmaxa ∫θ bsas’(θ) R(s,a) + Σs’ θsas’ polyas’(θ) dθ• poly(θ) R(s,a*) + Σs’ θsa*s’ polya*s’(θ)• V’ V’ ∪ {poly}

– V V’

Page 27: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

27

Bayesian RLBayesian RL• Summary:

– Optimizes exploration/exploitation tradeoff– Easily encode prior knowledge to reduce exploration

• Potential for SDS:– Online user modeling:

• Tailor model to specific user with least exploration possible

– Offline user modelling:• Large corpus of unlabeled dialogues• Labeling takes time• Automated selection of a subset of dialogues to be labeled• Active learning: Jaulmes, Pineau et al. (2005)

Page 28: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

28

OutlineOutline

• Review– Markov Models– Reinforcement Learning

• Some areas of active research relevant to SDS– Bayesian Reinforcement Learning (BRL)– Inverse Reinforcement Learning (IRL)– Predictive State Representations (PSRs)

• Conclusion

Page 29: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

29

Reward FunctionReward Function• MDPs: T and R π

• RL: s,a,s’ and R π

• But R is often difficult to specify!

• SDS booking system:– Correct booking: large positive reward– Incorrect booking: large negative reward– Cost per question: ???– Cost per confirmation: ???– User frustration: ???

Page 30: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

30

Apprenticeship learningApprenticeship learning• Sometimes: expert policy π+ observable

• Apprenticeship learning:– Imitation: π+ π

• When T doen’t change, just imitate policy directly

– Inverse RL: π+ and s,a,s’ R• When T could change, estimate R• Then do RL: s,a,s’ and R π

• For different SDS, we have different policies because of different scenarios, but perhaps the same R can be used.

Page 31: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

31

Inverse RLInverse RL• In AI: Ng and Russell (2000), Abbeel and Ng

(2004), Ramachandran and Amir (2006)

• Bellman’s equation:– V(st) = maxat

R(st,at) + γ Σst+1Pr(st+1|st,at) V(st+1)

• Idea: find R such that π+ is optimal according to Bellman’s equation.

• Bayesian Inverse RL:– Prior Pr(R)– Posterior Pr(R|s,a,s’,a’,s’’,a’’,…)

Page 32: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

32

OutlineOutline

• Review– Markov Models– Reinforcement Learning

• Some areas of active research relevant to SDS– Bayesian Reinforcement Learning (BRL)– Inverse Reinforcement Learning (IRL)– Predictive State Representations (PSRs)

• Conclusion

Page 33: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

33

Partially Observable RLPartially Observable RL• States are rarely observable• Noisy sensors: measurements are correlated with

states of the world• Extend Markov models to account for sensor

noise

• Recall:– Markov Process HMM– MDP POMDP– RL PORL

Page 34: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

34

Hidden Markov ModelHidden Markov Model• Intuition: Markov Process with …

– Observation variables

• Example: speech recognition

s0 s1 s2 s3 s4

o1 o2 o3 o4

Page 35: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

35

Hidden Markov ModelHidden Markov Model• Definition

– Set of states: S– Set of observations: O– Transition model: Pr(st|st-1)– Observation model: Pr(ot|st)– Prior: Pr(s0)

• Belief monitoring:– Prior: b(s) = Pr(s)– Posterior: bao(s’) = Pr(s’|a,o)

= k Σs b(s) Pr(s’|s,a) Pr(o|s’)

Page 36: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

36

• Intuition: HMM with…– Decision nodes– Utility nodes

Partially Observable MDPPartially Observable MDP

s0 s1 s2 s3 s4

a0 a1 a2 a3

r1 r2 r3 r4

o1 o2 o3 o4

Page 37: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

37

Partially Observable MDPPartially Observable MDP• Definition

– Set of actions: A– Set of observations: O– Reward model: R(st,at) = rt

– Set of states: S– Transition model: T(st-1,at-1,st) = Pr(st|at-1,st-1)– Observation model: Z(st,ot) = Pr(ot|st)

• POMDPs for SDS: Roy et al. (2000), Zhang et al. (2001), Williams et al. (2006), Atrash and Pineau(2006)

Page 38: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

38

Partially Observable RLPartially Observable RL• Definition

– Set of actions: A– Set of observations: O– Reward model: R(st,at) = rt

– Set of states: S– Transition model: T(st-1,at-1,st) = Pr(st|at-1,st-1)– Observation model: Z(st,ot) = Pr(ot|st)

• NB: S is generally unknown since it is an unobservable quantity

PO-RL

Page 39: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

39

PORL algorithmsPORL algorithms• Model-free PORL:

– Stochastic gradient descent

• Model-based PORL:– Assume S, learn T and Z from a,o,a’,o’,… sequences

• E.g. EM algorithm for HMMs• But S is really unknown• In SDS, S may refer to user intentions, mental state, language

knowledge, etc.

– Learn S, T and Z from a,o,a’,o’,… sequences• E.g., Predictive state representations

Page 40: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

40

Sufficient statisticsSufficient statistics

• Beliefs are sufficient statistics to predict future observations– Pr(o|b) = Σs b(s) Pr(o|s) – Pr(o’|b,o,a) = k Σs b(s) Pr(o|s) Σs’ Pr(s’|s,a) Pr(o’|s’)

= Σs’ bo,a(s’) Pr(o’|bo,a)– …

• Are there more succinct sufficient statistics?

Page 41: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

41

Predictive State RepresentationsPredictive State Representations• Belief b

– vector of probabilities– Information to predict future observations– After each o,a pair, b is updated to bo,a using T and Z

• Idea: find sufficient statistic x such that– x is a vector of real numbers– x is a smaller vector than b– There exist functions f and g such that

• f(x) = Pr(o|b)• g(x,a,o) = xo,a and f(g(x,a,o)) = Pr(o|b,o,a)

Page 42: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

42

Predictive State RepresentationsPredictive State Representations• References: Litman et al. (2002), Poupart and

Boutilier (2003), Singh et al. (2003), Rudary and Singh (2004), James and Singh (2004), etc.

• Potential for SDS– Instead of handcrafting state variables, learn a state

representation of users from data– Learn a smaller user model

Page 43: Cheriton School of Computer Science | University of …ppoupart/publications/...David R. Cheriton School of Computer Science University of Waterloo 2 Outline •Review – Markov Models

43

ConclusionConclusion• Overview of Markov models for SDS• RL topics of active research relevant to SDS:

– Bayesian RL: tailor model to specific users– Inverse RL: learn reward model– PSR: learn state representation of user model

• Fields of machine learning and user modelling could offer more techniques to advance SDS