stochastic optimal control theory with applications ...bertk/acns/summerschool_nijmegen2013.pdf ·...

45
Stochastic optimal control theory with applications inneuroscience Bert Kappen SNN Donders Institute, Radboud University, Nijmegen Gatsby Unit, UCL London August 26, 2013 Bert Kappen

Upload: others

Post on 13-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

  • Stochastic optimal control theory with applicationsinneuroscience

    Bert KappenSNN Donders Institute, Radboud University, Nijmegen

    Gatsby Unit, UCL London

    August 26, 2013

    Bert Kappen

  • How to control a device?

    Plant is unknown

    Exploration of state space

    Motor babbling in infants

    Problem for brains and for robots

    Bert Kappen Nijmegen Summerschool 1/43

  • How to find your way home?

    How to navigate to previously visited locations?

    Bert Kappen Nijmegen Summerschool 2/43

  • Intractability due to uncertainty

    Noise affects optimal control in quantitatively.

    Optimal control computation is only tractable for simple cases:- deterministic problems using PMP approach

    - LQ problems

    Bert Kappen Nijmegen Summerschool 3/43

  • The big idea

    Linear Bellman equation and path integralExpress a control computation as an inference computation

    Bert Kappen Nijmegen Summerschool 4/43

  • The big idea

    Linear Bellman equation and path integralExpress a control computation as an inference computation

    Approximate inferenceIntractable inference problems can be made efficient using statistical physicsmethods

    Bert Kappen Nijmegen Summerschool 5/43

  • Outline

    • Link between control theory, inference and statistical physics

    – Hopf ’50, Fleming Mitter ’82, Kappen ’05

    Bert Kappen Nijmegen Summerschool 6/43

  • Outline

    • Link between control theory, inference and statistical physics

    – Hopf ’50, Fleming Mitter ’82, Kappen ’05

    • How to control a device?

    – Motor babbling as importance sampling

    Bert Kappen Nijmegen Summerschool 7/43

  • Outline

    • Link between control theory, inference and statistical physics

    – Hopf ’50, Fleming Mitter ’82, Kappen ’05

    • How to control a device?

    – Motor babbling as importance sampling

    • How to find your way home?

    – KL control theory– Efficient alternative for RL– Model of hippocampus– Computation by simulation

    Bert Kappen Nijmegen Summerschool 8/43

  • Discrete time optimal control

    Consider the control of a discrete time deterministic dynamical system:

    xt+1 = xt + f(xt, ut), t = 0, 1, . . . , T − 1

    xt describes the state and ut specifies the control or action at time t.

    Given x0 and u0:T−1, we can compute x1:T .

    Define a cost for each sequence of controls:

    C(x0, u0:T−1) =

    T−1∑t=0

    R(xt, ut)

    Find the sequence u0:T−1 that minimizes C(x0, u0:T−1).

    Bert Kappen Nijmegen Summerschool 9/43

  • Dynamic programming

    Find the minimal cost path from A to J.

    C(J) = 0, C(H) = 3, C(I) = 4

    C(F ) = min(6 + C(H), 3 + C(I)) = 7

    Minimal cost at time t easily expressable in terms of minimal cost at time t+ 1.

    Bert Kappen Nijmegen Summerschool 10/43

  • Discrete time optimal control

    Dynamic programming uses concept of optimal cost-to-go J(t, x).

    One can recursively compute J(t, x) from J(t+ 1, x) for all x in the following way:

    J(t, xt) = minut:T−1

    (T−1∑s=t

    R(xs, us)

    )= min

    ut(R(t, xt, ut) + J(t+ 1, xt + f(t, xt, ut)))

    J(T, x) = 0

    J(0, x) = minu0:T−1

    C(x, u0:T−1)

    This is called the Bellman Equation.

    Computes ut(x) for all intermediate t, x.

    Bert Kappen Nijmegen Summerschool 11/43

  • Stochastic optimal control

    Consider a stochastic dynamical system

    dxi = fi(x, u)dt+ dξi 〈dξidξj〉 = νijdt

    Given x(0) find control sequence u(0 → T ) that minimizes the expected futurecost

    C =

    〈φ(x(T )) +

    ∫ T0

    dtR(x(t), u(t))

    Expectation is over all trajectories given the control path.

    J(t, x) = minu

    (R(x, u) + 〈J(t+ dt, x+ dx)〉)

    −∂tJ(t, x) = minu

    (R(x, u) + f(x, u)∇xJ(x, t) +

    1

    2ν∇2xJ(x, t)

    )with boundary condition J(x, T ) = φ(x). This is HJB equation.

    Bert Kappen Nijmegen Summerschool 12/43

  • Path integral control theory

    dx = f(x, t)dt+ g(x, t)(udt+ dξ)

    C =

    〈φ(x(T )) +

    ∫ Tt

    dsV (x(s), s) +1

    2uTRu

    with 〈dξadξb〉 = νabdt and R = λν−1, λ > 0.

    The HJB equation becomes

    −∂tJ = minu

    (1

    2uTRu+ V + (f + gu)T (∇J) + 1

    2Tr(gνgT∇2J

    ))with boundary condition J(x, T ) = φ(x).

    Bert Kappen Nijmegen Summerschool 13/43

  • Path integral control theory

    Minimization wrt u yields non-linear HJB:

    u = −R−1gT∇J

    −∂tJ = −1

    2(∇J)TgR−1gT (∇J) + V + fT∇J + 1

    2Tr(gνgT∇2J

    )Define ψ(x, t) through J(x, t) = −λ logψ(x, t). We obtain a linear HJB:

    ∂tψ =

    (V

    λ− fT∇− 1

    2Tr(gνgT∇2

    ))ψ

    Bert Kappen Nijmegen Summerschool 14/43

  • Feynman-Kac formula

    Denote Q(τ |x, t) the distribution over uncontrolled trajectories that start at x, t:

    dx = f(x, t)dt+ g(x, t)dξ

    with τ a trajectory x(t→ T ). Then

    ψ(x, t) =

    ∫dQ(τ |x, t) exp

    (−S(τ)

    λ

    )= EQ

    (e−S/λ

    )S(τ) = φ(x(T )) +

    ∫ Tt

    dsV (x(s), s)

    ψ can be computed by forward sampling the uncontrolled process.

    Bert Kappen Nijmegen Summerschool 15/43

  • Posterior distribution over optimal trajectories

    ψ(x, t) can be interpreted as a partition sum for the distribution over paths underoptimal control:

    P (τ |x, t) = 1ψ(x, t)

    Q(τ |x, t) exp(−S(τ)

    λ

    )

    The optimal cost-to-go is a free energy:

    J(x, t) = −λ logEQ(e−S/λ

    )

    The optimal control is an expectation wrt P :

    u(x, t)dt = EP (dξ) =EQ(dξe−S/λ

    )EQ(e−S/λ

    )Bert Kappen Nijmegen Summerschool 16/43

  • Recap

    Control problem:

    dx = fdt+ g(udt+ dξ) C =

    〈φ+

    ∫ Tt

    V +1

    2uTRu

    〉R = λν−1

    HJB is linear:

    ∂tψ = Hψ J = −λ logψ

    Solution is given by Feynman-Kac formula: ψ = EQ(e−S/λ

    ).

    Q distribution over uncontrolled dynamics (u = 0).

    Optimal control is expectation value: udt =EQ

    (dξe−S/λ

    )EQ(e−S/λ)

    Bert Kappen Nijmegen Summerschool 17/43

  • Motor babbling: Estimate optimal control by importancesampling

    Initialize û = 0.

    Iterate:

    • Generate samples from Q′(τ) using random control ûdt + dξ, ν = λR−1:

    dx = fdt+ g(ûdt+ dξ)Plant

    xt

    ut

    xt+dt

    This can be computed using simulator without knowledge of f, g.

    • Update the control

    udt = ûdt+EQ′

    (dξe−S

    ′/λ)

    EQ′(e−S′/λ

    )Converges to optimal stochastic control solution.

    Bert Kappen Nijmegen Summerschool 18/43

  • Acrobot

    Joint angles q1, q2:

    d11(q)q̈1 + d12(q)q̈2 + h1(q, q̇) + φ1(q) = 0

    d21(q)q̈1 + d22q̈2 + h2(q, q̇) + φ2(q) = u

    We can write these equations in standard form

    dxi = fi(x)dt+ gi(x)udt

    with x1 = q1, x2 = q2, x3 = q̇1, x4 = q̇2

    Bert Kappen Nijmegen Summerschool 19/43

  • Acrobot

    0 20 40 60 80 100−4

    −2

    0

    2

    4

    0 20 40 60 80 1000

    2

    4

    6

    8

    ss

    0 20 40 60 80 100−150

    −100

    −50

    0

    J

    Jphi

    0 20 40 60 80 100−10

    0

    10

    20

    30

    incre

    ment

    mean

    std

    100 iterations. At each iteration 50 stochastic trajectories were generated. Noise was lowered at

    each iteration. Top left: final height for each stochastic trajectory for each iteration (red) and for

    each deterministic solution (blue).

    Bert Kappen Nijmegen Summerschool 20/43

  • Acrobot

    (movie92.mp4)

    Result after 100 trials

    Bert Kappen Nijmegen Summerschool 21/43

    Lavf52.81.0

    movie92_0.mp4Media File (video/mp4)

  • Darmstadt simulator: Beer pong

    (beer pong video) (beer pong video)

    Left) PID controller provides trajectory based solution for one particular target location. Right) We

    demonstrate that PI feed-back control can adapt to changing target location and/or noise.

    Bert Kappen Nijmegen Summerschool 22/43

    x264

    out-3.mp4Media File (video/mp4)

    x264

    out-2.mp4Media File (video/mp4)

  • Application in robotics

    (ICREA2011.mp4)

    (Theodorou et al. 2010)

    Bert Kappen Nijmegen Summerschool 23/43

    Lavf52.81.0

    ICRA2011-1_0.mp4Media File (video/mp4)

  • KL control theory

    x denotes state of the agent and x1:T is a path through state space from timet = 1 to T .

    q(x1:T |x0) denotes a probability distribution over possible future trajectories giventhat the agent at time t = 0 is is state x0, with

    q(x1:T |x0) =T∏t=0

    q(xt+1|xt)

    q(xt+1|xt) implements the allowed moves.

    V (x1:T ) =∑Tt=1 V (xt) is the total cost when following path x1:T .

    The KL control problem is to find the probability distribution p(x1:T |x0) thatminimizes

    C(p|x0) =∑x1:T

    p(x1:T |x0)(

    logp(x1:T |x0)q(x1:T |x0)

    + V (x1:T )

    )= KL(p||q) + 〈V 〉p

    Bert Kappen Nijmegen Summerschool 24/43

  • KL control theory

    p(x1:T |x0) and q(x1:T |x0) distributions over trajectories.

    Given q, find p that minimizes

    C(p|x0) = KL(p||q) + 〈V 〉p

    The solution and the optimal control cost are

    p(x1:T |x0) =1

    Z(x0)q(x1:T |x0) exp (−V (x1:T ))

    C = − logZ(x0)

    Z(x0) =∑x1:T

    q(x1:T |x0) exp (−V (x1:T ))

    NB: Z(x0) is an integral over paths.

    Bert Kappen Nijmegen Summerschool 25/43

  • KL control theory

    The optimal control at time t = 0 is given by

    p(x1|x0) =∑x2:T

    p(x1:T |x0) ∝ q(x1|x0) exp(−V (x1))β1(x1)

    with βt(x) the backward messages.

    xxx

    ....

    x0 T−2 T−1 T

    βT (xT ) = 1

    βt−1(xt−1) =∑xt

    q(xt|xt−1) exp(−V (xt))βt(xt)

    Bert Kappen Nijmegen Summerschool 26/43

  • Link to continuous path integral formulation

    The previous continuous path integral control can be obtained as a special case ofthe KL control formulation.

    dx = f(x, t)dt+ g(x, t)(udt+ dξ)〈dξ2〉

    = νdt

    p(xt+dt|xt, ut) = N (xt+dt|xt + f(xt, t)dt+ g(x, t)utdt,Ξ(x, t))q(xt+dt|xt) = N (xt+dt|xt + f(x, t)dt,Ξ(x, t))

    C(p|x0) = KL(p|q) + 〈V 〉 =∑xdt:T

    p(xdt:T |x0)

    (T∑t=dt

    1

    2uTt ν

    −1ut + V (xt)

    )

    Bert Kappen Nijmegen Summerschool 27/43

  • Average cost KL control

    When T →∞ and q ergodic the backward message recursion

    βt−1(xt−1) =∑xt

    H(xt−1, xt)βt(xt) H(x, y) = q(y|x) exp(−V (y))

    becomes the computation of the Perron-Frobenius eigen pair (β(·), λ):

    Hβ = λβ H(x, y) = q(y|x) exp(−V (x))

    The optimal control satisfies

    p(y|x) = q(y|x) exp(−V (x)) β(y)λβ(x)

    C(x0) = − log β(x0)− T log λ

    Todorov 2006

    Bert Kappen Nijmegen Summerschool 28/43

  • KL-learning

    Goal: find Perron-Frobenius solution Hβ = λβ, with H = [q(y|x) exp(−V (x))],while stepping through state space according to q and observing incurred cost.

    Algorithm (KL-learning):

    Initialize β0 random and λ0 =∑x β0(x). Initialize x0 random.

    For t = 1 . . . do

    xt ∼ q(·|xt−1)βt(xt−1) = βt−1(xt−1) + η∆

    λt = λt−1 + η∆

    ∆ =exp(−V (xt))βt−1(xt)

    λt−1− βt−1(xt−1)

    Generalization of z-learning (Todorov) to λ 6= 1

    Bierkens, Kappen 2012

    Bert Kappen Nijmegen Summerschool 29/43

  • Planning of goal directed behaviour

    Effective navigation requires planning to goal locations that have been previouslyvisited.

    The hippocampus has long been associated with navigation. Hippocampal placecells fire selectively when an animal occupies a restricted location in an environment.

    4 well-trained rats performing a spatial memory

    task in a 2 × 2 meter open area. Record upto 250 hippocampal place cells. Two phases:

    forage to obtain reward in an unknown location;

    obtain reward in a predictable reward location.

    Neural activity during many candidate events

    revealed temporally compressed, two-dimensional

    trajectories across the environment.

    Pfeiffer and Foster 2013

    Bert Kappen Nijmegen Summerschool 30/43

  • Observations and assumptions

    • the place cell activity moves from current location to goal location, sequentiallyactivating intermediate place cells.

    • can be understood as a type of gradient flow in a potential field

    • the potential field is shaped around the food locations, which change each day

    Bert Kappen Nijmegen Summerschool 31/43

  • Thinking rats

    Pfeiffer and Foster 2013

    Bert Kappen Nijmegen Summerschool 32/43

  • Finite state model

    Two dimensional grid of hippocampus place cells as a finite state model.

    Each state x corresponds to one place cell firing and all other place cells silent.

    We assume a grid world with one-to-one pre-learned correspondence between theplace cells and the grid locations.

    Four food locations

    5 10 15

    5

    10

    15

    Bert Kappen Nijmegen Summerschool 33/43

  • 5 10 15

    5

    10

    15

    5 10 15

    5

    10

    15

    10

    20

    30

    5 10 15

    5

    10

    15 0

    10

    20

    30

    5 10 15

    5

    10

    150

    5

    10

    15

    5 10 15

    5

    10

    15

    5 10 15

    5

    10

    15 0

    20

    40

    5 10 15

    5

    10

    15 0

    10

    20

    5 10 15

    5

    10

    15

    5 10 15

    5

    10

    15

    5 10 15

    5

    10

    15

    5 10 15

    5

    10

    15

    20

    40

    60

    0

    10

    20

    Bert Kappen Nijmegen Summerschool 34/43

  • Attractor dynamics

    0 2000 4000 6000 8000 10000−1

    0

    1

    2

    3

    t

    min

    ima

    l d

    ista

    nce

    to

    ta

    rge

    t

    0 2000 4000 6000 8000 100000

    10

    20

    30

    40

    pa

    th le

    ng

    th t

    o m

    inim

    al d

    ista

    nce

    t

    Quality of the controlled dynamics as a function of the learning steps t. Left: average minimal

    distance of trajectories starting at (8,8) and of length 50 to one of the food locations. Right:

    Average coresponding path length.

    Bert Kappen Nijmegen Summerschool 35/43

  • Changing locations

    run file7 movie.m

    Bert Kappen Nijmegen Summerschool 36/43

  • Discussion

    KL control as a simple alternative for RL:- only single eigenvalue computation

    - actor-critic or policy iteration requires multiple policy evaluations

    - Q learning requires state and action representation

    Bert Kappen Nijmegen Summerschool 37/43

  • Discussion

    KL control as a simple alternative for RL:- only single eigenvalue computation

    - actor-critic or policy iteration requires multiple policy evaluations

    - Q learning requires state and action representation

    KL Learning- model-free learning

    - model-based thinking

    Bert Kappen Nijmegen Summerschool 38/43

  • Discussion

    KL control as a simple alternative for RL:- only single eigenvalue computation

    - actor-critic or policy iteration requires multiple policy evaluations

    - Q learning requires state and action representation

    KL Learning- model-free learning

    - model-based thinking

    Accellerations:- learn a representation of uncontrolled dynamics while exploring

    - update β in parallel for all states, not only the state that is visited

    Bert Kappen Nijmegen Summerschool 39/43

  • Discussion

    KL control as a simple alternative for RL:- only single eigenvalue computation

    - actor-critic or policy iteration requires multiple policy evaluations

    - Q learning requires state and action representation

    KL learning:- model-free learning

    - model-based thinking

    Accellerations:- learn a representation of uncontrolled dynamics while exploring

    - update β in parallel for all states, not only the state that is visited

    Neural issues:- neural ’blob’ (Amari, Kohonen) for place cell activity

    - topological map learning for place fields (Kohonen)

    - β(x) (and λ) as extra layer of neurons or thresholds

    Bert Kappen Nijmegen Summerschool 40/43

  • Conclusion

    Path integral control problems are inference problems- decision making by sampling

    Bert Kappen Nijmegen Summerschool 41/43

  • Conclusion

    Path integral control problems are inference problems- decision making by sampling

    - decisions are bifurcations that occur at phase transitions

    Bert Kappen Nijmegen Summerschool 42/43

  • Conclusion

    Path integral control problems are inference problems- decision making by sampling

    - phase transitions

    - efficient computational methods

    0 0.2 0.4 0.6 0.8 1 −1

    0

    1

    2

    3

    4

    5

    Noise

    Co

    st

    Diffe

    ren

    ce

    10 15 20 25 3010

    −1

    100

    101

    102

    103

    Agents

    CP

    U T

    ime

    Bert Kappen Nijmegen Summerschool 43/43

  • Conclusion

    Path integral control problems are inference problems- decision making by sampling

    - phase transitions

    - efficient computational methods

    0 0.2 0.4 0.6 0.8 1 −1

    0

    1

    2

    3

    4

    5

    Noise

    Co

    st

    Diffe

    ren

    ce

    10 15 20 25 3010

    −1

    100

    101

    102

    103

    Agents

    CP

    U T

    ime

    Theory for sensori-motor integration- learning (motor babbling) approach for robotics

    - hippocampal model for learning goal directed behavior

    www.snn.ru.nl/~bertk

    Bert Kappen Nijmegen Summerschool 44/43

    www.snn.ru.nl/~bertk