stochastic optimal control – part 2 discrete time, markov ... · stochastic optimal control –...
TRANSCRIPT
Stochastic Optimal Control – part 2discrete time, Markov Decision Processes,
Reinforcement Learning
Marc Toussaint
Machine Learning & Robotics Group – TU [email protected]
ICML 2008, Helsinki, July 5th, 2008
• Why stochasticity?
• Markov Decision Processes
• Bellman optimality equation, Dynamic Programming, Value Iteration
• Reinforcement Learning: learning from experience
1/21
Why consider stochasticity?
1) system is inherently stochastic
2) true system is actually deterministic but
a) system is described on level of abstraction, simplification which makesmodel approximate and stochastic
b) sensors/observations are noisy, we never know the exact statec) we can handle only a part of the whole system
– partial knowledge → uncertainty– decomposed planning; factored state representation
2
agent
world
agent
1
• probabilities are a tool to represent information and uncertainty– there are many sources of uncertainty
2/21
Machine Learning models of stochastic processes
• Markov Processesdefined by random variables x0, x1, .. and transition probabilitiesP (xt+1 |xt)
x0 x1 x2
• non-Markovian Processes– higher order Markov Processes, auto regression models– structured models (hierarchical, grammars, text models)– Gaussian processes (both, discrete and continuous time)– etc
• continuous time processes– stochastic differential equations
3/21
Markov Decision Processes
• Markov Process on the random variables of states xt, actions at, andrewards rt
x2x1
a2a1a0
r1 r2r0
π
x0
P (xt+1 | at, xt) transition probability (1)
P (rt | at, xt) reward probability (2)
P (at |xt) = π(at |xt) policy (3)
• we will assume stationarity, no explicit dependency on time– P (x′ | a, x) and P (r | a, x) are invariable properties of the world– the policy π is a property of the agent
4/21
optimal policies
• value (expected discounted return) of policy π when started in x
V π(x) = E{r0 + γr1 + γ2r2 + · · · |x0 =x;π
}(cf. cost function C(x0, a0:T ) = φ(xT ) +
∑T -10 R(t, xt, at) )
• optimal value function:
V ∗(x) = maxπ
V π(x)
• policy π∗ if optimal iff
∀x : V π∗(x) = V ∗(x)
(simultaneously maximizing the value in all states)
• There always exists (at least one) optimal deterministic policy!
5/21
Bellman optimality equation
V π(x) = E{r0 + γr1 + γ2r2 + · · · |x0 =x;π
}= E {r0 |x0 =x;π}+ γE {r1 + γr2 + · · · |x0 =x;π}
= R(π(x), x) + γ∑
x′ P (x′ |π(x), x) E {r1 + γr2 + · · · |x1 =x′;π}
= R(π(x), x) + γ∑
x′ P (x′ |π(x), x) V π(x′)
• Bellman optimality equation
V ∗(x) = maxa
[R(a, x) + γ
∑x′ P (x′ | a, x) V ∗(x′)
]π∗(x) = argmax
a
[R(a, x) + γ
∑x′ P (x′ | a, x) V ∗(x′)
](if π would select another action than argmaxa[·], π wouldn’t be optimal:π′ which = π everywhere except π′(x) = argmaxa[·] would be better)
• this is the principle of optimality in the stochastic case(related to Viterbi, max-product algorithm)
6/21
Dynamic Programming
• Bellman optimality equation
V ∗(x) = maxa
[R(a, x) + γ
∑x′ P (x′ | a, x) V ∗(x′)
]
• Value iteration (initialize V0(x) = 0, iterate k = 0, 1, ..)
∀x : Vk+1(x) = maxa
[R(a, x) + γ
∑x′ P (x′|a, x) Vk(x′)
]– stopping criterion: maxx |Vk+1(x)− Vk(x)| ≤ ε
(see script for proof of convergence)
• once it converged, choose the policy
πk(x) = argmaxa
[R(a, x) + γ
∑x′ P (x′ | a, x) Vk(x′)
]7/21
maze example
• typical example for avalue function in navigation
[online demo – or switch to Terran Lane’s lecture...]8/21
comments
• Bellman’s principle of optimality is the core of the methods
• it refers to the recursive thinking of what makes a path optimal– the recursive property of the optimal value function
• related to Viterbi, max-product algorithm
9/21
Learning from experience
• Reinforcement Learning problem: model P (x′ | a, x) and P (r | a, x) arenot known, only exploration is allowed
Dynam
icProg.
mod
ellearning
policyup
date
policysearch
TD
lear
nin
gQ
-lea
rnin
gEM
policy optim.
experience{xt, at, rt}
MDP modelπP,R
value/Q-functionV,Q
policy
10/21
Model learning
• trivial on direct discrete representation:use experience data to estimate model
P̂ (x′|a, x) ∝ #(x′ ← x|a)
– for non-direct representations: Machine Learning methods
• use DP to compute optimal policy for estimated model
• Exploration-Exploitation is not a Dilemmapossible solutions: E3 algorithm, Bayesian RL (see later)
11/21
Temporal Difference
• recall Value Iteration
∀x : Vk+1(x) = maxa
[R(a, x) + γ
∑x′ P (x′|a, x) Vk(x′)
]• Temporal Difference learning (TD): given experience (xtatrt xt+1)
Vnew(xt) = (1− α)Vold(xt) + α[rt + γVold(xt+1)]
= Vold(xt) + α[rt + γVold(xt+1)− Vold(xt)] .
... is a stochastic variant of Dynamic Programming→ one can prove convergence with probability 1 (see Q-learning in script)
• reinforcement:– more reward than expected (rt > γVold(xt+1)− Vold(xt))→ increase V (xt)
– less reward than expected (rt < γVold(xt+1)− Vold(xt))→ decrease V (xt)
12/21
Q-learning convergence with prob 1
• Q-learning
Qπ(a, x) = E{r0 + γr1 + γ2r2 + · · · |x0 =x, a0 =u;π
}Q∗(a, x) = R(a, x) + γ
∑x′ P (x′ | a, x) maxa′ Q∗(a′, x′)
∀a,x : Qk+1(a, x) = R(a, x) + γ∑
x′ P (x′|a, x) maxa′ Qk(a′, x′)
Qnew(xt, at) = (1− α)Qold(xt, at) + α[rt + γ maxa
Qold(xt+1, a)]
• Q-learning is a stochastic approximation of Q-VI:Q-VI is deterministic: Qk+1 = T (Qk)
Q-learning is stochastic: Qk+1 = (1− α)Qk + α[T (Qk) + ηk]
ηk is zero mean!
13/21
Q-learning impact
• Q-Learning (Watkins, 1988) is the first provably convergent directadaptive optimal control algorithm
• Great impact on the field of Reinforcement Learning– smaller representation than models– automatically focuses attention to where it is needed i.e., nosweeps through state space– though does not solve the exploration versus exploitation issue– epsilon-greedy, optimistic initialization, etc,...
14/21
Eligibility traces• Temporal Difference:
Vnew(x0) = Vold(x0) + α[r0 + γVold(x1)− Vold(x0)]
• longer reward sequence: r0 r1 r2 r3 r4 r5 r6 r7
temporal credit assignment, think further backwards, receiving r3 alsotells us something about V (x0)
Vnew(x0) = Vold(x0) + α[r0 + γr1 + γ2r2 + γ3Vold(x3)− Vold(x0)]
• online implementation: remember where you’ve been recently(“eligibility trace”) and update those values as well:
e(xt)← e(xt) + 1
∀x : Vnew(x) = Vold(x) + αe(x)[rt + γVold(xt+1)− Vold(xt)]
∀x : e(x)← γλe(x)
• core topic of Sutton & Barto book– great improvement 15/21
comments
• again, Bellman’s principle of optimality is the core of the methods
TD(λ), Q-learning, eligibilities, are all methods to converge to afunction obeying the Bellman optimality equation
16/21
E3 : Explicit Explore or Exploit
• (John Langford) from observed data construct two MDPs:
(1) MDPknown includes sufficiently often visited states and executed actionswith (rather exact) estimates of P and R.(model which captures what you know)
(2) MDPunknown = MDPknown except the reward is 1 for all actions which leavethe known states and 0 otherwise.(model which captures optimism of exploration)
• the algorithm:
(1) If last x not in Known: choose the least previously used action
(2) Else:
(a) [seek exploration] If Vunknown > ε then act according to Vunknown until
state is unknown (or t mod T = 0) then goto (1)
(b) [exploit] else act according to Vknown
17/21
E3 – Theory
• for any (unknown) MDP:– total number of actions and computation time required by E3 arepoly(|X|, |A|, T ∗, 1
ε , ln1δ )
– performance guarantee: with probability at least (1− δ) exp. return ofE3 will exceed V ∗ − ε
• details– actual return: 1
T
∑Tt=1 rt
– let T ∗ denote the (unknown) mixing time of the MDP– one key insight: even the optimal policy will take time O(T ∗) toachieve actual return that is near-optimal
• straight-forward & intuitive approach!– the exploration-exploitation dilemma is not a dilemma!– cf. active learning, information seeking, curiosity, variance analysis
18/21
Bayesian Reinforcement Learning• initially, we don’t know the MDP
– but based on experience we can estimate the MDP
• parametrize MDP by a parameter θ
(e.g., direct parametrization: P (x′ | a, x) = θx′ax)– given experience x0a0 x1a1 x0a0 · · · we can estimate the posterior
b(θ) = P (θ |x0:t, a0:t)
• given a “posterior belief b about the world”, plan to maximize policyvalue in this distribution of worlds
V ∗(x, b) = maxa
[R(a, x) + γ
∑x′
∫θP (x′|a, x; θ)b(θ)dθ V ∗(x′, b′)
]
(see last year’s ICML tutorial, old theory from Operations Research)19/21
comments
• there is no fundamentally unsolved exploration-exploitation dilemma!
– but an efficiency issue
20/21
further topics
• function approximation:– Laplacian eigen functions as value function representation(Mahadevan)– Gaussian processes (Carl Rasmussen, Yaakov Engel)
• representations:– macro (hierarchical) policies, abstractions, options (Precup, Sutton,Singh, Dietterich, Parr)– predictive state representations (PSR, Littman, Sutton, Singh)
• partial observability (POMDPs)
21/21
• we conclude with a demo from Andrew Ng– helicopter flight– actions: standard remote control– reward functions: hand-designed for specific tasks
[rolls] [inverted flight]
22/21
appendix: discrete time continuous state control• same framework as MDPs, different conventional notation:control MDPs
ut control at action
xt+1 = xt + f(t , xt, ut) + ξt discretedx = f(t, x, u) dt+dξ continuous system
P (xt+1 | at, xt) transition prob
φ(xT ) final costR(t, xt, ut) local cost
P (rt | at, xt) reward prob
C(x0, u0:T ) exp. trajectory cost V π(x) policy value
J(t, x) optimal cost-to-go V ∗(x) optimal value function
• discrete time stochastic controlled system
xt+1 = f(xt, ut) + ξ , ξ ∼ N(0, Q)
P (xt+1 |ut, xt) = N(xt+1 | f(xt, ut), Q)
• objective is to minimize the expectation of the cost
C(x0:T , u0:T ) =T∑t=0
R(t, xt, ut) .
23/21
appendix: discrete time continuous state control
• just as we had in the MDP case, the value function obeys the Bellmanoptimality equation
Jt(x) = minu
[R(t, x, u) +
∫x′
P (x′ |u, x) Jt+1(x′)
]• 2 types of optimal control problems
open-loop: find control sequence u∗1:T that minimizes the expected
cost
closed-loop: find a control law π∗ : (t, x) 7→ ut (that exploits the truestate observation in each time step and maps it to a feedback controlsignal) that minimizes the expected cost
24/21
appendix: Linear-quadratic-Gaussian (LQG) case• consider a linear control process with Gaussian noise and quadratic costs,
P (xt |xt-1, ut) = N(xt |Axt-1 + But, Q) ,
C(x1:T , u1:T ) =TX
t=1
xTt Rxt + uT
t Hut .
• assume we know the exact cost-to-go Jt(x) at time t and that it has the formJt(x) = xT Vtx. Then,
Jt-1(x) = mina
hxT Rx + uT Hu +
Zy
N(y |Ax + Bu, Q) yT Vty dxi
= mina
hxT Rx + uT Hu + (Ax + Bu)T Vt(Ax + Bu) + tr(VtQ)
i= min
a
hxT Rx + uT (H + BT VtB)u + 2uT BT VtAx + xT AVtAx + tr(VtQ)
iminimization yields
0 = 2(H + BT VtB)u∗ + 2BT VtAx ⇒ u∗ = −(H + BT VtB)-1BT VtAx
Jt-1(x) = xT Vt-1x , Vt-1 = R + AT VtA−AT VtB(H + BT VtB)-1BT VtA
• this is called the Ricatti equation 25/21