3 core theory - university of texas at austin · 4.7 eciency of dynamic programming dp may not be...

Part 3: Core Theory

Intro to Reinforcement Learning

Interactive Example: You are the algorithm!

St 2 S

Finite Markov decision processes (finite MDPs)

At 2 Astates

actionsRt 2 Rrewards

policy ⇡ : A⇥ S ! [0, 1] ⇡(a|s) = Pr{At = a|St = s}

dynamicsp(s0, r|s, a) = Pr{St+1 = s0, Rt+1 = r|St = s,At = a}p : S⇥ R⇥ S⇥A ! [0, 1]

Experience:

time ➝state action reward

policy

dynamics

R3

p

S0 A0 A1R1 S1 R2 A2S2⇡ ⇡ ⇡

p p

. . . ...

Rewards and returns• The objective in RL is to maximize long-term future reward • That is, to choose so as to maximize • But what exactly should be maximized? • The discounted return at time t:

At Rt+1, Rt+2, Rt+3, . . .

Gt = Rt+1 + �Rt+2 + �2Rt+3 + �3Rt+4 + · · · � 2 [0, 1)

Reward sequence1 0 0 0…

Return1

0 0 2 0 0 0…0.5(or any)

0.5 0.50.9 0 0 2 0 0 0… 1.620.5 -1 2 6 3 2 0 0 0… 2

�

the discount rate

Values are expected returns• The value of a state, given a policy:

• The value of a state-action pair, given a policy:

• The optimal value of a state:

• The optimal value of a state-action pair:

• Optimal policy: is an optimal policy if and only if

• in other words, is optimal iff it is greedy wrt

v⇡(s) = E{Gt | St = s,At:1⇠⇡} v⇡ : S ! <

q⇡(s, a) = E{Gt | St = s,At = a,At+1:1⇠⇡} q⇡ : S⇥A ! <

v⇤(s) = max

⇡v⇡(s) v⇤ : S ! <

⇡⇤(a|s) > 0 only where q⇤(s, a) = max

bq⇤(s, b)

⇡⇤

⇡⇤ q⇤

8s 2 S

q⇤(s, a) = max

⇡q⇡(s, a) q⇤ : S⇥A ! <

4 value functions

• All theoretical objects, mathematical ideals (expected values)

• Distinct from their estimates:

V (s) Q(s, a) v̂(s;wt) q̂(s, a;wt)

state values

action values

prediction

control q⇤v⇤

v⇡ q⇡

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8

Gridworld

❐ Actions: north, south, east, west; deterministic.❐ If would take agent off the grid: no move but reward = –1❐ Other actions produce reward = 0, except actions that move

agent out of special states A and B as shown.

State-value function for equiprobable random policy;γ = 0.9


Golf

❐ State is ball location❐ Reward of –1 for each stroke until

the ball is in the hole❐ Value of a state?❐ Actions:

! putt (use putter)! driver (use driver)

❐ putt succeeds anywhere on the green

Q*(s,driver)

Vputt

sand

green

!1

sand

!2!2

!3

!4

!1

!5!6

!4

!3

!3!2

!4

sand

green

!1

sand

!2

!3

!2

0

0

!"

!"

vputt

q*(s,driver)


Why Optimal State-Value Functions are Useful

v*

v*

Any policy that is greedy with respect to is an optimal policy.

Therefore, given , one-step-ahead search produces the long-term optimal actions.

E.g., back to the gridworld:

a) gridworld b) V* c) !*

22.0 24.4 22.0 19.4 17.5

19.8 22.0 19.8 17.8 16.0

17.8 19.8 17.8 16.0 14.4

16.0 17.8 16.0 14.4 13.0

14.4 16.0 14.4 13.0 11.7

A B

A'

B'+10

+5

v* π*


Optimal Value Function for Golf

❐ We can hit the ball farther with driver than with putter, but with less accuracy

❐ q* (s,driver) gives the value or using driver first, then using whichever actions are best

Q*(s,driver)

Vputt

sand

green

!1

sand

!2!2

!3

!4

!1

!5!6

!4

!3

!3!2

!4

sand

green

!1

sand

!2

!3

!2

0

0

!"

!"

vputt

q*(s,driver)


Why Optimal State-Value Functions are Useful

v*

v*

Any policy that is greedy with respect to is an optimal policy.

Therefore, given , one-step-ahead search produces the long-term optimal actions.

E.g., back to the gridworld:

a) gridworld b) V* c) !*

22.0 24.4 22.0 19.4 17.5

19.8 22.0 19.8 17.8 16.0

17.8 19.8 17.8 16.0 14.4

16.0 17.8 16.0 14.4 13.0

14.4 16.0 14.4 13.0 11.7

A B

A'

B'+10

+5

v* π*

Solving unknown MDPs• Tabular case: No function approximation; each state

has dedicated memory, and the states are visible

• Given a policy and returns from following it, we can average the returns (for each state-action pair) to approximate its action-value function

• Given , we can form a new policy that is greedy with respect to it:

• The new policy is guaranteed to be an improvement: with equality only if both are optimal

⇡

q⇡

q⇡ ⇡0

⇡0(a|s) > 0 only where q⇡(s, a) = max

bq⇡(s, b) 8s 2 S

v⇡0(s) � v⇡(s) 8s 2 S

⇡0⇡ v⇡policy evaluation greedification

policy improvement

It follows then that repeated policy improvement:

⇡1 v⇡1 v⇡2v⇡3⇡2 ⇡3 ⇡⇤ v⇡⇤… ⇡⇤

converges to an optimal policy in a finite number of “iterations”

This is called policy improvement. Schematically:

This is called policy iteration. It is the basis for almost all (control) solution methods.

eval eval eval evalgreedy greedy greedy greedy


Generalized Policy Iteration

Generalized Policy Iteration (GPI): any interaction of policy evaluation and policy improvement, independent of their granularity.

A geometric metaphor forconvergence of GPI:

100 CHAPTER 4. DYNAMIC PROGRAMMING

π v

evaluationv → vπ

improvement

π→greedy(v)

v*π*

Figure 4.7: Generalized policy iteration: Value and policy functions interactuntil they are optimal and thus consistent with each other.

being driven toward the value function for the policy. This overall schema forGPI is illustrated in Figure 4.7.

It is easy to see that if both the evaluation process and the improvementprocess stabilize, that is, no longer produce changes, then the value functionand policy must be optimal. The value function stabilizes only when it is con-sistent with the current policy, and the policy stabilizes only when it is greedywith respect to the current value function. Thus, both processes stabilize onlywhen a policy has been found that is greedy with respect to its own evaluationfunction. This implies that the Bellman optimality equation (4.1) holds, andthus that the policy and the value function are optimal.

The evaluation and improvement processes in GPI can be viewed as bothcompeting and cooperating. They compete in the sense that they pull in op-posing directions. Making the policy greedy with respect to the value functiontypically makes the value function incorrect for the changed policy, and mak-ing the value function consistent with the policy typically causes that policy nolonger to be greedy. In the long run, however, these two processes interact tofind a single joint solution: the optimal value function and an optimal policy.

One might also think of the interaction between the evaluation and im-provement processes in GPI in terms of two constraints or goals—for example,as two lines in two-dimensional space:

4.7. EFFICIENCY OF DYNAMIC PROGRAMMING 101

v0 π0

v = vπ

π = greedy(v)

v* π*

Although the real geometry is much more complicated than this, the diagramsuggests what happens in the real case. Each process drives the value functionor policy toward one of the lines representing a solution to one of the twogoals. The goals interact because the two lines are not orthogonal. Drivingdirectly toward one goal causes some movement away from the other goal.Inevitably, however, the joint process is brought closer to the overall goal ofoptimality. The arrows in this diagram correspond to the behavior of policyiteration in that each takes the system all the way to achieving one of the twogoals completely. In GPI one could also take smaller, incomplete steps towardeach goal. In either case, the two processes together achieve the overall goalof optimality even though neither is attempting to achieve it directly.

4.7 E�ciency of Dynamic Programming

DP may not be practical for very large problems, but compared with othermethods for solving MDPs, DP methods are actually quite e�cient. If weignore a few technical details, then the (worst case) time DP methods take tofind an optimal policy is polynomial in the number of states and actions. If n

and m denote the number of states and actions, this means that a DP methodtakes a number of computational operations that is less than some polynomialfunction of n and m. A DP method is guaranteed to find an optimal policy inpolynomial time even though the total number of (deterministic) policies is m

n.In this sense, DP is exponentially faster than any direct search in policy spacecould be, because direct search would have to exhaustively examine each policyto provide the same guarantee. Linear programming methods can also be usedto solve MDPs, and in some cases their worst-case convergence guarantees arebetter than those of DP methods. But linear programming methods becomeimpractical at a much smaller number of states than do DP methods (by afactor of about 100). For the largest problems, only DP methods are feasible.

DP is sometimes thought to be of limited applicability because of the curse


Summary

❐ Agent-environment interaction! States! Actions! Rewards

❐ Policy: stochastic rule for selecting actions

❐ Return: the function of future rewards agent tries to maximize

❐ Markov Decision Process

❐ Value functions! State-value function for a policy! Action-value function for a policy! Optimal state-value function! Optimal action-value function

❐ Optimal policies❐ Generalized policy iteration

Next up:

A Monte Carlo Learning Example: Solving Blackjack

3 core theory - university of texas at austin · 4.7 eciency of dynamic programming dp may not be...

Documents