approximation dynamic programming presented by yu-shun, wang
Post on 02-Jan-2016
227 Views
Preview:
TRANSCRIPT
Approximation Dynamic Programming
Presented by Yu-Shun, Wang
Page 2
Agenda
IntroductionBackground
Policy Evaluation Algorithms
General Issues of Cost ApproximationApproximate Policy Iteration
Direct and Indirect Approximation
The Role of Monte Carlo Simulation
Direct Policy Evaluation - Gradient MethodsBatch Gradient Methods for Policy Evaluation
Incremental Gradient Methods for Policy Evaluation
Comparison with our approach
Page 3
Agenda
IntroductionBackground
Policy Evaluation Algorithms
General Issues of Cost ApproximationApproximate Policy Iteration
Direct and Indirect Approximation
The Role of Monte Carlo Simulation
Direct Policy Evaluation - Gradient MethodsBatch Gradient Methods for Policy Evaluation
Incremental Gradient Methods for Policy Evaluation
Comparison with our approach
Page 4
Background
A principal aim of the methods is to address problems with very large number of states n.
Another aim of the methods of this chapter is to address model-free situations.
i.e., problems where a mathematical model is unavailable or hard to construct.
The system and cost structure may be simulated.
for example, a queuing network with complicated but well-defined service disciplines.
Page 5
Background
The assumption here is that:
There is a computer program that simulates, for a given control u, the probabilistic transitions from any given state i to a successor state j according to the transition probabilities pij(u).
It also generates a corresponding transition cost g(i, u, j).
It may be possible to use repeated simulation to calculate (at least approximately) the transition probabilities of the system and the expected stage costs by averaging.
Page 6
Background
We will aim to approximate the cost function of a given policy or even the optimal cost-to-go function by generating one or more simulated system trajectories and associated costs.
In another type of method, which we will discuss only briefly, we use a gradient method and simulation data to approximate directly an optimal policy.
Page 7
Agenda
IntroductionBackground
Policy Evaluation Algorithms
General Issues of Cost ApproximationApproximate Policy Iteration
Direct and Indirect Approximation
The Role of Monte Carlo Simulation
Direct Policy Evaluation - Gradient MethodsBatch Gradient Methods for Policy Evaluation
Incremental Gradient Methods for Policy Evaluation
Comparison with our approach
Page 8
Policy Evaluation Algorithms
With this class of methods, we aim to approximate the cost function Jμ(i) of a policy μ with a parametric architecture of the form , where r is a parameter vector.
Alternatively, it may be used to construct an approximate cost-to-go function of a single suboptimal/heuristic policy with one-step or multistep look ahead.
( , )J i r
Page 9
Policy Evaluation Algorithms
We focus primarily on two types of methods, the first class called direct, we use simulation to collect samples of costs for various initial states, and fit the architecture to the samples.
The second and currently more popular class of methods is called indirect . Here, we obtain r by solving an approximate version of Bellman’s equation. We obtain the parameter vector r by solving the equation:
J
Page 10
Agenda
IntroductionBackground
Policy Evaluation Algorithms
General Issues of Cost ApproximationApproximate Policy Iteration
Direct and Indirect Approximation
The Role of Monte Carlo Simulation
Direct Policy Evaluation - Gradient MethodsBatch Gradient Methods for Policy Evaluation
Incremental Gradient Methods for Policy Evaluation
Comparison with our approach
Page 14
Approximate Policy Iteration
Suppose that the current policy is μ, and for a given r, is an approximation of Jμ(i). We generate an “improved” policy using the formula:
When the sequence of policies obtained actually converges to some , then it can be proved that μ is optimal to within:
( , )J i r
2
1
Page 15
Approximate Policy Iteration
Block diagram of approximate policy iteration
Page 16
Approximate Policy Iteration
A simulation-based implementation of the algorithm is illustrated in the following figure. It consists of four modules:
The simulator, which given a state-control pair (i, u), generates the next state j according to the system’s transition probabilities.
The decision generator, which generates the control of the improved policy at the current state i for use in the simulator.
The cost-to-go approximator, which is the function that is used by the decision generator.
The cost approximation algorithm, which accepts as input the output of the simulator and obtains the approximation of the cost of .
( )i
( , )J j r
( , )J r
Page 17
Approximate Policy Iteration
Page 18
Approximate Policy Iteration
There are two policies μ and , and parameter vectors r and , which are involved in this algorithm.
In particular, r corresponds to the current policy μ, and the approximation is used in the policy improvement to generate the new policy .
At the same time, drives the simulation that generates samples that determines the parameter ,which will be used in the next policy iteration.
r
( , )J r
r
Page 19
Approximate Policy Iteration
System simulator Cost-to-Go Approximator
DecisionGenerator
Cost Approximation Algorithm
Page 20
Agenda
IntroductionBackground
Policy Evaluation Algorithms
General Issues of Cost ApproximationApproximate Policy Iteration
Direct and Indirect Approximation
The Role of Monte Carlo Simulation
Direct Policy Evaluation - Gradient MethodsBatch Gradient Methods for Policy Evaluation
Incremental Gradient Methods for Policy Evaluation
Comparison with our approach
Page 21
Direct and Indirect Approximation
An important generic difficulty with simulation-based policy iteration is the cost samples may biases the simulation by underrepresented states.
As a result, the cost-to-go estimates of these states may be highly inaccurate, causing serious errors in the calculation of the improved control policy.
The difficulty is known as inadequate exploration of the system’s dynamics because of the use of a fixed policy.
Page 22
Direct and Indirect Approximation
One possibility for adequate exploration is to frequently restart the simulation and to ensure that the initial states employed form a rich and representative subset.
A related approach, called iterative resampling, is to derive an initial cost evaluation of μ, simulate the next policy obtained on the basis of this initial evaluation to obtain a set of representative states S
And repeat the evaluation of μ using additional trajectories initiated from S.
Page 23
Direct and Indirect Approximation
The most straightforward algorithmic approaches for approximating the cost function is direct.
It is used to find an approximation S that matches ∈best Jμ in some normed error sense, i.e.,
Where
Here, || . || is usually some (weighted) Euclidean norm.
J
Page 24
Direct and Indirect Approximation
If the matrix φ has linearly independent columns, the solution is unique and can also be represented as:
where Π denotes projection with respect to || . || on the subspace S.
A major difficulty is that specific cost function values Jμ(i) can only be estimated through their simulation-generated cost samples.
Page 25
Direct and Indirect Approximation
Page 26
Direct and Indirect Approximation
An alternative approach, referred to as indirect , is to approximate the solution of Bellman’s equation J = TμJ on the subspace S.
We can view this equation as a projected form of Bellman’s equation.
Page 27
Direct and Indirect Approximation
Solving projected equations as approximations to more complex/higher-dimensional equations has a long history in scientific computation in the context of Galerkin methods.
The use of the Monte Carlo simulation ideas that are central in approximate DP is an important characteristic that differentiates the methods from the Galerkin methods.
Page 28
Direct and Indirect Approximation
Page 29
Agenda
IntroductionBackground
Policy Evaluation Algorithms
General Issues of Cost ApproximationApproximate Policy Iteration
Direct and Indirect Approximation
The Role of Monte Carlo Simulation
Direct Policy Evaluation - Gradient MethodsBatch Gradient Methods for Policy Evaluation
Incremental Gradient Methods for Policy Evaluation
Comparison with our approach
Page 30
The Role of Monte Carlo Simulation
The methods of this chapter rely to a large extent on simulation in conjunction with cost function approximation in order to deal with large state spaces.
The advantage that simulation holds in this regard can be traced to its ability to compute (approximately) sums with a very large number terms.
Page 31
The Role of Monte Carlo Simulation
Example: Approximate Policy Evaluation
Consider the approximate solution of the Bellman equation that corresponds to a given policy of an n-state discounted problem:
where P is the transition probability matrix and α is the discount factor.
Let us adopt a hard aggregation approach whereby we divide the n states in two disjoint subsets I1 and I2 with I1∪I2 = {1, . . . , n}, and we use the piecewise constant approximation:
Page 32
The Role of Monte Carlo Simulation
Example: Approximate Policy Evaluation (cont.)
This corresponds to the linear feature-based architecture J ≈ φr, where φ is the n × 2 matrix with column components equal to 1 or 0, depending on whether the component corresponds to I1 or I2.
We obtain the approximate equations:
Page 33
The Role of Monte Carlo Simulation
Example: Approximate Policy Evaluation (cont.)
we can reduce to just two equations by forming two weighted sums (with equal weights) of the equations corresponding to the states in I1 and I2, respectively:
where n1 and n2 are numbers of states in I1 and I2. We thus obtain the aggregate system of the following two equations in r1 and r2:
Page 34
The Role of Monte Carlo Simulation
Example: Approximate Policy Evaluation (cont.)
Here the challenge, when the number of states n is very large, is the calculation of the large sums in the right-hand side, which can be of order O(n2).
Simulation allows the approximate calculation of these sums with complexity that is independent of n.
This is similar to the advantage that Monte-Carlo integration holds over numerical integration.
Page 35
IntroductionBackground
Policy Evaluation Algorithms
General Issues of Cost ApproximationApproximate Policy Iteration
Direct and Indirect Approximation
The Role of Monte Carlo Simulation
Direct Policy Evaluation - Gradient MethodsBatch Gradient Methods for Policy Evaluation
Incremental Gradient Methods for Policy Evaluation
Comparison with our approach
Agenda
Page 36
Batch Gradient Methods for Policy Evaluation
Suppose that the current policy is μ, and for a given r, is an approximation of Jμ(i). We generate an “improved” policy μ using the formula:
To evaluate approximately , we select a subset of “representative” states S (obtained by simulation), and for each i ∈ S, we obtain M(i) samples of the cost .
The mth such sample is denoted by c(i,m), and it can be viewed as plus some simulation error.
( , )J i r
J
( )J i
( )J i
Page 37
Batch Gradient Methods for Policy Evaluation
We obtain the corresponding parameter vector r by solving the following least squares problem:
The above problem can be solved if a linear approximation architecture is used.
However, when a nonlinear architecture is used, we may use gradient-like methods for solving the least squares problem.
Page 38
Batch Gradient Methods for Policy Evaluation
Let us focus on an N-transition portion (i0, . . . , iN) of a simulated trajectory, also called a batch. We view the numbers
as cost samples, one per initial state i0, . . . , iN−1, which can be used for least squares approximation of the parametric architecture ( , )J i r
Page 39
Batch Gradient Methods for Policy Evaluation
One way to solve this least squares problem is to use a gradient method, whereby the parameter associated with is updated at time N by
Here, denotes gradient with respect to and γ is a positive stepsize, which is usually diminishing over time (we leave its precise choice open for the moment).
r
J r
Page 40
Batch Gradient Methods for Policy Evaluation
The update of is done after processing the entire batch, and that the gradients are evaluated at the preexisting value of , i.e., the one before the update.
In a traditional gradient method, the gradient iteration is repeated, until convergence to the solution.
( , )kJ i rr
r
Page 41
Batch Gradient Methods for Policy Evaluation
However, there is an important tradeoff relating to the size N of the batch:
In order to reduce simulation error and generate multiple cost samples for a representatively large subset of states, it is necessary to use a large N,
Yet to keep the work per gradient iteration small, it is necessary to use a small N.
Page 42
Batch Gradient Methods for Policy Evaluation
To address the issue of size of N, batches may be changed after one or more iterations.
Thus, the N-transition batch comes from a potentially longer simulated trajectory, or from one of many simulated trajectories.
We leave the method for generating simulated trajectories and forming batches open for the moment.
But we note that it influences strongly the result of the corresponding least squares optimization.
Page 43
Agenda
IntroductionBackground
Policy Evaluation Algorithms
General Issues of Cost ApproximationApproximate Policy Iteration
Direct and Indirect Approximation
The Role of Monte Carlo Simulation
Direct Policy Evaluation - Gradient MethodsBatch Gradient Methods for Policy Evaluation
Incremental Gradient Methods for Policy Evaluation
Comparison with our approach
Page 44
Incremental Gradient Methods for Policy Evaluation
We now consider a variant of the gradient method called incremental . This method can also be described through the use of N-transition batches.
But we will see that the method is suitable for use with a single very long simulated trajectory, viewed as a single batch.
Page 45
Incremental Gradient Methods for Policy Evaluation
For a given N-transition batch (i0, . . . , iN), the batch gradient method processes the N transitions all at once, and updates .
The incremental method updates a total of N times, once after each transition.
Each time it adds to the corresponding portion of the gradient that can be calculated using the newly available simulation data.
r
r
r
Page 46
Incremental Gradient Methods for Policy Evaluation
Thus, after each transition (ik, ik+1):
We evaluate the gradient at the current value of .
We sum all the terms that involve the transition (ik, ik+1), and we update by making a correction along their sum:
r( , )kJ i r
r
Page 47
Incremental Gradient Methods for Policy Evaluation
By adding the “incremental” correction in the above iteration, we see that after N transitions, all the terms of the batch iteration will have been accumulated.
But there is a difference:
In the incremental version, is changed during the processing of the batch, and the gradient is evaluated at the most recent value of [after the transition (it, it+1)].
r( , )tJ i r
r
Page 48
Incremental Gradient Methods for Policy Evaluation
By contrast, in the batch version these gradients are evaluated at the value of prevailing at the end of the batch.
Note that the gradient sum can be conveniently updated following each transition, thereby resulting in an efficient implementation.
r
Page 49
Incremental Gradient Methods for Policy Evaluation
It can be seen that because is updated at intermediate transitions within a batch (rather than at the end of the batch), the location of the end of the batch becomes less relevant.
In this case, for each state i, we will have one cost sample for every time when state i is encountered in the simulation.
Accordingly state i will be weighted in the least squares optimization in proportion to the frequency of its occurrence within the simulated trajectory.
r
Page 50
Incremental Gradient Methods for Policy Evaluation
Generally, the incremental versions of the gradient methods can be implemented more flexibly and tend to converge faster than their batch counterparts.
However, the rate of convergence can be very slow.
Page 51
Agenda
IntroductionBackground
Policy Evaluation Algorithms
General Issues of Cost ApproximationApproximate Policy Iteration
Direct and Indirect Approximation
The Role of Monte Carlo Simulation
Direct Policy Evaluation - Gradient MethodsBatch Gradient Methods for Policy Evaluation
Incremental Gradient Methods for Policy Evaluation
Comparison with our approach
Page 52
Comparison with our approach
Approximation DP Our Approach
The method of gradient and stepsize determination
Open problems Collect meaningful and easy to take information during simulation
The extent of the use of simulation
Calculation only Coupled with whole solution approach
Problem structure Single Double
Considered target Estimate cost from n states
Minimize aggregate effect of n categories attackers
Progress Architecture only Architecture and implementation
Page 53
The End
top related