dynamic programming to dr. awdhesh kumar submitted by nitin kapoor(2k5/ce/433) pankaj mohan...

DYNAMIC PROGRAMMING

TODr. AWDHESH KUMAR

SUBMITTED BY NITIN KAPOOR(2K5/CE/433)

PANKAJ MOHAN KAUSHAL(2K5/CE/434) PHOOL SINGH(2K5/CE/435)

Dynamic Programming is a enumerative technique developed by Richard Bellman in 1953 and is based on the Bellman principle of optimality.

The word “programming” in “dynamic programming” comes from the term “mathematical programming” a synonym for optimization.

This technique is used to get the optimum solution to a problem which can be represented as a MULTISTAGE DECISION PROCESS.

An optimal policy has the property that whatever the state and decisions are, the remaining decisions must constitute an optimal policy with respect to the state resulting from the first decision.

Optimal substructure

In general, we can solve a problem with optimal substructure using a three-step process:

1. Break the problem into smaller sub problems. 2. Solve these problems optimally using this three-step process

recursively.3. Use these optimal solutions to construct an optimal solution

for the original problem.

• Dynamic programming usually takes one of two approaches:

• Top-down approach: The problem is broken into sub-problems, and these sub-problems are solved and the solutions remembered, in case they need to be solved again. This is recursion and memorization combined together.

• Bottom-up approach: All sub-problems that might be needed are solved in advance and then used to build up solutions to larger problems, but it is sometimes not intuitive to figure out all the sub-problems needed for solving the given problem.

Dynamic Programming

• As an algorithm it is a powerful procedure to solve sequential decision problems. Many problems in water resources involve sequence of decisions from one period to the next period and are known as sequential decision problems.

• An important feature of Dynamic Programming is that non- linearity and constraints can be readily accommodated. In fact , constraints serve to reduce the region to be covered in computations and are helpful in the sense.

• In a Dynamic Programming problem formulation, the dynamic behavior of a system is expressed by using three types of variables:

• State variables:– define the condition of the system. For example, the amount of water stored in the reservoir may represent its state. If a problem has one state variable per stage, it is called a one-dimensional problem; a multi-dimensional problem has more than one state variable per stage. Thus, the optimization of operation of a system of two reservoirs will have two state variables, one for each reservoir.

• Stage variables:-define the order in which events occur in the system. Most commonly, time is the stage variable. There must be a finite number of possible states at each stage.

• Control variables:- represents the controls applied at a particular stage.

The principle of Dynamic Programming can be illustrated by the following example:

• A system of three reservoirs is to be constructed. The yield verses cost at the reservoir sites is given in the following table.

Find the minimum cost combination to get a total system yield of 60 and 80.

Yield Cost

Reservoir 1 Reservoir 2 Reservoir 3

0 0 0 0

20 15 10 20

40 30 35 40

• Solution

• * indicates optimal solution for that yield.

Total Yield

Yield from Reservoir 1


Cost of Reservoir 1

Cost of Reservoir 2

Total cost at Stage 2

0 0 0 0 0 0

20 20 0 15 0 15

0 20 0 10 10 *

40 40 0 30 0 30

20 20 15 10 25 *

0 40 0 35 35

60 20 40 15 35 50

40 20 30 10 40 *

80 40 40 30 35 65 *

• Proceeding to stage 3, all the three reservoirs are considered. Now, for a total system yield of 60 and 80, the possible combinations and corresponding costs are given in the following table :

• * indicates optimal solution for that yield.

Total Yield


Yield from Stage 2

Cost of Reservoir 3

Cost of Stage 2

Total Cost

60 40 20 40 10 50

20 40 20 25 45

0 60 0 40 40 *

80 20 60 20 40 60 *

40 40 40 25 65

• OBSERVATION MADE

• To get a yield of 60, reservoir 3 should not be constructed and a yield of 60 units should be obtained from stage 2. From computation of stage 2, one can note that reservoir 1 should give a yield of 40 and a yield of 20 units must be obtained from reservoir 2.

• Similarly, for a total system yield of 80, a yield of 20 must be planned from reservoir 3, and a yield of 60 from stage 2. The table for stage 2 shows that a yield of 40 should be obtained from reservoir 1 and reservoir 2 must provide a yield of 20.

Curse of dimensionality

consider a two-reservoir problem If reservoir 1 takes on 40 feasible states and reservoir 2 taken on 20 feasible states the DP recursive equation would have to be evaluated at 800 points in

each period. In general, if there are n state variables at each stage and each state

variable has m discrete values then one needs to evaluate the objective function of mn points.

This problem arising due to storage and comparison of abnormally large number of variables was terms by Bellman as the curse of dimensionality.

How to overcome the curse of dimensionality ?

1. Intuitively, the number of variables to be stored can be reduced by adopting a coarser grid for initial computations.

2. After the optimal solution is located, a finer grid can be constructed in the vicinity of this solution.

Limitation In this scheme, one may miss the global optimum and solution may converge

to a local optimum.

Discrete differential dynamic programming(DDDP)

• This technique which uses the concept of increments for state variables was introduced by Larson(1968) and termed as state increment DP(SIDP).

• Heidari et al.(1971) used this concept for reservoir operation studies and referred it as Discrete Differential Dynamic Programming(DDDP). The major difference between Larson's SIDP and DDDP is the time interval used in computations, which is variable in the former and fixed in the latter. In fact, DDDP is a generalization of SIDP.

• The DDDP procedure starts with an assumed trial state trajectory, which is a sequence of feasible state vectors resulting in a corresponding initial policy, and an initial value of the objective function.

• The procedure is assumed to have converged to a local optimum when the trajectories in two successive iterations are the same and a better value of the objective function cannot be found. This can be interpreted as a sort of successive approximation scheme. An initial estimate of the policy is made and this is used to construct an improved estimate.

• The scheme cannot assure the global optimum and may converge to a local optimum.

By starting from different initial solutions, the possibility of finding the global optimum is increased. This technique is particularly suitable for invertible systems. The water resources systems are mostly invertible. For example, assuming that the inflows to a reservoir are known, the releases from it can be determined if the states of the reservoir at different times are known.

DDDP cont`d

To obtain quick convergence, two procedures have been suggested to compute the increments of state variables.

The first is to keep the increments small and constant throughout an iteration.

The second is to reduce the size of increments as the iterations proceed.

Stochastic Dynamic Programming

The DP formulation which takes into account the stochastic nature of variables is known as Stochastic Dynamic Programming(SDP).

Since many water resource variables are stochastic in nature, the DP approach is frequently modified to account for this stochasticity .

The basic idea of this procedure is to generate a number of synthetic stream flow sequences which match the properties of the observed inflow series. For each of these series , a DP formulation is used to get the optimum policy and so there will be as many policies as the number of synthetic sequences .

In this method, it is necessary to analyze streamflows on a time period basis and express the relation between these as transition or conditional probabilities of period-to-period flows.

Where the probability of various values of a variable are dependent on the value of that variable in a previous time period, the sequence of events so described is called a Markov Chain. When the probability of being in a given state after another given state is a fixed quantity, it is termed as constant or stationary conditional probability. Many hydrologic variables display this property.

The derivation of the optimal policy is based on the assumption that the system is described as ergodic.

For an ergodic system, the final system state is independent of the starting state. For example, in a reservoir operation problem, this is equivalent to stating that no matter what the state of the reservoir at the start of the computations is the steady state of the system will be independent of that starting state.

Difference between deterministic and stochastic dynamic programming

in a deterministic problem, the action taken at a current state completely determines what the next state will be

in a probabilistic(stochastic) problem, the action taken at a current state alters the probability law of the next state of the process, but the next state is still a random variable.

Advantages of dynamic programming

reduce a single N dimensional problem to a sequence of N one-dimensional problems.

determine many important structural features of a solution even in those cases where we cannot solve them completely.

Utilization of structural properties of the solution and the

reduction in dimension combine to furnish computing techniques which greatly reduce the time required to solve the original problem.

• Let us now begin by specifying the mathematical structure of the problems.

• We now begin with specifying some notation.• The control variable at every time (or stage) t will be denoted

by ut .• Let At denote the set of possible values the control can take at

time t i.e. for every t the control ut ∈ At

• For example At could be the set of non-negative integers (in which case it is the same for all t, or At= {ut: 0 ≤ ut ≤ k(t)} where k(t) is a positive number which changes its value depending on t.

• Also note that ut can be vector valued in which case At will specify the values the vector can take etc.

• Let Ut denote the sequence of controls over a horizon [0,t] i.e.• Ut= (u0,u1,...,ut)• This is just the set of controls taken on the interval [0,t]. It is

often called a policy.• Assuming we start at the point where our process has a value

x, let the cost incurred in adopting the sequence {Uk}tk=0

• or equivalently to adopting the policy Ut be denoted by J(x, Ut) = J(x,u0,u1,···,ut)

• The above is often denoted Ju(0,t)(x). In words, it is the cost incurred over the horizon [0,t] when starting from x and using the controls u0,u1,···,ut .

• Now suppose the horizon of interest is [0,T] i.e. we wish to stop at time or stage T.

• Let J(0,T) (x) denote the optimal cost inf∗ u U∈ T-1 Ju (0,T)(x) where we have used inf for min since the minimum may not be defined without some conditions on At

• Note here we are specifying the control actions upto T − 1 since the initial value and the controls upto T − 1 determine where the trajectory of the system will be at time T and we have no interest to proceed further (as we terminate the problem at T).

Now by the application of the Principle of Optimality we obtain:• J(t,T)(x)=infut AtJ(t,T)(x,ut,U(t+1,T-1)) where U(t,s) denotes the optimal values of the

controls chosen in the interval [t,s].• Thus, in order to be able to solve for J(0,T)(x), we need to specify the terminal

cost J(T,T)(x). Once we have this we can work backwards. Note in the equation above x is just a variable which specifies that we start from the point x at time t.

• Without further assumptions on the structure of the cost function we cannot say much more about the behaviour of J and hence determine u ..∗ ∗

In the sequel we will assume the following structure for the problem:

• We will assume that the objective function or costs have an additive structure. By this we mean that the overall cost (over the interval (called horizon)) of interest is the sum of individual costs incurred at each stage or time point. The individual costs are called running costs or stage costs.

• First note that in the definition of J(0,t) we carry the initial value x and the vector Ut-1. This is because specifying the initial condition and the set of control values over a given interval we specify the value of the trajectory at the end of the interval. Thus as t increases the vector(x,u0,u1,···,ut)increases in dimension and this is very inconvenient. We can reduce the vector to define our process or system by the introduction of the concept of a state of a system.

(The state of a system)• The state of a system is a quantity which encapsulates the past of the

system. Specifically the state of a system at time (or stage) t, denoted by xt

is such that knowing xt and the set of inputs(ut,ut+1,···, uT) allows us to determine XT+1 completely.

• In our context it is equivalent to saying that knowing x0= x and u0determines x1 at time 1.Knowing x1and u1 determines x2 and so on . This suggests the following equation for the evolution of the system: XK+1=ak(xk,uk) k=0,1,.....

with initial state X0=x given.

• With the above definition of the state, we will now define a general form for the additive costs we will treat:

T-1

Ju(0,T)(x)= ∑ ck(xk , uk)+kT(xT) k=0

where the terms ck(xk , uk) denote the running or stage costs (which depend on the time k, the state xk and the control or decision uk ) and kT(xT) denotes the terminal cost.

Remark: We have written the running costs in terms of the state at time k and the control used at time k. In terms of our prior discussion, if we did not use the concept of the state then each term ck(xk , uk) would be of the form ck(x,Uk) which is a considerable saving both in terms of interpretation as well as the fact that (see below) the optimization at each stage is over the current control variable at that stage , i.e.uk if the stage is k, instead of doing it over the whole vector Uk

Dynamic Programming Applications

What are we doing here ?

Learning to make long-term decisions …

Sequential decision making model & ingredients

Present state

Nextstate

Action Action

Cost Cost

system

Planning ahead

Present decisions affect future events by:– making certain opportunities available– precluding others– altering costs of still others

Trade-off: low cost now vs. high costs in futureDP: techniques for making interrelated

sequential decisions

ObjectiveReflects the decision makers inter-temporal tradeoffs.

minimize/maximize• total expected return• total discounted return• average reward per stage• worst case expected return• expected utility• preference ordering• multi-objective (e.g. mean-variance)

Tools

• Decision rule: specifies action to be taken at particular time

• Policy: sequence of decision rules; prescription for taking actions in the future.

• Optimal policy: policy that optimizes objective.

Problem Types

• finite vs infinite state set• finite vs infinite horizon • discrete vs continuous time• deterministic vs stochastic system

The Stagecoach story

• some 150 years ago there was a salesman travelling west by stagecoach ..

A J

I

H

GD

FC

EB

Insurance Costs

A J

I

H

GD

FC

EB2

4

3

7

46

3

2

4

4

1

53

3

3

6

4

1

3

4

The Stagecoach Cont’d

• Greedy: A-B-F-I-J costs 13$• But.. A-D-F = 4 < A-B-F = 6

A-D = 3 > A-B = 2• not to be greedy pays off!• Trial and error ~ exhaustive enumeration –

takes forever!• Idea: Work backwards!

The Stagecoach Solution

• F(X)= min cost from X to J (“cost-to-go”)• F(J)=0• F(H)=3, F(I)=4• F(G)=6, F(F)=7, F(E)= 4• F(D)= 8, F(C )=7, F(B)= 11• F(A)=11 on A-D-F-I-J (not unique!)

Deterministic Dynamical System DDS: the state at the next stage is completely determined by state and decision at current stage.

xt+1xt

ut

gt(xt,ut)

t t+1

state, control, time horizon

xt+1= f(xt, ut,,t) = ft(xt, ut), t = 0,1,..N-1

DDP Ingredients• Deterministic dynamic system described by state

xt St ( St = state space at time t ).• Control/action to be selected at time t: utUt(xt). (Ut(xt) = action set at time t in state xt).• Dynamics (plant equation):

xt+1= ft(xt, ut), t= 0,1,..N-1• Total cost function: additive over time

gN(xN) + gt(xt,ut) t=0

N-1

where gt(xt,ut) = cost of decision ut

Policies• Rule for choosing the value of control variables under

all possible circumstances, as a function of perceived circumstances (= strategy, control law)

• Actions are taken in real time, whereas a policy is to be formulated in advance.• Closed-loop (or feedback): ut = u(xt, t) sequential decisions depend on the current state• Open-loop control: ut = u*(x0, t) all decisions are made at time t=0 (actions are

determined by the clock, as opp. to current state)

Principle of Optimality

• Given the current state, an optimal policy for the remaining stages is independent of the policy adopted in previous stages.

• From any point on an optimal trajectory, the remaining trajectory is optimal for the corresp. subproblem initiated at that point.

• Action: select a decision to minimize the sum of cost incurred at current stage and least total cost that could be incurred from all subsequent stages, consequent on present decision.

Bellman’s principle

• Jt(xt) = optimal cost starting in state xt at stage t.• Bellman’s principle of optimality:

JN(xN) = gN(xN) Jt(xt) = min { gt(xt,ut) + Jt+1(ft (xt,ut)) }

“cost-to-go”• Optimal expected cost for overall problem:

J0(x0)

utUt(xt)

Project Planning and Critical Path Analysis

• Project: K activities of known durations• Some need to be completed before others• Find min completion time & critical activities• nodes = completion of some project phase node 1 = start ; node N = end of project• arc (i, j) =activity that starts once phase i is

completed and has duration tij

• Acyclic network with all nodes reachable from 1

Critical Path Analysis

• Path 1 i: p ={ (1,j1), (j1,j2),..,(jk,i)}

• Duration: Dp= t(1,j1) + t(j1,j2) + .. + t(jk,i) • Completion of phase i:

Ti = max{Dp| paths p: 1 i}

• Longest path problem SP(G, -tij)

shortest path for graph with negative arc lengths


• Let Sk={i| all paths: 1 i have k arcs}, S0={1}

(nodes reachable in k steps from node 1 )• Threshold property:

k* s.t. Sk=S for all k k*, else SkS.• Shortest path algorithm:

Ti = max { tij + Tj}, for all iSk, iSk-1

• Forward DP algorithm

(j,i), jSk-1


S0={1}, S1={1,2}, S2={1,2,3}, S3={1,2,3,4}, S4={1,2,3,4,5}

Completion times: T1=0, T2=3, T3=4, T4=6, T5=10

Critical path: 1 2 3 4 5training 5

2

3

1 4start end

hire

order transport

training

construction2

31 2

2 4

Guidelines for DP algorithms

• View solution as a sequence of decisions occurring in stages and incurring additive costs

• Define state as a summary of all relevant past decisions

• Determine which state transitions are possible and identify their corresponding costs.

• Write a recursion on the optimal cost from the origin state to a destination state

To Remember• The optimal ut is only a function of state xt & time t

• The DP equation expresses the optimal ut in closed loop form. It is optimal whatever the past control policy may have been.

• The DP equation is backward induction in time; always the later policy is decided first.

• References: Google search and ebooks: 1.Dynamic Programming by Richard Bellman 2. Water Resources Systems Planning and

Management by Sharad Kumar Jain and Vijay P Singh

THANK YOU

dynamic programming to dr. awdhesh kumar submitted by nitin kapoor(2k5/ce/433) pankaj mohan...

Documents