![Page 1: A Real Time Approximate Dynamic Programming Algorithm For Planning](https://reader035.vdocuments.us/reader035/viewer/2022062703/55500ed4b4c905af648b47cd/html5/thumbnails/1.jpg)
A Real Time Approximate Dynamic Programming Algorithm for Planning A Real Time Approximate Dynamic Programming Algorithm for Planning
Nikolaos E. Pratikakis , Jay H. Lee , and Matthew J. Realff
![Page 2: A Real Time Approximate Dynamic Programming Algorithm For Planning](https://reader035.vdocuments.us/reader035/viewer/2022062703/55500ed4b4c905af648b47cd/html5/thumbnails/2.jpg)
AgendaAgenda
Motivation
Background information
The Curse of Dimensionality
Algorithm
Exploitation Vs Exploration
Results
Conclusions and Future Directions
![Page 3: A Real Time Approximate Dynamic Programming Algorithm For Planning](https://reader035.vdocuments.us/reader035/viewer/2022062703/55500ed4b4c905af648b47cd/html5/thumbnails/3.jpg)
Motivation – Capacity Planning (1)Motivation – Capacity Planning (1)
Main Processing
Station 1
Queue
Completed Jobs
Reconstruction Area
D (Demand)
R(Recirculation
Rate)
1-R
Testing Area
Station 2
Queue
Station 3
Queue
![Page 4: A Real Time Approximate Dynamic Programming Algorithm For Planning](https://reader035.vdocuments.us/reader035/viewer/2022062703/55500ed4b4c905af648b47cd/html5/thumbnails/4.jpg)
Motivation – Capacity Planning (2)Motivation – Capacity Planning (2)
MIP with deterministic future well studied
MIP with uncertain future - computational bottlenecks(a) Solving for an expected value by sampling the future. That misses the opportunity to revise the actions depending on the state.(b) Solving the full problem where actions can depend on the state. Requires the consideration of branching the future scenarios
• Number of branching points and scenarios• You need fairly restrictive assumptions about how the actions and the
future interact
Rolling horizon compromises the (a) and (b)
Our solution strategy relates to Approximate Dynamic Programming (ADP) - Key advantage :
ADP is based on a procedural representation of the problem (simulation code)MIP declare the alternatives explicitly before it starts
![Page 5: A Real Time Approximate Dynamic Programming Algorithm For Planning](https://reader035.vdocuments.us/reader035/viewer/2022062703/55500ed4b4c905af648b47cd/html5/thumbnails/5.jpg)
Background InformationBackground Information
Markov Decision Processes (MDP): A mathematical representation of a sequential decision making problem in which:
A system evolves through timeA decision maker controls it by taking actions at pre-specified points of timeActions incur immediate costs or rewards and affect the subsequent system state
Basic Model IngredientsState space - (generic state s)Action space – (generic action α)Rewards –Transition probabilities –
A model is called stationary if rewards and transition probabilities are independent of t
Dynamic Programming (DP) the computational tool to address MDP
![Page 6: A Real Time Approximate Dynamic Programming Algorithm For Planning](https://reader035.vdocuments.us/reader035/viewer/2022062703/55500ed4b4c905af648b47cd/html5/thumbnails/6.jpg)
Model Ingredients Model Ingredients
State Space –
Queue lengths (wi) -
Finished stock (St) -
States of random variables (D,R)
Conservative estimation more than 1 billion discrete states
Action Space –
Number of Machines Available at each stage –
Percentage of Machines Used at each stage –
More than 1 million discrete controls per state
To achieve substantial performance
Meet Demand
Stock level is controlled around SSP
The queue levels (w2,w3) are minimized
Transition equations (material balances)
Demand – Recirculation modeled as a first order Markov model
![Page 7: A Real Time Approximate Dynamic Programming Algorithm For Planning](https://reader035.vdocuments.us/reader035/viewer/2022062703/55500ed4b4c905af648b47cd/html5/thumbnails/7.jpg)
Formal Definition of Value FunctionFormal Definition of Value Function
Given a policy the value function of state s0 is the expected reward
Optimal value function corresponds to
Optimal Value Functions are the solution of the optimality equations
Optimal action can easily be computed, if one has the knowledge of the optimal value functions for all the states
0 2 4 6 8 100
1
2
3
4
5
6
Starting stateGoal state
*
![Page 8: A Real Time Approximate Dynamic Programming Algorithm For Planning](https://reader035.vdocuments.us/reader035/viewer/2022062703/55500ed4b4c905af648b47cd/html5/thumbnails/8.jpg)
The Curse of Dimensionality – Motivation for ADPThe Curse of Dimensionality – Motivation for ADP
The Cardinality of the State Space
Approximate Dynamic Programming (Lee et al.,2004)
Explicit, Explore or Exploit (E3) (Kearns, 1998)
Real Time Dynamic Programming (Barto, 1995)
The Cardinality of the Action Space
No significant effort can be found in literature to minimize this source of computational bottleneck (cannot guarantee convergence)
The Calculation of the Expectation over all dimensions of our random quantities
Stochastic gradient method (Powell , 2005)
Monte Carlo Sampling
Bellman equation:
![Page 9: A Real Time Approximate Dynamic Programming Algorithm For Planning](https://reader035.vdocuments.us/reader035/viewer/2022062703/55500ed4b4c905af648b47cd/html5/thumbnails/9.jpg)
Trial Based Real Time Dynamic Programming (Barto, 1995)Trial Based Real Time Dynamic Programming (Barto, 1995)
The algorithm proposed in Bartos et al.(1995) paper to address stochastic shortest path problems (1 goal state and finite but large state space)
Inputs:
Initialize number of trials n.
Initialize the value functions for all the states in S
0 2 4 6 8 100
1
2
3
4
5
6
Basic Steps for each Trial:
1. Start from starting state st2. Use Bellman equation and pick
3. a) Simulate the system using b) Update 4. Set5. End at goal state
st
St+1st
![Page 10: A Real Time Approximate Dynamic Programming Algorithm For Planning](https://reader035.vdocuments.us/reader035/viewer/2022062703/55500ed4b4c905af648b47cd/html5/thumbnails/10.jpg)
On Modifying RTDPOn Modifying RTDP
The Concept of the Relevant state space ( )
We want to solve for the value function only in the region that the system normally operates
Greedy heuristic, for example, is a good initial policy to define the operating space
RTDP modification:
The concept to “evolve” the state space from an empty one
The concept of the “adaptive action set”
The Exploration Vs Exploitation by tuning the initialization of the value function for the unseen successive states
The usage of k-NN (local value function approximators)
![Page 11: A Real Time Approximate Dynamic Programming Algorithm For Planning](https://reader035.vdocuments.us/reader035/viewer/2022062703/55500ed4b4c905af648b47cd/html5/thumbnails/11.jpg)
Schematic Illustration Concerning State Space TerminologySchematic Illustration Concerning State Space Terminology
1 2' { : St> or }is w S S
![Page 12: A Real Time Approximate Dynamic Programming Algorithm For Planning](https://reader035.vdocuments.us/reader035/viewer/2022062703/55500ed4b4c905af648b47cd/html5/thumbnails/12.jpg)
The Proposed Method to Overcome The COD in the State,Action and Uncertainty Space
The Proposed Method to Overcome The COD in the State,Action and Uncertainty Space
subA
is
Uncertainty
Sample frompossible transitions
Evaluate using Bellman equation
α*
js
Adaptive Action Set ( )
1
2
34
57
Possible successive next stateCandidate optimal action for Initial state
Pratikakis, N.E, Realff M.J and Lee, J.H “Strategic Capacity Decisions In Manufacturing Using Real-Time Adaptive Dynamic Programming”, Submitted to Naval Research Logistics.
6
subA
![Page 13: A Real Time Approximate Dynamic Programming Algorithm For Planning](https://reader035.vdocuments.us/reader035/viewer/2022062703/55500ed4b4c905af648b47cd/html5/thumbnails/13.jpg)
The Adaptive Action SetThe Adaptive Action Set
Designed to circumvent the curse of dimensionality concerning
The idea: Use optimization and heuristics to selectively choose a small number of controls for each state
How to Construct the ?
Heuristic actions
Mathematical Programming actions
Random actions
Best known Actions
subA A
![Page 14: A Real Time Approximate Dynamic Programming Algorithm For Planning](https://reader035.vdocuments.us/reader035/viewer/2022062703/55500ed4b4c905af648b47cd/html5/thumbnails/14.jpg)
On Evaluating the Actions in the AsubOn Evaluating the Actions in the Asub
js
js
js
1st Scenario: All sj’s belong to Retrieve from the look up table
2nd Scenario: Some of the sj’s do not belong to
3rd Scenario: What ifInitial estimation schemes
• Underestimating the optimal value function for all sj
• Overestimating the optimal value function for all sj
( ) max{ ( , ) ( | , ) ( )}sub
j REL
i ij j
as
suba J s r s a P s s a J s
A
S
A
( )
( ) { : },
1 ( ) ( ) ( )
j
def T
j REL i j j
i ij j
x N s
Find N s s d s s W s s
If N s k J s J xk
S
( )jN s k
![Page 15: A Real Time Approximate Dynamic Programming Algorithm For Planning](https://reader035.vdocuments.us/reader035/viewer/2022062703/55500ed4b4c905af648b47cd/html5/thumbnails/15.jpg)
Exploration Vs ExploitationExploration Vs Exploitation
The initialization of the space values turns out to be an important parameter in the algorithm to control exploitation Vs exploration
Consistent underestimation scheme as initialization for the optimal value function intuitively leads to minimum exploration
Consistent overestimation scheme as initialization for the optimal value function intuitively leads to maximum exploration
|S|
Number of Iterations
RTADP with Over-estimationRTADP with Under-estimation
sSimulation
with RTADP
α)RTADP with under-estimator α*=α1
(bias to choose actions that lead to already explored space)
b)RTADP with over-estimator α*=α2
(bias to choose actions that lead to unexplored regions of the state space)
α1
α2
![Page 16: A Real Time Approximate Dynamic Programming Algorithm For Planning](https://reader035.vdocuments.us/reader035/viewer/2022062703/55500ed4b4c905af648b47cd/html5/thumbnails/16.jpg)
ResultsResults
RTADP with different exploration strategy
Comparison MIP full information – RTADP – Heuristic – MIP rolling horizon
Behavior of the architectures in Time Series (Smooth Stock level control)
![Page 17: A Real Time Approximate Dynamic Programming Algorithm For Planning](https://reader035.vdocuments.us/reader035/viewer/2022062703/55500ed4b4c905af648b47cd/html5/thumbnails/17.jpg)
RTADP combined with different exploration strategy RTADP combined with different exploration strategy
'S
Estimation of Value Function Used when evaluating sj’s with no neighbors (Scenario 3)
Estimation of Value Function Used when evaluating sj’s that
belong to set
RTADP – a
(under-estimator)
0 0
RTADP – b
(prior knowledge)
0 -10
RTADP – c
(over-estimator)
0*( ) ( )i
j jJ s J s
![Page 18: A Real Time Approximate Dynamic Programming Algorithm For Planning](https://reader035.vdocuments.us/reader035/viewer/2022062703/55500ed4b4c905af648b47cd/html5/thumbnails/18.jpg)
RTADP combined with different exploration strategy (2)RTADP combined with different exploration strategy (2)
RTADP + Under-Estimation RTADP + Prior KnowledgeRTADP + Over-Estimation
![Page 19: A Real Time Approximate Dynamic Programming Algorithm For Planning](https://reader035.vdocuments.us/reader035/viewer/2022062703/55500ed4b4c905af648b47cd/html5/thumbnails/19.jpg)
Performance ComparisonPerformance Comparison
13.3% Performance gap between RTADP-α and MIP with full information
Multi stage uncertainty not very dominant?
![Page 20: A Real Time Approximate Dynamic Programming Algorithm For Planning](https://reader035.vdocuments.us/reader035/viewer/2022062703/55500ed4b4c905af648b47cd/html5/thumbnails/20.jpg)
The Control of Stock Level using RTADP in Time SeriesThe Control of Stock Level using RTADP in Time Series
![Page 21: A Real Time Approximate Dynamic Programming Algorithm For Planning](https://reader035.vdocuments.us/reader035/viewer/2022062703/55500ed4b4c905af648b47cd/html5/thumbnails/21.jpg)
Conclusions and Future DirectionsConclusions and Future Directions
RTADP successfully implemented for this manufacturing job shop
Alleviate curse of dimensionality
• Adaptive action set
• Evolving state space
Usage of k-NN as a local approximator
Performance gap given an upper bound solution is 13.3%
Importance of Multistage uncertainty
Initialization of the optimal value function can be used as a tuning parameter for exploitation Vs exploration
Possible extensions
Incorporate coherent* risk measures to the objective function to manipulate the profit (cost) distribution accordingly
![Page 22: A Real Time Approximate Dynamic Programming Algorithm For Planning](https://reader035.vdocuments.us/reader035/viewer/2022062703/55500ed4b4c905af648b47cd/html5/thumbnails/22.jpg)
AcknowledgementsAcknowledgements
Advisors:
Jay H. Lee
Matthew J. Realff
ISSICL Group members
Financial Support :NSF (CTS#03019993)