Meso-Parametric Value Function Approximation for
Dynamic Customer Acceptances in Delivery Routing
Marlin W. Ulmer Barrett W. Thomas
Abstract
In this paper, we introduce a novel method of value function approximation (VFA). In
stochastic, dynamic decision problems, VFAs approximate the reward-to-go. Conventionally,
VFAs are either parametric or non-parametric. Parametric VFAs (P-VFAs) approximate the
value function using a particular functional form. Non-parametric VFAs (N-VFAs) approxi-
mate value functions without assuming a functional form. Both VFAs have advantages and
shortcomings. While P-VFAs provide fast and reliable approximation, reliable in the sense
that there is an approximate value for every state, the approximation is often inaccurate. N-
VFAs can provide more accurate approximations, but require significant computational effort
to do so. To combine the advantages and to alleviate the shortcomings of P-VFA and N-VFA
used individually, we present a novel method, meso-parametric value function approximation
(M-VFA). This method combines P-VFA and N-VFA approximations. Most importantly, we
demonstrate that simultaneous tuning of the approximations leads to better outcomes than ei-
ther N- and P-VFA individually or some ex-post combination. Using a benchmark problem
that allows combining elements of routing problems, problems for which N-VFA has shown
superior performance, and knapsack problems, problems for which P-VFA has shown supe-
rior performance, we compare the proposed approach with the individual VFAs and online
rollout algorithms. We show how M-VFA offers the advantages of the individual VFAs while
alleviating their shortcomings.
Keywords: Dynamic Customer Acceptances, Dynamic Vehicle Routing, Dynamic Multi-
Dimensional Knapsack Problem, Approximate Dynamic Programming, Value Function Ap-
proximation
1
1 Introduction
Many decision-making problems involve a sequence of decisions in which one must decide to
allocate a finite set of resources for an immediate reward or to conserve resources to maintain the
possibility of taking advantage of some future and yet unknown opportunity. Examples of such
decision-making problems include project scheduling, fleet management, and capital budgeting.
These sequential decision making problems under uncertainty are naturally modeled as Markov
Decision Processes (MDPs). Given the scale of most real-world problems, solutions of these MDPs
rely on approximate dynamic programming (ADP) techniques.
A common ADP technique is value function approximation (VFA). VFAs approximate the
cost-to-go of the optimality equation. VFAs generally operate by reducing the dimensionality of
the state through the selection of a set of features to which all states can be mapped. The cost-to-go
is then approximated via this set of features using either parametric or non-parametric methods.
Parametric VFA (P-VFA) approximates the cost-to-go by using the feature set as variables in a pre-
specified functional form. Non-parametric VFA (N-VFA) operates by directly approximating the
value of an observed instance of features, often using look-up tables. Both P- and N-VFA offer the
computational advantage that they can be tuned offline. P-VFAs are known for providing “reliable”
approximations. That is, P-VFAs can return a value for any given state. Further, P-VFA provides
a reliable approximation “using a relatively small number of observations” (Powell, 2011, p.237).
In addition, given the tuned parameters, P-VFAs can be quickly evaluated. However, even if it
can return a value, the value returned by P-VFAs is often inaccurate as a result of the simplifying
functional assumptions. Because they do not rely on any particular functional form, N-VFAs can be
“very accurate approximations of very general functions, as long as we have enough observations”
(Powell, 2011, p.238). In practice, it can be challenging to “have enough observations.” As a
result and as Bertsimas and Demir (2002) show, N-VFAs may therefore tend to provide unreliable
approximation for large problem sizes.
To mitigate the challenges associated with each method without diminishing each method’s
advantages, we develop a meso-parametric value function approximation (M-VFA) that combines
both P- and N-VFA. The proposed M-VFA simultaneously tunes the parameter values of both the
N-VFA and P-VFA and combines the two. In this paper, we propose a state-space aggregation via
2
a lookup table for the N-VFA and a linear basis function for our P-VFA. We propose combining
the two via a linear combination, but theoretically, a variety of methods could be used to combine
or even choose between the two approximations.
We demonstrate the effectiveness of the proposed method by applying it to the capacitated
customer acceptance problem with stochastic requests (CAPSR). In the CAPSR, throughout the
day, a dispatcher receives random customer requests for service. The dispatcher must instantly
accept or reject each request. Each request comprises a location in the service area, a required
capacity, and a revenue. Accepted requests are delivered the next day by a capacitated vehicle
within a working shift. To determine whether or not a request can be accepted, the dispatcher must
determine whether or not a request can be served feasibly, and if it can be, whether the revenue that
would be earned is worth the consumption of resources. The objective is to maximize the expected
overall revenue.
We select the CAPSR as our test environment for two reasons. First, it is a problem important
for delivery companies (Esser and Kurte, 2015; Savelsbergh and Van Woensel, 2016). Second, the
problem represents a combination of a routing problem with a dynamic knapsack problem. Both
individual problems exhibit different value function structures. For knapsack problems, paramet-
ric VFAs perform well (Bertsimas and Demir, 2002). With the routing component, we experience
interdependencies between decisions and time consumption. Therefore, the value function struc-
ture may be complex, and it is difficult to identify a particularly effective functional form. In such
cases, N-VFA has been more effective (Ulmer et al., 2017).
This paper makes the following contributions. First, we introduce the M-VFA and demonstrate
how to simultaneously tune N- and P-VFA to create M-VFA. Using the CAPSR as a test environ-
ment, we then show that the M-VFA performs better than either the N- or P-VFA alone. To analyze
the performance of M-VFA for different value function structures, we also systematically shift the
focus between the dynamic knapsack and the dynamic routing problem. As expected, P-VFA per-
forms well for the knapsack problem and N-VFA performs well for the routing problem. However,
both methods are significantly outperformed by M-VFA.
The paper is outlined as follows. In Section 2, we present the literature on VFA. In Section 3,
we provide an overview of ADP. Section 4 defines the M-VFA and describes the procedure for
tuning it. In Section 5, we formally present the CAPSR. We also describe the details of our tuning
3
approach for the M-VFA applied to the CAPSR and present the benchmark policies. For a variety
of instances based on customer data of Iowa City, we evaluate and analyze the approaches in
Section 6. The paper concludes with a summary and an outlook in Section 7.
2 Related Literature
Our work presents a general and novel ADP-method for a dynamic decision problem. In our lit-
erature review, we present an overview of methodological literature for related ADP-methods. We
first present literature analyzing the performance of N- and P-VFA as well as developing meth-
ods to reinforce their functionality. We further give an overview on work combining different
ADP-methods.
Generally, the literature confirms that P-VFAs are reliable but inaccurate, while N-VFAs are
accurate but not always reliable (He et al., 2012; Fang et al., 2013; Powell and Meisel, 2016). That
is, P-VFAs can produce a value for any state, but the value may not well represent the value of
being in the state. On the other hand, N-VFAs can offer more accurate approximations, but there
is often a large computational burden in doing so and even then it might not be possible to return
an estimate for every state.
The only systematic comparison of N- and P-VFA is conducted by Bertsimas and Demir (2002)
who compare the two on a deterministic multi-dimensional knapsack problem. The N-VFA pro-
posed by Bertsimas and Demir (2002) approximates the individual value for every potential state.
The P-VFA approximates the value function using a linear function based on each dimension’s
remaining capacity. Bertsimas and Demir (2002) show that the success of N-VFA and P-VFA
depends on the problem’s dimensions. Results in He et al. (2012) suggest that when the function
form is known, the P-VFA can offer significantly better performance than the N-VFA.
2.1 N-VFA Literature
N-VFA methods have a long history in the literature. See Powell (2011) and Bertsekas and Tsit-
siklis (1996) for overviews. N-VFA have shown to be particularly effective in domains in which
the functional form of the value-function is challenging to quantify. One such application area is
dynamic vehicle routing (Goodson et al., 2013, 2016; Ulmer et al., to appear, 2017), the domain to
4
which we apply the method proposed in this paper.
In this paper, we focus on N-VFAs that store VFAs in the form of a look-up table. Methods us-
ing value-function approximations in the form of a lookup table are often referred to as state-space
aggregation. Powell and Meisel (2016) state that N-VFAs based on “[lookup tables] are particularly
impacted by the curses of dimensionality.” The key challenge with lookup tables or state-space ag-
gregation is determining at what level to aggregate or partition the state space. Partitioning at too
fine a level leads to a lookup table that is too large, and even if it can be stored in memory, there are
often areas of the partition for which there are no values. Coarser approximations can overcome
the problem of having empty partitions, but they do so at the risk of greater approximation error.
George et al. (2008) refer to these as sampling and aggregation error, respectively.
To alleviate this shortcoming of lookup tables, researchers propose several methods to improve
the performance of the approximation. George et al. (2008) introduce a method that uses multiple
lookup tables, each with increasing levels of coarseness. For any given state, the value of the future
is given by combining the values in each of the lookup tables. Fang et al. (2013) demonstrate the
effectiveness of the technique for a supply chain sourcing problem.
As an alternative to the method of George et al. (2008), Ulmer et al. (2017) propose a method
that dynamically partitions the (post-decision) state space in response to the learning process. We
refer to the method as dynamic lookup table (DLT). Ulmer et al. (2017) show that for routing
problems, the value function is often complex and not amenable to any particular functional form.
They further show that DLT outperforms the approach of George et al. (2008) in effectiveness and
efficiency. Thus, we incorporate the approach of Ulmer et al. (2017) into our M-VFA to reinforce
the N-VFA component. We further compare M-VFA with N-VFA based on DLT.
Papadaki and Powell (2002) introduce a method for applying lookup tables to monotonic value
functions. One important feature of the proposed algorithm is that it takes advantage of the mono-
tonicity and updates neighboring cells of a just observed cell in a lookup table. This approach
improves the convergence of the estimates in the lookup table. This feature of the algorithm pre-
sented in Papadaki and Powell (2002) also updates regions of the lookup table in which there are
no observations. The combination of the N-VFA with the P-VFA in this paper also allows for
values in unexplored regions of the lookup table, but does not require monotonicity to do so. Jiang
and Powell (2015b) generalize the work in Papadaki and Powell (2002) and introduce a provably
5
convergent algorithm. Jiang and Powell (2015a) demonstrate the application of the technique in
solving an energy management problem.
Because of their ability to theoretically approximate any continuous function (Hornik et al.,
1989), neural networks are also often used to approximate state values. Bertsekas and Tsitsiklis
(1996) provide a well known overview of the methods with a recent overview available in Liu et al.
(2017). With the rise of “deep learning” (see LeCun et al. (2015) for an introduction to deep learn-
ing), the use of neural nets to approximate value functions has recently received renewed interest.
The best known example is the work of Mnih et al. (2013) that uses deep learning combined with
Q-learning, a method similar to post-decision state lookup table methods, to learn to play Atari
2600 games at levels similar to human players. As we note in Section 2.3, our method can be used
with neural-net-based approximations.
We note that one could view lookahead methods as a form of N-VFA. Lookahead methods
approximate the value function either by solving value functions by looking a limited number of
steps into the future or by approximating the future with a heuristic policy. These methods are non-
parametric in the sense that they do not assume any particular functional form for the approximated
values. Lookahead methods are what are known as “online” VFA in that the approximations are
solved at runtime. In contrast, the work in this paper focuses on offline methods for which the
approximations are determined offline in advance of execution. Given their success in solving both
knapsack and dynamic routing problems (see Goodson et al. (2017) and Ulmer et al. (to appear)),
we compare the M-VFA to an online lookahead method known as rollout. Powell (2011) provide
an overview of lookahead methods with Goodson et al. (2017) providing the latest advances in
rollout algorithms.
2.2 P-VFA Literature
Like N-VFAs, P-VFAs have a long history in the literature. Powell (2011) provides a general
overview. Geist and Pietquin (2013) provide an overview of determining parameters in P-VFA.
The most common P-VFA is a linear basis function approximation. A basis function maps fea-
tures of the state into real values and then linearly combines the values. Examples of successful
application of linear basis functions to approximate value functions include ambulance redeploy-
ment (Maxwell et al., 2010; Schmid, 2012), dynamic vehicle routing (Meisel, 2011), technician
6
scheduling (Chen et al., 2017), and truckload trucking (Simao et al., 2009).
There has also been a large body of literature exploiting known non-linear functional forms.
Piecewise linear approximations have proven particularly successful. Examples of piecewise linear
VFAs can be found in fleet management (Godfrey and Powell, 2002a,b; Topaloglu and Powell,
2006), infertility treatment (He et al., 2012), and inventory management (Godfrey and Powell,
2001). To the best of the authors’ knowledge, piecewise linear approximations in the literature
rely on monotonicity. Godfrey and Powell (2001) introduce a method for tuning a piecewise linear
approximation of concave functions. In some applications, monotonicity may not hold in across
all states. To overcome this challenge, He et al. (2012) seek to improve the quality of a piecewise
linear approximation by partitioning the state space and finding piecewise linear approximations
for each partition.
Piecewise linear approximations for nonlinear value functions have the advantage that the
preservation of linearity often allows for the application of efficient math programming techniques
to solve the approximate value function. Yet, there are fields, particularly economics and finance,
in which continuous, nonlinear approximations are favored. An overview of nonlinear P-VFA and
a discussion of numerous applications can be found in Cai and Judd (2014). Recent work in the
area focuses on the challenges of solving nonlinear approximate Bellman equations. Examples
include Cai et al. (2017) and Shen and Wang (2015).
For the M-VFA proposed in this paper, we use linear basis functions for our P-VFA. We do
so because we know of no particular functional form that fits the problem that we are studying.
Further, our results demonstrate the even a linear approximation improves solution quality. How-
ever, the general idea of our proposed scheme does not rely on a linear form of the approximation
and the combination of N- and P-VFA proposed in this paper could use a different functional
approximation than linear.
2.3 Literature on Combining VFAs
In our proposed algorithm, we combine N- and P-VFAs. The literature on such methods is limited.
Both Powell (2011, pp.242) and Bertsekas and Tsitsiklis (1996, pp.70) propose approximations
that combine N- and P-VFA. However, neither presents an application of the proposed approaches.
7
Powell (2011, pp.242) propose embedding a lookup-table into a P-VFA. The lookup-table val-
ues are filled a-priori by a domain expert. Our presented method differs in two ways. First, instead
of drawing on a domain expert, we use offline simulation to fill the values of the lookup table. Sec-
ond, in our method, the lookup-table values are approximated not sequentially but simultaneously
with the P-VFA. We show the advantage of the simultaneous approximation in our computational
evaluation by comparing M-VFA to an ex-post combination of the individual VFAs.
Bertsekas and Tsitsiklis (1996, pp.70) propose combining a neural-net-based approximation
with a linear basis function. The authors propose a two-stage scheme in which they first tune the
neural network, and then having fixed the neural-network approximation, they learn the values of
the basis function. Again, we propose learning the N- and P-VFA values simultaneously. Our
results demonstrate the advantage of this approach.
Additional work combines online lookahead methods with offline VFAs. Online methods de-
termine a state’s value during the decision-making process. Thus, in contrast to offline methods,
online methods require real-time computation time. Online methods often provide detailed ap-
proximation while offline methods generally offer a more reliable approximation based on many
simulation runs. Li and Womer (2015) and Ulmer et al. (to appear) present online rollout algo-
rithms (RAs) with VFAs as base policies. Ulmer and Hennig (2016) limit the horizon of an RA
and estimate the remaining horizon via a VFA value. A similar idea is sketched by Powell et al.
(2012) that simulates the vehicle routing via an online lookahead and estimates the value of the
resulting vehicle locations with VFA. These methods are related to Monte Carlo-tree search, of-
ten applied to generate policies for complex games with long horizons such as Go (Browne et al.,
2012). Our method differs from the online/offline methods in that our approximation scheme is
based on the combination of two offline methods.
3 Approximate Dynamic Programming
In this section, we provide an overview of approximate dynamic programming. We first recall
the terminology of finite Markov decision processes and then describe the approximate Bellman
Equation.
8
0
0x
1 1x 1
0
10x x
Figure 1: Markov Decision Tree
3.1 Markov Decision Process
Markov decision processes (MDPs) are models of sequences of decisions, and stochastic dynamic
decision problems are generally modeled as MDPs. In the following, we recall the terminology of
an MDP to later illustrate the procedure of the M-VFA. The terminology is reflected in the Markov
decision tree shown in Figure 1. An MDP contains a sequence of decision points k = 0, . . . , K.
Parameter K may be a random variable. At each decision point k, a decision state Sk and a set
of potential decisions X (Sk) is given. In Figure 1, decision states are represented by the squares
and decisions by the solid arrows. Each decision x ∈ X (Sk) for state Sk provides a reward
R(Sk, x). This reward may be an expectation. The application of a decision x to a state Sk leads to
a deterministic transition to a post-decision state Sxk , represented by a circle in Figure 1. We utilize
post-decision states in the M-VFA. A realization ωk ∈ Ωk(Sxk ) of an exogenous random variable,
indicated by the dashed arrows in Figure 1, leads to a new decision state Sk+1 = (Sxk , ωk). This
procedure continues until a termination state SK is reached.
A policy π : S → X is a sequence of decision rules that assigns a decision Xπk (Sk) ∈ X (Sk)
to every state Sk ∈ S. The decision Xπk (Sk) is the decision given dependent state Sk and π in
decision point k. An optimal policy π∗ maximizes the expected rewards over all decision points
beginning from an initial state S0. Formerly, π∗ is given by:
9
π∗ = arg maxπ∈Π
E
[K∑k=0
R(Xπk (Sk))|S0
]. (1)
3.2 The Approximate Bellman Equation
Equation (1) can be rewritten recursively as
V (Sk) = maxx∈X(Sk)
R(Sk, x) + E[V (Sk+1) | Sk ] . (2)
The value function V represents the expected reward-to-go originating from a given state. Tradi-
tionally, Equation (2) is solved by backward induction. For most real-world applications, however,
the backward induction approach suffers from the well known “curses of dimensionality.” To over-
come this challenge, researchers turn to solving approximate forms of Equation (2). This method
is often called approximate dynamic programming (ADP). Powell (2011) provides an overview of
ADP. In ADP, we replace the second term of Equation (2) with an approximated value resulting in
the approximate Bellman Equation given by
V (Sk) = maxx∈X(Sk)
R(Sk, x) + E
[V (Sk+1) | Sk
]. (3)
In this paper, we will operate on an equivalent approximate Bellman Equation, the post-decision
approximate Bellman Equation, given as:
V (Sk) = maxx∈X(Sk)
R(Sk, x) + V (Sxk )
, (4)
where V (Sxk ) is known as the value of the post-decision state.
4 Meso-Parametric Value Function Approximation
In this section, we describe the proposed M-VFA. We first formalize N- and P-VFA. We then use
N- and P-VFA to describe M-VFA. We conclude the section by describing the approximate value
iteration for M-VFA (AVI-M-VFA), the method that we use to tune the M-VFA. The key feature
of AVI-M-VFA is the simultaneous approximation of the N- and P-VFA in the creation of a VFA
10
that is a combination of both.
4.1 M-VFA: Combining P-VFA and N-VFA
Generally, in VFA, states are represented by quantifications based on a (sub-)set of state dimensions
called features φ ∈ Φ. These features are functions mapping states to real numbers, indicators, or
ordinal numbers for specific state characteristics. For example, consider a state that includes the
location of a vehicle at a particular time. We could map this state to a single feature, the point of
time. Both P-VFA and N-VFA as well as M-VFA use features to approximately evaluate states.
Our proposed M-VFA is a combination of N- and P-VFA. To apply P-VFA, two assumptions
are made. First, we assume that there is a known subset of features Φp = (φp1, . . . , φplp) ⊆ Φ.
Second, we assume a general functional form fV is given (e.g., linear, polynomial, logarithmic,
etc.). The functional form may be a sum of individual functions fV1 , . . . , fVm such as monomials in
a polynomial. These individual functions may draw on all or on subsets of the features Φp. A P-
VFA is fitted to a particular problem using a set of tuneable parameters Θ = (θ1, . . . , θm), usually
one for each individual function in fV , resulting in a specific function fV (Θ). In this paper, we
focus on what are known as linear basis functions, which, for a post-decision state Sx, results in
V p(Sx) = fV (Φp(Sx),Θ) = θ0 +m∑i=1
θiφpi (S
x). (5)
In contrast to P-VFA, N-VFAs do not assume a functional form. In this paper, we focus on the
methods known as state-space aggregation. In these methods and similar to the case of the P-VFA,
the state is mapped to a set of features Φn = (φn1 , . . . , φnln) ⊂ Φ, and the N-VFA approximates the
value for each individual feature combination. The resulting approximated values VLT are stored
in a ln-dimensional lookup table, each dimension representing a feature. Thus, the value of a
post-decision state Sx is V n(Sx) = VLT(φ1(Sx), . . . , φf (Sx)).
The M-VFA is a combination of the two approximations, V p and V n, which we represent
generally as V = g(V p, V n). While the algorithm for determining the values of V p and V n is
agnostic to the form of the combination, in this paper, we focus on a convex combination of V p
11
and V n. Given a post-decision state Sx and a user defined parameter λ, the M-VFA is given by
V λ(Sx) = gλ(V p, V n)(Sxk ) = (1− λ)× V p(Sx) + λ× V n(Sx). (6)
4.2 Approximate Value Iteration for the M-VFA
In this section, we define our method for determining the values of the parametric and non-
parametric components of the M-VFA. We denote these two components as M-VFA(P) for para-
metric and M-VFA(N) for non-parametric. We refer to our method as AVI-M-VFA.
AVI-M-VFA is based on approximate value iteration (AVI) (see Powell (2011) for an overview
of AVI). Like AVI, AVI-M-VFA iterates through a set of sample path realizations. At each iteration
and each step in a given sample path realization, the algorithm either explores the state space
or exploits the current value function. AVI-M-VFA solves the approximate Bellman Equation
using the current approximated value of the sampled post-decisions states V . These values are a
combination of the current values of M-VFA(N) (V n) and M-VFA(P) (V p). The key difference
between AVI-M-VFA and AVI as well as between AVI-M-VFA and related methods discussed in
the literature review is that AVI-M-VFA updates V n and V p simultaneously.
The details of the AVI-M-VFA are presented in Algorithm 1. Input for the algorithm is the
initial parametric approximate value function M-VFA(P) V p. The set of M-VFA(N)-values is
initially empty. Throughout, the algorithm carries the observed states and their approximated
values. This set of observations O is initially empty.
After initialization, the algorithm generates a series of sample paths. For each sample path,
the algorithm records the realized post-decision states and the running value of the rewards. These
values are stored respectively in sets R and Sx that are empty at the start of each iteration of the
algorithm.
Given a decision state Sk along sample path i, the algorithm solves the approximate Bellman
Equation. To this end, the algorithm iterates through the potential decisions. For each post-decision
state Sxk resulting from the current state and decision, the algorithm solves the approximate Bell-
man Equation. The algorithm selects the decision that maximizes the approximate Bellman Equa-
tion. We note that the algorithm can be modified to include some randomization in the decision
selection.
12
Algorithm 1: Meso-Parametric Value Function ApproximationInput : Initial M-VFA (P) V p
Output : M-VFA (N) V n, M-VFA(P) V p
1 // Initialization2 i← 1
3 V n ← ∅4 O ← ∅5 // Simulation6 while (i ≤ N) do7 k ← −18 x← ∅9 Sx−1 ← ∅
10 Sx ← ∅11 R ← ∅12 R−1 ← 013 while (Sxk 6= SK) do14 k ← k + 115 ωik ← GenerateExogeneous(Sk, x)16 Sk ← (Sxk−1, ω
ik)
17 v ← −BigM18 for all x ∈ X (Sk) do19 Sxk ← (Sk, x)
20 vtemp ← R(Sk, x) + g(V p, V n) if vtemp > v then21 v ← vtemp22 x∗ ← x
23 end24 end25 Sxk ← (Sk, x
∗)26 Rk ← Rk−1 +R(Sk, x)27 Sx ← Sx ∪ Sxk28 R ← R∪ Rk29 end30 // Update
31 O ← UpdateObservations(O, V n, V p,Sx,R)32 V n ← UpdateN(V n,O)33 V p ← UpdateP(V p,O)34 i← i+ 1
35 end36 // Termination
37 return V n, V p, O
A sample path ends upon reaching a termination state. Before beginning a new sample path,
the states and values observed during the sample path are added to the set of observations. Most
importantly, M-VFA(N) and M-VFA(P) are updated. The specific updates depend on the design of
13
M-VFA(N) and M-VFA(P). In Section 5.4, we provide an example related to the test problem used
in this paper.
After exploring N sample paths, the algorithm returns the VFAs of M-VFA(N) and M-VFA(P)
as well as the observation information O, potentially required to determine the values of V .
5 Application: The Customer Acceptance Problem with Stochas-
tic Requests
In this section, we define the CAPSR and model it as an Markov decision process (MDP). We
then introduce an implementation of M-VFA specific to CAPSR and present a set of benchmark
policies from the literature. For a review of the literature related to CAPSR, we refer the reader to
Appendix A.2.
5.1 Problem Statement
In the CAPSR, a dispatcher receives orders from customers located in a given service area. These
customers place requests dynamically during the horizon [0, tcmax], and the orders are unknown
at the start of the horizon. Each requesting customer C offers an individual revenue P (C) and
requires a specific capacity κ(C).
Accepted orders are served by a vehicle with capacity κmax that delivers orders during a delivery
phase [0, tdmax], [0, tcmax] ∩ [0, tdmax] = ∅. Each delivered order consumes the same service time of ζ ,
and the travel time between two customers and/or the depot is d(·, ·).
Upon receiving a request for service, the dispatcher must immediately accept or reject the
request. Once accepted, an order must be served. An order can be accepted only if the addition of
the order to the vehicle does not violate the capacity and if a feasible planned tour τ incorporating
the new request exists. This means the overall travel and service duration d(τ) does not exceed the
time limit of the delivery phase tdmax. The dispatcher can also reject a request. The dispatcher seeks
to maximize the expected sum of revenues.
14
5.2 Markov Decision Process
We model the CAPSR as a route-based MDP (See Ulmer et al. (2016a) for an overview of route-
based MDPs). An example of the CAPSR can be found in Appendix A.3.
A decision point k occurs when a new order is issued. A state Sk = (tk, Ck, Cnewk , τk) contains
the point of time tk ∈ [0, tcmax] at which the order occurs, the set of already accepted orders Ck =
C1k , . . . , C
mk , the new order Cnew
k , and the currently planned tour τk = (D,Cτk1 , . . . , C
τkm , D)
through the already accepted customers, starting and ending at the depot D.
At each decision point k, a decision x(Sk) is made about whether to accept or reject the cus-
tomer Cnewk , and if the customer is accepted, how to accommodate it in τk. A decision x is feasible,
if the resulting tour duration d(τxk ) does not exceed tmax and the sum of capacities does not exceed
the overall capacity,
κmax −∑C∈Cxk
κ(C) ≥ 0.
The reward for an accepted customer Cnewk is R(Sk, x) = P (Cnew
k ) and is R(Sk, x) = 0 otherwise.
The decision to accept customer Cnewk leads to a transition in which the customer Cnew
k is
added to the set Cxk and tour τxk updated to include Cnewk resulting in the post-decision state Sxk =
(tk, Cxk , τxk ). The realization of the next request ωk leads to a new decision state Sk+1 = (tk+1, Cxk , Cnewk+1, τ
xk ).
The MDP is initialized at the point of the first order with S0 = (t0, ∅, Cnew0 , (D,D)). The initial
tour contains only the depot. The termination state is SK = (tcmax, CxK−1, τxK−1).
5.3 M-VFA for CAPSR
In the following, we describe how we apply and tune the M-VFA for the CAPSR. We describe the
selected features and the required tuning of M-VFA. Finally, even though the M-VFA overcomes
the curse of dimensionality related to the state space, the CAPSR is also challenged by the dimen-
sionality of the action space. Thus, we reduce the decision space by applying a routing heuristic.
We start with the parametric and non-parametric components of M-VFA and then describe how the
steps of Algorithm 1 are executed.
15
Parametric and Non-Parametric Components
For both the M-VFA(P) and M-VFA(N) that we apply to the CAPSR, we use the features free time
budget bxk and the free capacity κxk of a post decision state Sxk . The free time budget bxk,0 ≤ bxk ≤
tdmax, is computed as
bxk = tdmax − d(τxk ).
The free capacity κxk follows from the currently consumed capacity and is computed as
κxk = κmax −∑C∈Cxk
κ(C).
The example of these two features is presented in Appendix A.3.
For the purpose of presentation, we write b and κ in the remainder of this section. We also use
the current point of time t. For both the M-VFA(P) and M-VFA(N), a state is therefore represented
by a three-dimensional feature-vector Φn = Φp = (t, b, κ).
For the parametric component M-VFA(P), we approximate a linear function fV . Our choice
is motivated by Bertsimas and Demir (2002) who demonstrate the effectiveness of such a function
when applied to a knapsack problem, a problem related to the capacity component of the CAPSR.
Because preliminary tests integrating t as a feature into the M-VFA(P) resulted in inferior policies,
we discretize time into unit intervals and derive a function fVι (b, κ) for each of the resulting time
intervals ι ∈ T , where T is the set of intervals resulting from discretizing [0, tcmax]. The overall
function is therefore stepwise-linear over the time-dimension. The function takes as variables b
and κ and is formally written as
V p(Sx) = θbι × b+ θκι × κ+ θaι , (7)
where ι is the time interval in T associated with Sx.
The coefficients Θι = (θbι , θκι , θ
aι ) ∈ R3 determine the function and are approximated. Coef-
ficient θaι represents the abscissa of the function. This term is zero for the optimal value function
because, by definition, a budget of zero in the free time and the vehicle’s capacity leads to a value
of zero. Yet, we add this parameter to increase the number of considered functions since pre-
16
liminary tests that did not include an abscissa led to inferior results. To estimate the coefficients
θbι , θκι , θ
aι for each interval ι, we draw on multiple linear regression, minimizing the mean-squared
error associated with the realized values and functional values over the last ν observations. Based
on preliminary tests, we set ν = 100.
The non-parametric component M-VFA(N) approximates values VLT of a three-dimensional
LT, one dimension for the point of time t, one for the budget b, and one for the capacity κ. To
allow a fast and efficient approximation, we draw on a dynamic state space partitioning scheme,
the dynamic lookup table (DLT) introduced in Ulmer et al. (2017). The DLT partitions the three-
dimensional vector space in response to observed values. The partitioning is defined by intervals
in all three dimensions. The DLT starts with large intervals to achieve a first approximation. Then,
the DLT creates finer partitions for “important” and “reliable” areas, those in which there are
many observations and thus represent states that are frequently visited, of the vector space. The
DLT further refines the partitions in areas with high volatility across the observed values. This
means that areas with high variance across the observed values and with a sufficient number of
observations are partitioned into smaller intervals. Given a current partition of the DLT, the value
of a post-decision state is calculated as
V n(Sx) = VDLT(t, b, κ). (8)
As proposed in Ulmer et al. (2017), the DLTs start with interval length of 16 units of both time
and capacity decreasing to 1. We set the DLT-threshold parameter to 3.0 as proposed in Ulmer et al.
(2016b). This parameter control the speed at which the DLT algorithm creates new partitions. The
values for M-VFA(N) are updated after each run to the running average.
5.4 AVI-M-VFA Implementation for the CAPSR
In this section, we describe the implementation of AVI-M-VFA (Algorithm 1) for the CAPSR.
For each value of λ = 0, 0.1, 0.2, . . . , 1, we run N = 1 million trials of Algorithm 1. Because
the action space is so large, we make routing decisions using an insertion routing heuristic. The
routing heuristic is described in Appendix A.4. This leads to at most two decisions, accept or
reject, for the new request. To evaluate a decision, we use the result of the routing heuristic (in
17
the case of an accept decision) to transition to a post-decision state and then solve the approximate
Bellman Equation. In this case, the approximate Bellman Equation is given by
V λ(Sx) = (1− λ)× (θbι × b+ θκι × κ+ θaι ) + λ× VDLT(t, b, κ), (9)
where b, κ, t, and ι are the features and time interval associated with the post-decision state Sx.
We note that, if no LT-entry for the given post-decision state exists at the time of the decision, we
set V (Sxk ) = V p(Sxk ).
At the end of a given trial, we update the M-VFA(N) and M-VFA(P) values based on the
observations. For the DLT, this additionally means that the partitions are updated. For M-VFA(P),
we use the last ν = 100 observations for each time interval, and the new parameters Θι of function
fVι of M-VFA(P) are determined by means of multiple linear regression.
Having determined an M-VFA for each λ = 0, 0.1, 0.2, . . . , 1, we then determine the best
setting for λ for each instance setting. To do this, we compare the M-VFA for each λ across an
additional 10,000 trials and choose the λwhose M-VFA leads to the best performance for the given
instance setting.
5.5 Benchmark Policies
In this section, we present benchmark policies. We are interested in the performance of the M-VFA
procedure and the general performance of VFA for the CAPSR. To this end, we compare M-VFA
with VFA-methods from the literature and with an online rollout algorithm.
To show the advantages of the combining N- and P-VFA, we first compare our approach with
conventional N- and P-VFA. To do so, we approximate both VFAs individually. The individual
P-VFA can be seen as our M-VFA with λ = 1. The N-VFA is similar to the M-VFA with λ = 0,
but differs in cases in which we observe a new post-decision state and hence a potentially empty
LT-entry. If an empty partition is observed during the tuning phase, the N-VFA selects the unvisited
partition to force exploration.
To examine the impact of the simultaneous tuning of the N- and P-VFA, we also compare
the M-VFA to an ex-post combination of the just described individual N- and P-VFA. We call
these policies E-VFA for ex-post combination. To create E-VFA, we first tune N- and P-VFA
18
individually using 1 million simulation runs. Then, to use these individually derived N- and P-
VFAs in Equation (9), we must find a value of λex. To do so, we run 10,000 trials for each
λ = 0, 0.1, 0.2, . . . , 1 and choose the best λex for each instance setting. We emphasize that the
selection of λex is different from the selection of λ described in the previous section. In the previous
section, the N- and P-VFA are tuned simultaneously. Here, we combine the two VFAs after having
tuned them individually. In contrast to the use of λ in the tuning of the M-VFA, λex is not used in
the tuning phase, but only in the execution phase of E-VFA.Again, the M-VFA and E-VFA policies
are identical for λ = λex = 1 and similar for λ = λex = 0.
We also compare the M-VFA to an online rollout algorithm policy RA. Our benchmark RA is
motivated by Campbell and Savelsbergh (2005) in which the expected reward-to-go for a particular
state is estimated using the number of future feasible customers, their request probability, and their
revenues. Because for the CAPSR the number of potential customers is vast and their revenues
are unknown, we sample a set of requests. We also extend the method proposed in Campbell and
Savelsbergh (2005) by adding a time dimension. This means, that we do not assume that customers
request all at once but over individually over the time horizon. Ulmer et al. (to appear) show that
the addition of a time-dimension leads to a better approximation and a better rollout policy for
dynamic routing problems.
To evaluate the second term of the Bellman Equation in state Sk using the RA, for each decision
x ∈ X(Sk), the RA samples a set of m realizations starting in post-decision state Sxk and ending
in SK . Within the sampled realization, the RA draws on a myopic base-policy that accepts every
feasible request. The average over the realized revenues for each simulated realization is then the
estimate of the reward-to-go for Sxk .
One drawback of rollout algorithms is that they are online. That is, unlike the proposed M-VFA,
rollout algorithms perform their computation at the time of execution. Given that the time available
for computation in real time is highly limited, we limit the number of simulated realizations. Based
on preliminary tests, we set m = 16.
19
6 Computational Evaluation
In this section, we analyze the approaches for a variety of instances defined in §6.1. We present the
results in Section 6.2. We analyze the results for the different VFAs in Section 6.3 and the M-VFA
approximation process in Section 6.4. Finally, we compare the performance of N-VFA and P-VFA
with respect to varying resource shortages.
6.1 Instances
The customer locations are provided by Ulmer and Thomas (2016) and based on Iowa City census
data. The depot is located in the upper left corner of the service area. We calculate the distances
d(·, ·) using the Haversine distance measure (Shumaker and Sinnott, 1984). This distance measure
is the equivalent to the Euclidean distances on a globe. We multiply each resulting distance by
1.4 to account for the impact of traveling on a road network (Boscoe et al., 2012). We set both
capture phase and delivery phase to tcmax = tdmax = 480 minutes, which is equivalent to assuming
that orders are placed the day before deliveries take place. To ensure a surplus in orders, we
set the expected number of orders to 50 and the service time to ζ = 10. This means that it
is generally not possible to service of every order and a selection is necessary. The orders are
generated over time by means of a minute-by-minute Poisson process. The revenue per customer
P ∈ U [1, 10] is discretely uniformly distributed. This can be seen as an extension of Campbell
and Savelsbergh (2005). Following Bertsimas and Demir (2002), we set the discrete capacity
distribution to κ ∈ U [1, 10]. With these parameters, we generate 10,000 trials to compare the
M-VFA to each of the benchmarks.
To analyze the impact of resource shortages, we vary the vehicle’s capacity and the vehi-
cle’s travel speed. We define varying speeds of v = 20 kmh , 25km
h , and 30kmh . All travel dura-
tions are rounded to the minute. These speeds reflect heavy, moderate, and light traffic con-
ditions. We further define different maximal capacities κmax = 100, 120, 140, 160. Capacity
κmax = 100 represents a transport van allowing only for service of around 18 customer orders
while κmax = 160 represents a truck that can serve about 30 customer orders. The variation of
capacities and speeds results in 12 different instance settings. We note that our instance settings
result in up to 481 × 481 × 161 = 37, 249, 121 different feature combinations for the instance
20
0
5
10
15
20
25
30
M‐VFA N‐VFA P‐VFA E‐VFA
Improvement(in%)
Policy
Figure 2: Improvement Compared to RA-Policy
settings with a capacity of 160. Given the use of continuous time and service areas, the state space
is infinite.
6.2 Solution Quality
In this section, we analyze the performance of the VFAs. First, we compare the improvement of
the VFA policies compared to the RA-policy. For a detailed presentation of the individual results,
see Table A1 in Appendix A.1. LetQ(π, i) be the average revenue for policy π and instance setting
i. We then define the improvement of policy π to the RA-policy by calculating
Q(π, i)−Q(RA, i)Q(RA, i)
× 100%.
Figure 2 presents the average improvement over all instance settings. On the x-axis, the policy
is depicted. On the y-axis, the improvement compared to the RA-policy is shown.
All VFA-approaches significantly outperform the RA-policy. The M-VFA provides the greatest
improvement with 25.5% with the N-VFA, P-VFA, and E-VFA returning improvements of 18.7%,
20.6%, and 21.1%, respectively. Interestingly, the RA has an advantage over the VFAs in that
it evaluates states using all the detail of the state and not just features extracted from the states.
21
0
1
2
3
4
5
6
N‐VFA P‐VFA E‐VFA
ImprovementofM‐VFA(in%)
ComparedtoPolicy
Figure 3: Improvement of M-VFA Compared to N-VFA, P-VFA, and E-VFA
However, the RA uses a myopic policy in its estimation of the reward-to-go. At a minimum, for
the CAPSR at least, the results indicate that the value of the full information of the state does not
overcome the poor quality of the myopic heuristic policy.
Figure 3 presents the average improvement of the M-VFA compared to the other VFAs. The
percentage differences in Figure 3 are computed similarly to those for Figure 2. On average, M-
VFA outperforms N-VFA by 5.4%, P-VFA by 3.9%, and E-VFA by 3.5%. Further, as shown in
Table A1 in the Appendix A.1, M-VFA outperforms all other policies not only on average, but also
for each of the individual instance settings. Overall, these results show that the combination of the
N- and P-VFA whether during the approximation or ex-post has value in relation to either the N- or
P-VFA individually. However, significant improvement is possible if the two are combined during
the approximation phase, the difference in the M-VFA method versus that of the E-VFA.
Also of interest is the fact that, on average, P-VFA provides better solution quality than N-
VFA. This result supports the conclusion of Bertsimas and Demir (2002) that high-dimensional
state spaces result in unreliable approximation by N-VFA and that P-VFA is superior in such
circumstances. Yet, as we show in Section 6.5, the performance of N- and P-VFA strongly depends
on the instance settings. In particular, the time budget and capacity have an important role.
22
15
20
25
30
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Improvemem
t(in%)
CombinationParameter
M‐VFA E‐VFA
Figure 4: Comparison of M-VFA and E-VFA per λ
6.3 The Value of Simultaneous Approximation: M-VFA vs. E-VFA
In this section, we analyze the impact of simultaneous approximation by M-VFA. To this end,
we compare M-VFA with E-VFA to show why the simultaneous approximation of LT-values and
feature coefficients is advantageous. Policy class E-VFA first approximates N-VFA and P-VFA
individually and later combines the two components in the determination of the values. As shown
in Section 6.2, E-VFA provides inferior results on average. In the following, we first show the
average improvement compared to the RA for both policy classes for varying λ and λex. We then
analyze the structure of the approximated value function in one example.
Figure 4 shows the average improvement of M- and E-VFA compared to the RA for varying
λ. The x-axis depicts the parameter λ = λex = 0, 0.1, . . . , 1. The improvement is shown on the
y-axis.
First, we analyze the extreme cases λ = λex = 0 and λ = λex = 1. Because they both result
in pure P-VFA, both policies M- and E-VFA perform similarly for λ = λex = 1. There is a gap
between M- and E-VFA with λ = λex = 0. This result is surprising because, with λ = λex = 0,
both only draw on the N-VFA and M-VFA(N) values. Yet, due to our rule based on the observa-
tions, M-VFA with λ = 0 can access the M-VFA(P) component in the evaluation of unobserved
states. Therefore, the initial approximation is more reliable and the overall approximation quality
23
is higher. This finding indicates that it may be generally beneficial to initiate an N-VFA with the
values of a P-VFA.
We now analyze the development for 0.0 ≤ λ = λex ≤ 1.0. For E-VFA, we see a slight, but
constant increase in solution quality up to λ = 0.9. In essence, instead of utilizing the advantages of
N- and P-VFA, the ex-post combination just provides the convex combination of the two weighted
by λex. For M-VFA, we observe a peak at an intermediate λ. The best results are generally
achieved by 0.4 ≤ λ ≤ 0.5 (For details of the best λ per instance setting, we refer to Table A2
in the Appendix A.1.). These results suggest that the M-VFA is doing more than returning a
weighted value of the M-VFA(N) and M-VFA(P). Rather, the simultaneous approximation leads
to an approximation that is greater than the parts.
To better understand the simultaneous approximation, we present the example of the instance
setting v = 25kmh and κmax = 120. For this instance setting, the best M-VFA values are approxi-
mated with λ = 0.4. Because the approximated value functions are three-dimensional, we fix the
point of time and time budget parameters to t = 240 and b = 120, respectively. Figure 5 shows the
values by capacity. The capacity is depicted on the x-axis. States with capacities higher than 70
are usually not observed for this instance setting when t = 240 and b = 120. The y-axis depicts
the approximated values. The gray lines show the values for the individual approximation. The
black lines indicate the values for M-VFA(N) and M-VFA(P). The dashed lines represent P-VFA
or M-VFA(P) while the solid lines represent N-VFA or M-VFA(N).
The parametric VFAs have higher values than the non-parametric VFAs. This result can be
explained by the fact that the parametric coefficients are determined for all potential values of b.
For this particular b, they are therefore slightly higher. However, this phenomenon does not say
anything about the quality of approximation. The plateaus of the non-parametric approximations
are the result of the dynamic LT-partitioning.
The most notable feature shown in Figure 5 is the difference between the N- and P-VFA versus
the difference between M-VFA(N) and M-VFA(P). The N-VFA and P-VFA values show a signifi-
cant difference. The difference between the M-VFA components is less distinct. This result occurs
because the two components of the M-VFA, M-VFA(N) and M-VFA(P), are tuned simultaneously,
and thus reinforce one another.
A result of the simultaneous tuning can be seen in the non-monotonic nature of the N-VFA
24
0
20
40
60
80
100
120
0 10 20 30 40 50 60 70
Value
FreeCapacity
M‐VFA(N) N‐VFA
M‐VFA(P) P‐VFA
Figure 5: Approximate Value Functions for Time t = 240, Free Budget b = 120
whose value drops around κ = 30, even though increasing capacity should render increasing values
as is the case in M-VFA(N). The decrease is the result of the fact that the N-VFA’s first observation
of the entry of κ = 30 resulted in a poor approximation, the AVI for the N-VFA subsequently
avoided states represented by this entry. In the case of M-VFA(N), observations leading to such
results are overcome by the M-VFA(P) as is demonstrated in the figure. This demonstrates how
the combination of M-VFA(N) and M-VFA(P) lead to a both reliable and accurate approximation.
6.4 Approximation Process
In the following, we analyze how the combination impacts the approximation process. Figure 6
depicts the approximation process for the example used in the previous section. Each line in
the spiderweb represents the parameter λ starting on top with λ = 0.0 and moving clockwise
λ increases by 0.1 up to λ = 1.0. The values represent the improvement of M-VFA compared
to the RA starting at −10% in the center of the spiderweb. We show four different steps of the
approximation process after 1k , 10k, 100k, and 1 million approximation runs which are indicated
by the dotted line, dashed lines, and solid line, respectively.
We first analyze the solution quality after 1k approximation runs. Starting with λ = 0.0, we
observe a constant increase in solution quality with increasing λ. The best results are achieved by
λ = 1.0. This means that with only a few approximation runs, a focus on reliable approximation
25
‐10
0
10
20
300
0.1
0.2
0.3
0.4
0.50.6
0.7
0.8
0.9
1
1k
10k
100k
1000k
Figure 6: Approximation Process of M-VFA: Improvement per λ compared to the Rollout Algo-rithm
as given by P-VFA provides better results than a focus on accurate approximation as given by
N-VFA. With an increase in approximation runs, the best tuning shifts from λ = 1.0 for 1k to
λ = 0.9 for 10k runs and to λ = 0.7 for 100k runs before eventually reaching λ = 0.4 for 1
million approximation runs. Importantly, the improvement for λ = 1.0 stagnates after the early
approximation phase and the improvement for λ = 0.0 after 1 million approximation runs is still
low at 21.8%. Thus, except in the case of P-VFA for a very low number of trials, neither the
N- nor the P-VFA is capable of providing the best results. Rather, an explicit combination of non-
parametric and parametric VFAs with 0 < λ < 1 significantly strengthens the entire approximation
process.
26
‐5
0
5
80
85
90
95
100
100 120 140 160
Improvement(in%)
Consumption(in%)
Capacity
Time Capacity N‐VFAvs.P‐VFA
Figure 7: Non-Parametric vs. Parametric VFA
6.5 Resource Shortages: Routing vs. Knapsack Problem
Finally, we analyze the results with respect to each instance’s resource shortages. We show that
the performance of N- and P-VFA depends on the importance in the instance of routing (time) and
capacity. Particularly, we show that the P-VFA is superior when we observe shortage in the vehi-
cle’s capacity and thus the CAPSR is closer to a dynamic knapsack problem. If routing decisions
become more important, N-VFA outperforms P-VFA.
To demonstrate this behavior, we analyze the results with respect to different vehicle capacities
as shown in Figure 7. The x-axis shows the capacity. For the M-VFA policy, the consumption of
the time budget and capacity averaged over the three speeds is indicated by the dashed lines and
the left y-axis. For the time budget, we observe a constant increase with respect to the vehicle’s
capacity. That means that a low capacity of 100 may lead to fewer acceptances, shorter routes, and
less consumption of the available time. Thus, when capacity is low, the number of orders is limited
by the capacity, and routing is less important. If the available capacity increases, more orders can
be served and the routing dimension gains in importance.
We now analyze how the different resource shortages affect the performance of N- and P-VFA.
We depict the improvement of N-VFA compared with the P-VFA by the solid line and the right
y-axis in Figure 7. For a low capacity, P-VFA provides a better solution quality than N-VFA.
27
This can be explained by the linear approximation providing good results for the dynamic knap-
sack problem with independent capacity consumptions of the items. With increasing capacity, the
routing becomes more important and N-VFA outperforms P-VFA. This confirms the observation
by Ulmer et al. (2017) that, for dynamic routing problems, the structure of the value function is
complex. Hence, a functional (in our case linear) approximation may not be able to capture this
complexity.
7 Conclusion
In this research, we have presented a new ADP-method that combines the advantages of non-
parametric and parametric VFAs. Further, as tuning can be done offline, the M-VFA allows im-
mediate responses to real-time requests. Using the CAPSR as a testbed, we demonstrate that
the proposed method provides excellent solution quality. Importantly, our results demonstrate the
value of simultaneously tuning the two components of the M-VFA.
Future research may focus on both extensions of the M-VFA and the CAPSR. The M-VFA is
a novel and general ADP-method. Hence, the M-VFA may be applicable to a variety of dynamic
and stochastic decision problems. Further, it may be interesting to analyze the performance of
M-VFA analytically for different artificial value-function structures. Finally, the procedure of M-
VFA may be further improved. Our computational study indicates that VFAs benefit from an
initial, reliable parametric approximation followed by a detailed N-VFA approximation. To this
end, more sophisticated rules based on the observations may dynamically adapt the combination-
parameter λ with respect to the approximation process and even determine state-dependent values
of λ. For example, it might be possible to base λ variation in LT-entries.
For the CAPSR, fleets of vehicles may be considered as well as additional constraints like time
windows. Because it leads to an increase in dimensions of state and action space, the consider-
ation of fleets is challenging. These problems may be approached with features differing for the
parametric and the non-parametric VFA component. Particularly, because its dimensionality is
unaffected by an increase in the number of features, the parametric component of the M-VFA may
capture additional features. Time windows may require the determination of additional features
and/or change the structure of the value function. In this case, the non-parametric component may
28
provide significant benefit. Finally, the presented results for the CAPSR may be used to develop
anticipatory pricing algorithms.
Acknowledgment
The authors thank Warren Powell and Ulf Jesper for their valuable advice.
References
Bertsekas, Dimitri P, John N Tsitsiklis. 1996. Neuro-dynamic programming. Athena Scientific,
Belmont, MA, Belmont, Massachusetts.
Bertsimas, Dimitris, Ramazan Demir. 2002. An approximate dynamic programming approach to
multidimensional knapsack problems. Management Science 48(4) 550–565.
Boscoe, Francis P., Kevin A. Henry, Michael S. Zdeb. 2012. CA Nationwide Comparison of
Driving Distance Versus Straight-Line Distance to Hospitals.
Browne, Cameron B, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling,
Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, Simon Colton.
2012. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational
Intelligence and AI in Games 4(1) 1–43.
Cai, Yongyang, Kenneth L Judd. 2014. Advances in numerical dynamic programming and new
applications. Karl Schmedders, Kenneth L Judd, eds., Computational Economics, Handbooks
of computational economics, vol. 3. North-Holland, Amsterdam, 479–516.
Cai, Yongyang, Kenneth L Judd, Thomas S Lontzek, Valentina Michelangeli, Che-Lin Su. 2017.
A nonlinear programming method for dynamic programming. Macroeconomic Dynamics 21(2)
336–361.
Campbell, Ann M, Martin Savelsbergh. 2005. Decision support for consumer direct grocery ini-
tiatives. Transportation Science 39(3) 313–327.
29
Chen, Xi, Mike Hewitt, Barrett W. Thomas. 2017. Approximate dynamic programming for the
multi-period technician scheduling with experience-based service times and stochastic cus-
tomers. Submitted for publication.
Ehmke, Jan F, Ann M Campbell. 2014. Customer acceptance mechanisms for home deliveries in
metropolitan areas. European Journal of Operational Research 233(1) 193–207.
Esser, Klaus, Judith Kurte. 2015. Kep 2015. Marktanalyse, Bundesverband Paket und Expresslo-
gistik e. V.
Fang, Jiarui, Lei Zhao, Jan C Fransoo, Tom Van Woensel. 2013. Sourcing strategies in supply
risk management: An approximate dynamic programming approach. Computers & Operations
Research 40(5) 1371–1382.
Geist, Matthieu, Olivier Pietquin. 2013. Algorithmic survey of parametric value function approxi-
mation. IEEE Transactions on Neural Networks and Learning Systems 24(6) 845–867.
George, Abraham, Warren B Powell, Sanjeev R Kulkarni, Sridhar Mahadevan. 2008. Value func-
tion approximation using multiple aggregation for multiattribute resource management. Journal
of Machine Learning Research 9(10) 2079–2111.
Godfrey, Gregory A, Warren B Powell. 2001. An adaptive, distribution-free algorithm for the
newsvendor problem with censored demands, with applications to inventory and distribution.
Management Science 47(8) 1101–1112.
Godfrey, Gregory A, Warren B Powell. 2002a. An adaptive dynamic programming algorithm for
dynamic fleet management, I: Single period travel times. Transportation Science 36(1) 21–39.
Godfrey, Gregory A, Warren B Powell. 2002b. An adaptive dynamic programming algorithm for
dynamic fleet management, II: Multiperiod travel times. Transportation Science 36(1) 40–54.
Goodson, Justin C., Jeffrey W. Ohlmann, Barrett W. Thomas. 2013. Rollout policies for dynamic
solutions to the multivehicle routing problem with stochastic demand and duration limits. Op-
erations Research 61(1) 138–154.
30
Goodson, Justin C., Barrett W. Thomas, Jeffrey W. Ohlmann. 2016. Restocking-based rollout poli-
cies for the vehicle routing problem with stochastic demand and duration limits. Transportation
Science 50(2) 591–607.
Goodson, Justin C, Barrett W Thomas, Jeffrey W Ohlmann. 2017. A rollout algorithm frame-
work for heuristic solutions to finite-horizon stochastic dynamic programs. European Journal
of Operational Research 258(1) 216–229.
He, Miao, Lei Zhao, Warren B Powell. 2012. Approximate dynamic programming algorithms for
optimal dosage decisions in controlled ovarian hyperstimulation. European Journal of Opera-
tional Research 222(2) 328–340.
Hornik, Kurt, Maxwell Stinchcombe, Halbert White. 1989. Multilayer feedforward networks are
universal approximators. Neural networks 2(5) 359–366.
Jabali, Ola, Roel Leus, Tom Van Woensel, Ton de Kok. 2013. Self-imposed time windows in
vehicle routing problems. OR Spectrum 37(2) 331–352.
Jiang, Daniel R, Warren B Powell. 2015a. An approximate dynamic programming algorithm for
monotone value functions. Operations Research 63(6) 1489–1511.
Jiang, Daniel R, Warren B Powell. 2015b. Optimal hour ahead bidding in the real time electricity
market with battery storage using approximate dynamic programming. INFORMS Journal on
Computing 27(3) 525–543.
Klapp, Mathias A, Alan L Erera, Alejandro Toriello. 2016. The one-dimensional dynamic dispatch
waves problem. Transportation Science.
Kleywegt, Anton J, Jason D Papastavrou. 1998. The dynamic and stochastic knapsack problem.
Operations Research 46(1) 17–35.
Kleywegt, Anton J, Jason D Papastavrou. 2001. The dynamic and stochastic knapsack problem
with random sized items. Operations Research 49(1) 26–41.
LeCun, Yann, Yoshua Bengio, Geoffrey Hinton. 2015. Deep learning. Nature 521(7553) 436–444.
31
Li, H., N. Womer. 2015. Solving stochastic resource-constrained project scheduling problems by
closed-loop approximate dynamic programming. European Journal of Operational Research
246 20–33.
Liu, Derong, Qinglai Wei, Ding Wang, Xiong Yang, Hongliang Li. 2017. Adaptive Dynamic
Programming with Applications in Optimal Control. Advances in Industrial Control, Springer,
Cham, Switzerland.
Maxwell, Matthew S, Mateo Restrepo, Shane G Henderson, Huseyin Topaloglu. 2010. Approx-
imate dynamic programming for ambulance redeployment. INFORMS Journal on Computing
22(2) 266–281.
Meisel, Stephan. 2011. Anticipatory Optimization for Dynamic Decision Making, Operations
Research/Computer Science Interfaces Series, vol. 51. Springer.
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan
Wierstra, Martin Riedmiller. 2013. Playing Atari with deep reinforcement learning. preprint
arXiv:1312.5602, arXiv.
Papadaki, Katerina P., Warren B. Powell. 2002. Exploiting structure in adaptive dynamic pro-
gramming algorithms for a stochastic batch service problem. European Journal of Operational
Research 142(1) 108 – 127. doi:https://doi.org/10.1016/S0377-2217(01)00297-1. URL http:
//www.sciencedirect.com/science/article/pii/S0377221701002971.
Papastavrou, Jason D, Srikanth Rajagopalan, Anton J Kleywegt. 1996. The dynamic and stochastic
knapsack problem with deadlines. Management Science 42(12) 1706–1718.
Powell, Warren B. 2011. Approximate Dynamic Programming: Solving the Curses of Dimension-
ality, Wiley Series in Probability and Statistics, vol. 842. John Wiley & Sons, New York.
Powell, Warren B, Stephan Meisel. 2016. Tutorial on stochastic optimization in energy—Part II:
An energy storage illustration. IEEE Transactions on Power Systems 31(2) 1468–1475.
Powell, Warren B, Hugo P Simao, Belgacem Bouzaiene-Ayari. 2012. Approximate dynamic pro-
gramming in transportation and logistics: a unified framework. EURO Journal on Transporta-
tion and Logistics 1(3) 237–284.
32
Ritzinger, Ulrike, Jakob Puchinger, Richard F Hartl. 2015. A survey on dynamic and stochastic
vehicle routing problems. International Journal of Production Research 1–17.
Savelsbergh, Martin, Tom Van Woensel. 2016. 50th anniversary invited article—city logistics:
Challenges and opportunities. Transportation Science 50(2) 579–590.
Schmid, Verena. 2012. Solving the dynamic ambulance relocation and dispatching problem using
approximate dynamic programming. European Journal of Operational Research 219(3) 611–
621.
Shen, Weiwei, Jun Wang. 2015. Transaction costs-aware portfolio optimization via fast lowner-
john ellipsoid approximation. Proceedings of the Twenty-Ninth AAAI Conference on Artificial
Intelligence. AAAI Press, Menlo Park, California, 1854–1860.
Shumaker, BP, RW Sinnott. 1984. Astronomical computing: 1. computing under the open sky. 2.
virtues of the haversine. Sky and telescope 68 158–159.
Simao, Hugo P, Jeff Day, Abraham P George, Ted Gifford, John Nienow, Warren B Powell. 2009.
An approximate dynamic programming algorithm for large-scale fleet management: A case
application. Transportation Science 43(2) 178–197.
Topaloglu, Huseyin, Warren B. Powell. 2006. Dynamic-programming approximations for stochas-
tic time-staged integer multicommodity-flow problems. Informs Journal on Computing 18(1)
31–42.
Ulmer, Marlin W, Justin C Goodson, Dirk C Mattfeld, Marco Hennig. to appear. Offline-online
approximate dynamic programming for dynamic vehicle routing with stochastic requests. Trans-
portation Science.
Ulmer, Marlin W, Justin C Goodson, Dirk C Mattfeld, Barrett W Thomas. 2016a. Route-based
Markov decision processes for dynamic vehicle routing problems. Submitted.
Ulmer, Marlin W, Marco Hennig. 2016. Value function approximation-based limited horizon roll-
out algorithms for dynamic multi-period routing. Submitted.
33
Ulmer, Marlin W, Dirk C Mattfeld, Marco Hennig, Justin C Goodson. 2015. A rollout algorithm for
vehicle routing with stochastic customer requests. Logistics Management. Springer, 217–227.
Ulmer, Marlin W., Dirk C. Mattfeld, Felix Koster. 2017. Budgeting time for dynamic vehicle
routing with stochastic customer requests. Transportation Science.
Ulmer, Marlin W, Dirk C Mattfeld, Ninja Soeffker. 2016b. Dynamic multi-period vehicle routing:
approximate value iteration based on dynamic lookup tables. Submitted.
Ulmer, Marlin W, Barrett W Thomas. 2016. Enough waiting for the cable guy - estimating arrival
times for service vehicle routing. Submitted.
Yang, Xinan, Arne K Strauss. 2016. An approximate dynamic programming approach to attended
home delivery management. Submitted.
Yang, Xinan, Arne K Strauss, Christine SM Currie, Richard Eglese. 2014. Choice-based demand
management and vehicle routing in e-fulfillment. Transportation Science 50(2) 473–488.
Appendix
A.1 Results
In the Appendix, we present the results for every individual instance setting and related literature
for the CAPSR. Table A1 shows the average revenue for the policies and varying speed and ca-
pacity. The best tuning parameters λ for M-VFA and E-VFA per instance setting are depicted in
Table A2.
A.2 Literature for the CAPSR
In this section, we present the literature related to the CAPSR. The work most closely related to
the CAPSR is Ehmke and Campbell (2014) and Campbell and Savelsbergh (2005). Like in the
CAPSR, the problems studied in this papers focus on customer acceptance decisions. However,
they differ in objective and constraints. Ehmke and Campbell (2014) determine customer accep-
tances based on the probability that the integration of the customer does not lead to time window
34
Table A1: Results: Revenue
Speed Capacity M-VFA E-VFA P-VFA N-VFA RA
20 100 166.41 159.61 159.06 157.95 126.9320 120 178.44 170.49 169.02 169.25 142.9220 140 184.90 174.67 173.77 173.93 150.0520 160 185.48 179.94 174.86 179.25 150.1325 100 167.98 163.40 163.40 158.97 126.6425 120 184.04 177.80 177.80 172.93 145.4425 140 196.17 189.36 189.36 186.32 161.1725 160 202.47 195.42 193.98 191.66 169.7630 100 168.32 163.93 163.63 159.86 127.3730 120 185.52 180.44 180.44 172.72 145.0430 140 200.14 194.08 193.96 188.80 162.0730 160 209.61 202.59 202.50 196.33 175.75
violations. These probabilities are determined with respect to stochastic travel times which con-
trasts with the CAPSR in which acceptance decisions are determined with regard to potential new
requests.
In Campbell and Savelsbergh (2005), customers from a known set of customers dynamically
request service. Each potential customer has a request probability and time-window preferences
known at the start of the horizon. Customers can choose from a set of time slots offered by the
service provider. The objective is to maximize the expected revenue. Campbell and Savelsbergh
(2005) determine acceptance by solving the static stochastic vehicle routing problem on a rolling
horizon. They evaluate a planned tour in terms of expected revenue. The approach does not
consider the dynamic development resulting from dynamically requesting customers. Because for
the CAPSR, the number of customers is vast and the vehicle is capacitated, a direct transfer of the
approach in Campbell and Savelsbergh (2005) to the CAPSR is not possible. Thus, we present
an online rollout algorithm that extends the approach of Campbell and Savelsbergh (2005) by
subsequently sampling requests over a simulated request horizon. We use this rollout algorithm as
a benchmark for the M-VFA.
Papastavrou et al. (1996), Kleywegt and Papastavrou (1998), and Kleywegt and Papastavrou
(2001) consider a dynamic knapsack problem in which items of random weight and reward arrive
over time and must be either accepted or rejected for inclusion in the knapsack. The problem is
similar to the CAPSR but without the routing component. Papastavrou et al. (1996), Kleywegt and
35
Table A2: Results: Best Parameter λ
Speed Capacity M-VFA E-VFA
20 100 0.4 0.920 120 0.4 0.820 140 0.4 0.520 160 0.3 0.225 100 0.6 1.025 120 0.5 1.025 140 0.4 0.925 160 0.4 0.930 100 0.4 0.930 120 0.6 1.030 140 0.5 0.930 160 0.5 0.9
Papastavrou (1998), and Kleywegt and Papastavrou (2001) characterize the optimal policies for a
variety of versions of the problem. As a result of its routing component, these results do not apply
to the CAPSR. More recently, Goodson et al. (2017) demonstrate the effectiveness of a rollout
algorithm (RA) applied to a variant of the problem presented in Kleywegt and Papastavrou (1998).
We use RA as one of our benchmark algorithms.
The customer acceptance decision making in the CAPSR is also related to work on dynamic
routing with stochastic requests. For an overview on dynamic routing, the interested reader is
referred to Ritzinger et al. (2015). In these dynamic routing problems, vehicles are already on
the road when new requests occur. For such a problem, Ulmer et al. (2017) present the customer
acceptance policy evaluating the free time budget left by means of a non-parametric VFA. The
non-parametric part of the M-VFA presented in this paper can be seen as a generalization of this
approach adapted to the needs of the CAPSR, especially, considering delivery routing and capacity
constraints. Other work has shown rollout algorithms (RAs) to be effective approaches for dynamic
routing with stochastic requests (Klapp et al., 2016; Ulmer et al., 2015, to appear). As noted
previously, in our computational study, we apply an RA as benchmark.
Finally, customer acceptances in delivery routing is related to time-slot pricing. For these
problems, the customers select time-windows for delivery, but the selection can be influenced
by the dispatcher and, in the extreme case, no time-slot is offered and the customer is rejected.
Recently, Yang and Strauss (2016) present a pricing policy estimating delivery costs via parametric
36
0
0
3
47
2
5
0
3
4
2
5
7 0
3
4
2
5
7
8
7 7 8x
(2,2)
(3,5)
(1,7)
(3,8)
(5,10)
(2,2)
(3,5)
(1,7)
(3,8)
(5,10)
(2,2)
(3,5)
(1,7)
(3,8)
(5,10)
(5,5)
(3,5)
Figure A.1: State, Decision, Post-Decision State, Transition
VFA. For a general overview on time-slot pricing, the interested reader is referred to Yang et al.
(2014).
A.3 CAPSR Example
In the following, we present an example for the MDP. Figure A.1 depicts a state, decision, post-
decision state, and stochastic information for the seventh decision point, k = 7. For the purpose
of presentation, we assume a Manhattan-style grid with travel duration of 10 minutes for each
segment. We further assume a service time of ζ = 10 minutes, a time duration for both capture and
delivery phase of tcmax = tdmax = 480 minutes, and a maximal capacity of κmax = 100. The depot
is indicated by the gray square, the accepted customers by the gray circles, and the new request
by the white circle. The reward and the capacity required by each customer are indicated by the
adjacent white squares. As an example, the previous acceptance of Customer 2 led to a reward of
P (C2) = 3 and a capacity consumption of κ(C2) = 5.
The state S7 = (100, C2, C3, C4, C5, C7, (D,C2, C3, C5, C4, D)) is depicted on the left. The
current point of time is t = 100 minutes. This means there are still 380 minutes to receive orders
and before the actual delivery starts. Four customers are already accepted, Ck = C2, C3, C4, C5.
Customer C7 requested service in t = 100. The current planned tour τk starts and ends in the depot
and traverses Customers 2, 3, 5 and 4. The current tour duration d(τk) is the sum of travel and
service times, 160 + 4× 10 = 200 minutes. The current capacity consumed is 5 + 2 + 8 + 7 = 22.
37
Decision x accepts customer C7 and updates the tour τk accordingly. The immediate reward of
decision x is R(S7, x) = P (C7) = 5. The new planned tour is τx7 = (D,C2, C7, C3, C5, C4, D).
This leads to post-decision state Sx7 = (100, C2, C3, C4, C5, C7, (D,C2, C7, C3, C5, C4, D)), de-
picted in the center of Figure A.1. The new tour duration is d(τx7 ) = 160 + 5 × 10 = 210. The
capacity consumed is 22 + 10 = 32. These values are reflected in the features free time budget bx7
and free capacity κx7 . Generally, the free time budget bxk with 0 ≤ bxk ≤ tdmax is defined as
bxk = tdmax − d(τxk ).
In the example, the free time budget is bx7 = 480− 210 = 270 minutes. This means that 270 min-
utes of travel time and service time are free to integrate new customers. The currently consumed
capacity determines the free capacity κxk as
κxk = κmax −∑C∈Cxk
κ(C).
In the example, the free capacity is κx7 = 100 − 32 = 68. The next decision point k = 8 occurs
when the next stochastic customer C8 requests. The new decision state
S8 = (100, C2, C3, C4, C5, C7, C8, (D,C2, C7, C3, C5, C4, D))
is depicted on the right side of Figure A.1.
A.4 Routing Heuristic
Since τxk needs to be determined in real-time while the customer is waiting, M-VFA and the bench-
mark heuristics draw on the efficient cheapest insertion routing heuristic (CI) as applied by Camp-
bell and Savelsbergh (2005) for a problem similar to the CAPSR. In each decision point, CI main-
tains the current route τk and inserts the new request at the position leading to a minimal extension
of the route. As Ulmer and Thomas (2016) show, CI provides competitive tours compared to op-
timal TSP-solutions for the Iowa City data set while requiring significantly less calculation time.
Further, CI allows for the instant communication of approximate delivery times, a feature often
desired by customers (Jabali et al., 2013).
38