tail analysis without tail information: a worst-case
TRANSCRIPT
Submitted to Operations Researchmanuscript (Please, provide the manuscript number!)
Tail Analysis without Tail Information:A Worst-case Perspective
Henry LamDepartment of Industrial and Operations Engineering, University of Michigan, Ann Arbor, MI 48109, [email protected]
Clementine MottetDepartment of Mathematics and Statistics, Boston University, Boston, MA 02215, [email protected]
Tail modeling refers to the task of selecting the best probability distributions that describe the occurrences
of extreme events. One common bottleneck in this task is that, due to their very nature, tail data are often
very limited. The conventional approach uses parametric fitting, but the validity of the choice of a parametric
model is usually hard to verify. This paper describes a reasonable alternative that does not require any
parametric assumption. The proposed approach is based on a worst-case analysis under the geometric premise
of tail convexity, a feature shared by all known parametric tail distributions. We demonstrate that the
worst-case convex tail behavior is either extremely light-tailed or extremely heavy-tailed. We also construct
low-dimensional nonlinear programs that can both distinguish between the two cases and find the worst-
case tail. Numerical results show that the proposed approach gives a competitive performance versus using
conventional parametric methods.
Key words : tail modeling, robust analysis, nonparametric
1. Introduction
Modeling extreme behaviors is a fundamental task in analyzing and managing risk. The earliest
applications arose in environmental contexts, as hydrologists and climatologists tried to predict
the risk of flooding and pollution based on historical data of sea levels or air pollutants (Gumbel
(2012)). In non-life or casualty insurance, insurers rely on accurate prediction of large losses to
price insurance policies (McNeil (1997), Beirlant and Teugels (1992), Embrechts et al. (1997)).
Calculating risk measures related to large losses is also a key focus of financial portfolio management
(Glasserman and Li (2005), Glasserman et al. (2007, 2008)). In engineering, measurement of system
1
Lam and Mottet: Worst-case Tail Analysis2 Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!)
reliability often involves modeling the tail behaviors of individual components’ failure times (Nicola
et al. (1993), Heidelberger (1995)).
Despite its importance in various disciplines, tail modeling is an intrinsically difficult task
because, by their own nature, tail data are often very limited. Consider these two examples:
Example 1 (Adopted from McNeil (1997)). There were 2,156 Danish fire losses over one
million Danish Krone (DKK) from 1980 to 1990. The empirical cumulative distribution function
(ECDF) and the histogram (in log scale) are plotted in Figure 1. For a concrete use of the data,
an insurance company might be interested in pricing a high-excess contract with reinsurance, which
has a payoff of X − 50 (in million DKK) when 50 < X ≤ 200, 150 when X > 200, and 0 when
X ≤ 50, where X is the loss amount (the marks 50 and 200 are labeled with vertical lines in Figure
1). Pricing this contract would require, among other information, E[payoff]. However, only seven
data points are above 50 (the loss amount above which the payoff is non-zero).
Figure 1: ECDF and histogram for Danish fire losses from 1980 to 1990
Example 2. A more extreme situation is a synthetic data set of size 200 generated from an
unknown distribution, whose histogram is shown in Figure 2. Suppose the quantity of interest is
P (4<X < 5). This appears to be an ill-posed problem since the interval [4,5] has no data at all.
This situation is not uncommon when in any application one tries to extrapolate the tail with a
small sample size.
Lam and Mottet: Worst-case Tail AnalysisArticle submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 3
Figure 2 Histogram of a synthetic data set with sample size 200
The primary purpose of this paper is to construct a theoretically justified methodology to esti-
mate quantities of interest such as those depicted in the examples above. This requires drawing
information properly from data not in the tail. We will illustrate how to do this and revisit the
two examples later with numerical performance of our method.
Another important objective of this paper is to identify the qualitative choice of tail that one
should adopt under limited data. To motivate this, mathematical studies of tail behavior in com-
plex systems, notably large deviations analysis, often rely crucially on the types of tail in the
underlying individual components. One prominent distinction is light versus heavy tails. To exhibit
large deviations behavior, it is well-known that each of the light-tailed components individually
contribute a small amount (through exponential twisting of the probability measure; e.g., Bucklew
(2004), Dembo and Zeitouni (2009)), whereas heavy-tailed systems do so by making the so-called
“big jumps” (i.e., only one or two components exhibit large values while all others remain in the
usual domain; e.g., Denisov et al. (2008)). In the examples above, with little or no data in the tail,
should one fit a light or a heavy-tailed distribution? It is in the interest of this paper to give a
meaningful answer.
Lam and Mottet: Worst-case Tail Analysis4 Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!)
2. Our Approach and Main Contributions
We adopt a nonparametric approach. Rather than fitting a tail parametric curve when there can
be little or zero observations in the tail region, we base our analysis on one geometric premise: that
the tail density is convex. We emphasize that this condition is satisfied by all known parametric
distributions (e.g. normal, lognormal, exponential, gamma, Weibull, Pareto etc.). For this reason
we believe it is a natural and minimal assumption to make.
In any given problem, there can be potentially infinitely many feasible candidates of convex
tails. The central idea of our method is a worst-case characterization. Formally, given information
on the non-tail part of the distribution and a target quantity of interest (e.g., P (4 < X < 5) in
Example 2), we aim to find a convex tail, consistent with the non-tail part, that gives rise to the
worst-case value of the target (e.g., the largest possible value of P (4<X < 5)). This value serves
as the best bound for the target that is robust with respect to the uncertainty on the tail, without
any knowledge other than our a priori assumption of convexity.
Our proposed approach requires solving an optimization over a potentially infinite-dimensional
space of convex tails. As our key contributions, we show that this problem has a very simple
optimality structure, and find its solution via low-dimensional nonlinear programs. In particular:
1. We both qualitatively and quantitatively characterize the worst-case tail behavior under the
tail convexity condition. We show that the worst-case tail, for any bounded target quantity of
interest, is either extremely light-tailed or extremely heavy-tailed in a well-defined sense. Both cases
can be characterized by piecewise linear densities, the distinction being whether the pieces form a
bounded support distribution or lead to probability masses that escape to infinity.
2. We provide efficient algorithms to distinguish between the two cases above, and to solve for
the optimal distribution in each case. For a large class of objectives, the classification only requires a
one-dimensional line search, and solving for the optimal distribution requires at most an additional
two-dimensional nonlinear program.
We further illustrate how to integrate the above results with data to obtain statistically valid
worst-case bounds. This requires suitable relaxation of the proposed worst-case optimizations to
Lam and Mottet: Worst-case Tail AnalysisArticle submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 5
incorporate interval estimates of certain key parameters of the distribution in the non-tail region.
Our framework gets around the difficulty faced by conventional parametric methods (discussed in
detail in the next section) in directly estimating the tail curve, by using worst-case analysis to
mitigate the estimation burden completely to the central part of the density curve where more data
are available. However, we pay the price of conservativeness: our method can generate a worst-case
bound that is over-pessimistic. We therefore believe it is most suitable for small sample size, when
a price of conservativeness is unavoidable in trading with statistical validity.
The remainder of this paper is organized as follows. Sections 3 discusses some previous techniques
and reviews the relevant literature. Section 4 presents our formulation and results for an abstract
setting. Section 5 describes some elementary numerical examples. Section 6 gives our main math-
ematical arguments. Section 7 explains how to integrate our formulation with data, and Section 8
concludes and discusses some future work. All proofs are left to the Appendix.
3. Related Work
3.1. Overview of Common Tail-fitting Techniques
As far as we know, all existing techniques for handling extreme values are parametric-based, in
the sense that a “best” parametric curve is chosen and the parameters are fit to the tail data.
The classic text of Hogg and Klugman (2009) provides a comprehensive discussion on the common
choices of parametric tail densities. While exploratory data analysis, such as quantile plots and
mean excess plots, can provide guidance regarding the class of parametric curves to use (such as
heavy, middle or light tail), this approach is limited by its reliance on a large amount of data in
the tail and subjectivity in the choice of parametric curve.
Beyond the goodness-of-fit approach, there are two widely used results on the parametric choice
that is provably suitable for extreme values. The Fisher-Tippett-Gnedenko Theorem (Fisher and
Tippett (1928), Gnedenko (1943)) postulates that the sample maxima, after suitable scaling, must
converge to a generalized extreme value (GEV) distribution, given that it converges at all to some
non-degenerate distribution. This result is useful if the data are known to derive from the maximum
Lam and Mottet: Worst-case Tail Analysis6 Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!)
of some distributions. For instance, environmental data on sea level and river heights are often
collected as annual maxima (Davison and Smith (1990)), and in this scenario it is sensible to fit the
GEV distribution. In other scenarios, the data have to be pre-divided into blocks and blockwise
maxima have to be taken in order to apply GEV, but this blockwise approach is statistically
wasteful (Embrechts et al. (2005)).
The Pickands-Balkema-de Haan Theorem (Pickands III (1975), Balkema and De Haan (1974))
does not need the requirement of data coming from maxima. Rather, the theorem states that the
excess losses over thresholds converge to a generalized Pareto distribution (GPD) as the thresh-
olds approach infinity, under the same conditions as the Fisher-Tippett-Gnedenko Theorem. The
Pickands-Balkema-de Haan theorem provides a solid mathematical justification for using GPD to
fit the tail portion of data (McNeil (1997), Embrechts et al. (2005)). Fitting GPD can be done by
well-studied procedures such as maximum likelihood estimation (Smith (1985)), and the method of
probability-weighted moments (Hosking and Wallis (1987)). The Hill estimator (Hill et al. (1975),
Davis and Resnick (1984)) is also a widely used alternative.
Despite the attraction and frequent usage, fitting GPD suffers from two pitfalls: First, there is
no convergence rate result that tells how high a threshold should be for the GPD approximation to
be valid (e.g. McNeil (1997)). Hence, picking the threshold is an ad hoc task in practice. Second,
and more importantly, even if the threshold chosen is sufficiently high for the theorem to hold, a
large amount of data above it is needed to accurately estimate the parameters in GPD. In our two
examples, especially Example 2, this is plainly impossible.
3.2. Related Literature on our Methodology
Our mathematical formulation and techniques are related to two lines of literature. The use of
convexity and other shape constraints (such as log-concavity) have appeared in density estimation
(Cule et al. (2010), Seregin and Wellner (2010), Koenker and Mizera (2010)) and convex regression
(Seijo et al. (2011), Hannah and Dunson (2013), Lim and Glynn (2012)) in statistics. A major
reason for using convexity in these statistical problems is the removal of tuning parameters, such
Lam and Mottet: Worst-case Tail AnalysisArticle submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 7
as bandwidth, as otherwise required. Besides the difference in motivation, these problems often
involve optimizations that are finite-dimensional (in the data size), and so are different from our
infinite-dimensional formulations.
The second line of related literature is optimization over probability distributions, which have
appeared in decision analysis (Smith (1995), Bertsimas and Popescu (2005), Popescu (2005)),
robust control theory (Iyengar (2005), El Ghaoui and Nilim (2005), Petersen et al. (2000), Hansen
and Sargent (2008)), distributionally robust optimization (Delage and Ye (2010), Goh and Sim
(2010)), and stochastic programming (Birge and Wets (1987), Birge and Dula (1991)). The typical
formulation involves optimization of some objective governed by a probability distribution that
is partially specified via constraints like moments (Karr (1983), Winkler (1988)) and statistical
distances (Ben-Tal et al. (2013)). Our formulation differs from these studies because it pertains
more to tail modeling (i.e., knowledge of certain regions of the density, but none beyond it). Among
all the previous works, only Popescu (2005) has considered convex density assumption, as an
instance of a proposed class of geometric conditions that are added to moment constraints. While
the result bears similarity to ours in that a piecewise linearity structure shows up in the solution,
our qualitative classification of the tail, the solution techniques, and the data-driven relaxation all
differ from the semidefinite programming approach in Popescu (2005).
4. Abstract Formulation and Results
We begin by considering an abstract formulation assuming full information on the distribution up
to some threshold, and no information beyond. The next sub-sections give the details.
4.1. Formulation and Notation
Consider a continuous probability distribution on R whose density exists and is denoted by f(x).
We assume that f is known up to a certain large threshold, say a ∈R. The goal is to extrapolate
f .
We impose the assumption that f(x), for x≥ a, is convex. Figure 3 shows an example of an f(x)
known up to a, and Figures 4 and 5 each show an example of convex and non-convex extrapolation.
Lam and Mottet: Worst-case Tail Analysis8 Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!)
Observe that the convex tail assumption excludes any “surprising” bumps (and falls) in the density
curve.
Figure 3: A probability density f(x) known up
to a threshold a
Figure 4: An example of convex tail extrapo-
lation
Figure 5: An example of non-convex tail
extrapolation Figure 6: The parameters η, ν,β
Now suppose we are given a target objective or performance measure E[h(X)], where E[·] denotes
the expectation under f , and h : R→ R is a bounded function in X. The goal is to calculate the
worst-case value of E[h(X)] under the assumption that f is convex beyond a. That is, we want to
obtain maxE[h(X)] =∫∞−∞ h(x)f(x)dx where the maximization is over all convex f(x), x≥ a such
that it satisfies the properties of a probability density function. For this formulation, we need three
constants extracted from f(x), x < a, which we denote as η, ν,β > 0 respectively:
1. η is the value of the density f at a, i.e. f(a) = η.
2. −ν is the left derivative of f at a. Since f is convex beyond a, it is differentiable with non-
decreasing derivative almost everywhere (a.e.) in that range. The right derivative at a, f ′+(a), must
be at least as large as f ′−(a), its left derivative, which is set equal to −ν.
Lam and Mottet: Worst-case Tail AnalysisArticle submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 9
3. β is the tail probability at a. Since f is known up to a,∫ a−∞ f(x) is known to be equal to
some number 1−β, and∫∞af(x)dx must equal β.
Figure 6 illustrates these quantities. Our formulation can be written as
max∫∞ah(x)f(x)dx
subject to∫∞af(x)dx= β
f(a) = η
f ′+(a)≥−ν
f convex for x≥ a
f(x)≥ 0 for x≥ a
(1)
Note that we have set our objective to be E[h(X);X ≥ a], since E[h(X);X < a] is completely
known in this setting.
4.2. Optimality Characterization
The solution structure of (1) turns out to be extremely simple and is characterized by either one
of two closely related cases (focusing on the region x≥ a), as presented in the following theorem:
Theorem 1. Suppose h is bounded. Consider the optimization (1). If it is feasible, then either
1. An optimal probability density exists. In this case, there is an optimal density that is con-
tinuous, piecewise linear and has bounded support for x≥ a. Moreover, it has three line segments,
the first one continuing from a with slope −ν, and the last linking to the horizontal axis (with the
possibility that one or both of the first two segments are degenerate, i.e., zero length).
2. An optimal probability density does not exist. In this case, there is a sequence of feasible
probability densities whose objective values converge to the optimal value of (1). Each density in
this sequence is continuous, piecewise linear and has bounded support for x≥ a. It has three line
segments, the first one continuing from a with slope −ν, and the last linking to the horizontal axis
(with the possibility that the first segment is degenerate). As the sequence emerges, the last segment
gets both closer and more parallel to the horizontal axis.
Lam and Mottet: Worst-case Tail Analysis10 Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!)
Under the following additional assumption on h, Theorem 1 can be further simplified:
Assumption 1. The function h : R→ R is non-decreasing in (a, c) and non-increasing in (c,∞)
for some constant a≤ c≤∞ (i.e. c can possibly be ∞).
Theorem 2. Under Assumption 1 and that h is bounded, the optimal density in the first case in
Theorem 1 has two line segments, the first continuing from a with slope −ν and the second linking
to the horizontal axis (with the possibility that the first segment is degenerate).
The proofs of Theorems 1 and 2 are discussed in Section 6 and detailed in Appendix EC.1.2.
Figures 7 and 8 show the tail behaviors for the two cases when Assumption 1 holds. Qualitatively,
with a bounded support, the first case in Theorem 1 or 2 clearly possesses the lightest possible
tail. The second case in the theorems can be interpreted as an extreme heavy-tail. To explain this,
compare the optimal sequence of densities with a given arbitrary density. Given any fixed large
enough x on the real line, as the sequence emerges, the decay rate (i.e. slope) at that point is
eventually slower than that of the given density. Since a slower decay rate is the characteristic of a
fatter tail, this optimal density sequence can be interpreted as qualitatively capturing the heaviest
possible tail.
Figure 7: Behavior of an optimal light-tailed
extrapolation
Figure 8: Behavior of an element in an optimal
heavy-tailed extrapolation sequence
Lam and Mottet: Worst-case Tail AnalysisArticle submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 11
4.3. Optimization Procedure
We focus on h that satisfies Assumption 1, since it covers many natural scenarios including the
two examples in the Introduction. Given information about the density up to a, our algorithm will:
1) classify the two cases of light versus heavy tails, and 2) solve for the optimal density or the
optimal sequence of densities. The main idea is to search for the kinks of piecewise linear densities
that Theorem 1 shows to be optimal for problem (1). We first state the algorithm, and will provide
explanation momentarily and more details in Section 6:
Algorithm 1: Procedure for Solving (1)
Inputs:
1. The cost function h that satisfies Assumption 1 and is bounded and w.l.o.g. non-negative.
2. The parameters β, η, ν > 0.
Procedure:
Exclusion of trivial scenarios:
1. If η2 > 2βν, there is no feasible density.
2. If η2 = 2βν, then:
- Optimal value: νH(µ)
- There is only one feasible density given by f(x) = η− ν(x− a).
3. Otherwise continue.
Main procedure:
Let
µ=η
ν, σ=
2β
ν, H(x) =
∫ x
0
∫ u
0
h(v+ a)dvdu, λ= lim supx→∞
H(x)
x2<∞ (2)
Consider the optimization
maxx1∈[0,µ)
W (x1) =σ−µ2
σ− 2µx1 +x21
H(x1) +(µ−x1)
2
σ− 2µx1 +x21
H
(σ−µx1
µ−x1
)(3)
Either of the following two cases occurs:
Case 1 (light tail): There is an optimal solution for (3), given by x∗1 ∈ [0, µ). Then:
Lam and Mottet: Worst-case Tail Analysis12 Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!)
- Optimal value: νW (x∗1)
- Optimal density:
f(x) =
η− ν(x− a) for a≤ x≤ x∗1 + a
η− νx∗1− νp∗2(x− a−x∗1) for x∗1 + a≤ x≤ x∗2 + a
0 for x≥ x∗2 + a
where
x∗2 =σ−µx∗1µ−x∗1
, p∗2 =(µ−x∗1)2
σ− 2µx∗1 +x∗12
Case 2 (heavy tail): There does not exist an optimal solution for (3). This occurs when there
is a sequence x(k)1 → µ such that W (x
(k)1 )↗W ∗, where W ∗ is the optimal value of (3), but W ∗ >
maxx1∈[0,µ−ε]W (x1) for any ε > 0.
Consider further the optimization
maxx1∈[0,µ),ρ∈[µ2,σ]
V (x1, ρ)
=ρ−µ2
ρ− 2µx1 +x21
(H(x1)−λx21) +
(µ−x1)2
ρ− 2µx1 +x21
(H
(ρ−µx1
µ−x1
)−λ
(ρ−µx1
µ−x1
)2)
+λσ
(4)
One of the following two sub-cases occurs:
Case 2.1: There exists an optimal solution for (4), given by (x∗1, ρ∗)∈ [0, µ)× [µ2, σ]. Then:
- Optimal value: νV (x∗1, ρ∗)
- There is no optimal density. Optimal value is achieved by a sequence f (k)(x), x≥ a given by
f (k)(x) =
η− ν(x− a) for a≤ x≤ x∗1 + a
η− νx∗1− νp∗2(x− a−x∗1) for x∗1 + a≤ x≤ x∗2− δ(k) + a
η− νx∗1− νp∗2(x∗2− δ(k)−x∗1)− νγ(k)(x− a− (x∗2− δ(k))) for x∗2− δ(k) + a≤ x≤ x(k)3 + a
0 for x≥ x(k)3 + a
where
x∗2 =ρ∗−µx∗1µ−x∗1
, p∗2 =(µ−x∗1)2
ρ∗− 2µx∗1 +x∗12
Lam and Mottet: Worst-case Tail AnalysisArticle submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 13
and
x(k)3 →∞ is a sequence such that
H(x(k)3 )
x(k)3
2 → λ as k→∞
δ(k) =σ− ρ∗
p∗2(x(k)3 −x∗2)
, γ(k) =(σ− ρ∗)p∗2
p∗2(x(k)3 −x∗2)2 + (σ− ρ∗)
Case 2.2: There does not exist an optimal solution for (4). This occurs when there is a sequence
(x(k)1 , ρ(k)), with x
(k)1 → µ such that S(x
(k)1 , ρ(k))↗ V ∗, where V ∗ is the optimal value of (4), but
V ∗ >maxx1∈[0,µ−ε],ρ∈[µ2,σ] V (x1, ρ) for any ε > 0. Then:
- Optimal value: ν(H(µ) +λ(σ−µ2))
- There is no optimal density. Optimal value is achieved by a sequence f (k)(x), x≥ a given by
f (k)(x) =
η− ν(x− a) for a≤ x≤ µ− δ(k) + a
η− ν(µ− δ(k))− νγ(k)(x− a− (µ− δ(k))) for µ− δ(k) + a≤ x≤ x(k)2 + a
0 for x≥ x(k)2 + a
where
x(k)2 →∞ is a sequence such that
H(x(k)2 )
x(k)2
2 → λ as k→∞
δ(k) = σ−µ2
x(k)2 −µ
, γ(k) = σ−µ2
x(k)2
2−2µx(k)2 +σ
Note that the functions W (x1) in (3) and V (x1, ρ) in (4) are constructed to find the kinks of
the line segments in an optimal piecewise linear density/sequence of densities. x1 represents the
x-value of the first kink.
Algorithm 1 uses a one-dimensional line search, namely (3), to classify the type of optimality into
light or heavy tail. In the light-tailed case (Case 1), an optimal density can be obtained without
further optimization. In the heavy-tailed case (Case 2), an additional two-dimensional nonlinear
program (4) needs to be solved to find an optimal sequence of densities. These programs can be
solved readily by standard nonlinear solvers, modulo the detection of non-existence of optimal
solution due to the boundary x1 = µ, which we discuss next.
Lam and Mottet: Worst-case Tail Analysis14 Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!)
We can detect whether or not a solution exists for (3) and (4) by introducing tiny “tolerance
parameters” 0 < ε < ε′. For (3), replace the problem with maxx1∈[0,µ−ε]W (x1), which must pos-
sess an optimal solution x∗1. If x∗1 < µ− ε′, then it falls into Case 1, whereas x∗1 ∈ [µ− ε′, µ− ε]
falls into Case 2. This is capturing the behavior that if the optimality of W is characterized by a
sequence x(k)1 → µ, then, fixing a small value for ε′− ε, the optimal solution for maxx1∈[0,µ−ε]W (x1)
must occur in the interval [µ − ε′, µ − ε] as ε → 0. Similar statements hold for (4). Namely,
maxx1∈[0,µ−ε],ρ∈[µ2,σ] V (x1, ρ) must have an optimal solution (x∗1, ρ∗), and the scenario x∗1 < µ− ε′
leads to Case 2.1, whereas x∗1 ∈ [µ− ε′, µ− ε] leads to Case 2.2. Furthermore, when the function
H(x)/x2 is eventually non-decreasing as x→∞, one can simply put ε′ = ε in the above procedures.
We close this section with a few further remarks on Algorithm 1:
1. If h is not non-negative as needed in the algorithm, because h is bounded one can always add
a sufficiently large constant to make h non-negative. Obviously, the optimality properties will be
retained.
2. Regarding the exclusion of trivial scenarios in the algorithm, η2 > 2βν implies that β is smaller
than the area under the straight line starting from the point (a, η) down to the x-axis with slope
−ν. It is easy to see that no convex extrapolation can be drawn under this condition. Expressed
in terms of µ and σ, the condition is also equivalent to σ < µ2.
3. We have λ<∞ because h is bounded and hence H(x) grows at most quadratically in x.
4. W (x1) and V (x1, ρ) in (3) and (4) are both bounded as x1↗ µ (see Appendix EC.4.2 for
proof).
5. In the heavy-tailed case (Case 2), the sequence f (k)(x) that approaches optimality possesses
a pointwise limit, but the limit is not a valid density and has a probability mass that “escapes”
to positive infinity. In other words, f (k)(x) does not converge weakly to any probability measure,
even though the sequence of evaluated objective values does converge.
5. Elementary Numerical Examples
We demonstrate Algorithm 1 with two examples.
Lam and Mottet: Worst-case Tail AnalysisArticle submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 15
Entropic Risk Measure: The entropic risk measure (e.g., Follmer and Schied (2011)) captures
the risk aversion of users through the exponential utility function. It is defined as
ρ(X) =1
θlog(E[e−θX
])(5)
where θ > 0 is the parameter of risk aversion. In the case when the distribution of the random
variable X is known only up to some point a, we can find the worst case value of the entropic risk
measure subject to tail uncertainty by solving the optimization problem
maxP∈A
1
θlog(E[e−θX
])=
1
θlog(E[e−θX ;X ≤ a] + max
P∈AE[e−θX ;X >a
])(6)
where A denotes the set of convex tails that match the given non-tail region. Since the function
e−θX satisfies Assumption 1, we can apply Algorithm 1 to the second term of the RHS of (6).
The thick line in Figure 9 represents the worst-case value of the entropic risk measure for different
values of the parameter θ in the case when X is known to have a standard exponential distribution
Exp(1) up to a = − log(0.7) (i.e. a is the 70-percentile and β = η = ν = 0.7). For comparison,
we also calculate and plot the entropic risk measure for several fitted probability distributions:
Exp(1), two-segment continuous piecewise linear tail denoted as 2-PLT (two such instances in
Figure 9), and mixtures of 2-PLT and shifted Pareto. Clearly, the worst-case values bound those
calculated from the candidate parametric models, with the gap diminishing as θ increases.
The Newsvendor Problem: The classical newsvendor problem maximizes the profit of selling
a perishable product by fulfilling demand using a stock level decision, i.e.,
maxqE[pmin(q,D)]− cq (7)
where D is the demand random variable, p and c are the selling and purchase prices per product,
and q is the stock quantity to be determined. We assume that p > c. The optimal solution to (7)
is given by Littlewood’s rule q∗ = F−1((p− c)/p), where F−1 is the quantile function of D (Talluri
and Van Ryzin (2006)).
Lam and Mottet: Worst-case Tail Analysis16 Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!)
Figure 9 Optimal upper bound and comparison with parametric extrapolations for the entropic risk
measure.
Suppose the distribution of D is only known to have the shape of a lognormal distribution
with mean 50 and standard deviation 20 in the interval [0, a), where a is the 70-percentile of the
lognormal distribution. A robust optimization formulation for (7) is
maxq
minP∈A
E[pmin(q,D)]− cq (8)
= maxq
{E[pmin(q,D);D≤ a] + min
P∈AE[pmin(q,D);D>a]− cq
}where A denotes the set of convex tails that match the given non-tail region. The outer optimization
in (8) is a concave program. We concentrate on the inner optimization. Since pmin(q,D) is a non-
decreasing function in D on [0,∞), its negation is non-increasing, and Assumption 1 holds (note
that minimization here can be achieved by merely maximizing the negation). We can therefore
apply Algorithm 1 (with β = 0.7, η ≈ 0.007, and ν ≈ 0.0003). Figure 10 shows the optimal lower
bound of the inner optimization when p = 7, c = 1 and q varies between 0 and 193.26 (which is
the 95-percentile of the lognormal distribution). The curve peaks at q= 55.7, which is the solution
to problem (8). As a comparison, we also show different candidate values of the expectation that
are obtained by fitting the tails of lognormal, 2-PLT (two instances) and mixture of shifted Pareto
and 2-PLT (see Figure 10).
Lam and Mottet: Worst-case Tail AnalysisArticle submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 17
Figure 10 Optimal objective values of the inner optimization of the robust newsvendor problem.
6. Main Mathematical Developments
This section explains our main mathematical arguments for the optimality characterization and
Algorithm 1. There are four steps:
Step 1: Conversion from (1) into a moment constrained optimization program. The first key obser-
vation in solving (1) is that it can be reduced into a moment constrained program, by re-expressing
the decision variable as f ′(x) and identifying it as a probability distribution function through a
linear transformation. This is summarized as:
Theorem 3. Suppose h is bounded. Denote H(x) =∫ x0
∫ u0h(v+a)dvdu. Then the optimal value of
(1) is the same as
max νE[H(X)]
subject to E[X] = µ
E[X2] = σ
P∈P[0,∞)
(9)
where µ= ην
and σ= 2βν
. Here the decision variable is a probability distribution P∈P[0,∞), where
P[0,∞) denotes the set of all probability measures on [0,∞), and E[·] is the corresponding expec-
Lam and Mottet: Worst-case Tail Analysis18 Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!)
tation. Moreover, there is a one-to-one correspondence (up to measure zero) between the feasible
solutions to (1) and (9), given by f ′(x+a) = ν(p(x)− 1) a.e., where f is a feasible solution of (1)
and p is a probability distribution function over [0,∞).
The proof is left to Appendix EC.1.1. The result allows us to focus on program (9) and at the
end convert its solution back to that of the original formulation (1).
Step 2: Characterization of the form of optimal solution for (9). (9) is an infinite-dimensional
linear program (LP). Using existing terminology, we call an optimization program consistent if
there exists a feasible solution, and solvable if there exists an optimal solution. We start with the
immediate observation:
Lemma 1. Program (9) is consistent if and only if σ≥ µ2. Correspondingly, program (1) is consis-
tent if and only if η2 ≤ 2βν. When σ= µ2, (9) has only one feasible solution given by a point mass at
µ. Correspondingly, when η2 = 2βν, (1) has only one feasible solution given by f(x) = η− ν(x−a)
for x≥ a.
Lemma 1 justifies the beginning step on the exclusion of trivial scenarios in Algorithm 1.
To proceed further, denote Sn = {(p1, . . . , pn)∈Rn+ :∑n
i=1 pi = 1} as the n-dimensional probability
simplex. Denote Pn[0,∞) as the set of all finite support distributions on [0,∞) with at most n
support points, i.e. each probability measure in Pn[0,∞) has point masses (p1, p2, . . . , pn) ∈ Sn on
some distinct points x1, . . . , xn ∈ [0,∞). For convenience, denote OPT (C) as the program
max E[H(X)]
subject to E[X] = µ
E[X2] = σ
P∈ C
Moreover, we introduce the following assumption:
Assumption 2. H is convex and H ′ satisfies a convex-concave property, i.e. H ′(x) is convex for
x∈ (0, c) and concave for x∈ (c,∞), for some 0≤ c≤∞.
Lam and Mottet: Worst-case Tail AnalysisArticle submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 19
Note that this assumption holds if h is non-negative, bounded, satisfies Assumption 1, and relates
to H by H(x) =∫ x0
∫ u0h(v+ a)dvdu.
The following result characterizes the optimality structure of (9):
Proposition 1. The optimal value of OPT (P[0,∞)) is identical to that of OPT (P3[0,∞)). In
addition, under Assumption 2, the existence of an optimal solution for OPT (P[0,∞)) implies that
there is an optimal solution for OPT (P[0,∞)) that is in P2[0,∞).
Proposition 1 implies the optimality characterization of our original problem (1) as follows. The
first part concludes that there must be a sequence of feasible solutions in P3[0,∞) that converges
to optimality for (9). When converting back to formulation (1) using Theorem 3, these solutions
correspond exactly to piecewise linear densities with three line segments, which leads to Theorem
1. In the second part of Proposition 1, an optimal solution in P2[0,∞) for (9) converts exactly to
piecewise linear density with two line segments for (1), which leads to Theorem 2.
The detailed proof of Proposition 1 and how it implies Theorems 1 and 2 are shown in Appendix
EC.1.2.
Step 3: Solving (9). Knowing the optimality structure through Proposition 1, the task left is finding
an optimal solution (or sequence of solutions). This can be done by searching for the support points
and their masses. For convenience, we encode a generic element in P2[0,∞) by (x1, x2, p1, p2), where
(x1, x2) are the support points in [0,∞) and the associated probability masses are (p1, p2) ∈ S2.
Similarly, we encode a generic element in P3[0,∞) by (x1, x2, x3, p1, p2, p3).
The line search in (3) is posted to find an optimal element (x∗1, x∗2, p∗1, p∗2) in P2[0,∞) for (9),
which reduces to finding just x∗1. If the optimal x∗1 < µ, we conclude that (9) is solvable, which
will subsequently lead to Case 1 (i.e. the light-tailed case) in Algorithm 1. Otherwise, (9) is not
solvable and needs further analysis. These are summarized as follows:
Proposition 2. Suppose σ > µ2 and Assumption 2 holds. Then either one of the following cases
happens:
Lam and Mottet: Worst-case Tail Analysis20 Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!)
1. The program (3) is solvable, i.e. there is an optimal solution x∗1 ∈ [0, µ). Then (9) is solvable
and has an optimal solution in P2[0,∞). Moreover, this solution is given by (x∗1, x∗2, p∗1, p∗2), where
x∗2 =σ−µx∗1µ−x∗1
, p∗1 =σ−µ2
σ− 2µx∗1 +x∗12 , p∗2 =
(µ−x∗1)2
σ− 2µx∗1 +x∗12 (10)
2. The program (3) is not solvable, i.e. W (x1) < W ∗ < ∞ for any x1 ∈ [0, µ). Then (9)
is not solvable, and there is a feasible sequence P(k) ∈ P3[0,∞) with objective value converg-
ing to the optimal value of (9). Each P(k) is represented by the support points (x(k)1 , x
(k)2 , x
(k)3 )
and masses (p(k)1 , p
(k)2 , p
(k)3 ), such that either (x
(k)1 , x
(k)2 , x
(k)3 , p
(k)1 , p
(k)2 , p
(k)3 ) → (x∗1, x
∗2,∞, p∗1, p∗2,0)
for some x∗1, x∗2 ∈ [0,∞) (possibly identical) and (p∗1, p
∗2) ∈ S2, or (x
(k)1 , x
(k)2 , x
(k)3 , p
(k)1 , p
(k)2 , p
(k)3 )→
(x∗1,∞,∞,1,0,0) for some x∗1 ∈ [0,∞).
The first case in Proposition 2 leads to the optimality structure of Case 1 (i.e. the light-tailed
case) in Algorithm 1. The proof of Proposition 2 is in Appendix EC.1.3.
Step 4: Relaxed program to solve for the optimal sequence in the case of non-existence of optimal
solution. We have the following further characterization:
Proposition 3. Suppose σ > µ2 and Assumption 2 holds. In the case that (9) is not solvable, its
optimal value is equivalent to that of
max E[r(X)] +λσ
subject to E[X] = µ
E[X2]≤ σ
P∈P2[0,∞)
(11)
where r(x) =H(x)−λx2.
Program (11) is obtained by identifying all the limits of sequences in P3[0,∞) represented in Case
2 in Proposition 2. The proof of Proposition 3 is left to Appendix EC.1.4.
Finally, one can search for the support points and probability masses of an optimal solution
in (11) by posting the nonlinear program (4), and from there explicitly construct an optimality-
approaching sequence for (9). This is summarized as:
Lam and Mottet: Worst-case Tail AnalysisArticle submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 21
Proposition 4. Suppose σ > µ2 and Assumption 2 holds. In the situation that (9) is not solvable,
either of the following two cases must happen:
1. There exists an optimal solution for (4), given by (x∗1, ρ∗) ∈ [0, µ). Then, a sequence of fea-
sible P(k) ∈ P3[0,∞) represented by (x(k)1 , x
(k)2 , x
(k)3 , p
(k)1 , p
(k)2 , p
(k)3 ) will have objective value in (9)
converging to the optimum, where
x(k)1 = x∗1 for all k
x(k)2 = x∗2− δ(k) =
ρ∗−µx∗1µ−x∗1
− δ(k) for all k
x(k)3 →∞ is a sequence such that H(x
(k)3 )/x
(k)3
2→ λ as k→∞
p(k)1 = p∗1 = ρ∗−µ2
ρ∗−2µx∗1+x∗12
p(k)2 = p∗2− γ(k) =
(µ−x∗1)2
ρ∗−2µx∗1+x∗12 − γ(k)
p(k)3 = γ(k)
(12)
and
δ(k) =σ− ρ∗
p∗2(x(k)3 −x∗2)
, γ(k) =(σ− ρ∗)p∗2
p∗2(x(k)3 −x∗2)2 + (σ− ρ∗)
(13)
2. There does not exist an optimal solution for (4). Then, a sequence of feasible P(k) ∈P2[0,∞)
represented by (x(k)1 , x
(k)2 , p
(k)1 , p
(k)2 ) will have objective value in (9) converging to the optimal value
H(µ) +λ(σ−µ2), where
x(k)1 = µ− δ(k)
x(k)2 →∞ is a sequence such that
H(x(k)2 )
x(k)2
2 → λ as k→∞
p(k)1 = 1− γ(k)
p(k)2 = γ(k)
(14)
and
δ(k) =σ−µ2
x(k)2 −µ
, γ(k) =σ−µ2
x(k)2
2− 2µx
(k)2 +σ
(15)
This concludes the optimality structure of Case 2 (i.e. the heavy-tailed case) in Algorithm 1.
The proof of Proposition 4 is left to Appendix EC.1.4.
Lam and Mottet: Worst-case Tail Analysis22 Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!)
7. Data-driven Worst-case Tail Analysis
7.1. Relaxed Formulation and Combination with Interval Estimators
This section focuses on the integration with data like those in Examples 1 and 2 in Section 1. Still
presuming a threshold a chosen in advance, now we are interested in the scenario where we do not
know the exact density below a and therefore need to estimate it from the data. Using the interval
estimates for the quantities η, ν and β, the worst-case bound for E[h(X);X >a] is
max∫∞ah(x)f(x)dx
subject to β ≤∫∞af(x)dx≤ β
η≤ f(a)≤ η
f ′+(a)≥−ν
f(x) convex for x≥ a
f(x)≥ 0 for x≥ a
(16)
where [β,β], [η, η] and −ν are the joint (1−α) confidence intervals (CIs) for P (X > a) and f(a)
and lower confidence bound for f ′−(a). It is clear that the optimal value of (16) carries the following
statistical guarantee:
Proposition 5. Suppose that [β,β], [η, η] and −ν are the joint (1 − α) CIs for P (X > a) and
f(a), and lower confidence bound for f ′−(a). Then with probability (1− α) the optimization (16)
gives an upper bound for the worst-case value of E[h(X);X > a] under the assumption that f(x)
is convex for x≥ a.
The optimality characterization of (16) is the same as that of (1) (the proof is discussed in
Appendix EC.3):
Theorem 4. Suppose h is non-negative. Theorems 1 and 2 hold if (1) is replaced by (16).
Our algorithm for solving (16) (hereafter Algorithm 2) is conceptually similar to Algorithm 1 but
possesses additional parameters for handling inequality instead of equality constraints. Algorithm
2, which uses the parameters β, β, η, η and ν as inputs, requires two two-dimensional nonlinear
Lam and Mottet: Worst-case Tail AnalysisArticle submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 23
programs to distinguish between existence and non-existence of optimal density, and needs at most
two additional programs, each with at most three dimensions, to solve for an optimal density or
sequence of densities (see Appendix EC.2 for the details).
The next sub-section applies formulation (16) to the two examples in the Introduction and gives
the numerical results.
7.2. Synthetic Data: Example 2 Revisited
Consider the synthetic data set of size 200 in Example 2. This data set is actually generated from
a lognormal distribution with parameter (µ,σ) = (0,0.5), but we assume that only the data is
available to us. We are interested in the quantity P (4<X < 5), and for this we will solve program
(16) to generate an upper bound that is valid with 95% confidence.
We compute the interval estimates for β, η and ν as follows. First, we can obtain point esti-
mates for these parameters through standard kernel density estimator (KDE) in the R statistical
package. To obtain interval estimates, we run 1,000 bootstrap resamples and take the appropriate
quantiles of the 1,000 resampled point estimates. To account for the fact that three parameters
are estimated simultaneously, we apply a Bonferroni correction, so that the confidence level used
for each individual estimator is 1− 0.05/3.
For a sense of how to choose a, Figure 11 shows the density and density derivative estimates
and compares them to those of the lognormal distribution. The KDE suggests that convexity holds
starting from around x= 1.5 (the point where the density derivative estimate starts to turn from a
decreasing to an increasing function). Thus, it is reasonable to confine the choice of a to be larger
than 1.5. In fact, this number is quite close to the true inflexion point 1.15.
Since the data become progressively scanter as x grows larger, and the KDE is designed to utilize
neighborhood data, the interval estimators for the necessary parameters β, η and ν become less
reliable for larger choices of a. For instance, Figure 11 shows that the bootstrapped KDE CI of
the density derivative covers the truth only up to x = 3.3. In general, a good choice of a should
be located at a point where there are some data in the neighborhood of a, such that the interval
Lam and Mottet: Worst-case Tail Analysis24 Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!)
estimators for β, η and ν are reliable, but as large as possible, because choosing a small a can make
the tail extrapolation bound more conservative.
Figure 11 Bootstrapped kernel estimation of the distribution, density and density derivative for the synthetic
data.
As a first attempt, we run Algorithm 2 using a = 3.3 to estimate an upper bound for the
probability P (4<X < 5), which gives 6.6×10−3 while the truth is 2.1×10−3. Thus, this estimated
upper bound does cover the truth and also has the same order of magnitude. We perform the
following two other procedures for comparison:
1. GPD approach: As discussed in Section 3.1, this is a common approach for tail modeling. Fit
the data above a threshold u to the density function
(1− F (u))gζ,β(x−u)
Lam and Mottet: Worst-case Tail AnalysisArticle submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 25
where F (u) is the estimated ECDF at u, and gζ,β(·) is the GPD density, whose distribution function
is defined as
Gζ,β(x) =
1− (1 + ζx/β)−1/ζ if ζ 6= 0
1− exp(−x/β) if ζ ≥ 0
for x≥ 0, and β > 0. Set threshold u to be 0.8, such that a linear trend is observed on both the
exponential Q-Q plot and the mean excess plot of the data, as recommended by McNeil (1997).
Estimate F (u) by the sample mean of I(Xi ≤ u), where I(·) denotes the indicator function, and
obtain the parameter estimates ζ and β from maximum likelihood estimation. Then obtain 95%
CI for the quantity P (4<X < 5) from the delta method.
2. Worst-case approach with known parameter values: Assume β, η and ν are known at a= 3.3.
Then run Algorithm 1 to obtain the upper bound.
Table 1 shows the upper bounds obtained from the above approaches, and also shows the obvious
fact that using ECDF alone for estimating P (4 <X < 5) is not possible since there are no data
in the interval [4,5]. The worst-case approach with known parameters gives an upper bound of
2.9× 10−3, which is less than the data-driven version. The difference between these numbers can
be interpreted as the price of estimation for β, η and ν. Note that for this particular setup, the
worst-case approach correctly covers the true value, whereas GPD fitting actually gives an invalid
upper bound, thus showing that either the data size or the threshold level is insufficient to support
a good fit of the GPD. This is an instance where the worst-case approach has outperformed GPD
in terms of correctness.
The above discussion focuses only on one realization of data set, which raises the question of
whether it holds more generally. Therefore, we repeat the experiment 200 times to obtain an
empirical coverage probability, i.e. the probability that the estimated upper bound indeed covers
the truth. We also vary the position of the interval of interest in P (c <X < d) from [c, d] = [4,5] to
[9,10], and try two different a’s: 3.3 and 2.8. Tables 2 and 3 show the true probabilities, the mean
upper bounds from the 200 experiments, and the empirical coverage probabilities.
Lam and Mottet: Worst-case Tail Analysis26 Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!)
Method Estimated upper bound
Truth 2.1× 10−3
Worst-case approach 6.6× 10−3
GPD 1.5× 10−3
Worst-case with known parameters 2.9× 10−3
ECDF N/A
Table 1 Estimated upper bounds of the probability P (4 <X < 5) for the synthetic data in Example 2.
c d Truth Mean upper bound Coverage probability
4 5 2.13× 10−3 7.96× 10−3 0.93
5 6 4.73× 10−4 4.34× 10−3 0.96
6 7 1.19× 10−4 2.98× 10−3 1
7 8 3.37× 10−5 2.27× 10−3 1
8 9 1.04× 10−5 1.83× 10−3 1
9 10 3.49× 10−6 1.53× 10−3 1
Table 2 Mean upper bounds and empirical coverage probabilities using worst-case approach with threshold
a = 3.3.
c d Truth Mean upper bound Coverage probability
4 5 2.13× 10−3 1.09× 10−2 1
5 6 4.73× 10−4 6.87× 10−3 1
6 7 1.19× 10−4 4.98× 10−3 1
7 8 3.37× 10−5 3.90× 10−3 1
8 9 1.04× 10−5 3.20× 10−3 1
9 10 3.49× 10−6 2.71× 10−3 1
Table 3 Mean upper bounds and empirical coverage probabilities using worst-case approach with threshold
a = 2.8.
Lam and Mottet: Worst-case Tail AnalysisArticle submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 27
The coverage probabilities in Tables 2 and 3 are mostly 1, which suggests that our procedure
is conservative. For a = 3.3 and intervals that are close to a, i.e. [c, d] = [4,5] and [5,6], the cov-
erage probability is not 1 but rather is close to the prescribed confidence level of 95%. Further
investigation reveals that our procedure fails to cover the truth only in the case when the joint
CI of the parameters η, β and ν does not contain the true values, which is consistent with the
rationale of our method. Although we have not tried lower values of a, it is very likely that the
coverage probabilities will stay mostly 1, but the mean upper bounds will increase as the level of
conservativeness increases.
As a comparison, Table 4 shows the results of GPD fit using threshold u= 0.8. Here, all of the
coverage probabilities are far from the prescribed level of 95%, which suggests that either GPD is
the wrong parametric choice to use since the threshold is not high enough, or that the estimation
error of its parameters is too large due to the lack of data. Note that we have used a two-sided 95%
CI for the GPD approach here. If we had used a one-sided upper confidence bound, then the upper
bounding value would be even lower and the coverage probability would drop further. However, the
mean upper bounds using GPD fit do cover the truth in all cases. Since the coverage probability
is well below 95%, this suggests that the estimation of GPD parameters is highly sensitive to the
realization of data.
c d Truth Mean upper bound Coverage probability
4 5 2.13× 10−3 4.06× 10−3 0.67
5 6 4.73× 10−4 9.93× 10−4 0.54
6 7 1.19× 10−4 3.01× 10−4 0.44
7 8 3.37× 10−5 1.06× 10−4 0.37
8 9 1.04× 10−5 4.38× 10−5 0.30
9 10 3.49× 10−6 1.95× 10−5 0.30
Table 4 Mean upper bounds and empirical coverage probabilities using GPD approach.
Lam and Mottet: Worst-case Tail Analysis28 Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!)
In summary, Tables 2, 3 and 4 show the pros and cons of our worst-case approach and GPD
fitting. GPD appears to perform well in capturing the magnitude of the target quantity, but its
confidence upper bound can fall short of the prescribed coverage probability (in fact, only between
30 to 70% of the time it actually covers the truth in Table 4). On the other hand, our approach
gives a reasonably tight upper bound when the interval in consideration (i.e. [c, d]) is close to the
threshold a, and tends to be more conservative far out. This is a drawback, but sensibly so, given
that the uncertainty of extrapolation increases as it gets farther away from what is known.
Both our worst-case approach and GPD fitting require choosing a threshold parameter. In GPD
fitting, it is important to choose a threshold parameter high enough so that the GPD becomes a
valid model. GPD fitting, however, is difficult for a small data set when the lack of data prohibits
choosing a high threshold. On the other hand, the threshold in our worst-case approach can be
chosen much higher, because our method relies on the data below the threshold, not above it.
7.3. Fire Insurance Data: Example 1 Revisited
Consider the fire insurance data in Example 1. The quantity of interest is the expected payoff of a
high-excess policy with reinsurance, given by h(x) = (x− 50)I(50≤ x< 200) + 150I(x≥ 200). The
data set has only seven observations above 50.
We apply our worst-case approach to estimate an upper bound for the expected payoff by using
a= 29.03, the cutoff above which 15 observations are available. Similar to Section 7.2, we use the
bootstrapped KDE to obtain CIs for β, η and ν. The estimates in Figure 12 appear to be very
stable for this example, thanks to the relatively large data size.
We run Algorithm 2 to obtain a 95% confidence upper bound of 1.61. For comparison, we fit a
GPD using threshold u= 10, which follows McNeil (1997) as the choice that roughly balances the
bias-variance tradeoff and minimizes the length of the CI. The 95% CI from GPD fit is [0.04,0.16].
Thus, the worst-case approach gives an upper bound that is one order of magnitude higher, a
finding that resonates with that in Section 7.2. Our recommendation is that a modeler who cares
only about the order of magnitude would be better off choosing GPD, whereas a more risk-averse
Lam and Mottet: Worst-case Tail AnalysisArticle submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 29
Figure 12 Bootstrapped kernel estimation of the distribution, density and density derivative for the the Danish
fire losses data in Example 1.
modeler who wants a bound on the risk quantity with high probability guarantee would be better
off choosing the worst-case approach.
8. Conclusion
This paper proposed a worst-case, nonparametric approach to estimate tail quantities, based on
finding an optimal tail extrapolation under convexity assumption. The approach was developed
to solve cases with tremendous uncertainty in the tail region. Its implementability was demon-
strated by reducing the infinite-dimensional optimization problem into low-dimensional nonlinear
programs. Moreover, a data-driven optimization formulation was developed via suitable relaxation
that took into account parametric estimates for the non-tail region. With two examples of data sets,
Lam and Mottet: Worst-case Tail Analysis30 Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!)
one synthetic and one real, the proposed approach were compared to existing tail-fitting techniques,
and its relative strength of outputting correct tail estimates under data-deficient environment was
demonstrated. The level of conservativeness of the proposed approach, which is a limitation of the
method, was also examined.
We suggest two extensions of our research. First is to generalize the proposed method to mul-
tivariate distributions, perhaps through separate modeling on the marginal distributions and the
dependency structure. Second is to study means to reduce the level of conservativeness. This can
involve mathematical transformations of the variable, the addition of extra information from other
available data (e.g., data that can be used for validation) and the incorporation of subjective expert
opinion.
References
Balkema, August A, Laurens De Haan. 1974. Residual life time at great age. The Annals of Probability
792–804.
Beirlant, Jan, Jozef L Teugels. 1992. Modeling large claims in non-life insurance. Insurance: Mathematics
and Economics 11(1) 17–29.
Ben-Tal, Aharon, Dick Den Hertog, Anja De Waegenaere, Bertrand Melenberg, Gijs Rennen. 2013. Robust
solutions of optimization problems affected by uncertain probabilities. Management Science 59(2)
341–357.
Bertsimas, Dimitris, Ioana Popescu. 2005. Optimal inequalities in probability theory: A convex optimization
approach. SIAM Journal on Optimization 15(3) 780–804.
Birge, John R, Jose H Dula. 1991. Bounding separable recourse functions with limited distribution informa-
tion. Annals of Operations Research 30(1) 277–298.
Birge, John R, Roger J-B Wets. 1987. Computing bounds for stochastic programming problems by means
of a generalized moment problem. Mathematics of Operations Research 12(1) 149–162.
Bucklew, James. 2004. Introduction to rare event simulation. Springer Science & Business Media.
Lam and Mottet: Worst-case Tail AnalysisArticle submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 31
Cule, Madeleine, Richard Samworth, Michael Stewart. 2010. Maximum likelihood estimation of a multi-
dimensional log-concave density. Journal of the Royal Statistical Society: Series B (Statistical Method-
ology) 72(5) 545–607.
Davis, Richard, Sidney Resnick. 1984. Tail estimates motivated by extreme value theory. The Annals of
Statistics 1467–1487.
Davison, Anthony C, Richard L Smith. 1990. Models for exceedances over high thresholds. Journal of the
Royal Statistical Society. Series B (Methodological) 393–442.
Delage, Erick, Yinyu Ye. 2010. Distributionally robust optimization under moment uncertainty with appli-
cation to data-driven problems. Operations Research 58(3) 595–612.
Dembo, Amir, Ofer Zeitouni. 2009. Large deviations techniques and applications, vol. 38. Springer Science
& Business Media.
Denisov, Denis, AB Dieker, Vsevolod Shneer, et al. 2008. Large deviations for random walks under subex-
ponentiality: the big-jump domain. The Annals of Probability 36(5) 1946–1991.
El Ghaoui, Laurent, A Nilim. 2005. Robust solutions to Markov decision problems with uncertain transition
matrices. Operations Research 53(5).
Embrechts, Paul, Rdiger Frey, Alexander McNeil. 2005. Quantitative risk management. Princeton Series in
Finance, Princeton 10.
Embrechts, Paul, Claudia Kluppelberg, Thomas Mikosch. 1997. Modelling extremal events, vol. 33. Springer
Science & Business Media.
Fisher, Ronald Aylmer, Leonard Henry Caleb Tippett. 1928. Limiting forms of the frequency distribution of
the largest or smallest member of a sample. Mathematical Proceedings of the Cambridge Philosophical
Society , vol. 24. Cambridge University Press, 180–190.
Follmer, Hans, Alexander Schied. 2011. Stochastic finance: an introduction in discrete time. Walter de
Gruyter.
Glasserman, Paul, Wanmo Kang, Perwez Shahabuddin. 2007. Large deviations in multifactor portfolio credit
risk. Mathematical Finance 17(3) 345–379.
Lam and Mottet: Worst-case Tail Analysis32 Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!)
Glasserman, Paul, Wanmo Kang, Perwez Shahabuddin. 2008. Fast simulation of multifactor portfolio credit
risk. Operations Research 56(5) 1200–1217.
Glasserman, Paul, Jingyi Li. 2005. Importance sampling for portfolio credit risk. Management Science
51(11) 1643–1656.
Gnedenko, Boris. 1943. Sur la distribution limite du terme maximum d’une serie aleatoire. Annals of
Mathematics 423–453.
Goh, Joel, Melvyn Sim. 2010. Distributionally robust optimization and its tractable approximations. Oper-
ations Research 58(4-part-1) 902–917.
Gumbel, Emil Julius. 2012. Statistics of Extremes. Courier Corporation.
Hannah, Lauren A, David B Dunson. 2013. Multivariate convex regression with adaptive partitioning. The
Journal of Machine Learning Research 14(1) 3261–3294.
Hansen, Lars Peter, Thomas J Sargent. 2008. Robustness. Princeton University Press.
Heidelberger, Philip. 1995. Fast simulation of rare events in queueing and reliability models. ACM Trans-
actions on Modeling and Computer Simulation (TOMACS) 5(1) 43–85.
Hill, Bruce M, et al. 1975. A simple general approach to inference about the tail of a distribution. The
Annals of Statistics 3(5) 1163–1174.
Hogg, Robert V, Stuart A Klugman. 2009. Loss Distributions, vol. 249. John Wiley & Sons.
Hosking, Jonathan RM, James R Wallis. 1987. Parameter and quantile estimation for the generalized pareto
distribution. Technometrics 29(3) 339–349.
Iyengar, Garud N. 2005. Robust dynamic programming. Mathematics of Operations Research 30(2) 257–280.
Karr, Alan F. 1983. Extreme points of certain sets of probability measures, with applications. Mathematics
of Operations Research 8(1) 74–85.
Koenker, Roger, Ivan Mizera. 2010. Quasi-concave density estimation. The Annals of Statistics 2998–3027.
Lim, Eunji, Peter W Glynn. 2012. Consistency of multidimensional convex regression. Operations Research
60(1) 196–208.
Lam and Mottet: Worst-case Tail AnalysisArticle submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 33
McNeil, A. J. 1997. Estimating the tails of loss severity distributions using extreme value theory. The Journal
of the International Actuarial Association 27 117–137.
Nicola, Victor F, Marvin K Nakayama, Philip Heidelberger, Ambuj Goyal. 1993. Fast simulation of highly
dependable systems with general failure and repair processes. Computers, IEEE Transactions on 42(12)
1440–1452.
Petersen, Ian R, Matthew R James, Paul Dupuis. 2000. Minimax optimal control of stochastic uncertain
systems with relative entropy constraints. Automatic Control, IEEE Transactions on 45(3) 398–412.
Pickands III, James. 1975. Statistical inference using extreme order statistics. The Annals of Statistics
119–131.
Popescu, Ioana. 2005. A semidefinite programming approach to optimal-moment bounds for convex classes
of distributions. Mathematics of Operations Research 30(3) 632–657.
Seijo, Emilio, Bodhisattva Sen, et al. 2011. Nonparametric least squares estimation of a multivariate convex
regression function. The Annals of Statistics 39(3) 1633–1657.
Seregin, Arseni, Jon A Wellner. 2010. Nonparametric estimation of multivariate convex-transformed densities.
Annals of statistics 38(6) 3751.
Smith, James E. 1995. Generalized chebychev inequalities: theory and applications in decision analysis.
Operations Research 43(5) 807–825.
Smith, Richard L. 1985. Maximum likelihood estimation in a class of nonregular cases. Biometrika 72(1)
67–90.
Talluri, Kalyan T, Garrett J Van Ryzin. 2006. The theory and practice of revenue management , vol. 68.
Springer Science & Business Media.
Winkler, Gerhard. 1988. Extreme points of moment sets. Mathematics of Operations Research 13(4) 581–587.
e-companion to Lam and Mottet: Worst-case Tail Analysis ec1
Appendix
EC.1. Technical Proofs for Section 6
EC.1.1. Proofs in Step 1 in Section 6
Proof of Theorem 3. Without loss of generality, we can let a= 0 (by replacing f(x) with f(x+
a), and h(x) with h(x+ a) respectively). Hence (1) becomes
max∫∞0h(x)f(x)dx
subject to∫∞0f(x)dx= β
f(0) = η
f ′+(0)≥−ν
f convex for x≥ 0
f(x)≥ 0 for x≥ 0
(EC.1)
This can be rewritten as
max∫∞0h(x)f(x)dx
subject to∫∞0f(x)dx= β
f(0) = η
−ν ≤ f ′(x)≤ 0 a.e. for x≥ 0
f ′ non-decreasing a.e. for x≥ 0
(EC.2)
Here we have translated the convexity of f into the constraint that f ′ exists and is non-decreasing
a.e.. Note that we have removed the constraint f(x) ≥ 0 in (EC.1) since it is redundant. To see
this, suppose that f(x0) < 0 for some x0 > 0. If f(x) has non-positive slope for all x ≥ x0, then∫∞0f(x)dx = −∞. Else if f(x) has positive slope for some x > x0, then f(x) shoots to infinity
because of convexity, and hence∫∞0f(x)dx =∞. Either case leads to a contradiction with the
constraint∫∞0f(x)dx= β <∞.
In (EC.2) we have also imposed the additional condition that f ′(x) ≤ 0 for x ≥ 0, because
otherwise f ′(x0) > 0 for some x0 ≥ 0, and by convexity f ′(x) > f ′(x0) a.e. for x ≥ x0 and hence∫∞0f(x)dx→∞, again violating the constraint
∫∞0f(x)dx= β <∞.
ec2 e-companion to Lam and Mottet: Worst-case Tail Analysis
As a key step, we reexpress the decision variable in (EC.2) to be the derivative f ′, which exists
a.e.. For convenience, we let H(x) =∫ x0h(u+ a)du and H(x) =
∫ x0H(u)du. This definition of H
is obviously consistent with the definition in (2). First consider the objective function, and using
integration by parts,∫ ∞0
h(x)f(x)dx= H(x)f(x)∣∣∣∞0−∫ ∞0
H(x)f ′(x)dx=−∫ ∞0
H(x)f ′(x)dx
=−H(x)f ′(x)∣∣∣∞0
+
∫ ∞0
H(x)df ′(x) =
∫ ∞0
H(x)df ′(x) (EC.3)
where the second and the fourth equality follows from a simple observation in Lemma EC.9 (in
Section EC.4.1), and the fact that H(x) grows at most linearly and H(x) quadratically as x→∞
by the boundedness of h.
For the constraints in (EC.2), we can write∫ ∞0
f(x)dx=
∫ ∞0
x2
2df ′(x) (EC.4)
by merely viewing h≡ 1 in (EC.3). Also, we can use integration by parts again to write
f(0) =−∫ ∞0
f ′(x)dx=−xf ′(x)∣∣∣∞0
+
∫ ∞0
xdf ′(x) =
∫ ∞0
xdf ′(x) (EC.5)
where the third equality follows from Lemma EC.9 again. Therefore, (EC.2) can be written as
max∫∞0H(x)df ′(x)
subject to∫∞0
x2
2df ′(x) = β∫∞
0xdf ′(x) = η
−ν ≤ f ′(x)≤ 0 a.e. for x≥ 0
f ′ non-decreasing a.e. for x≥ 0
(EC.6)
Finally, let p(x) = f ′(x)/ν+ 1. Then (EC.6) can be rewritten as
max ν∫∞0H(x)dp(x)
subject to∫∞0x2dp(x) = 2β
ν∫∞0xdp(x) = η
ν
0≤ p(x)≤ 1 a.e. for x≥ 0
p non-decreasing a.e. for x≥ 0
(EC.7)
e-companion to Lam and Mottet: Worst-case Tail Analysis ec3
In the formulation (EC.7), one can identify p as a well-defined probability distribution function,
and so (EC.7) is nothing but maximizing E[H(X)] subject to the constraints that the first and
second moment equal to η/ν and 2β/ν respectively. This concludes the result.
EC.1.2. Proofs in Step 2 in Section 6 and Theorems 1 and 2
Proof of Lemma 1. It follows from Jensen’s inequality that for any valid probability distribution
P, E[X2]≥E[X]2, which gives σ≥ µ2 in (9). On the other hand, if σ≥ µ2, it is also rudimentary to
find a measure on P[0,∞) with first moment µ and second moment σ, e.g. a probability measure
with two suitably chosen support points. Substituting µ = η/ν and σ = 2β/ν, we get η2 ≤ 2βν.
Lastly, equality holds in E[X2]≥E[X]2 if and only if P is a point mass. This concludes the second
part of the lemma.
The proof of Proposition 1 follows from a few lemmas below. For convenience, we denote Pn(D)
as the set of all finite support distributions on D with at most n support points. So, similar to
the notation Pn[0,∞), Pn[0, c] is the set of all finite support distributions on [0, c] with at most n
support points.
Lemma EC.1 (Adapted from Theorem 3.2 in Winkler (1988)). The optimal value of
OPT (P[0, c]) is identical to that of OPT (P3[0, c]) for any c > 0. Similarly, the optimal value of
OPT (P[0,∞)) is identical to that of OPT (P3[0,∞)).
Lemma EC.1 implies that it suffices to focus on OPT (P3[0,∞)) in order to solve for (9).
For further convenience, we write Z(P) = E[H(X)] as the objective function of (9) or
OPT (P3[0,∞)), and we let Z∗ be the optimal value of (9) or OPT (P3[0,∞)). For any P∈Pn(D)
and D, we can represent P in terms of the support points (x1, . . . , xn) ∈ Dn (possibly some being
identical) and masses (p1, . . . , pn)∈ Sn, and we use the shorthand notation P∼ (x1, . . . , xl, p1, . . . , pn)
for this representation.
Lemma EC.2. Consider OPT (P3[0,∞)) that is consistent. The optimal value is either achieved at
some P∗ ∈P3[0,∞), or there exists a sequence of feasible P(k) ∈P3[0,∞) such that Z(P(k)) converges
ec4 e-companion to Lam and Mottet: Worst-case Tail Analysis
to the optimal value. In the second case, each P(k) ∼ (x(k)1 , x
(k)2 , x
(k)3 , p
(k)1 , p
(k)2 , p
(k)3 ), such that either
(x(k)1 , x
(k)2 , x
(k)3 , p
(k)1 , p
(k)2 , p
(k)3 )→ (x∗1, x
∗2,∞, p∗1, p∗2,0) for some x∗1, x
∗2 ∈ [0,∞) (possibly identical) and
(p∗1, p∗2)∈ S2, or (x
(k)1 , x
(k)2 , x
(k)3 , p
(k)1 , p
(k)2 , p
(k)3 )→ (x∗1,∞,∞,1,0,0) for some x∗1 ∈ [0,∞).
Proof of Lemma EC.2. By the definition of optimality, clearly we have either the existence
of an optimal solution lying in P3[0,∞), which is the first case in the lemma’s statement, or
there exists a feasible sequence P(k) ∈ P3[0,∞) such that Z(P(k))→ Z∗. Let P(k) be represented
by the masses (p(k)1 , p
(k)2 , p
(k)3 ) ∈ S3 on the support points (x
(k)1 , x
(k)2 , x
(k)3 ) ∈ [0,∞)3 (possibly some
support points are identical). Suppose that xi’s are all bounded above by a number, say M .
Then, since [0,M ]3 ×S3 is a compact set, by Bolzano-Weierstrass Theorem we must have a sub-
sequence of (x(k)1 , x
(k)2 , x
(k)3 , p
(k)1 , p
(k)2 , p
(k)3 ), say (x
(kj)
1 , x(kj)
2 , x(kj)
3 , p(kj)
1 , p(kj)
2 , p(kj)
3 ) that converges to
(x∗1, x∗2, x∗3, p∗1, p∗2, p∗3) in [0,M ]3×S3. Since H(x) is continuous by construction, we have Z(P(kj)) =∑3
i=1H(x(kj)
i )p(kj)
i →∑3
i=1H(x∗i )p∗i = Z(P∗), where P∗ is represented by (x∗1, x
∗2, x∗3, p∗1, p∗2, p∗3). As
Z(P(kj)) is a subsequence of Z(P(k)), Z(P∗) must be equal to Z∗, and so P∗ is an optimal solution,
which reduces to the first case in the lemma. Therefore, for the second case, we should focus on
the scenario that at least one x(k)i satisfies lim supk→∞ x
(k)i =∞.
Without loss of generality, we fix the convention that x(k)1 ≤ x
(k)2 ≤ x
(k)3 . If at least one of x
(k)i
satisfies lim supk→∞ x(k)i =∞, we must have lim supk→∞ x
(k)3 =∞. Note that in order that P(k) is
feasible, E(k)[X] = µ holds and so x(k)1 ≤ µ for all k. We now distinguish two cases: either x
(k)2
is uniformly bounded, say by a large number M ≥ µ, or limsupk→∞ x(k)2 =∞ also. Consider the
first case. First, we find a subsequence, indexed by kj, such that x(kj)
3 ↗∞. Since (x(kj)
1 , x(kj)
2 ) ∈
[0,M ]2 which is compact, we can choose a subsequence kj′ such that (x(kj′ )1 , x
(kj′ )2 , x
(kj′ )3 ) →
(x∗1, x∗2,∞) where (x∗1, x
∗2) ∈ [0,M ]2. Now, since (p
(kj′ )1 , p
(kj′ )2 , p
(kj′ )3 ) ∈ S3 which is also compact,
we can choose another further subsequence kj′′ such that (p(kj′′ )1 , p
(kj′′ )2 , p
(kj′′ )3 )→ (p∗1, p
∗2, p∗3) ∈ S3.
Note that by the constraint E(k)[X2] = p(kj′′ )1 x
(kj′′ )1
2
+ p(kj′′ )2 x
(kj′′ )2
2
+ p(kj′′ )3 x
(kj′′ )3
2
= σ, we must have
p(kj′′ )3 = (σ − p
(kj′′ )1 x
(kj′′ )1
2
− p(kj′′ )2 x
(kj′′ )2
2
)/x(kj′′ )3
2
≤ σ/x(kj′′ )3
2
→ 0. In conclusion, in this case, we
end up being able to find a sequence of measures P(k′) ∼ (x(k′)1 , x
(k′)2 , x
(k′)3 , p
(k′)1 , p
(k′)2 , p
(k′)3 ) with
(x(k′)1 , x
(k′)2 , x
(k′)3 , p
(k′)1 , p
(k′)2 , p
(k′)3 )→ (x∗1, x
∗2,∞, p∗1, p∗2,0) where x∗1, x
∗2 ∈ [0,∞) and (p∗1, p
∗2)∈ S2.
e-companion to Lam and Mottet: Worst-case Tail Analysis ec5
For the second case, namely when lim supk→∞ x(k)i = ∞ for both i = 2 and 3. We can
argue similarly that there is a sequence of measures P(k′) ∼ (x(k′)1 , x
(k′)2 , x
(k′)3 , p
(k′)1 , p
(k′)2 , p
(k′)3 ),
such that x(k′)2 , x
(k′)3 →∞ and p
(k′)2 , p
(k′)3 → 0. In other words, (x
(k′)1 , x
(k′)2 , x
(k′)3 , p
(k′)1 , p
(k′)2 , p
(k′)3 )→
(x∗1,∞,∞,1,0,0) where x∗1 ∈ [0,∞).
Next, we borrow the following result:
Lemma EC.3 (Adapted from Theorem 5.1 in Birge and Dula (1991)). Consider
OPT (P[0, c]) for any c <∞. Suppose H is convex with derivative H ′ convex on (0, c) and concave
on (c, c) for some 0≤ c≤ c. Then an optimal solution exists and lies in P2[0, c].
Lemma EC.4. Under Assumption 2, if OPT (P3[0,∞)) is solvable, then there is an optimal solu-
tion in P2[0,∞).
Proof of Lemma EC.4. Suppose there is an optimal solution P∗ to OPT (P3[0,∞)). Clearly, P∗
must have bounded support, say covered by [0,M ] for some large M . It follows that P∗ is also
the optimal solution to the restricted program OPT (P3[0,M ]). Now, by Lemma EC.1, the optimal
value of OPT (P3[0,M ]) is the same as that of OPT (P[0,M ]). In other words, P∗ is an optimal
solution to OPT (P[0,M ]).
By Lemma EC.3, OPT (P[0,M ]) has an optimal solution P∗∗ ∈ P2[0,M ]. We then must have
Z(P∗∗) =Z(P∗), so that P∗∗ is optimal for OPT (P3[0,∞)). This concludes the lemma.
Proof of Proposition 1. The second statement in Lemma EC.1 immediately implies the first
statement in Proposition 1. Then Lemma EC.4 implies the second statement in Proposition 1.
Proof of Theorem 1. Convert the original optimization (1) into (9) by Theorem 3. The optimal
solution of (9) is characterized by Proposition 1. The result follows by noting that any P∈P3[0,∞)
represented by (x1, x2, x3, p1, p2, p3) (with potentially identical xi’s), admits the one-to-one corre-
spondence f ′(x+ a) = ν(p(x)− 1) in Theorem 3 into
f ′(x) =
−ν for a< x< x1 + a
−ν(1− p1) for x1 + a< x< x2 + a
−ν(1− p1− p2) for x2 + a< x< x3 + a
0 for x> x3 + a
ec6 e-companion to Lam and Mottet: Worst-case Tail Analysis
and hence
f(x) =
η− νx for a≤ x≤ x1 + a
η− νx1− ν(1− p1)(x− a−x1) for x1 + a≤ x≤ x2 + a
η− νx1− ν(1− p1)(x2−x1)− ν(1− p1− p2)(x− a−x2) for x2 + a≤ x≤ x3 + a
0 for x≥ x3 + a
Proof of Theorem 2. Assume w.l.o.g. that h is non-negative. Then Assumption 1 is equivalent
to Assumption 2. Similar to the proof of Theorem 1, the result here follows by noting that any
P ∈ P2[0,∞) represented by (x1, x2, p1, p2) (with potentially identical xi’s), admits the one-to-one
correspondence f ′(x+ a) = ν(p(x)− 1) in Theorem 3 into
f ′(x) =
−ν for a< x< x1 + a
−νp2 for x1 + a< x< x2 + a
0 for x> x2 + a
and hence
f(x) =
η− νx for a≤ x≤ x1 + a
η− νx1− νp2(x− a−x1) for x1 + a≤ x≤ x2 + a
0 for x≥ x2 + a
EC.1.3. Proofs in Step 3 in Section 6
Proof of Proposition 2. Suppose for now that (9) admits an optimal probability measure in
P2[0,∞) represented by (x1, x2, p1, p2) (where x1, x2 must be distinct since otherwise σ = µ2).
Adopting a similar line of analysis as in Birge and Dula (1991), we let x1 ≤ x2 without loss of
generality. For a two-support-point distribution to be feasible, we must have µ2 < σ and x1 < µ.
The constraints enforce that p1x1 + p2x2 = µ, p1x21 + p2x
22 = σ and p1 + p2 = 1. Hence p2 = 1− p1,
which gives p1x1 + (1− p1)x2 = µ and p1x21 + (1− p1)x2
2 = σ. From the first equation we get p1 =
(x2−µ)/(x2−x1). Putting this into p1x21 + (1− p1)x2
2 = σ, we further get x2 = (σ−µx1)/(µ−x1).
Now, putting this in turn into p1 = (x2−µ)/(x2−x1), we obtain p1 = (σ−µ2)/(σ−2µx1 +x21) and
hence p2 = 1− p1 = (µ−x1)2/(σ− 2µx1 +x2
1). Therefore, the objective value of (9) is given by
maxx1∈[0,µ)
p1H(x1) + p2H(x2) = maxx1∈[0,µ)
σ−µ2
σ− 2µx1 +x21
H(x1) +(µ−x1)
2
σ− 2µx1 +x21
H
(σ−µx1
µ−x1
)(EC.8)
e-companion to Lam and Mottet: Worst-case Tail Analysis ec7
which is exactly maxx1∈[0,µ)W (x1) in (3).
Let W ∗ be the optimal value of (3) or (EC.8). We distinguish two cases: either W ∗ =
maxx1∈[0,µ−ε]W (x1) for some ε > 0, or not. We will argue that these correspond exactly to the two
cases in the proposition.
Consider the first case, i.e. W ∗ = maxx1∈[0,µ−ε]W (x1) for some ε > 0. Since [0, µ− ε] is compact,
there exists an optimal solution x∗1 ∈ [0, µ− ε] for maxx1∈[0,µ−ε]W (x1), which is also an optimal
solution for (3). Clearly (3) is solvable then.
We now argue that this implies (9) is solvable and has an optimal solution in P2[0,∞). First,
note that by Lemma EC.4, the program OPT (P[0, c]) for any c <∞ (sufficiently large so that the
program is consistent) must admit an optimal solution in P2[0,∞). This optimal solution can be
found by solving (3), but defining H(x) = −∞ for x > c. Now, the optimal solution x∗1 for (3),
and its corresponding values for x∗2, p∗1, p∗2 given in (10), then clearly gives an optimal solution for
OPT (P[0, c]) for any sufficiently large c <∞.
Then we argue that this must imply that P∗ ∼ (x∗1, x∗2, p∗1, p∗2) as given by (3) and (10) is an
optimal density for (9). Suppose the latter does not hold, then there must exist some P′ ∈P3[0,∞)
with Z(P′) > Z(P∗), where Z(·) is the objective function of (9). Let the support points of P′ be
(x′1, x′2, x′3). This implies that P∗ is not an optimal solution for OPT (P3[0,max{x′1, x′2, x′3}]), or
equivalently OPT (P[0,max{x′1, x′2, x′3}]). This leads to a contradiction.
For the second case, we have W ∗ >maxx1∈[0,µ−ε]W (x1) for any ε > 0. The only scenario that this
can occur is that there must exist a sequence x(k)1 such that x
(k)1 → µ and W ∗ = limk→∞W (x
(k)1 )
(this sequence can be found, for example, by picking an x(k)1 in [µ− εk, µ) such that W (x
(k)1 ) >
maxx1∈[0,µ−ε]W (x1) as εk↘ 0). In this case, there does not exist an optimal solution P∗ ∈P2[0,∞)
for (9), and so (9) does not have any optimal solution by Lemma EC.4. By Proposition 1, we must
have a sequence of probability measures P(k) ∈P3[0,∞) that approaches the optimal value of (9).
Finally, by Lemma EC.2 this sequence can be identified as P(k) ∼ (x(k)1 , x
(k)2 , x
(k)3 , p
(k)1 , p
(k)2 , p
(k)3 ), such
that either (x(k)1 , x
(k)2 , x
(k)3 , p
(k)1 , p
(k)2 , p
(k)3 )→ (x∗1, x
∗2,∞, p∗1, p∗2,0) for some x∗1, x
∗2 ∈ [0,∞) (possibly
identical) and (p∗1, p∗2)∈ S2, or (x
(k)1 , x
(k)2 , x
(k)3 , p
(k)1 , p
(k)2 , p
(k)3 )→ (x∗1,∞,∞,1,0,0) for some x∗1 ∈ [0,∞).
ec8 e-companion to Lam and Mottet: Worst-case Tail Analysis
EC.1.4. Proofs in Step 4 in Section 6
Propositions 3 and 4 are closely related and we shall prove them together.
Proof of Propositions 3 and 4. By Proposition 2, if (3) (and subsequently (9) or
OPT (P3[0,∞))) is not solvable, then there must be a feasible sequence P(k) ∈ P3[0,∞) that
approaches optimality, which can be denoted by P(k) ∼ (x(k)1 , x
(k)2 , x
(k)3 , p
(k)1 , p
(k)2 , p
(k)3 ). There are
two cases: either (x(k)1 , x
(k)2 , x
(k)3 , p
(k)1 , p
(k)2 , p
(k)3 )→ (x∗1, x
∗2,∞, p∗1, p∗2,0) for some x∗1, x
∗2 ∈ [0,∞) and
(p∗1, p∗2) ∈ S2 (with x∗1, x
∗2 possibly identical), or (x
(k)1 , x
(k)2 , x
(k)3 , p
(k)1 , p
(k)2 , p
(k)3 )→ (x∗1,∞,∞,1,0,0)
for some x∗1 ∈ [0,∞). We shall characterize the form of the limiting objective value and feasible
region for both cases.
Consider the first case, i.e. x(k)1 → x∗1, x
(k)2 → x∗2 and x
(k)3 →∞. By the second constraint in (9)
we must have p(k)3 = (σ− p(k)1 x
(k)1
2− p(k)2 x
(k)2
2)/x
(k)3
2so that
limk→∞
p(k)3 x
(k)3 = lim
k→∞
σ− p(k)1 x(k)1
2− p(k)2 x
(k)2
2
x(k)3
= 0
Therefore, the first constraint in (9) entails that p(k)1 x
(k)1 + p
(k)2 x
(k)2 + p
(k)3 x
(k)3 → p∗1x
∗1 + p∗2x
∗2 = µ
as k→∞. Then, the second constraint in turn entails that p(k)3 x
(k)3
2= σ − p(k)1 x
(k)1
2− p(k)2 x
(k)2
2→
σ− p∗1x∗12− p∗2x∗22 ≥ 0, and p∗1x∗12 + p∗2x
∗22 ≤ σ.
Note that the objective value Z(P(k)) for (9) satisfies
limk→∞
Z(P(k)) = limk→∞
p(k)1 H(x
(k)1 ) + p
(k)2 H(x
(k)2 ) + p
(k)3 H(x
(k)3 )
= p∗1H(x∗1) + p∗2H(x∗2) + limk→∞
σ− p(k)1 x(k)1
2− p(k)2 x
(k)2
2
x(k)3
2 H(x(k)3 )
≤ p∗1H(x∗1) + p∗2H(x∗2) +λ(σ− p∗1x∗12− p∗2x∗2
2)
by the continuity of H and the definition of λ. This implies that the program
maxx1,x2,p1,p2 p1H(x1) + p2H(x2) +λ(σ− p1x21− p2x2
2)
subject to p1x1 + p2x2 = µ
p1x21 + p2x
22 ≤ σ
(x1, x2)∈ [0,∞)2, (p1, p2)∈ S2
(EC.9)
e-companion to Lam and Mottet: Worst-case Tail Analysis ec9
provides an upper bound to the optimal value of (9). Note that the objective function in (EC.9)
can be written as p1(H(x1)−λx21) + p2(H(x2)−λx2
2) +λσ = p1r(x1) + p2r(x2) +λσ, where r(x) is
defined by r(x) =H(x)−λx2. Hence (EC.9) is equivalent to
max E[r(X)] +λσ
subject to E[X] = µ
E[X2]≤ σ
P∈P2[0,∞)
(EC.10)
Now consider the second case, namely when x(k)1 → x∗1, x
(k)2 →∞ and x
(k)3 →∞. Similar to the
analysis above, from the second constraint in (9), we must have p(k)2 = (σ−p(k)1 x
(k)1
2−p(k)3 x
(k)3
2)/x
(k)2
2
and p(k)3 = (σ− p(k)1 x
(k)1
2− p(k)2 x
(k)2
2)/x
(k)3
2so that limk→∞ p
(k)2 x
(k)2 = limk→∞ p
(k)3 x
(k)3 = 0. Note that,
from the second constraint in (9), we have p(k)2 x
(k)2
2+ p
(k)3 x
(k)3
2= σ− p(k)1 x
(k)1
2→ σ− p∗1x∗1. For the
objective value of (9), we have
limk→∞
Z(P(k)) = limk→∞
p(k)1 H(x
(k)1 ) + p
(k)2 H(x
(k)2 ) + p
(k)3 H(x
(k)3 )
= p∗1H(x∗1) + limk→∞
σ− p(k)1 x(k)1
2− p(k)3 x
(k)3
2
x(k)2
2 H(x(k)2 ) + lim
k→∞
σ− p(k)1 x(k)1
2− p(k)2 x
(k)2
2
x(k)3
2 H(x(k)3 )
≤ p∗1H(x∗1) +λ(2σ− 2p∗1x∗12− lim
k→∞(p
(k)2 x
(k)2
2+ p
(k)3 x
(k)3
2))
= p∗1H(x∗1) +λ(2σ− 2p∗1x∗12− (σ− p∗1x∗1
2))
= p∗1H(x∗1) +λ(σ− p∗1x∗12) (EC.11)
Note that p∗1 must be 1 since p(k)2 , p
(k)3 → 0, and from the first constraint in (9) we must have x∗1 = µ.
Thus (EC.11) reduces into H(µ) + λ(σ− µ2), which is achieved by plugging in the delta measure
at µ in the program (EC.9). Hence it suffices to consider (EC.9) or (EC.10) in the sequel.
We shall show that (EC.10) does not only provide an upper bound to (9), but actually matches
its optimal value. For this we shall explicitly construct a sequence of P(k) ∈ P3[0,∞), feasible for
(9), whose objective value converges to the optimal value of (EC.10). For this purpose, we shall
first solve (EC.10).
ec10 e-companion to Lam and Mottet: Worst-case Tail Analysis
Suppose that an optimal solution exists for (EC.10). This optimal solution has either two-point
or one-point support. In the first case, this optimal solution can be obtained by putting
ρ∗ = p∗1x∗12 + p∗2x
∗22, x∗2 =
ρ∗−µx∗1µ−x∗1
, p∗1 =ρ∗−µ2
ρ∗− 2µx∗1 +x∗12 , p∗2 =
(µ−x∗1)2
ρ∗− 2µx∗1 +x∗12 (EC.12)
and searching for x∗1 ∈ [0, µ) and ρ∗ ∈ [µ2, σ] in (4). This can be seen by a similar argument leading
to the optimization (3) but with ρ also varying here. Now, in the case when the optimal solution
has one-point support, it must be the point mass at µ, which gives an objective value r(µ) + λσ.
There is no optimal solution to (4) in this scenario, i.e. there exists only a sequence (x(k)1 , ρ(k)) such
that x(k)1 → µ, ρ(k)→ µ2, and V (x
(k)1 , ρ(k))↗ V ∗ where V ∗ is the optimal value of (4).
Next, consider also the situation where there does not exist an optimal solution for (EC.10).
By a similar argument as that in Lemma EC.2, we must have a feasible sequence P(k) ∼
(x(k)1 , x
(k)2 , p
(k)1 , p
(k)2 ) ∈ P2[0,∞) that converges to optimality and such that x
(k)1 → x∗1, x
(k)2 →∞,
p(k)1 → 1 and p
(k)2 ≤ (σ − p(k)1 x
(k)2 )/x
(k)2
2→ 0. Hence x∗1 = µ. Moreover, since limsupx→∞ r(x)/x2 =
limsupx→∞H(x)/x2 − λ= 0 by construction, the objective value of (EC.10) will be p(k)1 r(x
(k)1 ) +
p(k)2 r(x
(k)2 ) +λσ ≤ p(k)1 r(x
(k)1 ) + ((σ− p(k)1 x
(k)2 )/x
(k)2
2)r(x
(k)2 ) +λσ→ r(µ) +λσ, which is equal to the
objective value of (EC.10) attained by the delta measure at µ. This implies a contradiction and so
(EC.10) must be solvable.
In Appendix EC.4.3, we shall show that in the case where (4) is solvable and hence there is
an optimal solution of (EC.10) that has two-point support, the sequence defined by x(k)1 = x∗1,
x(k)2 = x∗2 − δ(k), x
(k)3 →∞ such that H(x
(k)3 )/x
(k)3
2→ λ as k→∞, p
(k)1 = p∗1, p
(k)2 = p∗2 − γ(k) and
p(k)3 = γ(k), where δ(k) and γ(k) are defined in (13), is feasible for (9) and has objective value
converging to the optimal value of (EC.10). On the other hand, if (4) is not solvable and hence
the optimal solution of (EC.10) is a delta measure at µ, the sequence defined by x(k)1 = µ− δ(k),
x(k)2 →∞ such that H(x
(k)2 )/x
(k)2
2→ λ as k→∞, p
(k)1 = 1− γ(k) and p
(k)2 = γ(k), where δ(k) and γ(k)
are defined in (15), is feasible for (9) and has objective value converging to the optimal value of
(EC.10), namely r(µ) +λσ.
e-companion to Lam and Mottet: Worst-case Tail Analysis ec11
EC.2. Procedure for Solving (16)
For convenience, recall H(x) =∫ x0
∫ u0h(v+a)dvdu and λ= lim supx→∞H(x)/x2 <∞, and introduce
the additional notation:
W(x,ω, ρ) =ρ−ω2
ρ− 2ωx+x2H(x) +
(ω−x)2
ρ− 2ωx+x2H
(ρ−ωxω−x
)V(x, ω, ρ, κ) =
ρ− ω2
ρ− 2ωx+x2(H(x)−λx2) +
(ω−x)2
ρ− 2ωx+x2
(H
(ρ− ωxω−x
)−λ
(ρ− ωxω−x
)2)
+λκ
K2(x;x1, x2, s0, s1, s2) =
s0− s1(x− a) for a≤ x≤ x1 + a
s0− s1x1− s2(x− a−x1) for x1 + a≤ x≤ x2 + a
0 for x≥ x2 + a
K3(x;x1, x2, x3, s0, s1, s2, s3) =
s0− s1(x− a) for a≤ x≤ x1 + a
s0− s1x1− s2(x− a−x1) for x1 + a≤ x≤ x2 + a
s0− s1x1− s2(x2−x1)− s3(x− a−x2) for x2 + a≤ x≤ x3 + a
0 for x≥ x3 + a
Algorithm 2: Procedure for Solving (16)
Inputs:
1. The cost function h that satisfies Assumption 1 and is bounded and non-negative.
2. The parameters β,β, η, η, ν > 0.
Procedure:
Let
µ=η
ν, µ=
η
ν, σ=
2β
ν, σ=
2β
ν
1. If µ2 >σ, there is no feasible density.
2. If If µ2 = σ, then:
- Optimal value: νH(µ)
- There is only one feasible density given by f(x) = η− ν(x− a).
ec12 e-companion to Lam and Mottet: Worst-case Tail Analysis
3. Otherwise run Subprograms 1 and 2 below. Record the optimal value and density/sequence
of densities for each of them as a “candidate” for optimum. The final optimal value for (16) is
max{Z∗1 ,Z∗2}, where Z∗1 and Z∗2 are the optimal values for Subprograms 1 and 2 respectively. A
final optimal density/sequence of densities is given by the corresponding candidate maximizer.
Subprogram 1: Proceed only if µ2 <σ. Consider
maxx1∈[0,µ),ρ∈[σ∨µ2,σ]
W(x1, µ, ρ) (EC.13)
Either one of the following two cases occurs:
Case 1: There is an optimal solution for (EC.13), given by (x∗1, ρ∗)∈ [0, µ)× [σ ∨µ2, σ]. Then:
- Candidate optimal value: νW(x∗1, µ, ρ∗)
- Candidate optimal density: f(x) =K2(x;x∗1, x∗2, η, ν, νp
∗2) where
x∗2 =ρ∗−µx∗1µ−x∗1
, p∗2 =(µ−x∗1)2
ρ∗− 2µx∗1 +x∗12
Case 2: There does not exist an optimal solution for (EC.13), i.e. W∗ > W(x1, µ, ρ) for any
x1 ∈ [0, µ), ρ ∈ [σ ∨µ2, σ], where W∗ is the optimal value of (EC.13). Then there exists a sequence
of feasible solutions (x(k)1 , ρ(k)), with x
(k)1 → µ such that W(x
(k)1 , µ, ρ(k))→W∗.
Consider further the optimization
maxx1∈[0,µ),ρ∈[µ2,σ]
V(x1, µ, ρ, σ) (EC.14)
One of the following two sub-cases occurs:
Case 2.1: There exists an optimal solution for (EC.14), given by (x∗1, ρ∗)∈ [0, µ)× [µ2, σ]. Then:
- Candidate optimal value: νV(x∗1, µ, ρ∗, σ)
- There is no candidate optimal density. Optimality is achieved by a sequence f (k)(x) =
K3(x;x∗1, x∗2− δ(k), x
(k)3 , η, ν, νp∗2, γ
(k)) where
x∗2 =ρ∗−µx∗1µ−x∗1
, p∗2 =(µ−x∗1)2
ρ∗− 2µx∗1 +x∗12
e-companion to Lam and Mottet: Worst-case Tail Analysis ec13
and
x(k)3 →∞ is a sequence such that
H(x(k)3 )
x(k)3
2 → λ as k→∞
δ(k) =σ− ρ∗
p∗2(x(k)3 −x∗2)
, γ(k) =(σ− ρ∗)p∗2
p∗2(x(k)3 −x∗2)2 + (σ− ρ∗)
Case 2.2: There does not exist an optimal solution for (EC.14). This occurs when there is a
sequence (x(k)1 , ρ(k)), with x
(k)1 → µ, such that V(x
(k)1 , µ, ρ(k), σ)↗V∗, where V∗ is the optimal value
of (EC.14), but that V∗ >maxx1∈[0,µ−ε],ρ∈[µ2,σ] V(x1, µ, ρ, σ) for any ε > 0. Then:
- Candidate optimal value: ν(H(µ) +λ(σ−µ2))
- There is no candidate optimal density. Optimality is achieved by a sequence f (k)(x) =K2(x;µ−
δ(k), x(k)2 , η, ν, νγ(k)) where
x(k)2 →∞ is a sequence such that
H(x(k)2 )
x(k)2
2 → λ as k→∞
δ(k) = σ−µ2
x(k)2 −µ
, γ(k) = σ−µ2
x(k)2
2−2µx(k)2 +σ
Subprogram 2: Consider
maxx1∈[0,ω),ω∈[µ,µ∧
√σ]W(x1, ω,σ) (EC.15)
Either one of the following two cases occurs:
Case 1: There is an optimal solution for (EC.15), given by (x∗1, ω∗)∈ [0, µ∧
√σ)× [µ,µ∧
√σ] with
x∗1 <ω∗. Then:
- Candidate optimal value: νW(x∗1, ω∗, σ)
- Candidate optimal density: f(x) =K2(x;x∗1, x∗2, νω
∗, ν, νp∗2) where
x∗2 =σ−ω∗x∗1ω∗−x∗1
, p∗2 =(ω∗−x∗1)2
σ− 2ω∗x∗1 +x∗12
Case 2: There does not exist an optimal solution for (EC.15), i.e. W∗ >W(x1, ω,σ) for any
x1 ∈ [0, ω), ω ∈ [µ,µ∧√σ], where W∗ is the optimal value of (EC.15). Then there exists a sequence
of feasible solutions (x(k)1 , ω(k)), with ω(k)→ ω∗ and x
(k)1 → ω∗ for some ω∗ ∈ [µ,µ∧
√σ], such that
W(x(k)1 , ω(k), σ)→W∗. One of the following three cases must happen:
Case 2.1: If ω∗ = µ∧√σ, then:
ec14 e-companion to Lam and Mottet: Worst-case Tail Analysis
- Candidate optimal value: νH(µ∧√σ)
- Candidate optimal density: f(x) = ν(µ∧√σ)− ν(x− a) for x≥ a
Otherwise, consider further the optimization
maxx1∈[0,ω),ω∈[µ,µ∧
√σ],ρ∈[ω2,σ]
V(x1, ω, ρ, σ) (EC.16)
Either of the following two cases occurs:
Case 2.2: There exists an optimal solution for (EC.16), given by (x∗1, ω∗, ρ∗)∈ [0, µ∧
√σ)× [µ,µ∧
√σ]× [µ2, σ] with x∗1 < ω
∗ and ρ∗ ≥ ω∗. Then:
- Candidate optimal value: νV(x∗1, ω∗, ρ∗, σ)
- There is no candidate optimal density. Optimality is achieved by a sequence f (k)(x) =
K3(x;x∗1, x∗2− δ(k), x
(k)3 , νω∗, ν, νp∗2, νγ
(k)) where
x∗2 =ρ∗− ω∗x∗1ω∗−x∗1
, p∗2 =(ω∗−x∗1)2
ρ∗− 2ω∗x∗1 +x∗12
and
x(k)3 →∞ is a sequence such that
H(x(k)3 )
x(k)3
2 → λ as k→∞
δ(k) =σ− ρ∗
p∗2(x(k)3 −x∗2)
, γ(k) =(σ− ρ∗)p∗2
p∗2(x(k)3 −x∗2)2 + (σ− ρ∗)
Case 2.3: There does not exist an optimal solution for (EC.16). This occurs when there
is a sequence (x(k)1 , ω(k), ρ(k)), with ω(k) → ω∗ ∈ [µ,µ ∧
√σ], x
(k)1 → ω∗, and ρ(k) → ρ∗ ∈
[(ω∗)2, σ] such that V(x(k)1 , ω(k), ρ(k), σ) ↗ V∗, the optimal value of (EC.16), but that V∗ >
maxx1∈[0,ω−ε],ω∈[µ,µ∧√σ],ρ∈[ω2,σ] V(x1, ω, ρ, σ) for any ε > 0. Then:
- Candidate optimal value: ν(H(ω∗) +λ(σ− (ω∗)2))
- There is no candidate optimal density. Optimality is achieved by a sequence f (k)(x) =
K2(x; ω∗− δ(k), x(k)2 , νω∗, ν, νγ(k)) where
x(k)2 →∞ is a sequence such that
H(x(k)2 )
x(k)2
2 → λ as k→∞
δ(k) = σ−(ω∗)2
x(k)2 −ω
∗, γ(k) = σ−(ω∗)2
x(k)2
2−2ω∗x(k)2 +σ
e-companion to Lam and Mottet: Worst-case Tail Analysis ec15
The introduction of the two subprograms 1 and 2 comes from analyzing the possibility of each
inequality constraint being binding in the moment problem that is transformed from (16) (a step
that is analogous to Step 1 in Section 6). Further details are given in the next section.
EC.3. Technical Proofs for Section 7.1
The general framework for the proof of Theorem 4 and why Algorithm 2 works is similar to that
for the abstract formulation in Section 4. Hence for succinctness we only focus on the new issues
here. First, almost a verbatim of the arguments for Theorem 3 will lead to the following reduction
for (16):
Theorem EC.1. Suppose h is bounded. Denote H(x) =∫ x0
∫ u0h(v + a)dvdu. Then the optimal
value of (16) is the same as
max νE[H(X)]
subject to µ≤E[X]≤ µ
σ≤E[X2]≤ σ
P∈P[0,∞)
(EC.17)
where µ= η/ν, µ= η/ν, σ= 2β/ν, σ= 2β/ν. Here the decision variable is a probability distribution
P ∈ P[0,∞), where P[0,∞) denotes the set of all probability measures on [0,∞), and E[·] is the
corresponding expectation. Moreover, there is a one-to-one correspondence (up to measure zero)
of the feasible solutions to (16) and (EC.17), given by f ′(x+ a) = ν(p(x)− 1) a.e., where f is a
feasible solution of (16) and p is a probability distribution function over [0,∞).
For convenience, we denote OPT (C) as the program
max νE[H(X)]
subject to µ≤E[X]≤ µ
σ≤E[X2]≤ σ
P∈ C
Therefore, (EC.17) can be written as OPT (P[0,∞)). We also let OPT (C;ω,ρ) be the program
OPT (C) defined before, but replacing µ= ω,σ= ρ therein.
ec16 e-companion to Lam and Mottet: Worst-case Tail Analysis
Lemma EC.5. The optimal value of OPT (P[0, c]) is identical to that of OPT (P3[0, c]) for any
c > 0. Similarly, the optimal value of OPT (P[0,∞)) is identical to that of OPT (P3[0,∞)).
Proof of Lemma EC.5. Since the inequality constraints in OPT (C) are not all linearly inde-
pendent, the results in Winkler (1988) does not directly apply. Instead, we shall prove the cur-
rent lemma by extending Lemma EC.1 via a simple observation. Consider OPT (P3[0, c]). Note
that it can be written as maxω∈[µ,µ],ρ∈[σ,σ]U∗(ω,ρ), where U∗(ω,ρ) denotes the optimal value of
OPT (P[0, c];ω,ρ). Now suppose Z∗ is the optimal value of OPT (P[0, c]). Then there must exist
a sequence (ω(k), ρ(k)) such that U∗(ω(k), ρ(k))→ Z∗. For each (ω(k), ρ(k)), we already know from
Lemma EC.1 that it suffices to look at the set P3[0, c], and that there must be a sequence Pω(k),ρ(k) ∈
P3[0,∞) such that its objective value converges to U∗(ω(k), ρ(k)). Combining the above, we can
choose a sequence P(k) ∈P3[0, c] such that its objective value converges to Z∗. This shows the first
statement of the lemma. Similar argument holds for OPT (P[0,∞)).
The following gives the rule to exclude trivial scenarios in the algorithm:
Lemma EC.6. The program OPT (P3[0,∞)) is consistent if and only if σ≥ µ2.
Proof of Lemma EC.6. Similar to Lemma 1 and hence skipped.
The following explains the origin of the two subprograms:
Lemma EC.7. Suppose h is non-negative. The optimal value of OPT (P3[0,∞)) is given by
max{Z∗1 ,Z∗2}, where Z∗1 is the optimal value of
max E[H(X)]
subject to E[X] = µ
σ≤E[X2]≤ σ
P∈P3[0,∞)
(EC.18)
and Z∗2 is the optimal value of
max E[H(X)]
subject to µ≤E[X]≤ µ
E[X2] = σ
P∈P3[0,∞)
(EC.19)
e-companion to Lam and Mottet: Worst-case Tail Analysis ec17
respectively. The corresponding optimal solution or sequence of feasible solutions that converges to
optimality for either (EC.18) or (EC.19) will be optimal for OPT (P3[0,∞)) as well.
Proof of Lemma EC.7. We argue that to solve OPT (P3[0,∞)), it suffices to restrict attention to
the feasible region {P∈P3[0,∞) :E[X] = µ,σ≤E[X2]≤ σ}∪ {P∈P3[0,∞) : µ≤E[X]≤ µ,E[X2] =
σ}. To explain, suppose the optimal value of OPT (P3[0,∞)) is not zero (otherwise, since h is
assumed non-negative, H is also non-negative and the result is trivial). It suffices to look at P∼
(x1, x2, x3, p1, p2, p3) ∈ P3[0,∞) with at least one of xi having H(xi) > 0 and pi > 0 (otherwise
the objective value is zero, which is clearly not optimal). Now suppose P satisfies E[X] < µ and
E[X2] < σ. We can increase xi so that E[X] ≤ µ and E[X2] ≤ σ remain to be satisfied, but the
objective value is at least as large as before since H(x) is non-decreasing. Hence any P such that
E[X]<µ and E[X2]< σ is dominated by or equal in value to a solution in {P ∈ P3[0,∞) : E[X] =
µ,σ≤E[X2]≤ σ}∪ {P∈P3[0,∞) : µ≤E[X]≤ µ,E[X2] = σ}. This proves the lemma.
Lemma EC.8. Under Assumption 2, if (EC.18) is solvable, then there is an optimal solution in
P2[0,∞). Same statement holds for (EC.19).
Proof of Lemma EC.8. Let P∗ be an optimal solution for the solvable (EC.18). Let σ∗ =E∗[X2].
Then P∗ must also be an optimal solution for OPT (P3[0,∞);µ,σ∗). By Lemma EC.4, there must
be a solution P∗∗ that has the same objective value as P∗. Similar argument holds for (EC.19).
This concludes the proof.
By Lemmas EC.5 and EC.7, it suffices to look at the two subprograms (EC.18) and (EC.19)
in order to solve (EC.17). Together with Lemma EC.8, this concludes Theorem 4 immediately.
Algorithm 2 then follows the same line of analysis as Steps 3 and 4 in Section 6, now applied to
the subprograms (EC.18) and (EC.19). The inequality constraints in each of these subprograms
lead to one extra dimension in the respective nonlinear programs in the algorithm, but other than
that the analysis is almost completely the same, and hence we skip the details here.
ec18 e-companion to Lam and Mottet: Worst-case Tail Analysis
EC.4. Auxiliary Lemmas and Proofs
EC.4.1. A Lemma for the Proof of Theorem 3
Lemma EC.9. Suppose a probability density f is convex for x≥ a for some a∈R. Then xf(x)→ 0
and x2f ′(x)→ 0 a.e. as x→∞.
Proof of Lemma EC.9. Note that a density f that is convex in the tail must be non-increasing,
since otherwise∫fdx=∞. We denote g(x) = xf(x)−F (x), where F is the distribution function
of f . Consider, for (a∨ 0)≤ x1 ≤ x2,
g(x2)− g(x1) = x2f(x2)−x1f(x1)− (F (x2)−F (x1))
≤ x2f(x2)−x1f(x1)− f(x2)(x2−x1) since f(x) is non-increasing
= x1[f(x2)− f(x1)]
≤ 0 again since f is non-increasing
Therefore g is non-increasing for large enough x, and since xf(x)≥ 0 and 0≤ F (x)≤ 1 we have g
bounded from below. This implies that g must converge to a limit, say c, as x→∞. In other words,
xf(x)− F (x)→ c, and since F (x)→ 1, we have xf(x)→ c+ 1. There are two cases: c+ 1> 0 or
c+ 1 = 0. The first case implies that xf(x)≥ ε > 0 for some ε for all large enough x. This means
f(x)≥ ε/x for all large enough x, and∫fdx=∞, which is impossible. Hence xf(x) must converge
to 0. This proves the first part of the lemma.
To prove the second part, observe first that f ′(x) must be non-decreasing a.e. for x≥ a since f(x)
is convex, and also non-positive since otherwise∫f(x)dx=∞. We now define g(x) =−x2f ′(x) +
2F (x), where F (x) =−∫∞xtf ′(t)dt. For any (a∨ 0)≤ x1 ≤ x2,
g(x2)− g(x1) = x21f′(x1)−x2
2f′(x2) + 2F (x1)− 2F (x2)
≤ x21f′(x1)−x2
2f′(x2) + f ′(x2)(x
22−x2
1) since f ′(x) is non-decreasing
= x21(f′(x1)− f ′(x2))
≤ 0 again since f ′(x) is non-decreasing
e-companion to Lam and Mottet: Worst-case Tail Analysis ec19
Therefore, g(x) is non-increasing. Note that −x2f ′(x) ≥ 0. Also, since F (x) = −tf(t)|∞x +∫∞xf(t)dt = xf(x) + F (x) by using integration by parts and that limx→∞ xf(x)→ 0 as we have
just proved, we have F (x)→ 0 as x→∞ and hence also bounded. Therefore g is bounded from
below. This implies that g must converge to a limit, say c, as x→∞. Since F (x)→ 0, we have
−x2f ′(x)→ c. There are two cases: either c > 0 or c= 0. The former case implies that −xf ′(x)≥ ε/x
for some ε > 0 and large enough x, and so F (x0) =−∫∞x0xf ′(x)dx=∞ for some large x0, which
arrives at a contradiction. Therefore −x2f ′(x)→ 0. This proves the second part of the lemma.
EC.4.2. Verification for Remark Point 4 at the End of Section 4.3
First, we show that the objective in (3) is bounded as x1↗ µ. By η2 < 2βν or equivalently σ > µ2,
we have σ− 2µx1 +x21→ σ−µ2 > 0 as x1↗ µ. Moreover, by λ<∞,
limsupx1↗µ
(µ−x1)2 H
(σ−µx1
µ−x1
)= lim sup
x1↗µ(σ−µx1)
2
(µ−x1
σ−µx1
)2
H
(σ−µx1
µ−x1
)= λ(σ−µ2)2 <∞
Hence both the first and the second term in (3) are bounded.
A similar argument works for the program (4) when ρ > µ is fixed. In fact, we will show that
the objective V (x1, ρ) in (4) is bounded for x1→ µ and over any ρ, even around the point (x1, ρ) =
(µ,µ2). To this end, the first term of V (x1, ρ) is clear since 0≤ (ρ−µ2)/(ρ−2µx1 +x21)≤ 1 for any
x2 <µ and ρ> µ2. For the second term, we write as
(µ−x1)2
ρ− 2µx1 +x21
(H
(ρ−µx1
µ−x1
)−λ
(ρ−µx1
µ−x1
)2)
=(ρ−µx1)
2
ρ− 2µx1 +x21
((µ−x1
ρ−µx1
)2
H
(ρ−µx1
µ−x1
)−λ
)
Since ((µ− x1)/(ρ− µx1))2H((ρ− µx1)/(µ− x1))− λ is bounded for any x1 < µ,ρ > µ2 by using
the fact that limsupx→∞H(x)/x2 − λ= 0, we are left to show that (ρ− µx1)2/(ρ− 2µx1 + x2
1) is
bounded. For this, we let x1 = µ− b and ρ= µ2 +d where b, d > 0 and b, d→ 0. Moreover, we write
b= dc, where c is a positive sequence such that b→ 0 as d→ 0. Then we have
(ρ−µx1)2
ρ− 2µx1 +x21
=(d+µb)2
d+ b2=d(1 +µc)2
1 + dc2=d+ 2µdc+µ2dc2
1 + dc2
which is bounded since d, dc= b is bounded and the function is bounded for any positive dc2. We
thus have proved our claim.
ec20 e-companion to Lam and Mottet: Worst-case Tail Analysis
EC.4.3. Verification of the Optimal Sequence in the Proofs of Propositions 3 and 4
We consider both cases of (4) being solvable and not solvable. For the first case, given an optimal
solution x∗1, x∗2, p∗1, p∗2 for (EC.9) and also ρ∗ = p∗1x
∗12 + p∗2x
∗22, solved by using (EC.12), we define
a solution sequence in P3[0,∞) exactly given by (12) and (13). We first verify that the sequence
defined in (12) and (13) satisfies all constraints in (9). In fact, we will show that (13) is obtained
by setting
p(k)1 x
(k)1 + p
(k)2 x
(k)2 + p
(k)3 x
(k)3 = µ (EC.20)
and
p(k)1 x
(k)1
2+ p
(k)2 x
(k)2
2+ p
(k)3 x
(k)3
2= σ (EC.21)
for all k. Note that (EC.20) and (EC.21) can be written as
p∗1x∗1 + (p∗2− γ(k))(x∗2− δ(k)) + γ(k)x
(k)3 = µ
and
p∗1x∗12 + (p∗2− γ(k))(x∗2− δ(k))2 + γ(k)x
(k)3
2= σ
respectively. Since p∗1x∗1 + p∗2x
∗2 = µ and p∗1x
∗12 + p∗2x
∗22 = ρ∗, they become
−δ(k)p∗2− γ(k)(x∗2− δ(k)) + γ(k)x(k)3 = 0 (EC.22)
and
δ(k)p∗2(−2x∗2 + δ(k)) + γ(k)(x(k)3
2− (x∗2− δ(k))2
)= σ− ρ∗ (EC.23)
From (EC.22), we have
γ(k) =δ(k)p∗2
δ(k) +x(k)3 −x∗2
(EC.24)
Plugging into (EC.23), we get
δ(k)p∗2(−2x∗2 + δ(k)) +δ(k)p∗2
δ(k) +x(k)3 −x∗2
(x(k)3
2−(x∗2− δ(k)
)2)= σ− ρ∗
Solving for δ(k) implies
p∗2δ(k)(x(k)3 −x∗2
)= σ− ρ∗
e-companion to Lam and Mottet: Worst-case Tail Analysis ec21
which gives
δ(k) =σ− ρ∗
p∗2(x(k)3 −x∗2)
(EC.25)
Substituting into (EC.24), we have
γ(k) =(σ− ρ∗)p∗2
p∗2
(x(k)3 −x∗2
)2
+ (σ− ρ∗)(EC.26)
Now the corresponding objective value for (9) is
p(k)1 H(x
(k)1 ) + p
(k)2 H(x
(k)2 ) + p
(k)3 H(x
(k)3 )
= p∗1H(x∗1) + (p∗2− γ(k))H(x∗2− δ(k)) + γ(k)H(x(k)3 )
= p∗1H(x∗1) + (p∗2− γ(k))H(x∗2− δ(k)) +(σ− ρ∗)p∗2
p∗2(x(k)3 −x∗2)2 + (σ− ρ∗)
H(x(k)3 )
→ p∗1H(x∗1) + p∗2H(x∗2) +λ(σ− ρ∗)
by using the definition of λ, which is exactly the optimal objective value of (EC.9). This shows
that solving (EC.9) via (4) will identify an optimality-approaching sequence for (9).
In the second case, i.e. (4) is not solvable, we have
p(k)1 x
(k)1 + p
(k)2 x
(k)2 = µ (EC.27)
and
p(k)1 x
(k)1
2+ p
(k)2 x
(k)2
2= σ (EC.28)
for all k. Using x(k)1 = µ− δ(k) and p
(k)1 = 1− γ(k), (EC.27) and (EC.28) can be written as
(1− γ(k))(µ− δ(k)) + γ(k)x(k)2 = µ
and
(1− γ(k))(µ− δ(k))2 + γ(k)x(k)2
2= σ
respectively, which further gives
γ(k)(δ(k) +x
(k)2 −µ
)− δ(k) = 0 (EC.29)
ec22 e-companion to Lam and Mottet: Worst-case Tail Analysis
and
γ(k)(x(k)2
2−(µ− δ(k)
)2)+(µ− δ(k)
)2= σ (EC.30)
From (EC.29) we have
γ(k) =δ(k)
δ(k) +x(k)2 −µ
(EC.31)
Putting into (EC.30), we get
δ(k)
δ(k) +x(k)2 −µ
(x(k)2
2−(µ− δ(k)
)2)+(µ− δ(k)
)2= σ
Solving for δ(k), we have
δ(k)(x(k)2 +µ− δ(k)
)+(µ− δ(k)
)2= σ
which gives
δ(k) =σ−µ2
x(k)2 −µ
(EC.32)
Plugging into (EC.31), we have
γ(k) =σ−µ2
(σ−µ2) +(x(k)2 −µ
)2 (EC.33)
Moreover, the objective value for (9) is
p(k)1 H(x
(k)1 ) + p
(k)2 H(x
(k)2 ) = (1− γ(k))H(µ− δ(k)) + γ(k)H(x
(k)2 )
= (1− γ(k))H(µ− δ(k)) +σ−µ2
(σ−µ2) +(x(k)2 −µ
)2H(x(k)2 )
→H(µ) +λ(σ−µ2)
by using the definition of λ. This also matches the objective value of (EC.9).