tail analysis without tail information: a worst-case

Submitted to Operations Researchmanuscript (Please, provide the manuscript number!)

Tail Analysis without Tail Information:A Worst-case Perspective

Henry LamDepartment of Industrial and Operations Engineering, University of Michigan, Ann Arbor, MI 48109, [email protected]

Clementine MottetDepartment of Mathematics and Statistics, Boston University, Boston, MA 02215, [email protected]

Tail modeling refers to the task of selecting the best probability distributions that describe the occurrences

of extreme events. One common bottleneck in this task is that, due to their very nature, tail data are often

very limited. The conventional approach uses parametric fitting, but the validity of the choice of a parametric

model is usually hard to verify. This paper describes a reasonable alternative that does not require any

parametric assumption. The proposed approach is based on a worst-case analysis under the geometric premise

of tail convexity, a feature shared by all known parametric tail distributions. We demonstrate that the

worst-case convex tail behavior is either extremely light-tailed or extremely heavy-tailed. We also construct

low-dimensional nonlinear programs that can both distinguish between the two cases and find the worst-

case tail. Numerical results show that the proposed approach gives a competitive performance versus using

conventional parametric methods.

Key words : tail modeling, robust analysis, nonparametric

1. Introduction

Modeling extreme behaviors is a fundamental task in analyzing and managing risk. The earliest

applications arose in environmental contexts, as hydrologists and climatologists tried to predict

the risk of flooding and pollution based on historical data of sea levels or air pollutants (Gumbel

(2012)). In non-life or casualty insurance, insurers rely on accurate prediction of large losses to

price insurance policies (McNeil (1997), Beirlant and Teugels (1992), Embrechts et al. (1997)).

Calculating risk measures related to large losses is also a key focus of financial portfolio management

(Glasserman and Li (2005), Glasserman et al. (2007, 2008)). In engineering, measurement of system

1

Lam and Mottet: Worst-case Tail Analysis2 Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!)

reliability often involves modeling the tail behaviors of individual components’ failure times (Nicola

et al. (1993), Heidelberger (1995)).

Despite its importance in various disciplines, tail modeling is an intrinsically difficult task

because, by their own nature, tail data are often very limited. Consider these two examples:

Example 1 (Adopted from McNeil (1997)). There were 2,156 Danish fire losses over one

million Danish Krone (DKK) from 1980 to 1990. The empirical cumulative distribution function

(ECDF) and the histogram (in log scale) are plotted in Figure 1. For a concrete use of the data,

an insurance company might be interested in pricing a high-excess contract with reinsurance, which

has a payoff of X − 50 (in million DKK) when 50 < X ≤ 200, 150 when X > 200, and 0 when

X ≤ 50, where X is the loss amount (the marks 50 and 200 are labeled with vertical lines in Figure

1). Pricing this contract would require, among other information, E[payoff]. However, only seven

data points are above 50 (the loss amount above which the payoff is non-zero).

Figure 1: ECDF and histogram for Danish fire losses from 1980 to 1990

Example 2. A more extreme situation is a synthetic data set of size 200 generated from an

unknown distribution, whose histogram is shown in Figure 2. Suppose the quantity of interest is

P (4<X < 5). This appears to be an ill-posed problem since the interval [4,5] has no data at all.

This situation is not uncommon when in any application one tries to extrapolate the tail with a

small sample size.

Lam and Mottet: Worst-case Tail AnalysisArticle submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 3

Figure 2 Histogram of a synthetic data set with sample size 200

The primary purpose of this paper is to construct a theoretically justified methodology to esti-

mate quantities of interest such as those depicted in the examples above. This requires drawing

information properly from data not in the tail. We will illustrate how to do this and revisit the

two examples later with numerical performance of our method.

Another important objective of this paper is to identify the qualitative choice of tail that one

should adopt under limited data. To motivate this, mathematical studies of tail behavior in com-

plex systems, notably large deviations analysis, often rely crucially on the types of tail in the

underlying individual components. One prominent distinction is light versus heavy tails. To exhibit

large deviations behavior, it is well-known that each of the light-tailed components individually

contribute a small amount (through exponential twisting of the probability measure; e.g., Bucklew

(2004), Dembo and Zeitouni (2009)), whereas heavy-tailed systems do so by making the so-called

“big jumps” (i.e., only one or two components exhibit large values while all others remain in the

usual domain; e.g., Denisov et al. (2008)). In the examples above, with little or no data in the tail,

should one fit a light or a heavy-tailed distribution? It is in the interest of this paper to give a

meaningful answer.


2. Our Approach and Main Contributions

We adopt a nonparametric approach. Rather than fitting a tail parametric curve when there can

be little or zero observations in the tail region, we base our analysis on one geometric premise: that

the tail density is convex. We emphasize that this condition is satisfied by all known parametric

distributions (e.g. normal, lognormal, exponential, gamma, Weibull, Pareto etc.). For this reason

we believe it is a natural and minimal assumption to make.

In any given problem, there can be potentially infinitely many feasible candidates of convex

tails. The central idea of our method is a worst-case characterization. Formally, given information

on the non-tail part of the distribution and a target quantity of interest (e.g., P (4 < X < 5) in

Example 2), we aim to find a convex tail, consistent with the non-tail part, that gives rise to the

worst-case value of the target (e.g., the largest possible value of P (4<X < 5)). This value serves

as the best bound for the target that is robust with respect to the uncertainty on the tail, without

any knowledge other than our a priori assumption of convexity.

Our proposed approach requires solving an optimization over a potentially infinite-dimensional

space of convex tails. As our key contributions, we show that this problem has a very simple

optimality structure, and find its solution via low-dimensional nonlinear programs. In particular:

1. We both qualitatively and quantitatively characterize the worst-case tail behavior under the

tail convexity condition. We show that the worst-case tail, for any bounded target quantity of

interest, is either extremely light-tailed or extremely heavy-tailed in a well-defined sense. Both cases

can be characterized by piecewise linear densities, the distinction being whether the pieces form a

bounded support distribution or lead to probability masses that escape to infinity.

2. We provide efficient algorithms to distinguish between the two cases above, and to solve for

the optimal distribution in each case. For a large class of objectives, the classification only requires a

one-dimensional line search, and solving for the optimal distribution requires at most an additional

two-dimensional nonlinear program.

We further illustrate how to integrate the above results with data to obtain statistically valid

worst-case bounds. This requires suitable relaxation of the proposed worst-case optimizations to


incorporate interval estimates of certain key parameters of the distribution in the non-tail region.

Our framework gets around the difficulty faced by conventional parametric methods (discussed in

detail in the next section) in directly estimating the tail curve, by using worst-case analysis to

mitigate the estimation burden completely to the central part of the density curve where more data

are available. However, we pay the price of conservativeness: our method can generate a worst-case

bound that is over-pessimistic. We therefore believe it is most suitable for small sample size, when

a price of conservativeness is unavoidable in trading with statistical validity.

The remainder of this paper is organized as follows. Sections 3 discusses some previous techniques

and reviews the relevant literature. Section 4 presents our formulation and results for an abstract

setting. Section 5 describes some elementary numerical examples. Section 6 gives our main math-

ematical arguments. Section 7 explains how to integrate our formulation with data, and Section 8

concludes and discusses some future work. All proofs are left to the Appendix.

3. Related Work

3.1. Overview of Common Tail-fitting Techniques

As far as we know, all existing techniques for handling extreme values are parametric-based, in

the sense that a “best” parametric curve is chosen and the parameters are fit to the tail data.

The classic text of Hogg and Klugman (2009) provides a comprehensive discussion on the common

choices of parametric tail densities. While exploratory data analysis, such as quantile plots and

mean excess plots, can provide guidance regarding the class of parametric curves to use (such as

heavy, middle or light tail), this approach is limited by its reliance on a large amount of data in

the tail and subjectivity in the choice of parametric curve.

Beyond the goodness-of-fit approach, there are two widely used results on the parametric choice

that is provably suitable for extreme values. The Fisher-Tippett-Gnedenko Theorem (Fisher and

Tippett (1928), Gnedenko (1943)) postulates that the sample maxima, after suitable scaling, must

converge to a generalized extreme value (GEV) distribution, given that it converges at all to some

non-degenerate distribution. This result is useful if the data are known to derive from the maximum


of some distributions. For instance, environmental data on sea level and river heights are often

collected as annual maxima (Davison and Smith (1990)), and in this scenario it is sensible to fit the

GEV distribution. In other scenarios, the data have to be pre-divided into blocks and blockwise

maxima have to be taken in order to apply GEV, but this blockwise approach is statistically

wasteful (Embrechts et al. (2005)).

The Pickands-Balkema-de Haan Theorem (Pickands III (1975), Balkema and De Haan (1974))

does not need the requirement of data coming from maxima. Rather, the theorem states that the

excess losses over thresholds converge to a generalized Pareto distribution (GPD) as the thresh-

olds approach infinity, under the same conditions as the Fisher-Tippett-Gnedenko Theorem. The

Pickands-Balkema-de Haan theorem provides a solid mathematical justification for using GPD to

fit the tail portion of data (McNeil (1997), Embrechts et al. (2005)). Fitting GPD can be done by

well-studied procedures such as maximum likelihood estimation (Smith (1985)), and the method of

probability-weighted moments (Hosking and Wallis (1987)). The Hill estimator (Hill et al. (1975),

Davis and Resnick (1984)) is also a widely used alternative.

Despite the attraction and frequent usage, fitting GPD suffers from two pitfalls: First, there is

no convergence rate result that tells how high a threshold should be for the GPD approximation to

be valid (e.g. McNeil (1997)). Hence, picking the threshold is an ad hoc task in practice. Second,

and more importantly, even if the threshold chosen is sufficiently high for the theorem to hold, a

large amount of data above it is needed to accurately estimate the parameters in GPD. In our two

examples, especially Example 2, this is plainly impossible.

3.2. Related Literature on our Methodology

Our mathematical formulation and techniques are related to two lines of literature. The use of

convexity and other shape constraints (such as log-concavity) have appeared in density estimation

(Cule et al. (2010), Seregin and Wellner (2010), Koenker and Mizera (2010)) and convex regression

(Seijo et al. (2011), Hannah and Dunson (2013), Lim and Glynn (2012)) in statistics. A major

reason for using convexity in these statistical problems is the removal of tuning parameters, such


as bandwidth, as otherwise required. Besides the difference in motivation, these problems often

involve optimizations that are finite-dimensional (in the data size), and so are different from our

infinite-dimensional formulations.

The second line of related literature is optimization over probability distributions, which have

appeared in decision analysis (Smith (1995), Bertsimas and Popescu (2005), Popescu (2005)),

robust control theory (Iyengar (2005), El Ghaoui and Nilim (2005), Petersen et al. (2000), Hansen

and Sargent (2008)), distributionally robust optimization (Delage and Ye (2010), Goh and Sim

(2010)), and stochastic programming (Birge and Wets (1987), Birge and Dula (1991)). The typical

formulation involves optimization of some objective governed by a probability distribution that

is partially specified via constraints like moments (Karr (1983), Winkler (1988)) and statistical

distances (Ben-Tal et al. (2013)). Our formulation differs from these studies because it pertains

more to tail modeling (i.e., knowledge of certain regions of the density, but none beyond it). Among

all the previous works, only Popescu (2005) has considered convex density assumption, as an

instance of a proposed class of geometric conditions that are added to moment constraints. While

the result bears similarity to ours in that a piecewise linearity structure shows up in the solution,

our qualitative classification of the tail, the solution techniques, and the data-driven relaxation all

differ from the semidefinite programming approach in Popescu (2005).

4. Abstract Formulation and Results

We begin by considering an abstract formulation assuming full information on the distribution up

to some threshold, and no information beyond. The next sub-sections give the details.

4.1. Formulation and Notation

Consider a continuous probability distribution on R whose density exists and is denoted by f(x).

We assume that f is known up to a certain large threshold, say a ∈R. The goal is to extrapolate

f .

We impose the assumption that f(x), for x≥ a, is convex. Figure 3 shows an example of an f(x)

known up to a, and Figures 4 and 5 each show an example of convex and non-convex extrapolation.


Observe that the convex tail assumption excludes any “surprising” bumps (and falls) in the density

curve.

Figure 3: A probability density f(x) known up

to a threshold a

Figure 4: An example of convex tail extrapo-

lation

Figure 5: An example of non-convex tail

extrapolation Figure 6: The parameters η, ν,β

Now suppose we are given a target objective or performance measure E[h(X)], where E[·] denotes

the expectation under f , and h : R→ R is a bounded function in X. The goal is to calculate the

worst-case value of E[h(X)] under the assumption that f is convex beyond a. That is, we want to

obtain maxE[h(X)] =∫∞−∞ h(x)f(x)dx where the maximization is over all convex f(x), x≥ a such

that it satisfies the properties of a probability density function. For this formulation, we need three

constants extracted from f(x), x < a, which we denote as η, ν,β > 0 respectively:

1. η is the value of the density f at a, i.e. f(a) = η.

2. −ν is the left derivative of f at a. Since f is convex beyond a, it is differentiable with non-

decreasing derivative almost everywhere (a.e.) in that range. The right derivative at a, f ′+(a), must

be at least as large as f ′−(a), its left derivative, which is set equal to −ν.


3. β is the tail probability at a. Since f is known up to a,∫ a−∞ f(x) is known to be equal to

some number 1−β, and∫∞af(x)dx must equal β.

Figure 6 illustrates these quantities. Our formulation can be written as

max∫∞ah(x)f(x)dx

subject to∫∞af(x)dx= β

f(a) = η

f ′+(a)≥−ν

f convex for x≥ a

f(x)≥ 0 for x≥ a

(1)

Note that we have set our objective to be E[h(X);X ≥ a], since E[h(X);X < a] is completely

known in this setting.

4.2. Optimality Characterization

The solution structure of (1) turns out to be extremely simple and is characterized by either one

of two closely related cases (focusing on the region x≥ a), as presented in the following theorem:

Theorem 1. Suppose h is bounded. Consider the optimization (1). If it is feasible, then either

1. An optimal probability density exists. In this case, there is an optimal density that is con-

tinuous, piecewise linear and has bounded support for x≥ a. Moreover, it has three line segments,

the first one continuing from a with slope −ν, and the last linking to the horizontal axis (with the

possibility that one or both of the first two segments are degenerate, i.e., zero length).

2. An optimal probability density does not exist. In this case, there is a sequence of feasible

probability densities whose objective values converge to the optimal value of (1). Each density in

this sequence is continuous, piecewise linear and has bounded support for x≥ a. It has three line

segments, the first one continuing from a with slope −ν, and the last linking to the horizontal axis

(with the possibility that the first segment is degenerate). As the sequence emerges, the last segment

gets both closer and more parallel to the horizontal axis.


Under the following additional assumption on h, Theorem 1 can be further simplified:

Assumption 1. The function h : R→ R is non-decreasing in (a, c) and non-increasing in (c,∞)

for some constant a≤ c≤∞ (i.e. c can possibly be ∞).

Theorem 2. Under Assumption 1 and that h is bounded, the optimal density in the first case in

Theorem 1 has two line segments, the first continuing from a with slope −ν and the second linking

to the horizontal axis (with the possibility that the first segment is degenerate).

The proofs of Theorems 1 and 2 are discussed in Section 6 and detailed in Appendix EC.1.2.

Figures 7 and 8 show the tail behaviors for the two cases when Assumption 1 holds. Qualitatively,

with a bounded support, the first case in Theorem 1 or 2 clearly possesses the lightest possible

tail. The second case in the theorems can be interpreted as an extreme heavy-tail. To explain this,

compare the optimal sequence of densities with a given arbitrary density. Given any fixed large

enough x on the real line, as the sequence emerges, the decay rate (i.e. slope) at that point is

eventually slower than that of the given density. Since a slower decay rate is the characteristic of a

fatter tail, this optimal density sequence can be interpreted as qualitatively capturing the heaviest

possible tail.

Figure 7: Behavior of an optimal light-tailed

extrapolation

Figure 8: Behavior of an element in an optimal

heavy-tailed extrapolation sequence


4.3. Optimization Procedure

We focus on h that satisfies Assumption 1, since it covers many natural scenarios including the

two examples in the Introduction. Given information about the density up to a, our algorithm will:

1) classify the two cases of light versus heavy tails, and 2) solve for the optimal density or the

optimal sequence of densities. The main idea is to search for the kinks of piecewise linear densities

that Theorem 1 shows to be optimal for problem (1). We first state the algorithm, and will provide

explanation momentarily and more details in Section 6:

Algorithm 1: Procedure for Solving (1)

Inputs:

1. The cost function h that satisfies Assumption 1 and is bounded and w.l.o.g. non-negative.

2. The parameters β, η, ν > 0.

Procedure:

Exclusion of trivial scenarios:

1. If η2 > 2βν, there is no feasible density.

2. If η2 = 2βν, then:

- Optimal value: νH(µ)

- There is only one feasible density given by f(x) = η− ν(x− a).

3. Otherwise continue.

Main procedure:

Let

µ=η

ν, σ=

2β

ν, H(x) =

∫ x

0

∫ u

0

h(v+ a)dvdu, λ= lim supx→∞

H(x)

x2<∞ (2)

Consider the optimization

maxx1∈[0,µ)

W (x1) =σ−µ2

σ− 2µx1 +x21

H(x1) +(µ−x1)

2

σ− 2µx1 +x21

H

(σ−µx1

µ−x1

)(3)

Either of the following two cases occurs:

Case 1 (light tail): There is an optimal solution for (3), given by x∗1 ∈ [0, µ). Then:


- Optimal value: νW (x∗1)

- Optimal density:

f(x) =

η− ν(x− a) for a≤ x≤ x∗1 + a

η− νx∗1− νp∗2(x− a−x∗1) for x∗1 + a≤ x≤ x∗2 + a

0 for x≥ x∗2 + a

where

x∗2 =σ−µx∗1µ−x∗1

, p∗2 =(µ−x∗1)2

σ− 2µx∗1 +x∗12

Case 2 (heavy tail): There does not exist an optimal solution for (3). This occurs when there

is a sequence x(k)1 → µ such that W (x

(k)1 )↗W ∗, where W ∗ is the optimal value of (3), but W ∗ >

maxx1∈[0,µ−ε]W (x1) for any ε > 0.

Consider further the optimization

maxx1∈[0,µ),ρ∈[µ2,σ]

V (x1, ρ)

=ρ−µ2

ρ− 2µx1 +x21

(H(x1)−λx21) +

(µ−x1)2

ρ− 2µx1 +x21

(H

(ρ−µx1

µ−x1

)−λ

(ρ−µx1

µ−x1

)2)

+λσ

(4)

One of the following two sub-cases occurs:

Case 2.1: There exists an optimal solution for (4), given by (x∗1, ρ∗)∈ [0, µ)× [µ2, σ]. Then:

- Optimal value: νV (x∗1, ρ∗)

- There is no optimal density. Optimal value is achieved by a sequence f (k)(x), x≥ a given by

f (k)(x) =

η− ν(x− a) for a≤ x≤ x∗1 + a

η− νx∗1− νp∗2(x− a−x∗1) for x∗1 + a≤ x≤ x∗2− δ(k) + a

η− νx∗1− νp∗2(x∗2− δ(k)−x∗1)− νγ(k)(x− a− (x∗2− δ(k))) for x∗2− δ(k) + a≤ x≤ x(k)3 + a

0 for x≥ x(k)3 + a

where

x∗2 =ρ∗−µx∗1µ−x∗1

, p∗2 =(µ−x∗1)2

ρ∗− 2µx∗1 +x∗12


and

x(k)3 →∞ is a sequence such that

H(x(k)3 )

x(k)3

2 → λ as k→∞

δ(k) =σ− ρ∗

p∗2(x(k)3 −x∗2)

, γ(k) =(σ− ρ∗)p∗2

p∗2(x(k)3 −x∗2)2 + (σ− ρ∗)

Case 2.2: There does not exist an optimal solution for (4). This occurs when there is a sequence

(x(k)1 , ρ(k)), with x

(k)1 → µ such that S(x

(k)1 , ρ(k))↗ V ∗, where V ∗ is the optimal value of (4), but

V ∗ >maxx1∈[0,µ−ε],ρ∈[µ2,σ] V (x1, ρ) for any ε > 0. Then:

- Optimal value: ν(H(µ) +λ(σ−µ2))

- There is no optimal density. Optimal value is achieved by a sequence f (k)(x), x≥ a given by

f (k)(x) =

η− ν(x− a) for a≤ x≤ µ− δ(k) + a

η− ν(µ− δ(k))− νγ(k)(x− a− (µ− δ(k))) for µ− δ(k) + a≤ x≤ x(k)2 + a

0 for x≥ x(k)2 + a

where


H(x(k)2 )

x(k)2

2 → λ as k→∞

δ(k) = σ−µ2

x(k)2 −µ

, γ(k) = σ−µ2

x(k)2

2−2µx(k)2 +σ

Note that the functions W (x1) in (3) and V (x1, ρ) in (4) are constructed to find the kinks of

the line segments in an optimal piecewise linear density/sequence of densities. x1 represents the

x-value of the first kink.

Algorithm 1 uses a one-dimensional line search, namely (3), to classify the type of optimality into

light or heavy tail. In the light-tailed case (Case 1), an optimal density can be obtained without

further optimization. In the heavy-tailed case (Case 2), an additional two-dimensional nonlinear

program (4) needs to be solved to find an optimal sequence of densities. These programs can be

solved readily by standard nonlinear solvers, modulo the detection of non-existence of optimal

solution due to the boundary x1 = µ, which we discuss next.


We can detect whether or not a solution exists for (3) and (4) by introducing tiny “tolerance

parameters” 0 < ε < ε′. For (3), replace the problem with maxx1∈[0,µ−ε]W (x1), which must pos-

sess an optimal solution x∗1. If x∗1 < µ− ε′, then it falls into Case 1, whereas x∗1 ∈ [µ− ε′, µ− ε]

falls into Case 2. This is capturing the behavior that if the optimality of W is characterized by a

sequence x(k)1 → µ, then, fixing a small value for ε′− ε, the optimal solution for maxx1∈[0,µ−ε]W (x1)

must occur in the interval [µ − ε′, µ − ε] as ε → 0. Similar statements hold for (4). Namely,

maxx1∈[0,µ−ε],ρ∈[µ2,σ] V (x1, ρ) must have an optimal solution (x∗1, ρ∗), and the scenario x∗1 < µ− ε′

leads to Case 2.1, whereas x∗1 ∈ [µ− ε′, µ− ε] leads to Case 2.2. Furthermore, when the function

H(x)/x2 is eventually non-decreasing as x→∞, one can simply put ε′ = ε in the above procedures.

We close this section with a few further remarks on Algorithm 1:

1. If h is not non-negative as needed in the algorithm, because h is bounded one can always add

a sufficiently large constant to make h non-negative. Obviously, the optimality properties will be

retained.

2. Regarding the exclusion of trivial scenarios in the algorithm, η2 > 2βν implies that β is smaller

than the area under the straight line starting from the point (a, η) down to the x-axis with slope

−ν. It is easy to see that no convex extrapolation can be drawn under this condition. Expressed

in terms of µ and σ, the condition is also equivalent to σ < µ2.

3. We have λ<∞ because h is bounded and hence H(x) grows at most quadratically in x.

4. W (x1) and V (x1, ρ) in (3) and (4) are both bounded as x1↗ µ (see Appendix EC.4.2 for

proof).

5. In the heavy-tailed case (Case 2), the sequence f (k)(x) that approaches optimality possesses

a pointwise limit, but the limit is not a valid density and has a probability mass that “escapes”

to positive infinity. In other words, f (k)(x) does not converge weakly to any probability measure,

even though the sequence of evaluated objective values does converge.

5. Elementary Numerical Examples

We demonstrate Algorithm 1 with two examples.


Entropic Risk Measure: The entropic risk measure (e.g., Follmer and Schied (2011)) captures

the risk aversion of users through the exponential utility function. It is defined as

ρ(X) =1

θlog(E[e−θX

])(5)

where θ > 0 is the parameter of risk aversion. In the case when the distribution of the random

variable X is known only up to some point a, we can find the worst case value of the entropic risk

measure subject to tail uncertainty by solving the optimization problem

maxP∈A

1

θlog(E[e−θX

])=

1

θlog(E[e−θX ;X ≤ a] + max

P∈AE[e−θX ;X >a

])(6)

where A denotes the set of convex tails that match the given non-tail region. Since the function

e−θX satisfies Assumption 1, we can apply Algorithm 1 to the second term of the RHS of (6).

The thick line in Figure 9 represents the worst-case value of the entropic risk measure for different

values of the parameter θ in the case when X is known to have a standard exponential distribution

Exp(1) up to a = − log(0.7) (i.e. a is the 70-percentile and β = η = ν = 0.7). For comparison,

we also calculate and plot the entropic risk measure for several fitted probability distributions:

Exp(1), two-segment continuous piecewise linear tail denoted as 2-PLT (two such instances in

Figure 9), and mixtures of 2-PLT and shifted Pareto. Clearly, the worst-case values bound those

calculated from the candidate parametric models, with the gap diminishing as θ increases.

The Newsvendor Problem: The classical newsvendor problem maximizes the profit of selling

a perishable product by fulfilling demand using a stock level decision, i.e.,

maxqE[pmin(q,D)]− cq (7)

where D is the demand random variable, p and c are the selling and purchase prices per product,

and q is the stock quantity to be determined. We assume that p > c. The optimal solution to (7)

is given by Littlewood’s rule q∗ = F−1((p− c)/p), where F−1 is the quantile function of D (Talluri

and Van Ryzin (2006)).


Figure 9 Optimal upper bound and comparison with parametric extrapolations for the entropic risk

measure.

Suppose the distribution of D is only known to have the shape of a lognormal distribution

with mean 50 and standard deviation 20 in the interval [0, a), where a is the 70-percentile of the

lognormal distribution. A robust optimization formulation for (7) is

maxq

minP∈A

E[pmin(q,D)]− cq (8)

= maxq

{E[pmin(q,D);D≤ a] + min

P∈AE[pmin(q,D);D>a]− cq

}where A denotes the set of convex tails that match the given non-tail region. The outer optimization

in (8) is a concave program. We concentrate on the inner optimization. Since pmin(q,D) is a non-

decreasing function in D on [0,∞), its negation is non-increasing, and Assumption 1 holds (note

that minimization here can be achieved by merely maximizing the negation). We can therefore

apply Algorithm 1 (with β = 0.7, η ≈ 0.007, and ν ≈ 0.0003). Figure 10 shows the optimal lower

bound of the inner optimization when p = 7, c = 1 and q varies between 0 and 193.26 (which is

the 95-percentile of the lognormal distribution). The curve peaks at q= 55.7, which is the solution

to problem (8). As a comparison, we also show different candidate values of the expectation that

are obtained by fitting the tails of lognormal, 2-PLT (two instances) and mixture of shifted Pareto

and 2-PLT (see Figure 10).


Figure 10 Optimal objective values of the inner optimization of the robust newsvendor problem.

6. Main Mathematical Developments

This section explains our main mathematical arguments for the optimality characterization and

Algorithm 1. There are four steps:

Step 1: Conversion from (1) into a moment constrained optimization program. The first key obser-

vation in solving (1) is that it can be reduced into a moment constrained program, by re-expressing

the decision variable as f ′(x) and identifying it as a probability distribution function through a

linear transformation. This is summarized as:

Theorem 3. Suppose h is bounded. Denote H(x) =∫ x0

∫ u0h(v+a)dvdu. Then the optimal value of

(1) is the same as

max νE[H(X)]

subject to E[X] = µ

E[X2] = σ

P∈P[0,∞)

(9)

where µ= ην

and σ= 2βν

. Here the decision variable is a probability distribution P∈P[0,∞), where

P[0,∞) denotes the set of all probability measures on [0,∞), and E[·] is the corresponding expec-


tation. Moreover, there is a one-to-one correspondence (up to measure zero) between the feasible

solutions to (1) and (9), given by f ′(x+a) = ν(p(x)− 1) a.e., where f is a feasible solution of (1)

and p is a probability distribution function over [0,∞).

The proof is left to Appendix EC.1.1. The result allows us to focus on program (9) and at the

end convert its solution back to that of the original formulation (1).

Step 2: Characterization of the form of optimal solution for (9). (9) is an infinite-dimensional

linear program (LP). Using existing terminology, we call an optimization program consistent if

there exists a feasible solution, and solvable if there exists an optimal solution. We start with the

immediate observation:

Lemma 1. Program (9) is consistent if and only if σ≥ µ2. Correspondingly, program (1) is consis-

tent if and only if η2 ≤ 2βν. When σ= µ2, (9) has only one feasible solution given by a point mass at

µ. Correspondingly, when η2 = 2βν, (1) has only one feasible solution given by f(x) = η− ν(x−a)

for x≥ a.

Lemma 1 justifies the beginning step on the exclusion of trivial scenarios in Algorithm 1.

To proceed further, denote Sn = {(p1, . . . , pn)∈Rn+ :∑n

i=1 pi = 1} as the n-dimensional probability

simplex. Denote Pn[0,∞) as the set of all finite support distributions on [0,∞) with at most n

support points, i.e. each probability measure in Pn[0,∞) has point masses (p1, p2, . . . , pn) ∈ Sn on

some distinct points x1, . . . , xn ∈ [0,∞). For convenience, denote OPT (C) as the program

max E[H(X)]


E[X2] = σ

P∈ C

Moreover, we introduce the following assumption:

Assumption 2. H is convex and H ′ satisfies a convex-concave property, i.e. H ′(x) is convex for

x∈ (0, c) and concave for x∈ (c,∞), for some 0≤ c≤∞.


Note that this assumption holds if h is non-negative, bounded, satisfies Assumption 1, and relates

to H by H(x) =∫ x0

∫ u0h(v+ a)dvdu.

The following result characterizes the optimality structure of (9):

Proposition 1. The optimal value of OPT (P[0,∞)) is identical to that of OPT (P3[0,∞)). In

addition, under Assumption 2, the existence of an optimal solution for OPT (P[0,∞)) implies that

there is an optimal solution for OPT (P[0,∞)) that is in P2[0,∞).

Proposition 1 implies the optimality characterization of our original problem (1) as follows. The

first part concludes that there must be a sequence of feasible solutions in P3[0,∞) that converges

to optimality for (9). When converting back to formulation (1) using Theorem 3, these solutions

correspond exactly to piecewise linear densities with three line segments, which leads to Theorem

1. In the second part of Proposition 1, an optimal solution in P2[0,∞) for (9) converts exactly to

piecewise linear density with two line segments for (1), which leads to Theorem 2.

The detailed proof of Proposition 1 and how it implies Theorems 1 and 2 are shown in Appendix

EC.1.2.

Step 3: Solving (9). Knowing the optimality structure through Proposition 1, the task left is finding

an optimal solution (or sequence of solutions). This can be done by searching for the support points

and their masses. For convenience, we encode a generic element in P2[0,∞) by (x1, x2, p1, p2), where

(x1, x2) are the support points in [0,∞) and the associated probability masses are (p1, p2) ∈ S2.

Similarly, we encode a generic element in P3[0,∞) by (x1, x2, x3, p1, p2, p3).

The line search in (3) is posted to find an optimal element (x∗1, x∗2, p∗1, p∗2) in P2[0,∞) for (9),

which reduces to finding just x∗1. If the optimal x∗1 < µ, we conclude that (9) is solvable, which

will subsequently lead to Case 1 (i.e. the light-tailed case) in Algorithm 1. Otherwise, (9) is not

solvable and needs further analysis. These are summarized as follows:

Proposition 2. Suppose σ > µ2 and Assumption 2 holds. Then either one of the following cases

happens:


1. The program (3) is solvable, i.e. there is an optimal solution x∗1 ∈ [0, µ). Then (9) is solvable

and has an optimal solution in P2[0,∞). Moreover, this solution is given by (x∗1, x∗2, p∗1, p∗2), where

x∗2 =σ−µx∗1µ−x∗1

, p∗1 =σ−µ2

σ− 2µx∗1 +x∗12 , p∗2 =

(µ−x∗1)2

σ− 2µx∗1 +x∗12 (10)

2. The program (3) is not solvable, i.e. W (x1) < W ∗ < ∞ for any x1 ∈ [0, µ). Then (9)

is not solvable, and there is a feasible sequence P(k) ∈ P3[0,∞) with objective value converg-

ing to the optimal value of (9). Each P(k) is represented by the support points (x(k)1 , x

(k)2 , x

(k)3 )

and masses (p(k)1 , p

(k)2 , p

(k)3 ), such that either (x

(k)1 , x

(k)2 , x

(k)3 , p

(k)1 , p

(k)2 , p

(k)3 ) → (x∗1, x

∗2,∞, p∗1, p∗2,0)

for some x∗1, x∗2 ∈ [0,∞) (possibly identical) and (p∗1, p

∗2) ∈ S2, or (x

(k)1 , x

(k)2 , x

(k)3 , p

(k)1 , p

(k)2 , p

(k)3 )→

(x∗1,∞,∞,1,0,0) for some x∗1 ∈ [0,∞).

The first case in Proposition 2 leads to the optimality structure of Case 1 (i.e. the light-tailed

case) in Algorithm 1. The proof of Proposition 2 is in Appendix EC.1.3.

Step 4: Relaxed program to solve for the optimal sequence in the case of non-existence of optimal

solution. We have the following further characterization:

Proposition 3. Suppose σ > µ2 and Assumption 2 holds. In the case that (9) is not solvable, its

optimal value is equivalent to that of

max E[r(X)] +λσ


E[X2]≤ σ

P∈P2[0,∞)

(11)

where r(x) =H(x)−λx2.

Program (11) is obtained by identifying all the limits of sequences in P3[0,∞) represented in Case

2 in Proposition 2. The proof of Proposition 3 is left to Appendix EC.1.4.

Finally, one can search for the support points and probability masses of an optimal solution

in (11) by posting the nonlinear program (4), and from there explicitly construct an optimality-

approaching sequence for (9). This is summarized as:


Proposition 4. Suppose σ > µ2 and Assumption 2 holds. In the situation that (9) is not solvable,

either of the following two cases must happen:

1. There exists an optimal solution for (4), given by (x∗1, ρ∗) ∈ [0, µ). Then, a sequence of fea-

sible P(k) ∈ P3[0,∞) represented by (x(k)1 , x

(k)2 , x

(k)3 , p

(k)1 , p

(k)2 , p

(k)3 ) will have objective value in (9)

converging to the optimum, where

x(k)1 = x∗1 for all k

x(k)2 = x∗2− δ(k) =

ρ∗−µx∗1µ−x∗1

− δ(k) for all k

x(k)3 →∞ is a sequence such that H(x

(k)3 )/x

(k)3

2→ λ as k→∞

p(k)1 = p∗1 = ρ∗−µ2

ρ∗−2µx∗1+x∗12

p(k)2 = p∗2− γ(k) =

(µ−x∗1)2

ρ∗−2µx∗1+x∗12 − γ(k)

p(k)3 = γ(k)

(12)

and

δ(k) =σ− ρ∗

p∗2(x(k)3 −x∗2)

, γ(k) =(σ− ρ∗)p∗2

p∗2(x(k)3 −x∗2)2 + (σ− ρ∗)

(13)

2. There does not exist an optimal solution for (4). Then, a sequence of feasible P(k) ∈P2[0,∞)

represented by (x(k)1 , x

(k)2 , p

(k)1 , p

(k)2 ) will have objective value in (9) converging to the optimal value

H(µ) +λ(σ−µ2), where

x(k)1 = µ− δ(k)


H(x(k)2 )

x(k)2

2 → λ as k→∞

p(k)1 = 1− γ(k)

p(k)2 = γ(k)

(14)

and

δ(k) =σ−µ2

x(k)2 −µ

, γ(k) =σ−µ2

x(k)2

2− 2µx

(k)2 +σ

(15)

This concludes the optimality structure of Case 2 (i.e. the heavy-tailed case) in Algorithm 1.

The proof of Proposition 4 is left to Appendix EC.1.4.


7. Data-driven Worst-case Tail Analysis

7.1. Relaxed Formulation and Combination with Interval Estimators

This section focuses on the integration with data like those in Examples 1 and 2 in Section 1. Still

presuming a threshold a chosen in advance, now we are interested in the scenario where we do not

know the exact density below a and therefore need to estimate it from the data. Using the interval

estimates for the quantities η, ν and β, the worst-case bound for E[h(X);X >a] is

max∫∞ah(x)f(x)dx

subject to β ≤∫∞af(x)dx≤ β

η≤ f(a)≤ η

f ′+(a)≥−ν

f(x) convex for x≥ a

f(x)≥ 0 for x≥ a

(16)

where [β,β], [η, η] and −ν are the joint (1−α) confidence intervals (CIs) for P (X > a) and f(a)

and lower confidence bound for f ′−(a). It is clear that the optimal value of (16) carries the following

statistical guarantee:

Proposition 5. Suppose that [β,β], [η, η] and −ν are the joint (1 − α) CIs for P (X > a) and

f(a), and lower confidence bound for f ′−(a). Then with probability (1− α) the optimization (16)

gives an upper bound for the worst-case value of E[h(X);X > a] under the assumption that f(x)

is convex for x≥ a.

The optimality characterization of (16) is the same as that of (1) (the proof is discussed in

Appendix EC.3):

Theorem 4. Suppose h is non-negative. Theorems 1 and 2 hold if (1) is replaced by (16).

Our algorithm for solving (16) (hereafter Algorithm 2) is conceptually similar to Algorithm 1 but

possesses additional parameters for handling inequality instead of equality constraints. Algorithm

2, which uses the parameters β, β, η, η and ν as inputs, requires two two-dimensional nonlinear


programs to distinguish between existence and non-existence of optimal density, and needs at most

two additional programs, each with at most three dimensions, to solve for an optimal density or

sequence of densities (see Appendix EC.2 for the details).

The next sub-section applies formulation (16) to the two examples in the Introduction and gives

the numerical results.

7.2. Synthetic Data: Example 2 Revisited

Consider the synthetic data set of size 200 in Example 2. This data set is actually generated from

a lognormal distribution with parameter (µ,σ) = (0,0.5), but we assume that only the data is

available to us. We are interested in the quantity P (4<X < 5), and for this we will solve program

(16) to generate an upper bound that is valid with 95% confidence.

We compute the interval estimates for β, η and ν as follows. First, we can obtain point esti-

mates for these parameters through standard kernel density estimator (KDE) in the R statistical

package. To obtain interval estimates, we run 1,000 bootstrap resamples and take the appropriate

quantiles of the 1,000 resampled point estimates. To account for the fact that three parameters

are estimated simultaneously, we apply a Bonferroni correction, so that the confidence level used

for each individual estimator is 1− 0.05/3.

For a sense of how to choose a, Figure 11 shows the density and density derivative estimates

and compares them to those of the lognormal distribution. The KDE suggests that convexity holds

starting from around x= 1.5 (the point where the density derivative estimate starts to turn from a

decreasing to an increasing function). Thus, it is reasonable to confine the choice of a to be larger

than 1.5. In fact, this number is quite close to the true inflexion point 1.15.

Since the data become progressively scanter as x grows larger, and the KDE is designed to utilize

neighborhood data, the interval estimators for the necessary parameters β, η and ν become less

reliable for larger choices of a. For instance, Figure 11 shows that the bootstrapped KDE CI of

the density derivative covers the truth only up to x = 3.3. In general, a good choice of a should

be located at a point where there are some data in the neighborhood of a, such that the interval


estimators for β, η and ν are reliable, but as large as possible, because choosing a small a can make

the tail extrapolation bound more conservative.

Figure 11 Bootstrapped kernel estimation of the distribution, density and density derivative for the synthetic

data.

As a first attempt, we run Algorithm 2 using a = 3.3 to estimate an upper bound for the

probability P (4<X < 5), which gives 6.6×10−3 while the truth is 2.1×10−3. Thus, this estimated

upper bound does cover the truth and also has the same order of magnitude. We perform the

following two other procedures for comparison:

1. GPD approach: As discussed in Section 3.1, this is a common approach for tail modeling. Fit

the data above a threshold u to the density function

(1− F (u))gζ,β(x−u)


where F (u) is the estimated ECDF at u, and gζ,β(·) is the GPD density, whose distribution function

is defined as

Gζ,β(x) =

1− (1 + ζx/β)−1/ζ if ζ 6= 0

1− exp(−x/β) if ζ ≥ 0

for x≥ 0, and β > 0. Set threshold u to be 0.8, such that a linear trend is observed on both the

exponential Q-Q plot and the mean excess plot of the data, as recommended by McNeil (1997).

Estimate F (u) by the sample mean of I(Xi ≤ u), where I(·) denotes the indicator function, and

obtain the parameter estimates ζ and β from maximum likelihood estimation. Then obtain 95%

CI for the quantity P (4<X < 5) from the delta method.

2. Worst-case approach with known parameter values: Assume β, η and ν are known at a= 3.3.

Then run Algorithm 1 to obtain the upper bound.

Table 1 shows the upper bounds obtained from the above approaches, and also shows the obvious

fact that using ECDF alone for estimating P (4 <X < 5) is not possible since there are no data

in the interval [4,5]. The worst-case approach with known parameters gives an upper bound of

2.9× 10−3, which is less than the data-driven version. The difference between these numbers can

be interpreted as the price of estimation for β, η and ν. Note that for this particular setup, the

worst-case approach correctly covers the true value, whereas GPD fitting actually gives an invalid

upper bound, thus showing that either the data size or the threshold level is insufficient to support

a good fit of the GPD. This is an instance where the worst-case approach has outperformed GPD

in terms of correctness.

The above discussion focuses only on one realization of data set, which raises the question of

whether it holds more generally. Therefore, we repeat the experiment 200 times to obtain an

empirical coverage probability, i.e. the probability that the estimated upper bound indeed covers

the truth. We also vary the position of the interval of interest in P (c <X < d) from [c, d] = [4,5] to

[9,10], and try two different a’s: 3.3 and 2.8. Tables 2 and 3 show the true probabilities, the mean

upper bounds from the 200 experiments, and the empirical coverage probabilities.


Method Estimated upper bound

Truth 2.1× 10−3

Worst-case approach 6.6× 10−3

GPD 1.5× 10−3

Worst-case with known parameters 2.9× 10−3

ECDF N/A

Table 1 Estimated upper bounds of the probability P (4 <X < 5) for the synthetic data in Example 2.

c d Truth Mean upper bound Coverage probability

4 5 2.13× 10−3 7.96× 10−3 0.93

5 6 4.73× 10−4 4.34× 10−3 0.96

6 7 1.19× 10−4 2.98× 10−3 1

7 8 3.37× 10−5 2.27× 10−3 1

8 9 1.04× 10−5 1.83× 10−3 1

9 10 3.49× 10−6 1.53× 10−3 1

Table 2 Mean upper bounds and empirical coverage probabilities using worst-case approach with threshold

a = 3.3.


4 5 2.13× 10−3 1.09× 10−2 1

5 6 4.73× 10−4 6.87× 10−3 1

6 7 1.19× 10−4 4.98× 10−3 1

7 8 3.37× 10−5 3.90× 10−3 1

8 9 1.04× 10−5 3.20× 10−3 1

9 10 3.49× 10−6 2.71× 10−3 1

Table 3 Mean upper bounds and empirical coverage probabilities using worst-case approach with threshold

a = 2.8.


The coverage probabilities in Tables 2 and 3 are mostly 1, which suggests that our procedure

is conservative. For a = 3.3 and intervals that are close to a, i.e. [c, d] = [4,5] and [5,6], the cov-

erage probability is not 1 but rather is close to the prescribed confidence level of 95%. Further

investigation reveals that our procedure fails to cover the truth only in the case when the joint

CI of the parameters η, β and ν does not contain the true values, which is consistent with the

rationale of our method. Although we have not tried lower values of a, it is very likely that the

coverage probabilities will stay mostly 1, but the mean upper bounds will increase as the level of

conservativeness increases.

As a comparison, Table 4 shows the results of GPD fit using threshold u= 0.8. Here, all of the

coverage probabilities are far from the prescribed level of 95%, which suggests that either GPD is

the wrong parametric choice to use since the threshold is not high enough, or that the estimation

error of its parameters is too large due to the lack of data. Note that we have used a two-sided 95%

CI for the GPD approach here. If we had used a one-sided upper confidence bound, then the upper

bounding value would be even lower and the coverage probability would drop further. However, the

mean upper bounds using GPD fit do cover the truth in all cases. Since the coverage probability

is well below 95%, this suggests that the estimation of GPD parameters is highly sensitive to the

realization of data.


4 5 2.13× 10−3 4.06× 10−3 0.67

5 6 4.73× 10−4 9.93× 10−4 0.54

6 7 1.19× 10−4 3.01× 10−4 0.44

7 8 3.37× 10−5 1.06× 10−4 0.37

8 9 1.04× 10−5 4.38× 10−5 0.30

9 10 3.49× 10−6 1.95× 10−5 0.30

Table 4 Mean upper bounds and empirical coverage probabilities using GPD approach.


In summary, Tables 2, 3 and 4 show the pros and cons of our worst-case approach and GPD

fitting. GPD appears to perform well in capturing the magnitude of the target quantity, but its

confidence upper bound can fall short of the prescribed coverage probability (in fact, only between

30 to 70% of the time it actually covers the truth in Table 4). On the other hand, our approach

gives a reasonably tight upper bound when the interval in consideration (i.e. [c, d]) is close to the

threshold a, and tends to be more conservative far out. This is a drawback, but sensibly so, given

that the uncertainty of extrapolation increases as it gets farther away from what is known.

Both our worst-case approach and GPD fitting require choosing a threshold parameter. In GPD

fitting, it is important to choose a threshold parameter high enough so that the GPD becomes a

valid model. GPD fitting, however, is difficult for a small data set when the lack of data prohibits

choosing a high threshold. On the other hand, the threshold in our worst-case approach can be

chosen much higher, because our method relies on the data below the threshold, not above it.

7.3. Fire Insurance Data: Example 1 Revisited

Consider the fire insurance data in Example 1. The quantity of interest is the expected payoff of a

high-excess policy with reinsurance, given by h(x) = (x− 50)I(50≤ x< 200) + 150I(x≥ 200). The

data set has only seven observations above 50.

We apply our worst-case approach to estimate an upper bound for the expected payoff by using

a= 29.03, the cutoff above which 15 observations are available. Similar to Section 7.2, we use the

bootstrapped KDE to obtain CIs for β, η and ν. The estimates in Figure 12 appear to be very

stable for this example, thanks to the relatively large data size.

We run Algorithm 2 to obtain a 95% confidence upper bound of 1.61. For comparison, we fit a

GPD using threshold u= 10, which follows McNeil (1997) as the choice that roughly balances the

bias-variance tradeoff and minimizes the length of the CI. The 95% CI from GPD fit is [0.04,0.16].

Thus, the worst-case approach gives an upper bound that is one order of magnitude higher, a

finding that resonates with that in Section 7.2. Our recommendation is that a modeler who cares

only about the order of magnitude would be better off choosing GPD, whereas a more risk-averse


Figure 12 Bootstrapped kernel estimation of the distribution, density and density derivative for the the Danish

fire losses data in Example 1.

modeler who wants a bound on the risk quantity with high probability guarantee would be better

off choosing the worst-case approach.

8. Conclusion

This paper proposed a worst-case, nonparametric approach to estimate tail quantities, based on

finding an optimal tail extrapolation under convexity assumption. The approach was developed

to solve cases with tremendous uncertainty in the tail region. Its implementability was demon-

strated by reducing the infinite-dimensional optimization problem into low-dimensional nonlinear

programs. Moreover, a data-driven optimization formulation was developed via suitable relaxation

that took into account parametric estimates for the non-tail region. With two examples of data sets,


one synthetic and one real, the proposed approach were compared to existing tail-fitting techniques,

and its relative strength of outputting correct tail estimates under data-deficient environment was

demonstrated. The level of conservativeness of the proposed approach, which is a limitation of the

method, was also examined.

We suggest two extensions of our research. First is to generalize the proposed method to mul-

tivariate distributions, perhaps through separate modeling on the marginal distributions and the

dependency structure. Second is to study means to reduce the level of conservativeness. This can

involve mathematical transformations of the variable, the addition of extra information from other

available data (e.g., data that can be used for validation) and the incorporation of subjective expert

opinion.

References

Balkema, August A, Laurens De Haan. 1974. Residual life time at great age. The Annals of Probability

792–804.

Beirlant, Jan, Jozef L Teugels. 1992. Modeling large claims in non-life insurance. Insurance: Mathematics

and Economics 11(1) 17–29.

Ben-Tal, Aharon, Dick Den Hertog, Anja De Waegenaere, Bertrand Melenberg, Gijs Rennen. 2013. Robust

solutions of optimization problems affected by uncertain probabilities. Management Science 59(2)

341–357.

Bertsimas, Dimitris, Ioana Popescu. 2005. Optimal inequalities in probability theory: A convex optimization

approach. SIAM Journal on Optimization 15(3) 780–804.

Birge, John R, Jose H Dula. 1991. Bounding separable recourse functions with limited distribution informa-

tion. Annals of Operations Research 30(1) 277–298.

Birge, John R, Roger J-B Wets. 1987. Computing bounds for stochastic programming problems by means

of a generalized moment problem. Mathematics of Operations Research 12(1) 149–162.

Bucklew, James. 2004. Introduction to rare event simulation. Springer Science & Business Media.


Cule, Madeleine, Richard Samworth, Michael Stewart. 2010. Maximum likelihood estimation of a multi-

dimensional log-concave density. Journal of the Royal Statistical Society: Series B (Statistical Method-

ology) 72(5) 545–607.

Davis, Richard, Sidney Resnick. 1984. Tail estimates motivated by extreme value theory. The Annals of

Statistics 1467–1487.

Davison, Anthony C, Richard L Smith. 1990. Models for exceedances over high thresholds. Journal of the

Royal Statistical Society. Series B (Methodological) 393–442.

Delage, Erick, Yinyu Ye. 2010. Distributionally robust optimization under moment uncertainty with appli-

cation to data-driven problems. Operations Research 58(3) 595–612.

Dembo, Amir, Ofer Zeitouni. 2009. Large deviations techniques and applications, vol. 38. Springer Science

& Business Media.

Denisov, Denis, AB Dieker, Vsevolod Shneer, et al. 2008. Large deviations for random walks under subex-

ponentiality: the big-jump domain. The Annals of Probability 36(5) 1946–1991.

El Ghaoui, Laurent, A Nilim. 2005. Robust solutions to Markov decision problems with uncertain transition

matrices. Operations Research 53(5).

Embrechts, Paul, Rdiger Frey, Alexander McNeil. 2005. Quantitative risk management. Princeton Series in

Finance, Princeton 10.

Embrechts, Paul, Claudia Kluppelberg, Thomas Mikosch. 1997. Modelling extremal events, vol. 33. Springer

Science & Business Media.

Fisher, Ronald Aylmer, Leonard Henry Caleb Tippett. 1928. Limiting forms of the frequency distribution of

the largest or smallest member of a sample. Mathematical Proceedings of the Cambridge Philosophical

Society , vol. 24. Cambridge University Press, 180–190.

Follmer, Hans, Alexander Schied. 2011. Stochastic finance: an introduction in discrete time. Walter de

Gruyter.

Glasserman, Paul, Wanmo Kang, Perwez Shahabuddin. 2007. Large deviations in multifactor portfolio credit

risk. Mathematical Finance 17(3) 345–379.


Glasserman, Paul, Wanmo Kang, Perwez Shahabuddin. 2008. Fast simulation of multifactor portfolio credit

risk. Operations Research 56(5) 1200–1217.

Glasserman, Paul, Jingyi Li. 2005. Importance sampling for portfolio credit risk. Management Science

51(11) 1643–1656.

Gnedenko, Boris. 1943. Sur la distribution limite du terme maximum d’une serie aleatoire. Annals of

Mathematics 423–453.

Goh, Joel, Melvyn Sim. 2010. Distributionally robust optimization and its tractable approximations. Oper-

ations Research 58(4-part-1) 902–917.

Gumbel, Emil Julius. 2012. Statistics of Extremes. Courier Corporation.

Hannah, Lauren A, David B Dunson. 2013. Multivariate convex regression with adaptive partitioning. The

Journal of Machine Learning Research 14(1) 3261–3294.

Hansen, Lars Peter, Thomas J Sargent. 2008. Robustness. Princeton University Press.

Heidelberger, Philip. 1995. Fast simulation of rare events in queueing and reliability models. ACM Trans-

actions on Modeling and Computer Simulation (TOMACS) 5(1) 43–85.

Hill, Bruce M, et al. 1975. A simple general approach to inference about the tail of a distribution. The

Annals of Statistics 3(5) 1163–1174.

Hogg, Robert V, Stuart A Klugman. 2009. Loss Distributions, vol. 249. John Wiley & Sons.

Hosking, Jonathan RM, James R Wallis. 1987. Parameter and quantile estimation for the generalized pareto

distribution. Technometrics 29(3) 339–349.

Iyengar, Garud N. 2005. Robust dynamic programming. Mathematics of Operations Research 30(2) 257–280.

Karr, Alan F. 1983. Extreme points of certain sets of probability measures, with applications. Mathematics

of Operations Research 8(1) 74–85.

Koenker, Roger, Ivan Mizera. 2010. Quasi-concave density estimation. The Annals of Statistics 2998–3027.

Lim, Eunji, Peter W Glynn. 2012. Consistency of multidimensional convex regression. Operations Research

60(1) 196–208.


McNeil, A. J. 1997. Estimating the tails of loss severity distributions using extreme value theory. The Journal

of the International Actuarial Association 27 117–137.

Nicola, Victor F, Marvin K Nakayama, Philip Heidelberger, Ambuj Goyal. 1993. Fast simulation of highly

dependable systems with general failure and repair processes. Computers, IEEE Transactions on 42(12)

1440–1452.

Petersen, Ian R, Matthew R James, Paul Dupuis. 2000. Minimax optimal control of stochastic uncertain

systems with relative entropy constraints. Automatic Control, IEEE Transactions on 45(3) 398–412.

Pickands III, James. 1975. Statistical inference using extreme order statistics. The Annals of Statistics

119–131.

Popescu, Ioana. 2005. A semidefinite programming approach to optimal-moment bounds for convex classes

of distributions. Mathematics of Operations Research 30(3) 632–657.

Seijo, Emilio, Bodhisattva Sen, et al. 2011. Nonparametric least squares estimation of a multivariate convex

regression function. The Annals of Statistics 39(3) 1633–1657.

Seregin, Arseni, Jon A Wellner. 2010. Nonparametric estimation of multivariate convex-transformed densities.

Annals of statistics 38(6) 3751.

Smith, James E. 1995. Generalized chebychev inequalities: theory and applications in decision analysis.

Operations Research 43(5) 807–825.

Smith, Richard L. 1985. Maximum likelihood estimation in a class of nonregular cases. Biometrika 72(1)

67–90.

Talluri, Kalyan T, Garrett J Van Ryzin. 2006. The theory and practice of revenue management , vol. 68.

Springer Science & Business Media.

Winkler, Gerhard. 1988. Extreme points of moment sets. Mathematics of Operations Research 13(4) 581–587.

e-companion to Lam and Mottet: Worst-case Tail Analysis ec1

Appendix

EC.1. Technical Proofs for Section 6

EC.1.1. Proofs in Step 1 in Section 6

Proof of Theorem 3. Without loss of generality, we can let a= 0 (by replacing f(x) with f(x+

a), and h(x) with h(x+ a) respectively). Hence (1) becomes

max∫∞0h(x)f(x)dx

subject to∫∞0f(x)dx= β

f(0) = η

f ′+(0)≥−ν

f convex for x≥ 0

f(x)≥ 0 for x≥ 0

(EC.1)

This can be rewritten as

max∫∞0h(x)f(x)dx

subject to∫∞0f(x)dx= β

f(0) = η

−ν ≤ f ′(x)≤ 0 a.e. for x≥ 0

f ′ non-decreasing a.e. for x≥ 0

(EC.2)

Here we have translated the convexity of f into the constraint that f ′ exists and is non-decreasing

a.e.. Note that we have removed the constraint f(x) ≥ 0 in (EC.1) since it is redundant. To see

this, suppose that f(x0) < 0 for some x0 > 0. If f(x) has non-positive slope for all x ≥ x0, then∫∞0f(x)dx = −∞. Else if f(x) has positive slope for some x > x0, then f(x) shoots to infinity

because of convexity, and hence∫∞0f(x)dx =∞. Either case leads to a contradiction with the

constraint∫∞0f(x)dx= β <∞.

In (EC.2) we have also imposed the additional condition that f ′(x) ≤ 0 for x ≥ 0, because

otherwise f ′(x0) > 0 for some x0 ≥ 0, and by convexity f ′(x) > f ′(x0) a.e. for x ≥ x0 and hence∫∞0f(x)dx→∞, again violating the constraint

∫∞0f(x)dx= β <∞.

ec2 e-companion to Lam and Mottet: Worst-case Tail Analysis

As a key step, we reexpress the decision variable in (EC.2) to be the derivative f ′, which exists

a.e.. For convenience, we let H(x) =∫ x0h(u+ a)du and H(x) =

∫ x0H(u)du. This definition of H

is obviously consistent with the definition in (2). First consider the objective function, and using

integration by parts,∫ ∞0

h(x)f(x)dx= H(x)f(x)∣∣∣∞0−∫ ∞0

H(x)f ′(x)dx=−∫ ∞0

H(x)f ′(x)dx

=−H(x)f ′(x)∣∣∣∞0

+

∫ ∞0

H(x)df ′(x) =

∫ ∞0

H(x)df ′(x) (EC.3)

where the second and the fourth equality follows from a simple observation in Lemma EC.9 (in

Section EC.4.1), and the fact that H(x) grows at most linearly and H(x) quadratically as x→∞

by the boundedness of h.

For the constraints in (EC.2), we can write∫ ∞0

f(x)dx=

∫ ∞0

x2

2df ′(x) (EC.4)

by merely viewing h≡ 1 in (EC.3). Also, we can use integration by parts again to write

f(0) =−∫ ∞0

f ′(x)dx=−xf ′(x)∣∣∣∞0

+

∫ ∞0

xdf ′(x) =

∫ ∞0

xdf ′(x) (EC.5)

where the third equality follows from Lemma EC.9 again. Therefore, (EC.2) can be written as

max∫∞0H(x)df ′(x)

subject to∫∞0

x2

2df ′(x) = β∫∞

0xdf ′(x) = η

−ν ≤ f ′(x)≤ 0 a.e. for x≥ 0

f ′ non-decreasing a.e. for x≥ 0

(EC.6)

Finally, let p(x) = f ′(x)/ν+ 1. Then (EC.6) can be rewritten as

max ν∫∞0H(x)dp(x)

subject to∫∞0x2dp(x) = 2β

ν∫∞0xdp(x) = η

ν

0≤ p(x)≤ 1 a.e. for x≥ 0

p non-decreasing a.e. for x≥ 0

(EC.7)


In the formulation (EC.7), one can identify p as a well-defined probability distribution function,

and so (EC.7) is nothing but maximizing E[H(X)] subject to the constraints that the first and

second moment equal to η/ν and 2β/ν respectively. This concludes the result.

EC.1.2. Proofs in Step 2 in Section 6 and Theorems 1 and 2

Proof of Lemma 1. It follows from Jensen’s inequality that for any valid probability distribution

P, E[X2]≥E[X]2, which gives σ≥ µ2 in (9). On the other hand, if σ≥ µ2, it is also rudimentary to

find a measure on P[0,∞) with first moment µ and second moment σ, e.g. a probability measure

with two suitably chosen support points. Substituting µ = η/ν and σ = 2β/ν, we get η2 ≤ 2βν.

Lastly, equality holds in E[X2]≥E[X]2 if and only if P is a point mass. This concludes the second

part of the lemma.

The proof of Proposition 1 follows from a few lemmas below. For convenience, we denote Pn(D)

as the set of all finite support distributions on D with at most n support points. So, similar to

the notation Pn[0,∞), Pn[0, c] is the set of all finite support distributions on [0, c] with at most n

support points.

Lemma EC.1 (Adapted from Theorem 3.2 in Winkler (1988)). The optimal value of

OPT (P[0, c]) is identical to that of OPT (P3[0, c]) for any c > 0. Similarly, the optimal value of

OPT (P[0,∞)) is identical to that of OPT (P3[0,∞)).

Lemma EC.1 implies that it suffices to focus on OPT (P3[0,∞)) in order to solve for (9).

For further convenience, we write Z(P) = E[H(X)] as the objective function of (9) or

OPT (P3[0,∞)), and we let Z∗ be the optimal value of (9) or OPT (P3[0,∞)). For any P∈Pn(D)

and D, we can represent P in terms of the support points (x1, . . . , xn) ∈ Dn (possibly some being

identical) and masses (p1, . . . , pn)∈ Sn, and we use the shorthand notation P∼ (x1, . . . , xl, p1, . . . , pn)

for this representation.

Lemma EC.2. Consider OPT (P3[0,∞)) that is consistent. The optimal value is either achieved at

some P∗ ∈P3[0,∞), or there exists a sequence of feasible P(k) ∈P3[0,∞) such that Z(P(k)) converges


to the optimal value. In the second case, each P(k) ∼ (x(k)1 , x

(k)2 , x

(k)3 , p

(k)1 , p

(k)2 , p

(k)3 ), such that either

(x(k)1 , x

(k)2 , x

(k)3 , p

(k)1 , p

(k)2 , p

(k)3 )→ (x∗1, x

∗2,∞, p∗1, p∗2,0) for some x∗1, x

∗2 ∈ [0,∞) (possibly identical) and

(p∗1, p∗2)∈ S2, or (x

(k)1 , x

(k)2 , x

(k)3 , p

(k)1 , p

(k)2 , p

(k)3 )→ (x∗1,∞,∞,1,0,0) for some x∗1 ∈ [0,∞).

Proof of Lemma EC.2. By the definition of optimality, clearly we have either the existence

of an optimal solution lying in P3[0,∞), which is the first case in the lemma’s statement, or

there exists a feasible sequence P(k) ∈ P3[0,∞) such that Z(P(k))→ Z∗. Let P(k) be represented

by the masses (p(k)1 , p

(k)2 , p

(k)3 ) ∈ S3 on the support points (x

(k)1 , x

(k)2 , x

(k)3 ) ∈ [0,∞)3 (possibly some

support points are identical). Suppose that xi’s are all bounded above by a number, say M .

Then, since [0,M ]3 ×S3 is a compact set, by Bolzano-Weierstrass Theorem we must have a sub-

sequence of (x(k)1 , x

(k)2 , x

(k)3 , p

(k)1 , p

(k)2 , p

(k)3 ), say (x

(kj)

1 , x(kj)

2 , x(kj)

3 , p(kj)

1 , p(kj)

2 , p(kj)

3 ) that converges to

(x∗1, x∗2, x∗3, p∗1, p∗2, p∗3) in [0,M ]3×S3. Since H(x) is continuous by construction, we have Z(P(kj)) =∑3

i=1H(x(kj)

i )p(kj)

i →∑3

i=1H(x∗i )p∗i = Z(P∗), where P∗ is represented by (x∗1, x

∗2, x∗3, p∗1, p∗2, p∗3). As

Z(P(kj)) is a subsequence of Z(P(k)), Z(P∗) must be equal to Z∗, and so P∗ is an optimal solution,

which reduces to the first case in the lemma. Therefore, for the second case, we should focus on

the scenario that at least one x(k)i satisfies lim supk→∞ x

(k)i =∞.

Without loss of generality, we fix the convention that x(k)1 ≤ x

(k)2 ≤ x

(k)3 . If at least one of x

(k)i

satisfies lim supk→∞ x(k)i =∞, we must have lim supk→∞ x

(k)3 =∞. Note that in order that P(k) is

feasible, E(k)[X] = µ holds and so x(k)1 ≤ µ for all k. We now distinguish two cases: either x

(k)2

is uniformly bounded, say by a large number M ≥ µ, or limsupk→∞ x(k)2 =∞ also. Consider the

first case. First, we find a subsequence, indexed by kj, such that x(kj)

3 ↗∞. Since (x(kj)

1 , x(kj)

2 ) ∈

[0,M ]2 which is compact, we can choose a subsequence kj′ such that (x(kj′ )1 , x

(kj′ )2 , x

(kj′ )3 ) →

(x∗1, x∗2,∞) where (x∗1, x

∗2) ∈ [0,M ]2. Now, since (p

(kj′ )1 , p

(kj′ )2 , p

(kj′ )3 ) ∈ S3 which is also compact,

we can choose another further subsequence kj′′ such that (p(kj′′ )1 , p

(kj′′ )2 , p

(kj′′ )3 )→ (p∗1, p

∗2, p∗3) ∈ S3.

Note that by the constraint E(k)[X2] = p(kj′′ )1 x

(kj′′ )1

2

+ p(kj′′ )2 x

(kj′′ )2

2

+ p(kj′′ )3 x

(kj′′ )3

2

= σ, we must have

p(kj′′ )3 = (σ − p

(kj′′ )1 x

(kj′′ )1

2

− p(kj′′ )2 x

(kj′′ )2

2

)/x(kj′′ )3

2

≤ σ/x(kj′′ )3

2

→ 0. In conclusion, in this case, we

end up being able to find a sequence of measures P(k′) ∼ (x(k′)1 , x

(k′)2 , x

(k′)3 , p

(k′)1 , p

(k′)2 , p

(k′)3 ) with

(x(k′)1 , x

(k′)2 , x

(k′)3 , p

(k′)1 , p

(k′)2 , p

(k′)3 )→ (x∗1, x

∗2,∞, p∗1, p∗2,0) where x∗1, x

∗2 ∈ [0,∞) and (p∗1, p

∗2)∈ S2.


For the second case, namely when lim supk→∞ x(k)i = ∞ for both i = 2 and 3. We can

argue similarly that there is a sequence of measures P(k′) ∼ (x(k′)1 , x

(k′)2 , x

(k′)3 , p

(k′)1 , p

(k′)2 , p

(k′)3 ),

such that x(k′)2 , x

(k′)3 →∞ and p

(k′)2 , p

(k′)3 → 0. In other words, (x

(k′)1 , x

(k′)2 , x

(k′)3 , p

(k′)1 , p

(k′)2 , p

(k′)3 )→

(x∗1,∞,∞,1,0,0) where x∗1 ∈ [0,∞).

Next, we borrow the following result:

Lemma EC.3 (Adapted from Theorem 5.1 in Birge and Dula (1991)). Consider

OPT (P[0, c]) for any c <∞. Suppose H is convex with derivative H ′ convex on (0, c) and concave

on (c, c) for some 0≤ c≤ c. Then an optimal solution exists and lies in P2[0, c].

Lemma EC.4. Under Assumption 2, if OPT (P3[0,∞)) is solvable, then there is an optimal solu-

tion in P2[0,∞).

Proof of Lemma EC.4. Suppose there is an optimal solution P∗ to OPT (P3[0,∞)). Clearly, P∗

must have bounded support, say covered by [0,M ] for some large M . It follows that P∗ is also

the optimal solution to the restricted program OPT (P3[0,M ]). Now, by Lemma EC.1, the optimal

value of OPT (P3[0,M ]) is the same as that of OPT (P[0,M ]). In other words, P∗ is an optimal

solution to OPT (P[0,M ]).

By Lemma EC.3, OPT (P[0,M ]) has an optimal solution P∗∗ ∈ P2[0,M ]. We then must have

Z(P∗∗) =Z(P∗), so that P∗∗ is optimal for OPT (P3[0,∞)). This concludes the lemma.

Proof of Proposition 1. The second statement in Lemma EC.1 immediately implies the first

statement in Proposition 1. Then Lemma EC.4 implies the second statement in Proposition 1.

Proof of Theorem 1. Convert the original optimization (1) into (9) by Theorem 3. The optimal

solution of (9) is characterized by Proposition 1. The result follows by noting that any P∈P3[0,∞)

represented by (x1, x2, x3, p1, p2, p3) (with potentially identical xi’s), admits the one-to-one corre-

spondence f ′(x+ a) = ν(p(x)− 1) in Theorem 3 into

f ′(x) =

−ν for a< x< x1 + a

−ν(1− p1) for x1 + a< x< x2 + a

−ν(1− p1− p2) for x2 + a< x< x3 + a

0 for x> x3 + a


and hence

f(x) =

η− νx for a≤ x≤ x1 + a

η− νx1− ν(1− p1)(x− a−x1) for x1 + a≤ x≤ x2 + a

η− νx1− ν(1− p1)(x2−x1)− ν(1− p1− p2)(x− a−x2) for x2 + a≤ x≤ x3 + a

0 for x≥ x3 + a

Proof of Theorem 2. Assume w.l.o.g. that h is non-negative. Then Assumption 1 is equivalent

to Assumption 2. Similar to the proof of Theorem 1, the result here follows by noting that any

P ∈ P2[0,∞) represented by (x1, x2, p1, p2) (with potentially identical xi’s), admits the one-to-one

correspondence f ′(x+ a) = ν(p(x)− 1) in Theorem 3 into

f ′(x) =

−ν for a< x< x1 + a

−νp2 for x1 + a< x< x2 + a

0 for x> x2 + a

and hence

f(x) =

η− νx for a≤ x≤ x1 + a

η− νx1− νp2(x− a−x1) for x1 + a≤ x≤ x2 + a

0 for x≥ x2 + a


Proof of Proposition 2. Suppose for now that (9) admits an optimal probability measure in

P2[0,∞) represented by (x1, x2, p1, p2) (where x1, x2 must be distinct since otherwise σ = µ2).

Adopting a similar line of analysis as in Birge and Dula (1991), we let x1 ≤ x2 without loss of

generality. For a two-support-point distribution to be feasible, we must have µ2 < σ and x1 < µ.

The constraints enforce that p1x1 + p2x2 = µ, p1x21 + p2x

22 = σ and p1 + p2 = 1. Hence p2 = 1− p1,

which gives p1x1 + (1− p1)x2 = µ and p1x21 + (1− p1)x2

2 = σ. From the first equation we get p1 =

(x2−µ)/(x2−x1). Putting this into p1x21 + (1− p1)x2

2 = σ, we further get x2 = (σ−µx1)/(µ−x1).

Now, putting this in turn into p1 = (x2−µ)/(x2−x1), we obtain p1 = (σ−µ2)/(σ−2µx1 +x21) and

hence p2 = 1− p1 = (µ−x1)2/(σ− 2µx1 +x2

1). Therefore, the objective value of (9) is given by

maxx1∈[0,µ)

p1H(x1) + p2H(x2) = maxx1∈[0,µ)

σ−µ2

σ− 2µx1 +x21

H(x1) +(µ−x1)

2

σ− 2µx1 +x21

H

(σ−µx1

µ−x1

)(EC.8)


which is exactly maxx1∈[0,µ)W (x1) in (3).

Let W ∗ be the optimal value of (3) or (EC.8). We distinguish two cases: either W ∗ =

maxx1∈[0,µ−ε]W (x1) for some ε > 0, or not. We will argue that these correspond exactly to the two

cases in the proposition.

Consider the first case, i.e. W ∗ = maxx1∈[0,µ−ε]W (x1) for some ε > 0. Since [0, µ− ε] is compact,

there exists an optimal solution x∗1 ∈ [0, µ− ε] for maxx1∈[0,µ−ε]W (x1), which is also an optimal

solution for (3). Clearly (3) is solvable then.

We now argue that this implies (9) is solvable and has an optimal solution in P2[0,∞). First,

note that by Lemma EC.4, the program OPT (P[0, c]) for any c <∞ (sufficiently large so that the

program is consistent) must admit an optimal solution in P2[0,∞). This optimal solution can be

found by solving (3), but defining H(x) = −∞ for x > c. Now, the optimal solution x∗1 for (3),

and its corresponding values for x∗2, p∗1, p∗2 given in (10), then clearly gives an optimal solution for

OPT (P[0, c]) for any sufficiently large c <∞.

Then we argue that this must imply that P∗ ∼ (x∗1, x∗2, p∗1, p∗2) as given by (3) and (10) is an

optimal density for (9). Suppose the latter does not hold, then there must exist some P′ ∈P3[0,∞)

with Z(P′) > Z(P∗), where Z(·) is the objective function of (9). Let the support points of P′ be

(x′1, x′2, x′3). This implies that P∗ is not an optimal solution for OPT (P3[0,max{x′1, x′2, x′3}]), or

equivalently OPT (P[0,max{x′1, x′2, x′3}]). This leads to a contradiction.

For the second case, we have W ∗ >maxx1∈[0,µ−ε]W (x1) for any ε > 0. The only scenario that this

can occur is that there must exist a sequence x(k)1 such that x

(k)1 → µ and W ∗ = limk→∞W (x

(k)1 )

(this sequence can be found, for example, by picking an x(k)1 in [µ− εk, µ) such that W (x

(k)1 ) >

maxx1∈[0,µ−ε]W (x1) as εk↘ 0). In this case, there does not exist an optimal solution P∗ ∈P2[0,∞)

for (9), and so (9) does not have any optimal solution by Lemma EC.4. By Proposition 1, we must

have a sequence of probability measures P(k) ∈P3[0,∞) that approaches the optimal value of (9).

Finally, by Lemma EC.2 this sequence can be identified as P(k) ∼ (x(k)1 , x

(k)2 , x

(k)3 , p

(k)1 , p

(k)2 , p

(k)3 ), such

that either (x(k)1 , x

(k)2 , x

(k)3 , p

(k)1 , p

(k)2 , p

(k)3 )→ (x∗1, x

∗2,∞, p∗1, p∗2,0) for some x∗1, x

∗2 ∈ [0,∞) (possibly

identical) and (p∗1, p∗2)∈ S2, or (x

(k)1 , x

(k)2 , x

(k)3 , p

(k)1 , p

(k)2 , p

(k)3 )→ (x∗1,∞,∞,1,0,0) for some x∗1 ∈ [0,∞).



Propositions 3 and 4 are closely related and we shall prove them together.

Proof of Propositions 3 and 4. By Proposition 2, if (3) (and subsequently (9) or

OPT (P3[0,∞))) is not solvable, then there must be a feasible sequence P(k) ∈ P3[0,∞) that

approaches optimality, which can be denoted by P(k) ∼ (x(k)1 , x

(k)2 , x

(k)3 , p

(k)1 , p

(k)2 , p

(k)3 ). There are

two cases: either (x(k)1 , x

(k)2 , x

(k)3 , p

(k)1 , p

(k)2 , p

(k)3 )→ (x∗1, x

∗2,∞, p∗1, p∗2,0) for some x∗1, x

∗2 ∈ [0,∞) and

(p∗1, p∗2) ∈ S2 (with x∗1, x

∗2 possibly identical), or (x

(k)1 , x

(k)2 , x

(k)3 , p

(k)1 , p

(k)2 , p

(k)3 )→ (x∗1,∞,∞,1,0,0)

for some x∗1 ∈ [0,∞). We shall characterize the form of the limiting objective value and feasible

region for both cases.

Consider the first case, i.e. x(k)1 → x∗1, x

(k)2 → x∗2 and x

(k)3 →∞. By the second constraint in (9)

we must have p(k)3 = (σ− p(k)1 x

(k)1

2− p(k)2 x

(k)2

2)/x

(k)3

2so that

limk→∞

p(k)3 x

(k)3 = lim

k→∞

σ− p(k)1 x(k)1

2− p(k)2 x

(k)2

2

x(k)3

= 0

Therefore, the first constraint in (9) entails that p(k)1 x

(k)1 + p

(k)2 x

(k)2 + p

(k)3 x

(k)3 → p∗1x

∗1 + p∗2x

∗2 = µ

as k→∞. Then, the second constraint in turn entails that p(k)3 x

(k)3

2= σ − p(k)1 x

(k)1

2− p(k)2 x

(k)2

2→

σ− p∗1x∗12− p∗2x∗22 ≥ 0, and p∗1x∗12 + p∗2x

∗22 ≤ σ.

Note that the objective value Z(P(k)) for (9) satisfies

limk→∞

Z(P(k)) = limk→∞

p(k)1 H(x

(k)1 ) + p

(k)2 H(x

(k)2 ) + p

(k)3 H(x

(k)3 )

= p∗1H(x∗1) + p∗2H(x∗2) + limk→∞

σ− p(k)1 x(k)1

2− p(k)2 x

(k)2

2

x(k)3

2 H(x(k)3 )

≤ p∗1H(x∗1) + p∗2H(x∗2) +λ(σ− p∗1x∗12− p∗2x∗2

2)

by the continuity of H and the definition of λ. This implies that the program

maxx1,x2,p1,p2 p1H(x1) + p2H(x2) +λ(σ− p1x21− p2x2

2)

subject to p1x1 + p2x2 = µ

p1x21 + p2x

22 ≤ σ

(x1, x2)∈ [0,∞)2, (p1, p2)∈ S2

(EC.9)


provides an upper bound to the optimal value of (9). Note that the objective function in (EC.9)

can be written as p1(H(x1)−λx21) + p2(H(x2)−λx2

2) +λσ = p1r(x1) + p2r(x2) +λσ, where r(x) is

defined by r(x) =H(x)−λx2. Hence (EC.9) is equivalent to

max E[r(X)] +λσ


E[X2]≤ σ

P∈P2[0,∞)

(EC.10)

Now consider the second case, namely when x(k)1 → x∗1, x

(k)2 →∞ and x

(k)3 →∞. Similar to the

analysis above, from the second constraint in (9), we must have p(k)2 = (σ−p(k)1 x

(k)1

2−p(k)3 x

(k)3

2)/x

(k)2

2

and p(k)3 = (σ− p(k)1 x

(k)1

2− p(k)2 x

(k)2

2)/x

(k)3

2so that limk→∞ p

(k)2 x

(k)2 = limk→∞ p

(k)3 x

(k)3 = 0. Note that,

from the second constraint in (9), we have p(k)2 x

(k)2

2+ p

(k)3 x

(k)3

2= σ− p(k)1 x

(k)1

2→ σ− p∗1x∗1. For the

objective value of (9), we have

limk→∞

Z(P(k)) = limk→∞

p(k)1 H(x

(k)1 ) + p

(k)2 H(x

(k)2 ) + p

(k)3 H(x

(k)3 )

= p∗1H(x∗1) + limk→∞

σ− p(k)1 x(k)1

2− p(k)3 x

(k)3

2

x(k)2

2 H(x(k)2 ) + lim

k→∞

σ− p(k)1 x(k)1

2− p(k)2 x

(k)2

2

x(k)3

2 H(x(k)3 )

≤ p∗1H(x∗1) +λ(2σ− 2p∗1x∗12− lim

k→∞(p

(k)2 x

(k)2

2+ p

(k)3 x

(k)3

2))

= p∗1H(x∗1) +λ(2σ− 2p∗1x∗12− (σ− p∗1x∗1

2))

= p∗1H(x∗1) +λ(σ− p∗1x∗12) (EC.11)

Note that p∗1 must be 1 since p(k)2 , p

(k)3 → 0, and from the first constraint in (9) we must have x∗1 = µ.

Thus (EC.11) reduces into H(µ) + λ(σ− µ2), which is achieved by plugging in the delta measure

at µ in the program (EC.9). Hence it suffices to consider (EC.9) or (EC.10) in the sequel.

We shall show that (EC.10) does not only provide an upper bound to (9), but actually matches

its optimal value. For this we shall explicitly construct a sequence of P(k) ∈ P3[0,∞), feasible for

(9), whose objective value converges to the optimal value of (EC.10). For this purpose, we shall

first solve (EC.10).


Suppose that an optimal solution exists for (EC.10). This optimal solution has either two-point

or one-point support. In the first case, this optimal solution can be obtained by putting

ρ∗ = p∗1x∗12 + p∗2x

∗22, x∗2 =

ρ∗−µx∗1µ−x∗1

, p∗1 =ρ∗−µ2

ρ∗− 2µx∗1 +x∗12 , p∗2 =

(µ−x∗1)2

ρ∗− 2µx∗1 +x∗12 (EC.12)

and searching for x∗1 ∈ [0, µ) and ρ∗ ∈ [µ2, σ] in (4). This can be seen by a similar argument leading

to the optimization (3) but with ρ also varying here. Now, in the case when the optimal solution

has one-point support, it must be the point mass at µ, which gives an objective value r(µ) + λσ.

There is no optimal solution to (4) in this scenario, i.e. there exists only a sequence (x(k)1 , ρ(k)) such

that x(k)1 → µ, ρ(k)→ µ2, and V (x

(k)1 , ρ(k))↗ V ∗ where V ∗ is the optimal value of (4).

Next, consider also the situation where there does not exist an optimal solution for (EC.10).

By a similar argument as that in Lemma EC.2, we must have a feasible sequence P(k) ∼

(x(k)1 , x

(k)2 , p

(k)1 , p

(k)2 ) ∈ P2[0,∞) that converges to optimality and such that x

(k)1 → x∗1, x

(k)2 →∞,

p(k)1 → 1 and p

(k)2 ≤ (σ − p(k)1 x

(k)2 )/x

(k)2

2→ 0. Hence x∗1 = µ. Moreover, since limsupx→∞ r(x)/x2 =

limsupx→∞H(x)/x2 − λ= 0 by construction, the objective value of (EC.10) will be p(k)1 r(x

(k)1 ) +

p(k)2 r(x

(k)2 ) +λσ ≤ p(k)1 r(x

(k)1 ) + ((σ− p(k)1 x

(k)2 )/x

(k)2

2)r(x

(k)2 ) +λσ→ r(µ) +λσ, which is equal to the

objective value of (EC.10) attained by the delta measure at µ. This implies a contradiction and so

(EC.10) must be solvable.

In Appendix EC.4.3, we shall show that in the case where (4) is solvable and hence there is

an optimal solution of (EC.10) that has two-point support, the sequence defined by x(k)1 = x∗1,

x(k)2 = x∗2 − δ(k), x

(k)3 →∞ such that H(x

(k)3 )/x

(k)3

2→ λ as k→∞, p

(k)1 = p∗1, p

(k)2 = p∗2 − γ(k) and

p(k)3 = γ(k), where δ(k) and γ(k) are defined in (13), is feasible for (9) and has objective value

converging to the optimal value of (EC.10). On the other hand, if (4) is not solvable and hence

the optimal solution of (EC.10) is a delta measure at µ, the sequence defined by x(k)1 = µ− δ(k),

x(k)2 →∞ such that H(x

(k)2 )/x

(k)2

2→ λ as k→∞, p

(k)1 = 1− γ(k) and p

(k)2 = γ(k), where δ(k) and γ(k)

are defined in (15), is feasible for (9) and has objective value converging to the optimal value of

(EC.10), namely r(µ) +λσ.


EC.2. Procedure for Solving (16)

For convenience, recall H(x) =∫ x0

∫ u0h(v+a)dvdu and λ= lim supx→∞H(x)/x2 <∞, and introduce

the additional notation:

W(x,ω, ρ) =ρ−ω2

ρ− 2ωx+x2H(x) +

(ω−x)2

ρ− 2ωx+x2H

(ρ−ωxω−x

)V(x, ω, ρ, κ) =

ρ− ω2

ρ− 2ωx+x2(H(x)−λx2) +

(ω−x)2

ρ− 2ωx+x2

(H

(ρ− ωxω−x

)−λ

(ρ− ωxω−x

)2)

+λκ

K2(x;x1, x2, s0, s1, s2) =

s0− s1(x− a) for a≤ x≤ x1 + a

s0− s1x1− s2(x− a−x1) for x1 + a≤ x≤ x2 + a

0 for x≥ x2 + a

K3(x;x1, x2, x3, s0, s1, s2, s3) =

s0− s1(x− a) for a≤ x≤ x1 + a

s0− s1x1− s2(x− a−x1) for x1 + a≤ x≤ x2 + a

s0− s1x1− s2(x2−x1)− s3(x− a−x2) for x2 + a≤ x≤ x3 + a

0 for x≥ x3 + a

Algorithm 2: Procedure for Solving (16)

Inputs:

1. The cost function h that satisfies Assumption 1 and is bounded and non-negative.

2. The parameters β,β, η, η, ν > 0.

Procedure:

Let

µ=η

ν, µ=

η

ν, σ=

2β

ν, σ=

2β

ν

1. If µ2 >σ, there is no feasible density.

2. If If µ2 = σ, then:

- Optimal value: νH(µ)

- There is only one feasible density given by f(x) = η− ν(x− a).


3. Otherwise run Subprograms 1 and 2 below. Record the optimal value and density/sequence

of densities for each of them as a “candidate” for optimum. The final optimal value for (16) is

max{Z∗1 ,Z∗2}, where Z∗1 and Z∗2 are the optimal values for Subprograms 1 and 2 respectively. A

final optimal density/sequence of densities is given by the corresponding candidate maximizer.

Subprogram 1: Proceed only if µ2 <σ. Consider

maxx1∈[0,µ),ρ∈[σ∨µ2,σ]

W(x1, µ, ρ) (EC.13)

Either one of the following two cases occurs:

Case 1: There is an optimal solution for (EC.13), given by (x∗1, ρ∗)∈ [0, µ)× [σ ∨µ2, σ]. Then:

- Candidate optimal value: νW(x∗1, µ, ρ∗)

- Candidate optimal density: f(x) =K2(x;x∗1, x∗2, η, ν, νp

∗2) where

x∗2 =ρ∗−µx∗1µ−x∗1

, p∗2 =(µ−x∗1)2

ρ∗− 2µx∗1 +x∗12

Case 2: There does not exist an optimal solution for (EC.13), i.e. W∗ > W(x1, µ, ρ) for any

x1 ∈ [0, µ), ρ ∈ [σ ∨µ2, σ], where W∗ is the optimal value of (EC.13). Then there exists a sequence

of feasible solutions (x(k)1 , ρ(k)), with x

(k)1 → µ such that W(x

(k)1 , µ, ρ(k))→W∗.

Consider further the optimization

maxx1∈[0,µ),ρ∈[µ2,σ]

V(x1, µ, ρ, σ) (EC.14)

One of the following two sub-cases occurs:

Case 2.1: There exists an optimal solution for (EC.14), given by (x∗1, ρ∗)∈ [0, µ)× [µ2, σ]. Then:

- Candidate optimal value: νV(x∗1, µ, ρ∗, σ)

- There is no candidate optimal density. Optimality is achieved by a sequence f (k)(x) =

K3(x;x∗1, x∗2− δ(k), x

(k)3 , η, ν, νp∗2, γ

(k)) where

x∗2 =ρ∗−µx∗1µ−x∗1

, p∗2 =(µ−x∗1)2

ρ∗− 2µx∗1 +x∗12


and


H(x(k)3 )

x(k)3

2 → λ as k→∞

δ(k) =σ− ρ∗

p∗2(x(k)3 −x∗2)

, γ(k) =(σ− ρ∗)p∗2

p∗2(x(k)3 −x∗2)2 + (σ− ρ∗)

Case 2.2: There does not exist an optimal solution for (EC.14). This occurs when there is a

sequence (x(k)1 , ρ(k)), with x

(k)1 → µ, such that V(x

(k)1 , µ, ρ(k), σ)↗V∗, where V∗ is the optimal value

of (EC.14), but that V∗ >maxx1∈[0,µ−ε],ρ∈[µ2,σ] V(x1, µ, ρ, σ) for any ε > 0. Then:

- Candidate optimal value: ν(H(µ) +λ(σ−µ2))

- There is no candidate optimal density. Optimality is achieved by a sequence f (k)(x) =K2(x;µ−

δ(k), x(k)2 , η, ν, νγ(k)) where


H(x(k)2 )

x(k)2

2 → λ as k→∞

δ(k) = σ−µ2

x(k)2 −µ

, γ(k) = σ−µ2

x(k)2

2−2µx(k)2 +σ

Subprogram 2: Consider

maxx1∈[0,ω),ω∈[µ,µ∧

√σ]W(x1, ω,σ) (EC.15)

Either one of the following two cases occurs:

Case 1: There is an optimal solution for (EC.15), given by (x∗1, ω∗)∈ [0, µ∧

√σ)× [µ,µ∧

√σ] with

x∗1 <ω∗. Then:

- Candidate optimal value: νW(x∗1, ω∗, σ)

- Candidate optimal density: f(x) =K2(x;x∗1, x∗2, νω

∗, ν, νp∗2) where

x∗2 =σ−ω∗x∗1ω∗−x∗1

, p∗2 =(ω∗−x∗1)2

σ− 2ω∗x∗1 +x∗12

Case 2: There does not exist an optimal solution for (EC.15), i.e. W∗ >W(x1, ω,σ) for any

x1 ∈ [0, ω), ω ∈ [µ,µ∧√σ], where W∗ is the optimal value of (EC.15). Then there exists a sequence

of feasible solutions (x(k)1 , ω(k)), with ω(k)→ ω∗ and x

(k)1 → ω∗ for some ω∗ ∈ [µ,µ∧

√σ], such that

W(x(k)1 , ω(k), σ)→W∗. One of the following three cases must happen:

Case 2.1: If ω∗ = µ∧√σ, then:


- Candidate optimal value: νH(µ∧√σ)

- Candidate optimal density: f(x) = ν(µ∧√σ)− ν(x− a) for x≥ a

Otherwise, consider further the optimization

maxx1∈[0,ω),ω∈[µ,µ∧

√σ],ρ∈[ω2,σ]

V(x1, ω, ρ, σ) (EC.16)

Either of the following two cases occurs:

Case 2.2: There exists an optimal solution for (EC.16), given by (x∗1, ω∗, ρ∗)∈ [0, µ∧

√σ)× [µ,µ∧

√σ]× [µ2, σ] with x∗1 < ω

∗ and ρ∗ ≥ ω∗. Then:

- Candidate optimal value: νV(x∗1, ω∗, ρ∗, σ)


K3(x;x∗1, x∗2− δ(k), x

(k)3 , νω∗, ν, νp∗2, νγ

(k)) where

x∗2 =ρ∗− ω∗x∗1ω∗−x∗1

, p∗2 =(ω∗−x∗1)2

ρ∗− 2ω∗x∗1 +x∗12

and


H(x(k)3 )

x(k)3

2 → λ as k→∞

δ(k) =σ− ρ∗

p∗2(x(k)3 −x∗2)

, γ(k) =(σ− ρ∗)p∗2

p∗2(x(k)3 −x∗2)2 + (σ− ρ∗)

Case 2.3: There does not exist an optimal solution for (EC.16). This occurs when there

is a sequence (x(k)1 , ω(k), ρ(k)), with ω(k) → ω∗ ∈ [µ,µ ∧

√σ], x

(k)1 → ω∗, and ρ(k) → ρ∗ ∈

[(ω∗)2, σ] such that V(x(k)1 , ω(k), ρ(k), σ) ↗ V∗, the optimal value of (EC.16), but that V∗ >

maxx1∈[0,ω−ε],ω∈[µ,µ∧√σ],ρ∈[ω2,σ] V(x1, ω, ρ, σ) for any ε > 0. Then:

- Candidate optimal value: ν(H(ω∗) +λ(σ− (ω∗)2))


K2(x; ω∗− δ(k), x(k)2 , νω∗, ν, νγ(k)) where


H(x(k)2 )

x(k)2

2 → λ as k→∞

δ(k) = σ−(ω∗)2

x(k)2 −ω

∗, γ(k) = σ−(ω∗)2

x(k)2

2−2ω∗x(k)2 +σ


The introduction of the two subprograms 1 and 2 comes from analyzing the possibility of each

inequality constraint being binding in the moment problem that is transformed from (16) (a step

that is analogous to Step 1 in Section 6). Further details are given in the next section.

EC.3. Technical Proofs for Section 7.1

The general framework for the proof of Theorem 4 and why Algorithm 2 works is similar to that

for the abstract formulation in Section 4. Hence for succinctness we only focus on the new issues

here. First, almost a verbatim of the arguments for Theorem 3 will lead to the following reduction

for (16):

Theorem EC.1. Suppose h is bounded. Denote H(x) =∫ x0

∫ u0h(v + a)dvdu. Then the optimal

value of (16) is the same as

max νE[H(X)]

subject to µ≤E[X]≤ µ

σ≤E[X2]≤ σ

P∈P[0,∞)

(EC.17)

where µ= η/ν, µ= η/ν, σ= 2β/ν, σ= 2β/ν. Here the decision variable is a probability distribution

P ∈ P[0,∞), where P[0,∞) denotes the set of all probability measures on [0,∞), and E[·] is the

corresponding expectation. Moreover, there is a one-to-one correspondence (up to measure zero)

of the feasible solutions to (16) and (EC.17), given by f ′(x+ a) = ν(p(x)− 1) a.e., where f is a

feasible solution of (16) and p is a probability distribution function over [0,∞).

For convenience, we denote OPT (C) as the program

max νE[H(X)]


σ≤E[X2]≤ σ

P∈ C

Therefore, (EC.17) can be written as OPT (P[0,∞)). We also let OPT (C;ω,ρ) be the program

OPT (C) defined before, but replacing µ= ω,σ= ρ therein.


Lemma EC.5. The optimal value of OPT (P[0, c]) is identical to that of OPT (P3[0, c]) for any

c > 0. Similarly, the optimal value of OPT (P[0,∞)) is identical to that of OPT (P3[0,∞)).

Proof of Lemma EC.5. Since the inequality constraints in OPT (C) are not all linearly inde-

pendent, the results in Winkler (1988) does not directly apply. Instead, we shall prove the cur-

rent lemma by extending Lemma EC.1 via a simple observation. Consider OPT (P3[0, c]). Note

that it can be written as maxω∈[µ,µ],ρ∈[σ,σ]U∗(ω,ρ), where U∗(ω,ρ) denotes the optimal value of

OPT (P[0, c];ω,ρ). Now suppose Z∗ is the optimal value of OPT (P[0, c]). Then there must exist

a sequence (ω(k), ρ(k)) such that U∗(ω(k), ρ(k))→ Z∗. For each (ω(k), ρ(k)), we already know from

Lemma EC.1 that it suffices to look at the set P3[0, c], and that there must be a sequence Pω(k),ρ(k) ∈

P3[0,∞) such that its objective value converges to U∗(ω(k), ρ(k)). Combining the above, we can

choose a sequence P(k) ∈P3[0, c] such that its objective value converges to Z∗. This shows the first

statement of the lemma. Similar argument holds for OPT (P[0,∞)).

The following gives the rule to exclude trivial scenarios in the algorithm:

Lemma EC.6. The program OPT (P3[0,∞)) is consistent if and only if σ≥ µ2.

Proof of Lemma EC.6. Similar to Lemma 1 and hence skipped.

The following explains the origin of the two subprograms:

Lemma EC.7. Suppose h is non-negative. The optimal value of OPT (P3[0,∞)) is given by

max{Z∗1 ,Z∗2}, where Z∗1 is the optimal value of

max E[H(X)]


σ≤E[X2]≤ σ

P∈P3[0,∞)

(EC.18)

and Z∗2 is the optimal value of

max E[H(X)]


E[X2] = σ

P∈P3[0,∞)

(EC.19)


respectively. The corresponding optimal solution or sequence of feasible solutions that converges to

optimality for either (EC.18) or (EC.19) will be optimal for OPT (P3[0,∞)) as well.

Proof of Lemma EC.7. We argue that to solve OPT (P3[0,∞)), it suffices to restrict attention to

the feasible region {P∈P3[0,∞) :E[X] = µ,σ≤E[X2]≤ σ}∪ {P∈P3[0,∞) : µ≤E[X]≤ µ,E[X2] =

σ}. To explain, suppose the optimal value of OPT (P3[0,∞)) is not zero (otherwise, since h is

assumed non-negative, H is also non-negative and the result is trivial). It suffices to look at P∼

(x1, x2, x3, p1, p2, p3) ∈ P3[0,∞) with at least one of xi having H(xi) > 0 and pi > 0 (otherwise

the objective value is zero, which is clearly not optimal). Now suppose P satisfies E[X] < µ and

E[X2] < σ. We can increase xi so that E[X] ≤ µ and E[X2] ≤ σ remain to be satisfied, but the

objective value is at least as large as before since H(x) is non-decreasing. Hence any P such that

E[X]<µ and E[X2]< σ is dominated by or equal in value to a solution in {P ∈ P3[0,∞) : E[X] =

µ,σ≤E[X2]≤ σ}∪ {P∈P3[0,∞) : µ≤E[X]≤ µ,E[X2] = σ}. This proves the lemma.

Lemma EC.8. Under Assumption 2, if (EC.18) is solvable, then there is an optimal solution in

P2[0,∞). Same statement holds for (EC.19).

Proof of Lemma EC.8. Let P∗ be an optimal solution for the solvable (EC.18). Let σ∗ =E∗[X2].

Then P∗ must also be an optimal solution for OPT (P3[0,∞);µ,σ∗). By Lemma EC.4, there must

be a solution P∗∗ that has the same objective value as P∗. Similar argument holds for (EC.19).

This concludes the proof.

By Lemmas EC.5 and EC.7, it suffices to look at the two subprograms (EC.18) and (EC.19)

in order to solve (EC.17). Together with Lemma EC.8, this concludes Theorem 4 immediately.

Algorithm 2 then follows the same line of analysis as Steps 3 and 4 in Section 6, now applied to

the subprograms (EC.18) and (EC.19). The inequality constraints in each of these subprograms

lead to one extra dimension in the respective nonlinear programs in the algorithm, but other than

that the analysis is almost completely the same, and hence we skip the details here.


EC.4. Auxiliary Lemmas and Proofs

EC.4.1. A Lemma for the Proof of Theorem 3

Lemma EC.9. Suppose a probability density f is convex for x≥ a for some a∈R. Then xf(x)→ 0

and x2f ′(x)→ 0 a.e. as x→∞.

Proof of Lemma EC.9. Note that a density f that is convex in the tail must be non-increasing,

since otherwise∫fdx=∞. We denote g(x) = xf(x)−F (x), where F is the distribution function

of f . Consider, for (a∨ 0)≤ x1 ≤ x2,

g(x2)− g(x1) = x2f(x2)−x1f(x1)− (F (x2)−F (x1))

≤ x2f(x2)−x1f(x1)− f(x2)(x2−x1) since f(x) is non-increasing

= x1[f(x2)− f(x1)]

≤ 0 again since f is non-increasing

Therefore g is non-increasing for large enough x, and since xf(x)≥ 0 and 0≤ F (x)≤ 1 we have g

bounded from below. This implies that g must converge to a limit, say c, as x→∞. In other words,

xf(x)− F (x)→ c, and since F (x)→ 1, we have xf(x)→ c+ 1. There are two cases: c+ 1> 0 or

c+ 1 = 0. The first case implies that xf(x)≥ ε > 0 for some ε for all large enough x. This means

f(x)≥ ε/x for all large enough x, and∫fdx=∞, which is impossible. Hence xf(x) must converge

to 0. This proves the first part of the lemma.

To prove the second part, observe first that f ′(x) must be non-decreasing a.e. for x≥ a since f(x)

is convex, and also non-positive since otherwise∫f(x)dx=∞. We now define g(x) =−x2f ′(x) +

2F (x), where F (x) =−∫∞xtf ′(t)dt. For any (a∨ 0)≤ x1 ≤ x2,

g(x2)− g(x1) = x21f′(x1)−x2

2f′(x2) + 2F (x1)− 2F (x2)

≤ x21f′(x1)−x2

2f′(x2) + f ′(x2)(x

22−x2

1) since f ′(x) is non-decreasing

= x21(f′(x1)− f ′(x2))

≤ 0 again since f ′(x) is non-decreasing


Therefore, g(x) is non-increasing. Note that −x2f ′(x) ≥ 0. Also, since F (x) = −tf(t)|∞x +∫∞xf(t)dt = xf(x) + F (x) by using integration by parts and that limx→∞ xf(x)→ 0 as we have

just proved, we have F (x)→ 0 as x→∞ and hence also bounded. Therefore g is bounded from

below. This implies that g must converge to a limit, say c, as x→∞. Since F (x)→ 0, we have

−x2f ′(x)→ c. There are two cases: either c > 0 or c= 0. The former case implies that −xf ′(x)≥ ε/x

for some ε > 0 and large enough x, and so F (x0) =−∫∞x0xf ′(x)dx=∞ for some large x0, which

arrives at a contradiction. Therefore −x2f ′(x)→ 0. This proves the second part of the lemma.

EC.4.2. Verification for Remark Point 4 at the End of Section 4.3

First, we show that the objective in (3) is bounded as x1↗ µ. By η2 < 2βν or equivalently σ > µ2,

we have σ− 2µx1 +x21→ σ−µ2 > 0 as x1↗ µ. Moreover, by λ<∞,

limsupx1↗µ

(µ−x1)2 H

(σ−µx1

µ−x1

)= lim sup

x1↗µ(σ−µx1)

2

(µ−x1

σ−µx1

)2

H

(σ−µx1

µ−x1

)= λ(σ−µ2)2 <∞

Hence both the first and the second term in (3) are bounded.

A similar argument works for the program (4) when ρ > µ is fixed. In fact, we will show that

the objective V (x1, ρ) in (4) is bounded for x1→ µ and over any ρ, even around the point (x1, ρ) =

(µ,µ2). To this end, the first term of V (x1, ρ) is clear since 0≤ (ρ−µ2)/(ρ−2µx1 +x21)≤ 1 for any

x2 <µ and ρ> µ2. For the second term, we write as

(µ−x1)2

ρ− 2µx1 +x21

(H

(ρ−µx1

µ−x1

)−λ

(ρ−µx1

µ−x1

)2)

=(ρ−µx1)

2

ρ− 2µx1 +x21

((µ−x1

ρ−µx1

)2

H

(ρ−µx1

µ−x1

)−λ

)

Since ((µ− x1)/(ρ− µx1))2H((ρ− µx1)/(µ− x1))− λ is bounded for any x1 < µ,ρ > µ2 by using

the fact that limsupx→∞H(x)/x2 − λ= 0, we are left to show that (ρ− µx1)2/(ρ− 2µx1 + x2

1) is

bounded. For this, we let x1 = µ− b and ρ= µ2 +d where b, d > 0 and b, d→ 0. Moreover, we write

b= dc, where c is a positive sequence such that b→ 0 as d→ 0. Then we have

(ρ−µx1)2

ρ− 2µx1 +x21

=(d+µb)2

d+ b2=d(1 +µc)2

1 + dc2=d+ 2µdc+µ2dc2

1 + dc2

which is bounded since d, dc= b is bounded and the function is bounded for any positive dc2. We

thus have proved our claim.


EC.4.3. Verification of the Optimal Sequence in the Proofs of Propositions 3 and 4

We consider both cases of (4) being solvable and not solvable. For the first case, given an optimal

solution x∗1, x∗2, p∗1, p∗2 for (EC.9) and also ρ∗ = p∗1x

∗12 + p∗2x

∗22, solved by using (EC.12), we define

a solution sequence in P3[0,∞) exactly given by (12) and (13). We first verify that the sequence

defined in (12) and (13) satisfies all constraints in (9). In fact, we will show that (13) is obtained

by setting

p(k)1 x

(k)1 + p

(k)2 x

(k)2 + p

(k)3 x

(k)3 = µ (EC.20)

and

p(k)1 x

(k)1

2+ p

(k)2 x

(k)2

2+ p

(k)3 x

(k)3

2= σ (EC.21)

for all k. Note that (EC.20) and (EC.21) can be written as

p∗1x∗1 + (p∗2− γ(k))(x∗2− δ(k)) + γ(k)x

(k)3 = µ

and

p∗1x∗12 + (p∗2− γ(k))(x∗2− δ(k))2 + γ(k)x

(k)3

2= σ

respectively. Since p∗1x∗1 + p∗2x

∗2 = µ and p∗1x

∗12 + p∗2x

∗22 = ρ∗, they become

−δ(k)p∗2− γ(k)(x∗2− δ(k)) + γ(k)x(k)3 = 0 (EC.22)

and

δ(k)p∗2(−2x∗2 + δ(k)) + γ(k)(x(k)3

2− (x∗2− δ(k))2

)= σ− ρ∗ (EC.23)

From (EC.22), we have

γ(k) =δ(k)p∗2

δ(k) +x(k)3 −x∗2

(EC.24)

Plugging into (EC.23), we get

δ(k)p∗2(−2x∗2 + δ(k)) +δ(k)p∗2

δ(k) +x(k)3 −x∗2

(x(k)3

2−(x∗2− δ(k)

)2)= σ− ρ∗

Solving for δ(k) implies

p∗2δ(k)(x(k)3 −x∗2

)= σ− ρ∗


which gives

δ(k) =σ− ρ∗

p∗2(x(k)3 −x∗2)

(EC.25)

Substituting into (EC.24), we have

γ(k) =(σ− ρ∗)p∗2

p∗2

(x(k)3 −x∗2

)2

+ (σ− ρ∗)(EC.26)

Now the corresponding objective value for (9) is

p(k)1 H(x

(k)1 ) + p

(k)2 H(x

(k)2 ) + p

(k)3 H(x

(k)3 )

= p∗1H(x∗1) + (p∗2− γ(k))H(x∗2− δ(k)) + γ(k)H(x(k)3 )

= p∗1H(x∗1) + (p∗2− γ(k))H(x∗2− δ(k)) +(σ− ρ∗)p∗2

p∗2(x(k)3 −x∗2)2 + (σ− ρ∗)

H(x(k)3 )

→ p∗1H(x∗1) + p∗2H(x∗2) +λ(σ− ρ∗)

by using the definition of λ, which is exactly the optimal objective value of (EC.9). This shows

that solving (EC.9) via (4) will identify an optimality-approaching sequence for (9).

In the second case, i.e. (4) is not solvable, we have

p(k)1 x

(k)1 + p

(k)2 x

(k)2 = µ (EC.27)

and

p(k)1 x

(k)1

2+ p

(k)2 x

(k)2

2= σ (EC.28)

for all k. Using x(k)1 = µ− δ(k) and p

(k)1 = 1− γ(k), (EC.27) and (EC.28) can be written as

(1− γ(k))(µ− δ(k)) + γ(k)x(k)2 = µ

and

(1− γ(k))(µ− δ(k))2 + γ(k)x(k)2

2= σ

respectively, which further gives

γ(k)(δ(k) +x

(k)2 −µ

)− δ(k) = 0 (EC.29)


and

γ(k)(x(k)2

2−(µ− δ(k)

)2)+(µ− δ(k)

)2= σ (EC.30)

From (EC.29) we have

γ(k) =δ(k)

δ(k) +x(k)2 −µ

(EC.31)

Putting into (EC.30), we get

δ(k)

δ(k) +x(k)2 −µ

(x(k)2

2−(µ− δ(k)

)2)+(µ− δ(k)

)2= σ

Solving for δ(k), we have

δ(k)(x(k)2 +µ− δ(k)

)+(µ− δ(k)

)2= σ

which gives

δ(k) =σ−µ2

x(k)2 −µ

(EC.32)

Plugging into (EC.31), we have

γ(k) =σ−µ2

(σ−µ2) +(x(k)2 −µ

)2 (EC.33)

Moreover, the objective value for (9) is

p(k)1 H(x

(k)1 ) + p

(k)2 H(x

(k)2 ) = (1− γ(k))H(µ− δ(k)) + γ(k)H(x

(k)2 )

= (1− γ(k))H(µ− δ(k)) +σ−µ2

(σ−µ2) +(x(k)2 −µ

)2H(x(k)2 )

→H(µ) +λ(σ−µ2)

by using the definition of λ. This also matches the objective value of (EC.9).

tail analysis without tail information: a worst-case

Documents