models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance...

27
Models for the assessment of treatment improvement: the ideal and the feasible. P. C. ´ Alvarez-Esteban 1 , E. del Barrio 1 , J.A. Cuesta-Albertos 2 and C. Matr´ an 1 1 Departamento de Estad´ ıstica e Investigaci´ on Operativa and IMUVA, Universidad de Valladolid 2 Departamento de Matem´ aticas, Estad´ ıstica y Computaci´ on, Universidad de Cantabria November 17, 2018 Abstract Comparisons of different treatments or production processes are the goals of a significant fraction of applied research. Unsurprisingly, two-sample problems play a main role in Statistics through natural questions such as ‘Is the the new treatment significantly better than the old?’. However, this is only partially answered by some of the usual statistical tools for this task. More importantly, often practitioners are not aware of the real meaning behind these statistical procedures. We analyze these troubles from the point of view of the order between distributions, the stochastic order, showing evidence of the limitations of the usual approaches, paying special attention to the classical comparison of means under the normal model. We discuss the unfeasibility of statistically proving stochastic dominance, but show that it is possible, instead, to gather statistical evidence to conclude that slightly relaxed versions of stochastic dominance hold. Keywords: Stochastic dominance, similarity, two-sample comparison, trimmed distributions, winsorized distributions, Behrens-Fisher problem, index of stochastic dominance. 1 Introduction Comparison is an essential activity in any field of life, one upon which a significant part of human knowledge is founded. In fact, one of the main achievements of mankind –numbers– are just a wonderful sophistication of the comparison process. Whether by curiosity or 1 arXiv:1612.01291v2 [stat.ME] 18 Apr 2017

Upload: others

Post on 23-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

Models for the assessment of treatmentimprovement: the ideal and the feasible.

P. C. Alvarez-Esteban1, E. del Barrio1, J.A. Cuesta-Albertos2

and C. Matran1

1Departamento de Estadıstica e Investigacion Operativa and IMUVA,Universidad de Valladolid

2 Departamento de Matematicas, Estadıstica y Computacion,Universidad de Cantabria

November 17, 2018

Abstract

Comparisons of different treatments or production processes are the goals of asignificant fraction of applied research. Unsurprisingly, two-sample problems play amain role in Statistics through natural questions such as ‘Is the the new treatmentsignificantly better than the old?’. However, this is only partially answered by someof the usual statistical tools for this task. More importantly, often practitioners arenot aware of the real meaning behind these statistical procedures. We analyze thesetroubles from the point of view of the order between distributions, the stochasticorder, showing evidence of the limitations of the usual approaches, paying specialattention to the classical comparison of means under the normal model. We discussthe unfeasibility of statistically proving stochastic dominance, but show that it ispossible, instead, to gather statistical evidence to conclude that slightly relaxedversions of stochastic dominance hold.

Keywords: Stochastic dominance, similarity, two-sample comparison, trimmed distributions,

winsorized distributions, Behrens-Fisher problem, index of stochastic dominance.

1 Introduction

Comparison is an essential activity in any field of life, one upon which a significant part ofhuman knowledge is founded. In fact, one of the main achievements of mankind –numbers–are just a wonderful sophistication of the comparison process. Whether by curiosity or

1

arX

iv:1

612.

0129

1v2

[st

at.M

E]

18

Apr

201

7

Page 2: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

necessity we are continuously involved in comparing objects, leading to assessments likebigger/smaller, shorter/taller, better/worse,. . . It is therefore natural the prominent roleplayed in Statistics by procedures looking for some kind of ordering. In fact, ‘Two-Sample Problems’, i.e. comparing two populations or two treatments, is probably the mostcommon situation encountered in statistical practice. Not surprisingly, every textbook onstatistics explores the topic to some extent.

In many cases the practitioner using a two-sample procedure has the goal of gatheringevidence to conclude that a new treatment is better than the old standard. To fix ideas,let us assume that treatment refers to a particular training program for athletes. To assessthe possible improvement provided by a new training program the researcher collects someexperimental data from athletes training under the two different programs. Of course noteverything comes from the type of training and one expects that a naturally talentedathlete will perform better, whatever the training program, than a less gifted one. In thesimplest case in which performance is measured in terms of a simple univariate outcome,we can think of a training program as a nondecreasing transform of the level of naturaltalent. If, in some scale, this talent level is measured as T and the different trainingprograms result in hold(T ) and hnew(T ) levels of performance, respectively, we wouldsay that the new treatment is better than the old if hold(t) ≤ hnew(t) for all t.

What type of conclusion is drawn from the most standard use of the two-sample pro-cedures? The most commonly used test, namely, the t-test, even in the Welch versionrelated to the famous Behrens-Fisher problem, would simply aim at rejecting that themean of hold(T ) is greater than the mean of hnew(T ). However, as we show in thispaper, even under the normality assumptions implicit in the use of the t-test, evidence ofa significantly greater mean under the new treatment is compatible with a worse perfor-mance of, say, 40% of athletes. We believe that practitioners should be aware of this fact.We argue in this paper that, most often, they would rather be interested in gatheringevidence for stochastic dominance than for an increase in mean values.

In this goal of gathering evidence to support the claim that the new treatment yields animprovement over the old one, we cannot forget that testing hypothesis theory is designedto provide evidence to reject the null hypothesis and that lack of rejection does not meanevidence for the null. While this is a well known fact in the statistical community, in theabsence of other approaches, practitioners often resort to widely used procedures withoutfull conscience of their true meaning. Perhaps the best example of such a situationis the generalized use of goodness of fit tests, as the Kolmogorov-Smirnov test, as away to justify a parametric model assumption, such as normality. A closer example toour present framework is that of testing homogeneity in a two sample setup. In bothsituations, regardless of the obtained result, we will not be able to confirm the modelbut, at best, we would just get lack of statistical evidence to reject it. This fact hasbeen pointed out by several authors, notably by Dette and Munk (1998) or Munk andCzado (1998). In the particular case of stochastic dominance, we should test the null thatstochastic dominance does not hold against the alternative that it does hold if we want

2

Page 3: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

evidence supporting it (hence, evidence supporting that the new treatment is better).Unfortunately, no reasonable statistical test can help in this task, see Berger (1988) andour discussion in subsection 2.1 below.

On the other hand, a model is merely an approximation to reality, so, in order tovalidate a model, we should be conscious of what are the admissible deviations to themodel. This is the starting point for the discussion of practical vs. statistical significancein Hodges and Lehmann (1954) continued in a series of papers (see e.g. Rudas et al.(1994), Liu and Lindsay (2009), Alvarez-Esteban et al. (2008), Alvarez-Esteban et al.(2012), Alvarez-Esteban et al. (2014)) having the common goal of testing the approximatevalidity of statistical hypothesis. In this paper we discuss two relaxations of the stochasticdominance model for which there are consistent statistical tests. More precisely, we showthat there are consistent tests that allow the practitioner to conclude, up to some smallprobability of error, that the new treatment is within a small neighborhood of being betterthan the old one.

The remaining sections of this paper are organized as follows. In Section 2 we fur-ther discuss the convenience of considering stochastic order rather than growth in meanwhen trying to assess the improvement given by a new treatment. Details about theabove mentioned lack of valid inferential methods for concluding stochastic dominanceare also given, together with two relaxations of stochastic dominance, which can be usedto produce indices of deviation from the ideal model of stochastic order. We include asubsection that illustrates the behavior of these indices through the important exampleof distributions that differ only in changes in location or scale, showing that growth inthe mean is well compatible with a worse performance under a new treatment for a verysubstantial fraction of the population. We invite to a careful inspection of the graphics infigures 3 and 5 to get a visual impact of the departures of stochastic dominance measuredthrough such indices, when compared with changes in location and scale in the normalmodel. Section 3 provides valid inferential methods for gathering statistical evidence thatthe relaxed stochastic order models hold, hence, showing that these deviations from theideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide a simulation study showing the performance of theinferential methods in finite samples. Finally, the proofs of some results introduced inthis work are included in an Appendix.

2 Models of treatment improvement.

2.1 Distributional dominance vs. mean comparisons.

Let us briefly explore the use and real meaning of the most common approach to assessimprovement in two sample problems. Assuming independence between the samples andnormality in the parent distributions, the t-test is based on the comparison of the meansof the distributions. However, relations between the means, or any other feature of the

3

Page 4: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

distributions, must be cautiously evaluated to assess some kind of improvement in aproduction process or of a treatment with respect to another.

To motivate our discussion let us assume that the probability laws of the variable ofinterest under two different production processes are normal, say Pi = N(µi, σ

2i ), i = 1, 2.

If we could conclude that µ1 > µ2 we would be only allowed to claim that ‘in the mean’the first process produces larger values than the second. To better explain the meaningof such a statement, we can resort to the Strong Law of Large Numbers: for large enoughsamples obtained from both processes, the mean of the sample obtained from P1 wouldbe almost surely greater than that of the obtained from P2. We should stress the factthat this statement does not depend on the values σi. Thus, it is compatible with thesituations displayed in Figure 1. The relevant question is whether these situations arecompatible with our intuitive understanding of the statement that the first process leadsto greater values than the second.

−4 −2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

−4 −2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

−4 −2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

−4 −2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

Figure 1: Red: standard normal density; green: densities of N(1, σ2), σ2 = 1, 2, 3, 6.

An informal statement such as ‘men are taller than women’ can be better detailedsaying that a short man would not be as short among women, a medium-sized manwould be tall among women, and a tall man would be even taller among women. Thesecomparisons involve in a natural way the relative position or status of every item in bothpopulations. With greater precision, the whole comparison involves checking the newstatus of each element of the first population when we consider it as an element of thesecond population. If it is always greater, then we could say that the variable in the first

4

Page 5: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

population is greater than in the second.To analyze if this relation holds, a notable simplification is achieved through a previous

arrangement of each population ordering their elements by their status. In that way, therelative position of each element in its population is easily obtained, and normalization ineach population by its size to get comparable status leads to a simple relation. If F andG are respectively the distribution functions of the variable of interest in the first and thesecond populations, and F−1, G−1 are the corresponding quantile functions,

F−1(t) ≤ G−1(t) for every t ∈ (0, 1), (1)

would mean that the variable in the first population is lower than in the second. Werecall that for a general distribution function on the real line, F , the associated quantilefunction, that we will denote as F−1, is defined as

F−1(t) := min{x : t ≤ F (x)} t ∈ (0, 1).

It is well known that the relation (1) is equivalent to the classical definition of stochasticorder, usually attributed to Lehmann (1955), but already used at least in Mann andWhitney (1947). We say that F is stochastically smaller than G, and write F ≤st G, if

F (x) ≥ G(x) for every x ∈ R. (2)

In the econometric literature, where more general classes of stochastic orders are consid-ered, usually linked to preferences related with families of utility functions, this relationis often invoked as first order stochastic dominance (see, e.g., Shaked and Shanthikumar(2007) or Muller and Stoyan (2002) for several extensions of the concept).

An interesting fact about quantile functions is that if T is uniformly distributed on(0, 1) then F−1(T ) has distribution function F . Let us return for a moment to the dis-cussion in the introduction and think of F and G as the distribution functions of theperformances of athletes training under the old and new programs, respectively. Let Tbe a measure of the natural talent of a randomly chosen athlete and let FT be its d.f.If we assume that FT is continuous, then, it is well known that T ∗ := FT (T ) is uniformon (0, 1). Therefore, just making a modification on the measurement scale, we have thatthe variable giving the natural talent of the athletes is uniform on (0, 1). Then, we cansee F−1(T ∗) and G−1(T ∗) as the effects of the training programs on that natural talent.Hence, F−1 plays the role of hold and G−1 that of hnew in the discussion in the introduc-tion and we see from the interpretation there that the new training program was betterthan the old if hold(t) ≤ hnew(t) for all t coincides with first order stochastic dominance.

In view of the arguments above, we think that a sound answer to question ‘is the newtreatment better than the old’ should be based on the assessment of stochastic order. Alook at Figure 2 shows that this cannot be done by simply comparing the means. In fact,the mean is the same for all distributions in green, but stochastic order only holds in the

5

Page 6: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

2

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

2

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

2

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

2

Figure 2: Red: standard normal quantile function; green: quantile functions of N(1, σ2),σ2 = 1, 2, 3, 6.

comparisons N(0, 1) vs N(1, 1). Specific inferential methods for assessing F ≤st G areneeded.

There is an abundant literature in statistical and econometric journals concerningtesting problems related to stochastic dominance. Some references, tracing back to Mannand Whitney (1947), belong to an order restricted inferential approach, that is, assumingthat stochastic order holds they focus on concluding that strict stochastic order holds(F <st G if F (x) ≥ G(x) for all x with F (x) > G(x) for at least one x). More precisely,they consider the problem of testing the null hypothesisH0 : F = G against the alternativeHa : F <st G. Obvious as it may be, it is relevant to note that some caution should beadopted in applying these procedures, since, both H0 and Ha can be simultaneously false.

A different testing problem with a number of references in the literature (see e.g.McFadden (1989), Anderson (1996), Barrett and Donald (2003), Davidson and Duclos(2000), Linton et al. (2005), Linton et al. (2010)) is that of testing the null H0: F ≤st G,versus Ha: F 6≤st G. This is a kind of goodness-of-fit test. The statistical meaning of notrejecting the null is simply to acknowledge that there is not evidence enough to guaranteethat stochastic dominance does not hold. However, this ‘accepting’ the null is sometimesinvoked as a guarantee that one random variable is stochastically larger than other, butthis is simply wrong.

The available analyses do not address the main goal of gathering statistical evidence

6

Page 7: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

to assess that stochastic dominance, F ≤st G, holds. This would be the natural result ofrejecting the null in the problem of testing

H0 : F 6≤st G, versus Ha : F ≤st G. (3)

Unfortunately, as already noted in Berger (1988), Davidson and Duclos (2013) and Alvarez-Esteban et al. (2014), the statistical assessment of distributional dominance is impossiblebecause small variations in the tails of a distribution could avoid or facilitate a relationof stochastic dominance. In fact, there is no good α-level test for (3): the ‘no data’ test,rejecting H0 with probability α regardless of the data is uniformly most powerful. This isshowed in Berger (1988) in the one-sample setting, but the result can be easily generalizedto the two-sample setup considered here.

2.2 Relaxations of stochastic dominance.

As it often happens in Statistics, the concept of stochastic dominance is excessively rigidas to try to confirm it on the basis of a sample. It is a too strong assumption in problemsin which one is inclined to believe that a population X is somehow smaller than anotherpopulation Y . This difficulty seems to be a big justification for the common practice ofbasing the comparisons on features of the distributions, like the means, when trying toassess some kind of order between distributions. For some parametric models, as in thenormal model, this approach has the additional advantage of leading to true stochasticorder for distributions with the same variance. Since optimal testing for this problemcan be achieved through an exact test, the two-sample t-test, the approach seems almostperfect. For the celebrated Behrens-Fisher problem, when the variances are not assumedequal, approximations such as Welch’s proposal give satisfactory enough solutions to facethe comparison of means problem. However, even in the normal model, this may be givinga right answer to a wrong question.

Stochastic order is a 0-1 relation. It is either true or false (of course, the same canbe said for higher order choices of stochastic dominance). In the case of normal laws, forinstance, stochastic order holds only in the case of equal variances and increasing means,see Section 2.3 for related results on location-scale models (LS-models in the sequel).However, looking back at Figure 2, it is tempting to say that the degree of deviationfrom stochastic order is higher in the example in the lower right corner than in those inthe lower left or upper right corners. We could say that in these last cases stochasticdominance nearly holds or, even, that ‘in practice’, it holds. Some measurement of thelevel of agreement with stochastic order would be helpful.

Motivated by the unfeasibility of consistently testing (3), Berger (1988) considered theidea of ‘restricted stochastic dominance’, which amounts to looking for the relation F (x) ≥G(x) on a fixed closed interval, excluding the tails of the sampled distribution. The choiceof the interval is somewhat arbitrary. The same approach had already been consideredin Lehmann and Rojo (1992), as a weak version of the stochastic order, stressing the fact

7

Page 8: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

that it is not always appropriate to require that the comparison holds for all values ofx. An analogous version has been developed in the two-sample setting by Davidson andDuclos (2013). Given the equivalence between (1) and (2), it would make sense, as well,to fix an interval contained in (0, 1) and check whether (1) holds within this interval. Onthe other hand, the normalization given by the quantile transform gives some additionaladvantage. Rather than facing the arbitrary choice of the interval on which we want tocheck that (1) holds, we can look at the length of the the set where it does not hold,namely,

γ(F,G) = `(t ∈ (0, 1) : F−1(t) > G−1(t)

), (4)

where ` stands for Lebesgue measure. This yields a useful index to measure how farF and G are from stochastic order, with γ(F,G) = 0 corresponding to perfect fit. Ifwe turn back to the example of the old and new training programs for an athlete, thenγ(F,G) = 0.05 means that 95% of athletes get better results with the training programassociated to G than with that of F . When restricted to a specific model as the normalmodel (and more generally to LS-models) the computation and the meaning of this indexis easy and very informative. The contour-plot in Figure 3 gives a nice insight into thefact that moderate and even high levels of disagreement with stochastic order (up toγ(F,G) ∼ 0.5) are compatible with an increase in mean, even in the normal model. As anexample, if F denotes the standard normal law, N(0, 1), and G corresponds to the law,N(.337, 1.52), then we have γ(F,G) = 0.25. Again, in the training program example, wesee that the new program can yield an improvement in the mean performance of athletesand, yet, result in worse results for 25% of them. We recall that our initial motivation wasto discuss the suitability of the usual methods involved in the validation of domination orimprovement. From Figure 3 and the above discussion, the inadequacy of the two samplet-test to validate a real improvement for most of the population (unless both distributionssatisfy some strong additional assumptions such as normality plus equal variances) shouldbe obvious. Of course this is not an objection to the use of, say, the Welch version ofthe t-test for testing an increase in mean. The key point is the meaning of these meancomparisons for the task of showing improvement of treatments or production processes.Later, in Section 2.3 we return to the meaning of definition (4) for normal and, moregenerally, LS-models.

While γ(F,G) is a natural measure of deviation from stochastic order, it is not the onlypossible choice. In Leshno and Levy (2002) the authors introduce the so-called AlmostStochastic Dominance which, easily, leads to an index to measure this deviation defining

α(F,G) =

∫{G>F}(G(x)− F (x))dx∫∞−∞ |F (x)−G(x)|dx

.

Although γ(F,G) is well defined for any pair of d.f.’s, some assumptions on F and Gare needed in the case of α(F,G). In Leshno and Levy (2002) the authors require the

8

Page 9: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

Figure 3: Contour-plot of γ(N(0, 1), N(µ, σ2)) as in (4) for different values of µ (X-axis)and σ (Y-axis)

distributions to be bounded (and limit themselves to the cases in which α(F,G) < .5).This can be relaxed, but F and G should at least have finite mean.

This index enjoys nice properties related to the expectation of utility functions (see alsoTsetlin et al. (2015)). However, it lacks an important property: the stochastic dominanceis preserved by monotone functions. This remains true for γ(F,G) but not for α(F,G).Returning again to the athletes example, let h be a strictly increasing function, andassume that we decide to measure the performance of the athletes using the values ofh(hold) and h(hnew). If Fh and Gh represent the distribution functions of the new r.v.’s,then γ(F,G) = γ(Fh, Gh), while, there is no guarantee that α(F,G) = α(Fh, Gh). We donot pursue further the analysis of the α index in this paper.

Another alternative approach to measure agreement with stochastic order has beenintroduced in Alvarez-Esteban et al. (2016) (see also Alvarez-Esteban et al. (2014)). It isbased on looking for statistical evidence supporting that, for a given (small enough) π,there exist mixture decompositions

F = (1− π)F + πHF

G = (1− π)G+ πHG,

for some d.f.’s F , G such that F ≤st G. (5)

We mention some facts in favor of this approach. First, it allows a robust treatment ofthe problem of stochastic dominance because the decompositions above can be interpretedas contamination neighborhoods of some latent distributions F and G. In this sense we

9

Page 10: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

should recall that statistical practitioners often process the samples to avoid ‘rarities’ ornoise, hence, a methodological approach including an adequate treatment for this kind ofprocedure is helpful. On the other hand, by taking π large enough, model (5) always holds(the extreme choice π = 1 will always do). The smallest π for which a such decompositionis possible measures the fraction of the population intrinsically outside the stochastic ordermodel. This provides an index of disagreement with the stochastic order model, similarto the lack of fit index introduced in Rudas et al. (1994) for multinomial models or inAlvarez-Esteban et al. (2012) as a relaxation of the homogeneity model. The key fact touse the contamination model to measure deviation from stochastic order is given by thefollowing result, which is contained in Alvarez-Esteban et al. (2016).

Proposition 2.1 For arbitrary d.f.’s, F,G, and π ∈ [0, 1), (5) holds if and only if π ≥π(F,G), where

π(F,G) := supx∈R

(G(x)− F (x)). (6)

Notice that, when π(F,G) ∈ (0, 1) if F and G have continuous densities f and g,respectively, then, there exists x0 such that π(F,G) = G(x0) − F (x0) and x0 satisfiesf(x0) = g(x0). Also, as γ(F,G), π(F,G) is invariant for strictly increasing transformations(see Remark 2.6.1 in Alvarez-Esteban et al. (2016)).

A better insight into the meaning of model (5) is gained through the idea of trimmedprobabilities. An α-trimming of a probability, P , is any other probability, say Q, suchthat

Q(A) =

∫A

τdP

for some function τ taking values in [0, 11−α ] and every event A. The role of the function

τ is to allow to discard or downplay the influence of some regions on the sample space(having probability up to 1 − α) on the model, mimicking the common use in robuststatistics of removing disturbing observations. Identifying a probability on the line withits d.f., if we write Rα(F ) for the set of trimmings of F , then F = (1 − α)F + αHF forsome d.f. HF if and only if F ∈ Rα(F ). Hence, model (5) holds if and only if there existF ∈ Rα(F ) and G ∈ Rα(G) such that F ≤st G. Even more, this happens if and only if,after trimming the right tail of F and the left tail of G (removing an α fraction in bothcases), the resulting F and G satisfy F ≤st G. We refer to Alvarez-Esteban et al. (2016)for details.

Figure 4 shows the comparison of the empirical distributions corresponding to twosamples of heights of boys and girls (12 years old) from a data set discussed in Alvarez-Esteban et al. (2014). The value π(Gm, Fn) = 0.046, means that it suffices to trim thefraction π = 0.046 of shortest girls and of taller boys to achieve stochastic order betweenthe trimmed distributions and shows, up to 0.046 contamination, girls (in the sample)are taller than boys at age 12. On the other hand, to get the stochastic dominanceof boys over girls we should allow a considerably higher contamination level (because

10

Page 11: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

π(Fn, Gm) = 0.123). See Alvarez-Esteban et al. (2014) for a more detailed analysis of thisproblem.

130 140 150 160 170 180

0.0

0.2

0.4

0.6

0.8

1.0

original distributions

Gm Age 12 Boys Fn Age 12 Girls

130 140 150 160 170 180

0.0

0.2

0.4

0.6

0.8

1.0

trimmed distributions

Age 12 BoysAge 12 Girls

Figure 4: Sample d.f.’s of the heights of boys, Gm, and girls, Fn, aged 12 in the NHANESdataset.

Coming back to the role of π(F,G) as a measure of agreement with the stochasticorder, the case of the normal model is shown in the contour-plot in Figure 5. As in thecase of γ(F,G) we see that an increase in the mean is compatible with a high level ofdisagreement with respect to stochastic order, that is, with an improvement under the newtreatment. Thus, testing the null assumption that π(FN(µ1,σ2

1), FN(µ2,σ22)) ≥ 0.05 against

the alternative π(FN(µ1,σ21), FN(µ2,σ2

2)) < 0.05 would allow to conclude, upon rejection, thatthe second treatment results in improvement if we are willing to remove 5% of observationson each side, while no similar conclusion would be obtained from testing µ2 ≤ µ1 vsµ2 > µ1. In Section 3 we will analyze this possibility in the light of the testing proceduredeveloped in Alvarez-Esteban et al. (2016).

A simple comparison between the deviation indices defined in (4) and (6) arises fromthe following simple observation. For every couple (X, Y ) of random variables withmarginal d.f.’s F,G, we have:

G(x) = P (Y ≤ x,X ≤ Y ) + P (Y ≤ x,X > Y )

≤ P (X ≤ x,X ≤ Y ) + P (X > Y )

≤ F (x) + P (X > Y ),

thus considering the quantile representations (X, Y ) = (F−1, G−1), since P (X > Y ) =`(F−1 > G−1) = γ(F,G), from Proposition 2.1 we get the following statement.

Proposition 2.2 For any pair of d.f.’s, F and G,

π(F,G) ≤ γ(F,G).

11

Page 12: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

Figure 5: Contour-plot of π(N(0, 1), N(µ, σ2)) as in (6) for different values of µ (X-axis)and σ (Y-axis)

Now, both π(F,G) and γ(F,G) are deviation indices from the stochastic order model,taking values in [0, 1] and such that γ(F,G) = 0 if and only if π(F,G) = 0, with anyof these identities being also equivalent to F ≤st G. Later, in Section 3 below, we showhow to consistently test H0 : π(F,G) ≥ δ0 against Ha : π(F,G) < δ0, which, in caseof rejection, would provide statistical evidence that stochastic order holds aproximately.Under some additional assumptions we will also provide inferential methods for reachingthe more restrictive conclusion (in view of Proposition 2.2) that γ(F,G) < δ0.

To conclude this section we would like to mention that π(F,G) and γ(F,G) are relatedto the concepts of trimming and Winsorizing, respectively. Winsorizing and trimming arevery popular robustification procedures in Data Analysis, designed to avoid an excessiveinfluence of the tails, mainly in presence of outliers. Recall that π(F,G) ≤ δ0 if and onlyif the d.f.’s F and G that we obtain from F and G after removing the δ0 fraction of theupper tail of F and of the lower tail of G, respectively, satisfy F ≤st G. Winsorizing,in turn, consists in replacing the tails of the distribution with the percentile value fromeach end. If we assume that F−1(t) ≤ G−1(t) for t in some interval (γ1, 1 − γ2) thenγ(F,G) ≤ δ0 if γ1 + γ2 ≤ δ0. Of course, F−1 ≤ G−1 in (γ1, 1− γ2) if and only if the d.f.’sF and G that we obtain from F and G by Winsorizing (both at quantiles γ1 and 1− γ2)satisfy F ≤st G.

12

Page 13: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

2.3 Stochastic order and location-scale models

In this subsection we will specialize our analysis to the comparisons under a LS-model.In fact, often, and particularly in the simulation study below, we will focus on a normalmodel.

Let F0 be any d.f. on the real line. A d.f. F is said to belong to the LS-model basedon F0 if it satisfies F (x) = F0(x−θ

λ), x ∈ R for some ‘location’, θ ∈ R, and ‘scale’, λ > 0

parameters. The dependence on these parameters will be included in the notation throughthe corresponding subindices in the way Fθ,λ. By resorting to the quantile functions, weobtain the characterization

F−1θ,λ (y) = λF−1

0 (y) + θ, for every y ∈ (0, 1),

from which the stochastic order Fθ1,λ1 ≤st Fθ2,λ2 would require

(λ1 − λ2)F−10 (y) ≤ θ2 − θ1 for every y ∈ (0, 1). (7)

Under the assumption that F0 is continuous and strictly increasing, (something that wewill assume from now on in the LS-model setup), condition (7) holds if and only if λ1 = λ2

and θ1 ≤ θ2. Moreover, two quantile functions in the LS-family have a crossing point,say y0, if and only if λ1 6= λ2 and F−1

0 (y0) = θ2−θ1λ1−λ2 , therefore, if the crossing point exists,

it is unique. In other words, the set {y : F−1θ1,λ1

(y) ≤ F−1θ2,λ2

(y)} is (0,F0( θ2−θ1λ1−λ2 )] or

[F0( θ2−θ1λ1−λ2 ), 1) and `({F−1

θ1,λ1> F−1

θ2,λ2}) is 1 − F0( θ2−θ1

λ1−λ2 ) or F0( θ2−θ1λ1−λ2 ) (depending of the

sign of λ1 − λ2). Moreover note that

γ(Fθ1,λ1 , Fθ2,λ2) = γ(F0,1, F θ2−θ1λ1

,λ2λ1

), (8)

hence, we can focus our analysis on comparison to the reference d.f., F0.Now, given two d.f.’s F,G in the LS-model, if we are interested in guaranteeing an

agreement with stochastic dominance of G over F of, say, 95% for the γ(F,G) index, itwould suffice to consider the crossing point of the quantile functions and check whetherthe interval corresponding to {F−1 ≤ G−1} has, at least, length .95. For the normalmodel, when the reference d.f. is Φ, the standard normal d.f., a simple computation(which generalizes to any LS-model replacing Φ with the reference d.f., F0) shows that

γ(N(0, 1), N(µ, σ2)) = 1− Φ(

µ|σ−1|

), σ 6= 1,

while γ(N(0, 1), N(µ, 1)) = 0 if µ ≥ 0 and γ(N(0, 1), N(µ, 1)) = 1 if µ < 0. This identityhas been used to obtain the contour-plot in Figure 3. We see that γ(N(0, 1), N(µ, σ2)) isconstant along rays {(µ, σ) : µ = C|σ − 1|, σ) : σ > 0}, for some C > 0, and becomessingular at µ = 0, σ = 1. We also note that as σ grows bigger 1 (the case of highervariance in the second sample), we can have µ > 0 while γ(N(0, 1), N(µ, σ2))→ 1

2. This

13

Page 14: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

shows again that the conclusion µnew > µold is compatible with a worse performancewith the new treatment for up to 50% of the population.

We analyze now the behavior of π(F,G) under the LS-model. The fact that

supx∈R

(Fθ2,λ2(x)− Fθ1,λ1(x)) = supx∈R

(F θ2−θ1

λ1,λ2λ1

(x)− F0,1(x)

), (9)

shows that, as in (8), we have

π(Fθ1,λ1 , Fθ2,λ2) = π(F0,1, F θ2−θ1λ1

,λ2λ1

),

and we can consider only the case F = F0,1. There is no simple, general expression forπ(F0,1, Fθ,λ) for every LS-model, since the maximization problem in (9) depends on F0.In the particular case of the normal model some elementary computations show that, forσ 6= 1 and µ ≥ 0, π(N(0, 1), N(µ, σ2)) = Φ( x−µ

σ)− Φ(x), with

x =µ±

√µ2σ2 + 2(σ2 − 1) log σ

1− σ2, (10)

where the positive sign is taken for σ > 1 and the negative for σ < 1. Also note thatfor the same values of µ and σ, π(N(µ, σ2), N(0, 1)) = Φ(x) − Φ( x−µ

σ), with x the other

solution in (10), a fact that allows also to get the solution for nonpositive µ. Of course,when σ = 1 and µ ≥ 0, π(N(0, 1), N(µ, 1)) = 0, while π(N(µ, 1), N(0, 1)) is attainedat the only crossing point of both density functions x = µ/2. These computations havebeen used to produce the contour-plot in Figure 5. We see that π(N(0, 1), N(µ, σ2))has a smoother behavior than γ(N(0, 1), N(µ, σ2)). For a better understanding of thedifferent roles of π(N(0, 1), N(µ, σ2)) and γ(N(0, 1), N(µ, σ2)) we include Figure 6 below.We see that γ(N(0, 1), N(µ, σ2)) equals the common value of the d.f.’s at the crossingpoint, while π(N(0, 1), N(µ, σ2)) equals the difference between d.f.’s at the point wherethe density functions have a crossing point and the N(µ, σ2) d.f. is above the standardnormal d.f.. In the next section we will use this characterization of γ(F,G) in terms ofcrossing points (following a similar approach to that in Hawkins and Kochar (1991)) todesign a test for the null H0 : γ(F,G) ≥ δ0 against the alternative Ha : γ(F,G) < δ0.Similarly, we will discuss how to consistenly test H0 : π(F,G) ≥ δ0 against the alternativeHa : π(F,G) < δ0. These will be, according to the discussion above, feasible ways togather statistical evidence that for approximate stochastic dominance (that is, to concludethat, essentially, the new treatment is better than the old).

3 Testing approximate stochastic order

In this section we will succinctly analyze feasible test procedures to provide statisticalevidence of stochastic dominance. We keep in mind that no valid inferential procedure

14

Page 15: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

í� í���'��������������&��������������������������� � � �

���

���

���

���

���

���

Figure 6: Distribution and density functions N(0, 1) (red) and N(1, 6) (green). AbscissaC (resp. D) corresponds to crossing points of the distribution (resp. density) functions.Lengths of vertical lines at C and D are γ(N(0, 1), N(1, 6)) and π(N(0, 1), N(1, 6)), resp.

can handle the testing problem (3) and, consequently, we have to settle with the lessambitious testing problems

H0 : π(F,G) ≥ π0 against the alternative Ha : π(F,G) < π0, (11)

orH0 : γ(F,G) ≥ γ0 against the alternative Ha : γ(F,G) < γ0, (12)

with π(F,G) and γ(F,G) defined in (6) and (4), respectively. We insist that rejection ofthe null in (11) would provide statistical evidence that stochastic order, up to some smallcontamination, holds. In (12) rejection would lead to guarantee, at the desired level, thattreatment G produces better results than treatment F for at least a fraction of size 1−γ0

of the population.For this we will first include some results, obtained in Alvarez-Esteban et al. (2016),

that allow to tackle (11) and present some new results for dealing with (12) in a purelynonparametric way. Moreover we will include some variations that can be used in thenormal model. We will finish giving a comparative analysis under the normal model,based on a simulation study, providing a picture of the feasibility and performance ofthe different approaches. In both (11) and (12), resorting to the usual duality betweenone sided testing and confidence bounds, we would be interested in obtaining an upperconfidence bound, say for π(F,G), say for γ(F,G). If U = U(X1, ..., Xn, Y1, ..., Ym) (resp.V ) were an (asymptotic) upper confidence bound for π(F,G) (resp. for γ(F,G)), rejection

15

Page 16: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

of H0 in (11) (resp. in (12)) when U < π0 (resp. when V < γ0) would yield a test with(asymptotically) controlled type I error probability. This will be done in two differentsettings corresponding to nonparametric and parametric points of view.

Throughout the section we will assume that X1, ..., Xn and Y1, ..., Ym are independenti.i.d. r.v.’s obtained from the d.f.’s F and G, respectively. We recall that Fn and Gm

will denote the sample distribution functions based on the X ′s and Y ′s samples. As acommon assumption in both setups, we will suppose that

F and G are continuous; n,m→∞, λn,m := nn+m→ λ ∈ (0, 1). (13)

3.1 Testing approximate stochastic dominance with the π index

The role of π(F,G) in Proposition 2.1 suggests addressing the testing problem (11) onthe basis of the Kolmogorov-Smirnov statistic

π(Fn, Gm) = supx∈R

(Gm(x)− Fn(x)).

Consistency in the strong sense follows from the Glivenko-Cantelli theorem, which implies

π(Fn, Gm)→ π(F,G) almost surely, as m,n→∞,

while the asymptotic behavior of its law, under assumption (13), was obtained by Ragha-vachari (1973), extending well known results by Kolmogorov and Smirnov, in the way√

mnm+n

(π(Fn, Gm)− π(F,G))→w B(F,G, λ). (14)

The limit law in (14) is that of the maximization of a Gaussian Process on a suitable set

B(F,G, λ) := supt∈T (F,G,π(F,G))

(√λ B1(t)−

√1− λ B2(t− π(F,G))

),

where B1(t) and B2(t) are i.i.d. Brownian Bridges on [0, 1], and the set is

T (F,G, π) := {t ∈ [π, 1] : G(x) = t and F (x) = t− π for some x ∈ R}. (15)

Also Alvarez-Esteban et al. (2016) provides new results, involving bootstrap and com-putational issues, as well as exponential bounds for both types of error probabilities intesting on the basis of this statistic. Moreover (see the extended version Alvarez-Estebanet al. (2014)) it is shown that, for α ∈ (0, 1/2) there are sharp lower bounds for theα−quantiles of the law of B(F,G, λ), given by the quantiles of a “least favorable” normallaw N(0, σ2

π(F,G)(λ)) depending of λ and of π(F,G).

In fact, (in the interesting cases where λπ ≤ 12

and (1 − λ)π ≤ 12) taking σ2

π(λ) =14− π2λ(1− λ), rejection of the null in (11) when√

nmn+m

(π(Fn, Gm)− π0) < σπ0(λn,m)Φ−1(α), (16)

16

Page 17: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

provides a test of uniform asymptotic level α: If PF,G denotes for the probability whenthe samples are obtained from F and G, then

limn→∞

sup(F,G)∈H0

PF,G[√

nmn+m

(πn,m − π0) < σπ0(λn,m)Φ−1(α)]

= α.

Moreover, it detects alternatives with power exponentially close to one (see Proposition3.3 in Alvarez-Esteban et al. (2016)).

Two modifications of (16) result in improving the finite sample performance of thetest. The first relies on the (classical) substitution of the least favorable variance by anappropriate estimation. The second, with a more important effect, tries to correct theintrinsic bias of π(Fn, Gm), a task that often is successfully carried by resorting to theaverage of a set of boostrap estimates. The final proposal, based on these modifications,πn,m,BOOT of π(Fn, Gm) and σn,m of σπ0(λn,m) (see details in Alvarez-Esteban et al. (2016)),consists in rejection of H0 : π(F,G) ≥ π0 if√

nm

n+m(πn,m,BOOT − π0) < σn,mΦ−1(α), (17)

which defines a test of asymptotic level α with quickly decreasing type I and type II errorprobabilities away from the null hypothesis boundary. Of course, by defining

U := πn,m,BOOT −√

n+mnm

σn,mΦ−1(α) (18)

we get an upper bound with asymptotic confidence level at least 1− α for π(F,G).Let us notice that Section 3.4 in Alvarez-Esteban et al. (2016) is devoted to the

adaptation of these statistical tools to the dependent data setup.

3.2 Testing approximate stochastic dominance with the γ index

In this subsection we introduce a new procedure for the testing problem (12) under someadditional assumption on F and G. Basically, we assume that the d.f.’s F and G havea single crossing point. This kind of assumption has been considered in some othersetups. Hawkins and Kochar (1991) (see also Chen et al (2002)) proposed and analyzedan estimator of this crossing point, x∗. They also stress on the interest of this point forcomparison of lifetimes under treatments, because, if the r.v. of interest is the survivaltime, then x∗ is the threshold such that, say, the control subjects have a lower chance ofsurvival to any age x < x∗, while they have a higher chance of survival to ages x > x∗.Here we will make the same assumption, but since our interest concerns the commonvalue γ∗ = F (x∗) = G(x∗) at this point, we will state it in terms of the quantile functionsin the alternative way: there is a unique γ∗ ∈ (0, 1) such that F−1(γ∗) = G−1(γ∗) and,

F−1(t)−G−1(t) has opposite signs on (0, γ∗) and on (γ∗, 1). (19)

17

Page 18: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

In the same spirit as in Hawkins and Kochar (1991), we introduce

ψ(γ) =

∫ γ

0

(F−1(t)−G−1(t)

)dt−

∫ 1

γ

(F−1(t)−G−1(t)

)dt

= 2

∫ γ

0

(F−1(t)−G−1(t)

)dt− (µF − µG),

where µF , µG denote the means of F and G. The following result gives the link betweenψ and γ(F,G).

Proposition 3.1 If F and G have finite means and positive density functions on R andsatisfy (19), then γ∗ is the unique maximizer of |ψ(t)| on (0, 1), ψ(γ∗) 6= 0 and

γ(F,G) =

{γ∗ if ψ(γ∗) > 0

1− γ∗ if ψ(γ∗) < 0.

This suggests that we consider to estimate γ∗ by

γ∗n,m = min(argmaxγ∈(0,1)|ψn,m(γ)|

),

with ψn,m(γ) = 2∫ γ

0(F−1

n (t)−G−1m (t)) dt− (Xn − Ym), and γ(F,G) by

γn,m =

{γ∗n,m if ψn,m(γ∗n,m) ≥ 0

1− γ∗n,m if ψn,m(γ∗n,m) < 0.

The asymptotic behavior of γn,m is given next.

Proposition 3.2 Assume that F and G satisfy the conditions of Proposition 3.1 withdensities, f and g which are continuous in a neighborhood of x∗, and such that f(x∗) 6=g(x∗). Then, if the sample sizes satisfy (13),√

nmn+m

(γn,m − γ(F,G))→w N(0, σ2),

where

σ2 =γ∗(1− γ∗) [(1− λ)g2(x∗) + λf 2(x∗)]

(g(x∗)− f(x∗))2 .

We can now base our rejection rule for (12) on a bootstrap estimation of σ2. Moreprecisely, we will reject H0 : γ(F,G) ≥ γ0 if√

nmn+m

(γn,m − γ0) < σn,mΦ−1(α), (20)

where σn,m is the bootstrap estimator of σ. This rejection rule provides a consistent testof asymptotic level α. Also, as in (18),

V := γn,m −√

n+mnm

σn,mΦ−1(α) (21)

provides an upper confidence bound for γ(F,G) with asymptotic level 1− α.

18

Page 19: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

Remark 3.2.1 We must to point out the singularity of the procedure if the density func-tions coincide at the cross point of the d.f.’s. This fact produces instability of the approachwhen the densities are very similar. In particular, under a LS-model this will happen whenthe scales are very similar, a case that should lead to guarantee an stochastic dominanceon the basis of the estimates of the location parameters.

3.3 The parametric point of view

If we assume that F and G belong to the same LS-model, with F0 continuous and strictlyincreasing, then the values π(F,G) and γ(F,G) can be obtained from the parameters.Therefore the considered nonparametric approaches have natural competitors based onthe estimation of the parameters. Of course, such parametric alternatives will be highlynonrobust, specially for small values of the population sizes, that are the interesting onesbut strongly depend on the tails of the distributions. However we will consider theseparametric alternatives in order to explore the relative performance of our above propos-als under perfect conditions. In this subsection we assume that the parent distributionfunctions F,G are respectively FN(µ1,σ2

1), FN(µ2,σ22), although other parametric LS-families

could be treated in the same way.By considering the maximum likelihood estimators Xn, Ym and S2

X , S2Y for the involved

parameters, we can use the plugin estimators

πn,m := π(FN(Xn,S2X), FN(Ym,S2

Y )) for π(F,G) (22)

γn,m := γ(FN(Xn,S2X), FN(Ym,S2

Y )) for γ(F,G) (23)

Since these estimators are differentiable functions of the means and the standard devia-tions of the samples (excepting when σ1 = σ2, for γ), they will be asymptotically normalwith the possible exception of γ in the case σ1 = σ2. Avoiding this case, the consistencyand asymptotic normality would be guaranteed, and once more the bootstrap can be usedto approximate the distributions of (22) and (23). The derivation of the tests and upperbounds would parallel those in subsections 3.1 and 3.2.

3.4 Some simulations

We present now a simulation study that shows the power of the procedures discussedin subsections 3.1 and 3.2 for the assessment of approximate stochastic order. In oursimulations we have generated pairs of independent i.i.d. random samples of sizes 100,1000 and 5000 and report the rejection frequencies of the null hypotheses (11) and (12)for three choices of π0 and γ0 (π0, γ0 ∈ {0.01, 0.05, 0.1}). In all cases we have chosen theunderlying distribution functions, F and G, to be normal. This allows to compare theperformance of the nonparametric tests (16) (with the bootstrap bias correction) and (20)to the parametric procedures discussed in subsection 3.3. Needless to say, these parametric

19

Page 20: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

procedures are inconsistent if the unknown random generators F and G do not exactly fitinto the parametric model. On the other hand, if F and G satisfy the parametric modelthen the parametric procedures should be more efficient. Hence, the performance of theseparametric procedures should be taken as an ideal benchmark to which we compare theperformance of the nonparametric, consistent procedures. More extensive simulations,showing the performance of the test (16) in different setups, including the least favorablecases, can be found in the Online Appendix 2 to Alvarez-Esteban et al. (2016).

The results are reported in Tables 3.3 and 3.4. In all cases the nominal level of thetest is α = 0.05, F is the N(0, 1) distribution and G is a N(µ, σ2). Table 3.3 deals withthe testing problem (11). Several choices of σ (σ ∈ {0.7, 1, 1.5}) have been considered.For each of these three choices of sigma there are three different values of µ, chosen tomake π(F,G) = 0.01 (left column), π(F,G) = 0.05 (central column) and π(F,G) = 0.1(right column). Then, for each combination of sample sizes, of parameters, µ and σ, andof tolerance level, π0, there are two reported rejection frequencies, with the upper rowcorresponding to the nonparametric test and the lower row to the parametric test.

Table 3.3 Rejection rates for π(N(0, 1), N(µ, σ2)) ≥ π0 at the level α = .05 along 1,000simulations. Upper (resp. lower) rows show the results for nonparametric (resp. parame-tric) comparisons. The means for each σ have been chosen to satisfy π(N(0, 1), N(µ, σ2))equal to 0.01, 0.05 and 0.10 (respectively: first, second and third columns).

σ = 0.7 σ = 1 σ = 1.5Sample means means means

π0 size .443 .143 −.050 −.025 −.125 −.251 .770 .287 −.017.01 100 .132 .010 .000 .026 .011 .001 .110 .009 .000

.095 .001 .000 .002 .000 .000 .113 .002 .0001000 .067 .000 .000 .018 .000 .000 .048 .000 .000

.085 .000 .000 .044 .001 .000 .080 .000 .0005000 .044 .000 .000 .016 .000 .000 .049 .000 .000

.062 .000 .000 .072 .000 .000 .069 .000 .000.05 100 .379 .069 .005 .071 .020 .006 .431 .059 .003

.675 .096 .008 .230 .084 .016 .730 .103 .0071000 .993 .031 .000 .397 .015 .000 .997 .065 .000

1.000 .051 .000 .737 .060 .000 1.000 .083 .0005000 1.000 .035 .000 .979 .028 .000 1.000 .037 .000

1.000 .057 .000 1.000 .057 .000 1.000 .052 .000.10 100 .788 .270 .033 .222 .087 .027 .822 .247 .046

.960 .450 .083 .543 .290 .088 .978 .489 .0771000 1.000 .934 .035 .990 .615 .028 1.000 .967 .046

1.000 .992 .059 1.000 .877 .053 1.000 .996 .0615000 1.000 1.000 .029 1.000 .998 .032 1.000 1.000 .039

1.000 1.000 .055 1.000 1.000 .049 1.000 1.000 .053

20

Page 21: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

We see that the rejection frequencies for the nonparametric procedure show either areasonable agreement to the nominal level of the test or are slightly conservative, while theparametric procedure is slightly liberal. On the other hand, the nonparametric procedureis able to reject the null with remarkably high power. As an illustration, consider, forinstance, the block σ = 0.7, π0 = 0.05. Within this block the boundary between thenull and the alternative hypotheses corresponds to the middle column (µ = .143; thenπ(F,G) = 0.05). The observed rejection frequencies for the nonparametric procedure are.031 and .035 for sample sizes n = m = 1000 and n = m = 5000, respectively, a bit belowthe nominal level of the test (α = 0.05). As we move to the alternative (the left column,µ = .443, π(F,G) = 0.01) we see that samples of size n = m = 1000 are enough to rejectthe null hypotesis π(F,G) ≥ 0.05 (and conclude that F and G satisfy stochastic order upto less than 5% contamination) with high probability (the observed rejection frequencyis .993). The worst behaviour in terms of power corresponds to the case σ = 1 (middlegroup). In this case rejection of the null with high power (90% or higher) requires samplesizes n = m = 5000. We note, nevertheless, that testing for approximate stochastic orderis a hard inferential problem and, on the other hand, sample sizes in this range are notunusual in many fields of application. As for the parametric procedure introduced forcomparison (bottom rows) we observe that it presents a better performance in terms ofpower but it is a bit liberal in some cases (and recall, again, that it is not a consistentprocedure as we move away from this LS setup).

The results for the testing problem (12) are reported in Table 3.4. In this setup thecases σ = σ0 and σ = 1 − σ0 are symmetrical and we have focused on the case σ ≥ 1.The case σ = 1 would need a different handling, since γ(N(0, 1), N(µ, 1)) only takesthe values 0 and 1 depending on when µ ≥ 0 or µ < 0. For these reasons we have fixedσ ∈ {1.1, 1.5, 2}, choosing then µ accordingly to get γ(N(0, 1), N(µ, σ2)) ∈ {.01, .05, 0.10}.

We see in this case that the tests (both the nonparametric and parametric) can betoo liberal if the two samples have similar variances. This is not surprising, since theasymptotic variance in Proposition 3.2 tends to ∞ as σ → 1. As we move away from thissingular case we see a somewhat better degree of agreement to the nominal level of thetest, as well as a generally good performance in terms of power.

Deviations from the ideal model of stochastic order in terms of the γ(F,G)-index ad-mit, arguably, a simpler interpretation than deviations in π(F,G)-index, but we see thatthe assessment of stochastic order up to a small deviation in π(F,G)-index is, from thepoint of view of statistical inference, a better posed problem, less affected by the similarityof variances. Finally, we remark that, although both indices are intrinsically nonparamet-ric in nature, the scope of π(F,G) is considerably larger, since the single crossing pointassumption required for the validity of the asymptotic theory for the γ(F,G)-index couldbe too restrictive for some real applications.

21

Page 22: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

Table 3.4 Rejection rates for γ(N(0, 1), N(µ, σ2)) ≥ γ0 at the level α = .05 along 1,000simulations. Upper (resp. lower) rows show the results for nonparametric (resp. parame-tric) comparisons. The means for each σ have been chosen to satisfy γ(N(0, 1), N(µ, σ2))equal to 0.01, 0.05 and 0.10 (respectively: first, second and third columns).

σ = 1.1 σ = 1.5 σ = 2Sample means means means

γ0 size .233 .164 .128 1.163 .822 .641 2.326 1.645 1.282.01 100 .000 .000 .000 .007 .000 .000 .000 .006 .000

.001 .000 .000 .070 .006 .000 .124 .001 .0001000 .013 .000 .000 .095 .000 .000 .102 .000 .000

.038 .004 .000 .085 .000 .000 .083 .000 .0005000 .046 .001 .001 .097 .000 .000 .065 .000 .000

.097 .003 .000 .076 .000 .000 .048 .000 .000.05 100 .015 .003 .000 .338 .058 .015 .611 .089 .017

.038 .008 .001 .491 .112 .035 .764 .116 .0161000 .209 .043 .007 .919 .088 .002 .998 .082 .000

.319 .079 .015 .992 .089 .001 1.000 .066 .0005000 .654 .084 .008 1.000 .048 .000 1.000 .044 .000

.799 .105 .011 1.000 .053 .000 1.000 .053 .000.10 100 .061 .027 .007 .702 .256 .095 .926 .390 .121

.090 .035 .016 .810 .310 .141 .978 .461 .1121000 .540 .212 .073 1.000 .661 .084 1.000 .884 .066

.672 .258 .092 1.000 .795 .074 1.000 .988 .0575000 .964 .395 .105 1.000 .993 .060 1.000 1.000 .049

.987 .437 .102 1.000 .999 .062 1.000 1.000 .055

Appendix: Proofs.

To the best of our knowledge, the index γ(F,G) has been introduced just here, thus weinclude in this appendix some technical details to justify our claims about the asymptoticsfor the proposed estimator γn,m.

Proof of Proposition 3.1. We begin noting that ψ(1) = µF − µG = −ψ(0) and alsothat ψ is differentiable in (0, 1), with derivative ψ′(t) = 2(F−1(t)−G−1(t)).

Consider first the case

F−1(t) > G−1(t), for t ∈ (0, γ∗), while F−1(t) < G−1(t), for t ∈ (γ∗, 1),

we have γ(F,G) = γ∗ with ψ′(t) > 0, t ∈ (0, γ∗), while ψ′(t) < 0, t ∈ (γ∗, 1). In particular,γ∗ is the unique maximizer of ψ.

22

Page 23: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

Now, if ψ(0) ≥ 0 then ψ(1) ≤ 0, ψ(γ∗) > ψ(0) ≥ 0 and ψ(t) ∈ [ψ(1), ψ(γ∗)] for allt ∈ (0, 1) and we see that γ∗ is the unique maximizer of |ψ(t)|. If, on the other hand,ψ(0) < 0 then ψ(γ∗) > ψ(1) > 0, ψ(t) ∈ [ψ(0), ψ(γ∗)] for all t ∈ (0, 1) and, again, γ∗ isthe unique maximizer of |ψ(t)|.

In the other case F−1(t) < G−1(t), t ∈ (0, γ∗), and F−1(t) > G−1(t), t ∈ (γ∗, 1), wehave γ(F,G) = 1− γ∗ and, arguing as above, we see that γ∗ is the unique minimizer of ψand the unique maximizer of |ψ(t)| and satisfies ψ(γ∗) < 0.

�By considering the quantile functions F−1

n associated to the empirical d.f.’s Fn asa random function, we get in a natural way the quantile process defined by uFn (t) =√n(F−1

n (t) − F−1(t)) for t ∈ (0, 1). The study of this statistically meaningful stochasticprocess was addressed in the second half of the past century. For use in the proof ofProposition 3.2 we provide the following lemma.

Lemma 3.5 Let F (resp. G) be a d.f. with continuous and positive derivative f (resp. g)on the interval [F−1(p)− ε, F−1(q) + ε] (resp. [G−1(p)− ε,G−1(q) + ε]). Under the inde-pendence assumption on the samples obtained from F and G, let uFn (·) and uGm(·)) be thecorresponding quantile processes. Then there exist independent versions uFn (·) and uGm(·),of these processes (with the same joint distribution that the originals) and independentstandard Brownian bridges B1 and B2, such that

supt∈[p,q]

∣∣∣(√ m

n+muFn (t)−

√n

n+muGm(t)

)−(√

1− λ B1(t)

f(F−1(t))−√λ

B2(t)

g(G−1(t))

)∣∣∣ a.s.→ 0.

Proof: The hypotheses on F and G guarantee (see Example 3.9.24 in van der Vaart andWellner (1996)):

uFn (·) w→ B1(·)f(F−1(·))

and uGm(·) w→ B2(·)g(G−1(·))

in the space L∞[p, q],

where B1, B2 are independent standard Brownian bridges. Moreover, the independenceof the samples implies that of the quantile processes, thus the joint convergence(

uFn (·), uGm(·))

w→( B1(·)f(F−1(·))

,B2(·)

g(G−1(·))

).

Now we can resort to the Skorohod-Dudley-Wichura almost surely representation theorem(see e.g. Theorem 1.10.3 in van der Vaart and Wellner (1996)), providing a sequence of

pairs(uFn (·), uGm(·)

)d=(uFn (·), uGm(·)

)and a pair

(B1(·)

f(F−1(·)) ,B2(·)

g(G−1(·))

)d=(

B1(·)f(F−1(·)) ,

B2(·)g(G−1(·))

)such that

(uFn (·), uGm(·)

)a.s.→(

B1(·)f(F−1(·)) ,

B2(·)g(G−1(·))

)in the space L∞[p, q]. From here, the re-

sult is straightforward.�

23

Page 24: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

Proof of Proposition 3.2. We note first that

supγ∈[0,1]

|ψn,m(γ)− ψ(γ)| ≤∫ 1

0

|F−1n (t)− F−1(t)|dt+

∫ 1

0

|G−1m (t)−G−1(t)|dt→ 0 a.s.

(to check that∫ 1

0|F−1n (t)− F−1(t)|dt vanishes asymptotically we can use the fact that it

equals the Wasserstein distance between Fn and F , see del Barrio et al. (1999)). As aconsequence we have that γ∗m,n → γ∗ a.s..

Therefore, in the case F−1−G−1 > 0 in (0, γ∗), F−1−G−1 < 0 in (γ∗, 1), we will havethat a.s. ψm,n(γ∗m,n) > 0 eventually and γm,n = γ∗m,n. Similarly, in the case F−1−G−1 < 0in (0, γ∗), F−1 − G−1 > 0 in (γ∗, 1), we will have a.s. that eventually γm,n = 1 − γ∗m,n.Hence, in a probability one set, eventually√

n+mnm

(γm,n − γ(F,G)) = ±√

n+mnm

(γ∗m,n − γ∗),

with the positive sign in the former case and the negative in the latter.By symmetry of the centered normal laws it suffices to prove the convergence for√n+mnm

(γ∗m,n − γ∗). From this point we assume that we are in the case F−1 −G−1 > 0 in

(0, γ∗), F−1 −G−1 < 0 in (γ∗, 1) and note that in this case γ∗ is also the maximizer of ψ.We note also that we can replace ψ by φ(γ) =

∫ γδ

(F−1(t) − G−1(t))dt for some fixedδ ∈ (0, 1) and still have that γ∗ is the maximizer of φ(γ), γ ∈ (δ, δ′) for some otherδ′ ∈ (0, 1). We similarly set φn,m(γ) =

∫ γδ

(F−1n (t) − G−1

m (t))dt. The assumptions ensurethat we can choose δ and δ′ such that f and g are continuous and bounded away from 0in (δ, δ′). Then, with the notation introduced for the quantile processes associated to thesamples obtained from F and G√

nm

n+m(φn,m(γ)− φ(γ)) =

∫ γ

δ

√m

n+muFn (t)−

√n

n+muGm(t)dt.

The application of Lemma 3.5 to the quantile processes in L∞[δ, δ′], implies that thereare versions of uFn , uGm, B1 and B2 such that

supt∈[δ,δ′]

∣∣∣(√ m

n+muFn (t)−

√n

n+muGm(t)

)−(√

1− λ B1(t)

f(F−1(t))−√λ

B2(t)

g(G−1(t))

)∣∣∣ a.s.→ 0.

From this we conclude that for any sequence verifying γn,m = γ∗ + oP (1),√nm

n+m

(φn,m(γn,m)− φ(γm,n)

)−√

nm

n+m

(φn,m(γ∗)− φ(γ∗)

)=

∫ γm,n

γ∗

(√m

n+muFn (t)−

√n

n+muGm(t)

)dt

= (γm,n − γ∗)Z + (γm,n − γ∗)oP (1), (24)

24

Page 25: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

with Z =√

1− λB1(γ∗)/f(x∗)−√λB2(γ∗)/g(x∗). Since we already obtained the consis-

tency γ∗m,n → γ∗ a.s., we can apply now Theorem 3.6 below to conclude that√nm

n+m(γ∗n,m − γ∗) = − Z

1f(x∗)

− 1g(x∗)

+ oP (1),

which completes the proof.�

The following Theorem is a suitable version of Theorem 3.2.16 in van der Vaart andWellner (1996) (see the final comments there leading to this simplified statement). Itallows to obtain the asymptotic law of an estimator, like that involved in Proposition 3.2,based on an “argmax” procedure. This kind of argument is one of the best known tools toaddress the asymptotics of M-estimators (see Section 3.2.4 in van der Vaart and Wellner(1996))

Theorem 3.6 (see Theorem 3.2.16 in van der Vaart and Wellner (1996)) Let Mn bestochastic processes indexed by an open interval Θ ⊂ R and M : Θ → R a deterministicfunction. Assume that θ → M(θ) is twice continuously derivable at a point of maximumθ0 with second-derivative M ′′(θ0) 6= 0. Suppose that for some sequence rn →∞

rn

(Mn(θn)−M(θn)

)− rn

(Mn(θ0)−M(θ0)

)= (θn − θ0)Z + oP

(|θn − θ0|

),

for every random sequence θn = θ0 + oP (1) and a random variable Z. If the sequence

θnP→ θ0 and satisfies Mn(θn) ≥ supθMn(θ)− oP (r−2

n ) for every n, then

rn(θn − θ0) = − Z

M ′′(θ0)+ oP (1).

ACKNOWLEDGEMENTS

Research partially supported by the Spanish Ministerio de Economıa y Competitividady fondos FEDER, grants MTM2014-56235-C2-1-P and MTM2014-56235-C2-2, and byConsejerıa de Educacion de la Junta de Castilla y Leon, grant VA212U13.

References

Alvarez-Esteban, P.C.; del Barrio, E.; Cuesta-Albertos, J.A. and Matran, C. (2008).Trimmed comparison of distributions. J. Amer. Statist. Assoc. 103, No. 482, 697–704.

Alvarez-Esteban, P.C.; del Barrio, E.; Cuesta-Albertos, J.A. and Matran, C. (2012).Similarity of samples and trimming. Bernoulli, 18, 606–634.

25

Page 26: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

Alvarez-Esteban, P.C.; del Barrio, E.; Cuesta-Albertos, J.A. and Matran, C.(2014). A contamination model for approximate stochastic order: extended version.http://arxiv.org/abs/1412.1920

Alvarez-Esteban, P.C.; del Barrio, E.; Cuesta-Albertos, J.A. and Matran, C. (2016). Acontamination model for stochastic order. Test, 25, 751-774.

Anderson, G. (1996). Nonparametric tests for stochastic dominance. Econometrica 64,1183–1193.

Barrett, G.F., and Donald, S.G., (2003). Consistent tests for stochastic dominance. Econo-metrica 71, 71–104

Berger, R.L. (1988). A nonparametric, intersection-union test for stochastic order. InStatistical Decision Theory and Related Topics IV. Volume 2. (eds. Gupta, S. S., andBerger, J. O.). Springer-Verlag, New York.

Chen, G., Chen, J., and Chen, Y. (2002). Statistical inference on comparing two distri-bution functions with a possible crossing point. Statist. Probab. Letters, 60, 329–341.

Davidson, R., and Duclos, J.-Y. (2000). Statistical inference for stochastic dominance andfor the measurement of poverty and inequality. Econometrica 68, 1435–1464.

Davidson, R., and Duclos, J. (2013). Testing for Restricted Stochastic Dominance. Econo-metric Reviews 32(1), 84–125.

del Barrio, E., Gine, E. and Matran, C. (1999). Central limit theorems for the Wassersteindistance between the empirical and the true distributions. Ann. Probab., 27, 1009–1071.

Dette, H. and A. Munk (1998). Validation of linear regression models. Ann. Statist., 26,778–800.

Hawkins, D. L., and Kochar, S. C. (1991). Inference for the crossing point of two contin-uous CDF’s. Ann. Statist., 19, 1626–1638.

Hodges, J.L. and Lehmann, E.L. (1954). Testing the approximate validity of statisticalhypotheses. J. R. Statist. Soc. B 16, 261–268.

Lehmann, E. L. (1955). Ordered families of distributions. Ann. Math. Statist. 26, 399–419.

Lehmann, E. L., and Rojo, J. (1992). Invariant directional orderings. Ann. Statist., 20,2100–2110.

Leshno, M. and Levy, H. (2002). Preferred by “All” and preferred by “Most” decisionmakers: almost stochastic dominance. Management Science, 48, 1074-1085

26

Page 27: Models for the assessment of treatment improvement: the ... · ideal model of stochastic dominance are tractable models from the point of view of sta-tistical inference. We also provide

Linton, O., Maasoumi, E., and Whang, Y.-J. (2005). Consistent testing for stochasticdominance under general sampling schemes. Review Economic Studies 72, 735–765.

Linton, O., Song, K., and Whang, Y.-J. (2010). An improved bootstrap test for stochasticdominance. Journal of Econometrics 154, 186–202.

Liu, J., and Lindsay, B.G. (2009). Building and using semiparametric tolerance regionsfor parametric multinomial models. Ann. Statist., 37, 3644–3659.

Mann, H. B., and Whitney, D. R. (1947). On a test of whether one of two random variablesis stochastically larger than the other. Ann. Math. Statist., 18, 50–60.

McFadden, D. (1989). Testing for stochastic dominance. In Studies in the Economics ofUncertainty: In Honor of Josef Hadar, ed. by T. B. Fomby and T. K. Seo. Springer.

Muller, A., and Stoyan, D. (2002). Comparison Methods for Stochastic Models and Risks.Chichester, England. Wiley.

Munk, A., and Czado, C. (1998). Nonparametric validation of similar distributions andassessment of goodness of fit. J.R. Statist. Soc. B, 60, 223–241.

Raghavachari, M. (1973). Limiting distributions of the Kolmogorov-Smirnov type statis-tics under the alternative. Ann. Statist., 1, 67–73.

Rudas, T., Clogg, C.C., and Lindsay, B. G. (1994). A new index of fit based on mixturemethods for the analysis of contingency tables. J. R. Statist. Soc. B 56, No 4, 623–639.

Shaked, M., and Shanthikumar, J.G. (2007). Stochastic Orders. New York. Springer.

Tsetlin, I. R. L. Winkler, R. J. Huang, L. Y. Tzeng (2015) Generalized Almost StochasticDominance. Operations Research 63, 363-377.

van der Vaart, A. W. and Wellner, J. A. (1996). Weak convergence and empirical processes.Springer.

27