researchpaper

AN ESTIMATION OF K SAMPLE SURVIVAL FUNCTIONS UNDERSTOCHASTIC ORDERING CONSTRAINT

DANIEL HEALY, AMY KO

1. Abstract

The stochastic ordering takes important part of Statistics. The stochastic ordering was in-

troduced by Lehmann(1955). Lehmann defines a random variable X is stochastically larger

than a random variable Y with distribution functions F for X random variable and G for Y

random variable. For the survival functions, it can be defined that P = 1 − F and Q = 1 −G.

Based on his definition, F (x) ≤ G(x) for all x so it is also equivalent to P ≥ Q. Stochastic

ordering for two populations has been already researched by many statisticians. Lo(1987) sug-

gested that the two estimators Fm = min(Fm, Gn) and Gn = max(Fm, Gn). Then Lo showed

that these estimator are asymptotically minimax, under suitable conditions, for a large class

of loss functions and strongly uniformly consistent when m and n tend to go to infinity., Lo’s

estimators satisfy stochastic ordering between two populations but the estimators are not suc-

cessful when m and n go to infinity. To solve this problem, Rojo(2004) proposed the estimators

F ∗m(G∗

n)that is strongly uniformly consistent for F (G) when m(n) goes infinite. In this research,

we examined the most precise method to estimate survival functions of multiple sample cases

under stochastic ordering constraint based on Mean Squared Error and bias.Censored data are

also considered. For the last part, we applied our algorithm to actual data from Kalbfleisch

and Prentice(1980) and umbrella ordering.

2. Literature review

The definition of the stochastic ordering between two populations was first invented by

Lehmann(1955). Then Rojo(2004) proposed the idea of Benchmark function,

Rm+n(x) =n

m+ nQn +

m

m+ nPm (1)

so that the benchmark function is defined as the empirical cumulative survival function of the

two combined samples. As a result, the estimators become

1

AN ESTIMATION OF K SAMPLE SURVIVAL FUNCTIONS UNDER STOCHASTIC ORDERING CONSTRAINT2

˜Pmn = max(Pm(x), Rm+n(x)) (2)

and

˜Qmn = min(Qn(x), Rm+n(x)) (3)

satisfying the stochastic order constraints. As it is min(Pm, Qn) ≤ Rm+n ≤ max(Pm, Qn), it

should be satisfies that ˜Pmn ≤ ˆPmn and ˜Qmn ≥ ˆQmn.

We applied the idea of the benchmark functions in order to derive algorithm that minimizes

mean squared error and bias for both uncensored data and censored data.

2.1. Uncensored Data.

Let X1, ..., Xk be defined as k different distributions of random samples of different lengths.

These distributions must be stochastically ordered, meaning that at each point i the values

must go from least to greatest from 1 to k.

X1 ≤st X2 ≤st X3 ≤st, . . . ,≤st Xk

From these samples, the empirical cumulative distribution function, defined as X1, . . . , Xk,

is taken. Each of these empirical cumulative distribution functions are evaluated at a set

number of x values, called ti, so that the different lengths of the random samples does not

affect the data and each empirical function will have the same length. These ti values are

obtained by creating a true exponential survival function at a specific λ value. By taking the

inverse of this true survival function, it is possible to to find the x values that distribute

across the entire survival curve, called ti. From the empirical cumulative distribution

function, the survival functions are discovered by subtracting 1 from the empirical cumulative

distribution function.

1 − X1, 1 − X2, . . . , 1 − Xk = S1, S2, . . . , Sk

Using methods proposed by Lo(1987) of asymptotic minimax estimators and Rojo(1996) of a

benchmark estimating function that takes the averages of two lines, it is possible to extend

estimation from two distributions to k distributions. The benchmark function uses the

weights of the length of the distribution to find the average of what the average line between

two lines would be. When understanding estimation of multiple distributions, it is very

diffucult to have stochastic order when random samples are used, as presented by Shaked and

Shanthikumar(1997). Due to the fact that survival functions take the shape of exponential

curves, it was possible to test survival functions by comparing a true exponential curve to the

empirical cumulative distribution function of randomly created exponential data at the same

λ value. The application to this in the real world is that no real life data follows a specific


method, and is randomized. When using survival analysis to find the probability of living at a

certain age, the data is not going to be perfect and represent a true exponential curve. This is

what the point of a randomly generated survival curve is. By comparing randomly generated

data to what the data actually should be in the true survival curve, it makes it possible to

test the effectivesness of an estimator. Although it should be true when comparing true

exponential functions that as λ gets bigger, the y values of the exponential get smaller, it is

not the case when the values generated are at random. An example of this is seen when

comparing two exponential functions with λ = 1.1 and λ = 1.2. The true exponential function

would show two lines that are very close to one another, but the exponential function with a

value of λ at 1.1 would always be larger at each x value.

In this figure, being compared is a true exponential survival function with λ = 1, represented

by dots, and the survival function of random exponential data taken with λ = 1. As it is clear

to see, the random curve is never exactly on the true curve. This creates a problem when

comparing multiple distributions in that the randomly generated survival curve could be

greater when the λ value is greater, which is not something that would happen on a true

survival curve.

This idea creates a problem in terms of stochastic ordering in that the estimation of survial

curves must be ordered as X1 ≤st X2 ≤st X3 ≤st, . . . ,≤st Xk. In order to fix this dilemna, it is

necessary to create estimators that follow stochastic ordering, despite the fact that the

randomly generated curves are not. This models real world data in the when gathering data,

data will not perfectly fit an exponential curve. Stage 4 of cancer may have a higher survival

rate than stage 3 of cancer at certain times. In order to follow this restriction, we created a

estimation system that evaluates the maximum of the greatest randomly generated

exponential function and the benchmark function of the first and second greatest curves, s∗n.

From this it ensures that the estimation of the survival curve at each point will either be on


the randomly generated survival function or the line between the greatest curve and the next

curve less than it. After evaluating the maximum of the highest stochastically ordered curve,

the next highest stochastically ordered curve, s∗n−1, is evaluated by taking the minimum of the

estimator of the highest curve and the benchmark function of the same value as the highest

stochastically ordered estimator. In order to make these estimators work in the case of an

arbitrary set of survival functions, the third greatest stochastically order estimator s∗n−2 to the

lowest stochastically order function s∗1, the method maintains a specific pattern. The

minimum is always taken of the estimator previous to it as well as the benchmark of the value

of that n value along with the value greater than it.

Algorithm

s∗n = max(sn, Rn−1,n)

s∗n−1 = min(s∗n, Rn−1,n)

s∗n−2 = min(s∗n−1, Rn−2,n−1)...

s∗2 = min(s∗3, R2,3)

s∗1 = min(s∗2, R1,2)

In doing this, the way in which it was possible to find the most optimal method of

estimation was to compare the mean square error and bias of what the true survival function

was compared to the estimated survival function. Many methods were tested and the method

that was proven to be the best was starting from the stochastically largest and moving

downward. The second best way was doing the same method but going from the least to the

greatest. A problem with these methods is that as the number of distributions is increased,

the less accurate the MSE and Bias got. In order to insure that the estimators were

stochastically ordered, they need to be dependent on each other. There was no way to

estimate the next biggest distribution without making sure that it was smaller than the one

previous it. When keeping this in mind, there were many tests that were conducted to see

what was most accurate distribution was to go along with the previous estimator to take the

minimum of the two. Taking the benchmark function with the number the s∗ estimator was

as the lower bound proved to be the most efficient.


These two graphs present mean squared error and bias for 4 populations. The maximum

value of MSE is approximately 0.0022 and the maximum value of bias is approximately 0.03.

2.2. Censored Case.

Censored data was also a situation was tested and applied t the previous method. As it

mentioned before, ˜Pmn and ˜Qmn follow stochastic ordering, ˜Qmn ≤ Rm+n ≤ ˜Pmn. According to

Kaplan- Meier(1958), the Kaplan Meier estimator when there is no censoring can be defined

by ˆF (t) =∏

ti<tni−dini

where ni indicates the number of the survivors and di indicates the

number of deaths.

With censored data, Kaplan Meier can be defined as ˆF (t) =∏

Li,j≤ t(1 − 1

ni−i+1)δi,j By

using ˜Pmn and ˜Qmn, it is able to modify Kaplan-Meier estimator. δ is used for filtering

whether or not an observation is censored or not. Suppose that there are two random

distribution Xm and Yn . When Xm is smaller than Yn, δ is equal to 1 ,otherwise δ is 0.

Creating a method for censored data was very similiar to the method for uncensored data,

with some additions that were needed to be made in order to accomdate censoring and the

Kaplan Meier estimator. The initial steps made in the censored method were the same as the

uncensored method in that an exponential was made to respresent a true survival curve and

to find the inverse values of this to find x values to compare the true and random curves. The

next step was to create randomly censored data in each of the 10000 simulations. This was

done by creating a random exponential X1, . . . , Xk at the same λ values as the true

exponential, as well as another random exponential A to compare to X1, . . . , Xk. For the

values in each X, if they are less than A, than that value counts as a death and is given a 1,

Otherwise it is given a 0 and is censored. These values are in the set δ, and the minimum is


then taken from X and A to create X∗ so that censored data comes from the random

exponential A and uncensored data comes from X. Then the X∗ distribution is then put into

a data frame with δ. This is because when using a Kaplan Meier estimator in R, you need to

compare the survival values to the censoring in one data set. Using the ”Survival” package

in R, you evaluate the estimated Kaplan Meier survival function from the data in X∗ and δ.

After doing this it is necessary to extract the survival values at each specific time in each of

the 10000 Kaplan Meier functions. In the uncensored case, it was possible to to evaluate the

empirical cumulative distribution function at a specific ti, but when using the Kaplan Meier

function in the survival library, it is not possible to do this. Due to the fact that the survival

values and times were extracted from the Kaplan Meier function, a method was created to

evaluate the place of the ti values on the survival curve. The way this was done was to find

the survival times that are between each ti value, and give that ti value the survival value in

the Kaplan Meier. In doing this, it evaluates the Kaplan Meier survival curve at each ti value.

From all this, there is a estimated survival curve and a true curve to compare by taking the

Mean Square Error and Bias, just as in the uncensored case. Out of each of the 10,000

iterations, its subtracts the true curve minus the estimated curve to find the Bias and squares

that number to find the Mean square error. Both of these are then divided by 10000 to get

the average of them all.


Generally, the mean squared errors and bias for censored data tend to be higher than those of

uncensored data. The mean squared errors for censored data ranges from 0 to 0.01 and bias

range from -0.05 to 0.035.

3. example

3.1. Elbarmi and Mukerjee(2005).

We used the data from data 2 from ”The Statistical Analysis of Failure Time

Data(Kalbfleisch, Prentice)” to present our methods work accurately compared to Elbarmi

and Mukerjee(2005)’s result on the identical data set. The data 2 has 4 levels of N stages

from 0 to 3. When the number of stage is closer to 3, it indicates that there are the most

amount of nodes used for a patient. When the status is 0, it means it is censored while it

means death when the status is 1. The number of survival days ranges from 11 to 1823. The

graph below is the Kaplan Meier estimator graph based on the data 2

Suppose that each population: n0,n1,n2 and n3. We created two benchmark functions

applying Test 15 algorithm.

R3,4 = length(n3)∗n3+length(n2)∗n2)length(n3)+length(n2)

R2,3 = length(n1)∗n1+length(n2)∗n2)length(n1)+length(n2)

Then, we applied those benchmark functions to set s∗1, s∗2, s

∗3 and s∗4 following Test 15

algorithm.

s∗4 = max(s4, R3,4)

s∗3 = min(s∗4, R3,4)

s∗2 = min(s∗3, R2,3)

s∗1 = min(s∗2, s1)

The graph created by us looks similar with the graph made by Elbarmi and

Mukerjee(2005). However, the difference is that the population graph below is from the

population 3 in Elbarmi and Mukerjee’s graph while the population graph below is from the

population 0 in our graph.

In this comparison, we changed our algorithm and put extra benchmark function.

R1,2 = length(n1)∗n1+length(n0)∗n0

length(n1)+length(n0)

a


s∗1 = min(R1,2, s0)

s∗2 = max(s∗1, R1,2)

s∗3 = max(s∗2, R2,3)

s∗4 = max(s∗3, R3,4)

Although population is not clearly shown, they look almost similar.

3.2. umbrella ordering.

Umbrella ordering has been considered one of the most notable among statistics because it

is easy to observe for the response variable to increase with an increase of a treatment level up

to a certain unknown point, then decrease with further increase in the treatment level. In

short, it is a method of ordering that has up-then-down response pattern(Mack and Wolfe

1981). According to Pan(1997), there exists one and only one set of critical points

c1, c2, c3, ...., cn for umbrella ordering with a unique peak.

We assumed s∗1 ≤ s∗2 ≤ s∗3 ≥ s∗4 ≥ s∗5 and changed s∗1, s∗2, ...., s

∗5 in order to satisfy umbrella

ordering.

s∗5 = min(s5, R45)

s∗4 = min(s∗5, R45)

s∗3 = max(s∗4, R34)

s∗2 = max(s∗3, R12)

s∗1 = min(s∗2, s1)

After adjusting algorithm to umbrella ordering, it also shows low values of mean squared error

and bias. The MSE and bias show similar pattern with uncensored data. In detail, the

population 1 has the largest mean square error and bias while the population 4 has the least


mean squared error and bias. In other words, this algorithm works for umbrella ordering as

well.

4. summary

Applying different methods together from previous mathematicians guided us in creating

our method. The Benchmark function Rm+n(x) = nm+n

Qn + mm+n

Pm created by Rojo(2004),

and minimax estimators created by Lo(1987) made it possible to find the optimal method of

estimating curves that were stochastically ordered. The stochastic ordering algorithm that

was optimal was:

s∗n = max(sn, Rn−1,n)

s∗n−1 = min(s∗n, Rn−1,n

s∗n−2 = min(s∗n−1Rn−2,n−1)...

s∗2 = min(s∗3, R2,3)

s∗1 = min(s∗2, R1,2).

To prove that the algorithm is sufficiently accurate, we measured mean squared error and

bias for both uncensored data and censored data with different populations. As a result, the

mean squared errors for uncensored of 4 populations with λ = 1, 1.2, 1.4, 1.6 provided data

with a range from 0 to 0.003 and biases ranging from -0.035 to 0.015. Then we applied our

algorithm to previous research work in the specific field of the k sample case: Elbarmi and

Mukerjee(2005)analysis on data from Kalbfleisch and Prentice(1980) and umbrella ordering.

There were some difficulties in comparing the Algorithm we created to the previous work

done by Elbarmi and Mukerjee. When inputing the data, it was clear that they had input the

data incorrectly due to the fact that there were values that were counted as deaths, when in

reality they were censored values. The reasoning for this could have been to have more lines

cross on the graph to get a better idea of how their ordering method works, it is not helpful

though when using the data for the purpose of comparisons . Another problem that we were

faced with was the fact that Elbarmi and Mukerjee did not have an algorithm in there paper

to test against ours, so it is very difficult to know which one is better. They also did not

include values or graphs that show the mean square error and bias of their method of

estimation. The only way that we could compare our method to theres was purely basing it

off of graphs of the estimators and which one looks more accurate. When initially

comparing our method to Elbarmi and Murkerjee’s, it looked as if there method was more

accurate. After much testing it was figured out that this was the case due to our method

solving for the greatest stochastically ordered estimator was different than theres that started


from the least stochastically ordered estimator. After altering our method to start from the

least stochastically ordered method, Our graph looked very similar to theirs and actually had

more variability in the lines. The biggest problem that we were faced with was a lack of

being able to compare our method to others with measurements instead of physical

characteristics. Although this was the case, it seemed as if our method was very close to

Elbarmi and Mukerjee’s. In order to keep it stochastically ordered, it was necessary to apply

the previous estimator found in the next estimator. This decreases the accuracy of estimation

but was the best method when used with the benchmark function that would keep all the

estimators stochastically ordered. Last but not least, we applied our algorithm to umbrella

ordering which has up-then-down pattern. We modified the location of minimum estimator

and maximum estimator to satisfy the umbrella ordering. Consequently, the mean square

error and bias show similar pattern with uncensored data: The population 1 has the largest

MSE and bias while the population 4 has the least MSE and bias.

5. appendix

5.1. uncensored R code.


5.2. censored R code.


5.3. Elbarmi and Mukerjee R code.

5.4. umbrella ordering R code.

References

[1] Dykstra, R. L., (1982). Maximum likelihood estimation of the survival functions of two stochastically

ordered random variables. Journal of American Statistics Association. 77,621-628

[2] Hammou El Barmi and Hari Mukerjee, Inferences under a Stochastic Ordering Constraint: The k-Sample

Case, Journal of the American Statistical Association, Vol. 100, No. 469 (Mar., 2005), pp. 252-261

[3] Kalbfleisch, J.D. Prentice R.L(1980).The Statistical Analysis of Failured Time Data. New York:Wiley

[4] Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete observations,J. Am.

Statist. Assoctext , 53, 457-481

[5] Lehmann, E. L. Ordered Families of Distributions. The Annals of Mathematical Statistics 26 (1955), no.

3, 399–419. doi:10.1214/aoms/1177728487.

[6] Lo, S.H. (1987). Estimation of distribution functions under order restrictions, Statistics and Decisions, 5,

251-262


[7] Mack, G. A., and Wolfe, D. A. (1981), ”K-Sample Rank Tests for Um- brella Alternatives,”.Journal of

the American Statistical Associationtext , 76, 175-181.

[8] Guohua Pan.Confidence Subset Containing the Unknown Peaks of an Umbrella Ordering.Journal of the

American Statistical Association, Vol. 92, No. 437 (Mar., 1997), pp. 307-314

[9] Rojo, Javier. Ma.Z. On the Estimation of Stochastically Ordered Survival Functions,Journal of Statistical

Computation and Simulationtext , 55, 1-21.

[10] Rojo, Javier. On the estimation of survival functions under a stochastic order constraint. The First Erich

L. Lehmann Symposium—Optimality, 37–61, Institute of Mathematical Statistics, Beachwood, OH, 2004.

[11] Shaked, Moshe and Shanthikumar, J. George.(1997). Supermodular Stochastic Orders and Positive

Dependence of Random Vectors. Journal Of Multivariate Analysis 61, 86-101.

researchpaper

Documents