researchpaper
TRANSCRIPT
AN ESTIMATION OF K SAMPLE SURVIVAL FUNCTIONS UNDERSTOCHASTIC ORDERING CONSTRAINT
DANIEL HEALY, AMY KO
1. Abstract
The stochastic ordering takes important part of Statistics. The stochastic ordering was in-
troduced by Lehmann(1955). Lehmann defines a random variable X is stochastically larger
than a random variable Y with distribution functions F for X random variable and G for Y
random variable. For the survival functions, it can be defined that P = 1 − F and Q = 1 −G.
Based on his definition, F (x) ≤ G(x) for all x so it is also equivalent to P ≥ Q. Stochastic
ordering for two populations has been already researched by many statisticians. Lo(1987) sug-
gested that the two estimators Fm = min(Fm, Gn) and Gn = max(Fm, Gn). Then Lo showed
that these estimator are asymptotically minimax, under suitable conditions, for a large class
of loss functions and strongly uniformly consistent when m and n tend to go to infinity., Lo’s
estimators satisfy stochastic ordering between two populations but the estimators are not suc-
cessful when m and n go to infinity. To solve this problem, Rojo(2004) proposed the estimators
F ∗m(G∗
n)that is strongly uniformly consistent for F (G) when m(n) goes infinite. In this research,
we examined the most precise method to estimate survival functions of multiple sample cases
under stochastic ordering constraint based on Mean Squared Error and bias.Censored data are
also considered. For the last part, we applied our algorithm to actual data from Kalbfleisch
and Prentice(1980) and umbrella ordering.
2. Literature review
The definition of the stochastic ordering between two populations was first invented by
Lehmann(1955). Then Rojo(2004) proposed the idea of Benchmark function,
Rm+n(x) =n
m+ nQn +
m
m+ nPm (1)
so that the benchmark function is defined as the empirical cumulative survival function of the
two combined samples. As a result, the estimators become
1
AN ESTIMATION OF K SAMPLE SURVIVAL FUNCTIONS UNDER STOCHASTIC ORDERING CONSTRAINT2
˜Pmn = max(Pm(x), Rm+n(x)) (2)
and
˜Qmn = min(Qn(x), Rm+n(x)) (3)
satisfying the stochastic order constraints. As it is min(Pm, Qn) ≤ Rm+n ≤ max(Pm, Qn), it
should be satisfies that ˜Pmn ≤ ˆPmn and ˜Qmn ≥ ˆQmn.
We applied the idea of the benchmark functions in order to derive algorithm that minimizes
mean squared error and bias for both uncensored data and censored data.
2.1. Uncensored Data.
Let X1, ..., Xk be defined as k different distributions of random samples of different lengths.
These distributions must be stochastically ordered, meaning that at each point i the values
must go from least to greatest from 1 to k.
X1 ≤st X2 ≤st X3 ≤st, . . . ,≤st Xk
From these samples, the empirical cumulative distribution function, defined as X1, . . . , Xk,
is taken. Each of these empirical cumulative distribution functions are evaluated at a set
number of x values, called ti, so that the different lengths of the random samples does not
affect the data and each empirical function will have the same length. These ti values are
obtained by creating a true exponential survival function at a specific λ value. By taking the
inverse of this true survival function, it is possible to to find the x values that distribute
across the entire survival curve, called ti. From the empirical cumulative distribution
function, the survival functions are discovered by subtracting 1 from the empirical cumulative
distribution function.
1 − X1, 1 − X2, . . . , 1 − Xk = S1, S2, . . . , Sk
Using methods proposed by Lo(1987) of asymptotic minimax estimators and Rojo(1996) of a
benchmark estimating function that takes the averages of two lines, it is possible to extend
estimation from two distributions to k distributions. The benchmark function uses the
weights of the length of the distribution to find the average of what the average line between
two lines would be. When understanding estimation of multiple distributions, it is very
diffucult to have stochastic order when random samples are used, as presented by Shaked and
Shanthikumar(1997). Due to the fact that survival functions take the shape of exponential
curves, it was possible to test survival functions by comparing a true exponential curve to the
empirical cumulative distribution function of randomly created exponential data at the same
λ value. The application to this in the real world is that no real life data follows a specific
AN ESTIMATION OF K SAMPLE SURVIVAL FUNCTIONS UNDER STOCHASTIC ORDERING CONSTRAINT3
method, and is randomized. When using survival analysis to find the probability of living at a
certain age, the data is not going to be perfect and represent a true exponential curve. This is
what the point of a randomly generated survival curve is. By comparing randomly generated
data to what the data actually should be in the true survival curve, it makes it possible to
test the effectivesness of an estimator. Although it should be true when comparing true
exponential functions that as λ gets bigger, the y values of the exponential get smaller, it is
not the case when the values generated are at random. An example of this is seen when
comparing two exponential functions with λ = 1.1 and λ = 1.2. The true exponential function
would show two lines that are very close to one another, but the exponential function with a
value of λ at 1.1 would always be larger at each x value.
In this figure, being compared is a true exponential survival function with λ = 1, represented
by dots, and the survival function of random exponential data taken with λ = 1. As it is clear
to see, the random curve is never exactly on the true curve. This creates a problem when
comparing multiple distributions in that the randomly generated survival curve could be
greater when the λ value is greater, which is not something that would happen on a true
survival curve.
This idea creates a problem in terms of stochastic ordering in that the estimation of survial
curves must be ordered as X1 ≤st X2 ≤st X3 ≤st, . . . ,≤st Xk. In order to fix this dilemna, it is
necessary to create estimators that follow stochastic ordering, despite the fact that the
randomly generated curves are not. This models real world data in the when gathering data,
data will not perfectly fit an exponential curve. Stage 4 of cancer may have a higher survival
rate than stage 3 of cancer at certain times. In order to follow this restriction, we created a
estimation system that evaluates the maximum of the greatest randomly generated
exponential function and the benchmark function of the first and second greatest curves, s∗n.
From this it ensures that the estimation of the survival curve at each point will either be on
AN ESTIMATION OF K SAMPLE SURVIVAL FUNCTIONS UNDER STOCHASTIC ORDERING CONSTRAINT4
the randomly generated survival function or the line between the greatest curve and the next
curve less than it. After evaluating the maximum of the highest stochastically ordered curve,
the next highest stochastically ordered curve, s∗n−1, is evaluated by taking the minimum of the
estimator of the highest curve and the benchmark function of the same value as the highest
stochastically ordered estimator. In order to make these estimators work in the case of an
arbitrary set of survival functions, the third greatest stochastically order estimator s∗n−2 to the
lowest stochastically order function s∗1, the method maintains a specific pattern. The
minimum is always taken of the estimator previous to it as well as the benchmark of the value
of that n value along with the value greater than it.
Algorithm
s∗n = max(sn, Rn−1,n)
s∗n−1 = min(s∗n, Rn−1,n)
s∗n−2 = min(s∗n−1, Rn−2,n−1)...
s∗2 = min(s∗3, R2,3)
s∗1 = min(s∗2, R1,2)
In doing this, the way in which it was possible to find the most optimal method of
estimation was to compare the mean square error and bias of what the true survival function
was compared to the estimated survival function. Many methods were tested and the method
that was proven to be the best was starting from the stochastically largest and moving
downward. The second best way was doing the same method but going from the least to the
greatest. A problem with these methods is that as the number of distributions is increased,
the less accurate the MSE and Bias got. In order to insure that the estimators were
stochastically ordered, they need to be dependent on each other. There was no way to
estimate the next biggest distribution without making sure that it was smaller than the one
previous it. When keeping this in mind, there were many tests that were conducted to see
what was most accurate distribution was to go along with the previous estimator to take the
minimum of the two. Taking the benchmark function with the number the s∗ estimator was
as the lower bound proved to be the most efficient.
AN ESTIMATION OF K SAMPLE SURVIVAL FUNCTIONS UNDER STOCHASTIC ORDERING CONSTRAINT5
These two graphs present mean squared error and bias for 4 populations. The maximum
value of MSE is approximately 0.0022 and the maximum value of bias is approximately 0.03.
2.2. Censored Case.
Censored data was also a situation was tested and applied t the previous method. As it
mentioned before, ˜Pmn and ˜Qmn follow stochastic ordering, ˜Qmn ≤ Rm+n ≤ ˜Pmn. According to
Kaplan- Meier(1958), the Kaplan Meier estimator when there is no censoring can be defined
by ˆF (t) =∏
ti<tni−dini
where ni indicates the number of the survivors and di indicates the
number of deaths.
With censored data, Kaplan Meier can be defined as ˆF (t) =∏
Li,j≤ t(1 − 1
ni−i+1)δi,j By
using ˜Pmn and ˜Qmn, it is able to modify Kaplan-Meier estimator. δ is used for filtering
whether or not an observation is censored or not. Suppose that there are two random
distribution Xm and Yn . When Xm is smaller than Yn, δ is equal to 1 ,otherwise δ is 0.
Creating a method for censored data was very similiar to the method for uncensored data,
with some additions that were needed to be made in order to accomdate censoring and the
Kaplan Meier estimator. The initial steps made in the censored method were the same as the
uncensored method in that an exponential was made to respresent a true survival curve and
to find the inverse values of this to find x values to compare the true and random curves. The
next step was to create randomly censored data in each of the 10000 simulations. This was
done by creating a random exponential X1, . . . , Xk at the same λ values as the true
exponential, as well as another random exponential A to compare to X1, . . . , Xk. For the
values in each X, if they are less than A, than that value counts as a death and is given a 1,
Otherwise it is given a 0 and is censored. These values are in the set δ, and the minimum is
AN ESTIMATION OF K SAMPLE SURVIVAL FUNCTIONS UNDER STOCHASTIC ORDERING CONSTRAINT6
then taken from X and A to create X∗ so that censored data comes from the random
exponential A and uncensored data comes from X. Then the X∗ distribution is then put into
a data frame with δ. This is because when using a Kaplan Meier estimator in R, you need to
compare the survival values to the censoring in one data set. Using the ”Survival” package
in R, you evaluate the estimated Kaplan Meier survival function from the data in X∗ and δ.
After doing this it is necessary to extract the survival values at each specific time in each of
the 10000 Kaplan Meier functions. In the uncensored case, it was possible to to evaluate the
empirical cumulative distribution function at a specific ti, but when using the Kaplan Meier
function in the survival library, it is not possible to do this. Due to the fact that the survival
values and times were extracted from the Kaplan Meier function, a method was created to
evaluate the place of the ti values on the survival curve. The way this was done was to find
the survival times that are between each ti value, and give that ti value the survival value in
the Kaplan Meier. In doing this, it evaluates the Kaplan Meier survival curve at each ti value.
From all this, there is a estimated survival curve and a true curve to compare by taking the
Mean Square Error and Bias, just as in the uncensored case. Out of each of the 10,000
iterations, its subtracts the true curve minus the estimated curve to find the Bias and squares
that number to find the Mean square error. Both of these are then divided by 10000 to get
the average of them all.
AN ESTIMATION OF K SAMPLE SURVIVAL FUNCTIONS UNDER STOCHASTIC ORDERING CONSTRAINT7
Generally, the mean squared errors and bias for censored data tend to be higher than those of
uncensored data. The mean squared errors for censored data ranges from 0 to 0.01 and bias
range from -0.05 to 0.035.
3. example
3.1. Elbarmi and Mukerjee(2005).
We used the data from data 2 from ”The Statistical Analysis of Failure Time
Data(Kalbfleisch, Prentice)” to present our methods work accurately compared to Elbarmi
and Mukerjee(2005)’s result on the identical data set. The data 2 has 4 levels of N stages
from 0 to 3. When the number of stage is closer to 3, it indicates that there are the most
amount of nodes used for a patient. When the status is 0, it means it is censored while it
means death when the status is 1. The number of survival days ranges from 11 to 1823. The
graph below is the Kaplan Meier estimator graph based on the data 2
Suppose that each population: n0,n1,n2 and n3. We created two benchmark functions
applying Test 15 algorithm.
R3,4 = length(n3)∗n3+length(n2)∗n2)length(n3)+length(n2)
R2,3 = length(n1)∗n1+length(n2)∗n2)length(n1)+length(n2)
Then, we applied those benchmark functions to set s∗1, s∗2, s
∗3 and s∗4 following Test 15
algorithm.
s∗4 = max(s4, R3,4)
s∗3 = min(s∗4, R3,4)
s∗2 = min(s∗3, R2,3)
s∗1 = min(s∗2, s1)
The graph created by us looks similar with the graph made by Elbarmi and
Mukerjee(2005). However, the difference is that the population graph below is from the
population 3 in Elbarmi and Mukerjee’s graph while the population graph below is from the
population 0 in our graph.
In this comparison, we changed our algorithm and put extra benchmark function.
R1,2 = length(n1)∗n1+length(n0)∗n0
length(n1)+length(n0)
a
AN ESTIMATION OF K SAMPLE SURVIVAL FUNCTIONS UNDER STOCHASTIC ORDERING CONSTRAINT8
s∗1 = min(R1,2, s0)
s∗2 = max(s∗1, R1,2)
s∗3 = max(s∗2, R2,3)
s∗4 = max(s∗3, R3,4)
Although population is not clearly shown, they look almost similar.
3.2. umbrella ordering.
Umbrella ordering has been considered one of the most notable among statistics because it
is easy to observe for the response variable to increase with an increase of a treatment level up
to a certain unknown point, then decrease with further increase in the treatment level. In
short, it is a method of ordering that has up-then-down response pattern(Mack and Wolfe
1981). According to Pan(1997), there exists one and only one set of critical points
c1, c2, c3, ...., cn for umbrella ordering with a unique peak.
We assumed s∗1 ≤ s∗2 ≤ s∗3 ≥ s∗4 ≥ s∗5 and changed s∗1, s∗2, ...., s
∗5 in order to satisfy umbrella
ordering.
s∗5 = min(s5, R45)
s∗4 = min(s∗5, R45)
s∗3 = max(s∗4, R34)
s∗2 = max(s∗3, R12)
s∗1 = min(s∗2, s1)
After adjusting algorithm to umbrella ordering, it also shows low values of mean squared error
and bias. The MSE and bias show similar pattern with uncensored data. In detail, the
population 1 has the largest mean square error and bias while the population 4 has the least
AN ESTIMATION OF K SAMPLE SURVIVAL FUNCTIONS UNDER STOCHASTIC ORDERING CONSTRAINT9
mean squared error and bias. In other words, this algorithm works for umbrella ordering as
well.
4. summary
Applying different methods together from previous mathematicians guided us in creating
our method. The Benchmark function Rm+n(x) = nm+n
Qn + mm+n
Pm created by Rojo(2004),
and minimax estimators created by Lo(1987) made it possible to find the optimal method of
estimating curves that were stochastically ordered. The stochastic ordering algorithm that
was optimal was:
s∗n = max(sn, Rn−1,n)
s∗n−1 = min(s∗n, Rn−1,n
s∗n−2 = min(s∗n−1Rn−2,n−1)...
s∗2 = min(s∗3, R2,3)
s∗1 = min(s∗2, R1,2).
To prove that the algorithm is sufficiently accurate, we measured mean squared error and
bias for both uncensored data and censored data with different populations. As a result, the
mean squared errors for uncensored of 4 populations with λ = 1, 1.2, 1.4, 1.6 provided data
with a range from 0 to 0.003 and biases ranging from -0.035 to 0.015. Then we applied our
algorithm to previous research work in the specific field of the k sample case: Elbarmi and
Mukerjee(2005)analysis on data from Kalbfleisch and Prentice(1980) and umbrella ordering.
There were some difficulties in comparing the Algorithm we created to the previous work
done by Elbarmi and Mukerjee. When inputing the data, it was clear that they had input the
data incorrectly due to the fact that there were values that were counted as deaths, when in
reality they were censored values. The reasoning for this could have been to have more lines
cross on the graph to get a better idea of how their ordering method works, it is not helpful
though when using the data for the purpose of comparisons . Another problem that we were
faced with was the fact that Elbarmi and Mukerjee did not have an algorithm in there paper
to test against ours, so it is very difficult to know which one is better. They also did not
include values or graphs that show the mean square error and bias of their method of
estimation. The only way that we could compare our method to theres was purely basing it
off of graphs of the estimators and which one looks more accurate. When initially
comparing our method to Elbarmi and Murkerjee’s, it looked as if there method was more
accurate. After much testing it was figured out that this was the case due to our method
solving for the greatest stochastically ordered estimator was different than theres that started
AN ESTIMATION OF K SAMPLE SURVIVAL FUNCTIONS UNDER STOCHASTIC ORDERING CONSTRAINT10
from the least stochastically ordered estimator. After altering our method to start from the
least stochastically ordered method, Our graph looked very similar to theirs and actually had
more variability in the lines. The biggest problem that we were faced with was a lack of
being able to compare our method to others with measurements instead of physical
characteristics. Although this was the case, it seemed as if our method was very close to
Elbarmi and Mukerjee’s. In order to keep it stochastically ordered, it was necessary to apply
the previous estimator found in the next estimator. This decreases the accuracy of estimation
but was the best method when used with the benchmark function that would keep all the
estimators stochastically ordered. Last but not least, we applied our algorithm to umbrella
ordering which has up-then-down pattern. We modified the location of minimum estimator
and maximum estimator to satisfy the umbrella ordering. Consequently, the mean square
error and bias show similar pattern with uncensored data: The population 1 has the largest
MSE and bias while the population 4 has the least MSE and bias.
5. appendix
5.1. uncensored R code.
AN ESTIMATION OF K SAMPLE SURVIVAL FUNCTIONS UNDER STOCHASTIC ORDERING CONSTRAINT11
AN ESTIMATION OF K SAMPLE SURVIVAL FUNCTIONS UNDER STOCHASTIC ORDERING CONSTRAINT12
5.2. censored R code.
AN ESTIMATION OF K SAMPLE SURVIVAL FUNCTIONS UNDER STOCHASTIC ORDERING CONSTRAINT13
AN ESTIMATION OF K SAMPLE SURVIVAL FUNCTIONS UNDER STOCHASTIC ORDERING CONSTRAINT14
5.3. Elbarmi and Mukerjee R code.
5.4. umbrella ordering R code.
References
[1] Dykstra, R. L., (1982). Maximum likelihood estimation of the survival functions of two stochastically
ordered random variables. Journal of American Statistics Association. 77,621-628
[2] Hammou El Barmi and Hari Mukerjee, Inferences under a Stochastic Ordering Constraint: The k-Sample
Case, Journal of the American Statistical Association, Vol. 100, No. 469 (Mar., 2005), pp. 252-261
[3] Kalbfleisch, J.D. Prentice R.L(1980).The Statistical Analysis of Failured Time Data. New York:Wiley
[4] Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete observations,J. Am.
Statist. Assoctext , 53, 457-481
[5] Lehmann, E. L. Ordered Families of Distributions. The Annals of Mathematical Statistics 26 (1955), no.
3, 399–419. doi:10.1214/aoms/1177728487.
[6] Lo, S.H. (1987). Estimation of distribution functions under order restrictions, Statistics and Decisions, 5,
251-262
AN ESTIMATION OF K SAMPLE SURVIVAL FUNCTIONS UNDER STOCHASTIC ORDERING CONSTRAINT15
[7] Mack, G. A., and Wolfe, D. A. (1981), ”K-Sample Rank Tests for Um- brella Alternatives,”.Journal of
the American Statistical Associationtext , 76, 175-181.
[8] Guohua Pan.Confidence Subset Containing the Unknown Peaks of an Umbrella Ordering.Journal of the
American Statistical Association, Vol. 92, No. 437 (Mar., 1997), pp. 307-314
[9] Rojo, Javier. Ma.Z. On the Estimation of Stochastically Ordered Survival Functions,Journal of Statistical
Computation and Simulationtext , 55, 1-21.
[10] Rojo, Javier. On the estimation of survival functions under a stochastic order constraint. The First Erich
L. Lehmann Symposium—Optimality, 37–61, Institute of Mathematical Statistics, Beachwood, OH, 2004.
[11] Shaked, Moshe and Shanthikumar, J. George.(1997). Supermodular Stochastic Orders and Positive
Dependence of Random Vectors. Journal Of Multivariate Analysis 61, 86-101.