lecture 11: hypothesis testing iii stratified tests renyi and other tests
Post on 03-Jan-2016
226 Views
Preview:
TRANSCRIPT
Lecture 11: Hypothesis Testing III
Stratified TestsRenyi and Other Tests
Stratified Tests
• Adjust for a covariate• Allows you to control for a confounder without using
a regression approach• However
– Like regression, if interaction is present, it won’t be detected
– Assumes the ‘treatment’ effect is the same across strata
Sometime Confusing
• “Stratified” analysis• Sometimes
– Subgroup analysis– Stratified “combined” test
• In this case, combined test• Recall Mantel-Haenszel odds ratio
Notation
• Now three variables– Outcome (time to event)– Group variable (i.e. treatment)– Strata variable (i.e. gender, cancer grade)
• J = 1, 2,…., K indexes groups• S = 1, 2,…, M indexes strata
Similar to the Standard Test
• Formal hypothesis
• Now, Zj.(t) is represented by a sum
0 1 2: ... , 1, 2,..., ;s s KsH h t h t h t s M t
. 1
. .1ˆ ˆ
M
j jss
M
jg jgss
Z Z
From there, inference is the same
• Chi-square test with K – 1 d.f. where S-1 is the inverse of the estimated variance covariance matrix
• For the 2 group scenario it can be reduced to a Z-score
11
111ˆ
Mss
Mss
ZZ
'1 21 2 1, 1 2 1, 1, ,..., , ,..., ~K K KZ Z Z Z Z Z
Asymptotics
• Just like unstratified test, requires large N• Here requires even larger- think about dividing
the sample into M strata• In most cases, there probably is not sufficient
N
Small Example
• 20 subjects received 1 of two treatments– 9 patients on treatment 1– 11 patients received treatment 2
• Patients also categorized by disease type– 2 strata
• Question:– Does the data show a treatment effect after
adjusting for disease type?
Time Death Censor Trt Disease1 1 0 1 15 1 0 1 15 0 1 1 16 0 1 1 28 1 0 1 1
37 1 0 1 249 1 0 1 158 1 0 1 279 0 1 1 211 0 1 2 150 1 0 2 151 1 0 2 262 1 0 2 167 1 0 2 173 1 0 2 186 1 0 2 190 1 0 2 296 1 0 2 297 1 0 2 297 1 0 2 2
What first
• Data in standard format– Trt 1: 1, 5, 5+,6+, 8, 37, 49, 58, 79+– Trt 2: 11+, 50, 51, 62, 73, 86, 90, 96, 97, 97
• We might first conduct a global test– What is our hypothesis
0 1 2 1 2: . :trt trt A trt trtH h t h t vs H h t h t
Constructing Statistic , 1 , 1
11i trt i trt i i
i i i
Y Y Y diY Y Yd
, 1i trtY, 1i trtdiYidit , 1 , 1i
i
di trt i trt Yd Y
Calculate Statistic
• Z-statistic
• c2 statistic
Now Let’s Adjust for Disease Type
• Steps:1. Divide the data according to strata2. Calculate Zjs.(t) and
3. Sum Zjs(t) and across strata to get Zj.(t) &
4. Calculate your test statistic according to
1
ij
ij
D d
js i ijs ijs YiZ W t d Y
11ˆ 1ijs ijs ij ij
ij ij ij
D Y Y Y d
jgs i ijY Y YiW t d
ˆ jgs
ˆ jgs .ˆ jg
'11. 2. 1. 1. 2. 1., ,..., , ,...,K KZ Z Z Z Z Z
Divide data By Strata
• Disease 1 • Disease 2Time Death Censor Trt
1 1 0 15 1 0 15 0 1 18 1 0 1
49 1 0 111 0 1 250 1 0 262 1 0 267 1 0 273 1 0 286 1 0 2
Time Death Censor Trt6 0 1 1
37 1 0 158 1 0 179 0 1 151 1 0 290 1 0 296 1 0 297 1 0 297 1 0 2
Calculate and sgsms
, 1 , 1
11i trt i trt i i
i i i
Y Y Y diY Y Yd
, 1i trtY, 1i trtdiYidit , 1 , 1i
i
di trt i trt Yd Y
1 1,Trt DisZ1 1
2,ˆ Trt Dis
Calculate and sgsms
, 1 , 1
11i trt i trt i i
i i i
Y Y Y diY Y Yd
, 1i trtY, 1i trtdiYidit , 1 , 1i
i
di trt i trt Yd Y
1 2,Trt DisZ1 2
2,ˆ Trt Dis
Calculate the Statistic
• Z (or chi-square)
• What is our conclusion
R Code>times<-c(1,5,5,6,8,11,37,49,50,51,58,62,67,73,79,86,90,96,97,97)>trt<- c(1,1,1,1,1,2,1,1,2,2,1,2,2,2,1,2,2,2,2,2)>strat<-c(1,1,1,2,1,1,2,1,1,2,2,1,1,1,2,1,2,2,2,2)>death<-c(1,1,0,0,1,0,1,1,1,1,1,1,1,1,0,1,1,1,1,1)
#Global>survdiff(st~trt)Call:survdiff(formula = st ~ trt)
N Observed Expected (O-E)^2/E (O-E)^2/Vtrt=1 9 6 2.63 4.329 6.1trt=2 11 10 13.37 0.851 6.1
Chisq= 6.1 on 1 degrees of freedom, p= 0.0136
R Code
#Stratifiedsurvdiff(st~trt + strata(strat)) Call:survdiff(formula = st ~ trt + strata(strat))
N Observed Expected (O-E)^2/E (O-E)^2/Vtrt=1 9 6 2.27 6.16 9.46trt=2 11 10 13.73 1.02 9.46
Chisq= 9.5 on 1 degrees of freedom, p= 0.0021
BMT: Hodgkin’s & Non-Hodgkin’s Lymphoma
• Study included 43 BMT patients
• Is there a difference in hazard rates between – Allogenic transplant = HLA matched sibling donor (N=16)– Autogenic transplant = Own “cleaned” marrow (N=27)
• But want to adjust for disease state– Non-Hodgkin’s lymphoma (N=23)– Hodgkin’s disease (N=20)
Global Test
2 1 43 1 16 0.628 0.234
4 1 42 1 15 0.643 0.230
28 1 41 1 14 0.659 0.225
30 1 40 0 13 -0.325 0.219
32 1 39 1 13 0.667 0.222
…
132 1 22 0 7 -0.318 0.217
140 1 21 0 7 -0.333 0.222
252 1 18 0 7 -0.389 0.238
357 1 16 1 7 0.563 0.246
Sum 0.886 5.841
it id iY ,i Allod ,i AlloY, ,
i
i
di Allo i Allo Yd Y , ,
11i Allo i Allo i i
i i i
Y Y Y diY Y Yd
Global Results
• Global Test Results> dat<-read.csv("C:\\BJW\\AutoAllo.csv")> d<-dat$death; t<-dat$time> dis<-dat$disease; type<-dat$graft> nostrat<-survdiff(Surv(t, d)~type)> nostratCall:survdiff(formula = Surv(t, d) ~ type)
N Observed Expected (O-E)^2/E (O-E)^2/Vtype=1 16 10 9.11 0.0862 0.134type=2 27 16 16.89 0.0465 0.134
Chisq= 0.1 on 1 degrees of freedom, p= 0.714
Stratified by Disease Type
28 1 23 1 11 0.522 0.250
32 1 22 1 10 0.545 0.248
42 1 21 0 9 -0.429 0.245
49 1 20 1 9 0.550 0.248
53 1 19 0 8 -0.421 0.244
57 1 18 0 8 -0.444 0.247
63 1 17 0 8 -0.471 0.249
81 2 16 0 8 -1.000 0.467
84 1 14 1 8 0.429 0.245
140 1 13 0 7 -0.538 0.249
252 1 11 0 7 -0.636 0.231
357 1 10 1 7 0.300 0.210
524 1 8 0 6 -0.750 0.188
Sum -2.344 3.319
it id iY ,i Allod ,i AlloY, ,
i
i
di Allo i Allo Yd Y , ,
11i Allo i Allo i i
i i i
Y Y Y diY Y Yd
Non-Hodgkin’s Lymphoma subjects
Stratified by Disease Type
2 1 20 1 5 0.750 0.188
4 1 19 1 4 0.789 0.166
30 1 18 0 3 -0.167 0.139
36 1 17 0 3 -0.176 0.145
41 1 16 0 3 -0.188 0.152
52 1 15 0 3 -0.200 0.160
62 1 14 0 3 -0.214 0.168
72 1 13 1 3 0.769 0.178
77 1 12 1 2 0.833 0.139
79 1 11 1 1 0.909 0.083
108 1 10 0 0 0.000 0.000
132 1 9 0 0 0.000 0.000
sum 3.106 1.518
it id iY ,i Allod ,i AlloY, ,
i
i
di Allo i Allo Yd Y , ,
11i Allo i Allo i i
i i i
Y Y Y diY Y Yd
Hodgkin’s Disease subjects
Stratified Results
• Stratified Test Results> strat<-survdiff(Surv(t, d)~type + strata(dis))> stratCall:survdiff(formula = Surv(t, d) ~ type + strata(dis))
N Observed Expected (O-E)^2/E (O-E)^2/Vtype=1 16 10 9.24 0.0629 0.12type=2 27 16 16.76 0.0347 0.12
Chisq= 0.1 on 1 degrees of freedom, p= 0.729
Stratified Results
• Stratified Test Results
• Again we fail to reject• This seems in error (recall
(our survival curves looked VERY different)
2 0.120 0.729p
Problem?
• The treatment effect is not the same in the 2 disease states
• They are in different directions– ZAllo = -2.344
– ZAuto = 3.106
• Stratified approach is NOT appropriate
Alternative to Stratified Analysis
• Alternatives– Define 4 groups and conduct a K-sample log rank
test• Allogenic and NHL• Allogenic and Hodgkin’s• Autogenic and NHL• Autogenic and Hodgkin’s
– Subgroup analysis (by disease) should be performed
• Allo|Hodgkins• Allo|Non-Hodgkins
R Code- K sample test> allgrp<-ifelse(dis==1 & type==1, 1, 0)> allgrp<-ifelse(dis==1 & type==2, 2, allgrp)> allgrp<-ifelse(dis==2 & type==1, 3, allgrp)> allgrp<-ifelse(dis==2 & type==2, 4, allgrp)> grp4<-survdiff(Surv(t, d)~allgrp)> grp4Call:survdiff(formula = Surv(t, d) ~ allgrp)
N Observed Expected (O-E)^2/E (O-E)^2/Vallgrp=1 11 5 7.67 0.927 1.350allgrp=2 12 9 7.45 0.324 0.459allgrp=3 5 5 1.45 8.721 9.567allgrp=4 15 7 9.44 0.631 0.997
Chisq= 11.1 on 3 degrees of freedom, p= 0.0113
R Code- Subgroup analysis> ### Subgroup (NHL)> subNHL<-survdiff(Surv(t,d)[which(dis==1)]~type[which(dis==1)])> subNHLCall:survdiff(formula = Surv(t, d)[which(dis == 1)] ~ type[which(dis ==1)])
N Observed Expected (O-E)^2/E (O-E)^2/Vtype[which(dis == 1)]=1 11 5 7.34 0.748 1.66type[which(dis == 1)]=2 12 9 6.66 0.825 1.66
Chisq= 1.7 on 1 degrees of freedom, p= 0.198 > ### Subgroup (Hodgkins)> subHD<-survdiff(Surv(t,d)[which(dis==2)]~type[which(dis==2)])> subHDCall:survdiff(formula = Surv(t, d)[which(dis == 2)] ~ type[which(dis ==2)])
N Observed Expected (O-E)^2/E (O -E)^2/Vtype[which(dis == 2)]=1 5 5 1.89 5.095 6.36type[which(dis == 2)]=2 15 7 10.11 0.955 6.36
Chisq= 6.4 on 1 degrees of freedom, p= 0.0117
Summary: Stratified Testing
• Alternative to a regression approach to control for a 2nd covariate when examining treatment effect.
• Sample size needs to be larger that in the case of testing K-groups for test results to be valid.
• One needs to be cautious about misinterpreting null results when interactions exist.
• We can use a subgroup approach if this fails.
Renyi Tests
• Previous tests we discussed all use weighted integral of estimated difference in cumulative hazard rates
• Doesn’t address situation where early differences favor one group, and later differences favor another group
• Solution: Renyi tests– i.e. addresses issue of crossing hazard rates
Renyi Test
• Censored data analogs of Kolmogrov-Smirnov statistic when comparing to uncensored samples
• Recall KS is a test of equality of one-dimensional probability distributions used to compare two samples
Komolgrov-Smirnov Test
• Recall empirical distribution function
• Hypothesis
• The KS statistic is
11
n
n in iF x I x x
0 1, 2, ' 1, 2, ': . :n n A n nH F x F x vs H F x F x
, ' 1, 2, '
0 , '
sup
'Reject if :
'
n n n n
n n
D F x F x
nnH D
n n
Example of a KS test
• Two groups observed for a continuous outcome:
– 1: -0.2, 3.7, 4.3, 5.0, 7.7, 8.6– 2: -0.9, 0.4, 0.5, 2.6, 3.0, 12.1
• We want to determine if the distribution of the outcomes are different (without assuming any distributional form…)
Constructing KS statisticx P(X1 < x) P(X2 < x) |P(X1 < x)-P(X2 < x)|
-0.9 0 1/6 1/6
-0.2 1/6 1/6 0
0.4 1/6 1/3 1/6
0.5 1/6 1/2 1/3
2.6 1/6 2/3 1/2
3.0 1/6 5/6 2/3
3.7 1/3 5/6 1/2
4.3 1/2 5/6 1/3
5.0 2/3 5/6 1/6
7.7 5/6 5/6 0
8.6 1 5/6 1/6
12.1 1 1 0
2, ' 3
23
6*61.15
6 6~ 0.142
n nD
p
K-S Test
Renyi Test• Approach
– Find the value of Z(ti) for each failure time• Note different from Z(t) which sums over all ti < t
– Calculate series of Z(ti) :
– Estimate the standard error of Z(t) (all times)
1 1 , 1,2,...,k
k
k i
di k k k Y
t t
Z t W t d Y i D
1 2221
k k k k
k k k
k
Y Y Y dk kY Y Y
t
W t d
Renyi Statistic
• When hazard rates cross, the absolute value of Z(t) will have max value at some value t < t
• Hypothesis test:
• Note that multiple tests are made, because we are taking the max over Z(t)
0 1 2
1 2
: ,
: for some A
H h t h t t
H h t h t t
Test Statistic Q
• Use the same variance estimate for test statistic as in standard two-sample approach
• Test statistic
• Q is approximated by distribution of sup{|B(x)|, 0 < x < 1}
where B is a standard Brownian motion process
• Use table C.5 to find associated p-value
sup ,Q Z t t
Small Example
• Given the following data– Group 1: (7, 8+, 9, 15, 17)– Group 2: (1, 4, 5+, 6, 19)
Constructing the statisticti dk dk1 Yk Yk1 Var1 1
k
k
dk k Yd Y iZ t
Calculating Q
• First we can calculate Q
• Once we have Q we compare to table C.5
Example 2: Kidney Infection
• Data on 119 kidney dialysis patients• Comparing time to kidney infection between
two groups– Catheters placed percutaneously (n = 76)– Catheters placed surgically (n = 43)
Example: Kidney Infection
R Code: Kidney Infection> kidney<-read.csv("H:\\public_html\\BMRTY722_Summer2015\\Data\\Kidney.csv")> time<-kidney$Time> infect<-kidney$d> percut<-kidney$cath> st<-Surv(time, infect)> LRtest<-survdiff(st~percut)> LRtestCall:survdiff(formula = st ~ percut)
N Observed Expected (O-E)^2/E (O-E)^2/Vpercut=1 43 15 11 1.42 2.53percut=2 76 11 15 1.05 2.53
Chisq= 2.5 on 1 degrees of freedom, p= 0.112
How to Test This in R?
• We could write our own R function to conduct the Renyi test…
• BUT, it turns out there was a package released in April that has the Renyi test (and all weight functions from K & M included )
R Code: Kidney Infection> library(survMisc)> RYtest<-comp(survfit(st~percut))> RYtest$tne t n e n_percut=1 e_percut=1 n_percut=2 e_percut=2 1: 0.5 119 6 76 6 43 0 2: 1.5 103 1 60 0 43 1…16: 26.5 5 1 3 0 2 1
$tests$lrTests ChiSq df pLog-rank 2.529506318 1 0.11174Gehan-Breslow (mod~ Wilcoxon) 0.002084309 1 0.96359Tarone-Ware 0.402738202 1 0.52568Peto-Peto 1.399160019 1 0.23686Mod~ Peto-Peto (Andersen) 1.275908836 1 0.25866Flem~-Harr~ with p=1, q=1 9.834062861 1 0.00171
$tests$supTests Q pLog-rank 1.590442 0.22347Gehan-Breslow (mod~ Wilcoxon) 1.430499 0.30511Tarone-Ware 1.260498 0.41467Peto-Peto 1.166979 0.48551Mod~ Peto-Peto (Andersen) 1.185549 0.47085Renyi Flem~-Harr~ with p=1, q=1 7.460348 0.00000
R Code: Kidney Infection> library(survMisc)> RYtest<-comp(survfit(st~percut), FHp=0, FHq=0)> RYtest$tne t n e n_percut=1 e_percut=1 n_percut=2 e_percut=2 1: 0.5 119 6 76 6 43 0 2: 1.5 103 1 60 0 43 1…16: 26.5 5 1 3 0 2 1
$tests$lrTests ChiSq df pLog-rank 2.529506318 1 0.11174Gehan-Breslow (mod~ Wilcoxon) 0.002084309 1 0.96359Tarone-Ware 0.402738202 1 0.52568Peto-Peto 1.399160019 1 0.23686Mod~ Peto-Peto (Andersen) 1.275908836 1 0.25866Flem~-Harr~ with p=0, q=0 2.529506318 1 0.11174
$tests$supTests Q pLog-rank 1.5904422 0.22347Gehan-Breslow (mod~ Wilcoxon) 1.4304991 0.30511Tarone-Ware 1.2604976 0.41467Peto-Peto 1.1669791 0.48551Mod~ Peto-Peto (Andersen) 1.1855486 0.47085Renyi Flem~-Harr~ with p=0, q=0 0.9743145 0.65287
Example 3: Gastric Cancer
• Clinical trial of chemotherapy vs. chemotherapy combined with radiotherapy
• 45 Patients randomized to each of two arms• Followed for up to 8 years
R Code: Gastric Cancer> RYtest<-comp(survfit(Surv(tm, dth)~x, data=dat))> RYtest$tne t n_x=1 e_x=1 n_x=2 e_x=2 n e 1: 1 45 1 45 0 90 1 …80: 2363 3 1 6 0 9 1
$tests$lrTests ChiSq df pLog-rank 0.23192760 1 0.63010Gehan-Breslow (mod~ Wilcoxon) 3.99653918 1 0.04559Tarone-Ware 1.92661766 1 0.16513Peto-Peto 4.02844247 1 0.04474Mod~ Peto-Peto (Andersen) 4.12061234 1 0.04236Flem~-Harr~ with p=1, q=1 0.01112868 1 0.91598
$tests$supTests Q pLog-rank 2.200066 0.05560Gehan-Breslow (mod~ Wilcoxon) 2.951879 0.00632Tarone-Ware 2.677299 0.01484Peto-Peto 2.965941 0.00604Mod~ Peto-Peto (Andersen) 2.997885 0.00544Renyi Flem~-Harr~ with p=1, q=1 9.388643 0.00000
Compare To Log Rank
• Renyi test 0.05< p <0.06
• What would you expect to see from the log rank test? More or less significant?
LR Results
> LRtest<-survdiff(Surv(tm, dth)~x)> LRtestCall:survdiff(formula = Surv(tm, dth) ~ x)
N Observed Expected (O-E)^2/E (O-E)^2/Vx=1 45 43 45.1 0.102 0.232x=2 45 39 36.9 0.125 0.232
Chisq= 0.2 on 1 degrees of freedom, p= 0.63
Final Comments on the Renyi Test
• Simulations comparing the Renyi vs. log-rank – Hazards cross Renyi test performs better – Renyi test has little loss of power if proportional hazard
assumption holds (with limited censoring)– However, with large amounts of censoring, advantages of
the Renyi test decline
• So this tests provides a good alternative when hazard rates cross.
• But caution still needs to be taken when there is a large amount of censoring.
Other Tests for Crossing Hazards
• Cramer-von Mises test(s):– Based on the integrated squared difference
between two curves• T-test analog:
– Requires estimation of the mean– Compared area under S1(t) and S2(t)
• Brookmeyer-Crowley– Censored version of two-sample median test
Cramer-Von Mises Test• Based on the Nelson-Aalen estimator for the hazard
rate and it’s associated variance
• Ideally we integrate over time 0 to t but this integral is estimated by summing over distinct death times
2
1
2 2 21 2
, 1,2 &Group Specific:
Across groups:
ij ij
iji i ij ij
d d
j jYt t t t Y YH t j t
t t t
2
2
2 221
1 20
2 2 2 211 2 1Estimated by:
i
t
i itt
Q H t H t d t
Q H t H t t t
2-Sample T-test analog• Again this test is based on the difference in the area
under the survival curve between two groups• Components of the test include:
– Order all observed times (event and censored) – Calculate dij, cij, and Yij or both groups– Calculate the KM estimator for survival and censoring
– Calculate the pooled KM estimate of survival
ˆ ˆ1 & 1ij ij
ij iji i
d c
j jY Yt t t tS t G t
2-Sample T-test analog• Once these estimates are obtained:
– Construct weight function
– Construct the test statistic
– Construct the variance of the test statistic
– Calculate a Z-score according to
1 2
1 1 2 2
ˆ ˆ
ˆ ˆnG t G t
w tn G t n G t
1
1 21 1 2
1
ˆ ˆD
KM i i i i ii
n nW t t w t S t S t
n
2
1 1 2 2 1
1 1 1 2 1
1
1
1 ˆ ˆ21ˆ ˆ ˆ ˆ
1
ˆ
ˆ ˆˆ i ii
p i p i i i
D
i k k k p kk i
DnG t n G tA
p p i p iS t S t nG t G ti
A t t w t S t
S t S t
~ 0,1ˆKM
p
WZ N
Summary of Other 2-Sample Tests
• When the hazard rates cross, both the Cramer-Von Mises and the 2-sample t-test analog have greater power than log-rank.
• When hazard rates are proportional, both show power loss relative to log-rank.
• Performance is similar to the Renyi test when hazards cross but Renyi has better power for proportional hazards.
Test Based on Fixed Points in Time
• Complicated description in K&M (chapter 7.8)
• However, pretty simple idea when you are comparing two groups:
1 2
1 2
ˆ ˆ
ˆ ˆˆ ˆ
S t S tZ
V S t V S t
Next Time
• We will begin our discussion of semi-parametric regression modeling in survival analysis.
top related