drawbacks to integer scoring for ordered categorical data

4
BIOMETRICS 57, 567-570 June 2001 Drawbacks to Integer Scoring for Ordered Categorical Data Anastasia Ivanova Department of Biostatistics, The University of North Carolina at Chapel Hill, CB 7400, Chapel Hill, North Carolina 27599-7400, U.S.A. email: [email protected] and Vance W. Berger Biometry Research Group, DCP, NCI, Executive Plaza North, Suite 344, Bethesda, Maryland 20892-7354, U.S.A. SUMMARY. Linear rank tests are widely used when testing for independence against stochastic order in a 2 x J contingency table with two treatments and J ordered outcome levels. For this purpose, numerical scores are assigned, possibly by default, to the J outcome levels. When the choice of scores is not apparent, integer (equally spaced) scores are often assigned. We show that this practice generally leads to unnecessarily conservative tests. The use of slightly perturbed scores will result in a less conservative and uniformly more powerful test. KEY WORDS: Conservatism; Contingency table; Linear rank test; Permutation test. 1. Introduction It is common to compare two treatments on the basis of or- dered categorical data. Ignoring the ordering among the cat- egories or collapsing categories will result in a loss of power (Emerson and Moses, 1985). To exploit the ordering, numeri- cal scores may be assigned to the outcome levels. When sub- ject matter considerations offer no indication of what scores to assign, Graubard and Korn (1987) argued for equally spaced (integer) scores. In this article, we show the test with inte- ger scores to be excessively conservative. Using slightly per- turbed scores leads to a uniformly more powerful test. To illustrate, consider an ovarian cancer example. Patients were randomized to receive either placebo or diethyldithiocarba- mate (Gandara et al., 1995). The objective tumor response data were (9,24,12,16) in the diethyldithiocarbamate group and (9,22,8,13) in the placebo group, where the categories are progressive disease, stable disease, partial response, and complete response. Following the intent-to-treat principle, pa- tients who had been withdrawn from the study early are in- cluded. We consider these patients to be in the progressive disease category. Combining the partial response and com- plete response categories into a single response category, as is routinely done, yields {(9,24,28); (9,22,21)}. The linear rank test with scores (1,2,3) yields a one-sided pvalue of 0.030, yet with perturbed scores (1,2.01,3), the pvalue is reduced to 0.021, which is significant at the one-sided 0.025 level. In the remainder of the article, we explain why this phenomenon is not unusual. In Section 2, we present notation and preliminar- ies. In Section 3, we show, with power calculations, that cer- tain sets of scores (generally including equally spaced scores) result in excessive conservatism. In Section 4, we discuss rank- based tests whose reliance on assigning scores is rarely made explicit. In Section 5, we offer strategies for choosing column scores. 2. Notation and Preliminaries Consider testing the null hypothesis of independence between rows and columns against the one-sided alternative of stochas- ticorder. Thevectorsrrl = (~11,7112,~13) andq = (~~1,7122, 7123) of cell probabilities each sum to one. The corresponding trinomial random vectors C1 = (Cll,C12,C13) and CZ = (C21, C22, Cz3) sum to n1 and n2, respectively. The row mar- gins n = (nl, n2) are fixed by design (product multinomial sampling). The sample space r is the set of 2 x 3 contingency tables with nonnegative integer-valued cell counts with row totals n and fixed column totals T = (2'1, T2, T3). Given T, n and c = (C11,C12), we reconstruct the entire 2 x 3 contin- gency table as C13 = n1 - C11 - Clz and C2 = T - C1, so we let c denote a point of I'. Figure 1 displays Cl2 plotted against C11 for each of the 76 tables of for a small (rela, tive to {(9,24,28); (9,22,21)}, for which l? has 1194 points) real example, {(7,3,2);(18,4,14)}, from Emerson and Moses (1985). Observed table (7,3) is circled. The conditional null probability of each table can be calculated using the hyper- geometric distribution. The exact conditional linear rank test with scores (vl, w2, w3), v1 < 213, orders tables in r according to the difference A1 - A2 between two weighted sums, 567

Upload: anastasia-ivanova

Post on 14-Jul-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Drawbacks to Integer Scoring for Ordered Categorical Data

BIOMETRICS 57, 567-570 June 2001

Drawbacks to Integer Scoring for Ordered Categorical Data

Anastasia Ivanova Department of Biostatistics, The University of North Carolina at Chapel Hill,

CB 7400, Chapel Hill, North Carolina 27599-7400, U.S.A. email: [email protected]

and

Vance W. Berger Biometry Research Group, DCP, NCI,

Executive Plaza North, Suite 344, Bethesda, Maryland 20892-7354, U.S.A.

SUMMARY. Linear rank tests are widely used when testing for independence against stochastic order in a 2 x J contingency table with two treatments and J ordered outcome levels. For this purpose, numerical scores are assigned, possibly by default, to the J outcome levels. When the choice of scores is not apparent, integer (equally spaced) scores are often assigned. We show that this practice generally leads to unnecessarily conservative tests. The use of slightly perturbed scores will result in a less conservative and uniformly more powerful test.

KEY WORDS: Conservatism; Contingency table; Linear rank test; Permutation test.

1. Introduction It is common to compare two treatments on the basis of or- dered categorical data. Ignoring the ordering among the cat- egories or collapsing categories will result in a loss of power (Emerson and Moses, 1985). To exploit the ordering, numeri- cal scores may be assigned to the outcome levels. When sub- ject matter considerations offer no indication of what scores to assign, Graubard and Korn (1987) argued for equally spaced (integer) scores. In this article, we show the test with inte- ger scores to be excessively conservative. Using slightly per- turbed scores leads to a uniformly more powerful test. To illustrate, consider an ovarian cancer example. Patients were randomized to receive either placebo or diethyldithiocarba- mate (Gandara et al., 1995). The objective tumor response data were (9,24,12,16) in the diethyldithiocarbamate group and (9,22,8,13) in the placebo group, where the categories are progressive disease, stable disease, partial response, and complete response. Following the intent-to-treat principle, pa- tients who had been withdrawn from the study early are in- cluded. We consider these patients to be in the progressive disease category. Combining the partial response and com- plete response categories into a single response category, as is routinely done, yields {(9,24,28); (9,22,21)}. The linear rank test with scores (1,2,3) yields a one-sided pvalue of 0.030, yet with perturbed scores (1,2.01,3), the pvalue is reduced to 0.021, which is significant at the one-sided 0.025 level. In the remainder of the article, we explain why this phenomenon is not unusual. In Section 2, we present notation and preliminar- ies. In Section 3, we show, with power calculations, that cer- tain sets of scores (generally including equally spaced scores)

result in excessive conservatism. In Section 4, we discuss rank- based tests whose reliance on assigning scores is rarely made explicit. In Section 5, we offer strategies for choosing column scores.

2. Notation and Preliminaries Consider testing the null hypothesis of independence between rows and columns against the one-sided alternative of stochas- ticorder. Thevectorsrrl = ( ~ 1 1 , 7 1 1 2 , ~ 1 3 ) a n d q = ( ~ ~ 1 , 7 1 2 2 ,

7123) of cell probabilities each sum to one. The corresponding trinomial random vectors C1 = (Cll,C12,C13) and CZ = (C21, C22, Cz3) sum to n1 and n2, respectively. The row mar- gins n = (nl, n2) are fixed by design (product multinomial sampling). The sample space r is the set of 2 x 3 contingency tables with nonnegative integer-valued cell counts with row totals n and fixed column totals T = (2'1, T2, T3). Given T, n and c = (C11,C12), we reconstruct the entire 2 x 3 contin- gency table as C13 = n1 - C11 - Clz and C2 = T - C1, so we let c denote a point of I'. Figure 1 displays Cl2 plotted against C11 for each of the 76 tables of for a small (rela, tive to {(9,24,28); (9,22,21)}, for which l? has 1194 points) real example, {(7,3,2); (18,4,14)}, from Emerson and Moses (1985). Observed table (7,3) is circled. The conditional null probability of each table can be calculated using the hyper- geometric distribution. The exact conditional linear rank test with scores (vl, w2, w3) , v1 < 213, orders tables in r according to the difference A1 - A2 between two weighted sums,

567

Page 2: Drawbacks to Integer Scoring for Ordered Categorical Data

568 Biornetr ics, June 2001

Linear Rank Test (0.00,0.50,1 .OO)

t 7 0 0 0 0 0 1 Linear Rank Test (0.00,0.49,1 .OO)

0 I 2 3 4 5 6 7 8 9101112

c11

p = 0.2084

0 1 2 3 4 5 6 7 8 9101112

c11

p = 0.1476

Linear Rank Test (0.00,0.51,1 .OO)

0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2

c11

p = 0.2076

Figure 1. Critical regions for three linear rank tests for {(7,3,2); (18,4,14)}.

and

= C Z l U l f c22V2 + c23v3

122 ,

rejecting Ho for tables with large values of A2 - A l . Without loss of generality, the scores can be chosen as (0, Y, l ) , with u = (w2-211)/(~3--01). The linear rank test above is equivalent to cpu, which rejects Ho for large values of z,(c) = C11 + (1 - v)Ciz. We consider the class of exact level-cr linear rank tests, {pV : 0 5 v 5 1). If, e.g., the three categories are assumed to be equally spaced, then w = 0.5.

3. Conservatism of the Linear Rank Test with

Let Mv(c) = {c* E I zv(c*) 2 zv(c)} denote the 9, ex- treme region of c, with pu(c) = Po{Mv(c) 1 T } the cor- responding pvalue, where PO is the probability under the null hypothesis. Clearly, pv(.) is a monotonic set function for any u, so if M,(c) C Mv(c*), then pv(c) 5 pv(c*). In the first panel of Figure 1, M0.5(7,3) is shown by dark dots. Fix c = (CII, C12) and consider c* = ( ~ 7 : ~ ~ C;,) E - c. Then zv(c*) = z,(c) if and only if u = v,,,* = 1 - (C11- C;,)/(C& - C12). Letting c* range over r - c, V(c) = {wi(c), w ~ ( c ) , . . . , UK=(C)} is the ordered set of values wC,,* E R'. In our example, {(7,3,2); (18,4,14)}, if c = (7,3), then K(7,3) = 38 and

Integer Scores, 90.5

5 3 4 ' 2 " 2 ' 3

V(c) = { -00, -6, -5, -4, -3 -- -2 -- -- - 1

If u E V(c), then zv(c) = z,,(c*) for some c* E - c, so c* E Mv(c). Letting w* = ZJ+E or v* = w - E , so that z,. (c) > z,*(c*), c* would not be in M,=(c) and would not inflate p,* (c) . We see that p , (c) is maximized locally when w E V (c) .

THEOREM: Ifwk(c) < u < vk+l(c) for some k , then M,(c) is independent ofv, Mv(c) is a subset of both M,k(c)(c) and

Proof. The line through c with slope l / (v - 1) separates M, (c) - c from r - M, (c) and intersects with neither because w # V(c) (panel 2 of Figure 1). Decreasing v will not change the sets M,(c) - c or r - M,(c) until v = wk(c). If q ( c ) =

M,(c) UW,,(,) (c), and w , ~ ( ~ ) (c) n {r - M,(c)} represents the set of points that migrated into the extreme region when w be- came U, ,~(~)(C), making M,k(c)(c) strictly larger than M,(c). By the monotonicity of p, then, P , ~ ( ~ ) ( C ) 2 pa(.). The same

0

Because 0.5 = u19(7,3) E V(7,3), 90.5 assigns the same value of the test statistic to four points in the reference set; each counts in the calculation of the pvalue of the other three. Because 0.49 $! V(7,3) and 0.51 # V(7,3), the 90.5 extreme region and pvalue are larger than those of 90.49 and ~ 0 . 5 1 . The fact that there are fewer points for which the 9 0 . 5 pvalue is below 0.05 (or any other a-level) makes the 90.50 critical region a proper subset of the 'pO.49 and 90.51 critical regions. Consequently, 9 0 . 5 is more conservative and less powerful than $70.49 and 90.51. Notice in Figure 1 that M0.49(7,3) c Mo.50(7,3) (the points in Mo.50(7,3)-M0.49(7,3) are marked by crosses) and Mo.51(7,3) C M0.50(7,3). In fact, p0.50(7,3) = 0.2084, p0.49(7,3) = 0.1476, and p0.51(7,3) = 0.2076.

Table 1 shows exact unconditional power comparisons of 'po.5 to 'pO.49 and ~ 0 . 5 1 . We considered all 4356 tables with nl = 122 = 10, and we let "1 = (0.3,0.4,0.3) while 7r2 varies. The bold entries in Table 1 are the best powers among the tests considered. The last line of Table 1 shows that the ac- tual sizes of 90.49 (0.034) and 90.51 (0.034) are closer to the nominal size of 0.05 than the actual size of 90 .5 (0.026). This excessive conservatism of 'po.5 is reflected in the power calcu- lations: Specifically, both 90.49 and 90.51 are uniformly more powerful than 90.5. Note that 0 = Vl6(7,3) E V(7,3) and 1 = w21(7,3) E V(7,3), so binary tests on collapsed categories

Ml)k+l(C)(C), pVk(C)(') pV('), and p'Uk+l(C)(c) 2 pW(').

{c* E - 1 zVk(C)(c*) zVk(C)(c)}l then MVk(C)(c) =

argument applies as w increases to wk+l(c).

Page 3: Drawbacks to Integer Scoring for Ordered Categorical Data

Integer Scoring for Ordered Categorical Data 569

Table 1 Exact power comparisons of three linear rank tests

with n 1 = 122 = 10, with alternative "1 = (0.3,0.4,0.3) and "2 varying, at nominal (Y = 0.05. The last row of the table represents the actual size of the test."

Table 2 Eleven 2 x 3 contingency tables from Emerson and Moses (1985) with one-sided p-values for 90.50, $70.49, and $00.51

2 x 3 90.50 90.49 $00.51 ~

"2 q0.50 90.49 (P0.51

(O.l ,O.O, 0.9) (O.l,O.l, 0.8) (0.1,0.2,0.7) (0.1,0.3,0.6) (0.1,0.4,0.5) (0.1,0.5,0.4) (0.1,0.6,0.3) (0.3,0.4,0.3)

0.683 0.562 0.437 0.320 0.219 0.139 0.082 0.026

0.797 0.641 0.489 0.354 0.240 0.151 0.088 0.034

0.683 0.569 0.456 0.354 0.267 0.195 0.139 0.034

" Bold entries are the best powers among the tests considered.

are overly conservative. In practice, when linear rank tests are used, v is generally chosen from V = UCErV(c), making cpv overly conservative. Now 0.5 E V for most margins, but if T i T 2 T 3 = 0, then at least one column margin is zero and there exists k such that C11 + k C l 2 is constant on I?. In this case, cpv is the same test as cpv* provided that (v - k)(u* - k ) > 0.

4. Tests That Surreptitiously Use Scores Potentially unbeknownst to the uncritical data analyst, the Cochran-Mantel-Haenszel test assigns column numbers (in- teger scores) as default scores. Other tests that rely on the assignment of scores (that the user is rarely prompted to supply) include those based on correlation coefficients or rid- its. The Wilcoxon rank-sum test (Emerson and Moses, 1985), applied to this problem, is equivalent to a linear rank test with scores equal to midranks. In our second example, such midrank scores would be 01 = 13 (25 observations are tied in the first category), 212 = 29 (the midrank of ranks 26-32 for the second category), and "3 = 40.5. The standardized mid- score is v = (29 - 13)/(40.5 - 13) = 0.58. Graubard and Korn (1987) warned that "midrank scores can be unreasonable in applications when the column margin is far from uniform. " Of greater concern to us is that, when the column margin is exactly uniform, i.e., TI = T 2 = T 3 , the midrank scores will be exactly equivalent t o the integer scores and hence the test will be overly conservative.

5. Discussion When analyzing ordered categorical data, we suggest that the researcher select a test based on the trade-off between power and simplicity. If power is the primary objective, then use the adaptive test (Berger, 1998) or convex hull test (Berger, Per- mutt, and Ivanova, 1998). The Smirnov test is the simplest nonlinear rank test and uses as the test statistic the largest of 0, DT = Cll/nl - C 2 1 / n 2 , and 0; = (C11 + C 1 2 ) / n 1 - ( C 2 1 + C 2 2 ) / n 2 , or equivalently D 1 = ( C z z + C 2 3 ) / n ~ - ( C 1 2 + C 1 3 ) / n l and D 2 = C 2 3 / n 2 - C 1 3 / n 1 . This combination of 9 1 and $00 retains the excessive conservatism of its components. A combination of cp0.99 and 90.01 instead, with the test statis- tic the largest of 0, D 1 = (0.99c22 + C 2 3 ) / n 2 - (0.99c12 + C u ) / n l , and D 2 = (O.O1Czz+C23)/nz-(O.O1Ciz+Cis)/ni, will be less conservative and more powerful. This would be a good choice if power and simplicity are both important. If simplicity is of paramount importance, then a linear rank test

0.0006 0.1518 0.0028 0.0332 0.0006 0.2084 0.0000 0.2306 0.0052 0.3935 0.1172

0.0006 0.1518 0.0012 0.0295 0.0005 0.1476 0.0000 0.1544 0.0051 0.3931 0.1114

0.0002 0.0973 0.0028 0.0313 0.0005 0.2076 0.0000 0.2271 0.0031 0.3219 0.1389

may be used, thereby forcing one to select column scores. The column scores are neither data (observable from the sample) nor parameters (observable from the population) yet are said to be correct when they reflect the subject matter (Graubard and Korn, 1987). Suppose, e.g., that the response to the ques- tion "HOW much would you pay, out-of-pocket, for stable dis- ease instead of progressive disease?" would meet with an un- qualified response of M i . Likewise, suppose that one could assign a monetary value ( M 2 ) to shifting from stable disease to response. If known, then M I > 0 and M 2 > 0 would provide a clear basis for spacing the three outcome levels relative to each other, with column scores of (0, M I , M I + M 2 ) , or equiva- lently, (0, M l / [ M l + M 2 ] , 1). One would then prefer whichever treatment provides a larger mean score. However, the mone- tary values, and consequently the sets of column scores, would vary both across individuals and within individuals over time. With ordinal (as opposed to interval-scaled) data, there is, by definition, no subject-matter basis for the selection of scores, and the decision needs to be based on performance.

If one would like to use (p, and u E V ( c ) for some c E I', then we suggest using instead either pU-, or pv+, since both are valid and neither will give a larger pvalue than 'pv pro- vided that E is small enough to ensure that ZI = V(c) n [. -

E , V + E ] . Table 2 illustrates this point, using v = 0.50 and E = 0.01 for a variety of real data sets. Sometimes pv-, < p,+, and sometimes P , - ~ > pv+,. To ensure validity, one needs to select pv-, or pv+, before unblinding the treatment codes. We suggest making this selection after the margins are known. One might then use one of the following criteria for choosing either 'p,-o.ol or c p v + ~ . ~ l , illustrated by example. For { (9,24,28); (9,22,21)), with v = 0.5, 90.49 ($00.51) yields a smaller pvalue than $00.50 for 1011 (1046) tables, so 90.51

would be used. For {(7,3,2); (18,4,14)}, both 90.49 and 90.51 yield a smaller pvalue than 90.50 for 51 tables. However, 90.49

yields a smaller (larger) pvalue than 90.51 for 51 (21) tables, so 90.49 would be used. The second criterion depends on cy

and is based on whichever perturbation shifts more tables into the critical region. For {(7,3,2); (18,4,14)}, with c 1 = (9, l), c2 = (8,3), and c 3 = (7,5), po.s(c) = 0.058 for c = c1,

c = c 2 , and c = c 3 , yet pO.@(C2) = 0.034, p0.49(c1) = 0.025, and p o . 5 1 ( ~ 1 ) = 0.049. So $00.49 would be chosen because it

Page 4: Drawbacks to Integer Scoring for Ordered Categorical Data

570 Biometrics, June 2001

shifts two out of three tables to the critical region (when cy = 0.05). When analyzing R x J contingency tables, the same considerations apply, but there are more perturbations to consider.

RESUMB Des tests de rang linkaires sont largement utilisks pour tester l’indkpendance dans un tableau de contingence 2 x J avec 2 traitements et J niveaux de rkponse ordonnks. Dans ce but, des scores numkriques sont attribuks, parfois par dkfaut, aux J niveaux. Quand le choix des scores n’est pas explicitk, des scores entiers (kgalement espacks) sont attribuks. Nous mon- trons que cette pratique conduit, en general, B des tests trop conservateurs. L’utilisation de scores 16ghrement modifies con- duit B un test moins conservateur et uniformkment plus puis- sant.

REFERENCES Berger, V. W. (1998). Admissibility of exact conditional tests

of stochastic order. Journal of Statistical Planning and Inference 66, 39-50.

Berger, V. W., Permutt, T., and Ivanova, A. (1998). Con- vex hull test for ordered categorical data. Biornetrics 54,

Emerson, J. D. and Moses, L. E. (1985). A note on the Wilcox- on-Mann-Whitney test for 2 x k ordered tables. Biomet-

Gandara, D. R., Nahhas, W. A., Adelson, M. D., et al. (1995). Randomized placebo-controlled multicenter evaluation of diethyldithiocarbamate for chemoprotection against cis- platin-induced toxicities. Journal of Clinical Oncology

Graubard, B. I. and Korn, E. L. (1987). Choice of column scores for testing independence in ordered 2 x k contin- gency tables. Biornetrics 43, 471-476.

1541-1550.

rics 41, 303-309.

13, 490-496.

Received December 1999. Revised October 2000. Accepted December 2000.