1 propensity scores methodology for receiver operating characteristic (roc) analysis. marina...
TRANSCRIPT
1
Propensity Scores Methodology for Receiver Operating Characteristic
(ROC) Analysis.
Marina Kondratovich, Ph.D.
U.S. Food and Drug Administration,Center for Devices and Radiological Health
No official support or endorsement by the Food and Drug Administration of this presentation is intended or should be
inferred.
September, 2003
2
Outline Introduction
Place for propensity scores Distributions of covariates (details) Distributions of a New Test results (details)
Bias of naïve AUC estimation
Matching for one covariate Weighted ROC analysis
Stratification for one covariate
Relationship between AUC by matching and by stratification
Propensity score – pre-test risk of disease
Conjunction of a New Test with other diagnostic tests
3
ROC Analysis New Test is quantitative.
New Test Variable: X for Diseased population Y for Non-Diseased population
ROC curve = relationship between sensitivity and specificity of a New Test over all possible cut-off values.The AUC (area under curve) is the most common measure of the test performance.
AUC = sensitivity averaged over all values of specificity; specificity averaged over all values of sensitivity;
AUC = P{X>Y} probability that a randomly selected Diseased subject has a test value bigger than that for a randomly chosen Non-Diseased subject
4
• In order to correctly estimate the diagnostic accuracy of a New Test, we should compare the values of the New Test for Diseased subjects and the values of the New Test for the same Non-Diseased subjects. Each subject has two potential values of the New Test: a value X that would be observed if the subject was Diseased and a value Y that would be observed if the subject was Non-Diseased. But X and Y cannot be observed jointly for same subject.
Subject = {New Test, Covariates (e.g., C1=Age, C2=BMI)}
• If we were able to assign randomly the subjects to Diseased and Non-Diseased clinical states then Diseased and Non-Diseased groups were comparable in the sense of covariates and diagnostic accuracy of Test was evaluated correctly. But such a random assignment is impossible.
5
Biased estimators of AUC occur if
I. Distributions of covariates are different for the Disease and Non-Diseased study groups;
and
II. Distributions of New Test results are different for different sets of covariates.
Problem: Consider M randomly selected Diseased subjects and N randomly selected Non-Diseased subjects. Naïve estimation of AUC is biased (usually overstated).
Consider these two situations in more details for one covariate, Age.
6
I. Different Age distributions in Diseased and Non-Diseased study groups.
Target Population Age distribution (t1, t2, t3). t1=0.5; t2=0.3; t3=0.2
Pre-test risk of Disease (Age) =
π1
π2
π3
0.1 Age1
0.3 Age2
0.5 Age3
πpopulation = π1·t1+ π2·t2+ π3·t3
0.24
7
Age distributions
I. Study Groups: M randomly selected Diseased subjects, N randomly selected Non-Diseased subjects.
M = m1 + m2 + m3 N = n1 + n2 + n3
E [mi/M] = pi E [ni/N] = qi
Diseased Non-Diseased
p1=0.21; p2=0.38; p3=0.41 q1=0.59; q2=0.28; q3=0.13
8
I. Study Groups: M randomly selected Diseased subjects, N randomly selected Non-Diseased subjects.
1
1 (1 ) (1 )1
ii i
populationi ii
population
m
Nm nM
Monotonic function of πi, depends on πpopulation and πstudy.
Pre-test risk of Disease in the study (Agei) = i
i i
m
m n related to the pre-test risk of Disease in the population
Pre-test Risk of Disease
Study (N=M) Population
Age1 0.26 0.1
Age2 0.58 0.3
Age3 0.80 0.5
9
II. The distribution of the Test variable depends on Age.The New Test variables of Diseased subjects: X1 , X2 , X3 with c.d.f. F1(x), F2(x), F3(x) Non-Diseased subjects: Y1 , Y2 , Y3 with c.d.f. G1(y), G2(y), G3(y)
Example. Disease=Fracture, Non-Diseased=No Fracture, New Test=Ultrasound test for body site.This is a hypothetical relationship between the average ultrasound test and the age. Usually, the ultrasound values becomes lower with increasing of age.
PSA test values (for prostate cancer) are increasing with increasing age;BNP test values (for congestive heart failure) are increasing with increasing age.
10
This is a typical picture of the data (ultrasound test for the bone status).
I. The age distributions for Diseased and Non-Diseased subjects are different. II. The values of the New Test depend on age.
Prostate cancer is more prevalent in older men;Congestive heart failure is more prevalent in older people.
11
PROBLEM: Naïve estimation of AUC is biased (usually overstated). Indeed,
Wilcoxon - Mann -Whitney statistic
3 3
, ,1 1 1 1
1( , )
k sm n
k i s jk s i j
AUC X YM N
Ψ(A,B) =1 if A>B;
½ if A=B; 0 if A<B
3 3
,1 1
[ ] Tk s k s
k s
E AUC p q AUC p AUCq
area under ROC curve when the Diseased subjects are Agek -years old and the Non-Diseased subjects are Ages -years old.
, { } ( ) ( )k s k s s kAUC P X Y G x dF x where
12
Example.
X1 , Y1 ~ N(1,1/4)
X2, ,Y2 ~ N(2,1/4)
X3 , Y3 ~ N(3,1/4)
0.50 0.16 0.02
0.84 0.50 0.16
0.98 0.84 0.50
New Test does not have diagnostic ability: New Test cannot discriminate Diseased and Non-Diseased subjects in every age group.
AUC matrix is Non-diseased Age1 Age2 Age3
Age1
Age2
Age3
Diseased
Age distribution of the Diseased subjects is pT=(0.21; 0.38; 0.41);age distribution of Non-Diseased subjects is qT=(0.59; 0.28; 0.13),
Two groups, Diseased and Non-Diseased, appear different with respect to the values of the New Test.
13
Example (continued).
0.50 0.16 0.02
0.84 0.50 0.16
0.98 0.84 0.50
If the age distribution of the Diseased subjects is pT=(0.21; 0.38; 0.41);age distribution of Non-Diseased subjects is qT=(0.59; 0.28; 0.13),then the mean value of the Wilcoxon-Mann-Whitney statistic, pTAUCq, is 0.68.
The matrix element AUC3,1=0.98, which corresponds to the biggest age group of Diseased subjects (p3=0.41) and the biggest age group of Non-Diseased subjects (q1=0.59), makes the largest contribution to the bilinear form pTAUCq, computed for vectors p and q.
AUC matrix:
Non-diseased Age1 Age2 Age3
Age1
Age2
Age3
Diseased
14
Adjustments for one covariate
Three common methods of adjusting for
one confounding covariate:
– Matching
– Stratification
– Covariate adjustment through logistic regression
15
MatchingMatching of Diseased and Non-diseased subjects means that the agedistributions of these subjects are the same. Let the diseased and non-diseased subjects be matched with common age distribution φT = (φ1 , φ2 , φ3 )
Theorem. A New Test cannot discriminate Diseased and Non-Diseased populations for each age group. Then the expected value of the Mann-Whitney statistic is 0.5 for any age distribution in the age-matched samples of Diseased and Non-Diseased subjects.
Wilcoson-Mann-Whitney statistic correctly evaluates the test performance (area under ROC curve) only for age-matched samples.
3 3
,1 1
[ ] Tk s k s
k s
E AUC AUC AUC
16
Matching (continued)
By matching, we create a “quasi-randomized” experiment. That is, if we find two subjects, one in the Diseased and one inNon-Diseased group, with the same pre-test risk of Disease (same age), then we could imagine that there was one subject to whom the value of the New Test was observed when this subject was Diseased and when this subject was Non-Diseased. The age-matched study groups are similar with respect to the Age (AUC for the covariate Age is exactly 0.5). Then we are sure that the difference in the New Test distributions for Diseased and Non-Diseased groups are not due to the difference in age.
Problem: The data of unmatched subjects are not used in AUC. Then the weighted ROC analysis should be used.
17
Weighted ROC AnalysisData set: Diseased and Non-Diseased Subjects are not Age-matched.We want to have these two samples be age-matched with the common age distribution φ, where φk = dk/D (dk = min(mk, nk)).
3,12,11,1 XXX 5,14,13,12,11,1 YYYYY
4,23,22,21,2 XXXX 3,22,21,2 YYY
3,1 3,2X X
Age distribution Diseased Non-diseased for matching
Age1 d1=3 m1=3 n1=5
Age2 d2=3 m2=4 n2=3
Age3 d3=1 m2=2 n2=1
3,1Y
18
Weighted ROC Analysis (continued)
For each age Agek, we can take• Some set of size dk of mk Diseased subjects.
k
k
m
d
Then we consider all possible sets of matching, estimate AUC for each set, and then take the average of AUC over all these sets.
There are different variants.
•Some set of size dk of nk Non-Diseased subjects.
There are k
k
n
d
different variants.
For Age1, 10 variants; for Age2, 4 variants; for Age3, 2 variants.Total number of different matched sets: 80 (=10 x 4 x 2).
Using the particular age-matched set of D Diseased and D Non-Diseased subjects, we can estimate age-matched AUCusing the Wilcoxon statistic.
19
Weighted ROC Analysis (continued)
This is equivalent to the calculation of AUC with all N Diseased subjects with weights dk/mk and with all M Non-Diseased subjectswith weights dk/nk:
, ,2
1 1 1 1
1( , )
k sm nK Kk s
weighted k i s jk s i j k s
d dAUC X Y
n nD
The weighted ROC analysis is equivalent to consideration of all possible variants of age-matching with common age distribution φ.
Also, the weighted estimate of AUC can be obtained using the bootstrap technique.
20
Weighted ROC Analysis (continued)
3,12,11,1 XXX 5,14,13,12,11,1 YYYYY
4,23,22,21,2 XXXX 3,22,21,2 YYY
2,31,3 YY
Age distribution Diseased Non-diseased for matching
Age1 d1=3 m1=3 n1=5
Weights 1 1 1 3/5 3/5 3/5 3/5 3/5
Age2 d2=3 m2=4 n2=3
Weights 3/4 3/4 3/4 3/4 1 1 1
Age3 d3=1 m2=2 n2=1
Weights 1/2 1/2 1
3,1Y
21
Weighted ROC Analysis (continued)The weighted AUC is unbiased estimate of φ-age-matched AUC.
The variance of the weighted estimate is:
2 2
, , , ,10 012 2
1 1 1
2 2, , , , ,
11 10 012 2 21 1
var( )
1( )
1( )
weighted
K K Kk s t k t sk s t k t s
k s tk s t s t k sk t
K Kk s k s s k s k k s
k sk s k s
AUC
d d d d d d
D n n m mm n
d d
D m n
If dk ≤ min(mk, nk) (all weights are not more than 1) then this variance is smaller than the variance for one matching set.
[ ] [ ] Tweighted matchedE AUC E AUC AUC
22
Stratification
The strata are defined and Diseased and Non-Diseased subjectswho are in the same stratum are compared.
3,12,11,1 XXX 5,14,13,12,11,1 YYYYY
4,23,22,21,2 XXXX 3,22,21,2 YYY
3,1 3,2X X
Diseased Non-diseased
Age1 m1=3 n1=5
Age2 m2=4 n2=3
Age3 m2=2 n2=1
3,1Y
AUC1,1
AUC2,2
AUC3,3
23
Stratification (continued)
Overall diagnostic accuracy of the New test can be the weighted average of AUC1,1, AUC2,2, and AUC3,3.
We can consider the linear combination:3
,1
k k kk
AUC
where φ is the same as in matching, φk = dk/D (dk = min(mk, nk)).
If AUC1,1=AUC2,2=AUC3,3=AUC, then the weights φk are similar to the weights inversely proportional to variances of stratum AUC. Is there a relationship between
3 3
,1 1
Ti j i j
i j
AUC AUC
3
,1
k k kk
AUC
AUC by matching
and AUC by stratification ?
24
Example. New Test = Ultrasound test for bone status.The results of the ultrasound test are the normal variables with the means which are different for different ages and with the same standard deviation of 130 m/sec.
Means for Diseased (m/sec)
Means for Non-Diseased (m/sec)
Age1 4,005 4,027
Age2 3,904 3,953
Age3 3,885 3,942
0.55 0.39 0.37
0.75 0.65 0.58
0.83 0.70 0.68
Matrix AUC
φT = (0.2; 0.5; 0.3)
AUC by matching: φTAUCφ = 0.624
AUC by stratification:3
,1
k k kk
AUC
0.639
25
Relationship between AUC by matching and AUC by stratification
0 0.030 0.015
0.030 0 0.025
0.015 0.025 0
Matrix Δ from previous Example.
Theorem. Let φT=(φ1, φ2, φ3) be the age distribution in the age-matched Diseased and Non-Diseased groups. Then ,
3
,1
k k kk
AUC TT AUC
where the matrix Δ is a symmetric matrix with elements
, , , , , ,( ) / 2.k s s k k k s s k s s kAUC AUC AUC AUC
For broad class of distributions,3
,1
Tk k k
k
AUC AUC
AUC bymatching
AUC bystratification
≤
26
Covariates (C1, C2, …, CL)
3km 5kn
Matching based on many covariates is difficult.
Stratification: As the number of covariates increases, the number of strata grows exponentially.
27
Replace the collection of confounding covariates with one scalar function of these covariates: the propensity score.
Propensity score (PS): conditional probability be in Diseased group rather than Non-Diseased group, given a collection of observed covariates.
PS (C1, C2, …, CL) = Pr (Disease| C1, C2, …, CL).
Propensity Score = Pre-test risk of Disease given a
collection of covariates, C1, C2, …, CL.
Propensity Scores
28
Construction of propensity score (pre-test risk)Logistic regression or others (neural networks,..) Outcome: Disease – 1, Non-Disease – 0. Predictors: all measured covariates, some interaction terms or squared terms, and so on. New Test is not included. AUC for combined covariates – a measure of covariates unbalance.
The distributions of X and Y variables, the values of a New Test for Diseased and Non-Diseased groups, depend on the covariatesbut this dependence is approximated well through the pre-test risk:F (x, C1, C2, …, CL) = F (x, PS(C1, C2, …, CL));G (y, C1, C2, …, CL) = G (y, PS(C1, C2, …, CL)).
29
Propensity Scores (continued) Calculate estimated propensity scores (pre-test risk) for all subjects using the propensity score model. Sort all subjects by propensity scores. Divide subjects into strata that have similar PS. Estimate AUC by matching (use weighted AUC) or AUC by stratification.
BMI
Age
mk Diseasednk Non-Diseased
Five strata based on logistic regression model of age and BMI (linear terms).
30
Propensity Scores (continued). Example: conjunction of a New Test with
other diagnostic tests
A New test is used in conjunction with other clinical tests to detect the clinical state “Disease”. The use of propensity scores technique is convenient tool for the matching based on all available prior information (covariates) about the subjects.
Example: “Disease”= any stenosis during coronary angiography; New Test; C1 = Age; C2 = Gender; C3 = Total cholesterol; C4 = HDL (“good” cholesterol) C5 = LDL (“bad” cholesterol)
In order to correctly evaluate the diagnostic ability of a New Test, matched AUC analysis should be performed. Matching based on propensity score is recommended.
31
Use of matched ROC analysis when New Test results do not depend on the covariates.
If the distribution of the New Test results for each strata is the same(F1=F2=F3=F, G1=G2=G3=G) but we do not have any information about that and use the matched ROC analysis.How is the matched estimate of AUC related to the usual empirical estimate?
Theorem. The matched estimate of the AUC is unbiased estimate of AUC but the variance of the matched estimate is inflated.
Proof based on the Hölder’s inequality (see [1]).
32
Summary If the results of a New Test depend on covariates and distributions of covariates in Diseased and Non-Diseased groups are different then only matched ROC analysis correctly evaluates the diagnostic accuracy of the New Test.
Matching based on propensity scores (pre-test risk of Disease) reduces bias. Propensity score is seriously degraded when important covariates influencing pre-test risk have not been collected. Weighted ROC analysis allows more effectively utilizing all the data.
33
References
1. Kondratovich, Marina V. (2000). Methodology of removing the effect of confounding variables in receiver operating characteristic (ROC) analysis.
Proceedings of the 2000 Joint Statistical Meeting, Biopharmaceutical Section, Indianapolis, IN.
2. Kondratovich, Marina V. (2002). Matched receiver operating characteristic (ROC) analysis and propensity scores.
Proceedings of the 2002 Joint Statistical Meeting, Biopharmaceutical Section, New York, NY.
3. Zweig, M.H. and Campbell, G. (1993). Receiver operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clinical Chemistry, 39, p. 561-577.
The propensity scores technique is well developed in the context of observational studies and studies for the therapeutic devices. In the context of diagnostic studies, however,there has been little papers.
34
• Rubin, DB, Estimating casual effects from large data sets using propensity scores. Ann Intern Med 1997; 127:757-763
• Grunkemeier, GL and et al, Propensity score analysis of stroke after off-pump coronary artery bypass grafting, Ann Thorac Surg 2002; 74:301-305
• Wolfgang, C. and et al, Comparing mortality of elder patients on hemodialysis versus peritoneal dialysis: A propensity score approach, J. Am Soc Nephrol 2002; 13:2353-2362
• Rosenbaum, PR, Rubin DB, Reducing bias in observational studies using subclassification on the propensity score. JASA 1984; 79:516-524
• Blackstone, EH, Comparing apples and oranges, J. Thoracic and Cardiovascular Surgery, January 2002; 1:8-15
• D’agostino, RB, Jr., Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group, Statistics in medicine, 1998,17:2265-2281
References for the propensity scores technique