biostat methods stat 5820/6910 handout #5: logistic ......intercept 0.511028 -0.209445 -0.035981...
TRANSCRIPT
1
Biostat Methods STAT 5820/6910 – Handout #5: Logistic Regression
(with Overdispersion, Separation of Points, and Inverse Interval Estimation)
Example 1: 102 patients with acute myelogenous leukemia (AML) in remission were
enrolled in a study of a new anti-relapse treatment (ACT). Patients were randomly
assigned to receive a 10-day infusion of ACT or a placebo (PBO), and effects were
followed for 90 days. Of interest was whether or not the patients suffered a major
'relapse' during the 90 days, including relapse, death, or major intervention, such as bone
marrow transplant. The time of remission from diagnosis or prior relapse ('x', in months)
at study enrollment was considered an important covariate in predicting relapse. Is there
any evidence that ACT leads to a decreased relapse rate compared to PBO?
Relapse (y)
No (0) Yes (1)
Treatment (trt) PBO (0) 20 30
ACT (1) 29 23
/* Define options */
ods html image_dpi=300 style=journal;
data aml; input group $ x relapse $ @@;
trt = (group='ACT');
y = (relapse='Y');
label x = 'Months in Remission';
cards;
ACT 3 N ACT 3 Y ACT 3 Y ACT 6 Y ACT 15 N ACT 6 Y
ACT 6 Y ACT 6 Y ACT 15 N ACT 15 N ACT 12 N ACT 18 N
ACT 6 Y ACT 15 N ACT 6 Y ACT 15 N ACT 12 Y ACT 9 N
ACT 6 Y ACT 6 N ACT 6 N ACT 6 N ACT 3 Y ACT 18 N
ACT 9 N ACT 12 Y ACT 6 N ACT 9 Y ACT 9 Y ACT 3 N
ACT 9 Y ACT 12 N ACT 12 N ACT 3 N ACT 12 N ACT 12 N
ACT 12 N ACT 9 Y ACT 6 Y ACT 12 N ACT 6 N ACT 15 Y
ACT 9 N ACT 3 Y ACT 9 N ACT 9 N ACT 9 N ACT 9 N
ACT 9 Y ACT 12 Y ACT 3 Y ACT 6 Y PBO 9 Y PBO 3 N
PBO 12 Y PBO 3 Y PBO 3 Y PBO 15 Y PBO 9 Y PBO 12 Y
PBO 3 Y PBO 9 Y PBO 15 Y PBO 9 Y PBO 6 Y PBO 9 Y
PBO 6 Y PBO 12 N PBO 9 N PBO 15 N PBO 15 Y PBO 9 N
PBO 9 N PBO 12 Y PBO 3 Y PBO 6 Y PBO 6 Y PBO 12 N
PBO 12 N PBO 12 Y PBO 3 Y PBO 12 Y PBO 3 Y PBO 12 Y
PBO 6 Y PBO 6 Y PBO 9 Y PBO 15 N PBO 15 N PBO 12 N
PBO 9 N PBO 12 N PBO 15 N PBO 18 Y PBO 12 N PBO 15 Y
PBO 15 N PBO 15 N PBO 18 N PBO 18 Y PBO 18 N PBO 18 N
;
2
/* Run usual chi-square test */
proc freq data=aml;
tables trt*y / chisq nopercent nocol;
title1 'Chi-square test of association';
title2 '(ignoring covariate)';
run;
Chi-square test of association
(ignoring covariate)
The FREQ Procedure
Frequency
Row Pct
Table of trt by y
trt y
0 1 Total
0 20
40.00
30
60.00
50
1 29
55.77
23
44.23
52
Total 49
53
102
Statistics for Table of trt by y
Statistic DF Value Prob
Chi-Square 1 2.5394 0.1110
Likelihood Ratio Chi-
Square
1 2.5505 0.1103
Continuity Adj. Chi-
Square
1 1.9469 0.1629
Mantel-Haenszel Chi-
Square
1 2.5145 0.1128
Phi Coefficient -0.1578
Contingency Coefficient 0.1559
Cramer's V -0.1578
Fisher's Exact Test
Cell (1,1) Frequency (F) 20
Left-sided Pr <= F 0.0813
Right-sided Pr >= F 0.9637
Table Probability (P) 0.0450
Two-sided Pr <= P 0.1189
3
/* Do equivalent test in logistic regression */
proc logistic data=aml;
model y(event='1') = trt;
title1 'Logistic regression';
title2 '(ignoring covariate)';
run;
Logistic regression
(ignoring covariate)
Response Profile
Ordered
Value
y Total
Frequency
1 0 49
2 1 53
Probability modeled is y=1.
Model Convergence Status
Convergence criterion
(GCONV=1E-8) satisfied.
Model Fit Statistics
Criterion Intercept
Only
Intercept
and
Covariates
AIC 143.245 142.695
SC 145.870 147.945
-2 Log L 141.245 138.695
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 2.5505 1 0.1103
Score 2.5394 1 0.1110
Wald 2.5178 1 0.1126
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-
Square
Pr > ChiSq
Intercept 1 0.4055 0.2887 1.9728 0.1602
trt 1 -0.6373 0.4016 2.5178 0.1126
Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
trt 0.529 0.241 1.162
4
/* Fit logistic regression model with covariate */
proc logistic data=aml plots(only)=roc;
model y(event='1') = trt x ;
title1 'Logistic regression';
title2 '(accounting for covariate)';
run;
Logistic regression
(accounting for covariate)
Response Profile
Ordered
Value
y Total
Frequency
1 0 49
2 1 53
Probability modeled is y=1.
Model Convergence Status
Convergence criterion (GCONV=1E-
8) satisfied.
Model Fit Statistics
Criterion Intercept
Only
Intercept
and
Covariates
AIC 143.245 129.376
SC 145.870 137.251
-2 Log L 141.245 123.376
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 17.8687 2 0.0001
Score 16.4848 2 0.0003
Wald 14.0612 2 0.0009
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-
Square
Pr > ChiSq
Intercept 1 2.6135 0.7149 13.3662 0.0003
trt 1 -1.1191 0.4669 5.7446 0.0165
x 1 -0.1998 0.0560 12.7187 0.0004
Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
trt 0.327 0.131 0.815
x 0.819 0.734 0.914
5
Association of Predicted Probabilities and
Observed Responses
Percent Concordant 68.5 Somers' D 0.454
Percent Discordant 23.1 Gamma 0.496
Percent Tied 8.4 Tau-a 0.229
Pairs 2597 c 0.727
/* Fit equivalent logistic regression model,
and look at 'dose-response' curves
for each level of group variable */
proc logistic data=aml;
class group;
model y(event='1') = group x ;
effectplot fit(plotby=group x=x);
title1 'Logistic regression';
title2 '(with dose-response curve)';
run;
6
Logistic regression
(with dose-response curve)
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 2.0539 0.5967 11.8477 0.0006
group ACT 1 -0.5595 0.2335 5.7446 0.0165
x 1 -0.1998 0.0560 12.7187 0.0004
Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
group ACT vs PBO 0.327 0.131 0.815
x 0.819 0.734 0.914
7
/***************************************/
/* Look at inverse interval estimation */
/***************************************/
/* First get 'weighted' version of data */
proc sort data=aml; by trt x;
proc means data=aml sum n noprint;
by trt x;
var y;
output out=out1 n=total sum=resp;
proc print data=out1;
title1 'Weighted version of AML data';
run;
Weighted version of AML data
Obs trt x _TYPE_ _FREQ_ total resp
1 0 3 0 7 7 6
2 0 6 0 6 6 6
3 0 9 0 10 10 6
4 0 12 0 12 12 6
5 0 15 0 10 10 4
6 0 18 0 5 5 2
7 1 3 0 8 8 5
8 1 6 0 14 14 9
9 1 9 0 12 12 5
10 1 12 0 10 10 3
11 1 15 0 6 6 1
12 1 18 0 2 2 0
/* Get 'weighted' data in order with trt=1 first
-- that way the ORDER=DATA option in PROC PROBIT
will make trt=1 be the indicated factor level
since it will occur first in the data set. */
proc sort data=out1; by descending trt;
run;
8
/* Get (and plot) inverse intervals for response probabilities
when trt=0 [need to give an x-level (6 here), but not used] */
data trt0; input trt x ;
cards;
0 6
;
proc probit data=out1 order=data xdata=trt0 plot=ippplot;
class trt;
model resp/total = trt x / d=logistic inversecl lackfit
covb;
/* NOTE: Put the 'dose' variable as the first continuous
(or non-CLASS) variable in the MODEL statement.
INVERSECL applies to first continuous predictor;
all predictors must have levels set in XDATA set */
title1 'Inverse Interval Estimation (trt=0; PBO)';
run;
Inverse Interval Estimation
(trt=0; PBO)
The Probit Procedure
Model Information
Data Set WORK.OUT1
Events Variable resp
Trials Variable total
Number of
Observations
12
Number of Events 53
Number of Trials 102
Name of
Distribution
Logistic
Log Likelihood -61.68822985
Number of Observations Read 12
Number of Observations Used 12
Number of Events 53
Number of Trials 102
Algorithm converged.
Goodness-of-Fit Tests
Statistic Value DF Value/DF Pr > ChiSq
Pearson Chi-
Square
3.2825 9 0.3647 0.9520
L.R. Chi-Square 4.5899 9 0.5100 0.8685
Note: Since the Pearson Chi-Square is small (p > 0.1000), fiducial limits will be calculated using a z value of 1.96.
Response-Covariate Profile
Response Levels 2
Number of Covariate Values 12
Type III Analysis of Effects
Effect DF Wald
Chi-Square
Pr > ChiSq
trt 1 5.7449 0.0165
x 1 12.7192 0.0004
9
Analysis of Maximum Likelihood Parameter Estimates
Parameter DF Estimate Standard
Error
95% Confidence
Limits
Chi-
Square
Pr > ChiSq
Intercept 1 2.6136 0.7149 1.2125 4.0147 13.37 0.0003
trt 1 1 -1.1191 0.4669 -2.0343 -0.2040 5.74 0.0165
trt 0 0 0.0000 . . . . .
x 1 -0.1998 0.0560 -0.3095 -0.0900 12.72 0.0004
Estimated Covariance Matrix
Intercept trt1 x
Intercept 0.511028 -0.209445 -0.035981
trt1 -0.209445 0.218010 0.009684
x -0.035981 0.009684 0.003137
Probit Analysis on x
Probability x 95% Fiducial Limits
0.01 36.0862 27.0084 66.4687
0.02 32.5655 24.6797 58.7092
0.03 30.4845 23.2938 54.1320
…
0.35 16.1822 12.9927 23.4513
0.40 15.1131 12.0206 21.3599
…
0.97 -4.3177 -24.1121 1.8153
0.98 -6.3988 -28.6721 0.4122
0.99 -9.9194 -36.4121 -1.9359
10
/* Get (and plot) inverse intervals for response probabilities
when trt=1 [need to give an x-level (6 here), but not used] */
data trt1; input trt x ;
cards;
1 6
;
proc probit data=out1 order=data xdata=trt1 plot=ippplot;
class trt;
model resp/total = trt x /
d=logistic inversecl lackfit covb;
title1 'Inverse Interval Estimation (trt=1; ACT)';
run;
NOTE: Output here is the same as for the trt=0 case, except for the “95% Fiducial
Limits” table and corresponding figure:
11
Example 2: Erectile Dysfunction Data
48 male subjects in an anti-impotence study had experienced erectile dysfunction
following prostate surgery. Subjects were randomly assigned to receive a new drug
(trt=1) or placebo (trt=0), and kept a diary for one month, recording the number of
attempts at sexual intercourse following taking the medication and the number of
attempts that were successful. Subject age is also recorded.
Does the new drug have a higher success rate than the placebo?
data ED; input trt age successes attempts @@;
ID = _n_;
cards;
0 41 3 6 1 57 3 8
0 44 5 15 1 54 10 12
0 62 0 4 1 65 0 0
0 44 1 2 1 51 5 8
0 70 3 8 1 53 8 10
0 35 4 8 1 44 17 22
0 72 1 6 1 66 2 3
0 34 5 15 1 55 9 11
0 61 1 7 1 37 6 8
0 35 5 5 1 40 2 4
0 52 6 8 1 44 9 16
0 66 1 7 1 64 5 9
0 35 4 10 1 78 1 3
0 61 4 8 1 51 6 12
0 55 2 5 1 67 5 11
0 41 7 9 1 44 3 3
0 53 2 4 1 65 7 18
0 72 4 6 1 69 0 2
0 58 0 0 1 53 4 14
0 56 12 17 1 49 5 8
0 53 8 15 1 74 10 15
0 45 3 4 1 39 4 9
0 40 14 20 1 35 8 10
1 47 4 5
1 46 6 7
;
12
proc logistic data=ED;
model successes/attempts = trt age;
title1 'ED Data Analysis';
run;
ED Data Analysis
Number of Observations Read 48
Number of Observations Used 46
Sum of Frequencies Read 417
Sum of Frequencies Used 417
Response Profile
Ordered
Value
Binary Outcome Total
Frequency
1 Event 234
2 Nonevent 183
Note: 2 observations with invalid response values have been deleted. Either the number of trials was less than or equal to zero or less than the number of events, or the number of events was negative.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 1.2111 0.4597 6.9410 0.0084
trt 1 0.5265 0.2041 6.6574 0.0099
age 1 -0.0243 0.00881 7.5816 0.0059
Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
trt 1.693 1.135 2.526
age 0.976 0.959 0.993
13
/* Check whether the subject strata (and associated
dependence of observations) has caused overdispersion */
proc logistic data=ED;
model successes/attempts = trt age / scale=pearson;
output out=out1 p=phat;
title1 'ED Data Analysis';
title2 '(Also Check for Overdispersion)';
run;
ED Data Analysis
(Also Check for Overdispersion)
Note: 2 observations with invalid response values have been deleted. Either the number of trials was less than or equal to zero or less than the number of events, or the number of events was negative.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Deviance and Pearson Goodness-of-Fit Statistics
Criterion Value DF Value/DF Pr > ChiSq
Deviance 70.3355 43 1.6357 0.0053
Pearson 63.7235 43 1.4819 0.0216
Number of events/trials observations: 46
Note: The covariance matrix has been multiplied by the heterogeneity factor (Pearson Chi-Square / DF) 1.48194.
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 1.2111 0.5596 4.6837 0.0305
trt 1 0.5265 0.2484 4.4923 0.0340
age 1 -0.0243 0.0107 5.1160 0.0237
14
Example 3: Menopause Data
370 female patients’ age and menopause status (menopause=1 for post-menopausal, 0
otherwise) is recorded. Age is categorized into a variable agecat: 1 for age<50, 2 for 50 ≤
age < 60, 3 for 60 ≤ age < 70, and 4 for 70 ≤ age. How does menopause rate depend on
age?
filename myurl url "http://www.stat.usu.edu/jrstevens/biostat/data/bcancer.csv"
lrecl=800;
data bcancer;
infile myurl dsd delimiter = "," firstobs=2 missover;
input menopause age agecat;
run;
data bcancer; set bcancer;
ID = _n_;
run;
/* Run regular logistic regression */
proc logistic data=bcancer plots=(roc effect);
model menopause(event='1') = age ;
title1 'Logistic regression on menopause data';
run;
15
Logistic regression on menopause data
Response Profile
Ordered
Value
menopause Total
Frequency
1 0 59
2 1 301
Probability modeled is menopause=1.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 124.1456 1 <.0001
Score 81.0669 1 <.0001
Wald 49.7646 1 <.0001
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 -12.8675 1.9360 44.1735 <.0001
age 1 0.2829 0.0401 49.7646 <.0001
Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
age 1.327 1.227 1.436
16
/* What happens if the trend were even more 'clear'? */
data new; set bcancer;
if menopause = 1 & age < 57 then delete;
if menopause = 0 & age > 55 then delete;
proc logistic data=new plots=(roc effect);
model menopause(event='1') = age ;
title1 'Logistic regression on menopause subset';
run;
Logistic regression on menopause subset
Response Profile
Ordered
Value
menopause Total
Frequency
1 0 58
2 1 180
Probability modeled is menopause=1.
Note: 1 observation was deleted due to missing values for the response or explanatory variables.
Model Convergence Status
Complete separation of data points detected.
17
Warning: The maximum likelihood estimate does not exist.
Warning: The LOGISTIC procedure continues in spite of the above warning. Results
shown are based on the last maximum likelihood iteration. Validity of the
model fit is questionable.
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 263.5557 1 <.0001
Score 153.7178 1 <.0001
Wald 5.4176 1 0.0199
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 -131.3 56.7926 5.3459 0.0208
age 1 2.3737 1.0198 5.4176 0.0199
18
/* How to fix this? */
proc logistic data=new plots=(roc effect);
model menopause(event='1') = age / firth;
title1 'Bias-corrected logistic regression on menopause
subset';
run;
Bias-corrected logistic regression on menopause subset
Model Information
Data Set WORK.NEW
Likelihood Penalty Firth's bias correction
Model Convergence Status
Complete separation of data points detected.
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 -88.0491 28.3566 9.6414 0.0019
age 1 1.5948 0.5078 9.8639 0.0017
19
/* A simple fix for separation with one continuous
predictor */
data new; set new;
dummy = (age >= 56);
proc logistic data=new;
model menopause(event='1') = dummy;
title1 'logistic with dummy';
run;
logistic with dummy
Model Convergence Status
Complete separation of data points detected.
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 -10.0502 19.9843 0.2529 0.6150
dummy 1 19.6361 21.9147 0.8029 0.3702
proc freq data=new;
tables dummy*menopause / chisq norow nocol nopercent;
title1 'chi-square with dummy';
run;
chi-square with dummy
Table of dummy by menopause
dummy menopause
0 1 Total
0 58
0
58
1 0
180
180
Total 58
180
238
Statistics for Table of dummy by
menopause
Statistic DF Value Prob
Chi-Square 1 238.0000 <.0001
Effective Sample Size = 238
Frequency Missing = 1