predicting hospital productivity from capacity metrics - a linear regression model
Post on 15-Jan-2017
109 Views
Preview:
TRANSCRIPT
PREDICTING HOSPITAL PRODUCTIVITY FROM
CAPACITY METRICS: A LINEAR REGRESSION
MODEL
Madeleine Organ
ACMS 30600: Prof. Huebner
April 27, 2016
Disclaimer: This paper was the final project for my Statistical Methods and Data
Analysis course. Its purpose is to demonstrate understanding of rigorous statistical
methods, not robust research. Its findings are therefore necessarily limited and should
be viewed only in this context.
I. INTRODUCTION
In this paper, I will analyze capacity metrics of hospitals in the top 200 Hospital Referral
Regions (HRRs) across America. I have constructed a linear regression model predicting medical
discharges per 1,000 Medicare enrollees (collapsed over gender) in 2012 from four capacity or
labor metrics: hosp.phys measures hospital-based physician rate (per 100,000 residents); hosp.rn
similarly measures hospital-based registered nurse rate (per 1,000 residents); hosp.emp measures
hospital employee rate (per 1,000 residents); finally, beds measures the acute care hospital bed
rate (per 1,000 residents). The different rates or density pool sizes should not affect the results of
the regression, but the slope parameter for hosp.phys, if significant, would simply be one
hundred times the size it would have been if hosp.phys was measured per 1,000 residents. It is
only in comparing the relative importance of the predictors (i.e., comparing the absolute values
of the slope parameters) that the effect of hosp.phys would need to be adjusted by 0.01 to
compare it with the slope parameters for hosp.rn, hosp.emp, and beds. In addition, data for
physician rates was collected in 2011 while other metrics were collected in 2012; while not ideal,
this should not cause a big disturbance in the model since it is reasonable to assume these rates
stay fairly similar, on average, from one year to the next. A summary of the variable names is
provided in Table 1, below.
Table 1: Variable Names and Descriptions for Model 1
Hosp.phys Hospital-Based Physicians per 100,000 Residents (collapsed over specialty), 2011
Hosp.rn FTE Hospital-Based Registered Nurses per 1,000 Residents, 2012
Hosp.emp FTE Hospital Employees per 1,000 Residents, 2012
Beds Acute Care Hospital Beds per 1,000 Residents, 2012
The discharges response variable essentially measures the productive capacity of the
hospital; a higher-producing hospital will discharge more patients. Because the discharges
variable is a rate as well, the population of the HRR does not confound the analysis. The goal of
this analysis is to determine the extent to which these four variables – rates of hospital
physicians, nurses, general employees, and beds – affect the overall productive capacity of the
hospital, as seen in their discharge rates. Significant results would give hospital strategists and
administrators insight into ways to target resources that will have the biggest impact on their
production. Healthcare decisions in general have huge implications on the surrounding
populations not only for the direct delivery of care but also because of the vast sums of money
spent on healthcare each year by individuals and governments. Studying healthcare data sets like
this one can provide insight into the most meaningful types of spending and indicate where
spending could be reduced, if necessary.
II. DATA
The data used in this report was obtained from the Dartmouth Atlas of Health Care1. This
online database houses Medicare data from hospitals and is grouped locally, regionally, and
nationally as well as indexed by hospital and their affiliated physicians. Medicare population as
used in this dataset includes individuals between the ages of 65 and 99 not enrolled in a risk-
bearing health maintenance organization. The insurance claims databases come from a federal
agency, the Centers for Medicare and Medicaid Services (CMS), that collects data for every
person and provider using Medicare health insurance. In addition, some data in the Atlas was
obtained from the U.S. Census, the American Hospital Association (AHA), the American
1 http://www.dartmouthatlas.org/data/table.aspx?ind=186&tf=34&ch=32&loc=&loct=3&addn=ind-140_ch-6_tf-
32,ind-139_tf-34,ind-135_tf-34,ind-138_tf-34&fmt=221
Medical Association (AMA), and the National Center for Health Statistics. Topics included in
this database include Medicare reimbursement data, demographics of the Medicare population of
each region, interaction with/use of the health care system (contact days, number of clinicians
seen, etc.), cancer screening, prescription drug use, quality/effective care, hospital use data,
medical discharge data, and hospital and physician capacity, among others. The topics utilized in
this project were mainly hospital use and capacity. The Dartmouth Atlas project has collected
data for 20 years, and all Atlas reports and publications are available on their web site. The Atlas
uses small area analysis, a population-based methodology, to focus on the experience of the
population living in a defined geographic area or using a specific hospital (alternative
methodologies consider only the number of procedures performed at a hospital, without
correcting for size of population served). The data used in this analysis was by Hospital referral
region (HRR); each represents a regional health care market and contains at least one hospital
that regularly performs cardiovascular and neuro surgeries. There are 306 designated HRRs, and
each has a minimum population size of 120,000; this analysis utilizes the top 200.
III. REGRESSION ANALYSIS
Exploratory data analysis via scatter plots indicates a positive correlation exists between
discharges and hospital nurses, discharges and general hospital employees, and discharges and
beds, as shown in Figure 1 (a) – (c) below:
Figure 1(a). Scatter Plot of Hospital Nurses vs. Discharges (2012)
Figure 1(b). Scatter Plot of Hospital Employees vs. Discharges (2012)
Figure 1(c). Scatter Plot of Beds vs. Discharges (2012)
As indicated in Table 2, below, the strongest correlation is between hospital nurses and
general hospital employees (correlation = 0.7682). This indicates that there is a possibility for a
severe multicollinearity problem; however, VIF analysis indicated that no such severe problem
exists (see Appendix, part C). Practically, it is reasonable that nurses and general hospital
employees would be correlated, as a large portion of hospital employees are nurses.
Table 2. Correlations Matrix for Predictor Variables
hosp.phys hosp.rn hosp.emp beds
hosp.phys 1.0000000 -0.2967407 -0.2433095 -0.4530859
hosp.rn -0.2967407 1.0000000 0.7682780 0.6540995
hosp.emp -0.2433095 0.7682780 1.0000000 0.5511914
beds -0.4530859 0.6540995 0.5511914 1.0000000
A first regression model (mod1) is fitted using all four predictor values – hospital
physicians, hospital nurses, general hospital employees, and beds – to predict the response
variable of discharges. The multiple coefficient of determination (R2mod1) is 0.3686; the adjusted
value (R2a, mod1) = 0.3557. A summary of the model is below:
> summary(mod1) Call: lm(formula = discharges ~ hosp.phys + hosp.rn + beds + hosp.emp) Residuals: Min 1Q Median 3Q Max -76.666 -20.942 -0.593 20.254 76.463 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 124.584 26.243 4.747 3.98e-06 *** hosp.phys -1.041 0.756 -1.377 0.1702 hosp.rn 19.861 4.959 4.005 8.80e-05 *** beds 24.406 5.798 4.210 3.90e-05 *** hosp.emp -1.845 1.063 -1.735 0.0843 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 30.32 on 195 degrees of freedom Multiple R-squared: 0.3686, Adjusted R-squared: 0.3557 F-statistic: 28.46 on 4 and 195 DF, p-value: < 2.2e-16
A second, reduced model was constructed to test for the significance of variable of hospital
physicians, since it was not individually reported as significant in the summary of the initial
model above. The ANOVA F test was conducted comparing the following hypotheses:
H0: ̂ = 0; hosp.phys can be removed
Ha: ̂ ≠ 0; hosp.phys cannot be removed
The results of the ANOVA analysis are provided in the following R output:
Analysis of Variance Table Model 1: discharges ~ hosp.rn + beds + hosp.emp Model 2: discharges ~ hosp.phys + hosp.rn + beds + hosp.emp Res.Df RSS Df Sum of Sq F Pr(>F) 1 196 181032 2 195 179290 1 1742.5 1.8952 0.1702
The p-value of 0.1702 > ( = 0.05); therefore, fail to reject H0. The variable hosp.phys can
be removed for the model, as its interaction with discharges is not significant.
While the reduction to Model 2 improved the model, in order to determine the optimal
model, two variable selection methods were used: first, StepAIC() and second, principal
components.
The StepAIC procedure indicated that the optimal model does not include the variable
hosp.phys; results of the analysis are given below:
> library(MASS) > optimal.hosp<-stepAIC(mod1) Start: AIC=1369.69 discharges ~ hosp.phys + hosp.rn + beds + hosp.emp Df Sum of Sq RSS AIC - hosp.phys 1 1742.5 181032 1369.6 <none> 179290 1369.7 - hosp.emp 1 2768.9 182059 1370.8 - hosp.rn 1 14750.2 194040 1383.5 - beds 1 16293.4 195583 1385.1 Step: AIC=1369.62 discharges ~ hosp.rn + beds + hosp.emp Df Sum of Sq RSS AIC <none> 181032 1369.6 - hosp.emp 1 2821.4 183854 1370.7 - hosp.rn 1 14832.9 195865 1383.4 - beds 1 23318.8 204351 1391.9 > optimal.hosp$anova #display results Stepwise Model Path Analysis of Deviance Table Initial Model: discharges ~ hosp.phys + hosp.rn + beds + hosp.emp Final Model: discharges ~ hosp.rn + beds + hosp.emp Step Df Deviance Resid. Df Resid. Dev AIC 1 195 179289.9 1369.688 2 - hosp.phys 1 1742.486 196 181032.4 1369.623
Thus, the resultant second model (mod2) measures discharges from hospital nurses, general
hospital employees, and beds. Summary statistics for this model are given below.
> summary(mod2) Call: lm(formula = discharges ~ hosp.rn + beds + hosp.emp) Residuals: Min 1Q Median 3Q Max -73.978 -20.790 1.214 18.521 79.607 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 92.784 12.483 7.433 3.22e-12 *** hosp.rn 19.916 4.970 4.007 8.72e-05 *** beds 27.263 5.426 5.025 1.13e-06 *** hosp.emp -1.862 1.065 -1.748 0.0821 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 30.39 on 196 degrees of freedom Multiple R-squared: 0.3625, Adjusted R-squared: 0.3527 F-statistic: 37.15 on 3 and 196 DF, p-value: < 2.2e-16
Because of the correlation between hospital nurses and hospital employees (see Table 2), I
conducted an additional principal components analysis for this optimal model without
hosp.phys. If this third model retained predictive power, it might be more optimal than the
second model because it alleviates any multicollinearity. My interpretation of the scree plot in
Figure 3, below, indicates that two principal components should be used to replace the three
remaining variables.
Figure 3. Scree Plot used for Principal Components Analysis
The third model (mod3) was constructed predicting discharges from the first two principal
components; its summary statistics are reported below.
> mod3<-lm(discharges~PC1+PC2) > summary(mod3) Call: lm(formula = discharges ~ PC1 + PC2) residual: Min 1Q Median 3Q Max -70.237 -22.231 1.038 17.896 88.048 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 206.799 2.179 94.911 < 2e-16 *** PC1 -13.458 1.434 -9.384 < 2e-16 *** PC2 -11.985 3.202 -3.742 0.000239 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 30.81 on 197 degrees of freedom Multiple R-squared: 0.3413, Adjusted R-squared: 0.3346 F-statistic: 51.03 on 2 and 197 DF, p-value: < 2.2e-16
I conducted an investigation of both Model 2 and Model 3 to test for influential
observations and outliers; neither influential observations nor outliers were found in either model.
> #Check for influential observations > ##cooks.distance(mod2) > which(cooks.distance(mod2)>1) named integer(0) > which(cooks.distance(mod3)>1) named integer(0) > # Check for Outliers > which(abs(rstandard(mod2))>3) named integer(0) > which(abs(rstandard(mod3))>3) named integer(0)
In choosing a final, optimal model, I considered various factors, including adjusted2-R2,
AIC, and the ANOVA and StepAIC procedures. My results are summarized in Table 3 below.
Table 3(a). Summary of Metrics of Model Optimality
MODEL
NAME PREDICTORS
ADJUSTED
R2 AIC NOTES
mod1
hosp.phys
hosp.rn
hosp.emp
beds
0.3557 1369.69 By F-test (ANOVA), hosp.phys was not
deemed a significant predictor; this model
should be eliminated from consideration.
mod2
hosp.rn
hosp.emp
beds
0.3527 1369.62 Optimal model designated by StepAIC
mod3
PC1
PC2
0.3346 1374.166 Principal component analysis was conducted
on the already reduced model (i.e., the model
without hosp.phys). This model has no
multicollinearity.
After analysis of these factors, I choose Model 2, the model with hosp.rn,
hosp.emp and beds as predictors; it simultaneously has the highest (ideal) adjusted R-squared
value, lowest (ideal) AIC value, and was designated by the StepAIC procedure to be the optimal
model. A secondary candidate would be Model 3, which, by nature of its principal components,
has no multicollinearity, but it has a lower adjusted R-squared and higher AIC value, which
makes it less optimal than Model 2. Indeed, the criterion values are quite close; an argument for
2 Since the models under investigation here have differing numbers of variables, I use adjusted-R2
rather than simple
R2 to test for predictive power; as with R2, a higher adjusted R2 is more optimal.
Model 3 would be its lack of multicollinearity and its model simplicity (it uses only two
variables in comparison with three in Model 2). However, since there was truly no need to
perform the principal components analysis (the VIF analysis indicated that the multicollinearity
was not severe) and because retaining variables in their original form makes for more
straightforward analysis of slope parameters, I will use Model 2 for all further analysis.
Model 2, my final choice, gives:
𝑑𝑖𝑠𝑐ℎ𝑎𝑟𝑔𝑒𝑠 = 92.784 + 19.916 ∗ (ℎ𝑜𝑠𝑝. 𝑟𝑛) + 27.263 ∗ (𝑏𝑒𝑑𝑠) − 1.862 ∗ (ℎ𝑜𝑠𝑝. 𝑒𝑚𝑝)
Finally, I perform model diagnostics on Model 2 in order to determine that it does indeed
satisfy model assumptions. Figure 4 below shows a residuals vs. fitted value plot; its circular
pattern (indicative of no real pattern) indicates that the model assumptions3 are upheld.
Figure 4. Residuals vs. Fitted Values Plot
3 The assumptions of the model are as follows: the error term (1) is normally distributed and (2) has mean 0 with
(3) constant variance 2 , and (4) all pairs of error terms are uncorrelated
Figure 5, below, is a quantile-quantile plot that verifies Assumption (1); namely, that the
error term is normally distributed:
Figure 5. Quantile-Quantile Plot (Model 2)
Since the quantile-quantile plot is a straight line, Assumption 1 is confirmed; the error
term is normally distributed.
A bootstrap resampling technique was used to validate the model; that is, a sample was
drawn with replacement and measured against the true model to determine if the model was
over-fit. The Evaluation R2 obtained for this validation test was 0.3356, which is quite close to
the original (Apparent) R2 (0.3527). If the Evaluation R2 was substantially lower than the
Apparent R2, we might conclude that the model was over-fit and too optimistic; however, this is
not the case for Model 2. Thus, Model 2 (predicting discharges from hospital employees,
hospital nurses, and beds) is valid and we can trust that it will generalize well to new data.
IV. RESULTS
Since the Bootstrapping technique confirmed that Model 2 is valid, I now use it to
conduct inferences.
First, I determine a 95% Confidence Interval for �̂�hosp.rn; this will give a range in which
we are 95% confident the true value of hosp.rn will fall.
> # CI for B.hosp.rn > CI.hosp.rn.low<-19.916-(1.96)*4.970 > CI.hosp.rn.up<-19.916+(1.96)*4.970 > CI.hosp.rn.low; CI.hosp.rn.up
[1] 10.1748 [1] 29.6572
Thus, for the hospital nurses variable, we can be 95% confident that the true value of is
within the interval [10.1748, 29.6572]. Since this interval does not contain zero, we can also say
that the interaction between hospital nurses and discharges is significant; that is, we are 95%
confident the slope parameter between the two is not zero. Practically, this indicates that the
additional discharges from employing one additional nurse lies within the region [10.17, 29.66].
Hospital strategists can consider discharge payoffs such as these when making nurse labor force
hiring and scheduling decisions. The range is somewhat large, which is reflective of the low
predictive power of this model.
Likewise, I determine a 95% Confidence Interval for �̂�hosp.emp: we can be 95% confident
that the true value of hosp.emp is within the range [-3.9494, 0.2254]. This is a somewhat troubling
result, however, because it indicates that the value for the slope parameter could be zero; that is,
that there might be no relationship between hospital employees and discharges. Practically, this
gives a great deal of uncertainty to hospital planners as they make staffing decisions about
employee numbers. To check this result, I created two additional models and conducted F tests
for the model utility of each (see Appendix, part B, for summaries of these models):
> mod4<-lm(discharges~hosp.rn+beds) #model does not include hosp.emp or hosp.physn > anova(mod4,mod1) Analysis of Variance Table Model 1: discharges ~ hosp.rn + beds Model 2: discharges ~ hosp.phys + hosp.rn + beds + hosp.emp Res.Df RSS Df Sum of Sq F Pr(>F) 1 197 183854 2 195 179290 2 4563.9 2.4819 0.08622 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > mod5<-lm(discharges~hosp.phys+hosp.rn+beds) #model does not include hosp.emp > anova(mod5,mod1) Analysis of Variance Table Model 1: discharges ~ hosp.phys + hosp.rn + beds Model 2: discharges ~ hosp.phys + hosp.rn + beds + hosp.emp Res.Df RSS Df Sum of Sq F Pr(>F) 1 196 182059 2 195 179290 1 2768.9 3.0116 0.08425 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
For each test of model utility, the test of model utility is as follows
H0: ̂ = 0; subset can be removed
Ha: ̂ ≠ 0; subset cannot be removed
For Model 4, the p-value of 0.08622 > ( = 0.05); therefore, fail to reject H0. Both
variables hosp.phys and hosp.emp can be removed for the model, as its interaction with
discharges is not significant.
For Model 5, the p-value of 0.08425 > ( = 0.05); therefore, fail to reject H0. The
single variable hosp.emp can be removed for the model, as its interaction with discharges is
not significant.
This analysis indicates that hospital employees was not a significant variable and should
be removed from the model. However, as determined by the Bootstrap resampling validation
technique, Model 2 (which included hospital employees as a variable) was valid; moreover, it
was selected by the StepAIC procedure as the optimal model. Therefore, I will continue to use
hospital employees as a significant variable and Model 2 as my model of choice; however,
practically, the effect of additional hospital employees is either ambiguous (likely depends on
factors beyond the scope of this model, or is confounded by the generality of the variable) or
quite low in magnitude. It should be noted that both the confidence interval and hypothesis
testing require the standard error in their computations. Standard errors are driven upward by
high levels of multicollinearity. While there was no problem of severe multicollinearity (that is,
no VIF > 10), as documented earlier (see Appendix, part C), hosp.rn and hosp.emp did have a
high correlation (0. 7682). This could play a role in inflating the standard error, which widens the
confidence interval and inflates the p-value (thereby prompting a failure to reject H0, when it
perhaps should be rejected). For these reasons as well as the AIC analysis, I will retain Model 2
as my optimal model. An updated summary table of all the tested models is given below.
Table 3(b). Summary of Metrics of Model Optimality
Finally, I consider a confidence interval and prediction interval for the predicted
discharges of an observation under the following conditions (chosen because they are the
conditions for the HRR of South Bend, IN):
hosp.phys.val<-27.2 hosp.rn.val<-3.8 beds.val<-1.9 hosp.emp.val<-14.2
The estimated value for discharges is �̂� = 193.8255; this compares with the observed
value of y = 199.3 discharges (the discrepancy here is reflective of the relatively low predictive
power of this model).
y.hat.SB<-92.784+19.916*hosp.rn.val+27.263*beds.val-1.862*hosp.emp.val y.hat.SB [1] 193.8241
MODEL
NAME PREDICTORS
ADJUSTED
R2 AIC NOTES
mod1
hosp.phys
hosp.rn
hosp.emp
beds
0.3557 1369.69 By F-test (ANOVA), hosp.phys was not
deemed a significant predictor; this model
should be eliminated from consideration.
mod2
hosp.rn
hosp.emp
beds
0.3527 1369.62 Optimal model designated by StepAIC
mod3
PC1
PC2
0.3346 1374.166 Principal component analysis was conducted
on the already reduced model (i.e., the model
without hosp.phys). This model has no
multicollinearity.
mod4
hosp.rn
beds
0.346 1370.716 Performed because CI hypothesis test
indicated that hosp.emp was not a
significant predictor (likewise does not include
hosp.phys)
Mod5
hosp.phys
hosp.rn
beds
0.3491 1370.75 Performed because CI hypothesis test
indicated that hosp.emp was not a
significant predictor (however, does include
hosp.phys)
A confidence interval for �̂� was calculated to be [188.87, 198.78]; that is, one can be
95% confident that the mean discharges for all hospitals under the designated conditions is
contained in this interval.
CI.yhat.SB<-predict(mod2, data.frame(hosp.rn=3.8, beds=1.9, hosp.emp=14.2), interval = "confidence", level=0.95) CI.yhat.SB fit lwr upr 1 193.8255 188.8666 198.7843
A prediction interval for �̂� was calculated to be [133.68, 253.97]; that is, one can be 95%
confident that an individual hospital’s discharges under the designated conditions will fall in this
interval.
PI.yhat.SB<-predict(mod2, data.frame(hosp.rn=3.8, beds=1.9, hosp.emp=14.2), interval = "prediction", level=0.95) PI.yhat.SB fit lwr upr 1 193.8255 133.6846 253.9663
The prediction interval is logically wider because of the greater inherent variability in an
individual observation in comparison with a mean.
This inference may be the most useful for hospital strategists as they can now utilize the
model with their specific rates for hospital nurses, general hospital employees, and beds, and
predict how many discharges their hospital will amass per 1,000 Medicare enrollees in a given
year. Moreover, the model enables such strategists to simulate new configurations of resources
(at a far reduced cost than actually trying them) to determine the best outcomes possible from
their scarce resources.
V. CONCLUSIONS
While this model provides a good framework for beginning to understand the predictors
for hospital capacity as measured by discharge rates, its limited predictive power (low R2 and
adjusted R2) indicate that these variables are not adequate on their own to predict hospital
discharges. In particular, the R2 for Model 2 is 0.3527; however, this means that 100-35.27 =
64.73% of the change in discharges cannot be explained by the predictors in this model. This is
reasonable, as a huge limitation of the dataset is lack of data on quality of the doctors and nurses
in a given HRR. Differences in quality of care can impact healthcare consumers’ decisions about
which hospitals they choose for their care; for example, if a consumer is near the border of two
designated HRRs, they will likely choose their care based on the quality of physicians or
facilities in each, neither of which is accounted for in this model. In addition, the model uses the
simplifying assumption of discharges as a measure of hospital productivity; however, this can be
misleading since any one person (accounting for only one discharge) can amass a multitude of
tests and procedures during their stay in the hospital which are not reflected in the data. This
problem may be intensified by the fact that higher-quality hospitals typically take on sicker
patients because of their more skilled physicians and greater medical and technical resources. It
is these patients that will either amass many procedures and still count as simply one patient, or
alternatively, fail to survive treatment (that is, are not discharged) and would not appear in this
dataset at all. This model could be improved by further data assessing the total number of
procedures performed (perhaps even weighted to account for different difficulty or risk factors)
or patient hours in the hospital as a better measure of hospital productivity. Further, it should
include a measure of physician and other caregiver quality rather than simply a rate of how many
caregivers a particular HRR employs.
Analysis of this model prompts several future questions. This model found that the
number of physicians in a given HRR had no impact on the number of discharges; however, this
seems unlikely. Perhaps this potential error manifests itself in the argument above, that physician
quality, rather than quantity, should be considered, but more likely the lack of physician
component in predicting hospital is due to a general omitted variable bias: this model simply
does not test the right variables to obtain a reliable prediction. Further research should be done,
then, to determine the degree to which physicians affect hospital productivity, and in particular if
this relationship is driven by simply quantity or by quality. Another avenue for research would
be comparing the relative need for physicians in comparison with needs for nurses and other
hospital staff. Here the quantitative model may be useful, albeit limited by the unknown quality
of the employees – perhaps one exceptional nurse could do the work of three average ones, and
there is no place for such a distinction in this model. Hospital administrators can utilize the
results from analyses like these to make strategic planning and employment decisions in order to
maximize production under conditions of limited resources.
VI. APPENDIX
Part A. R Code:
#PREDICTING HOSPITAL DISCHARGES FROM CAPACITY MEASURES:
#read in data
data2<-read.table("Capacity2.txt", header=T)
attach(data2)
#____________________________________________________________________
# EXPLORATORY DATA ANALYSIS
#make scatterplots of X variables vs. Y
plot(hosp.phys,discharges)
plot(hosp.rn, discharges)
plot(hosp.emp, discharges)
plot(beds, discharges)
#____________________
#correlation matrix
cor(cbind(hosp.phys, hosp.rn, hosp.emp, beds))
#____________________
# check VIF
mod1a<-lm(hosp.phys~hosp.rn+hosp.emp+beds)
summary(mod1a)
R2a<-0.2054
VIFa<-1/(1-R2a)
mod1b<-lm(hosp.rn~hosp.phys+hosp.emp+beds)
summary(mod1b)
R2b<-0.6667
VIFb<-1/(1-R2b)
mod1c<-lm(hosp.emp~hosp.phys+hosp.rn+beds)
summary(mod1c)
R2c<-0.5944
VIFc<-1/(1-R2c)
mod1d<-lm(beds~hosp.phys+hosp.rn+hosp.emp)
summary(mod1d)
R2d<-0.5602
VIFd<-1/(1-R2d)
VIFa; VIFb; VIFc; VIFd
#____________________________________________________________________
# REGRESSION ANALYSIS
# full model
mod1<-lm(discharges~hosp.phys+hosp.rn+beds+hosp.emp)
summary(mod1)
#____________________
# conduct F test for removal of [hosp.phys] variable (this is the
#one of least significance)
mod2<-lm(discharges~hosp.rn+beds+hosp.emp) #reduced model
anova(mod2,mod1) # F test
#____________________
# variable selection/data reduction method:
# stepAIC:
library(MASS)
optimal.hosp<-stepAIC(mod1)
optimal.hosp$anova #display results
summary(mod2)
# -------------------
# principal components:
#find z-scores for original x's
x.scale<-scale(cbind(hosp.rn, beds, hosp.emp))
pca<-prcomp(x.scale)
#check the scree plot
screeplot(pca,type="lines")
#calculate principal components:
PC1<-pca$x[,1]
PC2<-pca$x[,2]
#use PC's in new model:
mod3<-lm(discharges~PC1+PC2)
summary(mod3)
#____________________
#Check for influential observations
##cooks.distance(mod2)
which(cooks.distance(mod2)>1)
which(cooks.distance(mod3)>1)
# Check for Outliers
which(abs(rstandard(mod2))>3)
which(abs(rstandard(mod3))>3)
#____________________
# indicate which is final model
extractAIC(mod1)[2]
extractAIC(mod2)[2]
extractAIC(mod3)[2]
#mod2!
#____________________
# perform model diagnostics
# residual vs. fitted plot
plot(mod2$fitted.values,mod2$residuals)
# normal plot of residuals - demonstrate model assumptions
plot(mod2$residuals)
qqnorm(mod2$residuals)
#--------------------------------------
## MODEL VALIDATION
####(3) Bootstrap validation
#Create new data set including the new variables
disch <- as.data.frame(cbind(discharges, hosp.rn,hosp.emp,beds))
n<-dim(disch)[1]
#Fit model to original data
mod.disch<-lm(discharges~hosp.rn+hosp.emp+beds)
summary(mod.disch)
###Draw a bootstrap sample, fit a linear regression model,
## (i) calculate the R^2 for this bootstrap model applied to bootstrap
data ,
## (ii)calculate the R^2 for this bootstrap model applied to the
original sample.
##Perform this process 100 times,recording quantities (i) and (ii) for
each bootstrap sample
##Then use formula Evaluation=[Apparent-average(bootstrap-test)]
#Set seed to reproduce results
set.seed(5)
#Set up vectors to compute (i) and (ii)
R2.boot<-NULL
R2.test<-NULL
#Begin bootstrap sampling
for (i in 1:100){
#Draw bootstrap sample, i.e., sample of n WITH REPLACEMENT from
original data
disch.boot<-disch[sample(1:n,n,replace=TRUE),]
#fit logistic model on bootstrap sample
mod.boot<-glm(discharges~hosp.rn+hosp.emp+beds,data=disch.boot)
##Calculate R^2 when applying bootstrap model to bootstrap data
(similar to routine for data splitting or CV)
fitted.boot<-NULL
for (i in 1:n){
fitted.boot<-
c(fitted.boot,(mod.boot$coef[1]+mod.boot$coef[2]*disch.boot$hosp.rn[i]
+mod.boot$coef[3]*disch.boot$hosp.emp[i]
+mod.boot$coef[4]*disch.boot$beds[i]
))
}
#Compute R^2 on bootstrap sample using model fit to bootstrap sample
R2.boot.i <- 1 - sum((disch.boot$discharges-
fitted.boot)^2)/sum((disch.boot$discharges-
mean(disch.boot$discharges))^2)
#accumulate for each bootstrap sample
R2.boot <- c(R2.boot,R2.boot.i)
##Calculate R^2 when applying bootstrap model to original data;
consider as our "test"
fitted.test<-NULL
for (i in 1:n){
fitted.test<-
c(fitted.test,(mod.boot$coef[1]+mod.boot$coef[2]*disch$hosp.rn[i]
+mod.boot$coef[3]*disch$hosp.emp[i]
+mod.boot$coef[4]*disch$beds[i]
))
}
#Compute R^2 on original sample using model fit to bootstrap sample
R2.test.i <- 1 - sum((disch$discharges-
fitted.test)^2)/sum((disch$discharges-mean(disch$discharges))^2)
#accumulate for each bootstrap sample
R2.test <- c(R2.test,R2.test.i)
}
#From above, Apparent R^2 (R^2 on original data fit with original
model) was 0.3527
evaluation <- 0.3527-mean(R2.boot-R2.test)
evaluation
#----------------------------------------------------
#___________________________________________________________________
#___________________________________________________________________
# SECTION 4: RESULTS
# CI for B.hosp.rn
CI.hosp.rn.low<-19.916-(1.96)*4.970
CI.hosp.rn.up<-19.916+(1.96)*4.970
CI.hosp.rn.low; CI.hosp.rn.up
#____________________
# CI for B.hosp.emp
CI.hosp.emp.low<--1.862-(1.96)*1.065
CI.hosp.emp.up<--1.862+(1.96)*1.065
CI.hosp.emp.low; CI.hosp.emp.up
# ------- The below is not used in analysis -------------#
# CI for B.beds
CI.beds.low<-27.263-(1.96)*5.426
CI.beds.up<-27.263+(1.96)*5.426
CI.beds.low; CI.hosp.rn.up
#--------------------------------------------------------#
mod4<-lm(discharges~hosp.rn+beds) # model does not include hosp.phys
or hosp.emp
summary(mod4)
anova(mod4,mod1)
extractAIC(mod4)[2]
mod5<-lm(discharges~hosp.phys+hosp.rn+beds) #model does not include
hosp.emp
summary(mod5)
anova(mod5,mod1)
extractAIC(mod5)[2]
#____________________
# CI and PI for fitted value similar to South Bend, IN:
hosp.phys.val<-27.2
hosp.rn.val<-3.8
beds.val<-1.9
hosp.emp.val<-14.2
y.hat.SB<-92.784+19.916*hosp.rn.val+27.263*beds.val-1.862*hosp.emp.val
y.hat.SB
CI.yhat.SB<-predict(mod2, data.frame(hosp.rn=hosp.rn.val,
beds=beds.val, hosp.emp=hosp.emp.val), interval = "confidence",
level=0.95)
CI.yhat.SB
PI.yhat.SB<-predict(mod2, data.frame(hosp.rn=hosp.rn.val,
beds=beds.val, hosp.emp=hosp.emp.val), interval = "prediction",
level=0.95)
PI.yhat.SB
#____________________
Part B: Summaries of Models
> summary(mod1) Call: lm(formula = discharges ~ hosp.phys + hosp.rn + beds + hosp.emp) Residuals: Min 1Q Median 3Q Max -76.666 -20.942 -0.593 20.254 76.463 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 124.584 26.243 4.747 3.98e-06 *** hosp.phys -1.041 0.756 -1.377 0.1702 hosp.rn 19.861 4.959 4.005 8.80e-05 *** beds 24.406 5.798 4.210 3.90e-05 *** hosp.emp -1.845 1.063 -1.735 0.0843 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 30.32 on 195 degrees of freedom Multiple R-squared: 0.3686, Adjusted R-squared: 0.3557 F-statistic: 28.46 on 4 and 195 DF, p-value: < 2.2e-16 > summary(mod2) Call: lm(formula = discharges ~ hosp.rn + beds + hosp.emp) Residuals: Min 1Q Median 3Q Max -73.978 -20.790 1.214 18.521 79.607 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 92.784 12.483 7.433 3.22e-12 *** hosp.rn 19.916 4.970 4.007 8.72e-05 *** beds 27.263 5.426 5.025 1.13e-06 *** hosp.emp -1.862 1.065 -1.748 0.0821 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 30.39 on 196 degrees of freedom Multiple R-squared: 0.3625, Adjusted R-squared: 0.3527 F-statistic: 37.15 on 3 and 196 DF, p-value: < 2.2e-16 > summary(mod3) Call: lm(formula = discharges ~ PC1 + PC2) Residuals: Min 1Q Median 3Q Max -70.237 -22.231 1.038 17.896 88.048 Coefficients:
Estimate Std. Error t value Pr(>|t|) (Intercept) 206.799 2.179 94.911 < 2e-16 *** PC1 -13.458 1.434 -9.384 < 2e-16 *** PC2 -11.985 3.202 -3.742 0.000239 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 30.81 on 197 degrees of freedom Multiple R-squared: 0.3413, Adjusted R-squared: 0.3346 F-statistic: 51.03 on 2 and 197 DF, p-value: < 2.2e-16 > summary(mod4) Call: lm(formula = discharges ~ hosp.rn + beds) Residuals: Min 1Q Median 3Q Max -71.498 -21.105 1.814 17.660 82.158 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 90.046 12.449 7.233 1.02e-11 *** hosp.rn 14.305 3.813 3.751 0.000231 *** beds 26.310 5.427 4.848 2.52e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 30.55 on 197 degrees of freedom Multiple R-squared: 0.3525, Adjusted R-squared: 0.346 F-statistic: 53.63 on 2 and 197 DF, p-value: < 2.2e-16 > summary(mod5) Call: lm(formula = discharges ~ hosp.phys + hosp.rn + beds) Residuals: Min 1Q Median 3Q Max -72.64 -21.70 1.07 19.07 80.24 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 122.3442 26.3456 4.644 6.26e-06 *** hosp.phys -1.0562 0.7598 -1.390 0.166074 hosp.rn 14.3015 3.8042 3.759 0.000225 *** beds 23.4198 5.7993 4.038 7.72e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 30.48 on 196 degrees of freedom Multiple R-squared: 0.3589, Adjusted R-squared: 0.3491 F-statistic: 36.57 on 3 and 196 DF, p-value: < 2.2e-16
Part C: Evaluating VIFs:
# check VIF
mod1a<-lm(hosp.phys~hosp.rn+hosp.emp+beds)
summary(mod1a)
R2a<-0.2054
VIFa<-1/(1-R2a)
mod1b<-lm(hosp.rn~hosp.phys+hosp.emp+beds)
summary(mod1b)
R2b<-0.6667
VIFb<-1/(1-R2b)
mod1c<-lm(hosp.emp~hosp.phys+hosp.rn+beds)
summary(mod1c)
R2c<-0.5944
VIFc<-1/(1-R2c)
mod1d<-lm(beds~hosp.phys+hosp.rn+hosp.emp)
summary(mod1d)
R2d<-0.5602
VIFd<-1/(1-R2d)
> VIFa; VIFb; VIFc; VIFd [1] 1.258495 [1] 3.0003 [1] 2.465483 [1] 2.273761
Part D. Checking for Influential Observations/Outliers
> #Check for influential observations > ##cooks.distance(mod2) > which(cooks.distance(mod2)>1) named integer(0) > which(cooks.distance(mod3)>1) named integer(0) > # Check for Outliers > which(abs(rstandard(mod2))>3) named integer(0) > which(abs(rstandard(mod3))>3) named integer(0)
top related