analysis of cdc chronic disease indicators us compared with georgia
DESCRIPTION
Applied Statistics final project using SAS to complete an epidemiology analysis of CDC Chronic Disease Indicators in US compared with the state of Georgia. Note, we are not very healthy in the state of Georgia.TRANSCRIPT
D. FullertonSTAT 3010.W01
Final Project07/19/09
STAT 3010.W01 Final Project: Analysis of Center for Disease ControlChronic Disease Indicators of the United States and Georgia for Year 2005
The aim of this report is to discuss the results of a statistical analysis of Chronic Disease Indicators of the United States and Georgia for the year 2005 made by the Center for Disease Control. The points covered in the analysis of data were: 1) Determine descriptive statistics and describe the distributions of variables of the data set, 2) Compare chronic disease indicator rates between the United States and Georgia separately for each of five categories, and 3) Create a random 20 item sample from the dataset, then estimate the Chronic Disease Indicator rate in the United States and Georgia using a 95% and 99% confidence interval, then determine whether or not the population mean rate for all 50 initial data were captured by the estimated confidence intervals. SAS 9.1.3 SP4 and graphics from SAS and Minitab 15 were the applications used in this analysis.
The particular dataset was chosen due relation to healthcare, size, and complexion of data. The five variables (three catagorical and two quantitative) of the Center for Disease Control Chronic Disease Indicators of the United States and Georgia for Year 2005 were obtained by filtering a data set from the Center for Disease Control website (http://apps.nccd.cdc.gov/cdi/Default.aspx). A comparison was selected between the United States and Georgia. The data and definitions were originally developed by The Council of State and Territorial Epidemiologists with epidemiologists and chronic disease program directors at the state and federal level, were refined between 1999 and 2002, then a survey was made for 2005.
This data has proved useful in Georgia to develop a database of the indicators by 19 health districts available via the internet. As well, the Division of Diabetes Translation at Center for Disease Control uses the data to assist diabetes programs with their surveillance and epidemiological activities. Table 1 shows a short selection of the data, and variable names used in Table 1 are described in Table 2. There are 50 datapoints from the year 2005, and the six other datapoints from different years were trimmed from the data set before analysis. Therefore, results and analysis is only valid for the year 2005.The occurrences per 100,000 people of the United States, and Georgia, by Chronic Disease Indicator category are assessed.
The assessment of the quantitative and categorical variables shows the following. Table 3 shows the descriptive statistics for Chronic Disease Indicators of the United States and Georgia both have a significant difference between the mean and median. Figures 1 and 2 clearly show that the distribution of occurrences for the United States, and for Georgia, are both unimodal, and positively skewed. Figures 3 and 4 further demonstrate this trend. Although drasticly skewed, no outliers are shown. The most representative measure of central tendency is the median, 25.95 for the United States, and 25.90 for Georgia.
Table 4 shows the frequency of each occurrence by category. Cancer swallows up the data at 36 occurrences (out of 50), this mode is over four times that of the next leading indicator, Cardiovascular Disease. Figures 5 and 6 reinforce this, however, it is notable that cancer has a broader range of results,and is skewed, but Cardiovascular Disease has a more even distribution.
A new categorical variable was created for the occurrences in the United States and Georgia based on size. The occurrences were broken up into chunks of size 150. The Contingency Table 5 shows that Cancer statistics for the United States are mostly returned in the “X-Small” range, meaning that most of the 32 data points in this category were less than 150 occurrences.
Occurrences for Georgia differ in that some results fall into the “Medium” range, and 50% of the Cardiovascular results are from the “X-Small” category.
The categorical indicator is also show in Figures 7 through 10. They stress again that Cancer is the leading indicator, by far, at over 75% overall. Figure 7 clumps the smallest three indicators into one category, “Other”. The breakdowns of cause by either United States, or Georgia, continue to stress the facts that Cancer and Cardiovascular Disease are the factors that beg further study.
Tables 11 and 12 again show the breakdown of occurrences by the newly created variable, size. Each show that most occurrences for both the United States, and for Georgia, fall into the “X-Small” category, at a frequency of nearly 40% in each. Tables 13 and 14 show the category of incidence by size on stacked bar charts for the United States and Georgia. Cancer results in the United States fit mostly the “X-small” category, and Cardiovascular fit the “Small” category. The results in Geogia show that “X-Small” leads in all categories, and is the vast majority of the Cancer indicator.
Finally, a random sample was produced in SAS of 20 data points. Both the 95 and 99% confidence intervals captured the true sample means with the United States between 38.69 and 200.55 (95%), and 9.00 and 230.24 (99%), where the true mean is 102.33, and Georgia between 35.69 and 210.84 (95%), and 3.56 and 242.97 (99%), where the true mean is 100.37
APPENDIX I: SAS TABLES AND FIGURES
Table 1: Abbreviated Display of the Center for Disease ControlChronic Disease Indicators of the United States and Georgia for Year 2005
Obs CATEGORY INDICATOR YEAR MEASURE UNITED_STATES GEORGIA
1 Tobacco and Alcohol Chronic liver disease - mortality 2005 Crude Rate 9.3 7.5
2 Tobacco and Alcohol Chronic liver disease - mortality 2005 Age-adjusted Rate
8.9 8.1
3 Cancer Invasive cancer (all sites combined) - incidence
2005 Crude Rate 469.8 402.6
4 Cancer Invasive cancer (all sites combined) - incidence
2005 Age-adjusted Rate
458.4 452.0
5 Cancer Cancer (all sites combined) - mortality 2005 Crude Rate 188.6 157.2
. . . . . . .
. . . . . . .
. . . . . .
48 Overarching Conditions Premature mortality among adults aged 45-64 years
2005 Age-adjusted Rate
618.6 711.1
49 Other Diseases and Risk Factors
Asthma - mortality 2005 Crude Rate 1.3 1.3
50 Other Diseases and Risk Factors
Asthma - mortality 2005 Age-adjusted Rate
1.3 1.5
NOTE: The data for other years were minimal and thus eliminated from this data set (the numeration “Obs” was added automatically by SAS).
Table 2: Summary of Variables Contained in Center for Disease ControlChronic Disease Indicators of the United States and Georgia for Year 2005
Variable Name Label General Type Specific TypeMeasurement
Units
ObsObservation number
Categorical Identifier Variable N/A
CATEGORY Disease category Categorical Nominal N/A
INDICATOR Disease indicator Categorical Nominal N/A
YEARSurvey year(only 2005 used)
Categorical Nominal N/A
MEASURECrude or Age adjusted rate
Categorical Nominal N/A
UNITED_STATES - Quantitative Interval/Ratio
Number of instances per 100,000 persons*
GEORGIA - Quantitative Interval/Ratio
Number of instances per 100,000 persons*
* standardized by the direct method to the year 2000 standard U.S. population based on single years of age from the Census P25-1130 series estimates
Table 3: Descriptive Statistics of Center for Disease ControlChronic Disease Indicators of the United States and Georgia for Year 2005
Variable N Mean Median Std Dev Range Minimum Maximum
UNITED_STATESGEORGIA
5050
102.23100.37
25.9525.90
153.70163.10
628.60719.70
1.301.30
629.90721.00
Table 4: Frequency Table of Center for Disease ControlChronic Disease Indicators by Category
CATEGORY
CATEGORY Frequency PercentCumulativeFrequency
CumulativePercent
Cancer 36 72.00 36 72.00
Cardiovascular Disease 8 16.00 44 88.00
Other Diseases and Risk Factors 2 4.00 46 92.00
Overarching Conditions 2 4.00 48 96.00
Tobacco and Alcohol 2 4.00 50 100.00
Figure 1: Histogram of Occurrences United States (per 100,000 people)
0 120 240 360 480 600
0
10
20
30
40
50
60
70
Percent
UNI TED STATES
Figure 2: Histogram of Occurrences Georgia (per 100,000 people)
0 120 240 360 480 600 720
0
10
20
30
40
50
60
70
Percent
GEORGI A
Figure 3: Box Plot of Occurrences United States (year 2005 per 100,000 people)
2005
0
200
400
600
800
UNITED
STATES
YEAR
Figure 4: Box Plot of Occurrences Georgia (year 2005 per 100,000 people)
2005
0
200
400
600
800
GEORGIA
YEAR
Figure 5: Side by Side Box Plot of Occurrences United States (per 100,000 people)
Cancer Tobacco and Al cohol
0
200
400
600
800
UNITED
STATES
CATEGORY
Figure 6: Side by Side Box Plot of Occurrences Georgia (per 100,000 people)
Cancer Tobacco and Al cohol
0
200
400
600
800
GEORGIA
CATEGORY
Table 5: Contingency Table Category of Occurrences by United States Size
CATEGORY(CATEGORY) US_SIZE
Total
FrequencyPercentRow PctCol Pct Large Small X-Large X-Small
Cancer 24.005.56
100.00
24.005.56
25.00
00.000.000.00
3264.0088.8984.21
3672.00
Cardiovascular Disease 00.000.000.00
612.0075.0075.00
00.000.000.00
24.00
25.005.26
816.00
Other Diseases and Risk Factors 00.000.000.00
00.000.000.00
00.000.000.00
24.00
100.005.26
24.00
Overarching Conditions 00.000.000.00
00.000.000.00
24.00
100.00100.00
00.000.000.00
24.00
Tobacco and Alcohol 00.000.000.00
00.000.000.00
00.000.000.00
24.00
100.005.26
24.00
Total 24.00
816.00
24.00
3876.00
50100.00
Table 6: Contingency Table Category of Occurrences by Georgia Size
CATEGORY(CATEGORY) GA_SIZE
Total
FrequencyPercentRow PctCol Pct Large Medium Small X-Large X-Small
Cancer 12.002.78
100.00
12.002.78
50.00
36.008.33
50.00
00.000.000.00
3162.0086.1179.49
3672.00
Cardiovascular Disease 00.000.000.00
12.00
12.5050.00
36.00
37.5050.00
00.000.000.00
48.00
50.0010.26
816.00
Other Diseases and Risk Factors 00.000.000.00
00.000.000.00
00.000.000.00
00.000.000.00
24.00
100.005.13
24.00
Overarching Conditions 00.000.000.00
00.000.000.00
00.000.000.00
24.00
100.00100.00
00.000.000.00
24.00
Tobacco and Alcohol 00.000.000.00
00.000.000.00
00.000.000.00
00.000.000.00
24.00
100.005.13
24.00
Total 12.00
24.00
612.00
24.00
3978.00
50100.00
Figure 7: Pie Chart Category of Occurrences (per 100,000 people)
Figure 8: Pie Chart Category of Occurrences United States (per 100,000 people)
Figure 9: Pie Chart Category of Occurrences Georgia (per 100,000 people)
Figure 10: Bar Chart of Category of Occurrences (per 100,000 people)
Figure 11: Bar Chart of Category of Occurrences United States (per 100,000 people)FREQUENCY
0
10
20
30
40
US_ SI ZE
Lar ge Smal l X- Lar ge X- Smal l
Figure 12: Bar Chart of Category of Occurrences Georgia (per 100,000 people) FREQUENCY
0
10
20
30
40
GA_ SI ZE
Lar ge Medi um Smal l X- Lar ge X- Smal l
Figure 13: Stacked Bar Chart of Category of Occurrences United States (per 100,000 people)
Figure 14: Stacked Bar Chart of Category of Occurrences Georgia (per 100,000 people)
Table 7: 95 and 99% Confidence Intervals for United States and Georgia 20 set Sample
Variable Label NLower 95%
CL for MeanUpper 95%
CL for Mean
UNITED_STATESGEORGIA
UNITED STATESGEORGIA
2020
38.6935.69
200.55210.84
Variable Label NLower 99%
CL for MeanUpper 99%
CL for Mean
UNITED_STATESGEORGIA
UNITED STATESGEORGIA
2020
9.003.56
230.24242.97
Appendix II: Figures Generated in Minitab
Figure 15
6404803201600
25
20
15
10
5
0
occurrences/ 100k
Frequency
Histogram of UNITED_STATES
Figure 16
7006005004003002001000
30
25
20
15
10
5
0
occurrences/ 100k
Frequency
Histogram of GEORGIA
Figure 17
700
600
500
400
300
200
100
0
occ
urr
ence
s/100k
Boxplot of UNITED_STATES
Figure 18
800
700
600
500
400
300
200
100
0
occ
urr
ence
s/100k
Boxplot of GEORGIA
Figure 19
700
600
500
400
300
200
100
0
CATEGORY
occ
urr
ence
s/100k
Boxplot of UNITED_STATES by CATEGORY
Figure 20
800
700
600
500
400
300
200
100
0
CATEGORY
occ
ure
nce
s/100k
Boxplot of GEORGIA by CATEGORY
Figure 21
CancerCardiovascular DiseaseOther Diseases and Risk FactorsOverarching ConditionsTobacco and Alcohol
Category
4.0%4.0%
4.0%
16.0%
72.0%
Pie Chart of CATEGORY
Figure 22
CancerCardiovascular DiseaseOther Diseases and Risk FactorsOverarching ConditionsTobacco and Alcohol
Category
0.3%
24.5%
0.0%
27.6%
47.5%
Pie Chart of CATEGORY for UNITED_STATES
Figure 23
CancerCardiovascular DiseaseOther Diseases and Risk FactorsOverarching ConditionsTobacco and Alcohol
Category
0.3%
28.6%
0.0%
25.7%
45.3%
Pie Chart of CATEGORY for GEORGIA
Figure 24
40
30
20
10
0
CATEGORY
Count
Bar Chart of CATEGORY
Figure 25
X-SmallX-LargeSmallLarge
40
30
20
10
0
US_SIZE
Count
Bar Chart of United States Size
Figure 26
X-SmallX-LargeSmallMediumLarge
40
30
20
10
0
GA_SIZE
Count
Bar Chart of Georgia Size
Figure 27
CATEGORY
40
30
20
10
0
Count
X-SmallX-LargeSmallLarge
US_SIZE
Stacked Bar Chart of CATEGORY by United States Size
Figure 28
CATEGORY
40
30
20
10
0
Count
X-SmallX-LargeSmallMediumLarge
GA_SIZE
Stacked Bar Chart of CATEGORY by Georgia Size
Appendix III: SAS Code
* FULLERTON, STAT 3010.W01, FINAL PROJECT: DATA ANALYSIS OF Center for Disease Control Chronic Disease Indicators (CDC - CDI) of the United States and Georgia for Year 2005;
* SETTING SYSTEM OPTIONS;
DM 'LOG;CLEAR;OUT;CLEAR;';OPTIONS LS=100 PS=75 FORMDLIM="=";QUIT;
* Loading previously saved data set;
DATA NEWCDICDC;SET 'V:\final.project\CDICDC';
RUN;
* Saving the data as a permanent SAS data set;
DATA CDICDC;SET 'V:\final.project\CDICDC';
RUN;
* To view data in SAS;
PROC PRINT DATA = CDICDC;RUN;
* SETTING LIBREF;
* Saving data as a permanent SAS data set;
LIBNAME W2 'V:\final.project';
DATA W2.CDICDC;SET CDICDC;
RUN;
* IMPORT CDC - CDI DATA;
PROC IMPORTDATAFILE = 'V:\final.project\FilChrDisIndCDC.xls'OUT = T1REPLACE;
RUN; QUIT;
* Variable View in SAS;
PROC CONTENTS DATA = W2.CDICDC;RUN;
* Table 1 Dataset;
ODS RTF;
PROC PRINT DATA = W2.CDICDC;VAR CATEGORY INDICATOR YEAR MEASURE UNITED_STATES GEORGIA;
RUN;
ODS RTF CLOSE;
* Descriptive Statistics for Quantitative Variables;
ODS RTF;PROC MEANS DATA = W2.CDICDC MAXDEC=2 N MEAN MEDIAN STD RANGE MIN MAX;
VAR UNITED_STATES GEORGIA;RUN;ODS RTF CLOSE;
* Frequency Tables of Category Variables;
ODS RTF;PROC FREQ DATA = W2.CDICDC;
TABLES CATEGORY INDICATOR MEASURE;RUN;ODS RTF CLOSE;
* Histograms and Boxplots;
DM 'LOG; CLEAR; OUT; CLEAR;';
PROC UNIVARIATE DATA = W2.CDICDC;VAR UNITED_STATES GEORGIA;HISTOGRAM;
RUN;
PROC SORT DATA = W2.CDICDC;BY YEAR;
PROC BOXPLOT DATA = W2.CDICDC;PLOT UNITED_STATES*YEAR; PLOT GEORGIA*YEAR;
RUN;
* Boxplot of Occurrences by Category;
DM 'LOG; CLEAR; OUT; CLEAR; GRAPH; CLEAR';PROC SORT DATA = W2.CDICDC;
BY CATEGORY;PROC BOXPLOT DATA = W2.CDICDC;
PLOT UNITED_STATES*CATEGORY;PLOT GEORGIA*CATEGORY;
RUN;
* Creating new variable (size) for contingency table analysis;
DM 'LOG;CLEAR;OUT;CLEAR';DATA T1;
SET T1;LENGTH US_SIZE $ 7;IF UNITED_STATES < 145 THEN US_SIZE = 'X-Small';IF (UNITED_STATES GE 145) AND (UNITED_STATES < 300) THEN US_SIZE = 'Small';IF (UNITED_STATES GE 300) AND (UNITED_STATES < 450) THEN US_SIZE = 'Medium';IF (UNITED_STATES GE 450) AND (UNITED_STATES < 600) THEN US_SIZE = 'Large';IF (UNITED_STATES GE 600) THEN US_SIZE = 'X-Large';SET T1;LENGTH GA_SIZE $ 7;IF GEORGIA < 145 THEN GA_SIZE = 'X-Small';IF (GEORGIA GE 145) AND (GEORGIA < 300) THEN GA_SIZE = 'Small';IF (GEORGIA GE 300) AND (GEORGIA < 450) THEN GA_SIZE = 'Medium';IF (GEORGIA GE 450) AND (GEORGIA < 600) THEN GA_SIZE = 'Large';IF (GEORGIA GE 600) THEN GA_SIZE = 'X-Large';
PROC PRINT DATA = T1;RUN;
* Contingency Tables;
DM 'LOG;CLEAR;OUT;CLEAR';
ODS RTF;PROC FREQ DATA = T1;
TABLES CATEGORY*US_SIZE;RUN;ODS RTF CLOSE;
ODS RTF;PROC FREQ DATA = T1;
TABLES CATEGORY*GA_SIZE;RUN;ODS RTF CLOSE;
* Pie Charts;
PROC GCHART DATA = W2.CDICDC;PIE CATEGORY;GOPTIONS HTEXT = 1;
LEGEND;RUN;QUIT;
PROC GCHART DATA = W2.CDICDC;PIE CATEGORY / SUMVAR = UNITED_STATES PERCENT = INSIDE;
GOPTIONS HTEXT = 1; LEGEND;RUN;QUIT;
PROC GCHART DATA = W2.CDICDC;PIE CATEGORY / SUMVAR = GEORGIA PERCENT = INSIDE;
GOPTIONS HTEXT = 1;LEGEND;RUN;QUIT;
* Bar Charts;
PROC GCHART DATA = W2.CDICDC;VBAR CATEGORY / TYPE = FREQ;
GOPTIONS HTEXT = 1;LEGEND;RUN;
PROC GCHART DATA = T1;VBAR US_SIZE / TYPE = FREQ;
GOPTIONS HTEXT = 1;LEGEND;RUN;
PROC GCHART DATA = T1;VBAR GA_SIZE / TYPE = FREQ;
GOPTIONS HTEXT = 1;LEGEND;RUN;
* Stacked Bar Charts;
PROC GCHART DATA = T1;VBAR CATEGORY / SUBGROUP = US_SIZE;GOPTIONS HTEXT = 1;LEGEND;
RUN;
PROC GCHART DATA = T1;VBAR CATEGORY / SUBGROUP = GA_SIZE;GOPTIONS HTEXT = 1;LEGEND;
RUN;
* Generate Random sample set of data with seed to replicate data;
DATA CDICDCN;
SET W2.CDICDC;GROUP = RANUNI(123456);
PROC PRINT DATA = CDICDCN;RUN;
* Sort random data to show only the first 20 observations;
PROC SORT DATA = CDICDCN;
BY GROUP;
DATA CDICDCNN;
SET CDICDCN;IF _n_ < 21;
PROC PRINT DATA = CDICDCNN;RUN;
* Confidence Intervals on ratio scale variables;
DM 'LOG;CLEAR;OUT;CLEAR;';
ODS RTF;PROC MEANS DATA = CDICDCNN MAXDEC=2 N CLM ALPHA = .05;
VAR UNITED_STATES GEORGIA;RUN;PROC MEANS DATA = CDICDCNN MAXDEC=2 N CLM ALPHA = .01;
VAR UNITED_STATES GEORGIA;RUN;ODS RTF CLOSE;
* Export Data to Minitab;PROC EXPORT
OUTFILE = 'V:final.project\FilChrDisIndCDC.csv'DATA = W2.CDICDCREPLACE;
RUN;PROC EXPORT
OUTFILE = 'V:final.project\FilChrDisIndCDCT1.csv'DATA = T1REPLACE;
RUN;QUIT;