introduction to sas - stony brook medicine
TRANSCRIPT
1 of 81
Introduction to SASXiaoyue Zhang, MS.
Biostatistician
Biostatistical Consulting Core
Department of Family, Population and Preventive Medicine
Stony Brook University
Wei Hou, Ph.D.Research Associate Professor, Division of Epidemiology and Biostatistics
Department of Family, Population and Preventive Medicine
Adjunct Associate Professor, Department of Applied Mathematics and Statistics
Voluntary Faculty, Department of Pathology
Stony Brook University
January 14, 2019
Biostatistical Consulting Core (BCC)
In collaboration with Clinical Translational Science Center (CTSC) and
the Biostatistics and Bioinformatics Shared Resource (BB-SR), Stony Brook Cancer Center (SBCC).
2 of 81
SAS (Statistical Analysis System)
• SAS is a statistical software, which captures, stores, modifies and presents data and
perform various operations on it.
• SAS programs provide "extraordinary range of data analysis and data management tasks
[1]," but more difficult to use and learn compared with other statistical software.
Why SAS?• Undisputed market leader in statistical analysis and modeling.
• Offers huge array of statistical functions and comprehensive user’s guide
(https://support.sas.com/documentation/cdl/en/statug/63962/PDF/default/statug.pdf).
• Stable, reliable and powerful.
• Required by FDA.
[1] Acock, Alan C (November 2005). "SAS, Stata, SPSS: A Comparison". Journal of Marriage and Family. 67 (4): 1093–1095.
3 of 81
OUTLINE
Part I – Getting Started in SAS
• Using Virtual SINC Site
• SAS windows and SAS data
Part II – Data Processing
• Data Step Programming
• Select Variables/Observations
• Merge/Concatenate Data Sets
• Sort Data set
• IF-THEN/ELSE statement
Part III – Categorical Data
• Frequency tables
• Chi-square test
• Fisher’s exact test
Part IV – Continuous Data
• Descriptive statistics
• T-test
• ANOVA
• Correlation coefficient
4 of 81
Part I: Getting Started in SAS
5 of 81
PART I: GETTING STARTED IN SAS
Using Virtual SINC Site
The Virtual SINC Site provides a way for students, faculty, and staff members of SBU to
access site-licensed, academic software titles directly from their personal computers from
either on or off campus 24 hours a day, 7 days a week.
https://it.stonybrook.edu/services/virtual-sinc-site
Click “ LAUNCH VIRTUAL SINC SITE ”
6 of 81
PART I: GETTING STARTED IN SAS
Using Virtual SINC Site
7 of 81
PART I: GETTING STARTED IN SAS
Using Virtual SINC Site
8 of 81
PART I: GETTING STARTED IN SAS
Using Virtual SINC Site
9 of 81
PART I: GETTING STARTED IN SAS
SAS Window Environment
Log window
Editor window
Output windowResults window
Explorer window
Store SAS data file
10 of 81
PART I: GETTING STARTED IN SAS
SAS Window Environment
Return back to EXPLORER
window
Store temporary dataset
11 of 81
PART I: GETTING STARTED IN SAS
SAS Window Environment
Result window contains a running
record of the output. This figure
displays the results from PROC
FREQ step which will be shown in
Part III.
12 of 81
Execute the program
Specify the new dataset
Specify variable
names in new dataset
Specify values for each variableEnd of the program
Comments
Data Input
PART I: GETTING STARTED IN SAS
$: variable followed with
a dollar sign is a character
data, otherwise numeric data.
13 of 81
PART I: GETTING STARTED IN SAS
Data Input
• Variables: columns in a SAS dataset.
• Observations: rows in SAS dataset.
• Variable types:
Numeric: store numbers;
Character: contain text.
• The punctuation of a SAS statement is a semicolon (;).
• Statements can begin in any column and a single statement can span multiple lines.
14 of 81
PART I: GETTING STARTED IN SAS
Example Data Used for This Workshop: Bariatric.xlsx
• Made-up bariatric surgery data.
• Sample size: 300 patients.
• Variables:
o ID: unique patients’ ID.
o Sex: categorical variable, Female vs Male;
o Age: continuous variable, ≥18;
o Race: categorical variable, White vs Black vs Asian.
o Height (HGT): continuous variable;
o Height unit (HGTUNIT): mark the unit of height;
o Weight (WGT): continuous variable, weight before surgery;
o Weight unit (WGTUNIT): mark the unit of weight;
o PRE_BMI: continuous variable, BMI before surgery.
o POST_BMI: continuous variable, BMI after surgery
o Surgery type (SURG): categorical variable, Bypass vs Sleeve.
• THIS DATASET IS ONLY USED FOR SAS TUTORIAL, ANY CLINICAL RESULTS
CONCLUDED FROM IT WILL NOT HAVE ANY SCIENTIC BASIS.
15 of 81
PART I: GETTING STARTED IN SAS
16 of 81
Download the dataset from BCC Education Series website:
https://osa.stonybrookmedicine.edu/research-core-facilities/bcc/education
and save it under your stony brook disk (X: \\mysbfiles.campus.stonybrook.edu).
PART I: GETTING STARTED IN SAS
17 of 81
DATA IMPORT
18 of 81
PART I: GETTING STARTED IN SAS
Member name: the data set name used in SAS for the data you imported. In this
workshop, we set it as “mydata”.
CLICK FINISH
19 of 81
PART I: GETTING STARTED IN SAS
20 of 81
PART I: GETTING STARTED IN SAS
PLEASE NOTE:
• Check the LOG window each time after running a set of program.
• Close the dataset before running program to modify it.
21 of 81
PART II: DATA PROCESSING
Part II: Data Processing
22 of 81
Data Step Programming
A new SAS data set can be created using an existing SAS data set by DATA and
SET statement.
DATA name of new SAS dataset;
SET name of existing SAS dataset;
<other statements;>
RUN;
PART II: DATA PROCESSING
23 of 81
Select Variables
Use KEEP or DROP statement to control variables written into new dataset.
data test1; set mydata;
keep ID SURG SEX AGE RACE;
run;
PART II: DATA PROCESSING
24 of 81
Select Variables
data test2; set mydata;
drop SURG SEX AGE RACE;
run;
PART II: DATA PROCESSING
25 of 81
PART I: GETTING STARTED IN SAS
Select Observations
Use WHERE statement to select observations meet with certain condition and save them into
new dataset.
data test3; set test1;
where SEX=‘Male’;
run;
26 of 81
Select Observations
Use logical operators, comparison operators in WHERE statement to select observations
meeting selection criteria.
o Comparison operators: =; ^= (not equal); >; <;…
o Logical operators: And (&); Or (|); Not (^).
data test4; set test1;
where SEX=‘Female' & age<=50;
run;
PART II: DATA PROCESSING
27 of 81
Sort Data
Rearrange observations of the data set using the SORT procedure according to the variables
named in BY statement. It can be sorted on more than one variables at one time.
proc sort data=test1;
by AGE;
run;
PART II: DATA PROCESSING
This is the simplest way to use the
SORT procedure that it directly
modifies the original data and
replaces it with the sorted version.
28 of 81
PART II: DATA PROCESSING
Sort Data
• To save the sorted data as a new data set instead of modifying the input data set, add the
OUT= option in PORC SORT statement. We recommend to use this option as it avoid
altering the original data since this procedure is irreversible.
proc sort data=test1 out=test5;
by AGE;
run;
• To sort data without any duplication in BY variables, add option NODUPKEY in PROC
SORT: it automatically keep the first observation it will encounter./* ATTENSTION: CLOSE TEST5 BEFORE RUNNING FOLLOWING CODES*/
proc sort data=test1 out=test5 nodupkey;
by AGE;
run;
Compare test5 with test1 and check the difference in the AGE variable.
Note that: Be careful in the use of NODUPKEY as long as you pretty sure that no duplicated
record should be existed based on the BY statement.
29 of 81
PART II: DATA PROCESSING
Sort Data
By default SAS sorts data in ascending order. To reverse it, add keyword DESCENDING to
the BY statement before each variable that should be sorted from the highest to lowest.
proc sort data=test2 out=test6; by descending HGT WGT; run;
Compare the sorted order of HGT and WGT:
HGT is in descending order.
WGT is automatically in ascending order.
30 of 81
Match-Merging
Horizontally combine observations from multiple data sets into a single observation
in a new data set according to the matched observations using MERGE and BY
statement.
Input data sets used for merge must have at least one common variable.
Input data sets must be sorted by the common variable(s) using the SORT
procedure.
For example: merge data sets test5 and test6 by patients’ ID:
proc sort data=test5 out=test5_1; by ID; run;
proc sort data=test6 out=test6_1; by ID; run;
data test7; merge test5_1 test6_1;
by ID;
run;
PART II: DATA PROCESSING
31 of 81
Match-Merging
SAS will automatically keep all the available observations in the common variable(s)
even they are not matched. In such cases, new data set will assign missing values to
those observations.
PART II: DATA PROCESSING
32 of 81
Concatenate Datasets
Vertically stack datasets one after the other.
No requirement for input datasets to have common variables.
If common variables exist, they must have the same format (numeric,
character) [2].
New dataset will includes all available variables found in input datasets even
if they are not matched. In such cases, new dataset will assign missing values to
those variables.
data test8; set test3 test4; run;
[2] Introduction to SAS Informats and Formats, https://support.sas.com/publishing/pubcat/chaps/59498.pdf
PART II: DATA PROCESSING
33 of 81
PART I: GETTING STARTED IN SAS
IF-THEN/ELSE Statement
To modify observations meet specific conditions. For example, based on patients’ ages, add a
variable named AGEC as a categorical variable to divide patients into 3 groups: 18-29; 30-49;
>=50.
data mydata; set mydata;
if age<30 then agec="18-29";
else if age<50 then agec="30-49";
else agec=">=50";
run;
IF-ELSE-ELSE-… is a logical and efficient statement to avoid overlap across levels,
otherwise, you can specify separate IF statements but need to clearly write out the
condition for each level.
data mydata; set mydata;
if age<30 then agec="18-29";
if age>=30 & age<50 then agec="30-49";
if age>=50 then agec=">=50";
run;
34 of 81
Create New Variable(s)
In previous slide, we use AGE to create a new variable AGEC without any declaration for the
new variable. The format, length and other attributes were automatically set at the very first
place SAS encountered with this variable.
AGEC is considered as a character variable since it is assigned a string. The length is set as 5 bytes based on the very first text it is assigned (“18-29”).
If the content of AGEC for other observation is longer than 5 bytes, it will be cut off and only
the first 5 digits will be saved.
To be conservative, we recommend to use LENGTH statement before creating a new
variable.
data test9; set mydata;
length agec_1 $30;
if age<30 then agec_1="Younger than 30";
else if age<50 then agec_1="Age between 30 to 49";
else agec_1="Older than or equal to 50";
run;
Delete the LENGTH statement and compare the output data with test9.
35 of 81
Questions:
• How to generate descriptive table of patients’ demographic information by surgery type?
• How to analyze relationship between demographic information and surgery type?
• How to examine population means across groups?
• How to check the correlation between numeric variables?
36 of 81
Part III: Categorical Data
37 of 81
PART III: CATEGORICAL DATA
• The simplest categorical data simply tells us which of the two categories a subject is in,
e.g. Male or Female, Diseased or Non-Diseased, etc. This type of data is called binary or
dichotomous.
• The level of categorical data can be generalized to categories>2.
• In this section, we’ll introduce the FREQ procedure in SAS to summarize and analyze
categorical data. PROC FREQ is a descriptive and statistical procedure which can offer
you one-way to n-way frequency and contingency tables. It can alsoperform analyses and
statistical tests.
Level 1:Female Level 2: Male Total
Sex 236 64 300
38 of 81
PART III: CATEGORICAL DATA
One-Way Frequency Table
proc freq data=mydata;
tables SEX;
run;
• proc freq: Initiate FREQ procedure.
• data: Specify dataset.
• tables: Specify variable for frequency table.
• run: End of the procedure.
39 of 81
PART III: CATEGORICAL DATA
One-Way Frequency Table
Request plot by adding PLOTS= option in TABLES statement. Separate requested element
and options by slash(/).
proc freq data=mydata;
tables sex / plots=freqplot;
run;
40 of 81
PART III: CATEGORICAL DATA
One-Way Frequency Table
Request multiple one-way frequency tables at one time for different categorical variables.
proc freq data=mydata;
tables agec race / plots=freqplot;
run;
41 of 81
Question: among the 236 female patients, how many of them had bypass? how many of
them had sleeve?
Use contingency table to know the frequency distribution of variables
PART III: CATEGORICAL DATA
Sex
Surgery
Female Male Total
Bypass
Sleeve
Total
?
42 of 81
PART III: CATEGORICAL DATA
Two-Way Frequency Table
proc freq data=mydata;
tables surg*sex;
run;
Based on the column percent (Col Pct): among
236 female patients, 130 (55.08%) of them had
bypass, 106 (44.92%) of them had sleeve; among
64 male patients, 30 (46.88%) of them had
bypass, 34 (53.13%) of them had sleeve.
Based on row percent (Row Pct): among 160
bypass patients, 130 (81.25%) of them are female,
30 (18.75%) of them were male; among 140
sleeve patients, 106 (75.71%) are female and 34
(24.29%) are male.
43 of 81
PART III: CATEGORICAL DATA
Two-Way Frequency Table
• FREQ procedure excludes observations with missing values from the table but displays
the total frequency of missing observations below each table.
• SAS offers a lot of options in TABLES to control the output table information and you can
specify multiple options at the same time:
NOCOL: suppresses display of column percentages.
NOROW: suppresses display of row percentages.
NOCUM: suppresses display of cumulative frequencies and percentages.
CUMCOL: display cumulative column percentages.
…
proc freq data=mydata;
tables surg*sex / cumcol norow;
run;
44 of 81
PART III: CATEGORICAL DATA
Two-Way Frequency Table
Generate bar chart to compare the proportions by group.
proc freq data=mydata;
tables surg*sex / plots=FreqPlot(type=barchart groupby=column
twoway=cluster scale=grouppercent);
run;
Options of FREQPLOT:• type=barchart: request to create a bar chart.
• groupby=column: show distributions by column variable. Available options: column,
row.• twoway=cluster: horizontally display distributions from the same group side by side
in one figure. Available options: cluster, stacked.• scale=grouppercent: request to display group percentages. Available options:
grouppercent, freq, percent, log, sort.
45 of 81
PART III: CATEGORICAL DATA
Two-Way Frequency Table
46 of 81
PART III: CATEGORICAL DATA
Hypothesis Test
Based on the previous contingency table, 55.08% female patients had bypass, while 46.88%
male patients had bypass. It seems like that females are more likely to had bypass than sleeve
compared with males.
How to compare these two proportions to see whether there exists significant difference? Is
there any association between bariatric surgery type and patient’s gender?
We wish to show that there exists difference between genders in the choice of bariatric
surgery type and test this hypothesis.
A hypothesis is an educational guess and it should be testable and quantifiable, by evidence or
data.
47 of 81
PART III: CATEGORICAL DATA
Hypothesis Test
A hypothesis test is a proof of contradiction.
We assume that the two proportions are the same (null hypothesis) and we observe
evidence from the hypothesis test to cast doubt on what we assumed. So we have to
conclude the opposite (alternative hypothesis).
For example:
π1: the proportion of female patients had bypass.
π2: the proportion of male patients had bypass.
Null hypothesis is what we are trying to disprove: 𝐻0: π1 = π2.
Alternative hypothesis is what we are trying to show is true: 𝐻𝑎: π1 ≠ π2.
48 of 81
PART III: CATEGORICAL DATA
Steps of Hypothesis Testing
1. State the null (𝑯𝟎) and alternative (𝑯𝒂) hypothesis.
2. Choose a significance level 𝜶, usually 0.05.
3. Based on the sample, calculate the test statistic, p-value and confidence interval based
on the theoretical distribution behind the test statistic.
4. Compare the p-value with the significance level.
If P-value < 𝛼, reject the null hypothesis.
If P-value ≥ 𝛼, fail to the null hypothesis
5. Make a decision (reject or fail to reject the null hypothesis) and state your conclusion.
Note: If the alternative hypothesis is not proved, it doesn’t mean that the null hypothesis is
true.
49 of 81
PART III: CATEGORICAL DATA
Chi-Square Test
The chi-square test of independence can be used to examine the relationship between two
categorical variables. The frequency of each category for one categorical variable is compared
across the categories of the second categorical variable.
The number of categories of each variable can be larger than 2, that is, Chi-square test can be
used to r*k contingency tables, where r>2 and/or k>2.
To test the hypothesis that there exists difference between genders in the choice of bariatric
surgery types:
Null hypothesis: π1 = π2.
Alternative hypothesis: π1 ≠ π2.
π1: the proportion of female patients had bypass.
π2: the proportion of male patients had bypass.
50 of 81
PART III: CATEGORICAL DATA
Chi-Square Test
Use PROC FREQ to perform the chi-square test: add the CHISQ option in TABLES
statement.
proc freq data=mydata;
tables surg*sex / chisq;
run;
Based on the p-value, we fail to reject the null hypothesis at the 5% significant level
(0.2429>0.05) and conclude that there is no significant difference between the proportions of bypass patients among the female patients and among the male patients.
P-value from the Chi-Square test
51 of 81
PART III: CATEGORICAL DATA
Small Sample Situation
Note that the Chi-Square test is not suitable when the sample size of a contingency table is
small, say the expected value in any of the cells of the table is less than 5.
For 2*2 table with small sample size, use Fisher’s exact test.
For r*k table with small sample size, use Chi-square test with p-value from Monte Carlo
simulation.
For 2*2 tables. The CHISQ option automatically provides Fisher’s exact test result. For r*k
tables, you can request Monte Carlo simulation by adding EXACT statement.
52 of 81
PART III: CATEGORICAL DATA
Monte Carlo simulation
Test the relationship between surgery type and race.
proc freq data=mydata;
tables surg*race;
exact chisq / mc;
run;
Two cells have small values and there is a WARNING said that the result from the Chi-square
test may not be valid.
53 of 81
PART III: DATA PROCEDURE
Conclusion: Based on the p-value, we can reject the null hypothesis at 5% significant level
and conclude that there is significant difference across race in the choice of bariatric surgery
types.
P-value from the Chi-square test
based on Monte Carlo simulation
Monte Carlo simulation
54 of 81
Part IV: Continuous Data
55 of 81
PART IV: CONTINUOUS DATA
In our sample dataset, height, weight, pre-surgery BMI and post-surgery BMI are continuous
data.
o What are the mean, median, minimum, maximum of these variables?
o How dispersed are these variables?
o Is there any difference in pre-surgery BMI across age/race/gender groups?
o Is there any difference between pre-surgery and post-surgery BMI?
o What is the correlation between pre-surgery and post-surgery BMI?
In this section, we will first introduce how to summarize continuous data and calculate
descriptive statistics, such as mean, median, and minimum/maximum. Next, we will introduce
how to examine the relationships between variables by measuring the correlation, conducting
t-test, ANOVA.
56 of 81
PART IV: CONTINUOUS DATA
Descriptive Statistics
Use PROC MEANS to analyze the values of numeric variables. By default, this procedure
will calculate 5 statistical measures.
proc means data=mydata;
var pre_BMI;
run;
Note that the total amount of patients with pre-surgery BMI information is N=298. Based on
our previous analyses on categorical variables, there should be 300 patients records in total.
Thus there are 2 patients without pre-surgery BMI value. How to output missing count?
57 of 81
PART IV: CONTINUOUS DATA
Descriptive Statistics
To calculate other descriptive statistics, add keywords in PROC MEANS statement: median,
nmiss, Q1, Q3, QRANGE,…
To analyze data by groups, specify the group variable in CLASS statement.
proc means data=mydata nmiss n q1 median q3 max qrange;
var pre_BMI;
class SEX;
run;
proc means data=mydata nmiss n q1 median q3 max qrange;
var pre_BMI;
run;
58 of 81
PART IV: CONTINUOUS DATA
Box Plot, Histogram
To request box plot and histogram of a continuous variable, height, for example, use
PROC UNIVARIATE procedure and add PLOT option.
proc univariate data=mydata plot;
var HGT;
run;
59 of 81
PART IV: CONTINUOUS DATA
Box Plot, Histogram
To request plots of the continuous variables by group, say race, we first need to sort the data
by the group variable. Next, add the group variable in BY statement in PROC UNIVARIATE
procedure. The procedure defines a BY group as a set of contiguous observations that have
the same values for the BY variable. Each category of the group variable will have a set of
output listed after the name of the level and a grouped box plot will be at the end of the result.
proc sort data=mydata out=mydata2; by race; run;
proc univariate data=mydata2 plot;
var pre_bmi;
by race;
quit;
60 of 81
PART IV: CONTINUOUS DATA
Box Plot, Histogram
61 of 81
PART IV: CONTINUOUS DATA
Two-Sample T-test
Used to compare a continuous data between two populations or two groups of a categorical
variable.
To determine whether there is significant difference between the means of the 2 samples
(independent groups).
Key assumptions underlying the two-sample t-test:
• Sample size large enough (n1>30 and n2>30, say);
• Randomly sampled data;
• Two populations/groups are independent;
• If sample sizes are small, data from each population needs to be normal for the
procedure.
62 of 81
PART IV: CONTINUOUS DATA
Two-Sample T-test
Examine whether pre-surgery BMI is different between gender groups:
Null hypothesis: 𝜇1 = 𝜇2.
Alternative hypothesis: 𝜇1 ≠ 𝜇2.
𝜇1: the mean of pre-surgery BMI of female patients.
𝜇2: the mean of pre-surgery BMI of male patients.
By default, SAS will output the two-sided test result:
proc ttest data=mydata;
var pre_BMI;
class sex;
run;
63 of 81
Two-Sample T-test
PART IV: CONTINUOUS DATA
Check the homogeneity of variances of the 2 groups
based on a F-test. Since p-value<0.05, reject the null
hypothesis and conclude that the variances are not
equal. For unequal variance data, use the Satterthwaite’s
method.
P-values of the Two-sample t-
test. Fail to reject the null
hypothesis and conclude that
the means of pre-surgery BMI
are similar between genders.
The difference between two
means is close to 0 and the
confidence interval includes 0
64 of 81
Two-Sample T-test
Recommendation:
“For the problem of testing the equality of means from two independent normally
distributed populations where the ratio of the variances is unknown, directly apply
Satterthwaite's Approximate F test without using any preliminary variance test. ”
[Reference]: Moser, B. K., Stevens, G. R., & Watts, C. L. (1989). The two-sample t test versus Satterthwaite's approximate F
test. Communications in Statistics-Theory and Methods, 18(11), 3963-3975.
PART IV: CONTINUOUS DATA
65 of 81
PART IV: CONTINUOUS DATA
Two-Sample T-test
Check the assumption of normality:
Distribution.
Q-Q Plots
The male data does not seriously deviate from the fitted line, but female data does skewed
to the right. Since the sample size is large, t-test is still valid.
66 of 81
PART IV: CONTINUOUS DATA
Two-Sample T-test
To specify lower/upper one-sided test, add SIDES= in PROC TTEST statement.
• Lower one-sided test:
Null hypothesis: 𝜇1 = 𝜇2.
Alternative hypothesis: 𝜇1 < 𝜇2.
proc ttest data=mydata sides=L;
var pre_BMI;
class sex;
run;
• Upper one-sided test:
Null hypothesis: 𝜇1 = 𝜇2.
Alternative hypothesis: 𝜇1 > 𝜇2.
proc ttest data=mydata sides=U;
var pre_BMI;
class sex;
run;
67 of 81
PART IV: CONTINUOUS DATA
Paired T-test
• Paired data arise when two of the measurements are taken from the same subject, but under
different experimental conditions (e.g., before and after treatment).
• Must have the same number of subjects at 2 measurements.
• The analysis focuses on the difference in response from treatment to control.
• Validity of the paired t-test:
Paired observations;
Sample size large enough (n>30, say);
If the sample size is small, the differences are approximately normal.
68 of 81
PART IV: CONTINUOUS DATA
Paired T-test
Examine the BMI difference before and after surgery.
Null hypothesis: 𝜇𝐷=0;
Alternative hypothesis: 𝜇𝐷 ≠ 0;
𝜇𝐷: the mean difference of BMI before and after the surgery for the entire population.
proc ttest data=mydata;
paired pre_bmi*post_bmi;
run;
P-values from the paired t-test. Reject the null
hypothesis and conclude that the means of BMI
before surgery and after surgery are different.
The mean difference for the
entire population is not close
to 0.
69 of 81
PART IV: CONTINUOUS DATA
Paired T-test
Check the normality of differences:
Distribution Plot.
Q-Q Plot
The distribution and Q-Q plots show that the data do not seriously deviate from the fitted
line. They indicate the difference is normally distributed and the test is valid.
70 of 81
PART IV: CONTINUOUS DATA
ANOVA
We use t-test to compare two groups, use ANOVA to compare across N>2 groups.
We are still interested in population means:
Null hypothesis: 𝐻0: 𝜇1 = 𝜇2 = ⋯ = 𝜇𝑁.
Alternative hypothesis: 𝐻𝑎: one or more of 𝜇1, 𝜇2, … , 𝜇𝑁 are different.
If the null hypothesis is true, we expect the sample means to be close together and the
between-group variance to be relatively small.
Key assumptions underlying the ANOVA:
• Groups are independent and randomly sampled
• Each group comes from a normally distributed population
• The population variances are equal across groups
The last assumption forms the basis for the analysis.
ANOVA test statistic is based on the ratio of between-group variance and within-group
variance.
If the ratio is “big”, we reject the null hypothesis and conclude the means differ.
71 of 81
ANOVA
To examine the difference of means of pre-surgery BMI across age groups.proc glm data=mydata plots=diagnostic;
class agec;
model pre_bmi=agec;
means agec;
quit;
Following 2 options are not absolutely necessary to fit the model:• plots=diagnostic: request a panel of summary diagnostics for the fit.
• means agec: request means and standard deviations within each group of age.
PART IV: CONTINUOUS DATA
72 of 81
ANOVA
Recall the assumptions: we can
check the normality and variances
through residuals.
Normality can be checked
through Q-Q plot.
Equality of population SDs can
be checked through residual plot.
PART IV: CONTINUOUS DATA
Most data points falling
approximately along the
reference line -> approximately
meet the normality assumption.
variances of each age group are
nearly close -> meet the
assumption of homogeneity of
variances.
73 of 81
PART IV: CONTINUOUS DATA
Conclusion: Based on the p-value (0.0380<0.05), we can reject null hypothesis at the 5%
significant level and conclude that at least one of the means is different from others among
age groups.
What if the normality assumption not met: Nonparametric test (PROC NPAR1WAY): Wilcoxon’s rank test, Kruskal-Wallis test.
ANOVA
74 of 81
PART IV: CONTINUOUS DATA
Bonferroni procedure
The ANOVA test only tells us whether at least one of the means is different, but it doesn’t tell
us how the means differ from each other. In this workshop, we introduce the Bonferroni
adjustment procedure to test all possible pairwise comparison.
The procedure works for a fixed set of I null hypothesis to test or parameters to estimate.
In the previous hypothesis test, we reject the null hypothesis if p-value less than the
significance level (𝛼). However, the Bonferroni adjustment procedure simply sums all tests
at 𝛼
𝐼.
proc glm data=mydata;
class agec;
model pre_bmi=agec;
means agec / bon;
quit;
75 of 81
Correlation Coefficient
Instead of measuring the difference of continuous variable by groups, the correlation coefficient can be
used to measure the linear relationship between two continuous variables.
PART IV: CONTINUOUS DATA
76 of 81
The Pearson’s Correlation Coefficient
Pearson’s correlation coefficient (r) measures the strength and direction (positive or negative)
of how two variables (eg. X and Y) linearly related.
−1 ≤ 𝑟 ≤ 1 If r=1 then Y increases with X according to a perfect line.
If r=-1 then Y decreases with X according to a perfect line.
If r=0 then Y and X are not linearly associated.
The closer r is to -1 or 1, the more the points lay to a straight line.
PART IV: CONTINUOUS DATA
77 of 81
The Pearson’s Correlation Coefficient
For example, test the correlation between weight and BMI:
Null hypothesis: r=0;
Alternative hypothesis: r≠0.
proc corr data=mydata;
var WGT pre_BMI;
run;
PART IV: CONTINUOUS DATA
r=0.80687, close to 1. p-
value<.0001, reject the null
hypothesis and conclude that
weight and BMI have positive
linear relationship.
78 of 81
The Pearson’s Correlation Coefficient
To request the scatter plot, add PLOTS=SCATTER request in PROC CORR statement. The
default outputs include Pearson correlation, and prediction ellipse with 95% confidence.
proc corr data=mydata plots=scatter;
var WGT pre_BMI;
run;
PART IV: CONTINUOUS DATA
79 of 81
The Spearman’s Rank-Correlation Coefficient
• The Pearson’s correlation is highly sensitive to outliers or extreme values, which means it
is particularly sensitive to non-normality.
• The Spearman’s rank coefficient is a non-parametric alternative to the Pearson’s correlation
coefficient, which can be used when data contain extreme values or is otherwise far from
normal, or when data is ordinal.
• To perform Spearman’s correlation coefficient, add SPEARMAN option in PROC CORR
statement.
PART IV: CONTINUOUS DATA
80 of 81
Result Export
• Data Export: Choose “Export Data” under “File”, and choose the corresponding library
and member name of the data, follow the direction to save the data.
• Save SAS result: Right click on the table or figure you want to export it as excel file or
save it as picture.
81 of 81
Summary
Examine Relationships between Variables: Categorical data
• Chi-square test for large sample.
• small sample:
2*2 table: Fisher’s exact test;
r*k table: Chi-square test with Monte Carlo simulation.
Continuous data
• Compare between the means of 2 populations:
2 intendent groups: two-sample t-test;
2 paired groups: paired t-test.
• Compare the means across >=3 groups: ANOVA.
Multiple comparison of means across >=3 groups: Bonferroni adjustment.
• Linear relationship between 2 continuous variables: Pearson’s correlation coefficient.
Reference
1. PennState Eberly College of Science, Introduction to SAS, https://newonlinecourses.science.psu.edu/stat480/node/95/.