introduction to sas - stony brook medicine

1 of 81

Introduction to SASXiaoyue Zhang, MS.

Biostatistician

Biostatistical Consulting Core

Department of Family, Population and Preventive Medicine

Stony Brook University

Wei Hou, Ph.D.Research Associate Professor, Division of Epidemiology and Biostatistics

Department of Family, Population and Preventive Medicine

Adjunct Associate Professor, Department of Applied Mathematics and Statistics

Voluntary Faculty, Department of Pathology

Stony Brook University

January 14, 2019

Biostatistical Consulting Core (BCC)

In collaboration with Clinical Translational Science Center (CTSC) and

the Biostatistics and Bioinformatics Shared Resource (BB-SR), Stony Brook Cancer Center (SBCC).

2 of 81

SAS (Statistical Analysis System)

• SAS is a statistical software, which captures, stores, modifies and presents data and

perform various operations on it.

• SAS programs provide "extraordinary range of data analysis and data management tasks

[1]," but more difficult to use and learn compared with other statistical software.

Why SAS?• Undisputed market leader in statistical analysis and modeling.

• Offers huge array of statistical functions and comprehensive user’s guide

(https://support.sas.com/documentation/cdl/en/statug/63962/PDF/default/statug.pdf).

• Stable, reliable and powerful.

• Required by FDA.

[1] Acock, Alan C (November 2005). "SAS, Stata, SPSS: A Comparison". Journal of Marriage and Family. 67 (4): 1093–1095.

https://support.sas.com/documentation/cdl/en/statug/63962/PDF/default/statug.pdf

3 of 81

OUTLINE

Part I – Getting Started in SAS

• Using Virtual SINC Site

• SAS windows and SAS data

Part II – Data Processing

• Data Step Programming

• Select Variables/Observations

• Merge/Concatenate Data Sets

• Sort Data set

• IF-THEN/ELSE statement

Part III – Categorical Data

• Frequency tables

• Chi-square test

• Fisher’s exact test

Part IV – Continuous Data

• Descriptive statistics

• T-test

• ANOVA

• Correlation coefficient

4 of 81

Part I: Getting Started in SAS

5 of 81

PART I: GETTING STARTED IN SAS

Using Virtual SINC Site

The Virtual SINC Site provides a way for students, faculty, and staff members of SBU to

access site-licensed, academic software titles directly from their personal computers from

either on or off campus 24 hours a day, 7 days a week.

https://it.stonybrook.edu/services/virtual-sinc-site

Click “ LAUNCH VIRTUAL SINC SITE ”

https://it.stonybrook.edu/services/virtual-sinc-site

6 of 81



7 of 81



8 of 81



9 of 81


SAS Window Environment

Log window

Editor window

Output windowResults window

Explorer window

Store SAS data file

10 of 81



Return back to EXPLORER

window

Store temporary dataset

11 of 81



Result window contains a running

record of the output. This figure

displays the results from PROC

FREQ step which will be shown in

Part III.

12 of 81

Execute the program

Specify the new dataset

Specify variable

names in new dataset

Specify values for each variableEnd of the program

Comments

Data Input


$: variable followed with

a dollar sign is a character

data, otherwise numeric data.

13 of 81


Data Input

• Variables: columns in a SAS dataset.

• Observations: rows in SAS dataset.

• Variable types:

Numeric: store numbers;

Character: contain text.

• The punctuation of a SAS statement is a semicolon (;).

• Statements can begin in any column and a single statement can span multiple lines.

14 of 81


Example Data Used for This Workshop: Bariatric.xlsx

• Made-up bariatric surgery data.

• Sample size: 300 patients.

• Variables:

o ID: unique patients’ ID.

o Sex: categorical variable, Female vs Male;

o Age: continuous variable, ≥18;

o Race: categorical variable, White vs Black vs Asian.

o Height (HGT): continuous variable;

o Height unit (HGTUNIT): mark the unit of height;

o Weight (WGT): continuous variable, weight before surgery;

o Weight unit (WGTUNIT): mark the unit of weight;

o PRE_BMI: continuous variable, BMI before surgery.

o POST_BMI: continuous variable, BMI after surgery

o Surgery type (SURG): categorical variable, Bypass vs Sleeve.

• THIS DATASET IS ONLY USED FOR SAS TUTORIAL, ANY CLINICAL RESULTS

CONCLUDED FROM IT WILL NOT HAVE ANY SCIENTIC BASIS.

15 of 81


16 of 81

Download the dataset from BCC Education Series website:

https://osa.stonybrookmedicine.edu/research-core-facilities/bcc/education

and save it under your stony brook disk (X: \\mysbfiles.campus.stonybrook.edu).


https://osa.stonybrookmedicine.edu/research-core-facilities/bcc/education

17 of 81

DATA IMPORT

18 of 81


Member name: the data set name used in SAS for the data you imported. In this

workshop, we set it as “mydata”.

CLICK FINISH

19 of 81


20 of 81


PLEASE NOTE:

• Check the LOG window each time after running a set of program.

• Close the dataset before running program to modify it.

21 of 81

PART II: DATA PROCESSING

Part II: Data Processing

22 of 81

Data Step Programming

A new SAS data set can be created using an existing SAS data set by DATA and

SET statement.

DATA name of new SAS dataset;

SET name of existing SAS dataset;

<other statements;>

RUN;


23 of 81

Select Variables

Use KEEP or DROP statement to control variables written into new dataset.

data test1; set mydata;

keep ID SURG SEX AGE RACE;

run;


24 of 81

Select Variables


drop SURG SEX AGE RACE;

run;


25 of 81


Select Observations

Use WHERE statement to select observations meet with certain condition and save them into

new dataset.

data test3; set test1;

where SEX=‘Male’;

run;

26 of 81

Select Observations

Use logical operators, comparison operators in WHERE statement to select observations

meeting selection criteria.

o Comparison operators: =; ^= (not equal); >; <;…

o Logical operators: And (&); Or (|); Not (^).

data test4; set test1;

where SEX=‘Female' & age<=50;

run;


27 of 81

Sort Data

Rearrange observations of the data set using the SORT procedure according to the variables

named in BY statement. It can be sorted on more than one variables at one time.

proc sort data=test1;

by AGE;

run;


This is the simplest way to use the

SORT procedure that it directly

modifies the original data and

replaces it with the sorted version.

28 of 81


Sort Data

• To save the sorted data as a new data set instead of modifying the input data set, add the

OUT= option in PORC SORT statement. We recommend to use this option as it avoid

altering the original data since this procedure is irreversible.

proc sort data=test1 out=test5;

by AGE;

run;

• To sort data without any duplication in BY variables, add option NODUPKEY in PROC

SORT: it automatically keep the first observation it will encounter./* ATTENSTION: CLOSE TEST5 BEFORE RUNNING FOLLOWING CODES*/

proc sort data=test1 out=test5 nodupkey;

by AGE;

run;

Compare test5 with test1 and check the difference in the AGE variable.

Note that: Be careful in the use of NODUPKEY as long as you pretty sure that no duplicated

record should be existed based on the BY statement.

29 of 81


Sort Data

By default SAS sorts data in ascending order. To reverse it, add keyword DESCENDING to

the BY statement before each variable that should be sorted from the highest to lowest.

proc sort data=test2 out=test6; by descending HGT WGT; run;

Compare the sorted order of HGT and WGT:

HGT is in descending order.

WGT is automatically in ascending order.

30 of 81

Match-Merging

Horizontally combine observations from multiple data sets into a single observation

in a new data set according to the matched observations using MERGE and BY

statement.

Input data sets used for merge must have at least one common variable.

Input data sets must be sorted by the common variable(s) using the SORT

procedure.

For example: merge data sets test5 and test6 by patients’ ID:

proc sort data=test5 out=test5_1; by ID; run;

proc sort data=test6 out=test6_1; by ID; run;

data test7; merge test5_1 test6_1;

by ID;

run;


31 of 81

Match-Merging

SAS will automatically keep all the available observations in the common variable(s)

even they are not matched. In such cases, new data set will assign missing values to

those observations.


32 of 81

Concatenate Datasets

Vertically stack datasets one after the other.

No requirement for input datasets to have common variables.

If common variables exist, they must have the same format (numeric,

character) [2].

New dataset will includes all available variables found in input datasets even

if they are not matched. In such cases, new dataset will assign missing values to

those variables.

data test8; set test3 test4; run;

[2] Introduction to SAS Informats and Formats, https://support.sas.com/publishing/pubcat/chaps/59498.pdf


https://support.sas.com/publishing/pubcat/chaps/59498.pdf

33 of 81


IF-THEN/ELSE Statement

To modify observations meet specific conditions. For example, based on patients’ ages, add a

variable named AGEC as a categorical variable to divide patients into 3 groups: 18-29; 30-49;

>=50.

data mydata; set mydata;

if age<30 then agec="18-29";

else if age<50 then agec="30-49";

else agec=">=50";

run;

IF-ELSE-ELSE-… is a logical and efficient statement to avoid overlap across levels,

otherwise, you can specify separate IF statements but need to clearly write out the

condition for each level.

data mydata; set mydata;

if age<30 then agec="18-29";

if age>=30 & age<50 then agec="30-49";

if age>=50 then agec=">=50";

run;

34 of 81

Create New Variable(s)

In previous slide, we use AGE to create a new variable AGEC without any declaration for the

new variable. The format, length and other attributes were automatically set at the very first

place SAS encountered with this variable.

AGEC is considered as a character variable since it is assigned a string. The length is set as 5 bytes based on the very first text it is assigned (“18-29”).

If the content of AGEC for other observation is longer than 5 bytes, it will be cut off and only

the first 5 digits will be saved.

To be conservative, we recommend to use LENGTH statement before creating a new

variable.


length agec_1 $30;

if age<30 then agec_1="Younger than 30";

else if age<50 then agec_1="Age between 30 to 49";

else agec_1="Older than or equal to 50";

run;

Delete the LENGTH statement and compare the output data with test9.

35 of 81

Questions:

• How to generate descriptive table of patients’ demographic information by surgery type?

• How to analyze relationship between demographic information and surgery type?

• How to examine population means across groups?

• How to check the correlation between numeric variables?

36 of 81

Part III: Categorical Data

37 of 81

PART III: CATEGORICAL DATA

• The simplest categorical data simply tells us which of the two categories a subject is in,

e.g. Male or Female, Diseased or Non-Diseased, etc. This type of data is called binary or

dichotomous.

• The level of categorical data can be generalized to categories>2.

• In this section, we’ll introduce the FREQ procedure in SAS to summarize and analyze

categorical data. PROC FREQ is a descriptive and statistical procedure which can offer

you one-way to n-way frequency and contingency tables. It can alsoperform analyses and

statistical tests.

Level 1:Female Level 2: Male Total

Sex 236 64 300

38 of 81


One-Way Frequency Table

proc freq data=mydata;

tables SEX;

run;

• proc freq: Initiate FREQ procedure.

• data: Specify dataset.

• tables: Specify variable for frequency table.

• run: End of the procedure.

39 of 81



Request plot by adding PLOTS= option in TABLES statement. Separate requested element

and options by slash(/).


tables sex / plots=freqplot;

run;

40 of 81



Request multiple one-way frequency tables at one time for different categorical variables.


tables agec race / plots=freqplot;

run;

41 of 81

Question: among the 236 female patients, how many of them had bypass? how many of

them had sleeve?

Use contingency table to know the frequency distribution of variables


Sex

Surgery

Female Male Total

Bypass

Sleeve

Total

?

42 of 81


Two-Way Frequency Table


tables surg*sex;

run;

Based on the column percent (Col Pct): among

236 female patients, 130 (55.08%) of them had

bypass, 106 (44.92%) of them had sleeve; among

64 male patients, 30 (46.88%) of them had

bypass, 34 (53.13%) of them had sleeve.

Based on row percent (Row Pct): among 160

bypass patients, 130 (81.25%) of them are female,

30 (18.75%) of them were male; among 140

sleeve patients, 106 (75.71%) are female and 34

(24.29%) are male.

43 of 81



• FREQ procedure excludes observations with missing values from the table but displays

the total frequency of missing observations below each table.

• SAS offers a lot of options in TABLES to control the output table information and you can

specify multiple options at the same time:

NOCOL: suppresses display of column percentages.

NOROW: suppresses display of row percentages.

NOCUM: suppresses display of cumulative frequencies and percentages.

CUMCOL: display cumulative column percentages.

…


tables surg*sex / cumcol norow;

run;

44 of 81



Generate bar chart to compare the proportions by group.


tables surg*sex / plots=FreqPlot(type=barchart groupby=column

twoway=cluster scale=grouppercent);

run;

Options of FREQPLOT:• type=barchart: request to create a bar chart.

• groupby=column: show distributions by column variable. Available options: column,

row.• twoway=cluster: horizontally display distributions from the same group side by side

in one figure. Available options: cluster, stacked.• scale=grouppercent: request to display group percentages. Available options:

grouppercent, freq, percent, log, sort.

45 of 81



46 of 81


Hypothesis Test

Based on the previous contingency table, 55.08% female patients had bypass, while 46.88%

male patients had bypass. It seems like that females are more likely to had bypass than sleeve

compared with males.

How to compare these two proportions to see whether there exists significant difference? Is

there any association between bariatric surgery type and patient’s gender?

We wish to show that there exists difference between genders in the choice of bariatric

surgery type and test this hypothesis.

A hypothesis is an educational guess and it should be testable and quantifiable, by evidence or

data.

47 of 81


Hypothesis Test

A hypothesis test is a proof of contradiction.

We assume that the two proportions are the same (null hypothesis) and we observe

evidence from the hypothesis test to cast doubt on what we assumed. So we have to

conclude the opposite (alternative hypothesis).

For example:

π1: the proportion of female patients had bypass.

π2: the proportion of male patients had bypass.

Null hypothesis is what we are trying to disprove: 𝐻0: π1 = π2.

Alternative hypothesis is what we are trying to show is true: 𝐻𝑎: π1 ≠ π2.

48 of 81


Steps of Hypothesis Testing

1. State the null (𝑯𝟎) and alternative (𝑯𝒂) hypothesis.

2. Choose a significance level 𝜶, usually 0.05.

3. Based on the sample, calculate the test statistic, p-value and confidence interval based

on the theoretical distribution behind the test statistic.

4. Compare the p-value with the significance level.

If P-value < 𝛼, reject the null hypothesis.

If P-value ≥ 𝛼, fail to the null hypothesis

5. Make a decision (reject or fail to reject the null hypothesis) and state your conclusion.

Note: If the alternative hypothesis is not proved, it doesn’t mean that the null hypothesis is

true.

49 of 81


Chi-Square Test

The chi-square test of independence can be used to examine the relationship between two

categorical variables. The frequency of each category for one categorical variable is compared

across the categories of the second categorical variable.

The number of categories of each variable can be larger than 2, that is, Chi-square test can be

used to r*k contingency tables, where r>2 and/or k>2.

To test the hypothesis that there exists difference between genders in the choice of bariatric

surgery types:

Null hypothesis: π1 = π2.

Alternative hypothesis: π1 ≠ π2.

π1: the proportion of female patients had bypass.

π2: the proportion of male patients had bypass.

50 of 81


Chi-Square Test

Use PROC FREQ to perform the chi-square test: add the CHISQ option in TABLES

statement.


tables surg*sex / chisq;

run;

Based on the p-value, we fail to reject the null hypothesis at the 5% significant level

(0.2429>0.05) and conclude that there is no significant difference between the proportions of bypass patients among the female patients and among the male patients.

P-value from the Chi-Square test

51 of 81


Small Sample Situation

Note that the Chi-Square test is not suitable when the sample size of a contingency table is

small, say the expected value in any of the cells of the table is less than 5.

For 2*2 table with small sample size, use Fisher’s exact test.

For r*k table with small sample size, use Chi-square test with p-value from Monte Carlo

simulation.

For 2*2 tables. The CHISQ option automatically provides Fisher’s exact test result. For r*k

tables, you can request Monte Carlo simulation by adding EXACT statement.

52 of 81


Monte Carlo simulation

Test the relationship between surgery type and race.


tables surg*race;

exact chisq / mc;

run;

Two cells have small values and there is a WARNING said that the result from the Chi-square

test may not be valid.

53 of 81

PART III: DATA PROCEDURE

Conclusion: Based on the p-value, we can reject the null hypothesis at 5% significant level

and conclude that there is significant difference across race in the choice of bariatric surgery

types.

P-value from the Chi-square test

based on Monte Carlo simulation

Monte Carlo simulation

54 of 81

Part IV: Continuous Data

55 of 81

PART IV: CONTINUOUS DATA

In our sample dataset, height, weight, pre-surgery BMI and post-surgery BMI are continuous

data.

o What are the mean, median, minimum, maximum of these variables?

o How dispersed are these variables?

o Is there any difference in pre-surgery BMI across age/race/gender groups?

o Is there any difference between pre-surgery and post-surgery BMI?

o What is the correlation between pre-surgery and post-surgery BMI?

In this section, we will first introduce how to summarize continuous data and calculate

descriptive statistics, such as mean, median, and minimum/maximum. Next, we will introduce

how to examine the relationships between variables by measuring the correlation, conducting

t-test, ANOVA.

56 of 81


Descriptive Statistics

Use PROC MEANS to analyze the values of numeric variables. By default, this procedure

will calculate 5 statistical measures.

proc means data=mydata;

var pre_BMI;

run;

Note that the total amount of patients with pre-surgery BMI information is N=298. Based on

our previous analyses on categorical variables, there should be 300 patients records in total.

Thus there are 2 patients without pre-surgery BMI value. How to output missing count?

57 of 81


Descriptive Statistics

To calculate other descriptive statistics, add keywords in PROC MEANS statement: median,

nmiss, Q1, Q3, QRANGE,…

To analyze data by groups, specify the group variable in CLASS statement.

proc means data=mydata nmiss n q1 median q3 max qrange;

var pre_BMI;

class SEX;

run;

proc means data=mydata nmiss n q1 median q3 max qrange;

var pre_BMI;

run;

58 of 81


Box Plot, Histogram

To request box plot and histogram of a continuous variable, height, for example, use

PROC UNIVARIATE procedure and add PLOT option.

proc univariate data=mydata plot;

var HGT;

run;

59 of 81


Box Plot, Histogram

To request plots of the continuous variables by group, say race, we first need to sort the data

by the group variable. Next, add the group variable in BY statement in PROC UNIVARIATE

procedure. The procedure defines a BY group as a set of contiguous observations that have

the same values for the BY variable. Each category of the group variable will have a set of

output listed after the name of the level and a grouped box plot will be at the end of the result.

proc sort data=mydata out=mydata2; by race; run;

proc univariate data=mydata2 plot;

var pre_bmi;

by race;

quit;

60 of 81


Box Plot, Histogram

61 of 81


Two-Sample T-test

Used to compare a continuous data between two populations or two groups of a categorical

variable.

To determine whether there is significant difference between the means of the 2 samples

(independent groups).

Key assumptions underlying the two-sample t-test:

• Sample size large enough (n1>30 and n2>30, say);

• Randomly sampled data;

• Two populations/groups are independent;

• If sample sizes are small, data from each population needs to be normal for the

procedure.

62 of 81


Two-Sample T-test

Examine whether pre-surgery BMI is different between gender groups:

Null hypothesis: 𝜇1 = 𝜇2.

Alternative hypothesis: 𝜇1 ≠ 𝜇2.

𝜇1: the mean of pre-surgery BMI of female patients.

𝜇2: the mean of pre-surgery BMI of male patients.

By default, SAS will output the two-sided test result:

proc ttest data=mydata;

var pre_BMI;

class sex;

run;

63 of 81

Two-Sample T-test


Check the homogeneity of variances of the 2 groups

based on a F-test. Since p-value<0.05, reject the null

hypothesis and conclude that the variances are not

equal. For unequal variance data, use the Satterthwaite’s

method.

P-values of the Two-sample t-

test. Fail to reject the null

hypothesis and conclude that

the means of pre-surgery BMI

are similar between genders.

The difference between two

means is close to 0 and the

confidence interval includes 0

64 of 81

Two-Sample T-test

Recommendation:

“For the problem of testing the equality of means from two independent normally

distributed populations where the ratio of the variances is unknown, directly apply

Satterthwaite's Approximate F test without using any preliminary variance test. ”

[Reference]: Moser, B. K., Stevens, G. R., & Watts, C. L. (1989). The two-sample t test versus Satterthwaite's approximate F

test. Communications in Statistics-Theory and Methods, 18(11), 3963-3975.


65 of 81


Two-Sample T-test

Check the assumption of normality:

Distribution.

Q-Q Plots

The male data does not seriously deviate from the fitted line, but female data does skewed

to the right. Since the sample size is large, t-test is still valid.

66 of 81


Two-Sample T-test

To specify lower/upper one-sided test, add SIDES= in PROC TTEST statement.

• Lower one-sided test:


Alternative hypothesis: 𝜇1 < 𝜇2.

proc ttest data=mydata sides=L;

var pre_BMI;

class sex;

run;

• Upper one-sided test:


Alternative hypothesis: 𝜇1 > 𝜇2.

proc ttest data=mydata sides=U;

var pre_BMI;

class sex;

run;

67 of 81


Paired T-test

• Paired data arise when two of the measurements are taken from the same subject, but under

different experimental conditions (e.g., before and after treatment).

• Must have the same number of subjects at 2 measurements.

• The analysis focuses on the difference in response from treatment to control.

• Validity of the paired t-test:

Paired observations;

Sample size large enough (n>30, say);

If the sample size is small, the differences are approximately normal.

68 of 81


Paired T-test

Examine the BMI difference before and after surgery.

Null hypothesis: 𝜇𝐷=0;

Alternative hypothesis: 𝜇𝐷 ≠ 0;

𝜇𝐷: the mean difference of BMI before and after the surgery for the entire population.

proc ttest data=mydata;

paired pre_bmi*post_bmi;

run;

P-values from the paired t-test. Reject the null

hypothesis and conclude that the means of BMI

before surgery and after surgery are different.

The mean difference for the

entire population is not close

to 0.

69 of 81


Paired T-test

Check the normality of differences:

Distribution Plot.

Q-Q Plot

The distribution and Q-Q plots show that the data do not seriously deviate from the fitted

line. They indicate the difference is normally distributed and the test is valid.

70 of 81


ANOVA

We use t-test to compare two groups, use ANOVA to compare across N>2 groups.

We are still interested in population means:

Null hypothesis: 𝐻0: 𝜇1 = 𝜇2 = ⋯ = 𝜇𝑁.

Alternative hypothesis: 𝐻𝑎: one or more of 𝜇1, 𝜇2, … , 𝜇𝑁 are different.

If the null hypothesis is true, we expect the sample means to be close together and the

between-group variance to be relatively small.

Key assumptions underlying the ANOVA:

• Groups are independent and randomly sampled

• Each group comes from a normally distributed population

• The population variances are equal across groups

The last assumption forms the basis for the analysis.

ANOVA test statistic is based on the ratio of between-group variance and within-group

variance.

If the ratio is “big”, we reject the null hypothesis and conclude the means differ.

71 of 81

ANOVA

To examine the difference of means of pre-surgery BMI across age groups.proc glm data=mydata plots=diagnostic;

class agec;

model pre_bmi=agec;

means agec;

quit;

Following 2 options are not absolutely necessary to fit the model:• plots=diagnostic: request a panel of summary diagnostics for the fit.

• means agec: request means and standard deviations within each group of age.


72 of 81

ANOVA

Recall the assumptions: we can

check the normality and variances

through residuals.

Normality can be checked

through Q-Q plot.

Equality of population SDs can

be checked through residual plot.


Most data points falling

approximately along the

reference line -> approximately

meet the normality assumption.

variances of each age group are

nearly close -> meet the

assumption of homogeneity of

variances.

73 of 81


Conclusion: Based on the p-value (0.0380<0.05), we can reject null hypothesis at the 5%

significant level and conclude that at least one of the means is different from others among

age groups.

What if the normality assumption not met: Nonparametric test (PROC NPAR1WAY): Wilcoxon’s rank test, Kruskal-Wallis test.

ANOVA

74 of 81


Bonferroni procedure

The ANOVA test only tells us whether at least one of the means is different, but it doesn’t tell

us how the means differ from each other. In this workshop, we introduce the Bonferroni

adjustment procedure to test all possible pairwise comparison.

The procedure works for a fixed set of I null hypothesis to test or parameters to estimate.

In the previous hypothesis test, we reject the null hypothesis if p-value less than the

significance level (𝛼). However, the Bonferroni adjustment procedure simply sums all tests

at 𝛼

𝐼.

proc glm data=mydata;

class agec;

model pre_bmi=agec;

means agec / bon;

quit;

75 of 81

Correlation Coefficient

Instead of measuring the difference of continuous variable by groups, the correlation coefficient can be

used to measure the linear relationship between two continuous variables.


76 of 81

The Pearson’s Correlation Coefficient

Pearson’s correlation coefficient (r) measures the strength and direction (positive or negative)

of how two variables (eg. X and Y) linearly related.

−1 ≤ 𝑟 ≤ 1 If r=1 then Y increases with X according to a perfect line.

If r=-1 then Y decreases with X according to a perfect line.

If r=0 then Y and X are not linearly associated.

The closer r is to -1 or 1, the more the points lay to a straight line.


77 of 81


For example, test the correlation between weight and BMI:

Null hypothesis: r=0;

Alternative hypothesis: r≠0.

proc corr data=mydata;

var WGT pre_BMI;

run;


r=0.80687, close to 1. p-

value<.0001, reject the null

hypothesis and conclude that

weight and BMI have positive

linear relationship.

78 of 81


To request the scatter plot, add PLOTS=SCATTER request in PROC CORR statement. The

default outputs include Pearson correlation, and prediction ellipse with 95% confidence.

proc corr data=mydata plots=scatter;

var WGT pre_BMI;

run;


79 of 81

The Spearman’s Rank-Correlation Coefficient

• The Pearson’s correlation is highly sensitive to outliers or extreme values, which means it

is particularly sensitive to non-normality.

• The Spearman’s rank coefficient is a non-parametric alternative to the Pearson’s correlation

coefficient, which can be used when data contain extreme values or is otherwise far from

normal, or when data is ordinal.

• To perform Spearman’s correlation coefficient, add SPEARMAN option in PROC CORR

statement.


80 of 81

Result Export

• Data Export: Choose “Export Data” under “File”, and choose the corresponding library

and member name of the data, follow the direction to save the data.

• Save SAS result: Right click on the table or figure you want to export it as excel file or

save it as picture.

81 of 81

Summary

Examine Relationships between Variables: Categorical data

• Chi-square test for large sample.

• small sample:

2*2 table: Fisher’s exact test;

r*k table: Chi-square test with Monte Carlo simulation.

Continuous data

• Compare between the means of 2 populations:

2 intendent groups: two-sample t-test;

2 paired groups: paired t-test.

• Compare the means across >=3 groups: ANOVA.

Multiple comparison of means across >=3 groups: Bonferroni adjustment.

• Linear relationship between 2 continuous variables: Pearson’s correlation coefficient.

Reference

1. PennState Eberly College of Science, Introduction to SAS, https://newonlinecourses.science.psu.edu/stat480/node/95/.

https://newonlinecourses.science.psu.edu/stat480/node/95/

introduction to sas - stony brook medicine

Documents