introduction to sas - stony brook medicine

81
1 of 81 Introduction to SAS Xiaoyue Zhang, MS. Biostatistician Biostatistical Consulting Core Department of Family, Population and Preventive Medicine Stony Brook University Wei Hou, Ph.D. Research Associate Professor, Division of Epidemiology and Biostatistics Department of Family, Population and Preventive Medicine Adjunct Associate Professor, Department of Applied Mathematics and Statistics Voluntary Faculty, Department of Pathology Stony Brook University January 14, 2019 Biostatistical Consulting Core (BCC) In collaboration with Clinical Translational Science Center (CTSC) and the Biostatistics and Bioinformatics Shared Resource (BB-SR), Stony Brook Cancer Center (SBCC).

Upload: others

Post on 19-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to SAS - Stony Brook Medicine

1 of 81

Introduction to SASXiaoyue Zhang, MS.

Biostatistician

Biostatistical Consulting Core

Department of Family, Population and Preventive Medicine

Stony Brook University

Wei Hou, Ph.D.Research Associate Professor, Division of Epidemiology and Biostatistics

Department of Family, Population and Preventive Medicine

Adjunct Associate Professor, Department of Applied Mathematics and Statistics

Voluntary Faculty, Department of Pathology

Stony Brook University

January 14, 2019

Biostatistical Consulting Core (BCC)

In collaboration with Clinical Translational Science Center (CTSC) and

the Biostatistics and Bioinformatics Shared Resource (BB-SR), Stony Brook Cancer Center (SBCC).

Page 2: Introduction to SAS - Stony Brook Medicine

2 of 81

SAS (Statistical Analysis System)

• SAS is a statistical software, which captures, stores, modifies and presents data and

perform various operations on it.

• SAS programs provide "extraordinary range of data analysis and data management tasks

[1]," but more difficult to use and learn compared with other statistical software.

Why SAS?• Undisputed market leader in statistical analysis and modeling.

• Offers huge array of statistical functions and comprehensive user’s guide

(https://support.sas.com/documentation/cdl/en/statug/63962/PDF/default/statug.pdf).

• Stable, reliable and powerful.

• Required by FDA.

[1] Acock, Alan C (November 2005). "SAS, Stata, SPSS: A Comparison". Journal of Marriage and Family. 67 (4): 1093–1095.

Page 3: Introduction to SAS - Stony Brook Medicine

3 of 81

OUTLINE

Part I – Getting Started in SAS

• Using Virtual SINC Site

• SAS windows and SAS data

Part II – Data Processing

• Data Step Programming

• Select Variables/Observations

• Merge/Concatenate Data Sets

• Sort Data set

• IF-THEN/ELSE statement

Part III – Categorical Data

• Frequency tables

• Chi-square test

• Fisher’s exact test

Part IV – Continuous Data

• Descriptive statistics

• T-test

• ANOVA

• Correlation coefficient

Page 4: Introduction to SAS - Stony Brook Medicine

4 of 81

Part I: Getting Started in SAS

Page 5: Introduction to SAS - Stony Brook Medicine

5 of 81

PART I: GETTING STARTED IN SAS

Using Virtual SINC Site

The Virtual SINC Site provides a way for students, faculty, and staff members of SBU to

access site-licensed, academic software titles directly from their personal computers from

either on or off campus 24 hours a day, 7 days a week.

https://it.stonybrook.edu/services/virtual-sinc-site

Click “ LAUNCH VIRTUAL SINC SITE ”

Page 6: Introduction to SAS - Stony Brook Medicine

6 of 81

PART I: GETTING STARTED IN SAS

Using Virtual SINC Site

Page 7: Introduction to SAS - Stony Brook Medicine

7 of 81

PART I: GETTING STARTED IN SAS

Using Virtual SINC Site

Page 8: Introduction to SAS - Stony Brook Medicine

8 of 81

PART I: GETTING STARTED IN SAS

Using Virtual SINC Site

Page 9: Introduction to SAS - Stony Brook Medicine

9 of 81

PART I: GETTING STARTED IN SAS

SAS Window Environment

Log window

Editor window

Output windowResults window

Explorer window

Store SAS data file

Page 10: Introduction to SAS - Stony Brook Medicine

10 of 81

PART I: GETTING STARTED IN SAS

SAS Window Environment

Return back to EXPLORER

window

Store temporary dataset

Page 11: Introduction to SAS - Stony Brook Medicine

11 of 81

PART I: GETTING STARTED IN SAS

SAS Window Environment

Result window contains a running

record of the output. This figure

displays the results from PROC

FREQ step which will be shown in

Part III.

Page 12: Introduction to SAS - Stony Brook Medicine

12 of 81

Execute the program

Specify the new dataset

Specify variable

names in new dataset

Specify values for each variableEnd of the program

Comments

Data Input

PART I: GETTING STARTED IN SAS

$: variable followed with

a dollar sign is a character

data, otherwise numeric data.

Page 13: Introduction to SAS - Stony Brook Medicine

13 of 81

PART I: GETTING STARTED IN SAS

Data Input

• Variables: columns in a SAS dataset.

• Observations: rows in SAS dataset.

• Variable types:

Numeric: store numbers;

Character: contain text.

• The punctuation of a SAS statement is a semicolon (;).

• Statements can begin in any column and a single statement can span multiple lines.

Page 14: Introduction to SAS - Stony Brook Medicine

14 of 81

PART I: GETTING STARTED IN SAS

Example Data Used for This Workshop: Bariatric.xlsx

• Made-up bariatric surgery data.

• Sample size: 300 patients.

• Variables:

o ID: unique patients’ ID.

o Sex: categorical variable, Female vs Male;

o Age: continuous variable, ≥18;

o Race: categorical variable, White vs Black vs Asian.

o Height (HGT): continuous variable;

o Height unit (HGTUNIT): mark the unit of height;

o Weight (WGT): continuous variable, weight before surgery;

o Weight unit (WGTUNIT): mark the unit of weight;

o PRE_BMI: continuous variable, BMI before surgery.

o POST_BMI: continuous variable, BMI after surgery

o Surgery type (SURG): categorical variable, Bypass vs Sleeve.

• THIS DATASET IS ONLY USED FOR SAS TUTORIAL, ANY CLINICAL RESULTS

CONCLUDED FROM IT WILL NOT HAVE ANY SCIENTIC BASIS.

Page 15: Introduction to SAS - Stony Brook Medicine

15 of 81

PART I: GETTING STARTED IN SAS

Page 16: Introduction to SAS - Stony Brook Medicine

16 of 81

Download the dataset from BCC Education Series website:

https://osa.stonybrookmedicine.edu/research-core-facilities/bcc/education

and save it under your stony brook disk (X: \\mysbfiles.campus.stonybrook.edu).

PART I: GETTING STARTED IN SAS

Page 17: Introduction to SAS - Stony Brook Medicine

17 of 81

DATA IMPORT

Page 18: Introduction to SAS - Stony Brook Medicine

18 of 81

PART I: GETTING STARTED IN SAS

Member name: the data set name used in SAS for the data you imported. In this

workshop, we set it as “mydata”.

CLICK FINISH

Page 19: Introduction to SAS - Stony Brook Medicine

19 of 81

PART I: GETTING STARTED IN SAS

Page 20: Introduction to SAS - Stony Brook Medicine

20 of 81

PART I: GETTING STARTED IN SAS

PLEASE NOTE:

• Check the LOG window each time after running a set of program.

• Close the dataset before running program to modify it.

Page 21: Introduction to SAS - Stony Brook Medicine

21 of 81

PART II: DATA PROCESSING

Part II: Data Processing

Page 22: Introduction to SAS - Stony Brook Medicine

22 of 81

Data Step Programming

A new SAS data set can be created using an existing SAS data set by DATA and

SET statement.

DATA name of new SAS dataset;

SET name of existing SAS dataset;

<other statements;>

RUN;

PART II: DATA PROCESSING

Page 23: Introduction to SAS - Stony Brook Medicine

23 of 81

Select Variables

Use KEEP or DROP statement to control variables written into new dataset.

data test1; set mydata;

keep ID SURG SEX AGE RACE;

run;

PART II: DATA PROCESSING

Page 24: Introduction to SAS - Stony Brook Medicine

24 of 81

Select Variables

data test2; set mydata;

drop SURG SEX AGE RACE;

run;

PART II: DATA PROCESSING

Page 25: Introduction to SAS - Stony Brook Medicine

25 of 81

PART I: GETTING STARTED IN SAS

Select Observations

Use WHERE statement to select observations meet with certain condition and save them into

new dataset.

data test3; set test1;

where SEX=‘Male’;

run;

Page 26: Introduction to SAS - Stony Brook Medicine

26 of 81

Select Observations

Use logical operators, comparison operators in WHERE statement to select observations

meeting selection criteria.

o Comparison operators: =; ^= (not equal); >; <;…

o Logical operators: And (&); Or (|); Not (^).

data test4; set test1;

where SEX=‘Female' & age<=50;

run;

PART II: DATA PROCESSING

Page 27: Introduction to SAS - Stony Brook Medicine

27 of 81

Sort Data

Rearrange observations of the data set using the SORT procedure according to the variables

named in BY statement. It can be sorted on more than one variables at one time.

proc sort data=test1;

by AGE;

run;

PART II: DATA PROCESSING

This is the simplest way to use the

SORT procedure that it directly

modifies the original data and

replaces it with the sorted version.

Page 28: Introduction to SAS - Stony Brook Medicine

28 of 81

PART II: DATA PROCESSING

Sort Data

• To save the sorted data as a new data set instead of modifying the input data set, add the

OUT= option in PORC SORT statement. We recommend to use this option as it avoid

altering the original data since this procedure is irreversible.

proc sort data=test1 out=test5;

by AGE;

run;

• To sort data without any duplication in BY variables, add option NODUPKEY in PROC

SORT: it automatically keep the first observation it will encounter./* ATTENSTION: CLOSE TEST5 BEFORE RUNNING FOLLOWING CODES*/

proc sort data=test1 out=test5 nodupkey;

by AGE;

run;

Compare test5 with test1 and check the difference in the AGE variable.

Note that: Be careful in the use of NODUPKEY as long as you pretty sure that no duplicated

record should be existed based on the BY statement.

Page 29: Introduction to SAS - Stony Brook Medicine

29 of 81

PART II: DATA PROCESSING

Sort Data

By default SAS sorts data in ascending order. To reverse it, add keyword DESCENDING to

the BY statement before each variable that should be sorted from the highest to lowest.

proc sort data=test2 out=test6; by descending HGT WGT; run;

Compare the sorted order of HGT and WGT:

HGT is in descending order.

WGT is automatically in ascending order.

Page 30: Introduction to SAS - Stony Brook Medicine

30 of 81

Match-Merging

Horizontally combine observations from multiple data sets into a single observation

in a new data set according to the matched observations using MERGE and BY

statement.

Input data sets used for merge must have at least one common variable.

Input data sets must be sorted by the common variable(s) using the SORT

procedure.

For example: merge data sets test5 and test6 by patients’ ID:

proc sort data=test5 out=test5_1; by ID; run;

proc sort data=test6 out=test6_1; by ID; run;

data test7; merge test5_1 test6_1;

by ID;

run;

PART II: DATA PROCESSING

Page 31: Introduction to SAS - Stony Brook Medicine

31 of 81

Match-Merging

SAS will automatically keep all the available observations in the common variable(s)

even they are not matched. In such cases, new data set will assign missing values to

those observations.

PART II: DATA PROCESSING

Page 32: Introduction to SAS - Stony Brook Medicine

32 of 81

Concatenate Datasets

Vertically stack datasets one after the other.

No requirement for input datasets to have common variables.

If common variables exist, they must have the same format (numeric,

character) [2].

New dataset will includes all available variables found in input datasets even

if they are not matched. In such cases, new dataset will assign missing values to

those variables.

data test8; set test3 test4; run;

[2] Introduction to SAS Informats and Formats, https://support.sas.com/publishing/pubcat/chaps/59498.pdf

PART II: DATA PROCESSING

Page 33: Introduction to SAS - Stony Brook Medicine

33 of 81

PART I: GETTING STARTED IN SAS

IF-THEN/ELSE Statement

To modify observations meet specific conditions. For example, based on patients’ ages, add a

variable named AGEC as a categorical variable to divide patients into 3 groups: 18-29; 30-49;

>=50.

data mydata; set mydata;

if age<30 then agec="18-29";

else if age<50 then agec="30-49";

else agec=">=50";

run;

IF-ELSE-ELSE-… is a logical and efficient statement to avoid overlap across levels,

otherwise, you can specify separate IF statements but need to clearly write out the

condition for each level.

data mydata; set mydata;

if age<30 then agec="18-29";

if age>=30 & age<50 then agec="30-49";

if age>=50 then agec=">=50";

run;

Page 34: Introduction to SAS - Stony Brook Medicine

34 of 81

Create New Variable(s)

In previous slide, we use AGE to create a new variable AGEC without any declaration for the

new variable. The format, length and other attributes were automatically set at the very first

place SAS encountered with this variable.

AGEC is considered as a character variable since it is assigned a string. The length is set as 5 bytes based on the very first text it is assigned (“18-29”).

If the content of AGEC for other observation is longer than 5 bytes, it will be cut off and only

the first 5 digits will be saved.

To be conservative, we recommend to use LENGTH statement before creating a new

variable.

data test9; set mydata;

length agec_1 $30;

if age<30 then agec_1="Younger than 30";

else if age<50 then agec_1="Age between 30 to 49";

else agec_1="Older than or equal to 50";

run;

Delete the LENGTH statement and compare the output data with test9.

Page 35: Introduction to SAS - Stony Brook Medicine

35 of 81

Questions:

• How to generate descriptive table of patients’ demographic information by surgery type?

• How to analyze relationship between demographic information and surgery type?

• How to examine population means across groups?

• How to check the correlation between numeric variables?

Page 36: Introduction to SAS - Stony Brook Medicine

36 of 81

Part III: Categorical Data

Page 37: Introduction to SAS - Stony Brook Medicine

37 of 81

PART III: CATEGORICAL DATA

• The simplest categorical data simply tells us which of the two categories a subject is in,

e.g. Male or Female, Diseased or Non-Diseased, etc. This type of data is called binary or

dichotomous.

• The level of categorical data can be generalized to categories>2.

• In this section, we’ll introduce the FREQ procedure in SAS to summarize and analyze

categorical data. PROC FREQ is a descriptive and statistical procedure which can offer

you one-way to n-way frequency and contingency tables. It can alsoperform analyses and

statistical tests.

Level 1:Female Level 2: Male Total

Sex 236 64 300

Page 38: Introduction to SAS - Stony Brook Medicine

38 of 81

PART III: CATEGORICAL DATA

One-Way Frequency Table

proc freq data=mydata;

tables SEX;

run;

• proc freq: Initiate FREQ procedure.

• data: Specify dataset.

• tables: Specify variable for frequency table.

• run: End of the procedure.

Page 39: Introduction to SAS - Stony Brook Medicine

39 of 81

PART III: CATEGORICAL DATA

One-Way Frequency Table

Request plot by adding PLOTS= option in TABLES statement. Separate requested element

and options by slash(/).

proc freq data=mydata;

tables sex / plots=freqplot;

run;

Page 40: Introduction to SAS - Stony Brook Medicine

40 of 81

PART III: CATEGORICAL DATA

One-Way Frequency Table

Request multiple one-way frequency tables at one time for different categorical variables.

proc freq data=mydata;

tables agec race / plots=freqplot;

run;

Page 41: Introduction to SAS - Stony Brook Medicine

41 of 81

Question: among the 236 female patients, how many of them had bypass? how many of

them had sleeve?

Use contingency table to know the frequency distribution of variables

PART III: CATEGORICAL DATA

Sex

Surgery

Female Male Total

Bypass

Sleeve

Total

?

Page 42: Introduction to SAS - Stony Brook Medicine

42 of 81

PART III: CATEGORICAL DATA

Two-Way Frequency Table

proc freq data=mydata;

tables surg*sex;

run;

Based on the column percent (Col Pct): among

236 female patients, 130 (55.08%) of them had

bypass, 106 (44.92%) of them had sleeve; among

64 male patients, 30 (46.88%) of them had

bypass, 34 (53.13%) of them had sleeve.

Based on row percent (Row Pct): among 160

bypass patients, 130 (81.25%) of them are female,

30 (18.75%) of them were male; among 140

sleeve patients, 106 (75.71%) are female and 34

(24.29%) are male.

Page 43: Introduction to SAS - Stony Brook Medicine

43 of 81

PART III: CATEGORICAL DATA

Two-Way Frequency Table

• FREQ procedure excludes observations with missing values from the table but displays

the total frequency of missing observations below each table.

• SAS offers a lot of options in TABLES to control the output table information and you can

specify multiple options at the same time:

NOCOL: suppresses display of column percentages.

NOROW: suppresses display of row percentages.

NOCUM: suppresses display of cumulative frequencies and percentages.

CUMCOL: display cumulative column percentages.

proc freq data=mydata;

tables surg*sex / cumcol norow;

run;

Page 44: Introduction to SAS - Stony Brook Medicine

44 of 81

PART III: CATEGORICAL DATA

Two-Way Frequency Table

Generate bar chart to compare the proportions by group.

proc freq data=mydata;

tables surg*sex / plots=FreqPlot(type=barchart groupby=column

twoway=cluster scale=grouppercent);

run;

Options of FREQPLOT:• type=barchart: request to create a bar chart.

• groupby=column: show distributions by column variable. Available options: column,

row.• twoway=cluster: horizontally display distributions from the same group side by side

in one figure. Available options: cluster, stacked.• scale=grouppercent: request to display group percentages. Available options:

grouppercent, freq, percent, log, sort.

Page 45: Introduction to SAS - Stony Brook Medicine

45 of 81

PART III: CATEGORICAL DATA

Two-Way Frequency Table

Page 46: Introduction to SAS - Stony Brook Medicine

46 of 81

PART III: CATEGORICAL DATA

Hypothesis Test

Based on the previous contingency table, 55.08% female patients had bypass, while 46.88%

male patients had bypass. It seems like that females are more likely to had bypass than sleeve

compared with males.

How to compare these two proportions to see whether there exists significant difference? Is

there any association between bariatric surgery type and patient’s gender?

We wish to show that there exists difference between genders in the choice of bariatric

surgery type and test this hypothesis.

A hypothesis is an educational guess and it should be testable and quantifiable, by evidence or

data.

Page 47: Introduction to SAS - Stony Brook Medicine

47 of 81

PART III: CATEGORICAL DATA

Hypothesis Test

A hypothesis test is a proof of contradiction.

We assume that the two proportions are the same (null hypothesis) and we observe

evidence from the hypothesis test to cast doubt on what we assumed. So we have to

conclude the opposite (alternative hypothesis).

For example:

π1: the proportion of female patients had bypass.

π2: the proportion of male patients had bypass.

Null hypothesis is what we are trying to disprove: 𝐻0: π1 = π2.

Alternative hypothesis is what we are trying to show is true: 𝐻𝑎: π1 ≠ π2.

Page 48: Introduction to SAS - Stony Brook Medicine

48 of 81

PART III: CATEGORICAL DATA

Steps of Hypothesis Testing

1. State the null (𝑯𝟎) and alternative (𝑯𝒂) hypothesis.

2. Choose a significance level 𝜶, usually 0.05.

3. Based on the sample, calculate the test statistic, p-value and confidence interval based

on the theoretical distribution behind the test statistic.

4. Compare the p-value with the significance level.

If P-value < 𝛼, reject the null hypothesis.

If P-value ≥ 𝛼, fail to the null hypothesis

5. Make a decision (reject or fail to reject the null hypothesis) and state your conclusion.

Note: If the alternative hypothesis is not proved, it doesn’t mean that the null hypothesis is

true.

Page 49: Introduction to SAS - Stony Brook Medicine

49 of 81

PART III: CATEGORICAL DATA

Chi-Square Test

The chi-square test of independence can be used to examine the relationship between two

categorical variables. The frequency of each category for one categorical variable is compared

across the categories of the second categorical variable.

The number of categories of each variable can be larger than 2, that is, Chi-square test can be

used to r*k contingency tables, where r>2 and/or k>2.

To test the hypothesis that there exists difference between genders in the choice of bariatric

surgery types:

Null hypothesis: π1 = π2.

Alternative hypothesis: π1 ≠ π2.

π1: the proportion of female patients had bypass.

π2: the proportion of male patients had bypass.

Page 50: Introduction to SAS - Stony Brook Medicine

50 of 81

PART III: CATEGORICAL DATA

Chi-Square Test

Use PROC FREQ to perform the chi-square test: add the CHISQ option in TABLES

statement.

proc freq data=mydata;

tables surg*sex / chisq;

run;

Based on the p-value, we fail to reject the null hypothesis at the 5% significant level

(0.2429>0.05) and conclude that there is no significant difference between the proportions of bypass patients among the female patients and among the male patients.

P-value from the Chi-Square test

Page 51: Introduction to SAS - Stony Brook Medicine

51 of 81

PART III: CATEGORICAL DATA

Small Sample Situation

Note that the Chi-Square test is not suitable when the sample size of a contingency table is

small, say the expected value in any of the cells of the table is less than 5.

For 2*2 table with small sample size, use Fisher’s exact test.

For r*k table with small sample size, use Chi-square test with p-value from Monte Carlo

simulation.

For 2*2 tables. The CHISQ option automatically provides Fisher’s exact test result. For r*k

tables, you can request Monte Carlo simulation by adding EXACT statement.

Page 52: Introduction to SAS - Stony Brook Medicine

52 of 81

PART III: CATEGORICAL DATA

Monte Carlo simulation

Test the relationship between surgery type and race.

proc freq data=mydata;

tables surg*race;

exact chisq / mc;

run;

Two cells have small values and there is a WARNING said that the result from the Chi-square

test may not be valid.

Page 53: Introduction to SAS - Stony Brook Medicine

53 of 81

PART III: DATA PROCEDURE

Conclusion: Based on the p-value, we can reject the null hypothesis at 5% significant level

and conclude that there is significant difference across race in the choice of bariatric surgery

types.

P-value from the Chi-square test

based on Monte Carlo simulation

Monte Carlo simulation

Page 54: Introduction to SAS - Stony Brook Medicine

54 of 81

Part IV: Continuous Data

Page 55: Introduction to SAS - Stony Brook Medicine

55 of 81

PART IV: CONTINUOUS DATA

In our sample dataset, height, weight, pre-surgery BMI and post-surgery BMI are continuous

data.

o What are the mean, median, minimum, maximum of these variables?

o How dispersed are these variables?

o Is there any difference in pre-surgery BMI across age/race/gender groups?

o Is there any difference between pre-surgery and post-surgery BMI?

o What is the correlation between pre-surgery and post-surgery BMI?

In this section, we will first introduce how to summarize continuous data and calculate

descriptive statistics, such as mean, median, and minimum/maximum. Next, we will introduce

how to examine the relationships between variables by measuring the correlation, conducting

t-test, ANOVA.

Page 56: Introduction to SAS - Stony Brook Medicine

56 of 81

PART IV: CONTINUOUS DATA

Descriptive Statistics

Use PROC MEANS to analyze the values of numeric variables. By default, this procedure

will calculate 5 statistical measures.

proc means data=mydata;

var pre_BMI;

run;

Note that the total amount of patients with pre-surgery BMI information is N=298. Based on

our previous analyses on categorical variables, there should be 300 patients records in total.

Thus there are 2 patients without pre-surgery BMI value. How to output missing count?

Page 57: Introduction to SAS - Stony Brook Medicine

57 of 81

PART IV: CONTINUOUS DATA

Descriptive Statistics

To calculate other descriptive statistics, add keywords in PROC MEANS statement: median,

nmiss, Q1, Q3, QRANGE,…

To analyze data by groups, specify the group variable in CLASS statement.

proc means data=mydata nmiss n q1 median q3 max qrange;

var pre_BMI;

class SEX;

run;

proc means data=mydata nmiss n q1 median q3 max qrange;

var pre_BMI;

run;

Page 58: Introduction to SAS - Stony Brook Medicine

58 of 81

PART IV: CONTINUOUS DATA

Box Plot, Histogram

To request box plot and histogram of a continuous variable, height, for example, use

PROC UNIVARIATE procedure and add PLOT option.

proc univariate data=mydata plot;

var HGT;

run;

Page 59: Introduction to SAS - Stony Brook Medicine

59 of 81

PART IV: CONTINUOUS DATA

Box Plot, Histogram

To request plots of the continuous variables by group, say race, we first need to sort the data

by the group variable. Next, add the group variable in BY statement in PROC UNIVARIATE

procedure. The procedure defines a BY group as a set of contiguous observations that have

the same values for the BY variable. Each category of the group variable will have a set of

output listed after the name of the level and a grouped box plot will be at the end of the result.

proc sort data=mydata out=mydata2; by race; run;

proc univariate data=mydata2 plot;

var pre_bmi;

by race;

quit;

Page 60: Introduction to SAS - Stony Brook Medicine

60 of 81

PART IV: CONTINUOUS DATA

Box Plot, Histogram

Page 61: Introduction to SAS - Stony Brook Medicine

61 of 81

PART IV: CONTINUOUS DATA

Two-Sample T-test

Used to compare a continuous data between two populations or two groups of a categorical

variable.

To determine whether there is significant difference between the means of the 2 samples

(independent groups).

Key assumptions underlying the two-sample t-test:

• Sample size large enough (n1>30 and n2>30, say);

• Randomly sampled data;

• Two populations/groups are independent;

• If sample sizes are small, data from each population needs to be normal for the

procedure.

Page 62: Introduction to SAS - Stony Brook Medicine

62 of 81

PART IV: CONTINUOUS DATA

Two-Sample T-test

Examine whether pre-surgery BMI is different between gender groups:

Null hypothesis: 𝜇1 = 𝜇2.

Alternative hypothesis: 𝜇1 ≠ 𝜇2.

𝜇1: the mean of pre-surgery BMI of female patients.

𝜇2: the mean of pre-surgery BMI of male patients.

By default, SAS will output the two-sided test result:

proc ttest data=mydata;

var pre_BMI;

class sex;

run;

Page 63: Introduction to SAS - Stony Brook Medicine

63 of 81

Two-Sample T-test

PART IV: CONTINUOUS DATA

Check the homogeneity of variances of the 2 groups

based on a F-test. Since p-value<0.05, reject the null

hypothesis and conclude that the variances are not

equal. For unequal variance data, use the Satterthwaite’s

method.

P-values of the Two-sample t-

test. Fail to reject the null

hypothesis and conclude that

the means of pre-surgery BMI

are similar between genders.

The difference between two

means is close to 0 and the

confidence interval includes 0

Page 64: Introduction to SAS - Stony Brook Medicine

64 of 81

Two-Sample T-test

Recommendation:

“For the problem of testing the equality of means from two independent normally

distributed populations where the ratio of the variances is unknown, directly apply

Satterthwaite's Approximate F test without using any preliminary variance test. ”

[Reference]: Moser, B. K., Stevens, G. R., & Watts, C. L. (1989). The two-sample t test versus Satterthwaite's approximate F

test. Communications in Statistics-Theory and Methods, 18(11), 3963-3975.

PART IV: CONTINUOUS DATA

Page 65: Introduction to SAS - Stony Brook Medicine

65 of 81

PART IV: CONTINUOUS DATA

Two-Sample T-test

Check the assumption of normality:

Distribution.

Q-Q Plots

The male data does not seriously deviate from the fitted line, but female data does skewed

to the right. Since the sample size is large, t-test is still valid.

Page 66: Introduction to SAS - Stony Brook Medicine

66 of 81

PART IV: CONTINUOUS DATA

Two-Sample T-test

To specify lower/upper one-sided test, add SIDES= in PROC TTEST statement.

• Lower one-sided test:

Null hypothesis: 𝜇1 = 𝜇2.

Alternative hypothesis: 𝜇1 < 𝜇2.

proc ttest data=mydata sides=L;

var pre_BMI;

class sex;

run;

• Upper one-sided test:

Null hypothesis: 𝜇1 = 𝜇2.

Alternative hypothesis: 𝜇1 > 𝜇2.

proc ttest data=mydata sides=U;

var pre_BMI;

class sex;

run;

Page 67: Introduction to SAS - Stony Brook Medicine

67 of 81

PART IV: CONTINUOUS DATA

Paired T-test

• Paired data arise when two of the measurements are taken from the same subject, but under

different experimental conditions (e.g., before and after treatment).

• Must have the same number of subjects at 2 measurements.

• The analysis focuses on the difference in response from treatment to control.

• Validity of the paired t-test:

Paired observations;

Sample size large enough (n>30, say);

If the sample size is small, the differences are approximately normal.

Page 68: Introduction to SAS - Stony Brook Medicine

68 of 81

PART IV: CONTINUOUS DATA

Paired T-test

Examine the BMI difference before and after surgery.

Null hypothesis: 𝜇𝐷=0;

Alternative hypothesis: 𝜇𝐷 ≠ 0;

𝜇𝐷: the mean difference of BMI before and after the surgery for the entire population.

proc ttest data=mydata;

paired pre_bmi*post_bmi;

run;

P-values from the paired t-test. Reject the null

hypothesis and conclude that the means of BMI

before surgery and after surgery are different.

The mean difference for the

entire population is not close

to 0.

Page 69: Introduction to SAS - Stony Brook Medicine

69 of 81

PART IV: CONTINUOUS DATA

Paired T-test

Check the normality of differences:

Distribution Plot.

Q-Q Plot

The distribution and Q-Q plots show that the data do not seriously deviate from the fitted

line. They indicate the difference is normally distributed and the test is valid.

Page 70: Introduction to SAS - Stony Brook Medicine

70 of 81

PART IV: CONTINUOUS DATA

ANOVA

We use t-test to compare two groups, use ANOVA to compare across N>2 groups.

We are still interested in population means:

Null hypothesis: 𝐻0: 𝜇1 = 𝜇2 = ⋯ = 𝜇𝑁.

Alternative hypothesis: 𝐻𝑎: one or more of 𝜇1, 𝜇2, … , 𝜇𝑁 are different.

If the null hypothesis is true, we expect the sample means to be close together and the

between-group variance to be relatively small.

Key assumptions underlying the ANOVA:

• Groups are independent and randomly sampled

• Each group comes from a normally distributed population

• The population variances are equal across groups

The last assumption forms the basis for the analysis.

ANOVA test statistic is based on the ratio of between-group variance and within-group

variance.

If the ratio is “big”, we reject the null hypothesis and conclude the means differ.

Page 71: Introduction to SAS - Stony Brook Medicine

71 of 81

ANOVA

To examine the difference of means of pre-surgery BMI across age groups.proc glm data=mydata plots=diagnostic;

class agec;

model pre_bmi=agec;

means agec;

quit;

Following 2 options are not absolutely necessary to fit the model:• plots=diagnostic: request a panel of summary diagnostics for the fit.

• means agec: request means and standard deviations within each group of age.

PART IV: CONTINUOUS DATA

Page 72: Introduction to SAS - Stony Brook Medicine

72 of 81

ANOVA

Recall the assumptions: we can

check the normality and variances

through residuals.

Normality can be checked

through Q-Q plot.

Equality of population SDs can

be checked through residual plot.

PART IV: CONTINUOUS DATA

Most data points falling

approximately along the

reference line -> approximately

meet the normality assumption.

variances of each age group are

nearly close -> meet the

assumption of homogeneity of

variances.

Page 73: Introduction to SAS - Stony Brook Medicine

73 of 81

PART IV: CONTINUOUS DATA

Conclusion: Based on the p-value (0.0380<0.05), we can reject null hypothesis at the 5%

significant level and conclude that at least one of the means is different from others among

age groups.

What if the normality assumption not met: Nonparametric test (PROC NPAR1WAY): Wilcoxon’s rank test, Kruskal-Wallis test.

ANOVA

Page 74: Introduction to SAS - Stony Brook Medicine

74 of 81

PART IV: CONTINUOUS DATA

Bonferroni procedure

The ANOVA test only tells us whether at least one of the means is different, but it doesn’t tell

us how the means differ from each other. In this workshop, we introduce the Bonferroni

adjustment procedure to test all possible pairwise comparison.

The procedure works for a fixed set of I null hypothesis to test or parameters to estimate.

In the previous hypothesis test, we reject the null hypothesis if p-value less than the

significance level (𝛼). However, the Bonferroni adjustment procedure simply sums all tests

at 𝛼

𝐼.

proc glm data=mydata;

class agec;

model pre_bmi=agec;

means agec / bon;

quit;

Page 75: Introduction to SAS - Stony Brook Medicine

75 of 81

Correlation Coefficient

Instead of measuring the difference of continuous variable by groups, the correlation coefficient can be

used to measure the linear relationship between two continuous variables.

PART IV: CONTINUOUS DATA

Page 76: Introduction to SAS - Stony Brook Medicine

76 of 81

The Pearson’s Correlation Coefficient

Pearson’s correlation coefficient (r) measures the strength and direction (positive or negative)

of how two variables (eg. X and Y) linearly related.

−1 ≤ 𝑟 ≤ 1 If r=1 then Y increases with X according to a perfect line.

If r=-1 then Y decreases with X according to a perfect line.

If r=0 then Y and X are not linearly associated.

The closer r is to -1 or 1, the more the points lay to a straight line.

PART IV: CONTINUOUS DATA

Page 77: Introduction to SAS - Stony Brook Medicine

77 of 81

The Pearson’s Correlation Coefficient

For example, test the correlation between weight and BMI:

Null hypothesis: r=0;

Alternative hypothesis: r≠0.

proc corr data=mydata;

var WGT pre_BMI;

run;

PART IV: CONTINUOUS DATA

r=0.80687, close to 1. p-

value<.0001, reject the null

hypothesis and conclude that

weight and BMI have positive

linear relationship.

Page 78: Introduction to SAS - Stony Brook Medicine

78 of 81

The Pearson’s Correlation Coefficient

To request the scatter plot, add PLOTS=SCATTER request in PROC CORR statement. The

default outputs include Pearson correlation, and prediction ellipse with 95% confidence.

proc corr data=mydata plots=scatter;

var WGT pre_BMI;

run;

PART IV: CONTINUOUS DATA

Page 79: Introduction to SAS - Stony Brook Medicine

79 of 81

The Spearman’s Rank-Correlation Coefficient

• The Pearson’s correlation is highly sensitive to outliers or extreme values, which means it

is particularly sensitive to non-normality.

• The Spearman’s rank coefficient is a non-parametric alternative to the Pearson’s correlation

coefficient, which can be used when data contain extreme values or is otherwise far from

normal, or when data is ordinal.

• To perform Spearman’s correlation coefficient, add SPEARMAN option in PROC CORR

statement.

PART IV: CONTINUOUS DATA

Page 80: Introduction to SAS - Stony Brook Medicine

80 of 81

Result Export

• Data Export: Choose “Export Data” under “File”, and choose the corresponding library

and member name of the data, follow the direction to save the data.

• Save SAS result: Right click on the table or figure you want to export it as excel file or

save it as picture.

Page 81: Introduction to SAS - Stony Brook Medicine

81 of 81

Summary

Examine Relationships between Variables: Categorical data

• Chi-square test for large sample.

• small sample:

2*2 table: Fisher’s exact test;

r*k table: Chi-square test with Monte Carlo simulation.

Continuous data

• Compare between the means of 2 populations:

2 intendent groups: two-sample t-test;

2 paired groups: paired t-test.

• Compare the means across >=3 groups: ANOVA.

Multiple comparison of means across >=3 groups: Bonferroni adjustment.

• Linear relationship between 2 continuous variables: Pearson’s correlation coefficient.

Reference

1. PennState Eberly College of Science, Introduction to SAS, https://newonlinecourses.science.psu.edu/stat480/node/95/.