fall 2006 statlab workshop series: introduction to...

Fall 2006

StatLab Workshop Series:

Introduction to SAS

Prepared and conducted by:

Katie McLaughlin, M.S. Department of Psychology

Department of Epidemiology & Public Health Yale University

Chandra Erdman

Department of Statistics Yale University

StatLab: Main Classroom 140 Prospect Street

October 6th 2006 12:30-2:30pm

Yale University StatLab Fall 2006 Introduction to SAS Workshop

of 26

This workshop is designed to introduce new users to some of the basic concepts necessary for using the SAS System by covering the following topics:

1. The SAS window system

2. How to construct a SAS program, review the SAS log and SAS output 3. How to input and read raw data files and SAS data files

4. Difference between a DATA step and a PROC step

5. Difference between a temporary SAS dataset and a permanent SAS dataset 6. The LIBNAME and LIBREF statements

7. Simple ways to manipulate the data and the output

8. Simple procedures available for analysis 9. Importance of the semi-colon in SAS

INTRODUCTION to the SAS SYSTEM

The Statistical Analysis System (SAS) was created in 1977 for conducting agricultural research. SAS is

essentially a series of computer programs used for data management, analysis, and presentation. It is also considered a 4th generation programming language because it requires only the names of operations in

order to perform them. The programs have already been written and the code/syntax used simply invokes

those programs. 3rd generation programs like BASIC, FORTRAN, C, and Pascal all required very involved code to perform simple operations.

Generally, SAS entails writing a SAS program, submitting it for analysis, and reviewing the log and

output files to help you understand your data and the results of your analyses.

Click on the SAS icon to start the SAS System. Three main windows appear. You can also use the pull-

down menu “View” to open the different types of windows in the SAS environment.

1. ENHANCED EDITOR window (in v. 8.0 and above; provides color codes for syntax)

SAS Program – <filename>.sas

2. LOG window

SAS Log – <filename>.log

3. OUTPUT window SAS Output – <filename>.out

Consider each of these to be a file. Differentiate between them based on the extension (sas, log, out) after the dot. The SAS environment looks similar to the Windows environment. There are pull down menus

and icon buttons for controlling some of the features of SAS. Much of SAS, however, requires writing

code using the SAS programming language.


of 26

OVERVIEW for WRITING the SAS PROGRAM

General syntax rules when writing SAS code:

1. SAS statements can be written in upper or lower case (or mixed).

2. SAS statements can begin and end in any column, but a word may not be split between any two lines.

3. More than one SAS statement can be entered per line.

4. Blank lines can be used and are recommended to aid readability. 5. SAS processes statements in steps so make sure that the data steps and procedure steps are in the

proper order.

6. Every SAS statement ends with a semicolon ! ; "

Building a SAS Program:

1. Practice writing a program in the Editor window. Save the file with a name like TEST.SAS.

2. Format the output with some OPTIONS listed at the top of the program.

OPTIONS PS=66 LS=165 NOCENTER NOFMTERR;

PS= tells SAS to fit up to 66 lines of output per page; ranges from 15 - 32,767 lines

LS= tells SAS to fit up to 165 columns of output per line; ranges from 64 – 256 columns

NOCENTER tells SAS to print output flush left instead of the default centered.

NOFMTERR tells SAS to continue processing even if its read the assigned formats before

(which can produce an error message sometimes).

3. Document as much as possible directly in the program with comments – it helps you keep track

of what you are doing in your program and why you are doing it. Comments can go anywhere in

your program.

A SAS comment can start with an asterisk and end with a semicolon:

* PROGRAMMER: KAM

DATE: 10/06/06

PROJECT: SAS BASICS WORKSHOP;

You can also use the following, which helps avoid semicolon mishaps:

/* LEARN HOW TO INPUT A RAW DATA FILE */

Within a comment, do not use semicolons and avoid using quotation marks (single & double).

4. Handling data and invoking procedures always occurs in one of 2 steps in SAS.

DATA step: builds a SAS data set (e.g., adds variables, merges datasets) OR

PROC step: processes a SAS data set (e.g., produce means, frequencies)


of 26

WORKING with RAW DATA

There are a number of different ways to input and read raw data in SAS (i.e., the instructions given to

SAS about the location and format of the variables).

1. Characteristics of a raw data file: A. Each row represents an observation, containing data values for one subject.

B. Each column represents a variable across all subjects: e.g., sex, birth date, test scores

C. Values assigned to variables can be: Numeric – includes only numbers

Character – includes letters, sometimes letters and numbers (alphanumeric)

D. The kind of values that are assigned to variables can influence the way in which SAS reads the data and performs certain analyses. It’s important to gain familiarity with your raw data.

2. To create the raw data file, key in the lines of data using Word, Notepad, or any text editor and

save the file as <filename>.dat or .txt. We will use a raw data file called testdata.txt that is saved in ‘c:\temp\sasbasics’.

3. Avoid errors in keying the data. A. In the raw data file, data must be entered starting on Line 1.

B. Leave no blank lines at the top or bottom of the file (unless data is missing for a subject and

should be left blank – key in “pretend data” by using the space bar to represent the number of columns of data that should be there if the data were not missing).

C. Make sure variables are keyed into the correct column

D. Right-justify numeric data

E. Left-justify character data F. Use blank columns between variables to aid in readability

Two of the main ways to input a raw dataset with a SAS program include:

1. Using the INPUT and DATALINES (or CARDS) commands to input the actual raw data within

the SAS program. (Note: Use when you have a small set of data.)

2. Using an INFILE or IMPORT command to refer SAS to an external raw data file saved

somewhere (e.g., floppy disk, hard drive, network). (Note: Better to use when you have a large set

of data.)


of 26

I. INPUT and DATALINES Examples:

Example 1. General Template:

DATA <data-set-name>;

INPUT (variable-name1) (variable-name1) (variable-name3); DATALINES;

Keyed in lines of data go here – each row an observation, each column a variable;

;

RUN;

PROC <name-of-desired-statistical-procedure> DATA=<data-set-name>;

VAR <name of variables to be processed>;

RUN;

Example 1. Sample Program:

* PROGRAMMER: KAM

DATE: 10/06/06

PROJECT: SAS BASICS WORKSHOP;

/* LEARN HOW TO INPUT A RAW DATA FILE */

/* Example 1 using INPUT and CARDS commands */

DATA TEMP;

INPUT SUBJECT SATV SATM;

CARDS;

1 520 490

2 610 590

3 470 450

4 410 390

5 510 460

6 580 350 ;

* COMMENT: Below is a PROC step, which allows you to manipulate and

analyze your SAS data set. This produces means for SATV and SATM;

PROC MEANS DATA=TEMP;

VAR SATV SATM;

RUN;


of 26

Example 2. General Template:


INPUT #line-number @ column-number (variable-name) (column-width.)

@ column-number (variable-name) (column-width.)

@ column-number (variable-name) (column-width.) ; CARDS;

Keyed in lines of data go here – each row an observation, each column a variable;


VAR <name of variables to be processed>; RUN;

Example 2. Sample Program:

/* Example 2 using INPUT and CARDS commands */

data temp;

input #1 @ 1 (V1) (1.)

@ 2 (V2) (1.)

@ 3 (V3) (1.)

@ 4 (V4) (1.)

@ 5 (V5) (1.)

@ 6 (V6) (1.)

@ 7 (V7) (1.)

@ 9 (AGE) (2.)

@ 12(IQ) (3.)

@ 16 (NUMBER) (1.)

;

cards;

2234243 22 98 1

3424325 20 105 2

3242424 32 90 3

3242323 19 119 4

3232143 18 101 5

;

run;

* COMMENT: Below is a PROC step, which allows you to manipulate and

analyze your SAS data set. This produces means for V1, V2, AGE, and IQ;

proc means;

var v1 v2 age iq;

run;


of 26

Below describes parts of the example programs above:

1. DATA statement:

General form: DATA <data-set-name>

Data-set-name: TEMP

2. INPUT statement (as in Example 2):

INPUT #line-number @ column-number (variable-name) (column-width.) @ column-number (variable-name) (column-width.)

@ column-number (variable-name) (column-width.) ;

3. Line number directions:

#line-number ! Tells SAS what line to start on to read each subject’s data

INPUT #1 ! In this example, it starts at line 1

4. Column location, variable name, and column width directions:

@ column-number ! # of the column at which each variable begins

(variable-name) ! name given to each variable

(column-width.) ! # of columns to be occupied by each variable

Note: Column width must be followed by a period because it helps when decimals are part of the

variable. Also, above IQ was given 3 columns (even though some IQ values were only 2 digits).

Don’t forget the semicolon at the end of the INPUT statement!

INPUT #1 @ 1 (V1) (1.) ; At Line 1, Column 1, variable is called V1, and is 1 column wide

5. DATALINES (or CARDS) statement:

Right after the INPUT statement is the DATALINES statement to tell SAS that there is raw data. There must be a semicolon after the word DATALINES and again after the raw data.

6. Data lines:

Data lines are the values for each row/observation/subject. Again, leave no blank lines (otherwise

SAS will think that a subject has missing data) and very carefully check the columns of the

variables to make sure they are aligned correctly. Make sure you have a semicolon on the line right below your last line of data.

7. PROC and RUN Statements:

PROC tells SAS to perform a given procedure or statistical analysis: e.g., CONTENTS, MEANS,

TTEST, UNIVARIATE, FREQ, GLM, ANOVA, or PRINT.

RUN tells SAS to execute the PROC.


of 26

8. General rules for data set names and variable names:

A. Must begin with a letter (not a number)

B. May be no more than 8 characters long

C. May contain no special characters such as “*” or “#”

D. May contain no blank spaces

II. INFILE Example:

General Template:


INFILE ‘<directory-path-and-name-of-data-file>’;

INPUT #line-number @ column-number (variable-name) (column-width.)

@ column-number (variable-name) (column-width.) @ column-number (variable-name) (column-width.) ;

RUN:



RUN;

Example Program:

/* COMMENT: Example using INFILE command */

/* COMMENT: INFILE indicates the name of the data file in which the raw data

exists and where it can be found (need to specify the directory path if the

file is not located in the same place as the program). The INPUT statement

indicates the structure of the data file (as referred to by the INFILE

command). */

DATA TEMP;

INFILE 'c:\temp\sasbasics\testdata.txt';

INPUT #1 @ 1 (V1) (1.)

@ 2 (V2) (1.)

@ 3 (V3) (1.)

@ 4 (V4) (1.)

@ 5 (V5) (1.)

@ 6 (V6) (1.)

@ 7 (V7) (1.)

@ 9 (AGE) (2.)

@ 12(IQ) (3.)

@ 16(NUMBER) (1.)

;

run;

/* This will produce a mean for each variable in the data set because vars

were not specified */


RUN;


of 26

The code is identical to using CARDS, but the INFILE statement is added and the CARDS statement and data lines are deleted. Instead of including the raw data in the program, the INFILE statement indicates

where to find the raw data. The INPUT statement is still needed to tell SAS the structure of the raw data.

Another General Template using INFILE:

DATA <data-set-name>; INFILE ‘<directory-path-and-name-of-data-file>’;

INPUT variable name column start–column end variable-name column start–column end

variable name column start–column end variable-name column start–column end; RUN:


VAR <name of variables to be processed>; RUN;

Example Program:

/*COMMENT: Another way to enter data from a text file*/

DATA TEMP;


INPUT V1 1 V2 2 V3 3 V4 4 V5 5 V6 6 V7 7 age 9-10 IQ 12-14 number 16;

RUN;

/*This will produce descriptive statistics for age and IQ*/

proc univariate data=temp;

var age IQ;

run;

Additional tips for handling variables when inputting data:

• Input a string of variables with the same prefix and different numeric suffixes. Think about the

variables V1-V7 from above. The prefix (V) is the same, but the suffix is a different number. This is

useful when you have a survey or questionnaire with many items. If you have multiple surveys, the prefix could be some abbreviated form of what the particular survey is.

INPUT #1 @1 (V1-V7) (1.) " saves lines of code because it’s a string of variables @9 (AGE) (2.)

@12 (IQ) (3.)

• Inputting character variables requires that you indicate in the INPUT statement that it is a character

variable. The use of a $ before the number of columns required tells SAS that it’s a character variable.


of 26

For example, if we added a variable called SEX, it could be inputted with values of M or F instead of

values of 1 or 2.

INPUT #1 @1 (V1-V7) (1.)

@9 (AGE) (2.)

….

@18 (SEX) ($1.) " $ is included to indicate character values for SEX

• Sometimes multiple lines of data are needed for each subject.

INPUT #1 @ 1 (V1-V7) (1.)

@ 9 (AGE) (2.) @ 12 (IQ) (3.)

@ 16 (NUMBER) (1.)

@ 18 (SEX) ($1.)

#2 @ 1 (SATV) (3.) @ 5 (SATM) (3.) ;

Raw data for this input statement would look like this for 3 subjects: 2234243 22 98 1 M

520 490 " Subject 1 has data for SATV and SATM

3424325 20 105 2 M " Subject 2 is missing data for SATV and SATM

3242424 32 90 3 F

390 420

If data is missing for an observation, leave the space there as if it were present so SAS doesn’t

misalign the rows or use a period (.) to indicate that a value is missing.

• Create decimal places on input for numeric variables so you don’t have to key in the decimal point:

If you had a variable called GPA, key it in without the decimals

3.56 ! 356

2.20 ! 220

INPUT #1 @ 1 (GPA) (3.2) ; " Tells SAS to use 3 cols. & put a decimal in the 2nd

CARDS; 356

220;

• Inputting “check all that apply questions” as multiple variables:

Treat single questions with multiple parts to them as a set of questions. For each question there

can be a value of either 0 (not checked) or 1 (checked) – making each question a dichotomous variable using dummy coding.


of 26

III. IMPORT Example

If your data are saved as an Excel file rather than a text file, you will need to use the PROC IMPORT

command to enter your data into SAS. The best way to structure your Excel file to make it SAS-readable

is to use the first row to enter variable names and begin entering your data values in the second row. SAS

reads the first row of data as variable names.

General Template:

PROC IMPORT DATAFILE = ‘<directory-path-and-name-of-data-file.xls>’

DBMS = excel

OUT = <data-set-name>

REPLACE ;

RUN:



RUN; **It is important to note the lack of semi-colons following each line in the PROC IMPORT command.

This is an exception to the rule of using a semicolon after each line of command and only uses a semi-

colon after the entire command has been entered.

Example Program:

PROC IMPORT

DATAFILE = 'c:\temp\sasbasics\testdata.xls'

DBMS = excel

OUT = test

REPLACE

;

RUN;

/*This will produce output that lists the number of variables and

observations in your SAS dataset -- This is a useful way to double-check that

your data were entered correctly*/

PROC CONTENTS DATA = test;

RUN;


of 26

WORKING with TEMPORARY and PERMANENT DATASETS

The DATA statement tells SAS to build a SAS data set.

1. Building a Temporary SAS Data Set

The syntax for building a temporary SAS data set is:

DATA <data-set-name> ; INFILE ‘drive:\path\filename.dat’ ;

INPUT variable information ;

Here, the DATA statement refers to the data-set-name as the name of a temporary SAS data set. TEMP

was used in the previous programs as a data set name.

Example Program:

DATA TEMP;


INPUT #1 @ 1 (V1) (1.)

@ 2 (V2) (1.)

@ 3 (V3) (1.)

@ 4 (V4) (1.)

@ 5 (V5) (1.)

@ 6 (V6) (1.)

@ 7 (V7) (1.)

@ 9 (AGE) (2.)

@ 12 (IQ) (3.)

@ 16 (NUMBER) (1.)

;

run;

This code will not create a physical SAS dataset called testdata. Instead, the code invokes the physical raw dataset called testdata.txt and creates a temporary dataset called TEMP only for as along as you are

working in that DATA step and in that program. After SAS runs the program that creates TEMP, it

deletes it. A permanent data set, however, is kept even after SAS runs the program that creates it.

2. Building a Permanent SAS Data Set

The syntax that creates a permanent SAS data set is:

LIBNAME libref ‘drive:\path’;

DATA <libref.filename>; " Two-level name

The LIBNAME statement defines a libref, or a nickname, for the drive and the directory path in which to save or to find the permanent SAS data set.

A libref is 1-8 characters long, no spaces are allowed, and can start with an “_” or a letter, but not a number (i.e., any valid SAS name). It works by giving a nickname to the ‘drive:\path’ (single quotes are

required) for the duration of the current SAS program.


of 26

Define all librefs at the beginning of a SAS program to document where permanent SAS data sets are

saved (or used) by the SAS program.

The DATA step tells SAS to create a permanent SAS data set by using a two-level name, i.e.,

<libref.filename>.

The 1st level of the name is the libref, or the previously defined nickname in the LIBNAME statement to represent the ‘drive:\path’ where the permanent SAS data set is stored. The libref

name is followed by a period.

The 2nd level of the name is the filename of the permanent SAS data set stored in the libref. SAS

automatically appends the extension .sas7bdat to permanent SAS data sets.

Example Program:

/* COMMENT: Example of saving a permanent SAS dataset from the raw dataset */

LIBNAME data ‘c:\temp\sasbasics’ ;

* Step below creates a permanent SAS dataset called testdata.sas7bdat ;

DATA data.testdata ;

* Step below uses the raw dataset to create testdata.sas7bdat ;

INFILE ‘c:\temp\sasbasics\testdata.txt’;

INPUT #1 @ 1 (V1) (1.)

@ 2 (V2) (1.)

@ 3 (V3) (1.)

@ 4 (V4) (1.)

@ 5 (V5) (1.)

@ 6 (V6) (1.)

@ 7 (V7) (1.)

@ 9 (AGE) (2.)

@ 12 (IQ) (3.)

@ 16 (NUMBER) (1.)

;

PROC CONTENTS DATA=data.testdata;

RUN;

The LIBNAME statement above uses data as the libref to refer to ‘c:\temp\sasbasics’.

The DATA step (using the two-level name) tells SAS to create a permanent SAS data set called testdata

and to save it in data (a.k.a. c:\temp\sasbasics).

The INFILE statement tells SAS where the raw data set file exists in order to create testdata.sd2, the SAS

dataset.

After this program is run, check that the permanent SAS data set called testdata.sas7bdat exists in

c:\temp\sasbasics. Also, check the output to see the contents of testdata.sas7bdat.


of 26

3. Processing a Temporary SAS Data Set

Now that we have created a SAS dataset, we can use it to process the dataset temporarily. This is helpful

when you are testing out some code and don’t necessarily want to save the changes you are making. Note

that we no longer need to use the INFILE statement to indicate where to find the file; instead, we use a

SET statement.

LIBNAME <libref > ‘drive:\path’;

DATA <data-set-name> ; " A temporary SAS dataset used as the working file for code to

follow

SET <libref.filename> ; " A permanent dataset (in some cases it can be a temp SAS

dataset) must be named here using the SET statement so a

temporary data set can be created from it.

Example Program:

/* COMMENT: Setting a permanent SAS dataset to process temporarily */


* Step below creates a temporary SAS dataset called TEMP ;

DATA TEMP;

SET data.testdata ;

PROC CONTENTS DATA=TEMP;

RUN;

The DATA statement identifies the name of the temporary SAS dataset, and the SET statement identifies the dataset that was used to create the temporary dataset (in this case c:\temp\sasbasics\testdata.sas7bdat).

Note that no physical SAS dataset file is saved in c:\temp\sasbasics called temp.sas7bdat. In the output,

the contents will indicate that this data set is called TEMP.

4. Processing a Permanent SAS Data Set

One way to process a permanent SAS data set to perform a procedure is illustrated in this syntax:

LIBNAME libref ‘drive:\path’;

PROC <name-of- statistical-procedure> DATA = <libref.filename> ; " Two-level name

Note that the INPUT and INFILE statements are not needed now.

The LIBNAME statement defines the libref so that it refers to the ‘drive:\path’ where the permanent

SAS data set is stored.

The PROC statement tells SAS to perform a procedure on the SAS data set. Follow PROC with the name

of the procedure you want SAS to perform (e.g., MEANS, PRINT).


of 26

After the PROC, but before the semicolon, comes a DATA statement that uses the libref to tell SAS

where the permanent SAS data set is stored (the directory), followed by a period, and what the filename is of the permanent SAS data set.

Example Program:

/* COMMENT: Setting a permanent SAS dataset on a PROC step */


* Step below prints the data out for the permanent dataset called testdata;

PROC PRINT DATA=data.testdata ;

RUN;

The PROC statement tells SAS to perform the PRINT procedure on the permanent SAS data set

testdata.sas7bdat stored in c:\temp\sasbasics (as referred to by the libref we created using the

LIBNAME statement, “data”).

Another way to work with permanent data sets is to SET an existing permanent SAS data in order to

make a new permanent data set with a different name as well as changes to the data set.

Example Program:

/* COMMENT: Creating a new permanent SAS dataset by setting a permanent SAS

dataset */


* Step below saves a new data set called newdata.sas7bdat that is identical

to the data set called testdata.sas7bdat but with a new variable called

drink;

DATA data.testdat2 ;

SET data.testdata ;

* Create a variable called drink based on age ;

if age > 21 then drink = 1;

if age < 21 then drink = 0;

run;

PROC PRINT DATA=data.testdat2;

RUN;

The DATA statement tells SAS to save a new permanent SAS data set called testdat2.sas7bdat stored in ‘c:\temp\sasbasics’ and the SET statement tells SAS to use the data set called testdata.sas7bdat to make

this new dataset. Check in ‘c:\temp\sasbasics’ to make sure that it was created. Also, check the output to

see that the new variable is included in the dataset.

Note that an INFILE statement tells SAS what raw data set to use, whereas a SET statement tells SAS

what existing or permanent SAS data set to use.


of 26

WAYS to MANIPULATE the DATA

Data-manipulation will transform the data set in some way, e.g., add new variables or change existing

variables. Data manipulation code can go on a DATA step usually in one of two places:

1) Immediately after the INPUT statement (whether you use CARDS or INFILE)

Example Program:

DATA TEMP; INFILE ‘c:\temp\sasbasics\testdata.txt’;

INPUT #1 @ 1 (V1-V7) (1.)

@ 9 (AGE) (2.)

@ 12 (IQ) (3.) @ 16 (NUMBER) (1.) ;

if age > 21 then drink = 1; " data-manipulation & data-subsetting statements go here


PROC PRINT DATA = TEMP; RUN;

2) Immediately after the creation of a new data set:

Example Program:

DATA TEMP; INFILE ‘c:\temp\sasbasics\testdata.txt’;

INPUT #1 @ 1 (V1-V7) (1.)

@ 9 (AGE) (2.) @ 12 (IQ) (3.)

@ 16 (NUMBER) (1.) ;

RUN;

DATA TEMP2; " name of new data set to create

SET TEMP; " name of existing data set

if age > 21 then drink = 1; " data-manipulation & data-subsetting statements go here


PROC PRINT DATA = TEMP; " the variable DRINK will not be in this dataset

RUN;

PROC PRINT DATA = TEMP2; " the variable DRINK will be in this dataset

RUN;


of 26

Ways to manipulate the data can include creating variables in a DATA step with an assignment statement (see syntax below). Variables can be created or recoded in a DATA step, but not in a PROC step.

1. Create duplicate variables with new variable names:

General syntax:

<new-variable-name> = <existing-variable-name> ;

Examples:

BDI1 = V1;

LEGAL = DRINK;

2. Duplicating variables vs. renaming variables:

In the previous examples, the variables were not re-named; instead, duplicate variables were

created with new names. Both original and duplicate variables remain in the data set. There is

also a RENAME function to permanently rename variables without duplicating them.

3. Create new variables from existing variables:

Use these symbols in SAS to perform operations on variables: ( +, - , * , / , = ) Use parentheses and follow rules for order of operations.

Use SAS functions such as SUM, MEAN, or ROUND in an assignment statement.

Always check created variables to verify that they were created correctly.

General syntax:

<new-variable-name> = <formula-including-existing-variable-name> ;

Examples:

VTOTAL = V1 + V2 + V3 + V4 ; " SAS will not compute for obs with missing values

VTOTAL = SUM(V1,V2,V3,V4) ; " SAS ignores missing values & computes based

on the values present

Summing variables V1 through V4 creates a new variable called VTOTAL.

4. Recode variables to have a different value:

SAS can overwrite existing variables or create a new variable to store recoded values. Variable values can be recoded upon INPUT or recoded after they are saved in a SAS data set.

SAS can recode variable values or ranges into user-specified values with IF-THEN statements.

Example:

IF SEX = 1 THEN SEX = ‘M’ ;

5. Recode reversed variables:

Sometimes questionnaires have reversed items – a question is stated so that the meaning is the

opposite of the meaning of the other items on the questionnaire.


of 26

In general, perform the reversal before other data manipulations are performed on those items. It

is good practice to store recoded variable values as a new variable and leave the existing variable intact.

<new-variable> = <constant – existing-variable> ;

The constant is always equal to the number of response items on your survey plus 1.

V1R = 6 – V1; (in the case of 5 response items)

SUBSETTING DATA

Data-subsetting will eliminate unwanted observations from a sample so only a specified subgroup is in

the data set. For example, you only want to look at males and not females, or a particular age range.

Use what is called a sub-setting IF statement to perform analyses on only a subset of observations included in the data set.

General syntax:

DATA <new-data-set-name> ;

SET <existing-data-set-name> ;

IF statement;

Example:

To obtain the mean for each variable only for ages greater than 20 in the data set:

DATA TEMP2; SET TEMP;

IF AGE > 20 ;

PROC MEANS DATA = TEMP2; " This will display means only for those subjects older than 20.

RUN;

LABELS for VARIABLES

Use the LABEL statement to associate a label with any or all of the variables. Many SAS procedures print a variable name followed by its label to help document what is in the output.

General syntax:

LABEL var1 = ‘label for var1’ " The label can be up to 40 characters (including blanks)

var2 = ‘label for var2’

… var[n] = ‘label for var[n]’ ;


of 26

The LABEL statement tells SAS to associate the label “label for var1” with the variable var1, the label

“label for var2” with variable var2, and so on.

• Use the LABEL statement within a DATA step to associate the label(s) permanently with the

variable(s). These labels will be used in subsequent PROCs.

• Use the LABEL statement within a PROC step to associate the label(s) temporarily with the

variable(s). Labels associated with variables in a PROC step will be used in that PROC only.

FORMATS for VARIABLES

A format is a set of instructions that tells SAS how to print variable values in the output. A format can be associated with one or more variables temporarily in a PROC step or permanently in a DATA step.

You need to provide a place for SAS to keep the format library that you create. You use the LIBNAME

statement to do this. The libref LIBRARY is always used to refer to the format library. SAS will create a separate file (.SC2) of the format library. This file must always be with the SAS file or else you will

encounter errors.

1. To associate a format temporarily, use the FORMAT statement on a PROC step.

Example:

PROC FORMAT LIBRARY=LIBRARY;

VALUE $sex ‘M’ = ‘Male’ ‘F’ = ‘Female’ ;

VALUE affinity 1 = ‘not at all’ 2 = ‘a little’

3 = ‘in the middle’

4 = ‘a lot’

5 = ‘I LOVE IT’ ;


PROC MEANS DATA=data.testdata;

VAR v1 v2 v3 v4 v5 v6 v7 ;

FORMAT affinity. ;

RUN;

2. To associate a format permanently, use the FORMAT statement on a DATA step.

Example:


VALUE $sex ‘M’ = ‘Male’ ‘F’ = ‘Female’ ;


of 26

VALUE affinity 1 = ‘not at all’

2 = ‘a little’ 3 = ‘in the middle’

4 = ‘a lot’

5 = ‘I LOVE IT’ ;


LIBNAME library ‘c:\temp\sasbasics’ ;

DATA TEMP;

SET data.testdata;

FORMAT v1-v7 affinity. ;


RUN;

PROCEDURES

1. Examining the variables in a SAS data set

To print descriptor information about a SAS data set, use PROC CONTENTS.

General syntax:

PROC CONTENTS DATA = <libref.filename> or <filename>;

This tells SAS to run the CONTENTS procedure on the temporary SAS data set called TEMP.

PROC CONTENTS will list the name, type (numeric or character), length in bytes, and ordinal

position in the SAS data set, for each variable in alphabetical order.

General syntax with options:

PROC CONTENTS DATA = <libref.filename> or <filename> POSITION;

You can use statement options to change the defaults for PROC CONTENTS:

POSITION – will list variables in the order of their position in the SAS data set

SHORT – will print only a list of the variable names in the SAS data set

2. Examining the values in a SAS data set

To print the actual data (the actual observations) in a SAS data set, use PROC PRINT.

General syntax:

PROC PRINT DATA = DATA = <libref.filename> or <filename>;

This tells SAS to run the PRINT procedure on temporary SAS data set TEMP. PRINT numbers each observation and lists variable values in columns under the variable name.

General syntax:

PROC PRINT DATA = <libref.filename> or <filename> DOUBLE NOOBS ;


of 26

You can use statement options to change the defaults for PROC PRINT:

DOUBLE – double-spaces output NOOBS – suppresses printing of the observation number

UNIFORM – formats all pages uniformly (by default, SAS fits as much per page as

possible)

3. Producing frequency tables and crosstabulations

To produce frequency tables and/or crosstabulations and any relevant statistics use PROC FREQ.

General syntax:

PROC FREQ DATA = <libref.filename> or <filename>; TABLES var

var * var

var * var * var / options ;

• var = simple (one-way) frequency table

• var * var = crosstabulation (two-way table) where values of the variable before the asterisk (*)

will occupy the rows of the table and the values of the variable after the asterisk will occupy the columns of the table (row * column).

• var * var * var = crosstabulations of the second variable by the third variable for each level of

the first (control) variable (control * row * column).

• The slash (/) tells SAS to compute optional statistic(s) options for the tables (e.g., / CHISQ ; )

• The list option ( / LIST) tells SAS to present all crosstabulations in one table (this is particularly useful when you are examining 3 or more variables)

4. Producing univariate descriptive statistics

To calculate univariate descriptive statistics (e.g., mean, standard deviation, maximum, minimum,

median, percentiles) for one or more numeric variables use PROC UNIVARIATE.

General syntax:

PROC UNIVARIATE DATA = <libref.filename> or <filename>; VAR var1 var2 … var[n] ;

PROC UNIVARIATE can provide additional detail on the distribution of a variable including plots, frequency tables, and a test to determine whether the data are normally distributed. Add the

PLOT, FREQ, and /or NORMAL option to the PROC UNIVARIATE statement to include this

information to the output.

General syntax with options:

PROC UNIVARIATE DATA = <libref.filename> or <filename> PLOT FREQ NORMAL ;

VAR var1 var2 … varn ;

PROC UNIVARIATE will print a separate page of output for each variable. It is useful for

examining percentiles and outliers. Use PROC MEANS to print univariate descriptive statistics for more than one variable on the same page.


of 26

5. General form for regression analyses

SAS has a very useful procedure that can be used to run a number of different regression analyses (e.g.,

linear, logistic, Poisson, etc.): PROC GENMOD

General syntax:

PROC GENMOD DATA = <libref.filename> or <filename>;

MODEL <dependent variable> = <independent variable 1> <IV2> <IV3>/

DIST = <name of distribution> LINK = <name of link function>; RUN;

PROC GENMOD is a particularly flexible procedure because it allows you to specify the underlying

distribution and link function for your analyses. For example, to run a logistic regression a binomial distribution and logit link would be specified:

PROC GENMOD DATA = <libref.filename> or <filename>; MODEL <DV> = <IV1> <IV2> <IV3>/

DIST = binomial LINK = logit;

RUN;

Below is a list of some commonly used distributions and their corresponding link functions:

Note that there are many, many more procedures that SAS uses to perform analyses.

Distribution (SAS) Dist = Default Link Function (SAS) Link =

Normal NORMAL or N Identity IDENTITY or ID

Poisson POISSON or P Log LOG

Negative Binomial NEGBIN or NB Log LOG

Binomial BINOMIAL or B Logit LOGIT

Binomial BINOMIAL or B Probit PROBIT

Binomial BINOMIAL or B Complementary log-log CLOGLOG or CLL

Multinomial MULTINOMIAL or MULT Cumulative logit CUMLOGIT or CLOGIT

Multinomial MULTINOMIAL or MULT Cumulative probit CUMPROBIT or CPROBIT

Multinomial MULTINOMIAL or MULT Cumulative Complementary log-log

CUMCLL or CCLL

Additive Odds/Linear

Odds Model

Forward Link/

Backwards Link

Binomial BINOMIAL or B -must specify forward and

backwards link

fwdlink = _xbeta_/

(1 + _xbeta_)

Invlink ilink = _mean_/

(1 - _mean_)


of 26

TITLES

Document your output with the use of titles. Titles can be used anywhere in the program.

General syntax:

TITLE ‘<Insert your title here: This is a title to be printed on line 1 of each page of output>’ ;

Note that SAS processes a program in steps. A step begins with either a DATA or a PROC statement. A step ends with another DATA or PROC statement (or the end of the program). All TITLEs encountered

from the beginning of the step until the beginning of the next step are used for the current step. Use

optional RUN; statements to end a step at a specific point. Suppress a TITLE by writing the TITLE; statement with no text following it.

CATEGORICAL VARIABLES

An important difference between SAS and some other commonly used statistical packages (e.g., SPSS) is

that SAS requires that you specify whether a variable is categorical each time you use it in an analysis.

The option that SAS uses to specify categorical variables is the CLASS statement.

General Syntax:

PROC STATEMENT DATA = <libref.filename> or <filename>; CLASS <name of categorical variable>;

MODEL STATEMENT;

Syntax (using PROC GENMOD as an example):


CLASS <name of categorical variable>;

MODEL <dependent variable> = <independent variable>/ DIST = <name of distribution> LINK = <name of link function>;

RUN;

INTERACTIONS

Another important difference between SAS and other statistical packages involves interactions among

variables. In some packages (e.g., SPSS), interactions between main effect variables will be automatically included in certain types of analyses, such as ANOVA models. This does not occur in

SAS. Interactions between variables must be created in the DATA step and included in the MODEL

statement of any analyses in which interactions are to be tested.

General Syntax for creating interaction variables:

DATA <libref.filename> or <filename>;

SET <libref.filename> or <filename>; Interaction = (variable 1)*(variable2);

RUN;

Syntax for including interactions in analyses (using PROC GENMOD as an example):


MODEL <DV> = <IV1> <IV2> <Interaction>/ DIST = <name of distribution> LINK = <name of link function>;

RUN;


of 26

PUTTING A PROGRAM TOGETHER

/* PUT A PROGRAM TOGETHER */

/* Assign formats to the variables. Numbers generally don't require formats.

This step just lays out the formats, but does not permanently assign them. */


VALUE $sex 'M' = 'Male'

'F' = 'Female' ;

VALUE affinity 1 = 'not at all'

2 = 'a little'

3 = 'in the middle'

4 = 'a lot'

5 = 'I LOVE IT';

LIBNAME data 'c:\temp\sasbasics';

DATA data.testdata ;


INPUT #1 @ 1 (V1) (1.)

@ 2 (V2) (1.)

@ 3 (V3) (1.)

@ 4 (V4) (1.)

@ 5 (V5) (1.)

@ 6 (V6) (1.)

@ 7 (V7) (1.)

@ 9 (AGE) (2.)

@ 12 (IQ) (3.)

@ 16 (NUMBER) (1.)

;

run;

/* Create a permanent data set called TESTDAT2. We need a new data set

because we are about to change the data by adding labels and formats to the

variables and creating new variables. We need to SET the data set we want to

work from (called TESTDATA) in order to create the new version (called

TESTDAT2). */

DATA data.testdat2 ;

SET data.testdata ;

/* Create some new variables */

if 1 <= number <= 3 then sex = 'M';

if number > 3 then sex = 'F';

GENDER = SEX;

VTOTAL = V1 + V2 + V3 + V4;


of 26

/* Assign labels to the variables. */

LABEL

V1 = 'Variable 1'

V2 = 'Variable 2'

V3 = 'Variable 3'

V4 = 'Variable 4'

V5 = 'Variable 5'

V6 = 'Variable 6'

V7 = 'Variable 7'

age = 'Age of Subject'

IQ = 'IQ of Subject'

number = 'ID Number'

gender = 'Gender of Subject'

vtotal = 'Total sum of V1-V4'

;

/* Permanently assign the formats to the variables. V1-V7 use the same

format. */

FORMAT gender $sex. V1-V7 affinity. ;

run;

/* When you want SAS to use the data set that you last invoked for a

procedure, you do not need to identify it in the PROC statement. SAS defaults

to the last dataset used - in this case, it is TEMPDAT2.

/* Print the variables in the data set for each person. */

PROC PRINT DOUBLE;

TITLE 'Print of data in TESTDAT2';

RUN;

/* Produce means of the variables in the data set */

PROC MEANS;

TITLE 'Means of numeric variables in TESTDAT2.SD2';

RUN;

/* Correlate age and IQ */

PROC CORR;

VAR AGE IQ;

TITLE 'Correlation b/t age and IQ';

RUN;

AFTER RUNNING YOUR SAS PROGRAM

Always check the log file that is produced when you run a SAS program. Check the number of observations read. The log will indicate if there are any errors in the program that must be fixed. The log

also provides comments about what SAS did with your program. When an error is found, return to the

program and, starting from the beginning of the program, edit one thing at a time and re-run the program (this helps isolate where the problem is located because the log doesn’t always specify exactly where the

problem occurred).


of 26

MISCELLANEOUS NOTES

• An excellent collection of searchable SAS resources: http://www.ats.ucla.edu/stat/sas/

• Another excellent resource for using SAS which lists all possible options for each PROC (website is

organized alphabetically by PROC): http://www.csc.fi/cschelp/sovellukset/stat/sas/sasdoc/sashtml/mindex/a-index.htm

• SAS is can be fairly abstract, but it is also very powerful.

• SAS is great for large data sets with hundreds or thousands of observations and variables. • SAS relies heavily on programming code as opposed to using icons and pull-down menus to execute

commands.

• SAS is a very logical language and is useful for planning out the steps necessary to do complicated data work. Also, note that certain statements must go before other statements.

• One of the hardest concepts to grasp is the distinction between a temporary data set and a permanent

data set.

• Know your data well. Know what kind of file you will be working from. Think about whether you need to build a data file from scratch or utilize an existing data file.

• There are MANY ways to accomplish the same goal in SAS. Go with what feels most comfortable.

• You can always look up how to do things in SAS if you can’t remember!

fall 2006 statlab workshop series: introduction to...

Documents