introduction to sas promgramming

Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka

Introduction to SAS® Version 1.4 updated 9/29/2002

by Kazuaki Uekawa, Ph.D. Visiting Scholar, The Department of Sociology, The

University of Chicago; Population Research Center at NORC; Address: 1155 E. 60th. St,

Room 340, Chicago, IL 60637

www.src.uchicago.edu/users/ueka

[email protected]

Copyright © 2002 By Kazuaki Uekawa All rights reserved.

Table of Contents

I. Introduction........................................................................................................2

II. How to start?...................................................................................................3

III. LIBNAME: Assigning library name....................................................................3

IV. Create SAS data for a practice........................................................................4

V. Creating New Variables...................................................................................6

VI. Procedures.......................................................................................................8

A. PROC CONTENTS: Description of Contents...................................................8

B. PROC PRINT: See Data..................................................................................9

C. PROC SORT: Sorting Observations based on a value of variable...................9

D. PROC MEANS: Get Descriptive Statistics (Mean, STD, Min, Max)................10

E. PROC FREQ: Get Frequencies.....................................................................12

F. PROC UNIVARIATE: Get elaborate statistics and a univariate plot.................12

G. PROC PLOT: Plotting Two Variables.............................................................12

H. PROC TIMEPLOT: Time Plot.........................................................................12

I. PROC CORR: Correlation................................................................................13

J. PROC OLS: OLS Regression............................................................................13

K. PROC LOGISTIC: Logistic Regression..........................................................14

L. MAKE AN ASCHI FILE..................................................................................14

VII. More Procedures............................................................................................14

M. PROC STANDARD: Standardize Values........................................................14

N. PROC RANK: Rank observations.................................................................16

O. PROC SQL: Creating group-level mean variables........................................17

VIII. Merging Data Sets......................................................................................17

IX. Temporary and Permanent Data Sets............................................................18

1


I. Introduction

I recommend SAS® over other statistical packages because:

a) ODS (Output Delivery System) allows users to save statistical results as data. A

user can create tables off the result data set in one single program (as opposed to

printing out the results on paper and use excel to finish tables.) The table can be as

sophisticated as

http://www.src.uchicago.edu/users/ueka/SAS/proc_mixed_example1output.txt and

this can be further saved in an excel format using PROC EXPORT.

b) Rich arrays of macro functions

c) Email support service with quick response. [email protected]

d) Users come from many fields, including social and natural sciences, as well as

business. Thus, SAS ® programming skill can be an asset in the job market.

I discuss both ODS and MACRO in Introduction SAS 2, the document of which is

available from the same website.

Idiosyncrasy of this document

I am writing this document on my Japanese PC and backslash is not available. I use \

instead.

U. of Chicago People can access SAS on-line on the web!

SAS On-line for version 8

http://gsbapp2.uchicago.edu/sas/sashtml/main.htm

Note on SAS email support:

When you email SAS support with a question, you need to identify yourself as a legitimate

SAS customer. Look at the head of a log file and copy and paste the information at the

beginning of your email text.

NOTE: Copyright (c) 1999-2001 by SAS Institute Inc., Cary, NC, USA.

NOTE: SAS (r) Proprietary Software Release 8.2 (TS2M0)

Licensed to UNIVERSITY OF XXXXX, Site XXXXX.

NOTE: This session is executing on the WIN_ME platform.

2


II. How to start?

1. Start SAS. You can find the short cut going from START PROGRAMThe SAS

System.

2. Type in syntax in EDITOR window. Syntax is something you learn in this

document.

3. Click on the runner icon to run the program. Alternatively, you can highlight the

part of syntax that you want to run and then click the runner to run the program

selectively. (The downside of using UNIX instead of WINDOWS is that UNIX cannot

let you do this selective run.)

LOG file contains messages. Watch for the words error and warning.

OUTPUT file contains output.

If you ever mistype syntax and want to redo, do control-z. This is the

same command that can be used with Microsoft Office products.

To cancel the run while it is happening, click on the stop icon (which

looks like “!”) right next to the runner icon.

III. LIBNAME: Assigning library name

Assigning library name

Using path names as directory names is too tedious (e.g., C: \temp\abc\old), so we want to

give nicknames to them at the beginning of a program.

libname here “C:\TEMP”;

libname there “C:\”;

So from now on,

here.abc means the data set named “abc” placed in the directory nicknamed “here.”

there.xyz means the data set named “xyz” placed in the directory nicknamed

“there.”

3


IV. Create SAS data for a practice

Description of Practice Data

The data comes from TIMSS (Third International Mathematics and Science Study) in

which some 40 nations’ three population groups (3&4th graders, 7&8th graders,

and high school seniors) participated. I aggregated data at the national level. The

variables are:

acro: acronym for participant nations.

nation: name of the country

name: complete name of the country

mat8: 8thgraders’ average math test score

mat7: 7thgraders’ average math test score

GNP14: GNP per capita

prop: proportion of 8th graders in schooling

NATEXA: Administers national-level exam

NATSYLB: Sylbus is decided at the national level

NATTEXT: text is chosen at the national level.



data kaz;

input

acro $ NATION $ 6-14 NAME $ 15-33 MAT7 MAT8 GNP14 PROP NATEXAM

NATSYLB NATTEXT block $;

cards;

aus Australi Australia 498 529.63 -0.15526 84 0 1 0 ocea

aut Austria Austria 509 539.43 -0.29163 100 0 0 1 weuro

bfl Belgi_FL Belgium (Fl) 558 565.18 -0.25157 100 1 1 0 weuro

bfr Belgi_FR Belgium (Fr) 507 526.26 -0.25157 100 0 1 0 weuro

can Canada Canada 494 527.24 0.07184 88 0 0 0 namer

col Colombia Colombia 369 384.76 -0.23699 62 0 1 0 samer

cyp Cyprus Cyprus 446 473.59 -0.41906 95 0 1 1 seuro

4


csk Czech Czech Republic 523 563.75 -0.34840 86 0 1 0 eeuro

dnk Denmark Denmark 465 502.29 -0.34057 100 1 0 0 weuro

fra France France 492 537.83 0.55791 100 0 1 0 weuro

deu Germany Germany 484 509.16 0.91992 100 0 0 0 weuro

grc Greece Greece 440 483.90 -0.32620 99 0 1 1 seuro

hkg HongKong Hong Kong 564 588.02 -0.31638 98 1 1 1 seasia

hun Hungary Hungary 502 537.26 -0.37602 81 0 0 0 eeuro

isl Iceland Iceland 459 486.78 -0.42606 100 0 0 0 neuro

irn Iran Iran, Islamic Rep. 401 428.33 -0.17095 66 0 1 1 meast

irl Ireland Ireland 500 527.40 -0.38919 100 1 1 0 weuro

isr Israel Israel . 521.59 -0.35464 87 0 1 0 meast

jpn Japan Japan 571 604.77 1.85543 96 0 1 0 seasia

kor Korea Korea 577 607.38 -0.01168 93 0 1 1 seasia

kwt Kuwait Kuwait . 392.18 -0.40359 60 0 1 1 meast

lva Latvia Latvia (LSS) 462 493.36 -0.42319 87 0 0 0 eeuro

ltu Lithuani Lithuania 428 477.23 -0.41785 78 1 1 1 eeuro

nld Netherla Netherlands 516 540.99 -0.18184 93 1 0 0 weuro

nzl NewZeala New Zealand 472 507.80 -0.38319 100 1 1 0 ocea

nor Norway Norway 461 503.29 -0.35450 100 0 1 1 neuro

prt Portugal Portugal 423 454.45 -0.32588 81 0 1 0 weuro

rom Romania Romania 454 481.55 -0.35396 82 1 1 1 eeuro

rus RussianF Russian Federation 501 535.47 0.12827 88 1 0 0 eeuro

sco Scotland Scotland 463 498.46 0.48017 100 0 0 0 weuro

sgp Singapor Singapore 601 643.30 -0.37279 84 1 1 1 seasia

slv SlovakRe Slovak Republic 508 547.11 -0.40217 89 0 1 0 eeuro

svn Slovenia Slovenia 498 540.80 -0.41310 85 0 1 1 eeuro

esp Spain Spain 448 487.35 0.03461 100 0 1 1 weuro

swe Sweden Sweden 477 518.64 -0.30049 99 0 1 0 neuro

che Switzerl Switzerland 506 545.44 -0.27916 91 0 0 0 weuro

tha Thailand Thailand 495 522.37 -0.14533 37 0 1 1 seasia

usa USA United States 476 499.76 5.37506 97 0 0 0 namer

;

run;

5


/*this prints out the data*/

proc print;

run;

6


Advanced Topic:

Alternatively you can save above data (just data part) as a simple text and save it

at your C-drive’s temp directory as kaz.txt. (In case you only have this document

as a hard copy, visit www.src.uchicago.edu/users/ueka for a digital version of this

document, so you can copy and paste.) Then use the program below to read in the file.

/*these two lines are not crucial in this example, but let’s just put these at the beginning of

your program*/



data kaz;

infile “C:\TEMP\kaz.txt” missover;

input

acro $ NATION $ 6-14 NAME $ 15-33 MAT7 MAT8

GNP14 PROP NATEXAM NATSYLB NATTEXT block $;

run;

I think missover means that when there is no value in the spot where there is supposed to be

a value, just treat it as a missing value, but I forgot exactly. It is safe to use it.

$ means whatever comes before it is a character variable as opposed to numeric.

V. Creating New Variables

Data kaz2;

set kaz;

/*ADDITION*/

var1=mat7+mat8;

7


/*OR*/

var2=sum(of mat7 mat8);

/*SUBSTRACTION*/

var3=mat8-mat7;

/*MULTIPLICATION*/

var4=mat7*mat8;

/*DIVISION*/

var5=mat7/mat8;

/*Use brackets effectively*/

var6=1/(mat7+mat8);

/*MEAN of several variables*/

var7=mean(of mat7 mat8);

/*MAX of several variables*/

var8=max(of mat7 mat8);

/*MIN of several variables*/

var9=min(of mat7 mat8);

/*LOG: a value to enter must be

positive*/

var10=log(mat7);

/*Absolute values: this takes out negative

signs*/

var11=abs(gnp14);

run;

/*TO SEE WHAT YOU DID, USE PROC

PRINT*/

proc print data=kaz2;

title “Lots of manipulations: See results”;

var mat7 mat8 var1 var2 var3 var4 var5

var6 var7 var8 var9 var10 var11;

run;

Advanced Topics:

How is Z=mean(of X1 X2 X3) different from Z=(X1+X2+X3)/2;?

How is Z=sum(of X1 X2 X3) different from Z=X1+X2+X3;?

Functions, such as mean(of …) or sum (of …), take statistics of non-missing values. They

do return values even when some of the variables in the brackets are missing. For example,

if X1 is missing:

X=mean (of X1 X2 X3); will return the average of X2 and X3.

In contrast,

X=(X1+X2+X3)/2 will return a missing value, namely, “.”

Read this after you study PROC REG later in the document.

8


When we compare several regression models (e.g., coefficients, R2, Goodness-of-fit, etc.),

we want to keep the number of observations same across different models. Because

predictors may have different patterns of missing values, this must be made to happen if

you want to. For example, mat7, which is 7th graders’ mathematics score include some

missing cases. Some nations only let their 8th graders participate in this international test.

Use NMISS function to create a new variable john.

data kaz2;set kaz;

john=nmiss(of GNP14 mat8 mat7);/*this returns the number of missing cases*/

run;

/*check how the data looks like now*/


var name gnp14 mat8 mat7 john;

run;

/*Apply OLS regression with cases with perfect data (no missing cases). In this way,

model 1 and model 2 will have the same number of cases, or to be more precise, the same

data.*/

proc reg data=kaz2;

where john=0; /*Run only when john=0, namely, number of missing cases is 0*/

model mat8=mat7;

model mat8=mat7 gnp14;

run;

VI. Procedures

A. PROC CONTENTS: Description of Contents

PROC CONTENTS data=kaz;

run;

9


Advanced topic: the variables will be sorted by alphabetical order. They can be also shown

by position in the data set (left to right) by addition “position”:

proc contents data=kaz position;

run;

I like this option because in this way you can find related variables close to each other.

B. PROC PRINT: See Data

PROC PRINT data=kaz;

VAR nation mat7 mat8 natexam; /*without this, all variables will be printed*/

run;

Advanced topic: You can selectively print observations.

/*print only when natexam=1*/

proc print data=kaz;where natexam=1;var nation mat7 mat8;run;

/*print by group units*/

proc sort data=kaz out=kaz2;by block;run;

proc print data=kaz;by block;var nation mat7 mat8;run;

/*print only up to a certain number of observations*/

/*you want to do this when you data is big and don’t want to print every observation*/

data kaz2;set kaz;

john=_n_;

/*this creates a new variable indicating the column sequence of observation*/

run;

proc print data=kaz2;where john < 5;run; /*this shows the first 4 observations*/

If you want a nicer print-out, try proc report.

C. PROC SORT: Sorting Observations based on a value of variable

You would be using this procedure a lot, but be careful with large data set. This

procedure consumes lots of computation time.

PROC SORT data=kaz out=kaz2;

10


/*If you don’t want to create a new data set, just write “out=kaz”*/

by mat8;

run;

Advanced topics:

proc sort data=kaz out=kaz2 nodupkey;

by block;

run;

proc print data=kaz2;run;

This takes only the first observation of each block. Imagine that you have data

where there are individual level variable (e.g., 100 students) and group level

variable (e.g., 10 schools). Imagine you want to get school level information from

this data. Above procedure would take just the first observation of each school and

gets you ten lines of data for 10 schools. Ignore individual-level variables,

however.

You can use more than one variable in by line.

proc sort data=kaz out=kaz2;

by natexam block;

run;

/*How would the new data look like?*/

proc print data=kaz2;run;

D. PROC MEANS: Get Descriptive Statistics (Mean, STD, Min, Max)

PROC MEANS data=kaz;

VAR mat7 mat8;

11


run;

Advanced topic: Group means.

/*Report group means*/

proc sort data=kaz out=kaz2;by block;run;

proc means data=kaz2;

by block;

var mat7 mat8;

run;

You can also use “class” statement instead of “by” statement. Class statement is easier

because you don’t need to sort the data by the by-variable before it. I forgot what the

downside of it was.

proc means data=kaz2; /*now, kaz2 does not have to be sorted by block*/

class block;

var mat7 mat8;

run;

/*Save group means*/

ods listing close; /*printing of results suppressed*/

proc means data=kaz2; /*make sure kaz2 is already sorted by group ID*/

by block;

var mat7 mat8;

ods output summary=john; /*Output Delivery System Used. See SAS manual 2*/

run;

ods listing on; /*printing of results resumed*/

proc print data=john;

run;

/*Get standard errors by adding STDERR*/

12


/*But it would only get standard error, so you must add other statistics you would like with

it. Specify mean, N, STD, MAX, and MIN*/

PROC MEANS data=kaz mean n std max min stderr;

VAR mat7 mat8;run;

run;

I recommend reading a chapter on PROC MEANS in SAS CD-online. It is a very versatile

procedure.

E. PROC FREQ: Get Frequencies

PROC FREQ data=kaz;

Tables natexam ;

Run;

Advanced topics:

Get cross tabulation:

PROC FREQ data=kaz;

tables natexam*block;

run;

F. PROC UNIVARIATE: Get elaborate statistics and a univariate plot

PROC UNIVARIATE PLOT DATA=KAZ;

var mat7 mat8 gnp14;

run;

Advanced topic:Get a whisker plot by sub groups, so you can compare group values. But

the output is text-based and pretty ugly.


by block;

run;

PROC UNIVARIATE data=kaz2 plot;

by block;

var mat8;

run;

13


G. PROC PLOT: Plotting Two Variables

This is text-based graph. Use proc gplot for a nicer graphic.

PROC PLOT data=KAZ;

Plot mat7*mat8;

run;

H. PROC TIMEPLOT: Time Plot

proc timeplot data=KAZ;

plot mat8= '*';

id NAME;

run;

Advanced topics:

/*Sort first by the variable of your interest and see it*/

/*you will be seeing a ranking of nations*/


by mat8;

run;

proc timeplot data=KAZ2;

plot mat8= '*';

id NAME;

run;

Add bells and whistles. Below, I am asking, “Does GNP has anything to do with test score?

/*First sort by GNP*/


by gnp14;

run;

proc timeplot data=KAZ2;

title “TIMSS countries sorted by GNP”;

plot mat7 mat8/overlay hiloc npp ;

id NAME block gnp14 prop;

run;

14


I. PROC CORR: Correlation

PROC CORR DATA=KAZ;

VAR mat7 mat8 gnp14;

Run;

J. PROC OLS: OLS Regression

PROC REG DATA=KAZ;

MODEL mat8=natexam gnp14;

Run;

Advanced Topic:

See www.src.uchicago.edu/users/ueka for the creation of OLS table using OLS. Also see

PROC IML instruction on the same page to learn how OLS estimates its coefficients.

K. PROC LOGISTIC: Logistic Regression

/*I don’t know if natexam can be considered a dependent variable, but for the sake of

demonstration*/

PROC logistic data=kaz;

Model natexam=gnp14;

run;

L. MAKE AN ASCHI FILE

To use a stand-alone software program, you may have to create a simple aschi file. But I

rarely use this lately because many software read SAS data directly.

data timss;set kaz;

file "aschi_example.txt";

put (nation) (10.0) (mat7 mat8) (8.0);

run;

VII. More Procedures

M. PROC STANDARD: Standardize Values

Make Z-score with a mean of 0 and standard deviation of 1

15


proc standard data=kaz out=kaz2 mean=0 std=1;

var mat7 mat8;

run;

/*then see what you did*/


run;

Advanced technique: Standardize within groups.

/*First sort by group ID*/


by block;

run;

/*Use by statement*/

proc standard data=kaz2 out=kaz3 mean=0 std=1;

by block;

var mat7 mat8;

run;

16


N. PROC RANK: Rank observations

proc rank data=kaz out=kaz2 group=3;

/*Creates 3 groups. The new values will be 0, 1, and 2. */

var mat7 mat8;

RANKS Rmat7 Rmat8;

/*give names to the new variables*/

Run;

/*see what happened*/


var mat7 Rmat7 mat8 Rmat8;

RUN;

Research Tip:

Why do we use rank?

a. We can split the sample based on the rank. e.g., high SES student sample versus low

SES student sample.

b. We can create dummy variables quickly by specifying group=2. e.g., high SES student

will receive 1; else 0. This grouping occurs at the median point of a variable, which may or

may not be always the best strategy. Alternative way is to assign 1 and 0 based on some

meaningful threshold. For example, I have temperature data, I may use a medium point to

split the data if it makes sense, but maybe I use 0 degree (Freezing point) as a meaningful

point to split the data instead.

17


O. PROC SQL: Creating group-level mean variables

One could use proc means to derive group-level means. I don’t recommend this since it

involves extra steps of merging the mean data back to the main data set. Extra steps always

create rooms for errors. PROC SQL does it at once.

proc sql;

create table kaz2 as

select *,

mean(mat7) as mean_mat7,

mean(mat8) as mean_mat8,

mean(gnp14) as mean_gnp

from kaz

group by block;

run; /*proc sql does not really require run statement, but for the sake of consistency*/


run;

VIII. Merging Data Sets

libname here “C:\”;

/*Create two data sets A and B.*/

data A;

set kaz; /*I am assuming that you already have this data set “kaz” by running the program

on page 4 and 5 of this document. */

keep nation mat7;

18


run;

data B;

set kaz;

keep nation mat8;

run;

/*MERGE DATA SETS*/

/*First sort them by a common ID*/

/*Here they are already sorted, so the following two lines are not really necessary*/

proc sort data=A;by nation;run;

proc sort data=B;by nation;run;

data NEW;

merge A B;

by nation;

run;

/*Confirm*/

proc print data=NEW;

run;

IX. Temporary and Permanent Data Sets

There are temporary and permanent SAS data sets. When you turn off SAS, the temporary

data will be erased. Throughout the exercise, you have seen “kaz” and “kaz2.” They are

temporary data sets.

To actually see these data, go to the Explorer (leftish side of the SAS window),

then to Libraries, and find folders in there. The default directory is called Work. (You will

also find folders that you nicknamed.) Click them to open and find data in them.

If you want to make them permanent, so they don’t disappear when you turn off

SAS, add the directory nickname in front of the new data set. For example:

Data here.abc;set kaz;

keep nation growth;

growth=mat8-mat7;

19


run;

You are bringing in a temporary data set “kaz” and are creating a new permanent data

called abc in the directory “C:\TEMP” (nicknamed “here” by a library statement) You are

creating a variable called “growth” and it now is in here.abc. Only nation and growth are

kept in the new data set.

You can also do the opposite: bring in a permanent data set this time and create a temporary

data.

Data xyz; set here.abc;

growth=mat8-mat7;

drop mat8 mat7;

run;

You are bringing a permanent data set called “abc” placed in C:\TEMP and create a new

data abc in SAS’s defalt directory. You created a variable called “growth” and it now is in

abc. Mat8 and mat7 are dropped from the new data set.

(Of course, reading in a permanent data and creating a permanent data is possible by “data

here.xyz; set here.xyz;)

Research Tip:

I recommend that you make permanent data as infrequently as possible. Just save your

syntax program and create fresh temporary data each time you start and save disc space.. In

this way, you can just save your small syntax program. Also research is a lot easier if you

have only a few programs and data sets.

http://www.src.uchicago.edu/users/ueka/SAS/Dataextractor8.3.txt

Every time I need to work on this study, I can just run this one single program to reproduce

data. I don’t have to remember the name convention and location of the data sets that I

have to deal with.

For this particular study, I only need to deal with this file above and one more file

that actually does the analyses.

20


http://www.src.uchicago.edu/users/ueka/SAS/MakeFinalTables7.2.txt

If I need to make changes to my analyses, I know I just have to look into these two

files. This would be impossible if I had too many files and data sets flying all over the

places even in one directory.

HOWEVER, if your data is huge (e.g., census data), then you may be better off

saving permanent data, so it is quicker.

END of Document

21

introduction to sas promgramming

Documents

proc timeplot data kaz2

proc means data kaz

proc print data kaz2

proc sort data kaz

proc means data kaz2

x1 x2 x3

var mat7 mat8

permanent data sets