introduction to sas promgramming
TRANSCRIPT
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
Introduction to SAS® Version 1.4 updated 9/29/2002
by Kazuaki Uekawa, Ph.D. Visiting Scholar, The Department of Sociology, The
University of Chicago; Population Research Center at NORC; Address: 1155 E. 60th. St,
Room 340, Chicago, IL 60637
www.src.uchicago.edu/users/ueka
Copyright © 2002 By Kazuaki Uekawa All rights reserved.
Table of Contents
I. Introduction........................................................................................................2
II. How to start?...................................................................................................3
III. LIBNAME: Assigning library name....................................................................3
IV. Create SAS data for a practice........................................................................4
V. Creating New Variables...................................................................................6
VI. Procedures.......................................................................................................8
A. PROC CONTENTS: Description of Contents...................................................8
B. PROC PRINT: See Data..................................................................................9
C. PROC SORT: Sorting Observations based on a value of variable...................9
D. PROC MEANS: Get Descriptive Statistics (Mean, STD, Min, Max)................10
E. PROC FREQ: Get Frequencies.....................................................................12
F. PROC UNIVARIATE: Get elaborate statistics and a univariate plot.................12
G. PROC PLOT: Plotting Two Variables.............................................................12
H. PROC TIMEPLOT: Time Plot.........................................................................12
I. PROC CORR: Correlation................................................................................13
J. PROC OLS: OLS Regression............................................................................13
K. PROC LOGISTIC: Logistic Regression..........................................................14
L. MAKE AN ASCHI FILE..................................................................................14
VII. More Procedures............................................................................................14
M. PROC STANDARD: Standardize Values........................................................14
N. PROC RANK: Rank observations.................................................................16
O. PROC SQL: Creating group-level mean variables........................................17
VIII. Merging Data Sets......................................................................................17
IX. Temporary and Permanent Data Sets............................................................18
1
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
I. Introduction
I recommend SAS® over other statistical packages because:
a) ODS (Output Delivery System) allows users to save statistical results as data. A
user can create tables off the result data set in one single program (as opposed to
printing out the results on paper and use excel to finish tables.) The table can be as
sophisticated as
http://www.src.uchicago.edu/users/ueka/SAS/proc_mixed_example1output.txt and
this can be further saved in an excel format using PROC EXPORT.
b) Rich arrays of macro functions
c) Email support service with quick response. [email protected]
d) Users come from many fields, including social and natural sciences, as well as
business. Thus, SAS ® programming skill can be an asset in the job market.
I discuss both ODS and MACRO in Introduction SAS 2, the document of which is
available from the same website.
Idiosyncrasy of this document
I am writing this document on my Japanese PC and backslash is not available. I use \
instead.
U. of Chicago People can access SAS on-line on the web!
SAS On-line for version 8
http://gsbapp2.uchicago.edu/sas/sashtml/main.htm
Note on SAS email support:
When you email SAS support with a question, you need to identify yourself as a legitimate
SAS customer. Look at the head of a log file and copy and paste the information at the
beginning of your email text.
NOTE: Copyright (c) 1999-2001 by SAS Institute Inc., Cary, NC, USA.
NOTE: SAS (r) Proprietary Software Release 8.2 (TS2M0)
Licensed to UNIVERSITY OF XXXXX, Site XXXXX.
NOTE: This session is executing on the WIN_ME platform.
2
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
II. How to start?
1. Start SAS. You can find the short cut going from START PROGRAMThe SAS
System.
2. Type in syntax in EDITOR window. Syntax is something you learn in this
document.
3. Click on the runner icon to run the program. Alternatively, you can highlight the
part of syntax that you want to run and then click the runner to run the program
selectively. (The downside of using UNIX instead of WINDOWS is that UNIX cannot
let you do this selective run.)
LOG file contains messages. Watch for the words error and warning.
OUTPUT file contains output.
If you ever mistype syntax and want to redo, do control-z. This is the
same command that can be used with Microsoft Office products.
To cancel the run while it is happening, click on the stop icon (which
looks like “!”) right next to the runner icon.
III. LIBNAME: Assigning library name
Assigning library name
Using path names as directory names is too tedious (e.g., C: \temp\abc\old), so we want to
give nicknames to them at the beginning of a program.
libname here “C:\TEMP”;
libname there “C:\”;
So from now on,
here.abc means the data set named “abc” placed in the directory nicknamed “here.”
there.xyz means the data set named “xyz” placed in the directory nicknamed
“there.”
3
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
IV. Create SAS data for a practice
Description of Practice Data
The data comes from TIMSS (Third International Mathematics and Science Study) in
which some 40 nations’ three population groups (3&4th graders, 7&8th graders,
and high school seniors) participated. I aggregated data at the national level. The
variables are:
acro: acronym for participant nations.
nation: name of the country
name: complete name of the country
mat8: 8thgraders’ average math test score
mat7: 7thgraders’ average math test score
GNP14: GNP per capita
prop: proportion of 8th graders in schooling
NATEXA: Administers national-level exam
NATSYLB: Sylbus is decided at the national level
NATTEXT: text is chosen at the national level.
libname here “C:\TEMP”;
libname there “C:\”;
data kaz;
input
acro $ NATION $ 6-14 NAME $ 15-33 MAT7 MAT8 GNP14 PROP NATEXAM
NATSYLB NATTEXT block $;
cards;
aus Australi Australia 498 529.63 -0.15526 84 0 1 0 ocea
aut Austria Austria 509 539.43 -0.29163 100 0 0 1 weuro
bfl Belgi_FL Belgium (Fl) 558 565.18 -0.25157 100 1 1 0 weuro
bfr Belgi_FR Belgium (Fr) 507 526.26 -0.25157 100 0 1 0 weuro
can Canada Canada 494 527.24 0.07184 88 0 0 0 namer
col Colombia Colombia 369 384.76 -0.23699 62 0 1 0 samer
cyp Cyprus Cyprus 446 473.59 -0.41906 95 0 1 1 seuro
4
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
csk Czech Czech Republic 523 563.75 -0.34840 86 0 1 0 eeuro
dnk Denmark Denmark 465 502.29 -0.34057 100 1 0 0 weuro
fra France France 492 537.83 0.55791 100 0 1 0 weuro
deu Germany Germany 484 509.16 0.91992 100 0 0 0 weuro
grc Greece Greece 440 483.90 -0.32620 99 0 1 1 seuro
hkg HongKong Hong Kong 564 588.02 -0.31638 98 1 1 1 seasia
hun Hungary Hungary 502 537.26 -0.37602 81 0 0 0 eeuro
isl Iceland Iceland 459 486.78 -0.42606 100 0 0 0 neuro
irn Iran Iran, Islamic Rep. 401 428.33 -0.17095 66 0 1 1 meast
irl Ireland Ireland 500 527.40 -0.38919 100 1 1 0 weuro
isr Israel Israel . 521.59 -0.35464 87 0 1 0 meast
jpn Japan Japan 571 604.77 1.85543 96 0 1 0 seasia
kor Korea Korea 577 607.38 -0.01168 93 0 1 1 seasia
kwt Kuwait Kuwait . 392.18 -0.40359 60 0 1 1 meast
lva Latvia Latvia (LSS) 462 493.36 -0.42319 87 0 0 0 eeuro
ltu Lithuani Lithuania 428 477.23 -0.41785 78 1 1 1 eeuro
nld Netherla Netherlands 516 540.99 -0.18184 93 1 0 0 weuro
nzl NewZeala New Zealand 472 507.80 -0.38319 100 1 1 0 ocea
nor Norway Norway 461 503.29 -0.35450 100 0 1 1 neuro
prt Portugal Portugal 423 454.45 -0.32588 81 0 1 0 weuro
rom Romania Romania 454 481.55 -0.35396 82 1 1 1 eeuro
rus RussianF Russian Federation 501 535.47 0.12827 88 1 0 0 eeuro
sco Scotland Scotland 463 498.46 0.48017 100 0 0 0 weuro
sgp Singapor Singapore 601 643.30 -0.37279 84 1 1 1 seasia
slv SlovakRe Slovak Republic 508 547.11 -0.40217 89 0 1 0 eeuro
svn Slovenia Slovenia 498 540.80 -0.41310 85 0 1 1 eeuro
esp Spain Spain 448 487.35 0.03461 100 0 1 1 weuro
swe Sweden Sweden 477 518.64 -0.30049 99 0 1 0 neuro
che Switzerl Switzerland 506 545.44 -0.27916 91 0 0 0 weuro
tha Thailand Thailand 495 522.37 -0.14533 37 0 1 1 seasia
usa USA United States 476 499.76 5.37506 97 0 0 0 namer
;
run;
5
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
/*this prints out the data*/
proc print;
run;
6
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
Advanced Topic:
Alternatively you can save above data (just data part) as a simple text and save it
at your C-drive’s temp directory as kaz.txt. (In case you only have this document
as a hard copy, visit www.src.uchicago.edu/users/ueka for a digital version of this
document, so you can copy and paste.) Then use the program below to read in the file.
/*these two lines are not crucial in this example, but let’s just put these at the beginning of
your program*/
libname here “C:\TEMP”;
libname there “C:\”;
data kaz;
infile “C:\TEMP\kaz.txt” missover;
input
acro $ NATION $ 6-14 NAME $ 15-33 MAT7 MAT8
GNP14 PROP NATEXAM NATSYLB NATTEXT block $;
run;
I think missover means that when there is no value in the spot where there is supposed to be
a value, just treat it as a missing value, but I forgot exactly. It is safe to use it.
$ means whatever comes before it is a character variable as opposed to numeric.
V. Creating New Variables
Data kaz2;
set kaz;
/*ADDITION*/
var1=mat7+mat8;
7
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
/*OR*/
var2=sum(of mat7 mat8);
/*SUBSTRACTION*/
var3=mat8-mat7;
/*MULTIPLICATION*/
var4=mat7*mat8;
/*DIVISION*/
var5=mat7/mat8;
/*Use brackets effectively*/
var6=1/(mat7+mat8);
/*MEAN of several variables*/
var7=mean(of mat7 mat8);
/*MAX of several variables*/
var8=max(of mat7 mat8);
/*MIN of several variables*/
var9=min(of mat7 mat8);
/*LOG: a value to enter must be
positive*/
var10=log(mat7);
/*Absolute values: this takes out negative
signs*/
var11=abs(gnp14);
run;
/*TO SEE WHAT YOU DID, USE PROC
PRINT*/
proc print data=kaz2;
title “Lots of manipulations: See results”;
var mat7 mat8 var1 var2 var3 var4 var5
var6 var7 var8 var9 var10 var11;
run;
Advanced Topics:
How is Z=mean(of X1 X2 X3) different from Z=(X1+X2+X3)/2;?
How is Z=sum(of X1 X2 X3) different from Z=X1+X2+X3;?
Functions, such as mean(of …) or sum (of …), take statistics of non-missing values. They
do return values even when some of the variables in the brackets are missing. For example,
if X1 is missing:
X=mean (of X1 X2 X3); will return the average of X2 and X3.
In contrast,
X=(X1+X2+X3)/2 will return a missing value, namely, “.”
Read this after you study PROC REG later in the document.
8
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
When we compare several regression models (e.g., coefficients, R2, Goodness-of-fit, etc.),
we want to keep the number of observations same across different models. Because
predictors may have different patterns of missing values, this must be made to happen if
you want to. For example, mat7, which is 7th graders’ mathematics score include some
missing cases. Some nations only let their 8th graders participate in this international test.
Use NMISS function to create a new variable john.
data kaz2;set kaz;
john=nmiss(of GNP14 mat8 mat7);/*this returns the number of missing cases*/
run;
/*check how the data looks like now*/
proc print data=kaz2;
var name gnp14 mat8 mat7 john;
run;
/*Apply OLS regression with cases with perfect data (no missing cases). In this way,
model 1 and model 2 will have the same number of cases, or to be more precise, the same
data.*/
proc reg data=kaz2;
where john=0; /*Run only when john=0, namely, number of missing cases is 0*/
model mat8=mat7;
model mat8=mat7 gnp14;
run;
VI. Procedures
A. PROC CONTENTS: Description of Contents
PROC CONTENTS data=kaz;
run;
9
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
Advanced topic: the variables will be sorted by alphabetical order. They can be also shown
by position in the data set (left to right) by addition “position”:
proc contents data=kaz position;
run;
I like this option because in this way you can find related variables close to each other.
B. PROC PRINT: See Data
PROC PRINT data=kaz;
VAR nation mat7 mat8 natexam; /*without this, all variables will be printed*/
run;
Advanced topic: You can selectively print observations.
/*print only when natexam=1*/
proc print data=kaz;where natexam=1;var nation mat7 mat8;run;
/*print by group units*/
proc sort data=kaz out=kaz2;by block;run;
proc print data=kaz;by block;var nation mat7 mat8;run;
/*print only up to a certain number of observations*/
/*you want to do this when you data is big and don’t want to print every observation*/
data kaz2;set kaz;
john=_n_;
/*this creates a new variable indicating the column sequence of observation*/
run;
proc print data=kaz2;where john < 5;run; /*this shows the first 4 observations*/
If you want a nicer print-out, try proc report.
C. PROC SORT: Sorting Observations based on a value of variable
You would be using this procedure a lot, but be careful with large data set. This
procedure consumes lots of computation time.
PROC SORT data=kaz out=kaz2;
10
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
/*If you don’t want to create a new data set, just write “out=kaz”*/
by mat8;
run;
Advanced topics:
proc sort data=kaz out=kaz2 nodupkey;
by block;
run;
proc print data=kaz2;run;
This takes only the first observation of each block. Imagine that you have data
where there are individual level variable (e.g., 100 students) and group level
variable (e.g., 10 schools). Imagine you want to get school level information from
this data. Above procedure would take just the first observation of each school and
gets you ten lines of data for 10 schools. Ignore individual-level variables,
however.
You can use more than one variable in by line.
proc sort data=kaz out=kaz2;
by natexam block;
run;
/*How would the new data look like?*/
proc print data=kaz2;run;
D. PROC MEANS: Get Descriptive Statistics (Mean, STD, Min, Max)
PROC MEANS data=kaz;
VAR mat7 mat8;
11
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
run;
Advanced topic: Group means.
/*Report group means*/
proc sort data=kaz out=kaz2;by block;run;
proc means data=kaz2;
by block;
var mat7 mat8;
run;
You can also use “class” statement instead of “by” statement. Class statement is easier
because you don’t need to sort the data by the by-variable before it. I forgot what the
downside of it was.
proc means data=kaz2; /*now, kaz2 does not have to be sorted by block*/
class block;
var mat7 mat8;
run;
/*Save group means*/
ods listing close; /*printing of results suppressed*/
proc means data=kaz2; /*make sure kaz2 is already sorted by group ID*/
by block;
var mat7 mat8;
ods output summary=john; /*Output Delivery System Used. See SAS manual 2*/
run;
ods listing on; /*printing of results resumed*/
proc print data=john;
run;
/*Get standard errors by adding STDERR*/
12
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
/*But it would only get standard error, so you must add other statistics you would like with
it. Specify mean, N, STD, MAX, and MIN*/
PROC MEANS data=kaz mean n std max min stderr;
VAR mat7 mat8;run;
run;
I recommend reading a chapter on PROC MEANS in SAS CD-online. It is a very versatile
procedure.
E. PROC FREQ: Get Frequencies
PROC FREQ data=kaz;
Tables natexam ;
Run;
Advanced topics:
Get cross tabulation:
PROC FREQ data=kaz;
tables natexam*block;
run;
F. PROC UNIVARIATE: Get elaborate statistics and a univariate plot
PROC UNIVARIATE PLOT DATA=KAZ;
var mat7 mat8 gnp14;
run;
Advanced topic:Get a whisker plot by sub groups, so you can compare group values. But
the output is text-based and pretty ugly.
proc sort data=kaz out=kaz2;
by block;
run;
PROC UNIVARIATE data=kaz2 plot;
by block;
var mat8;
run;
13
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
G. PROC PLOT: Plotting Two Variables
This is text-based graph. Use proc gplot for a nicer graphic.
PROC PLOT data=KAZ;
Plot mat7*mat8;
run;
H. PROC TIMEPLOT: Time Plot
proc timeplot data=KAZ;
plot mat8= '*';
id NAME;
run;
Advanced topics:
/*Sort first by the variable of your interest and see it*/
/*you will be seeing a ranking of nations*/
proc sort data=kaz out=kaz2;
by mat8;
run;
proc timeplot data=KAZ2;
plot mat8= '*';
id NAME;
run;
Add bells and whistles. Below, I am asking, “Does GNP has anything to do with test score?
/*First sort by GNP*/
proc sort data=kaz out=kaz2;
by gnp14;
run;
proc timeplot data=KAZ2;
title “TIMSS countries sorted by GNP”;
plot mat7 mat8/overlay hiloc npp ;
id NAME block gnp14 prop;
run;
14
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
I. PROC CORR: Correlation
PROC CORR DATA=KAZ;
VAR mat7 mat8 gnp14;
Run;
J. PROC OLS: OLS Regression
PROC REG DATA=KAZ;
MODEL mat8=natexam gnp14;
Run;
Advanced Topic:
See www.src.uchicago.edu/users/ueka for the creation of OLS table using OLS. Also see
PROC IML instruction on the same page to learn how OLS estimates its coefficients.
K. PROC LOGISTIC: Logistic Regression
/*I don’t know if natexam can be considered a dependent variable, but for the sake of
demonstration*/
PROC logistic data=kaz;
Model natexam=gnp14;
run;
L. MAKE AN ASCHI FILE
To use a stand-alone software program, you may have to create a simple aschi file. But I
rarely use this lately because many software read SAS data directly.
data timss;set kaz;
file "aschi_example.txt";
put (nation) (10.0) (mat7 mat8) (8.0);
run;
VII. More Procedures
M. PROC STANDARD: Standardize Values
Make Z-score with a mean of 0 and standard deviation of 1
15
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
proc standard data=kaz out=kaz2 mean=0 std=1;
var mat7 mat8;
run;
/*then see what you did*/
proc print data=kaz2;
run;
Advanced technique: Standardize within groups.
/*First sort by group ID*/
proc sort data=kaz out=kaz2;
by block;
run;
/*Use by statement*/
proc standard data=kaz2 out=kaz3 mean=0 std=1;
by block;
var mat7 mat8;
run;
16
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
N. PROC RANK: Rank observations
proc rank data=kaz out=kaz2 group=3;
/*Creates 3 groups. The new values will be 0, 1, and 2. */
var mat7 mat8;
RANKS Rmat7 Rmat8;
/*give names to the new variables*/
Run;
/*see what happened*/
proc print data=kaz2;
var mat7 Rmat7 mat8 Rmat8;
RUN;
Research Tip:
Why do we use rank?
a. We can split the sample based on the rank. e.g., high SES student sample versus low
SES student sample.
b. We can create dummy variables quickly by specifying group=2. e.g., high SES student
will receive 1; else 0. This grouping occurs at the median point of a variable, which may or
may not be always the best strategy. Alternative way is to assign 1 and 0 based on some
meaningful threshold. For example, I have temperature data, I may use a medium point to
split the data if it makes sense, but maybe I use 0 degree (Freezing point) as a meaningful
point to split the data instead.
17
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
O. PROC SQL: Creating group-level mean variables
One could use proc means to derive group-level means. I don’t recommend this since it
involves extra steps of merging the mean data back to the main data set. Extra steps always
create rooms for errors. PROC SQL does it at once.
proc sql;
create table kaz2 as
select *,
mean(mat7) as mean_mat7,
mean(mat8) as mean_mat8,
mean(gnp14) as mean_gnp
from kaz
group by block;
run; /*proc sql does not really require run statement, but for the sake of consistency*/
proc print data=kaz2;
run;
VIII. Merging Data Sets
libname here “C:\”;
/*Create two data sets A and B.*/
data A;
set kaz; /*I am assuming that you already have this data set “kaz” by running the program
on page 4 and 5 of this document. */
keep nation mat7;
18
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
run;
data B;
set kaz;
keep nation mat8;
run;
/*MERGE DATA SETS*/
/*First sort them by a common ID*/
/*Here they are already sorted, so the following two lines are not really necessary*/
proc sort data=A;by nation;run;
proc sort data=B;by nation;run;
data NEW;
merge A B;
by nation;
run;
/*Confirm*/
proc print data=NEW;
run;
IX. Temporary and Permanent Data Sets
There are temporary and permanent SAS data sets. When you turn off SAS, the temporary
data will be erased. Throughout the exercise, you have seen “kaz” and “kaz2.” They are
temporary data sets.
To actually see these data, go to the Explorer (leftish side of the SAS window),
then to Libraries, and find folders in there. The default directory is called Work. (You will
also find folders that you nicknamed.) Click them to open and find data in them.
If you want to make them permanent, so they don’t disappear when you turn off
SAS, add the directory nickname in front of the new data set. For example:
Data here.abc;set kaz;
keep nation growth;
growth=mat8-mat7;
19
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
run;
You are bringing in a temporary data set “kaz” and are creating a new permanent data
called abc in the directory “C:\TEMP” (nicknamed “here” by a library statement) You are
creating a variable called “growth” and it now is in here.abc. Only nation and growth are
kept in the new data set.
You can also do the opposite: bring in a permanent data set this time and create a temporary
data.
Data xyz; set here.abc;
growth=mat8-mat7;
drop mat8 mat7;
run;
You are bringing a permanent data set called “abc” placed in C:\TEMP and create a new
data abc in SAS’s defalt directory. You created a variable called “growth” and it now is in
abc. Mat8 and mat7 are dropped from the new data set.
(Of course, reading in a permanent data and creating a permanent data is possible by “data
here.xyz; set here.xyz;)
Research Tip:
I recommend that you make permanent data as infrequently as possible. Just save your
syntax program and create fresh temporary data each time you start and save disc space.. In
this way, you can just save your small syntax program. Also research is a lot easier if you
have only a few programs and data sets.
http://www.src.uchicago.edu/users/ueka/SAS/Dataextractor8.3.txt
Every time I need to work on this study, I can just run this one single program to reproduce
data. I don’t have to remember the name convention and location of the data sets that I
have to deal with.
For this particular study, I only need to deal with this file above and one more file
that actually does the analyses.
20
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
http://www.src.uchicago.edu/users/ueka/SAS/MakeFinalTables7.2.txt
If I need to make changes to my analyses, I know I just have to look into these two
files. This would be impossible if I had too many files and data sets flying all over the
places even in one directory.
HOWEVER, if your data is huge (e.g., census data), then you may be better off
saving permanent data, so it is quicker.
END of Document
21