sas ® proc sql or vanilla flavor cecilia mauldin january 31 2007
TRANSCRIPT
What is Special in SAS PROC SQL?
• Things you already know from data step programming
• Some differences between data step and proc sql syntax
• Order of execution• How good is my code?
DATA dogs; LENGTH name breed $20 dob $10; INPUT name breed dob;CARDS;Nala Airdale 11jan1990Nina AMBR 22apr1997Jadzia Pointer 15feb1999Rocco MMBNR
08may1980Bruja Rottweiler
15aug2000Chacal Rottweiler 29jan1990Chata Boxer UNKTyson Rottweiler
15mar2000;RUN;
DATA parents; LENGTH parent name $20; INPUT parent name;CARDS;Ina NalaIna ChataIna BrujaIna FidongaChavo RoccoCecilia NinaCecilia Jadzia;RUN;
The Data
What is Special in SAS PROC SQL?
• Things you already know from data step programming
• Some differences between data step and Proc SQL syntax
• Order of execution• How good is my code?
Things You Already Know from Data Step Programming• Our old friends: where, keep,
drop and rename• In data step
– How do we use them?– Why do we use them?
• In SAS PROC SQL– How do we use them?– Why do we use them?
Things You Already Know from Data Step Programming
DATA new_data (KEEP= RENAME=( ) DROP= WHERE=( ));
SET old_data (KEEP= RENAME=() DROP= WHERE=( ));
SAS data steps … ;RUN;
It is written from the PDV when specified on an output data set.
Is applied when the dataset OLD_DATA is read into the PDV.
Data Step: Why Do We Use Our Old Friends?
No need to keep observations and/or variables that are not used in the output data set.
No need to carry observations or variables that are not going to be used in the PDV.
DATA new_data (KEEP= RENAME= ( ) DROP= WHERE=( ));
SET old_data (KEEP= RENAME=() DROP= WHERE=( ));
SAS data steps … ;RUN;
PROC SQL SQL_OPTIONS; CREATE TABLE example (KEEP= …. RENAME=( ….) DROP= WHERE=( ….))AS SELECT * FROM data_set (KEEP= var1 var2…. RENAME=(var1 = newvar1 var2 = newvar2 ….) DROP= WHERE=(….));QUIT;
Proc SQL: How Do We Use Them?
Typing Convenience
• Your dataset has 50 variables, you need to use 45– Do you want to type 45 variables?– Do you carry 5 variables that you don’t
need?
Typing Convenience cont’d
SELECT * FROM data_set (DROP=var46 … var50);
SELECT var1, var2, ∙∙∙, var45 FROM data_set;
Typing Convenience cont’d
SELECT *FROM dogs (DROP=dob RENAME=(name=firstname));
SELECT name AS firstname, breedFROM dogs;
First name Breed Nala AirdaleNina AMBRJadzia PointerRocco MMBNRBruja RottweilerChacal RottweilerChata BoxerTyson Rottweiler
Joins
• Avoid ‘warning’ messages when creating a data set because two variables in two different datasets have the same name
• Limit the records that come in a join
Joins cont’d
• Avoid WARNING messages when 2 variables have the same name
CREATE TABLE final AS SELECT a.*, b.* FROM data_set1 a, data_set2 b WHERE a.var1=b.var1 AND a.var2=b.var2 AND a.var3=b.var3;
Joins cont’d
• Avoid WARNING messages when 2 variables have the same name
CREATE TABLE family AS SELECT a.*, b.* FROM dogs a, parent b WHERE a.name=b.name ;
WARNING: Variable Name already exists on file WORK.FAMILY.NOTE: Table WORK.FAMILY created, with 6 rows and 4 columns.
Joins cont’d
• Avoid WARNING messages when 2 variables have the same name
CREATE TABLE final (DROP=VAR1B VAR2B VAR3B)
AS SELECT a.*, b.* FROM data_set1 A, data_set2 (RENAME=(var1=var1b var2=var2b var3=var3b)) B WHERE a.var1=b.var1b AND a.var2=b.var2b AND a.var3=b.var3b;
Joins cont’d
CREATE TABLE family (DROP=nickname)
AS SELECT a.*, b.*
FROM dogs a, parents (RENAME=(name=nickname)) b
WHERE a.name=b.nickname ; Name breed DOB Parent
Nala Airdale 11JAN1990 InaNina AMBR 22APR2001 CeciliaJadzia Pointer 15FEB1999 CeciliaRocco MMBNR 08MAY2001 ChavoBruja Rottweiler 15AUG2000 InaChata Boxer . Ina
Joins cont’d
SELECT a.*, b.*
FROM dogs a LEFT JOIN parents b
ON a.name=b.name
AND a.breed='Rottweiler' ; Name Breed DOB Parent name
Bruja Rottweiler 15AUG2000 Ina BrujaChacal Rottweiler 29JAN1990Chata Boxer .Jadzia Pointer 15FEB1999Nala Airdale 11JAN1990Nina AMBR 22APR2001Rocco MMBNR 08MAY2001Tyson Rottweiler 15MAR2000
Do we get a warning?
Joins cont’d
• Limit records
CREATE TABLE final AS SELECT a.*, b.*
FROM data_set1 (WHERE=(var1=’keep’)) a
LEFT JOIN data_set2 b
ON a.var1=B.var1b
AND a.var2=B.var2b
AND a.var3=B.var3b ;
Joins cont’d
SELECT a.*, b.*FROM dogs (WHERE=(breed='Rottweiler'))a LEFT JOIN parents b ON a.name=b.name ;
Name breed DOB Parent nameBruja Rottweiler 15AUG2000 Ina BrujaChacal Rottweiler 29JAN1990Tyson Rottweiler 15MAR2000
What is special in SAS PROC SQL?
• Some differences between data step and SQL syntax
•Order of execution
•How good is my code?
•Things you already know from data step programming
Some Differences Between Data Step and SQL Syntax
• Reserved word• Format modifiers• Descending sorting
Reserved Word
• CASE– Why to use it
• Is there something similar in data step?
– How to use it– What can be the problem– How to avoid a problem
CASE: Why to Use It
• There is no IF or SELECT statements in SQL• Is similar to the SELECT clause in data step
CASE: Why to Use It
DATA final;SET parents;SELECT (name); WHEN (‘Nina') education =‘Novice'; WHEN (‘Nala') education =‘None'; OTHERWISE education =‘There is hope’;END;
CASE: How to Use It
SELECT var1, var2, CASE [vark] WHEN (“value1”) THEN [‘constant value’ |
function] WHEN (‘value3’) THEN [‘other constant’ |
function] … ELSE [‘last value’ | function ] END AS newvar FROM data_set ;
CASE: How to Use It cont’d
SELECT name, CASE name WHEN ('Nala') THEN 'None' WHEN ('Nina') THEN 'Novice' ELSE 'There is hope' END AS education LENGTH = 15 FORMAT
$15., parent FROM parent ;
Name Education ParentNala None Ina Chata There is hope Ina Bruja There is hope Ina Fidonga There is hope Ina Rocco There is hope Chavo Nina Novice Cecilia Jadzia There is hope Cecilia
CASE: How to Use It cont’d
SELECT name,breed, CASE WHEN INDEX (breed,'MB') THEN PUT(breed,
$UKC.) ELSE breed END AS lbreed label "Long Breed" FROM dogs;
Name Breed Long breed Bruja Rottweiler Rottweiler Chacal Rottweiler Rottweiler Chata Boxer Boxer Jadzia Pointer Pointer Nala Airdale Airdale Nina AMBR American Mixed Breed Registry Rocco MMBNR Mexican Mixed Breed Non-RegisteredTyson Rottweiler Rottweiler
CASE: What Can Be the Problem
DATA new; LENGTH case $20; SET parents; IF parent ='Ina' THEN case='None'; ELSE IF parent ='Chavo' THEN case='Spoiled'; ELSE case='Work';RUN;
PROC PRINT DATA=new;RUN;
Obs Case Parent Name 1 None Ina Nala 2 None Ina Chata 3 None Ina Bruja 4 None Ina Fidonga 5 Spoiled Chavo Rocco 6 Work Cecilia Nina 7 Work Cecilia Jadzia
CASE: What Can Be the Problem cont’d
SELECT * FROM new WHERE
parent='Ina';
Case Parent NameNone Ina Nala None Ina Chata None Ina Bruja None Ina Fidonga
CASE: What Can Be the Problem cont’d
SELECT * FROM new WHERE
case='None';
select * from new WHERE case='None';+ 22+ 76+ERROR 22-322: Syntax error, expecting one of the following: + a name, a quoted string, a numeric constant, + a datetime constant, a missing value, (, +, -, + BTRIM, CALCULATED, CASE, EXISTS, INPUT, LOWER, + NOT, PUT, SELECT, SUBSTRING, TRANSLATE, UPPER, + USER, WHEN, ^, ~. +ERROR 76-322: Syntax error, statement will be ignored.
Some Differences Between Data Step and SQL Syntax
• Reserved word
• Format modifiers
• Descending sorting
Format Modifiers
• What are format modifiers?• Format modifiers in data step• Format modifiers in proc SQL
Format Modifiers in Data StepDATA ndogs; SET dogs; idate=INPUT(dob,date9.); IF _ERROR_=1 THEN PUT _ALL_;RUN;
NOTE: Invalid argument to function INPUT at line 196 column 9.name=Chata breed=Boxer dob=UNK idate=. _ERROR_=1 _N_=3NOTE: Mathematical operations could not be performed at the following places. The results of the operations have been set to missing values. Each place is given by: (Number of times) at (Line):(Column). 1 at 196:9
Format Modifiers in Data Step cont’d
DATA ndogs; SET dogs; idate=INPUT(dob, ?? date9.); IF _ERROR_=1 THEN PUT _ALL_;RUN;
NOTE: There were 8 observations read from the data set WORK.DOGS.NOTE: The data set WORK.NDOGS has 8 observations and 4 variables.
Format Modifiers in PROC SQL
+ERROR: Invalid day value NOTE: Table WORK.NDOGS created, with 8 rows and 3 columns.
CREATE TABLE ndogs AS SELECT name, dob,INPUT(dob, date9.) AS idate FORMAT mmddyy8. FROM dogs;
Format Modifiers in PROC SQL cont’d
CREATE TABLE ndogs AS SELECT name, dob,
INPUT(dob,? date9.) AS idate FORMAT mddyy8. FROM dogs;
NOTE: Table WORK.NDOGS created, with 8 rows and 3 columns.
Some Differences Between Data Step and SQL Syntax
• Reserved word• Format modifiers
• Descending sorting
Descending Sorting
PROC SORT DATA = ndogs; BY DESCENDING idate;RUN;
SELECT * FROM ndogs ORDER BY idate descending;
•PROC SQL
• Data step and PROCS
Descending Sorting cont’d
• A little lagniappe (pronounced nańap)SELECT name, dob, INPUT( dob,? date9.) AS idate FORMAT
date9. FROM dogs ORDER BY 3 DESC;
Name DOB IdateBruja 15Aug2000 15AUG2000Tyson 15Mar2000 15MAR2000Jadzia 15Feb1999 15FEB1999Nina 22APR1997 22APR1997Chacal 29Jan1990 29JAN1990Nala 11jan1990 11JAN1990Rocco 08may1980 08MAY1980Chata UNK .
What is Special in SAS PROC SQL?
• The program data vector (PDV)• Some differences between data
step and SQL syntax
• Order of execution
• How good is my code?
The Order of Execution
SELECT VAR1, VAR2, … FROM data_set [ FULL | LEFT | RIGHT | INNER ] JOIN
TABLE2
ON condition 1 AND condition 2
WHERE condition k AND condition l
GROUP BY VAR1, VAR….
HAVING condition p AND condition qORDER BY var1, var2… ;
The Order of Execution cont’dPROC SQL NUMBER;SELECT a.breed,
INPUT(a.dob,? date9.) AS ndob format date9.,
COALESE(a.name, b.name) AS name, b.parentFROM dogs a FULL JOIN parents bON a.name=b.name;
Breed NDOB Name Parent 1 Rottweiler 15AUG2000
Bruja Ina 2 Rottweiler 29JAN1990
Chacal 3 Boxer Chata Ina 4 . Fidonga Ina 5 Pointer 15FEB1999 Jadzia Cecilia 6 Airdale 11JAN1990 Nala Ina 7 AMBR 22APR1997 Nina Cecilia 8 MMBNR 08MAY1980 Rocco Chavo
The Order of Execution cont’dSELECT a.breed, INPUT( a.dob, ? date9.) AS ndob FORMAT date9., COALESCE (a.name, b.name) AS name, b.parentFROM dogs a FULL JOIN parents bON a.name=b.nameAND breed='Rottweiler';
Breed NDOB Name Parent 1 Rottweiler 15AUG2000
Bruja Ina 2 Rottweiler 29JAN1990
Chacal 3 Boxer Chata Ina 4 . Fidonga Ina 5 Pointer 15FEB1999 Jadzia Cecilia 6 Airdale 11JAN1990 Nala Ina 7 AMBR 22APR1997 Nina Cecilia 8 MMBNR 08MAY1980 Rocco Chavo
The Order of Execution cont’d
SELECT a.breed,b.parent,COALESCE ( a.name, b.name) FROM dogs a FULL JOIN parent bON a.name=b.nameWHERE breed='Rottweiler';
SELECT a.breed, b.parent,COALESCE (b.name, a.name)FROM dogs a FULL JOIN parent bON a.breed='Rottweiler'WHERE a.name=b.name;
Breed Parent Rottweiler Ina Bruja Rottweiler Chacal Rottweiler Tyson
Breed Parent Rottweiler Ina Bruja
The Order of Execution cont’d
SELECT a.breed, INPUT(a.dob, ? date9.) AS ndob format date9.,COALESCE(a.name, b.name) AS name,b.parentFROM dogs a FULL JOIN parent bON a.name=b.nameWHERE breed='Rottweiler'HAVING ndob GT '01jan2000'd;
Breed NDOB Name ParentRottweiler 15AUG2000 Bruja Ina Rottweiler 15MAR2000 Tyson
How Good is My Code?
• Does the options used in data step make my code faster?
• Is speed important?
How Good is My Code? cont’d
• Do data step options make my code faster?
data all; merge dogs parent; by name; do i=1 to 100; do j=1 to 100; output; end; end;run;
NOTE: There were 8 observations read from the data set WORK.DOGS.NOTE: There were 7 observations read from the data set WORK.PARENT.NOTE: The data set WORK.ALL has 891000 observations and 6 variables.
How Good is My Code? cont’d
• Does the statements used to control the size of my input data set make my code faster?
proc sql stimer;
NOTE: SQL statement used: real time 0.01 seconds cpu time 0.02 seconds
How Good is My Code? cont’d
NOTE: SQL statement used: real time 10.82 seconds cpu time 10.50 seconds
SELECT * FROM all WHERE breed='Rottweiler';
SELECT * FROMall (WHERE=(breed='Rottweiler'));
NOTE: SQL statement used: real time 10.85 seconds cpu time 10.61 seconds
• Does the statements used to control the size of my input data set make my code faster?
SELECT DISTINCT name, breed, dob FROM all WHERE breed='Rottweiler';
SELECT DISTINCT name, breed, dob FROM all ( WHERE=(breed='Rottweiler'));
SELECT DISTINCT name, breed, dobFROM all (WHERE=(breed='Rottweiler') KEEP=name breed dob);
• Does the statements used to control the size of my input data set make my code faster?
real time 5.89 secondscpu time 4.30 seconds
real time 9.42 secondscpu time 4.57 seconds
real time 4.90 secondscpu time 4.35 seconds
How Good is My Code? cont’d
Conclusions
• This presentation showed examples on how and when to use statements that we already know– KEEP– RENAME– DROP– WHERE (things you already know)
Conclusions
• Possible language problems (line mines)
– Reserved word (CASE)– Format modifiers– Descending order
(things you think you already know)
Conclusions cont’d
• Be careful with the order of the conditions
• The order of the conditions affect the result
Conclusions cont’d
• When you use SQL, consider if you need all your variables and observations
• If you don’t need them consider:– It there is a significant difference in
computer use– Are you a good typist?
Conclusions cont’d
• You can have plain vanilla SAS PROC SQL or you can SASsy it up with data step statements
Acknowledgements
• PPD management – Kim Sturgen and Bonnie Duncan– Andy Barnett– Darrel Edgley– Neil Howard, Larry Sleeper and Craig
Mauldin– For their time, although if there are
mistakes, they are all mine; I made modifications and additions after they read the paper
Contact Information
Angelina Cecilia Mauldin
Abraxis BioScience4505 Emperor Blvd. #400Durham, NC 27703919 433 [email protected]