sas ® proc sql or vanilla flavor cecilia mauldin january 31 2007

59
SAS ® PROC SQL or Vanilla Flavor Cecilia Mauldin January 31 2007

Upload: bernard-miller

Post on 31-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

SAS® PROC SQL or Vanilla Flavor

Cecilia MauldinJanuary 31 2007

What is Special in SAS PROC SQL?

• Things you already know from data step programming

• Some differences between data step and proc sql syntax

• Order of execution• How good is my code?

DATA dogs; LENGTH name breed $20 dob $10; INPUT name breed dob;CARDS;Nala Airdale 11jan1990Nina AMBR 22apr1997Jadzia Pointer 15feb1999Rocco MMBNR

08may1980Bruja Rottweiler

15aug2000Chacal Rottweiler 29jan1990Chata Boxer UNKTyson Rottweiler

15mar2000;RUN;

DATA parents; LENGTH parent name $20; INPUT parent name;CARDS;Ina NalaIna ChataIna BrujaIna FidongaChavo RoccoCecilia NinaCecilia Jadzia;RUN;

The Data

What is Special in SAS PROC SQL?

• Things you already know from data step programming

• Some differences between data step and Proc SQL syntax

• Order of execution• How good is my code?

Things You Already Know from Data Step Programming• Our old friends: where, keep,

drop and rename• In data step

– How do we use them?– Why do we use them?

• In SAS PROC SQL– How do we use them?– Why do we use them?

Things You Already Know from Data Step Programming

DATA new_data (KEEP= RENAME=( ) DROP= WHERE=( ));

SET old_data (KEEP= RENAME=() DROP= WHERE=( ));

SAS data steps … ;RUN;

It is written from the PDV when specified on an output data set.

Is applied when the dataset OLD_DATA is read into the PDV.

Data Step: Why Do We Use Our Old Friends?

No need to keep observations and/or variables that are not used in the output data set.

No need to carry observations or variables that are not going to be used in the PDV.

DATA new_data (KEEP= RENAME= ( ) DROP= WHERE=( ));

SET old_data (KEEP= RENAME=() DROP= WHERE=( ));

SAS data steps … ;RUN;

PROC SQL SQL_OPTIONS; CREATE TABLE example (KEEP= …. RENAME=( ….) DROP= WHERE=( ….))AS SELECT * FROM data_set (KEEP= var1 var2…. RENAME=(var1 = newvar1 var2 = newvar2 ….) DROP= WHERE=(….));QUIT;

Proc SQL: How Do We Use Them?

Why Do We Use Them?

• Typing convenience• Right, left and full joins

Typing Convenience

• Your dataset has 50 variables, you need to use 45– Do you want to type 45 variables?– Do you carry 5 variables that you don’t

need?

Typing Convenience cont’d

SELECT * FROM data_set (DROP=var46 … var50);

SELECT var1, var2, ∙∙∙, var45 FROM data_set;

Typing Convenience cont’d

SELECT *FROM dogs (DROP=dob RENAME=(name=firstname));

SELECT name AS firstname, breedFROM dogs;

First name Breed Nala AirdaleNina AMBRJadzia PointerRocco MMBNRBruja RottweilerChacal RottweilerChata BoxerTyson Rottweiler

Joins

• Avoid ‘warning’ messages when creating a data set because two variables in two different datasets have the same name

• Limit the records that come in a join

Joins cont’d

• Avoid WARNING messages when 2 variables have the same name

CREATE TABLE final AS SELECT a.*, b.* FROM data_set1 a, data_set2 b WHERE a.var1=b.var1 AND a.var2=b.var2 AND a.var3=b.var3;

Joins cont’d

• Avoid WARNING messages when 2 variables have the same name

CREATE TABLE family AS SELECT a.*, b.* FROM dogs a, parent b WHERE a.name=b.name ;

WARNING: Variable Name already exists on file WORK.FAMILY.NOTE: Table WORK.FAMILY created, with 6 rows and 4 columns.

Joins cont’d

• Avoid WARNING messages when 2 variables have the same name

CREATE TABLE final (DROP=VAR1B VAR2B VAR3B)

AS SELECT a.*, b.* FROM data_set1 A, data_set2 (RENAME=(var1=var1b var2=var2b var3=var3b)) B WHERE a.var1=b.var1b AND a.var2=b.var2b AND a.var3=b.var3b;

Joins cont’d

CREATE TABLE family (DROP=nickname)

AS SELECT a.*, b.*

FROM dogs a, parents (RENAME=(name=nickname)) b

WHERE a.name=b.nickname ; Name breed DOB Parent

Nala Airdale 11JAN1990 InaNina AMBR 22APR2001 CeciliaJadzia Pointer 15FEB1999 CeciliaRocco MMBNR 08MAY2001 ChavoBruja Rottweiler 15AUG2000 InaChata Boxer . Ina

Joins cont’d

SELECT a.*, b.*

FROM dogs a LEFT JOIN parents b

ON a.name=b.name

AND a.breed='Rottweiler' ; Name Breed DOB Parent name

Bruja Rottweiler 15AUG2000 Ina BrujaChacal Rottweiler 29JAN1990Chata Boxer .Jadzia Pointer 15FEB1999Nala Airdale 11JAN1990Nina AMBR 22APR2001Rocco MMBNR 08MAY2001Tyson Rottweiler 15MAR2000

Do we get a warning?

Joins cont’d

• Limit records

CREATE TABLE final AS SELECT a.*, b.*

FROM data_set1 (WHERE=(var1=’keep’)) a

LEFT JOIN data_set2 b

ON a.var1=B.var1b

AND a.var2=B.var2b

AND a.var3=B.var3b ;

Joins cont’d

SELECT a.*, b.*FROM dogs (WHERE=(breed='Rottweiler'))a LEFT JOIN parents b ON a.name=b.name ;

Name breed DOB Parent nameBruja Rottweiler 15AUG2000 Ina BrujaChacal Rottweiler 29JAN1990Tyson Rottweiler 15MAR2000

What is special in SAS PROC SQL?

• Some differences between data step and SQL syntax

•Order of execution

•How good is my code?

•Things you already know from data step programming

Some Differences Between Data Step and SQL Syntax

• Reserved word• Format modifiers• Descending sorting

Reserved Word

• CASE– Why to use it

• Is there something similar in data step?

– How to use it– What can be the problem– How to avoid a problem

CASE: Why to Use It

• There is no IF or SELECT statements in SQL• Is similar to the SELECT clause in data step

CASE: Why to Use It

DATA final;SET parents;SELECT (name); WHEN (‘Nina') education =‘Novice'; WHEN (‘Nala') education =‘None'; OTHERWISE education =‘There is hope’;END;

CASE: How to Use It

SELECT var1, var2, CASE [vark] WHEN (“value1”) THEN [‘constant value’ |

function] WHEN (‘value3’) THEN [‘other constant’ |

function] … ELSE [‘last value’ | function ] END AS newvar FROM data_set ;

CASE: How to Use It cont’d

SELECT name, CASE name WHEN ('Nala') THEN 'None' WHEN ('Nina') THEN 'Novice' ELSE 'There is hope' END AS education LENGTH = 15 FORMAT

$15., parent FROM parent ;

Name Education ParentNala None Ina Chata There is hope Ina Bruja There is hope Ina Fidonga There is hope Ina Rocco There is hope Chavo Nina Novice Cecilia Jadzia There is hope Cecilia

CASE: How to Use It cont’d

SELECT name,breed, CASE WHEN INDEX (breed,'MB') THEN PUT(breed,

$UKC.) ELSE breed END AS lbreed label "Long Breed" FROM dogs;

Name Breed Long breed Bruja Rottweiler Rottweiler Chacal Rottweiler Rottweiler Chata Boxer Boxer Jadzia Pointer Pointer Nala Airdale Airdale Nina AMBR American Mixed Breed Registry Rocco MMBNR Mexican Mixed Breed Non-RegisteredTyson Rottweiler Rottweiler

CASE: What Can Be the Problem

DATA new; LENGTH case $20; SET parents; IF parent ='Ina' THEN case='None'; ELSE IF parent ='Chavo' THEN case='Spoiled'; ELSE case='Work';RUN;

PROC PRINT DATA=new;RUN;

Obs Case Parent Name 1 None Ina Nala 2 None Ina Chata 3 None Ina Bruja 4 None Ina Fidonga 5 Spoiled Chavo Rocco 6 Work Cecilia Nina 7 Work Cecilia Jadzia

CASE: What Can Be the Problem cont’d

SELECT * FROM new WHERE

parent='Ina';

Case Parent NameNone Ina Nala None Ina Chata None Ina Bruja None Ina Fidonga

CASE: What Can Be the Problem cont’d

SELECT * FROM new WHERE

case='None';

select * from new WHERE case='None';+ 22+ 76+ERROR 22-322: Syntax error, expecting one of the following: + a name, a quoted string, a numeric constant, + a datetime constant, a missing value, (, +, -, + BTRIM, CALCULATED, CASE, EXISTS, INPUT, LOWER, + NOT, PUT, SELECT, SUBSTRING, TRANSLATE, UPPER, + USER, WHEN, ^, ~. +ERROR 76-322: Syntax error, statement will be ignored.

CASE: How to Avoid a Problem

SELECT * FROM new

(RENAME=(case=education)) WHERE education='None';

Some Differences Between Data Step and SQL Syntax

• Reserved word

• Format modifiers

• Descending sorting

Format Modifiers

• What are format modifiers?• Format modifiers in data step• Format modifiers in proc SQL

Format Modifiers in Data StepDATA ndogs; SET dogs; idate=INPUT(dob,date9.); IF _ERROR_=1 THEN PUT _ALL_;RUN;

NOTE: Invalid argument to function INPUT at line 196 column 9.name=Chata breed=Boxer dob=UNK idate=. _ERROR_=1 _N_=3NOTE: Mathematical operations could not be performed at the following places. The results of the operations have been set to missing values. Each place is given by: (Number of times) at (Line):(Column). 1 at 196:9

Format Modifiers in Data Step cont’d

DATA ndogs; SET dogs; idate=INPUT(dob, ?? date9.); IF _ERROR_=1 THEN PUT _ALL_;RUN;

NOTE: There were 8 observations read from the data set WORK.DOGS.NOTE: The data set WORK.NDOGS has 8 observations and 4 variables.

Format Modifiers in PROC SQL

+ERROR: Invalid day value NOTE: Table WORK.NDOGS created, with 8 rows and 3 columns.

CREATE TABLE ndogs AS SELECT name, dob,INPUT(dob, date9.) AS idate FORMAT mmddyy8. FROM dogs;

Format Modifiers in PROC SQL cont’d

CREATE TABLE ndogs AS SELECT name, dob,

INPUT(dob,? date9.) AS idate FORMAT mddyy8. FROM dogs;

NOTE: Table WORK.NDOGS created, with 8 rows and 3 columns.

Some Differences Between Data Step and SQL Syntax

• Reserved word• Format modifiers

• Descending sorting

Descending Sorting

PROC SORT DATA = ndogs; BY DESCENDING idate;RUN;

SELECT * FROM ndogs ORDER BY idate descending;

•PROC SQL

• Data step and PROCS

Descending Sorting cont’d

• A little lagniappe (pronounced nańap)SELECT name, dob, INPUT( dob,? date9.) AS idate FORMAT

date9. FROM dogs ORDER BY 3 DESC;

Name DOB IdateBruja 15Aug2000 15AUG2000Tyson 15Mar2000 15MAR2000Jadzia 15Feb1999 15FEB1999Nina 22APR1997 22APR1997Chacal 29Jan1990 29JAN1990Nala 11jan1990 11JAN1990Rocco 08may1980 08MAY1980Chata UNK .

What is Special in SAS PROC SQL?

• The program data vector (PDV)• Some differences between data

step and SQL syntax

• Order of execution

• How good is my code?

The Order of Execution

SELECT VAR1, VAR2, … FROM data_set [ FULL | LEFT | RIGHT | INNER ] JOIN

TABLE2

ON condition 1 AND condition 2

WHERE condition k AND condition l

GROUP BY VAR1, VAR….

HAVING condition p AND condition qORDER BY var1, var2… ;

The Order of Execution cont’dPROC SQL NUMBER;SELECT a.breed,

INPUT(a.dob,? date9.) AS ndob format date9.,

COALESE(a.name, b.name) AS name, b.parentFROM dogs a FULL JOIN parents bON a.name=b.name;

Breed NDOB Name Parent 1 Rottweiler 15AUG2000

Bruja Ina 2 Rottweiler 29JAN1990

Chacal 3 Boxer Chata Ina 4 . Fidonga Ina 5 Pointer 15FEB1999 Jadzia Cecilia 6 Airdale 11JAN1990 Nala Ina 7 AMBR 22APR1997 Nina Cecilia 8 MMBNR 08MAY1980 Rocco Chavo

The Order of Execution cont’dSELECT a.breed, INPUT( a.dob, ? date9.) AS ndob FORMAT date9., COALESCE (a.name, b.name) AS name, b.parentFROM dogs a FULL JOIN parents bON a.name=b.nameAND breed='Rottweiler';

Breed NDOB Name Parent 1 Rottweiler 15AUG2000

Bruja Ina 2 Rottweiler 29JAN1990

Chacal 3 Boxer Chata Ina 4 . Fidonga Ina 5 Pointer 15FEB1999 Jadzia Cecilia 6 Airdale 11JAN1990 Nala Ina 7 AMBR 22APR1997 Nina Cecilia 8 MMBNR 08MAY1980 Rocco Chavo

The Order of Execution cont’d

SELECT a.breed,b.parent,COALESCE ( a.name, b.name) FROM dogs a FULL JOIN parent bON a.name=b.nameWHERE breed='Rottweiler';

SELECT a.breed, b.parent,COALESCE (b.name, a.name)FROM dogs a FULL JOIN parent bON a.breed='Rottweiler'WHERE a.name=b.name;

Breed Parent Rottweiler Ina Bruja Rottweiler Chacal Rottweiler Tyson

Breed Parent Rottweiler Ina Bruja

The Order of Execution cont’d

SELECT a.breed, INPUT(a.dob, ? date9.) AS ndob format date9.,COALESCE(a.name, b.name) AS name,b.parentFROM dogs a FULL JOIN parent bON a.name=b.nameWHERE breed='Rottweiler'HAVING ndob GT '01jan2000'd;

Breed NDOB Name ParentRottweiler 15AUG2000 Bruja Ina Rottweiler 15MAR2000 Tyson

How Good is My Code?

• Does the options used in data step make my code faster?

• Is speed important?

How Good is My Code? cont’d

• Do data step options make my code faster?

data all; merge dogs parent; by name; do i=1 to 100; do j=1 to 100; output; end; end;run;

NOTE: There were 8 observations read from the data set WORK.DOGS.NOTE: There were 7 observations read from the data set WORK.PARENT.NOTE: The data set WORK.ALL has 891000 observations and 6 variables.

How Good is My Code? cont’d

• Does the statements used to control the size of my input data set make my code faster?

proc sql stimer;

NOTE: SQL statement used: real time 0.01 seconds cpu time 0.02 seconds

How Good is My Code? cont’d

NOTE: SQL statement used: real time 10.82 seconds cpu time 10.50 seconds

SELECT * FROM all WHERE breed='Rottweiler';

SELECT * FROMall (WHERE=(breed='Rottweiler'));

NOTE: SQL statement used: real time 10.85 seconds cpu time 10.61 seconds

• Does the statements used to control the size of my input data set make my code faster?

SELECT DISTINCT name, breed, dob FROM all WHERE breed='Rottweiler';

SELECT DISTINCT name, breed, dob FROM all ( WHERE=(breed='Rottweiler'));

SELECT DISTINCT name, breed, dobFROM all (WHERE=(breed='Rottweiler') KEEP=name breed dob);

• Does the statements used to control the size of my input data set make my code faster?

real time 5.89 secondscpu time 4.30 seconds

real time 9.42 secondscpu time 4.57 seconds

real time 4.90 secondscpu time 4.35 seconds

How Good is My Code? cont’d

Conclusions

• This presentation showed examples on how and when to use statements that we already know– KEEP– RENAME– DROP– WHERE (things you already know)

Conclusions

• Possible language problems (line mines)

– Reserved word (CASE)– Format modifiers– Descending order

(things you think you already know)

Conclusions cont’d

• Be careful with the order of the conditions

• The order of the conditions affect the result

Conclusions cont’d

• When you use SQL, consider if you need all your variables and observations

• If you don’t need them consider:– It there is a significant difference in

computer use– Are you a good typist?

Conclusions cont’d

• You can have plain vanilla SAS PROC SQL or you can SASsy it up with data step statements

Acknowledgements

• PPD management – Kim Sturgen and Bonnie Duncan– Andy Barnett– Darrel Edgley– Neil Howard, Larry Sleeper and Craig

Mauldin– For their time, although if there are

mistakes, they are all mine; I made modifications and additions after they read the paper

Contact Information

Angelina Cecilia Mauldin

Abraxis BioScience4505 Emperor Blvd. #400Durham, NC 27703919 433 [email protected]