notes for sas programming

83
Notes for SAS programming Econ424 Fall 2007

Upload: cora-flores

Post on 31-Dec-2015

52 views

Category:

Documents


5 download

DESCRIPTION

Notes for SAS programming. Econ424 Fall 2007. Why SAS?. Able to process large data set(s) Easy to cope with multiple variables Able to track all the operations on the data set(s) Generate systematic output Summary statistics Graphs Regression results - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Notes for SAS programming

Notes for SAS programming

Econ424

Fall 2007

Page 2: Notes for SAS programming

Why SAS?

• Able to process large data set(s)

• Easy to cope with multiple variables

• Able to track all the operations on the data set(s)

• Generate systematic output

• Summary statistics

• Graphs

• Regression results

• Most government agencies and private sectors use SAS

Page 3: Notes for SAS programming

Roadmap

• Basic rules

• Read in data

• Data cleaning commands

• Summary statistics

• Combine two or more datasets

• Hypothesis testing

• Regression

Page 4: Notes for SAS programming

Basic rules (1) – organize files

• .sas – program file• .log – notes, errors, warnings • .lst – output• .sas7bdat – data file • library – a cabinet to put data in

– Default: Work library• temporary, erased after you close the session

– Permanent library• libname mylib “m:\”; • mylib.mydata

= a sas data file named “mydata” in library “mylib”

• run and recall .sas

Page 5: Notes for SAS programming

Basic rules (2) -- program

• every command ends with ;• format does not matter

if x=1 then y=1; else y=2; is the same asif x=1 then y=1;

else y=2;

• case insensitive• comment

* this is comment;/* this is comment */;

Page 6: Notes for SAS programming

Basic rule (3) – variable

• Type– numeric (default, 8 digit, . stands for missing value) – character ($, default 8 digit, blank stands for missing)

• Variable names – <=32 characters if SAS 9.0 or above– <=8 characters if SAS 8 or below– case insensitive

• Must start with letter or “_” _name, my_name, zip5, u_and_me -name, my-name, 5zip, per%, u&me, my@w, my$sign

Page 7: Notes for SAS programming

Basic rules (4) – data step

• Data stepDATA distancenew;

set distance;

miles=26.22;

kilometers=1.61*miles;

input data “distance”

obs1

obs2

obs n

define “miles”

define “kilometers”

output data “distancenew”

obs1

obs n

Page 8: Notes for SAS programming

Basic rules (5) – proc step

• PROC stepPROC PRINT data=distancenew;

var miles kilometers;

title “print out distancenew”;

run;

Page 9: Notes for SAS programming

Read in data (1) – table editor

• Tools – table editor

• choose an existing library or define a new library

• rename variable

• save as

Page 10: Notes for SAS programming

Read in data (2) -- datalines

data testdata1;infile datalines; input id height weight gender $ age;datalines; 1 68 144 M 232 78 . M 343 62 99 F 37; /* you only need one semicolon at the end of all data lines, but the

semicolon must stand alone in one line */

proc contents data=testdata1; run;proc print data=testdata1; run;

No ; until you finish all the data lines

Page 11: Notes for SAS programming

Read in data (3) – more datalines

/* read in data in fixed columns */

data testdata1;

infile datalines;

input id 1 height 2-3 weight 4-6 gender $7 age 8-9;

datalines;

168144M23

278 M34

36299 F37

;

Page 12: Notes for SAS programming

Read in data (4) – more datalines

data testdata1;

infile datalines;

input id : 1. height : 2. weight : 3. gender : $1. age : 2.;

datalines;

1 68 144 M 23

2 78 . M 34

3 62 99 F 37

;

Page 13: Notes for SAS programming

Read in data (4) – more datalines

*alternatively;

data testdata1;

infile datalines;

informat id 1. height 2. weight 3. gender $1. age 2.;

input id height weight gender age;

datalines;

1 68 144 M 23

2 78 . M 34

3 62 99 F 37

;

Page 14: Notes for SAS programming

Read in data (5) – .csv

data testdata1;infile datalines dlm=‘,’ dsd missover;

/* what if you do not have dsd and/or missover */input id height weight gender $ age ;datalines; 1, 68, 144, M, 23 2, 78, , M, 343, 62, 99, F, 37; /* what if you forget to type 23 */run;

Page 15: Notes for SAS programming

Read in data (6) – .csv file

/* save the midterm data in a comma delimited file (.csv) before running the following codes */

libname mylib "M:\";

filename midterm "M:\midterm-fall2005-1997rank.csv";

data mylib.midterm;

infile midterm firstobs=3 dlm=',' dsd missover lrecl=900;

input year: 4. name: $50. score reputation accept_rate graduate_rate;

run;

proc contents data=mylib.midterm;

run;

proc print data=mylib.midterm;

run;

Note the options in the infile command!

Page 16: Notes for SAS programming

16

Read in data (7) – from excel

filename myexcel “M:\midterm-fall2005-1997rank.xls”;proc import datafile=myexcel out=readin1

DBMS=excel replace; run;data readin1;

set readin1;acceptper=accept_rate*100;gradper=graduate_rate*100;

run;proc contents data=readin1; run;proc print data=readin1; run;

No ; until the end of the whole sentence

Page 17: Notes for SAS programming

17

Read in data (7) – from excel

Be careful !SAS will read the first line as variable names, our original data has two non-data lines – the first as labels, the second as variable names. You need to delete the first line to make sure correct data input into SAS. What happens if you don’t delete the first line? What happens if you delete the second line?SAS assigns numeric and character type automatically. Sometime it does make mistake.

Page 18: Notes for SAS programming

Data cleaning (1) – if then

Format: IF condition THEN action;ELSE IF condition THEN action;ELSE action;

Note: (1) the if-then-else can be nested as many as you

want(2) if you need multiple actions instead of one

action, use “DO; action1; action2; END; ”

Page 19: Notes for SAS programming

Data cleaning (1) – if then

• = or EQ means equals• ~= or NE means not equal• > or GT means greater than• < or LT means less than• >= or GE means greater than or equal • <= or LE means less than or equal • in means subset

– if gender in (‘M’, ‘F’) then ..;

• Multiple conditions: AND (&), OR(|)

Page 20: Notes for SAS programming

Data cleaning (1) – if then

*reading in program of testdata1 is on page 14;data testdata1;set mylib.testdata1;IF gender='M' AND age<=25 THEN testgroup=0; ELSE testgroup=1; run;proc contents data=testdata1; run;proc print data=testdata1; run;Note: (1) the code is less efficient if you replace ELSE ..; with IF gender~= 'M' OR age>25 THEN ..;(2) missing value is always counted as the smallest negative, so age=. will

satisfy the condition age<=25. If you want to ignore the obs with age=. set the condition as 0<age<=25.

Page 21: Notes for SAS programming

Data cleaning (1) – if then

* Multiple actions in each branch;

data testdata1;set testdata1;IF gender='M' AND age<=25 THEN DO;

group=0; activity='ballgame'; END;

ELSE DO; group=1; activity='movie'; END;

run;proc print data=testdata1; run;

Page 22: Notes for SAS programming

Data cleaning (1) – if then

*Use if commands to choose a subsample;

data testdata2; /* note here we generate a new data set */set testdata1;IF age=. Then delete;If age>18 & weight~=.;run;proc print data=testdata2; run;

Page 23: Notes for SAS programming

Data cleaning (1) – exercise

still use testdata1.define

weight in pounds bmi=--------------------------------- x 703

(height in inches)^2

define group = 1 if bmi<18.5 (underweight) 2 if 18.5 <= bmi < 25 (normal) 3 if 25<= bmi <30 (overweight) 4 if bmi>=30 (obese).

Page 24: Notes for SAS programming

Data cleaning (1) – exercise answer

data testdata1;set testdata1;bmi=(weight/(height*height))*703;if bmi<18.5 then group=1;else if bmi<25 then group=2;

else if bmi<30 then group=3; else group=4;

run;proc contents data=testdata1; run;proc print data=testdata1; run;Question: What if one observation has weight=.?

Page 25: Notes for SAS programming

Pages 24-27 are optional material for data cleaning.

They are not required, but you may find them useful in the future.

We skip them in the regular class.

Page 26: Notes for SAS programming

Data cleaning (2) – convert variable type

Numeric to character:age1=put(age, $2.);age and age1 have the same contents but different formats

Character to numeric:age2=input(age1, 1.);now age and age2 are both numeric, but age2 is chopped at the first digit

Take a sub string of a characterage3=substr(age1,2,1); now age3 is a sub string of age1, starting from the second digit of age1 (the meaning of “2”) and having one digit in total (the meaning of “1”).

Page 27: Notes for SAS programming

Data cleaning (2) - example

* we want to convert studid 012345678 to 012-345-678;data testdata2;infile datalines;input studid : 9. studname : $1.;datalines;012345678 A135792468 B009876543 C;proc print; run;data testdata2;set testdata2;if studid LT 1E+7 then studid1= '00’||compress(put(studid, $9.));else if 1E+7 LE studid LT 1E+8 then studid1='0'||compress(put(studid,

$9.)); else studid1= put(studid, $9.);

studid2=substr(studid1,1,3)||'-'||substr(studid1,4,3)||'-'||substr(studid1,7,3);

proc print; run;

Page 28: Notes for SAS programming

Data cleaning (2) - exercise

You have the following data, variables in sequence are

SSN, score1, score2, score3, score4, score5:

123-45-6789 100 98 96 95 92

344-56-7234 69 79 82 65 88

898-23-1234 80 80 82 86 92

Calculate the average and standard deviation of the five scores for each individual. Use if-then command to find out who has the highest average score, and report his SSN without dashes.

Page 29: Notes for SAS programming

Data summary roadmap

• Proc contents – variable definitions• Proc print – raw data• Proc format – make your print look nicer• Proc sort – sort the data• Proc means – basic summary statistics• Proc univariate – detailed summary stat• Proc freq – frequency count• Proc chart – histogram• proc plot – scatter plot

Page 30: Notes for SAS programming

proc format (1)

*Continue the data cleaning exercise on page 23;

data testdata1;

set testdata1;

bmi=(weight/(height*height))*703;

if bmi<18.5 then group=1;

else if bmi<25 then group=2;

else if bmi<30 then group=3;

else group=4;

run;

Defining “group” as a numeric variable will

save space

Page 31: Notes for SAS programming

proc format (2)

proc format; value bmigroup 1=‘under weight’

2=‘normal weight’3=‘overweight’4=‘obese’;

run;proc print data=testdata1; format group bmigroup. bmi 2.1; title ‘report bmi group in words’; label bmi=‘body mass index’ group=‘bmi group 1-4’; var bmi group; run;

no ; here until the end of the

label command

no ; here until the end of the

value command

Page 32: Notes for SAS programming

proc sort

proc sort data=testdata1;

by gender age;

run;

proc sort data=testdata1 out=testdata2;

by gender descending age;

run;

* note that missing value is always counted as the smallest;

Page 33: Notes for SAS programming

proc means and proc univariate

proc means data=testdata1;class gender;var height weight bmi;run;

proc sort data=testdata1; by gender; run;

proc univariate data=testdata1;by gender;var bmi;run;

By default, proc means report mean, stdev, min, max

Could choose what to report:proc means data=testdata1

n mean median;

By default, proc univariate report median, and many

other statistics

Page 34: Notes for SAS programming

Notes on proc means and proc univariate

*if you do not use class or by command, the statistics are based on the full sample. If you use class or by var x, the statistics are based on the subsample defined by each value of var x.

*You can use class or by in proc means, but only by in proc univariate;

*whenever you use “by var x”, the data set should be sorted by var x beforehand;

Page 35: Notes for SAS programming

More example onproc means and proc univariate

data readin1; set readin1;if accept_rate<0.20 then acceptgrp=“low”; else acceptgrp=“high”; run;

proc means data=readin1;class acceptgrp;var reputation accept_rate graduate_rate;run;

proc sort data=readin1; by acceptgrp; run;proc univariate data=readin1;

by acceptgrp;var reputation;run;

By default, proc means report mean, stdev, min, max

Could choose what to report:proc means data=readin1

n mean median;

By default, proc univariate report median, and many

other statistics

Page 36: Notes for SAS programming

proc freq

data readin1;set readin1;if graduate_rate<0.5 then gradgrp=“low”;else if 0.5<=graduate_rate<0.8 then gradgrp=‘middle’;

else gradgrp=‘high’;run;

proc freq data=readin1;tables acceptgrp

gradgrp acceptgrp*gradgrp;

run;

Page 37: Notes for SAS programming

proc chart -- histogram

proc chart data=readin1;title ‘histogram for acceptgrp’;vbar acceptgrp;run;

proc chart data=readin1;title ‘frequency by two variables’;vbar acceptgrp / group=gradgrp;run;

Page 38: Notes for SAS programming

proc chart – continuous variable

proc chart data=readin1;

title “histogram for continuous variable’;

vbar accept_rate;

run;

proc chart data=readin1;

title ‘histogram with specific midpoints’;

vbar accept_rate / midpoints=0 to 1 by 0.1;

run;

Page 39: Notes for SAS programming

proc plot – scatter plot

proc plot data=readin1;

title ‘scatter plot of accept rate and grad rate’;

plot accept_rate*graduate_rate;

run;

Page 40: Notes for SAS programming

fancy proc means

proc means data=readin1;class acceptgrp gradgrp;var score reputation;output out = summary1

mean = avgscore avgreputn;run;

proc print data=summary1;run;

Page 41: Notes for SAS programming

The following page may be useful in practice, but I am not

going to cover it in class.

Page 42: Notes for SAS programming

some summary stat. in proc printfilename myexcel “M:\midterm-fall2005-1997rank.xls”;;proc import datafile=myexcel out=readin1 DBMS=excel replace; run;data readin1;

set readin1;if accept_rate<0.2 then acceptgrp=‘low’;

else if 0.2<=accept_rate<0.5 then acceptgrp=‘middle’; else acceptgrp=‘high’; run;proc sort data=readin1; by acceptgrp; run;proc print data=readin1 n;

where graduate_rate>=0.8; by acceptgrp; sum score; var name score accept_rate graduate_rate; run;

Page 43: Notes for SAS programming

merge and append

readin1: name score reputation acceptgrp gradgrp

Harvard 100 3.9 low high

summary1: acceptgrp gradgrp avgscore avgreputn

low high 80 3.3

merged: name score reputation acceptgrp gradgrp avgscore avgreputn

Harvard 100 3.9 low high 80 3.3

appended:

name score reputation acceptgrp gradgrp avgscore avgreputn

Harvard 100 3.9 low high . .

. . low high 80 3.3

Page 44: Notes for SAS programming

merge two datasets

proc sort data=readin1;by acceptgrp gradgrp;run;

proc sort data=summary1;by acceptgrp gradgrp;run;

data merged;merge readin1 (in=one) summary1 (in=two);by acceptgrp gradgrp;if one=1 & two=1; run; What if this line is

“if one=1 OR two=1;”?

Page 45: Notes for SAS programming

Keep track of matched and unmatched records

data allrecords;merge readin1 (in=one) summary1 (in=two);by acceptgrp gradgrp;myone=one;mytwo=two;if one=1 or two=1; run;

proc freq data=allrecords;tables myone*mytwo;run;

SAS will drop variables “one” and “two”

automatically at the end of the DATA step. If you want to keep them, you can copy

them into new variables “myone” and “mytwo”

Page 46: Notes for SAS programming

be careful about merge!

• always put the merged data into a new data set• must sort by the key variables before merge• ok for one-to-one, multi-to-one, one-to-multi, but

no good for multi-to-multi• be careful of what records you want to keep, and

what records you want to delete• what if variable x appears in both datasets, but x

is not in the “by” statement?– after the merge x takes the value defined in the last

dataset of the “merge” statement

Page 47: Notes for SAS programming

append

data appended;

set readin1 summary1;

run;

proc print data=appended;

run;

proc print data=merged;

run;

Page 48: Notes for SAS programming

Application : Project 5

rest_rating:

Restname Ratings Price Loc1 Phone1 Loc2 Phone2

Addie’s 1 30 Rockville 301-881-0081

Andalucia 2 30 Rockville 301-770-1880

Haandi 2 10 Bethesda 301-718-0121 Falls Church 703-533-3501

city_demo:

Loc State Pop Income

Bethesda MD 62936 38976

Falls Church VA 9578 26709

Rockville MD 44835 21484

How to merge the two data sets?

Page 49: Notes for SAS programming

Steps to merge the datasets

1. reshape data set 1 – focus on loc1rest_rating:

Restname Ratings Price Loc1 Phone1 Loc2 Phone2

Addie’s 1 30 Rockville 301-881-0081

Andalucia 2 30 Rockville 301-770-1880

Haandi 2 10 Bethesda 301-718-0121 Falls Church 703-533-3501

restloc1

restname .. loc phone

Addie’s Rockville 301-881-0081

Andalucia Rockville 301-770-1880

Haandi Bethesda 301-718-0121

data restloc1;

set rest_rating;

loc=loc1;

phone=phone1;

drop loc1 loc2 loc3 loc4 phone1 phone2 phone3 phone4;

run;

Page 50: Notes for SAS programming

Steps to merge the datasets

1. reshape data set 1 – focus on loc2rest_rating:

Restname Ratings Price Loc1 Phone1 Loc2 Phone2

Addie’s 1 30 Rockville 301-881-0081

Andalucia 2 30 Rockville 301-770-1880

Haandi 2 10 Bethesda 301-718-0121 Falls Church 703-533-3501

data restloc2;

set rest_rating;

loc=loc2;

phone=phone2;

drop loc1 loc2 loc3 loc4 phone1 phone2 phone3 phone4;

If loc=‘’ then delete;

run;

restloc2

restname .. loc phone

Haandi Falls Church 703-533-3501

Page 51: Notes for SAS programming

Steps to merge the datasets

Similarly create restloc3 and restloc4

Page 52: Notes for SAS programming

Steps to merge the datasets

2. append restloc1, restloc2, restloc3, restloc4 into restloc

restloc2

restname .. loc phone

Haandi Falls Church 703-533-3501

restloc1

restname .. loc phone

Addie’s Rockville 301-881-0081

Andalucia Rockville 301-770-1880

Haandi Bethesda 301-718-0121

restloc

restname .. loc phone

Addie’s Rockville 301-881-0081

Andalucia Rockville 301-770-1880

Haandi Bethesda 301-718-0121

Haandi Falls Church 703-533-3501

data restloc;

set restloc1 restloc2 restloc3 restloc4;

run;

Page 53: Notes for SAS programming

Steps to merge the datasets

3. merge restloc with citydemo by loc

restloc

restname .. loc phone

Addie’s Rockville 301-881-0081

Andalucia Rockville 301-770-1880

Haandi Bethesda 301-718-0121

Haandi Falls Church 703-533-3501

city_demo:

Loc State Pop Income

Bethesda MD 62936 38976

Falls Church VA 9578 26709

Rockville MD 44835 21484

proc sort data=restloc;

by loc; run;

proc sort data=city_demo;

by loc; run;

data merged;

merge restloc (in=one)

city_demo (in=two);

by loc;

if one=1 | two=1;

inone=one; intwo=two;

run;

Page 54: Notes for SAS programming

Questions

• How many restaurants have two locations? how many have three?

• How many restaurants locate in Rockville?• How many restaurants do not have any match

with city demographics?• How many restaurants in cities where per capita

income is at least 30,000?• How many are asian and locate in areas with per

capita income at least 30,000?

Page 55: Notes for SAS programming

mean comparison: two groups* compare the mean of two groups;* specific question: do Asian and non-Asian restaurants charge systematically different prices?;* define a dummy variable asianstyle=1 if the food style is Asian, Chinese, Japanese, Thai, Vietnamese,

Lebanese, or Korean, 0 otherwise;

proc glm data=merged;class asianstyle;model price=asianstyle; /* alternatively, model price=asianstyle/solution; */means asianstyle;run;The F-value tests

H0: regression has no power explaining variation in incomeH1: regression has some power.

In our context, this is equivalent to H0: income is no different across asianstyle or non-asianstyleH1: income is different.

This amounts to a two-tail test of equal mean.

If you use “/solution” at the end of the model command, the output would report regression results. The t-stat of the coefficient for asianstyle or the p-value next to the t-stat could apply to a two-tail test or a one-tail test of equal mean.

Page 56: Notes for SAS programming

mean comparison: two groups

Alternatively, you can use

proc reg data=merged;model price=asianstyle;Run;

The t-stat of the coefficient for asianstyle or the p-value next to the t-stat could apply to a two-tail test or a one-tail test of equal mean.

Page 57: Notes for SAS programming

mean comparison: three or more groups

* compare the mean of four groups – ratings =1,2,3,4;* specific question: do restaurants of different ratings charge

systematically different prices?;proc glm data=merged;class ratings;model price=ratings;means ratings/waller;means ratings/lsd cldiff;run;

Page 58: Notes for SAS programming

notes on mean comparison

1. The logic of mean comparison is the same as in Excel

2. Be careful about one-tail and two-tail tests. The standard SAS output of coefficient t-stat and p-value are based on a two-tail test of H0: coeff=0, but could be used for a one-tail test with appropriate adjustment on p-value. The F-value reported as the overall regression statistics only applies to the right tail of F distribution, because F is bounded to be positive. The F-value can test H0: regression explains no variation in your dependent variable, against H1: regression explains some variation. This is equivalent to a two-tail test of equal mean.

3. Comparison across more than two groupsH0: all groups have the same mean F-test of whole regressionORH0: group x and group y has the same mean waller or lsd

statistics

Page 59: Notes for SAS programming

regression in SAS• specific question: how do restaurant price relate

to ratings, city population and per capita income?

libname n “N:\share\notes”;

proc reg data=n.merged;

model price=ratings population per_capita_income;

run;Equivalent to:

Price=alpha+beta1*ratings+beta2*pop+beta3*per_cap_inc

Page 60: Notes for SAS programming

regression in SAS

* specific question: how do restaurant prices relate to ratings, city population, per capita income, and state specifics?

* note that state itself is a character variable, and we would like to separate MD, VA and DC. This amounts to define three dummy variables denoting each state. This is easy to achieve in SAS;

proc glm data=merged;class state;model price=ratings population per_capita_income

state / solution;run;

Page 61: Notes for SAS programming

Regression: exercise 1

• The coefficient of ratings in the first regression tells us how much price will change by a unit change in ratings.

• However, one may think the change from 1 star to 2 star is different from the change from 2 star to 3 star, or from 3 star to 4 star.

• To capture these differences, we need more than one coefficents for ratings.

proc glm data=merged;class ratings;model price=ratings population per_capita_income

/ solution;run;

Page 62: Notes for SAS programming

Regression: exercise 2• What if we also want to control for states?• SAS does not allow more than one variables in the “class ” command, so we need to define one set

of dummy variables by ourselves;• Below we define 4 dummy variables for each value of ratings. Since the sum of the four variables

is always 1, we need to drop one of them in the regression to avoid their colinearity with the intercept.

data merged;set merged;if ratings=1 then rating_1star=1; else rating_1star=0;if ratings=2 then rating_2star=1; else rating_2star=0;if ratings=3 then rating_3star=1; else rating_3star=0;if ratings=4 then rating_4star=1; else rating_4star=0;

proc glm data=merged;class state;model price=rating_2star rating_3star rating_4star population per_capita_income state / solution;run;

Page 63: Notes for SAS programming

Regression: exercise 3• What if we focus on the average price change by one unit increase of ratings (like in regression 1 on page 59), but we

would like to see whether this price change vary by states?• We need to define ratings*(dummy for MD), ratings*(dummy for VA),ratings*(dummy for DC)

data merged;set merged;if state=“MD”then state_MD=1; else state_MD=0;if state=“VA” then state_VA=1; else state_VA=0;if state=“DC” then state_DC=1; else state_DC=0;ratings_MD=ratings*state_MD;ratings_VA=ratings*state_VA;ratings_DC=ratings*state_DC;run;

proc reg data=merged;model price=ratings_MD ratings_VA ratings_DC population per_capita_income;run;

* Control for states;proc glm data=merged;class state;model price=ratings_MD ratings_VA ratings_DC population per_capita_income state/solution;run;

Page 64: Notes for SAS programming

A comprehensive example

A review of

1. readin data

2. summary statistics

3. mean comparison

4. regression

reg-cityreg.sas in N:\share\notes\

Page 65: Notes for SAS programming

A Case Study of Los Angeles Restaurants

Nov. 16-18, 1997 CBS 2 News “Behind the Kitchen Door”

January 16, 1998, LA county inspectors start issuing hygiene

grade cards

A grade if score of 90 to 100

B grade if score of 80 to 89

C grade if score of 70 to 79

score below 70 actual score shown

Grade cards are prominently displayed

in restaurant windows

Score not shown on grade cards

Page 66: Notes for SAS programming
Page 67: Notes for SAS programming

Take the ideas to data

better Information

better quality

better revenue

regulation

by county

by city

hygiene scores

restaurant revenue calculated by sales tax

Page 68: Notes for SAS programming

Data complications

Unit of analysis:

individual restaurant? city? zipcode? census tract?

Unit of time:

each inspection? per month? per quarter? per year?

Define information:

county regulation? city regulation? the date of passing the regulation? days since passing the regulation? % of days under regulation?

Define quality:

average hygiene score? the number of A restaurants? % of A restaurants?

Define revenue: (only collected by quarter!)

total revenue? average revenue per restaurant?

Page 69: Notes for SAS programming

How to test the idea?

• Regression 1:quality = α+β*information+error

+something else?

• Regression 2:revenue=α+β*quality+γ*information+error

+something else?

What is the something else? year trend, seasonality, quality*information, ....

Page 70: Notes for SAS programming

real test

reg-cityreg.sas in N:\share\notes\

Page 71: Notes for SAS programming

Questions

• How many observations in the sample? – log of the first data step, or output from proc contents

• How many variables in the sample? How many are numerical, how many are characters?– Output from proc contents

• On average, how much tax revenue can we collect from one city in one quarter?– Output from proc means, on sumtax

• How many percentage of restaurants have A quality in a typical city-month?– Output from proc means, on per_A

• What is the maximum tax revenue in a city-quarter? What is the minimum?– Output from proc means, on sumtax

Page 72: Notes for SAS programming

Questions

• What is the difference between cityreg and ctyreg? We know county regulation came earlier than city regulation, is that reflected in our data?– Yes, cityreg<=ctyreg in every observation– We can check this in proc means for cityreg and ctyreg, or add a proc

print to eyeball each obs

• What is the difference between cityreg and citymper? What is the mean of cityreg? What is the mean of citymper? Are they consistent with their definitions?– The unit of cityreg is # of days, so it should be a non-negative integer– The unit of citymper is % of days, so it should be a real number between

0 and 1– To check this, we can add a proc means for cityreg and citymper

Page 73: Notes for SAS programming

Questions

• Economic theories suggest quality be higher after the regulation if regulation gives consumers better information. Is that true in summary statistics? – by summary statistics, I mean a simple mean comparison of

quality by different regulation environments. Which procedure tells us that?

– Yes, the two proc means, class citym_g or ctym_g• How about revenue? Do summary statistics suggest that

better information better revenue?– Yes, the two proc means, class citym_g or ctym_g– The two proc means only show us group-average. For formal

tests, we can use proc glm or proc reg as in our lecture notes on mean comparison (do you know how to do it?)

Page 74: Notes for SAS programming

Questions

Facts reported in summary statistics often reflect many economic factors, not only the one in our mind. Summary statistics are also crude for easy presentation. That is why we explore regressions.

In the quality regression, does more city regulation associate with higher quality?

– is the coefficient of city regulation positive and significantly different from zero? (proc reg)

– is the coefficient of county regulation positive and significantly different from zero? (proc reg)

– Do we omit other sensible explanations for quality changes? What are they? (yes, they are ….)

Page 75: Notes for SAS programming

Questions

• What does “proc means” do on page 3? Why does it allow us to collapse the data of city-months into city-quarters?

• Why do we use lnsumtax as the dependent variable instead of sumtax? – Reg of lnsumtax tells us the effect of regulation on the

percentage change in sumtax

• In the first regression, what does the “cityqper” coefficient mean? what does the “per_A” coefficient mean? What does the “cityq_A” (which equals to cityqper*per_A) coefficient mean?

Page 76: Notes for SAS programming

Questions

• We report four sets of regression results for the revenue regression. How do they differ from each other?

• How does the coefficient of “cityqper” change across these regressions? What does the change suggest?

• Which set of results makes more sense?

Page 77: Notes for SAS programming

Count duplicates (not required)

http://support.sas.com/ctx/samples/index.jsp?sid=67&tab=code

data dups nodups ; set clasdata ; by name class ; /* If the combination of NAME and CLASS is in the data

set once, output NODUPS, else output DUPS. */ if first.class and last.class then output nodups ; else output dups ; run;

Page 78: Notes for SAS programming

Some project 6 ideas from previous students

• Does super-bowl advertising boost stock price?• Does financial aid promote GDP growth?• What are the most important determinants of

house price?• Advice on smart investment: which stock(s)

should John Doe invest in given his preference for return and risk?

• Does a chain store charge different prices in different locations?

Page 79: Notes for SAS programming

Idea I: Does super-bowl advertising boost stock price?

• Theory– advertising in super-bowl is extremely expensive and therefore

increase the cost of the advertiser negative effect on stock price

– advertising at the superbowl signals good finanical status of the advertiser

positive effect on stock price

• Data: – list of superbowl advertisers from major newspapers– list of non-advertisers who are major competitors with the

advertisers in the same industry– New York Stock Exchange

• Timing: right before and after super-bowl (event study)

Page 80: Notes for SAS programming

Idea II: Does financial aid promote GDP growth?

• Theory: – positive effect because aid means help– negative effect because aids tend to go to the most

under-developed countries

• Data– focus on Sub-Sahara Africa– World Development Indicators (WDI 2002) database

from the World Bank– 1990-2000: GDP per capita, aid, FDI, adjusted net

savings, health expenditures, adjult illiteracy rate

Page 81: Notes for SAS programming

Idea III: determinants of house price?

• Theory: – house attributes– location– time of sale

• Data– zillows.org

Page 82: Notes for SAS programming

Idea IV: Smart Investment

• Theory: CAPM

• John Doe’s preference: risk and return

• Sample:– treasury bills– S&P 500 index– six individual stocks– monthly data in the past 60 months

Page 83: Notes for SAS programming

Idea V: Does a chain store charge different prices in different locations?

• Theory: – Yes: A chain store will charge different prices because

different locations have different rent costs and different consumer willingness to pay

– No: same pricing facilitates advertising from the chain headquarter

• Data: – Hand collected price for 10 CVS stores– Choose a basket of goods that represent CVS sales– Land price and demographic data from Census