sas programming sample -sta 9850 project 2

29
John Rabovich STA 9850 Project 2 “Analyzing home loan trends in the five boroughs across ethnicity and gender” I wanted to analyze home loan data in New York City. Specifically, I wanted to see if there were differences in the origination of home loans in the five boroughs by ethnicity. It seems that race plays a factor in many facets of our society, and I wanted to see if there were differences in home loan origination by ethnicity. The question that I am posing is “Are there differences in the average value of the loans originated due to different ethnicities?” The second question I want to ask is ”Are there are differences in the average value of loans originated between genders?” It seems that while our country has come a long way in terms of gender relationships and gender discrimination, it doesn’t appear that we are at a place of complete gender equality yet. To answer these questions, I went to the consumerfinance.gov website and from there went to the Consumer Financial Protection Bureau section. I filtered home loan origination in 2014 by the five boroughs, types of housing, whether the person who took out the loan occupied the home as their personal dwelling, whether the loan was for a home purchase or refinancing, as well as if the ethnicity was “Hispanic or latino” or “not Hispanic or Latino.” This gave me a large dataset of 33,407 records from which I could analyze number of loans given out, the amount of the loans, and I could break it down by race and gender. Here is the link to the site: http://www.consumerfinance.gov/hmda/explore#filters Finding out if there are differences in the amount of loans given out by race using bootstrap resampling techniques Before I used resampling techniques I wanted to get some descriptive statistics about my dataset as well as load the data, etc... Below is my initial code to load the data: proc import out=WORK.loan_data datafile="C:\Users\jr167238\Desktop\home_loan_data_2014.xlsx" dbms=excel replace ; SHEET="hmda_lar"; getnames=yes; run; proc print data=loan_data(obs=15); run; Data loan_data1; Set loan_data; Where applicant_race_name_1 NE .; run; Proc contents data = loan_data varnum; run;

Upload: john-rabovich

Post on 24-Jan-2018

176 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: SAS PROGRAMMING SAMPLE -STA 9850 Project 2

John Rabovich STA 9850 Project 2 “Analyzing home loan trends in the five boroughs across ethnicity and gender”

I wanted to analyze home loan data in New York City. Specifically, I wanted to see if there were differences in the origination of home loans in the five boroughs by ethnicity. It seems that race plays a factor in many facets of our society, and I wanted to see if there were differences in home loan origination by ethnicity. The question that I am posing is “Are there differences in the average value of the loans originated due to different ethnicities?” The second question I want to ask is ”Are there are differences in the average value of loans originated between genders?” It seems that while our country has come a long way in terms of gender relationships and gender discrimination, it doesn’t appear that we are at a place of complete gender equality yet.

To answer these questions, I went to the consumerfinance.gov website and from there went to the Consumer Financial Protection Bureau section. I filtered home loan origination in 2014 by the five boroughs, types of housing, whether the person who took out the loan occupied the home as their personal dwelling, whether the loan was for a home purchase or refinancing, as well as if the ethnicity was “Hispanic or latino” or “not Hispanic or Latino.” This gave me a large dataset of 33,407 records from which I could analyze number of loans given out, the amount of the loans, and I could break it down by race and gender. Here is the link to the site: http://www.consumerfinance.gov/hmda/explore#filters

Finding out if there are differences in the amount of loans given out by

race using bootstrap resampling techniques

Before I used resampling techniques I wanted to get some descriptive statistics about my

dataset as well as load the data, etc... Below is my initial code to load the data:

proc import out=WORK.loan_data

datafile="C:\Users\jr167238\Desktop\home_loan_data_2014.xlsx"

dbms=excel replace

;

SHEET="hmda_lar";

getnames=yes;

run;

proc print data=loan_data(obs=15);

run;

Data loan_data1;

Set loan_data;

Where applicant_race_name_1 NE .;

run;

Proc contents data = loan_data varnum;

run;

Page 2: SAS PROGRAMMING SAMPLE -STA 9850 Project 2

Proc means data = loan_data n nmiss mean std min max range skew kurtosis

maxdec=2 ;

var _numeric_;

run;

proc freq data = loan_data;

tables applicant_race_name_1 * loan_amount_000s

;

Run;

The data below is a screenshot of the proc print just to doublecheck that the data loaded into

SAS matches the raw data in the Excel file. I cut it off after the first fifteen observations.

Below is the output from proc means. As you can see, the mean for loan amount(variable name

loan_amount_000s) is 466.14 or $466,140. This will be useful as a baseline amount from which to

analyze data later on.

Page 3: SAS PROGRAMMING SAMPLE -STA 9850 Project 2

When I ran proc freq for loan amount by race, the data is pretty staggering. 20.9% of the total

amount of the loans were given to Asian people, black people or African Americans had 12.4% of the total amount, and white people had 64.15%

To analyze the difference in average value of loans by race, I used the bootstrap method of

resampling using proc surveyselect and then I looked at the descriptive statistics of this resampled data

using proc univariate. I compared the resampled data against the original data by using proc univariate on the original data and the resampled data. Below is the code I used:

Page 4: SAS PROGRAMMING SAMPLE -STA 9850 Project 2

proc surveyselect data=loan_data out=outboot

seed=3045956

method=urs

samprate=0.2

outhits

rep=100;

run;

proc sort data=outboot;

by applicant_race_name_1;

run;

proc univariate data=outboot;

var loan_amount_000s;

by applicant_race_name_1;

output out=outall mean=sampmean;

run;

proc sort data=loan_data;

by applicant_race_name_1;

run;

proc univariate data=loan_data;

var loan_amount_000s;

by applicant_race_name_1;

output out=outall mean=sampmean;

run;

Page 5: SAS PROGRAMMING SAMPLE -STA 9850 Project 2

Using bootstrap resampling techniques to determine loan amount by race output

Native Americans(American Indian or Alaskan Native)

Page 6: SAS PROGRAMMING SAMPLE -STA 9850 Project 2

Comparing the Native American resampled data to the original data, we see that the means of the

amount of loans given out to Native Americans are slightly different to each other. The resampled mean

is 388.60 and the mean for the original data is 380.28. However compared to the means for the total

data of the value of loans given out of 466.14, we see that Native Americans average loan values were

significantly smaller than the mean loan value . We also see that according to Kolomogorov-Smirnov p-

values and the graphs for Normality that the data does not seem to be normally distributed.

Page 7: SAS PROGRAMMING SAMPLE -STA 9850 Project 2

Asian

Page 8: SAS PROGRAMMING SAMPLE -STA 9850 Project 2

Comparing the Asian resampled data to the original, we see that the means of the value of loans given

out to Asians are remarkably consistent. The resampled mean is 410.41 and the mean for the original

data is 409.24. However compared to the means for the total data of the value of loans given out of

466.14, we see that Asians were given smaller loan amounts than the average. We also see that

according to Kolomogorov-Smirnov p-values and the graphs for Normality that the data does not seem

to be normally distributed.

Page 9: SAS PROGRAMMING SAMPLE -STA 9850 Project 2

Black or African American

Page 10: SAS PROGRAMMING SAMPLE -STA 9850 Project 2

Comparing the African American resampled data to the original, we see that the means of loan amounts

given to African Ameicans are remarkably consistent. The resampled mean is 326.99 and the mean for

the original data is 327.24. However compared to the means for the total data for the loan values out of

466.14, we see that African Ameicans were given much smaller loan amounts than the average. We also

see that according to Kolomogorov-Smirnov p-values and the graphs for Normality that the data does not seem to be normally distributed.

Page 11: SAS PROGRAMMING SAMPLE -STA 9850 Project 2

Native Hawaiian or Pacific Islander

Page 12: SAS PROGRAMMING SAMPLE -STA 9850 Project 2

Comparing the Native Hawaiian or Pacific Islander resampled data to the original, we see that the means

of the amount of loans given out to Native Hawaiian or Pacific Islanders are a little different. The

resampled mean is 330.66 and the mean for the original data is 333.95. However compared to the

means for the total data of loan values of 466.14, we see that Native Hawaiian or Pacific Islander were

given much smaller loan amounts than the average. We also see that according to Kolomogorov-Smirnov p-values and the graphs for Normality that the data does not seem to be normally distributed.

Page 13: SAS PROGRAMMING SAMPLE -STA 9850 Project 2

White

Page 14: SAS PROGRAMMING SAMPLE -STA 9850 Project 2

Comparing the Whites resampled data to the original, we see that the means of the amount of loans

given out to Whites are remarkably consistent. The resampled mean and the mean for the original data

is 514.65. Compared to the means for the total data for the amount of loans given out of 466.14, we see

that Whites had much larger loan values than average. We also see that according to Kolomogorov-Smirnov p-values and the graphs for Normality that the data does not seem to be normally distributed.

Conclusions

According to the data, white people’s loans were much larger on average than every other ethnicity. Not

just that, but much more white people were given loans than any other ethnicity as well. Excluding

Native Americans due to their smaller sample size, it appears that average loan amounts given to other

races was in the $300,000s. The exception to this were people of Asian descent, who were given average

loan amounts of $410,000. While this is significantly higher than African Americans, Native Americans,

and Pacific Islanders on average, it still pales in comparison to the average loan amount of $514,000 given to white people. You can see the distribution by ethnicity graphically below:

Page 15: SAS PROGRAMMING SAMPLE -STA 9850 Project 2
Page 16: SAS PROGRAMMING SAMPLE -STA 9850 Project 2

Finding out if there are differences in the value of loans given out by

gender using cross-validation with resampling macro techniques

For the question as to whether there were differences in the amount of loans given out due to gender, I

wanted to get some initial baseline statistics using proc freq and proc means. Using proc freq, you can

see from the table below that 36.14% of the total amount of the loans originated were granted to

women while 63.86% of the total amount of the loans originated were granted to men. We also see that the average amount for each loan regardless of gender is $466,140.

Page 17: SAS PROGRAMMING SAMPLE -STA 9850 Project 2

I also used the procnpar1way procedure to see what the mean for the amount of the loans originated

were for males and females. The mean loan amount for males is $516,542 while the mean loan amount

for females is $377,075.

I also wanted to use the proc univariate procedure on the sample data to compare descriptive statistics before I used resampling techniques.

Page 18: SAS PROGRAMMING SAMPLE -STA 9850 Project 2

The means for the proc univariate procedure for loan amount by gender correspond to the means used

in the procnpar1way procedure. Specifically, the mean for loan amount for females is 377.07 for both

procedures while the mean for loan amount by males is 516.54

Now I want to use cross validation techniques to resample the data in SAS. Please see the code I used below:

proc import out=WORK.loan_data

datafile="C:\Users\jr167238\Desktop\home_loan_data_2014.xlsx"

dbms=excel replace

;

SHEET="hmda_lar";

getnames=yes;

run;

proc freq data = loan_data;

tables applicant_sex_name * loan_amount_000s

;

Run;

Proc means data = loan_data n nmiss mean std min max range skew kurtosis

maxdec=2 ;

Page 19: SAS PROGRAMMING SAMPLE -STA 9850 Project 2

var _numeric_;

run;

proc npar1way data = loan_data;

class applicant_sex_name ;

var loan_amount_000s;

exact scores=data / mc n= 9999 alpha = .05;

run;

data analysis validate;

set loan_data;

RETAIN k 16704 n 33407;

IF RANUNI(35879) < = K/N THEN DO; /* random seed */

k = k-1;

OUTPUT analysis;

END;

ELSE OUTPUT VALIDATE;

n = n-1;

DROP k n; /* the DROP statement removes "k" and "n" constraints from the

original formulation, ie k =267 and n=534*/

RUN;

proc print data=analysis(obs=15); run; /* runs data separately */

proc print data=validate(obs=15); run; /* runs data separately */

proc sort data=analysis;

by applicant_sex_name;

run;

proc univariate data=analysis normal mode plot;

title 'Analysis dataset';

by applicant_sex_name;

var loan_amount_000s;

histogram / normal;

run;

proc sort data= validate;

by applicant_sex_name;

run;

proc univariate data=validate normal mode plot;

title "Validate dataset";

by applicant_sex_name;

var loan_amount_000s;

histogram / normal;

run;

proc sort data=loan_data;

by applicant_sex_name;

proc univariate data=loan_data normal mode plot;

class applicant_sex_name;

var loan_amount_000s;

histogram / normal;

run;

Page 20: SAS PROGRAMMING SAMPLE -STA 9850 Project 2

The idea behind this procedure is randomly splitting the data from the original loan_data dataset into

two datasets, Analysis and Validate, and then comparing the statistics we get from each dataset to each

other as well as the statistics from the original sample.

Output for Analysis dataset:

Page 21: SAS PROGRAMMING SAMPLE -STA 9850 Project 2
Page 22: SAS PROGRAMMING SAMPLE -STA 9850 Project 2
Page 23: SAS PROGRAMMING SAMPLE -STA 9850 Project 2
Page 24: SAS PROGRAMMING SAMPLE -STA 9850 Project 2

Output for Validate dataset:

Page 25: SAS PROGRAMMING SAMPLE -STA 9850 Project 2
Page 26: SAS PROGRAMMING SAMPLE -STA 9850 Project 2
Page 27: SAS PROGRAMMING SAMPLE -STA 9850 Project 2
Page 28: SAS PROGRAMMING SAMPLE -STA 9850 Project 2
Page 29: SAS PROGRAMMING SAMPLE -STA 9850 Project 2

Conclusions:

The statistics from the original dataset are remarkably consistent with the two randomly

resampled datasets that were created using the cross-validation techniques with the resampling macros.

The mean amount of home loans for females is 377.07 for the original dataset, 377.21 for the analysis

dataset and the 376.93 for the validate dataset. The mean amount of home loans for males is 516.54 in

the original dataset, 514.31 in the analysis dataset, and 518.74. The means for home loan amounts by

males is slightly different for each dataset whereas it is incredibly consistent for women.

We can also see that the data does not appear to be normally distributed for any of the datasets

according to the Test for Normality statistics’ p-values as well as the Normality graphs. We can also see

that there were many more loans given out in general to men according to the distribution of loan

amounts by gender charts.

Everything in the resampled datasets confirms what we saw in the descriptive statistics for the

original datasets. There seems to be a massive disparity between the average value of loans originated between genders.