4stat 342 - wk 10: regression

4Stat 342 - Wk 10: Regression

Loading data with datalines

Regression - with interactions

(Proc glm) - with polynomial terms

- with categorical variables

(Proc glmselect) - with model selection

(this is mostly chapter 6 material)

Stat 342 Notes. Week 10 / 57

In Week 8, we saw correlations, which are the first step to regression.

In Week 9, we saw ANOVA, but treated like a regression on categorical variables.

This week we look at a suite of examples surrounding regression and PROC GLM.

As time permits, we will also look at t-tests and power analysis.


First, let's load up the 'mtcars' dataset.

Rather than relying on a .csv file, let's try loading it in through a data step and the DATALINES command.

The advantage of loading text this way is...

1) It can be done without knowing in advance the folder structure of your system.

2) Complete control over how variables are interpreted.


LENGTH Make $ 10. Model $ 22.;

Establish the variables 'make' and 'model' to 10 and 22 characters long, respectively. If this not done, SAS will assume that the variables are 8 characters long and will cut off anything after that.

INFILE DATALINES TRUNCOVER;

The file source isn't an external file, but a set of data lines to be written later in this data step.


INFILE DATALINES TRUNCOVER;

'TRUNCOVER' is short for TRUNCate OVER missing, meaning that every new line of the datalines is considered a new line of dataset that will be made.

Other options include 'missover' (similar), and the default 'flowover' which keeps filling variables even after a new line has been started.


INPUT Make $

Model $

mpg...

Take the following datalines and put them in the variables 'make' (character/string), 'model' (character string), 'mpg' (numeric), ... and so on.

Every space is a new variable. You could also tell SAS to put two words (with a space) into 'model' with &, such as

INPUT Model & $


DATALINES

Mazda RX4 21.0 6 160.0 110 ...

Mazda RX4_Wag 21.0 6 160.0 110 ...

...

Volvo 142E 21.4 4 121.0 109

;

The actual data to be entered. Only one semicolon is used atthe very end of the data. (If you need a semicolon IN the data somewhere you can use an escape sequence, like \; )


Finally, I wanted the company ('make') to show up in the model category as well, so I concatenated make and model together (and put the result into model).


To concatenate two strings means to take one and put it on the end of the other.

Three or more strings can also be concatenated.

DATA mtcars;

SET mtcars;

model = cat(make,model);

run;


The result:


<break>


Now let's dig into the actual regression, starting with a simple one: fuel economy vs weight.

proc glm data = mtcars;

model mpg = weight / solution;

run;

The SOLUTION option for the model tells SAS to print the estimates of the intercept and slope coefficients.

Without this, we got much simpler model summaries


For simple regression, we also get a scatterplot with a line ofbest fit (i.e the least-squares line, the regression line) with two bands around it:

The inner band (shaded) shows the confidence limits of the MEAN, also called the confidence interval. This is where the line COULD BE if we incorporated the variance of the coefficients. (95% of the time)

The outer band (dotted lines) shows the confidence limits of INDIVIDUAL PREDICTIONS. This is where new data points could be if we predicted them from this model (95% again).


The 'clparm' option gives you the confidence limits of the parameters. You can use alpha to change the confidence level of these limits, as well as the confidence bands.

model mpg = weight / solution clparm;


You can use alpha to change the confidence level of these limits, as well as the confidence bands.

model mpg = weight / solution clparm alpha=0.01;


Other options, like p and clm add predictions and confidencelimits of the mean response for each observation.

These are output into a separate table.


model mpg = weight / p clm;

run;


...and this table can be appended to the existing dataset so you can do further processing.


model mpg = weight / p clm;

output out = mtcars_model

P=predicted_mpg R=residual_mpg;

run;


(SAS demo)


...such as comparing residuals to predicted values, which is very useful for detecting unequal variance. The most typical sign of unequal variance to see in this plot is a fan or a code shape.

proc sgplot data=mtcars_model;

scatter x=predicted_mpg

y=residual_mpg;

ellipse x=predicted_mpg

y=residual_mpg;

run;


All of these options just tell SAS to add them to the list of things to calculate and/or include in the output. You can use all of them together.

One drawback/feature is that 'alpha' will apply to ALL the output of that model.


model mpg = weight / alpha = 0.025 solution clparm p clm;

run;


Let's try a more sophisticated model of fuel economy. Instead of just looking at the weight of a car, let's also look atits horsepower and displacement.


model mpg = weight hp displacement / solution;

run;


What if there's an interaction between weight and horsepower? We can include an interaction term with an asterisk.

Note that I've explicitly included the main effects 'weight' and 'hp' in here as well. This is good statistical practice.


model mpg = weight hp weight*hp displacement / solution;

run;


(SAS demo, comparing these two models)

(Document camera work)


What about polynomial terms?

Option one is to make an interaction of a variable with itself.

The following code will let you see how the fuel economy of a car changes with horsepower AND with horsepower squared.


model mpg = hp hp*hp / solution;

run;


However, other mathematical functions won't work in the model statement.


model mpg = hp hp**2 / solution;

run;

...not even premade ones


model mpg = hp sqrt(hp) / solution;

run;


To regress against transformations of variables, or polynomial terms of variables, you need to create these transformations with a data step.

data mtcars;

set mtcars;

hp2 = hp**2;

hp_sqrt = sqrt(hp);

hp_log = log(hp);

run;


Then you can regress against these


model mpg = hp hp2 hp_sqrt hp_log

/ solution;

run;


What about categorical variables, like number of cylinders?

We have cars with 4, 6, or 8 cylinders, but it doesn't make sense to treat this as a continuous variable.

Predicting the fuel economy of a car with 5.5 cylinders is meaningless, because no such car exists.

This is where the CLASS statement from last week comes back into play.


We need to specify to SAS which variables are categorical. After that, we can those variables like any other in a model.

Each category is the amount the mean response is increasedor decreased for observations in that category. All else being equal.


class cylinders;

model mpg = weight cylinders / solution;

run;


(document camera work)


We can even include interactions between numeric and categorical variables.

This will produce a separate slope coefficient under each category.


class cylinders;

model mpg = weight cylinders weight*cylinders / solution;

run;


Note that the LAST category is considered the baseline. This is the opposite of R.


(Document camera work)


As with two-way (or multi-way) ANOVA, we can include more than one categorical variable


class cylinders;

model mpg = weight hp cylinders weight*cylinders hp*cylinders / solution;

run;


PROC GLM is very flexible, but also very generalist.

For more detailed results from regression, you can use PROCREG, which includes options like...

cross-validation (Does a model derived from part of your data fit 'new' observations from the rest of your data?)


model selection (is the model you're using now the best one? How do I efficiently compare many different models?)

Diagnostics (are some of my observations overly influential?)

PROC REG lacks generalization, however.

It's designed for simple and multiple regression where all theexplanatory variables are continuous.


To incorporate categorical data, we would need to manually create dummy variables from categories using a data step.


However... PROC GLMSELECT has the advantages of both.


Model selection methods aim to find models that do two things...

1. Fit the data well. That is, models with small residuals, highr-squared, and low root-mean-square-error (RMSE).

2. Describe the data simply / parsimoniously. This means having few terms in the model, estimating few parameters, and using few degrees of freedom.


We can take our previous model of weight, horsepower, cylinders, and two interaction terms and apply a model selection method called 'stepwise' to determine if this model is the best.

proc glmselect data = mtcars;

class cylinders;

model mpg = weight hp cylinders weight*cylinders hp*cylinders

/ selection=stepwise(select = AIC);

run;


'stepwise' is just one method of model selection. It's popular, it's a nice combination of 'forward selection' and 'backwards elimination', but it's outdated.

A much more popular method these days is LASSO, which is also available in SAS with...

Selection=lasso

(But requires a special package in R). The LASSO method can handle HUNDREDS of different variables at once, even if there are more variables than observations!


Likewise, Akaike Information Criterion (AIC) is just one criterion of selection for models.

Other option include:

BIC (higher preference towards simpler models, if you sample size is large)

AICc (AIC with a cross-validation adjustment)

ADJRSQ (Adjusted R-squared. You typical coefficient of determination with a penalty per term)


plots = criterionpanel

will show you if the other criteria agree.Stat 342 Notes. Week 10 / 57

Try something with LOTS of variables.

proc glmselect data = mtcars plots=criterionpanel;

class cylinders gear;

model mpg = weight hp cylinders weight*cylinders hp*cylinders hp*gear displacement*gear weight*weight

/ selection=lasso;

run;


Additional Proc GLMSELECT slides from

http://www.sas.com/conte nt/dam/SAS/en_ca/User%20Group%20Presentations/Winnipeg-User-Group/SylvainTremblay-PROCGLMSELECT-Spring2012.pdf

GLMSELECT for Model Selection

Winnipeg SAS User Group Meeting

May 11, 2012

Sylvain Tremblay

SAS Canada – Education


http://www.sas.com/content/dam/SAS/en_ca/User%20Group%20Presentations/Winnipeg-User-Group/SylvainTremblay-PROCGLMSELECT-Spring2012.pdf



http://www.sas.com/conte

4stat 342 - wk 10: regression

Documents