4stat 342 - wk 10: regression

57
4Stat 342 - Wk 10: Regression Loading data with datalines Regression - with interacons (Proc glm) - with polynomial terms - with categorical variables (Proc glmselect) - with model selecon (this is mostly chapter 6 material) Stat 342 Notes. Week 10 Page 1 / 57

Upload: others

Post on 05-Dec-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

4Stat 342 - Wk 10: Regression

Loading data with datalines

Regression - with interactions

(Proc glm) - with polynomial terms

- with categorical variables

(Proc glmselect) - with model selection

(this is mostly chapter 6 material)

Stat 342 Notes. Week 10 Page 1 / 57

In Week 8, we saw correlations, which are the first step to regression.

In Week 9, we saw ANOVA, but treated like a regression on categorical variables.

This week we look at a suite of examples surrounding regression and PROC GLM.

As time permits, we will also look at t-tests and power analysis.

Stat 342 Notes. Week 10 Page 2 / 57

First, let's load up the 'mtcars' dataset.

Rather than relying on a .csv file, let's try loading it in through a data step and the DATALINES command.

The advantage of loading text this way is...

1) It can be done without knowing in advance the folder structure of your system.

2) Complete control over how variables are interpreted.

Stat 342 Notes. Week 10 Page 3 / 57

Stat 342 Notes. Week 10 Page 4 / 57

LENGTH Make $ 10. Model $ 22.;

Establish the variables 'make' and 'model' to 10 and 22 characters long, respectively. If this not done, SAS will assume that the variables are 8 characters long and will cut off anything after that.

INFILE DATALINES TRUNCOVER;

The file source isn't an external file, but a set of data lines to be written later in this data step.

Stat 342 Notes. Week 10 Page 5 / 57

INFILE DATALINES TRUNCOVER;

'TRUNCOVER' is short for TRUNCate OVER missing, meaning that every new line of the datalines is considered a new line of dataset that will be made.

Other options include 'missover' (similar), and the default 'flowover' which keeps filling variables even after a new line has been started.

Stat 342 Notes. Week 10 Page 6 / 57

INPUT Make $

Model $

mpg...

Take the following datalines and put them in the variables 'make' (character/string), 'model' (character string), 'mpg' (numeric), ... and so on.

Every space is a new variable. You could also tell SAS to put two words (with a space) into 'model' with &, such as

INPUT Model & $

Stat 342 Notes. Week 10 Page 7 / 57

DATALINES

Mazda RX4 21.0 6 160.0 110 ...

Mazda RX4_Wag 21.0 6 160.0 110 ...

...

Volvo 142E 21.4 4 121.0 109

;

The actual data to be entered. Only one semicolon is used atthe very end of the data. (If you need a semicolon IN the data somewhere you can use an escape sequence, like \; )

Stat 342 Notes. Week 10 Page 8 / 57

Finally, I wanted the company ('make') to show up in the model category as well, so I concatenated make and model together (and put the result into model).

Stat 342 Notes. Week 10 Page 9 / 57

To concatenate two strings means to take one and put it on the end of the other.

Three or more strings can also be concatenated.

DATA mtcars;

SET mtcars;

model = cat(make,model);

run;

Stat 342 Notes. Week 10 Page 10 / 57

The result:

Stat 342 Notes. Week 10 Page 11 / 57

<break>

Stat 342 Notes. Week 10 Page 12 / 57

Now let's dig into the actual regression, starting with a simple one: fuel economy vs weight.

proc glm data = mtcars;

model mpg = weight / solution;

run;

The SOLUTION option for the model tells SAS to print the estimates of the intercept and slope coefficients.

Without this, we got much simpler model summaries

Stat 342 Notes. Week 10 Page 13 / 57

Stat 342 Notes. Week 10 Page 14 / 57

For simple regression, we also get a scatterplot with a line ofbest fit (i.e the least-squares line, the regression line) with two bands around it:

The inner band (shaded) shows the confidence limits of the MEAN, also called the confidence interval. This is where the line COULD BE if we incorporated the variance of the coefficients. (95% of the time)

The outer band (dotted lines) shows the confidence limits of INDIVIDUAL PREDICTIONS. This is where new data points could be if we predicted them from this model (95% again).

Stat 342 Notes. Week 10 Page 15 / 57

Stat 342 Notes. Week 10 Page 16 / 57

The 'clparm' option gives you the confidence limits of the parameters. You can use alpha to change the confidence level of these limits, as well as the confidence bands.

model mpg = weight / solution clparm;

Stat 342 Notes. Week 10 Page 17 / 57

You can use alpha to change the confidence level of these limits, as well as the confidence bands.

model mpg = weight / solution clparm alpha=0.01;

Stat 342 Notes. Week 10 Page 18 / 57

Other options, like p and clm add predictions and confidencelimits of the mean response for each observation.

These are output into a separate table.

proc glm data = mtcars;

model mpg = weight / p clm;

run;

Stat 342 Notes. Week 10 Page 19 / 57

Stat 342 Notes. Week 10 Page 20 / 57

...and this table can be appended to the existing dataset so you can do further processing.

proc glm data = mtcars;

model mpg = weight / p clm;

output out = mtcars_model

P=predicted_mpg R=residual_mpg;

run;

Stat 342 Notes. Week 10 Page 21 / 57

(SAS demo)

Stat 342 Notes. Week 10 Page 22 / 57

...such as comparing residuals to predicted values, which is very useful for detecting unequal variance. The most typical sign of unequal variance to see in this plot is a fan or a code shape.

proc sgplot data=mtcars_model;

scatter x=predicted_mpg

y=residual_mpg;

ellipse x=predicted_mpg

y=residual_mpg;

run;

Stat 342 Notes. Week 10 Page 23 / 57

Stat 342 Notes. Week 10 Page 24 / 57

All of these options just tell SAS to add them to the list of things to calculate and/or include in the output. You can use all of them together.

One drawback/feature is that 'alpha' will apply to ALL the output of that model.

proc glm data = mtcars;

model mpg = weight / alpha = 0.025 solution clparm p clm;

run;

Stat 342 Notes. Week 10 Page 25 / 57

<break image>

Stat 342 Notes. Week 10 Page 26 / 57

Let's try a more sophisticated model of fuel economy. Instead of just looking at the weight of a car, let's also look atits horsepower and displacement.

proc glm data = mtcars;

model mpg = weight hp displacement / solution;

run;

Stat 342 Notes. Week 10 Page 27 / 57

What if there's an interaction between weight and horsepower? We can include an interaction term with an asterisk.

Note that I've explicitly included the main effects 'weight' and 'hp' in here as well. This is good statistical practice.

proc glm data = mtcars;

model mpg = weight hp weight*hp displacement / solution;

run;

Stat 342 Notes. Week 10 Page 28 / 57

(SAS demo, comparing these two models)

(Document camera work)

Stat 342 Notes. Week 10 Page 29 / 57

What about polynomial terms?

Option one is to make an interaction of a variable with itself.

The following code will let you see how the fuel economy of a car changes with horsepower AND with horsepower squared.

proc glm data = mtcars;

model mpg = hp hp*hp / solution;

run;

Stat 342 Notes. Week 10 Page 30 / 57

However, other mathematical functions won't work in the model statement.

proc glm data = mtcars;

model mpg = hp hp**2 / solution;

run;

...not even premade ones

proc glm data = mtcars;

model mpg = hp sqrt(hp) / solution;

run;

Stat 342 Notes. Week 10 Page 31 / 57

To regress against transformations of variables, or polynomial terms of variables, you need to create these transformations with a data step.

data mtcars;

set mtcars;

hp2 = hp**2;

hp_sqrt = sqrt(hp);

hp_log = log(hp);

run;

Stat 342 Notes. Week 10 Page 32 / 57

Then you can regress against these

proc glm data = mtcars;

model mpg = hp hp2 hp_sqrt hp_log

/ solution;

run;

Stat 342 Notes. Week 10 Page 33 / 57

What about categorical variables, like number of cylinders?

We have cars with 4, 6, or 8 cylinders, but it doesn't make sense to treat this as a continuous variable.

Predicting the fuel economy of a car with 5.5 cylinders is meaningless, because no such car exists.

This is where the CLASS statement from last week comes back into play.

Stat 342 Notes. Week 10 Page 34 / 57

We need to specify to SAS which variables are categorical. After that, we can those variables like any other in a model.

Each category is the amount the mean response is increasedor decreased for observations in that category. All else being equal.

proc glm data = mtcars;

class cylinders;

model mpg = weight cylinders / solution;

run;

Stat 342 Notes. Week 10 Page 35 / 57

Stat 342 Notes. Week 10 Page 36 / 57

(document camera work)

Stat 342 Notes. Week 10 Page 37 / 57

We can even include interactions between numeric and categorical variables.

This will produce a separate slope coefficient under each category.

proc glm data = mtcars;

class cylinders;

model mpg = weight cylinders weight*cylinders / solution;

run;

Stat 342 Notes. Week 10 Page 38 / 57

Stat 342 Notes. Week 10 Page 39 / 57

Note that the LAST category is considered the baseline. This is the opposite of R.

Stat 342 Notes. Week 10 Page 40 / 57

(Document camera work)

Stat 342 Notes. Week 10 Page 41 / 57

As with two-way (or multi-way) ANOVA, we can include more than one categorical variable

proc glm data = mtcars;

class cylinders;

model mpg = weight hp cylinders weight*cylinders hp*cylinders / solution;

run;

Stat 342 Notes. Week 10 Page 42 / 57

Stat 342 Notes. Week 10 Page 43 / 57

PROC GLM is very flexible, but also very generalist.

For more detailed results from regression, you can use PROCREG, which includes options like...

cross-validation (Does a model derived from part of your data fit 'new' observations from the rest of your data?)

Stat 342 Notes. Week 10 Page 44 / 57

model selection (is the model you're using now the best one? How do I efficiently compare many different models?)

Diagnostics (are some of my observations overly influential?)

PROC REG lacks generalization, however.

It's designed for simple and multiple regression where all theexplanatory variables are continuous.

Stat 342 Notes. Week 10 Page 45 / 57

To incorporate categorical data, we would need to manually create dummy variables from categories using a data step.

Stat 342 Notes. Week 10 Page 46 / 57

However... PROC GLMSELECT has the advantages of both.

Stat 342 Notes. Week 10 Page 47 / 57

Model selection methods aim to find models that do two things...

1. Fit the data well. That is, models with small residuals, highr-squared, and low root-mean-square-error (RMSE).

2. Describe the data simply / parsimoniously. This means having few terms in the model, estimating few parameters, and using few degrees of freedom.

Stat 342 Notes. Week 10 Page 48 / 57

Stat 342 Notes. Week 10 Page 49 / 57

We can take our previous model of weight, horsepower, cylinders, and two interaction terms and apply a model selection method called 'stepwise' to determine if this model is the best.

proc glmselect data = mtcars;

class cylinders;

model mpg = weight hp cylinders weight*cylinders hp*cylinders

/ selection=stepwise(select = AIC);

run;

Stat 342 Notes. Week 10 Page 50 / 57

Stat 342 Notes. Week 10 Page 51 / 57

Stat 342 Notes. Week 10 Page 52 / 57

'stepwise' is just one method of model selection. It's popular, it's a nice combination of 'forward selection' and 'backwards elimination', but it's outdated.

A much more popular method these days is LASSO, which is also available in SAS with...

Selection=lasso

(But requires a special package in R). The LASSO method can handle HUNDREDS of different variables at once, even if there are more variables than observations!

Stat 342 Notes. Week 10 Page 53 / 57

Likewise, Akaike Information Criterion (AIC) is just one criterion of selection for models.

Other option include:

BIC (higher preference towards simpler models, if you sample size is large)

AICc (AIC with a cross-validation adjustment)

ADJRSQ (Adjusted R-squared. You typical coefficient of determination with a penalty per term)

Stat 342 Notes. Week 10 Page 54 / 57

plots = criterionpanel

will show you if the other criteria agree.Stat 342 Notes. Week 10 Page 55 / 57

Try something with LOTS of variables.

proc glmselect data = mtcars plots=criterionpanel;

class cylinders gear;

model mpg = weight hp cylinders weight*cylinders hp*cylinders hp*gear displacement*gear weight*weight

/ selection=lasso;

run;

Stat 342 Notes. Week 10 Page 56 / 57

Additional Proc GLMSELECT slides from

http://www.sas.com/conte nt/dam/SAS/en_ca/User%20Group%20Presentations/Winnipeg-User-Group/SylvainTremblay-PROCGLMSELECT-Spring2012.pdf

GLMSELECT for Model Selection

Winnipeg SAS User Group Meeting

May 11, 2012

Sylvain Tremblay

SAS Canada – Education

Stat 342 Notes. Week 10 Page 57 / 57