regression using r

7/27/2019 Regression Using R

1/13

Regression Using R

This page is one of a (small) set of additional appendices to the bookApplying Regression andCorrelation, byJeremy Milesand Mark Shevlin.

R is not the most user friendly program in the known universe. But it is one of the mostpowerful. And it's free (as in beer, and also as in speech). To make it a little easier forbeginners, here is a guide to doing regression in R.

1. First, you need to get hold of R. Typing R into googlewill find you the R homepage, oryou can go straight to theComprehensive R Archive Networkand download and install itfrom a mirror near you.
http://www.jeremymiles.co.uk/regressionbookhttp://www.jeremymiles.co.uk/regressionbookhttp://www.jeremymiles.co.uk/regressionbookhttp://www.jeremymiles.co.uk/regressionbookhttp://www.jeremymiles.co.uk/http://www.jeremymiles.co.uk/http://www.jeremymiles.co.uk/http://www.google.co.uk/q=Rhttp://www.google.co.uk/q=Rhttp://cran.r-project.org/http://cran.r-project.org/http://cran.r-project.org/http://cran.r-project.org/http://www.google.co.uk/q=Rhttp://www.jeremymiles.co.uk/http://www.jeremymiles.co.uk/regressionbookhttp://www.jeremymiles.co.uk/regressionbook


2/13

2. When you have R installed, and you run it, it should look a bit like:

3. We need to get some data into R. We'll cover two types of data, first tab-delimited, thenSPSS

For either kind of file, it will make life easier if you change the working directory to thesame directory that your data are currently stored in. To do this, click on the file menu,and choose Change Dir:


3/13

Then choose the directory where your data are stored.

4. To read a tab delimited file, with variable names at the top, we'll assume that your datalook like:

HASSLES ANX38 1010 1260 2190 1688 2796 30

1 941 786 3259 11...

You can find the whole filehere.

We use the command read.table, and type:

data1


4/13

equals sign is that it is ambiguous. If we write x = 4 does that mean "put thevalue 4 into the box labelled x", or does it mean "is x equal to 4"? In R, for the

first, we would write x


5/13

> data1[[1]][1] 38 10 60 90 88 96 1 41 86 59 25 5 3 16 22 41 29 7255 36 96 36 91 47 63[26] 64 98 81 99 90 95 5 71 82 97 47 30 75 35 78

Notice the differeence? The second time, we didn't get the variable name at the top.We can also draw a histogram of the variable:We can also draw a histogram of thevariable:

> hist(data1$HASSLES)

Will give:


6/13

And:

> plot(data1$HASSLES, data1$ANX)


7/13

Will give

6. Now we've got our data in, we can go ahead and do our regression, but first we are goingto do an extra step, because we are, fundamentally, lazy.It's a bit dull having to retype data1$ each time we want to access a variable. (It's useful,because we can have multiple datasets open, but it's still dull.) So, we are going to attachour dataset. When a dataset is attached, you only need to type the variable name, and Rtakes it from the dataset which is attached. We type:

> attach(data1)


8/13

and our data are attached.

7. The basic regression command in R is GLM (general, or generalised, linear model). Weuse the command:glm(outcome ~ predictor1 + predictor2 + predictor3 )

For our first regression, the analysis we want to do is:

glm(ANX ~ HASSLES)

And R outputs:

Call: glm(formula = ANX ~ HASSLES)

Coefficients:(Intercept) HASSLES

5.4226 0.2526

Degrees of Freedom: 39 Total (i.e. Null); 38 ResidualNull Deviance: 4627Residual Deviance: 2159 AIC: 279

But this is now gone - we have the estimates, but no more. We need to have the outputstored somewhere, so we can do something with it. To do this, we use the assignmentoperator, as before:

glm.linear


9/13

gave a different kind of result. When applied to a glm object, it gives a different kind ofresult:> summary(glm.linear)

Call:

glm(formula = ANX ~ HASSLES)

Deviance Residuals:Min 1Q Median 3Q Max

-13.3153 -5.0549 -0.3794 4.5765 17.5913

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 5.42265 2.46541 2.199 0.034 *HASSLES 0.25259 0.03832 6.592 8.81e-08 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1

(Dispersion parameter for gaussian family taken to be56.80561)

Null deviance: 4627.1 on 39 degrees of freedomResidual deviance: 2158.6 on 38 degrees of freedomAIC: 279.05

Number of Fisher Scoring iterations: 2

And that is what we would consider to be our answer from the regression.

8. However, we can do more. There are other functions that we can apply, to get differentinformation from our regression analysis. We can ask for residuals, using the command(and this will surprise you):

> residuals(glm.linear)

Trouble is that the output isn't very helpful:

1 2 3 4

5 6-5.02121610 4.05141414 0.42171728 -12.15610083 -0.65091296 0.32833554

7 8 910 11 123.32475957 -8.77899791 4.85427492 -9.32568878

1.26250508 -1.6856161813 14 15


10/13

16 17 1813.81957170 -0.46414949 0.02028689 7.22100209 -6.74787067 -2.60940996

19 20 2122 23 24

-13.31531303 4.48397177 -12.67166446 -1.5160282317.59130523 -0.2945615425 26 27

28 29 30-5.33606453 1.41134153 9.82314767 4.1172446012.57055373 -5.15610083

31 32 3334 35 369.58092948 9.31438382 -10.35681603 4.86465066

7.07574161 -1.2945615437 38 39 40

-5.00046461 -8.36719178 -3.26343429 -2.12497359

But we can do stuff with that output to make it more useful. Specifically, we could put itinto a variable, using the command:

> glm.linear.resids hist(glm.linear.resids)

Which gives the following:

> hist(glm.linear.resids)

And then we could draw a histogram of that variable:


11/13

9.

Finally, we could find the predicted values, using the fitted.values() function.

> glm.linear.preds plot(glm.linear.preds, glm.linear.resids)


12/13

That was all rather hard work, wasn't it? Why bother, when we (probably) have access to SPSS,Excel, or other programs that can do regression?There are, as I see it, two reasons.First, R is more than a statistics package, it's a programming environment. You can access partsof the regression output, and do whatever you want to with them. Because glm.linear is anobject, you can access it like a dataset. The first element in that object is referred to asglm.linear[1] - if you type this you get:


13/13

> glm.linear[1]$coefficients(Intercept) HASSLES5.4226465 0.2525939

The coefficients, are another object. You can extract a part of that object, by asking for the firstelement of it,

> glm.linear[[1]][[1]][1] 5.422646

And because that's a bit of output, you can make it a variable, and do stuff with it.> x x[1] 5.422646> x x * 3[1] 16.26794

It's a bit like the output management system in SPSS, but much easier to use. (do you use SPSS?Have you ever used the output management system? Thought not - because it's fiddly.)

The second reason to bother with this is that, although R has a steep learning curve to start with,it gets flatter, faster, than other programs, because all of the commands are very similar. To do alogistic regression in (say) SPSS, you need to use a different interface, with different rules. In R,

to do a logistic regression, you add 'family = binomial', to do a Poisson regression, you

add 'family = poisson'. Multilevel models in R are similar: instead of the glm()

command, we use thelme()

command (for linear mixed effects) ornlme()

for non-linear

mixed effects. The commands are very similar, so once you've learned one, you've learned themall.

regression using r

Documents