regression using r

Upload: sanjay-mishra

Post on 14-Apr-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Regression Using R

    1/13

    Regression Using R

    This page is one of a (small) set of additional appendices to the bookApplying Regression andCorrelation, byJeremy Milesand Mark Shevlin.

    R is not the most user friendly program in the known universe. But it is one of the mostpowerful. And it's free (as in beer, and also as in speech). To make it a little easier forbeginners, here is a guide to doing regression in R.

    1. First, you need to get hold of R. Typing R into googlewill find you the R homepage, oryou can go straight to theComprehensive R Archive Networkand download and install itfrom a mirror near you.

    http://www.jeremymiles.co.uk/regressionbookhttp://www.jeremymiles.co.uk/regressionbookhttp://www.jeremymiles.co.uk/regressionbookhttp://www.jeremymiles.co.uk/regressionbookhttp://www.jeremymiles.co.uk/http://www.jeremymiles.co.uk/http://www.jeremymiles.co.uk/http://www.google.co.uk/q=Rhttp://www.google.co.uk/q=Rhttp://cran.r-project.org/http://cran.r-project.org/http://cran.r-project.org/http://cran.r-project.org/http://www.google.co.uk/q=Rhttp://www.jeremymiles.co.uk/http://www.jeremymiles.co.uk/regressionbookhttp://www.jeremymiles.co.uk/regressionbook
  • 7/27/2019 Regression Using R

    2/13

    2. When you have R installed, and you run it, it should look a bit like:

    3. We need to get some data into R. We'll cover two types of data, first tab-delimited, thenSPSS

    For either kind of file, it will make life easier if you change the working directory to thesame directory that your data are currently stored in. To do this, click on the file menu,and choose Change Dir:

  • 7/27/2019 Regression Using R

    3/13

    Then choose the directory where your data are stored.

    4. To read a tab delimited file, with variable names at the top, we'll assume that your datalook like:

    HASSLES ANX38 1010 1260 2190 1688 2796 30

    1 941 786 3259 11...

    You can find the whole filehere.

    We use the command read.table, and type:

    data1

  • 7/27/2019 Regression Using R

    4/13

    equals sign is that it is ambiguous. If we write x = 4 does that mean "put thevalue 4 into the box labelled x", or does it mean "is x equal to 4"? In R, for the

    first, we would write x

  • 7/27/2019 Regression Using R

    5/13

    > data1[[1]][1] 38 10 60 90 88 96 1 41 86 59 25 5 3 16 22 41 29 7255 36 96 36 91 47 63[26] 64 98 81 99 90 95 5 71 82 97 47 30 75 35 78

    Notice the differeence? The second time, we didn't get the variable name at the top.We can also draw a histogram of the variable:We can also draw a histogram of thevariable:

    > hist(data1$HASSLES)

    Will give:

  • 7/27/2019 Regression Using R

    6/13

    And:

    > plot(data1$HASSLES, data1$ANX)

  • 7/27/2019 Regression Using R

    7/13

    Will give

    6. Now we've got our data in, we can go ahead and do our regression, but first we are goingto do an extra step, because we are, fundamentally, lazy.It's a bit dull having to retype data1$ each time we want to access a variable. (It's useful,because we can have multiple datasets open, but it's still dull.) So, we are going to attachour dataset. When a dataset is attached, you only need to type the variable name, and Rtakes it from the dataset which is attached. We type:

    > attach(data1)

  • 7/27/2019 Regression Using R

    8/13

    and our data are attached.

    7. The basic regression command in R is GLM (general, or generalised, linear model). Weuse the command:glm(outcome ~ predictor1 + predictor2 + predictor3 )

    For our first regression, the analysis we want to do is:

    glm(ANX ~ HASSLES)

    And R outputs:

    Call: glm(formula = ANX ~ HASSLES)

    Coefficients:(Intercept) HASSLES

    5.4226 0.2526

    Degrees of Freedom: 39 Total (i.e. Null); 38 ResidualNull Deviance: 4627Residual Deviance: 2159 AIC: 279

    But this is now gone - we have the estimates, but no more. We need to have the outputstored somewhere, so we can do something with it. To do this, we use the assignmentoperator, as before:

    glm.linear

  • 7/27/2019 Regression Using R

    9/13

    gave a different kind of result. When applied to a glm object, it gives a different kind ofresult:> summary(glm.linear)

    Call:

    glm(formula = ANX ~ HASSLES)

    Deviance Residuals:Min 1Q Median 3Q Max

    -13.3153 -5.0549 -0.3794 4.5765 17.5913

    Coefficients:Estimate Std. Error t value Pr(>|t|)

    (Intercept) 5.42265 2.46541 2.199 0.034 *HASSLES 0.25259 0.03832 6.592 8.81e-08 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1

    (Dispersion parameter for gaussian family taken to be56.80561)

    Null deviance: 4627.1 on 39 degrees of freedomResidual deviance: 2158.6 on 38 degrees of freedomAIC: 279.05

    Number of Fisher Scoring iterations: 2

    And that is what we would consider to be our answer from the regression.

    8. However, we can do more. There are other functions that we can apply, to get differentinformation from our regression analysis. We can ask for residuals, using the command(and this will surprise you):

    > residuals(glm.linear)

    Trouble is that the output isn't very helpful:

    1 2 3 4

    5 6-5.02121610 4.05141414 0.42171728 -12.15610083 -0.65091296 0.32833554

    7 8 910 11 123.32475957 -8.77899791 4.85427492 -9.32568878

    1.26250508 -1.6856161813 14 15

  • 7/27/2019 Regression Using R

    10/13

    16 17 1813.81957170 -0.46414949 0.02028689 7.22100209 -6.74787067 -2.60940996

    19 20 2122 23 24

    -13.31531303 4.48397177 -12.67166446 -1.5160282317.59130523 -0.2945615425 26 27

    28 29 30-5.33606453 1.41134153 9.82314767 4.1172446012.57055373 -5.15610083

    31 32 3334 35 369.58092948 9.31438382 -10.35681603 4.86465066

    7.07574161 -1.2945615437 38 39 40

    -5.00046461 -8.36719178 -3.26343429 -2.12497359

    But we can do stuff with that output to make it more useful. Specifically, we could put itinto a variable, using the command:

    > glm.linear.resids hist(glm.linear.resids)

    Which gives the following:

    > hist(glm.linear.resids)

    And then we could draw a histogram of that variable:

  • 7/27/2019 Regression Using R

    11/13

    9.

    Finally, we could find the predicted values, using the fitted.values() function.

    > glm.linear.preds plot(glm.linear.preds, glm.linear.resids)

  • 7/27/2019 Regression Using R

    12/13

    That was all rather hard work, wasn't it? Why bother, when we (probably) have access to SPSS,Excel, or other programs that can do regression?There are, as I see it, two reasons.First, R is more than a statistics package, it's a programming environment. You can access partsof the regression output, and do whatever you want to with them. Because glm.linear is anobject, you can access it like a dataset. The first element in that object is referred to asglm.linear[1] - if you type this you get:

  • 7/27/2019 Regression Using R

    13/13

    > glm.linear[1]$coefficients(Intercept) HASSLES5.4226465 0.2525939

    The coefficients, are another object. You can extract a part of that object, by asking for the firstelement of it,

    > glm.linear[[1]][[1]][1] 5.422646

    And because that's a bit of output, you can make it a variable, and do stuff with it.> x x[1] 5.422646> x x * 3[1] 16.26794

    It's a bit like the output management system in SPSS, but much easier to use. (do you use SPSS?Have you ever used the output management system? Thought not - because it's fiddly.)

    The second reason to bother with this is that, although R has a steep learning curve to start with,it gets flatter, faster, than other programs, because all of the commands are very similar. To do alogistic regression in (say) SPSS, you need to use a different interface, with different rules. In R,

    to do a logistic regression, you add 'family = binomial', to do a Poisson regression, you

    add 'family = poisson'. Multilevel models in R are similar: instead of the glm()

    command, we use thelme()

    command (for linear mixed effects) ornlme()

    for non-linear

    mixed effects. The commands are very similar, so once you've learned one, you've learned themall.