as13

33
Analysis of quantitative outcomes (AS13) EPM304 Advanced Statistical Methods in Epidemiology Course: PG Diploma/ MSc Epidemiology This document contains a copy of the study material located within the computer assisted learning (CAL) session. If you have any questions regarding this document or your course, please contact DLsupport via [email protected] . Important note: this document does not replace the CAL material found on your module CDROM. When studying this session, please ensure you work through the CDROM material first. This document can then be used for revision purposes to refer back to specific sessions. These study materials have been prepared by the London School of Hygiene & Tropical Medicine as part of the PG Diploma/MSc Epidemiology distance learning course. This material is not licensed either for resale or further copying. © London School of Hygiene & Tropical Medicine September 2013 v1.1

Upload: lakshmi-seth

Post on 15-Sep-2015

2 views

Category:

Documents


1 download

DESCRIPTION

stats notes

TRANSCRIPT

  • Analysis of quantitative outcomes (AS13)

    EPM304 Advanced Statistical Methods in Epidemiology

    Course: PG Diploma/ MSc Epidemiology

    This document contains a copy of the study material located within the computer assisted learning (CAL) session. If you have any questions regarding this document or your course, please contact DLsupport via [email protected]. Important note: this document does not replace the CAL material found on your module CDROM. When studying this session, please ensure you work through the CDROM material first. This document can then be used for revision purposes to refer back to specific sessions. These study materials have been prepared by the London School of Hygiene & Tropical Medicine as part of the PG Diploma/MSc Epidemiology distance learning course. This material is not licensed either for resale or further copying.

    London School of Hygiene & Tropical Medicine September 2013 v1.1

  • Section 1: Analysis of quantitative outcomes Aims To give an introduction to analysing quantitative outcomes in regression Objectives By the end of this session you will be able to: model the relationship between a quantitative outcome and explanatory variable(s) using linear regression; interpret the parameters of the regression model and use significance tests to assess the strength of evidence of an association; use regression diagnostics to check the model assumptions; use regression modelling to adjust for confounding of an explanatory variable by another variable. Section 2: Planning your study

    In SC14 and SC15 you were introduced to assessment of correlation between two quantitative variables, as well as linear regression of a quantitative outcome (note: quantitative variables are also referred to as continuous variables). The aim of this session is to recap and extend this work.

    If you need to review any materials before you continue, refer to the appropriate sessions below. Correlation SC14 Linear regression SC15

    2.1: Planning your study To illustrate the concepts and method of quantitative regression we will use a simple example and data from one study: The In-Vitro Fertilization study

    Click on this study to see the details below.

    Interaction: Hyperlink: The In-Vitro Fertilization study:

    Output (appears in new window): In-Vitro Fertilization study

  • This study was set up to compare babies conceived following in-vitro fertilization to those from the general population. The data used here refer to the records of 641 singleton births following in-vitro fertilization (IVF).

    Section 3: Background - correlation

    We want to examine the association between birth weight and gestational age in our dataset. Firstly we use a scatter plot:

    What statistic could we calculate to determine the strength of association between these two variables?

    Interaction: thought bubble: Output (appears below): Pearsons correlation coefficient, r, is a measure between -1 and +1 governed by the direction and strength of the relationship between two quantitative variables. If one increases as the other increases r is positive, and if one decreases as the other increases r is negative. If there is no relationship between the two variables then r is 0. In a scatter plot of one variable against the other, the closer the points are to a straight line the closer the value of r will be to +1 or -1, i.e. the stronger the linear relationship between the variables.

    3.1: Background - correlation

    Below are some examples of some different correlation coefficients of hypothetical data. Use the drop-down menu to explore what different values of r might look like graphically.

    Interaction: Pulldown: r = 0.9:

  • r = 0.9

    x

    y

    2 4 6 8 10

    1015

    2025

    3035

    Interaction: Pulldown: r = 0.7:

    r = 0.7

    x

    y

    2 4 6 8 10

    510

    2030

    40

    Interaction: Pulldown: r = 0.3:

  • r = 0.3

    x

    y

    2 4 6 8 10

    2025

    3035

    4045

    50

    Interaction: Pulldown: r = -0.5:

    r = -0.5

    x

    y

    2 4 6 8 10

    2030

    4050

    6070

    3.2: Background - correlation

    Returning to the association between birth weight and gestational age, displayed in the graph below, the Pearsons correlation coefficient was 0.74.

  • What does this indicate? Interaction: Hotspot: Weak positive association Output: (appears in new window): No although the correlation coefficient is positive, indicating a positive relationship, we would generally consider such a large correlation coefficient to indicate a strong association. Note however, that there are no fixed rules on what defines a strong versus weak association. Interaction: Hotspot: Strong negative association Output: (appears in new window): No although this is quite a large correlation coefficient indicating a strong association, it is positive indicating a positive association. Note however, that there are no fixed rules on what defines a strong versus weak association. Interaction: Hotspot: Strong positive association Output: (appears in new window): Yes this is quite a large correlation coefficient indicating a strong association and it is positive indicating a positive association. Note however, that there are no fixed rules on what defines a strong versus weak association. 3.3: Background why do we need linear regression?

  • The correlation coefficient tells us the strength and direction of the linear relationship, but it does not allow us to quantify the relationship of the two variables, for example, by how much does one variable change, on average, with a unit change in the other. It also does not allow us to quantify the relationship of two variables, while adjusting for confounding with a third variable. It also assumes a linear relationship between the two variables, which may not be the case. For such situations we need linear regression. 3.4: Background linear regression Simple linear regression between two continuous variables uses a method known as least squares to derive an equation y = a + bx, with a and b chosen to ensure the best fit of the line to the data. Note: y are the values of the dependent or outcome variable plotted on the y-axis (the vertical axis) and x are the values of the independent or exposure variable plotted on the x-axis (the horizontal axis).

    For example:

    Note that a is the predicted value of y when x is zero and b is the gradient of the slope i.e. a one unit change in x is predicted to lead to a change in y of b units. In our example, a would be the predicted birth weight for a gestational age of zero and b would be the increase in birth weight in grams given by a one week increase in gestational age. 3.5: Background linear regression The least squares method works by reducing the distances between each point and the line. We will now illustrate this using a subset of the data of only seven points:

  • 010

    0020

    0030

    0040

    00B

    irthw

    eigh

    t (gr

    ams)

    25 30 35 40gestational age in weeks

    The vertical distance from each point to the line is called a residual. Press swap to see these illustrated.

    Interaction: Button:

    Swap: Output (changes to figure below):

  • 010

    0020

    0030

    0040

    00B

    irthw

    eigh

    t (gr

    ams)

    25 30 35 40gestational age in weeks

    3.6: Background least squares The least squares estimates of a and b are derived by minimising the sum of the square of each residual. By squaring the residuals, we penalise larger residuals, so for example, two residuals of 0 and 10 would have a sum of squares of 0+100=100, whereas two residuals of 5 and 5 would have a sum of squares of 25+25=50 and so we would prefer the two residuals of 5 and 5. Calculate the sum of squares of these residuals to one decimal place:

  • 010

    0020

    0030

    0040

    00B

    irthw

    eigh

    t (gr

    ams)

    25 30 35 40gestational age in weeks

    Interaction: Calculation: (calc) Output(appears in new window): Incorrect answer: No. The sum of squares of the residuals is 623.3^2 + 305.8^2 + 26.6^2 + 555.2^2 + 800.3^2 + 133.6^2 + 666.0^2 = 1892856.2 Correct answer: Correct

    Yes, the sum of squares of the residuals is 623.3^2 + 305.8^2 + 26.6^2 + 555.2^2 + 800.3^2 + 133.6^2 + 666.0^2 = 1892856.2 3.7: Background least squares

    The estimates of a and b that minimise the sum of squares of the residuals are given by:

    =

    ii

    iii

    xx

    yyxxb 2)(

    ))((

    xbya = Note: these equations are given for information and you are not expected to memorise them.

    623.3 -305.8

    26.6

    555.2 -800.3

    -133.6

    -666.0

  • 3.8: Background likelihood theory An alternative way to estimate a and b is by likelihood theory. Firstly we rewrite the equation to explicitly include the residuals ei as follows: yi = a + bxi + ei Then we assume these residuals ei are Normally distributed with mean zero and variance 2 i.e. N(0,2). So we can now re-write the log likelihood for the ei using: ei = yi a bxi If we now maximise the log likelihood, which is written in terms of the data yi and xi and the parameters a and b, we obtain the maximum likelihood estimates of a and b. These turn out to be exactly the same as the least squares estimates i.e.

    =

    ii

    iii

    xx

    yyxxb 2)(

    ))((

    xbya = 3.9: Background interpreting output Returning to our original example, we fit a linear regression line to the data

  • 010

    0020

    0030

    0040

    0050

    00B

    irthw

    eigh

    t (gr

    ams)

    25 30 35 40 45gestational age in weeks

    In this example, a will be predicted value of the birth weight for a baby with a gestational age of zero weeks. This does not make much sense and so we first centre the gestational age around a central point, in this case the mean gestational age in our sample, which was 38.7 weeks i.e. we subtract 38.7 from each gestational age in our dataset. This is called mean-centering. Now a represents the predicted birth weight in grams at the average gestational age in our sample of 38.7 weeks. 3.10: Background interpreting output

    If we fit a linear regression line to the data (with bweight representing the birth weight in grams and mgest representing the mean-centred gestational age in weeks) we get the following output. bweight Coef. Std. Err. t P>t [95% Conf. Interval] mgest 206.6412 7.484572 27.61 0.000 191.9439 221.3386 _cons 3129.137 17.42493 179.58 0.000 3094.92 3163.354 How would you interpret the constant coefficient of 3129.137?

    Interaction: thought bubble: Output (appears below):

    This is the estimate of a and so it is the predicted birth weight when the mean-centred gestational age is zero i.e. the predicted birth weight is 3129.137 when the gestational age is 38.7 weeks.

  • Note: the mean birth weight in our dataset is 3129.137 i.e. the same as a the regression line will always go through the point ),( yx . How would you interpret the coefficient of mgest (the mean-centred gestational age)? Interaction: thought bubble:

    Output (appears below):

    For every increase in gestational age of one week, the predicted birth weight increases by 206.6 grams. 3.11: Background interpreting output

    bweight Coef. Std. Err. t P>t [95% Conf. Interval] mgest 206.6412 7.484572 27.61 0.000 191.9439 221.3386 _cons 3129.137 17.42493 179.58 0.000 3094.92 3163.354

    The estimate of the slope, b, is the expected change in the outcome (i.e. birth weight) for a unit increase in the exposure variable (i.e. gestational age). Here it is estimated to be 206.6 grams. The output gives the 95% confidence interval for this parameter to be 191.9g to 221.3g. If there was no association between gestational age and birth weight the true value of the parameter would be zero and the points on the scatter plot would be randomly scattered about the mean values of birth weight. However, based on this analysis the lower limit of the 95% confidence interval is substantially above zero indicating there is strong evidence for an association. We can confirm this by looking at the Wald test which compares the ratio of the parameter estimate to its standard error with a t distribution. The larger the value of b compared to its standard error the larger the test statistic, and the smaller the P-value (stronger evidence of an association). The test statistic t is given as 27.6 under the column labelled t in the output. The null hypothesis of the test is that b is zero, or in other words that there is no association between birth weight and gestational age. The P-value is reported as Pt confirming that there is very strong evidence of an association between birth weight and gestational age, when the relationship is modelled as linear. 3.12: Background regression equation

    bweight Coef. Std. Err. t P>t [95% Conf. Interval] mgest 206.6412 7.484572 27.61 0.000 191.9439 221.3386 _cons 3129.137 17.42493 179.58 0.000 3094.92 3163.354

  • We can see from the output that the best prediction of birth weight will be given by the equation: Birth weight = 3129.1 + 206.6*mgest = 3129.1 + 206.6*(gestational age 38.7) What is the best prediction of the birth weight for a gestational age of 30 weeks (to the nearest gram)?

    Interaction: Calculation: (calc) Output(appears in new window): Incorrect answer: No. The best prediction is: = 3129.1 + 206.6*(gestational age 38.7) = 3129.1 + 206.6*(30 38.7) = 3129.1 - 206.6*8.7 = 1331.68 = 1332 to the nearest gram Correct answer: Correct

    Yes. The best prediction is: = 3129.1 + 206.6*(gestational age 38.7) = 3129.1 + 206.6*(30 38.7) = 3129.1 - 206.6*8.7 = 1331.68 = 1332 to the nearest gram Section 4: Checking model assumptions The assumptions of a linear regression model are

    1. The residuals come from a Normal distribution and are independent from each other (i.e. no correlation in the residuals between observations).

    2. The variance of the residuals is constant across y and x. 3. The correct relationship between y and x has been modelled.

    We can check whether these assumptions seem reasonable using plots of the residuals. This is easiest done using standardised residuals (i.e. with mean=0 and standard deviation=1). These are obtained by dividing each residual by the standard deviation of all the residuals. 4.1: Normality assumption The Normality assumption can be checked by producing a histogram of the residuals. Note: because these are standardised residuals they should come from a Standard Normal distribution, e.g. we expect about 95% of values to lie between 2 and +2.

  • 0.1

    .2.3

    .4D

    ensi

    ty

    -4 -2 0 2 4Standardized residuals

    The histogram looks symmetrical and reasonably bell-shaped, therefore the assumption that the residuals come from a Normal distribution is reasonable. The larger the sample size the less the shape of the distribution of the residuals will affect the model estimation. 4.2: Constant variance The second assumption of constant variance of the residuals across y and x can be

    checked from a scatter plot of the residuals versus the predicted values y (also called fitted values) i.e. for each point on the graph below, the fitted value is plotted against the residual.

  • 010

    0020

    0030

    0040

    0050

    00B

    irthw

    eigh

    t (gr

    ams)

    25 30 35 40 45gestational age in weeks

    The residuals should be randomly scattered about zero with constant variance over the predicted values.

    -4-2

    02

    4S

    tand

    ardi

    zed

    resi

    dual

    s

    0 1000 2000 3000 4000Fitted values

    Predicted (or fitted) value

    residual

  • There is no obvious relationship between the residuals and y , and no evidence of a change in the variance of the residuals with y . 4.3: Linear relationship We started our analysis with a scatter plot of the data. The importance of this is highlighted by scatter plots of some hypothetical data below. Use the drop down menu to examine the different plots. Note: the correlation coefficient is the same in each example.

    Interaction: Pulldown: linear relationship:

    This scatter plot shows a roughly linear relationship between y and x and so linear regression is appropriate.

    Interaction: Pulldown: remote point:

    There is a remote point, an observation which is far from the range of the other data.

    Interaction: Pulldown: outlier:

  • There is an outlier, a point which is not well fitted by the model. Remote points and outliers can change the regression parameters substantially, and in this case they are known as influential points. In order to identify influential points the model can be re-fitted with one observation left out each time, or they can be spotted by eye in a scatter plot and then the regression parameters estimated with and without the observation in the model. The first step once an influential point has been identified would be to check whether there has been a data-entry error, if this check is possible. If there is no data entry error, the observation should not be removed from the data unless there is very good reason to do so. One option is to report the results of the analysis with the observation included and with it removed in order to demonstrate how sensitive the results are to this observation.

    Interaction: Pulldown: non-linear:

    In this example, the values of y seem to initially decrease with increasing values of x, but then start to increase with even greater values of x. Hence, a linear regression will not give a good fit to such data and would be inappropriate.

    Interaction: Pulldown: two clusters:

  • There are two clusters of points, with what appears to be random scatter in each cluster. This may suggest some kind of threshold in x, below which y takes one average value and above which y takes another average value.

    Interaction: Pulldown: two lines:

    There appear to be two different lines shown here. It may well be that the value of y depends on x and on another binary variable. Section 5: Categorical variables The model we have fitted assumes that birth weight is linearly associated with gestational age within the range of gestational age in the data. Other models we have fitted in this course have used categories or grouped data to examine the relationship between an outcome variable, e.g. log odds and an explanatory variable. We can do this with a quantitative outcome too. 5.1: Categorical variables For our birth weight data we can group the data according to gestational age.

  • Categories gestational age

    Freq. Mean of birth weights

    Standard deviation of birth weights

    40 wks 188 3,452.223 441.366 The means of birth weight for the four groups increase sequentially from around 2.5 kg to 3.5 kg. 5.2: Categorical variables We can fit a simple linear regression model of birth weight with three indicators for the highest three groups of gestational age and estimate the mean differences between each of these three groups and the first (i.e. the t [95% Conf. Interval] gestcat 2 740.4562 61.27115 12.08 0.000 620.1383 860.7741 3 919.6259 57.522 15.99 0.000 806.6702 1032.582 4 1006.657 55.86212 18.02 0.000 896.9604 1116.353 _cons 2445.567 41.23697 59.31 0.000 2364.59 2526.544 5.3: Categorical variables bweight Coef. Std. Err. t P>t [95% Conf. Interval] gestcat 2 740.4562 61.27115 12.08 0.000 620.1383 860.7741 3 919.6259 57.522 15.99 0.000 806.6702 1032.582 4 1006.657 55.86212 18.02 0.000 896.9604 1116.353 _cons 2445.567 41.23697 59.31 0.000 2364.59 2526.544 The intercept, estimated as 2445.567, is the predicted mean value of birth weight in the baseline category of gestational age (i.e. the

  • The estimated increases can be added to the estimated mean in the baseline category to get the predicted mean value of birth weight in the four categories. The resulting equations for the predicted values are y = 2445.567, y = 2445.567 + 740.4562 = 3186.023, y = 2445.567 + 919.6259 = 3365.193 and y = 2445.567 + 1006.657 = 3452.223 respectively for babies in the first, second, third and fourth categories. These values correspond exactly to those shown in the column headed Mean of birth weights in the previous table.

    Interaction: Button: Show:

    Output (appears below): Categories gestational age

    Freq. Mean of birth weights

    Standard deviation of birth weights

    40 wks 188 3452.223 441.3662 5.4: Categorical variables

    The predicted values for this regression model can be seen on the scatter plot below.

  • 1000

    2000

    3000

    4000

    5000

    Birt

    hwei

    ght (

    gram

    s)

    25 30 35 40 45gestational age in weeks

    You can click swap to add the original line to the same graph.

    Interaction: Button: Swap:

    Output (figure changes to this and text appears below):

  • 010

    0020

    0030

    0040

    0050

    00B

    irthw

    eigh

    t (gr

    ams)

    25 30 35 40 45gestational age in weeks

    We can see that the categorical variable gives a reasonable fit to the data in the middle values of gestational age, but does particularly poorly at younger gestational ages. Section 6: Quadratic relationships The first model we fitted assumed a linear relationship between birth weight and gestational age, while the model using gestational age categories is more flexible about the shape of the relationship between the two variables, though clearly it makes some strong assumptions about what that relationship is, e.g. it assumes that all babies born before 38 weeks have the same mean birth weight. Another option is to model a quadratic relationship between birth weight and gestational age, which would allow some departure from a linear relationship. 6.1: Quadratic relationships Consider the (hypothetical) example below, in which it is possible that there is some curvature.

  • xy

    2 4 6 8 10

    1012

    1416

    1820

    2224

    x

    y

    2 4 6 8 10

    1012

    1416

    1820

    2224

    A straight line ignores any curvature there may be between y and x.

    x

    y

    2 4 6 8 10

    1012

    1416

    1820

    2224

  • Categorising x allows a non-linear relationship between y and x.

    x

    y

    2 4 6 8 10

    1012

    1416

    1820

    2224

    Fitting a quadratic allows some departure from a linear relationship. 6.2: Quadratic relationships To fit a quadratic relationship the model becomes y = a + bx + cx2 where a is still the estimated birth weight of a baby when x=0 (i.e. with mean gestational age) but now the slope of the line is described by two coefficients b and c. In this case if b and c were both positive this would mean that the babys growth accelerates over the period of gestation that we are examining. If b is positive and c is negative then the babys growth rates would be slowing down over the period of gestation that we are examining. Note that interpretation of the two parameter estimates, b and c, is less straightforward than interpretation of the parameter estimates when x was categorical. 6.3: Quadratic relationships We can fit this model by first calculating a new variable that is the square of the mean-centred gestational age e.g. mgest2=mgest^2 and then fitting a linear regression model on both the mean-centred gestational age and its square: bweight Coef. Std. Err. t P>t [95% Conf.

    Interval] mgest 193.2323 10.73898 17.99 0.000 172.1443 214.3203 mgest2 -2.796836 1.608682 -1.74 0.083 -5.955789 .3621158 _cons 3144.296 19.46009 161.58 0.000 3106.083 3182.51 So the regression equation is now:

    Birth weight = 3144.3 + 193.2*mgest 2.8*mgest2 = 3144.3 + 193.2*(age 38.7) 2.8*(age 38.7)2

  • 6.4: Quadratic relationships The graph below shows both the fitted linear and quadratic relationships. We can see that between 32 and 42 weeks gestation the lines are virtually identical, and it is only for babies born before 32 weeks that there appears to be any curvature. There are very few babies in the dataset born before about 32 weeks. It is unsurprising that there is little statistical evidence of departure from a linear relationship (P=0.08 for mgest2 in the output).

    010

    0020

    0030

    0040

    0050

    00B

    irthw

    eigh

    t (gr

    ams)

    25 30 35 40 45gestational age in weeks

    Section 7: Multivariable regression Just as we have done with Poisson and logistic regression models, we can include several covariates in a linear regression model. For example, we can fit a model of birth weight to gestational age and gender (0=male, 1=female): bweight Coef. Std. Err. t P>t [95% Conf. Interval] mgest 206.4446 7.363321 28.04 0.000 191.9853 220.9039 gender -161.7075 34.29034 -4.72 0.000 -229.0431 -94.37194 _cons 3208.604 24.03779 133.48 0.000 3161.401 3255.806

    linear

    quadratic

  • 7.1: Multivariable regression bweight Coef. Std. Err. t P>t [95% Conf. Interval] mgest 206.4446 7.363321 28.04 0.000 191.9853 220.9039 gender -161.7075 34.29034 -4.72 0.000 -229.0431 -94.37194 _cons 3208.604 24.03779 133.48 0.000 3161.401 3255.806 The value for mgest of 206.4 gives us the estimated increase in birth weight for each additional week of gestation, after adjusting for gender. The value -161.7 shows that females (coded 1 in the data) are estimated to be 161.7 g lighter than males (coded 0 in the data), after adjusting for week of gestation. Note that gender is not a confounder between gestational age and birth weight here, because adjusting for it barely changes the mgest estimate (from 206.6412 to 206.4446). However, we can see from the p-value and confidence interval that gender is independently associated with birth weight. What is the regression equation fitted here? Interaction: thought bubble:

    Output (appears below): Birth weight = 3208.6 + 206.4*mgest 161.7*gender = 3208.6 + 206.4*(gestational age 38.7) 161.7*gender

    where gender takes the value 0 for males and 1 for females 7.2: Multivariable regression We can plot two separate lines of predicted birth weight by gestational age for males and females on the scatter plot.

  • 010

    0020

    0030

    0040

    0050

    00B

    irthw

    eigh

    t (gr

    ams)

    25 30 35 40 45gestational age in weeks

    This makes clear that we are fitting two parallel lines to the data, with equal gradients but different intercepts.

    What are the regression equations for males and females separately? Interaction: thought bubble:

    Output (appears below): Males Birth weight = 3208.6 + 206.4*(gestational age 38.7) 161.7*0 = 3208.6 + 206.4*(gestational age 38.7) Females Birth weight = 3208.6 + 206.4*(gestational age 38.7) 161.7*1 = 3208.6 161.7 + 206.4*(gestational age 38.7) = 3046.9 + 206.4*(gestational age 38.7)

    So we can see that the gradients are the same, but the intercepts are different. 7.3: Multivariable regression When there is more than one covariate in the model, the goodness-of-fit checking should also include plots of the residuals against each covariate. With one

    covariate this is not necessary, as y is determined by x.

    males

    females

    difference is 161.7g

  • 7.4: Multivariable regression Note that the size of the effect estimate depends on the units of the covariate. For example, here gestational age is measured in weeks, but if it had been measured in days the parameter for gestational age would be 7 times smaller (since it is the estimated increase per day not per week). It is useful to consider the clinical impact of the covariate on the outcome in the wider population. For example, a baby born at 32 weeks is estimated to be 1.65kg lighter than a baby born at 40 weeks (206.44 x -8 = -1652g).

    Section 8: Analysis of variance for goodness of fit When testing for the significance of an association between the quantitative outcome and a covariate that has a single parameter, we can use the p-value for the parameter in the output table, as we did for the covariate gender above. However, if we want to test the impact of several parameters simultaneously we need to use an F test from the analysis of variance table. The F test in linear regression is analogous to the likelihood ratio test in logistic or Poisson regression. 8.1: Analysis of variance for goodness of fit A measure that allows us to evaluate how well a particular linear regression model fits the data is the sum of squares of the differences between the observed values of the outcome variable, and the values predicted by the model, i.e. the residual sum of squares,

    ( y i y i)2 This is obtained from an analysis of variance table. For example, the analysis of variance table produced when fitting the model with gestational age categories was: Source SS df MS Model 102656030 3 34218676.7 Residual 170064092 637 266976.596 Total 272720122 640 426125.19 Here the residual sum of squares (SS) is 170064092. 8.2: Analysis of variance for goodness of fit Source SS df MS Model 102656030 3 34218676.7

  • Residual 170064092 637 266976.596 Total 272720122 640 426125.19 If the residual sum of squares is zero the regression line would fit the data perfectly, that is every observed point would lie on the fitted line. By contrast larger values would indicate worse fits, since the deviations of the observed points from the regression line will be larger. Two possible factors contribute to a high residual sum of squares; either there is a lot of variation in the data, i.e. 2 is large, or the model does not explain much of the variation observed. 8.3: Analysis of variance for goodness of fit Source SS df MS Model 102656030 3 34218676.7 Residual 170064092 637 266976.596 Total 272720122 640 426125.19 We can obtain an estimate of 2 as: 2 = residual sum of squares/df where df is the degrees of freedom. This can be calculated as: 170064092/637 = 266976.596 This value is known as the residual mean square (MS shown in the last column in the table), and its square root is called the Root MSE (mean square error = 516.7). The larger the Model MS compared to the Residual MS, the better the model fits the data. Under the Null hypothesis of no effect of gestational age, the Model MS and the Residual MS are two independent estimates of 2 and their ratio is expected to be 1. We can then test the significance of the current model, by dividing these two terms. F = Model MS Residual MS This is known as an F test. Under a null hypothesis of no effect of gestational age on birth weight, this statistic would be expected to follow the F distribution with the appropriate degrees of freedom. This is written as F (3, 637), as we have fitted three parameters for the age categories and we are then left with 641 observations minus these three parameters minus the constant, leaving 637 for the residual SS. Calculate the F statistic for the Null hypothesis of no relationship between birth weight and gestational age to one decimal place.

    Interaction: Calculation: (calc)

  • Output(appears in new window): Incorrect answer: No. F = Model MS = 34218676.7 / 266976.596 = 128.2 Residual MS

    Correct answer: Correct

    Yes. F = Model MS = 34218676.7 / 266976.596 = 128.2 Residual MS

    8.4: Analysis of variance for goodness of fit If there is a strong relationship between the outcome and the exposure variables used in the model, then the Model MS will be much larger than the Residual MS, and F will be (substantially) greater than 1. If there is no relationship then, on average, the Model MS will be equal to the Residual MS and F will be 1. In this example the value of F is much larger than 1, so there is evidence that birth weight changes with gestational age. The p-value for the null hypothesis that there is no relationship, based on F = 128.2 is pt [95% Conf. Interval] gestcat 2 736.1429 60.54885 12.16 0.000 617.243 855.0427 3 909.2086 56.89301 15.98 0.000 797.4877 1020.929 4 1012.249 55.21226 18.33 0.000 903.8286 1120.669 gender -164.2448 40.44732 -4.06 0.000 -243.6713 -84.81839 _cons 2528.212 45.54496 55.51 0.000 2438.776 2617.649

  • The analysis of variance table for this model is at the top. For this model, our null hypothesis is that there is no effect of gestational age (considered in categories) or sex. A test of whether the true effects of both explanatory variables are zero is made by examining F(4, 636) = 102.59 and referring this to the F distribution. The p-value for this hypothesis is again
  • Multivariable linear regression As for logistic, Poisson and Cox regression, multiple variables can be included in the regression equation to produce estimates that are adjusted for the other variables. The partial F test is used instead of the likelihood ratio test to examine the evidence for some variables adjusted for other variable(s).

    2.1: Planning your study3.1: Background - correlation3.2: Background - correlation3.3: Background why do we need linear regression?3.4: Background linear regression3.5: Background linear regression3.6: Background least squares3.7: Background least squares3.8: Background likelihood theory3.9: Background interpreting output3.10: Background interpreting output3.11: Background interpreting output3.12: Background regression equation4.1: Normality assumption4.2: Constant variance4.3: Linear relationship5.1: Categorical variables5.2: Categorical variables5.3: Categorical variables5.4: Categorical variables6.1: Quadratic relationships6.2: Quadratic relationships6.3: Quadratic relationships6.4: Quadratic relationships7.1: Multivariable regression7.2: Multivariable regression7.3: Multivariable regression7.4: Multivariable regression8.1: Analysis of variance for goodness of fit8.2: Analysis of variance for goodness of fit8.3: Analysis of variance for goodness of fit8.4: Analysis of variance for goodness of fit8.5: Analysis of variance for goodness of fit8.6: Analysis of variance for goodness of fit