introductory statistical modelling using r - inpe

'

&

$

%

Introductory StatisticalModelling using R

Professor Trevor BaileySchool of Engineering, Computing and Mathematics

University of Exeter, UK

Rio de Janeiro, 25-27 March, 2008

c©2008 [email protected]

'

&

$

%

Aims of this course

My objectives in this short course are to:

ä Provide a brief overview of statistical modelling, what can (and can’t be) expected from

it, and what toolbox of inferential approaches it has to draw upon (including use of

likelihood, Bayesian/Markov Chain Monte Carlo (MCMC) methods, and resampling

ideas)

ä Describe in outline some examples of Linear (LM) and Generalised Linear (GLM)

models — the most widely used classes of parametric statistical models (including

both likelihood-based and Bayesian approaches to fitting and assessing such models)

ä Describe in outline some examples of non-parametric modelling (and associated

approaches to fitting and assessing such models)

'

&

$

%

Aims of this course

At the same time I want to:

ä Illustrate statistical software available to implement these various modelling methods

(including a short basic introduction to the use of R (a public domain general purpose

statistical language) and to use of WinBUGS (a special purpose public domain package

for Bayesian modelling) and also to links between R and WinBUGS (example data sets

are included in the course R workspace which is provided as a computer file) -

ß R from: http://www.R-project.org/ -

(for the course you also need the R packages lattice, R2WinBugs, and mgcv)

ß WinBUGS from: http://www.mrc-bsu.cam.ac.uk/bugs -

ä Provide notes, examples and some further reading on what I discuss -

0

1

'

&

$

%

Preliminaries

ä Before I embark on that, first note that it is very far ranging agenda (contents could

comprise several courses on various aspects of the subject).

ä In the short time available I can therefore do little more than touch the surface of many

of topics included

ä My focus will therefore be very much on establishing an overview of the range and

scope of statistical modelling (not on mathematical depth) and on illustrating software

(rather than training you to use it)

ä Going further than that is up to you! -

'

&

$

%

Preliminaries

ä Second note that we will only consider univariate statistical modelling. Data usually

involves several variables, but in this course we will be interested in modelling only one

of these (the response variable) in terms of the others (the explanatory variables or

predictors)

ä Only this single response variable will be considered stochastic (i.e. a random

variable—its values arise from a probability distribution; I will try to call it y).

Explanatory variables will always be considered deterministic (i.e. not random

variables) (I will try to collectively refer to these as the vector x). We model the

univariate conditional probability distribution of the response given known values of

explanatory variables (i.e. p(y|x))

ä We do not consider multivariate models—where several response variables are

modelled simultaneously (allowing for potential correlation between them; e.g. height,

weight and skin-fold thickness as a function of a baby’s age)

2

'

&

$

%

Preliminaries

ä Third note that we will restrict our attention to general purpose modelling techniques

applicable to simple data structures (i.e. independent observations on the

phenomenon of interest)

ä There are a whole range of specialised statistical modelling techniques designed to

handle particular, or more complex, data structures (e.g. survival data, time-series,

spatial data, extreme values).

ä Some of the techniques we discuss can be used for simple analyses of such data, but

usually modifications and extensions to the methods we discuss are required in order to

address the particular problems posed (e.g. non-independence between observations,

special purpose distributional assumptions etc.)

'

&

$

%

Preliminaries

ä Finally note that my focus will be on statistical methodology and I fully acknowledge that

good statistical modelling involves much more than methodology (however

sophisticated)

ä It is a blend of essentially three factors:

ß An understanding of the relevant subject matter

ß Access to appropriate data (good survey design included)

ß Use of suitable methodology

ä Don’t forget that these lectures are mostly about the third part of that process -

3

'

&

$

%

Arrangement of topics

0 Introduction

ß What are statistical models? Response and explanatory variables. Fitting models.

The R software environment for statistical modelling. Likelihood-based inference.

Assessing models. Comparing models. What’s the difference between Parametric

and non-parametric models?

0 Illustrations of parametric statistical modelling

ß Linear Regression. Generalised Linear Models (GLMs). Direct use of maximum

likelihood for non-standard models. Bayesian/MCMC methods.

0 Illustrations of non-parametric statistical modelling

ß Scatter plot smoothers. Additive models. Generalised Additive Models (GAMs).

'

&

$

%

What is a statistical model?

ä Much of modern statistics (perhaps all of it!) is about modelling data

Data ⇒ Trend+Error ⇒ STATISTICAL MODEL

ä The systematic (trend) and random (error) components are both important in

statistical modelling since we not only want to know how phenomena behave on

average, but also how volatile their behaviour is — the data represent ‘one throw of the

dice’ we want to know what might happen ‘when they are thrown again’.

ä So a statistical model for data values y = (y1, . . . , yn) actually comprises a joint

probability distribution (mass or density) for the data which we denote p(y), where

the expected value or mean, µ = (µ1, . . . , µn) of this distribution is the systematic

component of the model, and other characteristics of the probability distribution (e.g.

type, shape, variability about the mean etc.) represent the random component

ä The job of statistical modelling is to find out as much as possible about p(y)

4

'

&

$

%

What is a statistical model?

ä When observations (y1, . . . , yn) are (or can be assumed to be) an independent

random sample on the phenomenon studied (always the case in the situations we

consider), the joint probability distribution for the data is just the product of the

univariate probability distributions for each observation i.e: p(y) =∏n

i=1 p(yi) and

so we only need to focus on modelling each marginal distribution, p(yi)

ä In other cases (we will not consider these explicitly) observations may not be

independent (e.g. in a time series) in which case the model, p(y), needs to be handled

as a multivariate distribution, which incorporates appropriate covariance (or correlation)

structure between the observations (y1, . . . , yn). Conditional distributions,

p(yi|yj 6=i), then become important

'

&

$

%

Response versus explanatory variables

ä We mentioned earlier that we will only be considering univariate models, so our

response, y = (y1, . . . , yn), will always correspond to observations on just a single

stochastic variable. Note that yi may be metric, ordinal (e.g. low, average, high), or

categorical (e.g. ‘failed’ or not, ‘blondes’ or ‘brunettes’ )

ä We also said that a vector, xi = (xi1, . . . , xip)′, of explanatory variables (or

predictors) may be (and usually is) present for each response observation, yi. The xi

values influence the systematic component of the distribution of the response (and

perhaps the random component as well e.g. its variance). Note that xi may be metric,

ordinal (e.g. age group) or categorical (e.g. gender)

ä Categorical predictors are referred to as factors and handled in modelling by

converting them into a set of associated binary (1,0) auxiliary predictors (one for each

level or category of the factor)— an observation in category k of a factor has the kth

associated auxiliary variable equal to 1 and the others equal to 0 -

5

'

&

$

%

Fitting models

ä So in general: µi = f(xi) = f(xi1, . . . , xip) and a key object in modelling is to

obtain from the data a ‘good’ estimate, µi or f(xi), of this systematic model

component (and perhaps also ‘good’ estimates of any additional unknown

characteristics of the random component of the model)

ä We therefore need methods to fit the model to the data. Note that the predicted means

for each of the data points under the fitted model i.e. yi = µi are referred to as fitted

values. The predicted ‘errors’ (differences between the observed and fitted values) i.e.

εi = yi − yi = yi − µi are referred to as residuals

ä Fitting involves statistical inference — ‘guessing’ from data the properties of the

generating probabilistic process. Once the model is fitted, statistical inference also

concerns how confident we are in our estimates of various aspects of the model or

predictions from it (e.g. standard errors, confidence intervals etc.)

'

&

$

%

Fitting models

ä Methods for estimation of the model components (model fitting) and statistical inference

concerning them or model predictions derived from them are traditionally based on

likelihood theory

ä Bayesian approaches are now also increasingly used for complex models

ä Various resampling methods are sometimes also used when likelihood analysis is

mathematically intractable, or when one wishes to avoid specific assumptions about the

form of the systematic or random components of the model

ä We should not think of these as competing approaches, but as complementary — use

no more than is needed

ä We will return later to a fuller discussion of aspects of the different approaches as and

when required. Before that however, let’s talk about software.

6

'

&

$

%

Software for statistical modelling

ä There is a very wide range of statistical software packages (some commercial, some

public domain), all of which provide for some forms of statistical modelling (e.g. SPSS,

SAS, Minitab, GLIM, Genstat etc.). In addition other types of software often contain a

limited range of statistical tools (e.g. even spreadsheets such as Microsoft Excel)

ä Functionality in these packages varies enormously (special vs. general purpose), as

does their user friendliness (menu vs. language driven) and ability to cope with

non-standard analyses (‘black box’ vs. user flexibility).

ä Amongst those professionally involved in statistical modelling, flexible statistical

languages tend to be preferred to more restrictive menu driven environments based

around ‘black box’ routines (however extensive the functionality of the latter may be e.g.

SPSS or SAS)

'

&

$

%

Software for statistical modelling

ä Nowadays the mainstream language environments used for serious statistical modelling

are probably, MatLab, S-Plus and R

ä MatLab provides few statistical functions in its base release. An add on ‘statistics

toolbox’ is available but this incurs additional cost (although libraries of MatLab code for

performing many statistical functions may be downloaded free from the web). MatLab is

fast, powerful and flexible, but is not particularly statistically focussed

ä S-Plus and R are very much more statistically orientated. They are very similar and both

derive originally from the language S created by AT & T Bell Laboratories. S-PLUS is a

commercial product. R is public domain and may be freely downloaded from the web

7

'

&

$

%

An introduction to R

ä We will focus on R, rather than MatLab or S-Plus in this course

ä R is an integrated suite of software facilities for data manipulation, calculation and

graphical display. Among other things it has:

ß an effective data handling and storage facility

ß a suite of operators for calculations on vectors and matrices

ß a large, coherent, integrated collection of intermediate tools for data analysis

ß graphical facilities for data analysis and display

ß a well developed, simple and effective programming language (based on S) which

includes logical conditions, loops, user defined functions and input and output

facilities.

'

&

$

%

An introduction to R

ä R is a statistical ‘environment’ (a fully planned and coherent system) rather than an

incremental collection of very specific and inflexible tools (frequently the case with other

data analysis software)

ä R has developed rapidly. It is now very widely used and has been extended by a large

and growing collection of add-on packages (all free). If you can’t do it in R you probably

can’t do it

ä See ‘R for beginners’ and ‘An Introduction to R’ -

8

'

&

$

%

Likelihood-based inference

ä As we have seen, a statistical model comprises a general form (specification of

systematic and random components) which involves a set of unknowns. Let’s refer to

the unknowns by the vector θ. So the model is actually a joint probability distribution (or

density) for y which depends on values for θ — p(y; θ)

ä Viewed as a function of y for fixed θ, p(y; θ) is a probability distribution. But what if

we view it as a function of θ for a fixed value of y (i.e. the observed data)? Well it’s not

a probability distribution for θ (doesn’t integrate to unity), so perhaps we should denote

it L(θ; y) to avoid any confusion

ä That said, under the proposed model, L(θ; y) does in some relative sense represent

‘how likely it is that you would get the particular observed data y, for different values of

the parameters θ’. We therefore call it the likelihood for θ. So for independent

observations y = (y1, . . . , yn) with yi ∼ p(yi; θ) the likelihood is:

L(θ; y) =∏n

i=1 p(yi; θ). More generally L(θ; y) = p(y; θ)

'

&

$

%

Likelihood-based inference

ä One way to fit the model (i.e. to find ‘good’ estimates for θ) then becomes obvious i.e.

choose values, θ, for the unknowns θ that maximise the likelihood L(θ; y) — use

maximum likelihood estimates (mle)

ä The method of maximum likelihood is broadly applicable and provides estimates that

have many desirable statistical properties. Once the mle is derived, the general theory

of maximum-likelihood estimation then provides standard errors, statistical tests, and

other results useful for statistical inference.

ä A disadvantage is that it requires fairly strong assumptions about the form of the model

(both systematic and random components) otherwise the mathematical form of the

likelihood is not sufficiently well specified to proceed

9

'

&

$

%

Simple example of maximum-likelihood estimation

ä Suppose we want to estimate the probability θ of getting a head upon flipping a particular coin.

We flip the coin independently 10 times (i.e., we sample n = 10 flips), obtaining the following

result: HHTHHHTTHH

ä The probability of obtaining this data depends upon the unknown θ:

p(HHTHHHTTHH;θ) = θθ(1 − θ)θθθ(1 − θ)(1− θ)θθ = θ7(1− θ)3

ä The parameter θ has a fixed value, but this value is unknown, and so we let it vary in our

imagination between 0 and 1, treating the probability of the observed data as a function of θ i.e.

the likelihood: L(θ; HHTHHHTTHH) = θ7(1− θ)3

ä The probability distribution for the data and the likelihood function are the same, but the former is

a function of the data with the value of the unknowns fixed, while the likelihood is a function of the

unknowns with the data fixed.

'

&

$

%


ä Some values of this likelihood for different values of θ are:

θ L(θ; data) = θ7(1− θ)3

0.0 0.00.1 0.00000007290.2 0.000006550.3 0.00007500.4 0.0003540.5 0.0009770.6 0.001790.7 0.002220.8 0.001680.9 0.0004781.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0

0.00

000.

0005

0.00

100.

0015

0.00

20

pi

likel

ihoo

d

10

'

&

$

%


ä Although each value of L(θ; data) is a notional probability, the function L(θ; data) is not a

probability or density function (it does not enclose an area of one). The probability of obtaining

the sample of data that we have in hand, HHTHHHTTHH, is small regardless of the true value of

θ. This is usually the case, any specific sample result (including the one that is realized) will have

a low probability

ä Nevertheless, the likelihood contains useful information about the unknown parameter θ. For

example, θ cannot be zero or one, and is ‘unlikely’ to be close to zero or one

ä Reversing this reasoning, the value of θ that is most supported by the data is the one for which

the likelihood is largest. This value is the maximum-likelihood estimate (mle), denoted θ. Here,

that value is θ = .7, which is the sample proportion of heads 7

10

'

&

$

%


ä The more general case is also easy to work through for this simple example. Let y be

the number of heads in n tosses with θ being the probability of a head, the model for

the data y is a binomial distribution: p(y; θ) =(

ny

)

θy(1 − θ)n−y

ä The log likelihood (equivalent to likelihood and usually easier to work with) is therefore

log(

n!y!(n−y)!

)

+ y log(θ) + (n − y) log(1 − θ) and differentiating with respect to

θ and setting this derivative equal to zero (for a maximum) gives:

y

θ=

(n − y)

(1 − θ)or θ =

y

n

ä so the maximum likelihood estimate of θ is always just the sample proportion of heads

(as you would expect!)

11

'

&

$

%

likelihood-based inference

ä Likelihood provides a ‘good’ method to fit models — maximum likelihood estimators are

consistent (vary less from true value as sample size increases), asymptotically

unbiased (on average equal to true value), and asymptotically efficient (vary less from

true value than other estimators for given sample size)

ä In addition, the shape of the likelihood (in particular ‘how peaked’ it is at the maximum)

can provide a measure of the confidence that can be placed in such estimates (i.e.

standard errors and confidence intervals)

ä Ratios of likelihood for alternative models provide a basis for comparing models and

testing hypotheses (note these ideas are equivalent — testing a hypothesis involves

comparing a model constrained by the hypothesis, with one that is not)

ä We consider more details on these issue as we go (see ‘A Review of Likelihood Theory’

-)

'

&

$

%

Assessing models

ä All models involve various assumptions about the data to which they relate. Thus once

a model is fitted, the validity of any assumptions involved need to be checked critically,

especially if the model is going to be used to draw conclusions about relationships or to

make predictions

ä Correspondence between fitted values yi = µi and observed values yi is obviously

important in this respect and differences between these, i.e. the raw residuals

εi = (yi − yi), are therefore particularly useful diagnostic tools

ä But it needs to be appreciated that raw residuals are themselves estimates of real

errors (yi is a prediction) and further, depending on the model, errors might be

expected to have differential variability. It will thus usually be preferable to use some

form of standardised residual (the appropriate form will depend on the type of model)

12

'

&

$

%

Assessing models

ä Appropriately standardised residuals can be inspected in various ways:

ß to ensure that their distribution is in line with what one would expect given the

random component of the model

ß to ensure they are independent and do not cluster together according to some

characteristic of the sample observations (e.g. group effects or a temporal

dependence)

ß to ensure that they contain no patterns which imply that the systematic component

of the model could be improved

ß to ensure that no outliers are present. Particularly outliers that are also influential

i.e. ones that may be distorting the model fit

'

&

$

%

Assessing models

ä Various useful diagnostic plots for standardised residuals are:

ß residuals against fitted values (look for funnelling)

ß residuals against individual predictor variables (look for trend or curvature)

ß Q-Q plots i.e. ordered residuals against expected ordered values for same size

sample from some comparative theoretical distribution—usually N(0, 1) (look for

departure from straight line)

ß Influence plots for the observations usually based upon some function of the

difference in standardised residuals between a model fitted when the observation is

excluded from the data (sometimes referred to as jacknifed residuals) and one

when it is not (look for large influences)

ä We will look in more detail at some of these ideas later in the context of specific

modelling frameworks

13

'

&

$

%

Comparing models

ä Given a fitted model with assumptions that have been critically assessed as reasonable,

an obvious next consideration is how good is it? To what extent does the model ‘explain’

the data? Are all the predictors necessary or can some of them be ignored without

worsening the fit? Is the model formulation preferable to alternatives?

ä As suggested earlier, one way to consider these questions is via likelihood ratios. Given

a specified random component, a saturated model can always be proposed which fits

the data perfectly — one in which the systematic component is µi = yi (i.e. n

unknowns one for each data point). Suppose the maximised likelihood for this model is

LMs(clearly the associated mles will be µi = yi)

ä Suppose that some other model M has a likelihood maximised over its p < n

unknowns, LM , then the likelihood ratio Λ = LM

LMsintuitively measures how well the

model M fits relative to the saturated model Ms.

'

&

$

%

Comparing models

ä In fact likelihood theory tells us that minus twice the log likelihood ratio i.e.

−2 log Λ = 2 (log LMs − log LM ) (known as the scaled deviance for model M ) has

approximately a χ2n−p distribution. The expected value of a χ2 distribution is equal to its d.f. and

so an ‘adequate’ model with p unknowns should have a scaled deviance roughly equal its residual

d.f. i.e. n− p (note the term residual deviance is sometimes used instead of ‘scaled deviance’

— these are effectively equivalent although the former may be a constant multiple of the latter)

ä Differences in scaled deviances can also be used to compare any two nested models. Suppose

M1 ⊂M2 where M1 has p1 unknowns and M2 has p2 unknowns (p2 > p1), then the

difference in their scaled deviances i.e. 2 (log LM2− log LM1

) has approximately a χ2p2−p1

distribution which provides a test of whether the fit of the two models differs significantly.

ä Such ideas also form the basis of various model selection procedures to establish appropriate

subsets of explanatory variables to include in the model

14

'

&

$

%

Comparing models

ä As well as a saturated model for any proposed random component, there is also a null

model, Mn, in which the systematic component is µi = µ (i.e. a single unknown

corresponding to a constant mean for all observations implying the predictors have no

effect on the response)

ä A rough assessment of the predictive power of any other model, M , is therefore to ask

whether ‘it explains a large proportion of the scaled deviance of the null model’ i.e. we

look at the R2 associated with M where:

R2 =scaled dev Mn − scaled dev M

scaled dev Mn

ä Since the above measure will always increase with p (the number of unknowns in M ) it

is customary to penalise it for models with large numbers of unknowns and use the

adjusted R2 instead where: R2a = 1 −

(

1− R2)

(

n−1n−p

)

'

&

$

%

Comparing models

ä While R2 or R2a provide rough guides to the comparative predictive power of models,

alternative information based criteria are generally to be preferred

ä There are several variants, but the most commonly used is the Akaike Information

Criteria (AIC) defined for a model M with p unknowns as:

−2 logΛ + 2p = 2 (log LMs− log LM ) + 2p

ä So a model is preferred if it has lowest AIC.

15

'

&

$

%

Comparing models

ä Clearly the ultimate test of how well a model performs is the accuracy of its predictions.

But this has to be in relation to data that wasn’t involved in fitting the model, otherwise

the saturated model will always win out

ä One can assess this to some extent by examining confidence intervals derived from the

general likelihood methodology for model predictions at a range of values of the

predictor variables

ä Alternatively we can use the resampling technique of cross-validation. The general

idea here is to repeatedly partition the data into one set for building the model (training

set) and one for evaluating the model predictions (testing set). Predictive performance

is then assessed over the many testing sets used

'

&

$

%

Comparing models

ä The most general form is called K-fold cross-validation. The data set is split into K

partitions of roughly equal size. Each partition in turn is reserved for testing with the

rest of the data used to fit the model in each case.

ä We thus effectively end up with a prediction y(i) of each point yi in the data set derived

from a model which did not include yi in its training set

ä The mean square prediction error (MSE) 1n

∑

(yi − y(i))2 then provides an overall

assessment of the model’s predictive power.

ä The case where K = n (‘leave one out’) is often used and is essentially an example of

the general jacknife resampling technique

16

'

&

$

%

Parametric models

ä Most traditional statistical models are parametric.

ä The form (linear, logistic etc.) of the function f(xi) representing the systematic model

component is specified in advance and involves a number of unknown parameters

whose values then need to be estimated from the data (e.g. the probability of a head in

tossing coins in our earlier example).

ä Similarly the form of the random component (Normal, Poisson, Binomial etc.) is also

specified in advance and may (or may not) involve additional unknown auxiliary

parameters which also need to be estimated from the data (e.g. its variance, if this can

be specified independently of the systematic component).

'

&

$

%

Non-parametric models

ä Non-parametric models are distinguished by not specifying in advance a parametric

form for the systematic model component f(xi). The object is to estimate f(·)

directly, rather than to estimate the relationship through imposing a parametric form.

ä It is however necessary to assume something about f(xi) in order to be able to come

up with a method to fit the model. Most methods assume that f(xi) is in some sense a

‘smooth’ continuous function and then proceed to estimate it by some kind of smoothing

splines or other forms of localised polynomial approximations.

ä Note the justification for the fitting method will still usually require some broad

assumptions about the form of the random component of the model

17

'

&

$

%

Parametric versus non-parametric models

ä It is tempting to think that non-parametric models will automatically be ‘better’ than

parametric models because they make less assumptions about the data

ä Certainly non-parametric approaches are valuable. But:

ß Statistical modelling is about explaining data not just reproducing it — one needs to

guard against over-fitting.

ß There is a trade-off between the strength of model assumptions and our ability to

distinguish systematic from random components — building too little structure into a

model can be just as bad as building in too much. Sometimes non-parametric

approaches are unsatisfactory because they assume too little

'

&

$

%

Parametric versus non-parametric models

ä Furthermore:

ß Modelling is not only about estimating model components, but also about being able

to specify how confident you are of those estimates, or of predictions derived from

them. Parametric model frameworks will generally allow for better statistical

inference concerning model estimates and predictions (e.g. standard errors,

confidence intervals, model comparisons etc.)

ß Finally, fitting of non-parametric models will generally be more computationally

intensive than use of parametric models.

ä So both parametric and non-parametric approaches have their place in statistical

modelling. It depends what the objectives are. They should be viewed as

complementary and not competitive -

18

'

&

$

%

Illustrations of parametric modelling — the linear model

ä The most basic parametric statistical model is the traditional linear model (or linear

regression) where the systematic component is a function of the predictors which is

linear in the unknown parameters and the random component comprises additive,

independent, zero mean, normally distributed errors with constant variance

ä So that: yi = x′iβ + εi = β0 + β1xi1 + . . . + βpxip + εi with εi ∼ N(0, σ2) for

i = 1, . . . , n or more concisely: yi ∼ N(∑p

j=0 βjxij , σ2), where it is understood

that xi0 = 1 and defines the intercept term in the model

ä Note this framework includes the case where some predictors are categorical (i.e.

factors, so the raw explanatory variable will actually correspond to a set of auxiliary 0,1

predictors). It also includes the case where some predictors are functions of the raw

explanatory variables (e.g. β0 + β1xi1 + β2x2i1 + β3xi2 + β4xi1xi2 + εi is still a

linear model)

'

&

$

%

The linear model

ä Likelihood theory for the linear model leads to model fitting by least squares. Mle

parameter estimates for the systematic component are: β =(

X ′X)−1

X ′y where

X is the matrix with columns x0, x1, . . . , xp which is often known as the model or

design matrix (recall x0 is a column of ones)

ä Estimated variance for parameter estimates (and hence standard errors and confidence

intervals) follow from: var(

β)

= σ2(

X ′X)−1

and the fact that βi−βiq

var(βi)is

distributed as N(0, 1)

ä In practice the variance σ2 (i.e. the parameter of the random component) will also be

unknown and will need to be replaced by a suitable estimate σ2 (see later) in which

case var(

β)

= σ2(

X ′X)−1

and βi−βiq

ˆvar(βi)is distributed as tn−p−1

19

'

&

$

%

The linear model

ä For a linear model M the general log likelihood ratio statistic (i.e. scaled deviance)

mentioned earlier i.e. −2 logΛ = 2 (log LMs− log LM ) (where Ms is the

saturated model) corresponds mathematically toP

ni=1

(yi−yi)2

σ2 . In other words, the

deviance for the linear model, M , is the residual sum of squares (rssM ), the scale

factor is σ2

ä It follows that the R2 value for a model M is simply the difference between the rss for

the null model, Mn, and that for M expressed as a proportion of the rss for the null

model i.e: 1 − rssM

rssMn

ä Further, since the scaled deviance has an asymptotic χ2n−p−1 distribution with

expected value n − p − 1 it follows (by equating observed and expected value) that a

sensible estimate for σ2 is given by the mean square error or mean square residual

i.e. σ2 = rssM

n−p−1 =P

ni=1

(yi−yi)2

n−p−1 =P

ni=1

ε2in−p−1

'

&

$

%

The linear model

ä It also follows that tests of differences in fit between nested models M1, M2 with

numbers of parameters p1 < p2 should be based on the likelihood ratio statistic:rssM1

−rssM2

σ2

ä General likelihood theory says this is approximately χ2p2−p1

distributed if there is no

difference in model fit. But σ2 needs to be replaced by an estimate and the ‘best’ such

estimate must correspond to the mean square residual for the model with the more

parameters i.e. σ2 =rssM2

n−p2−1

ä If we substitute this estimate and also slightly modify the likelihood ratio statistic for

comparing M1 and M2 to:(rssM1

−rssM2)/(p2−p1)

rssM2/(n−p2−1)

ä Then, for the linear model, this can be shown to have an exact Fp2−p1,n−p2−1

distribution under no difference in model fit (n.b. root of this is tn−p−1 when omitting

one variable from a p variable model)

20

'

&

$

%

The linear model

ä Prediction of values from a linear model is simple — we predict a response y by its

fitted mean µ.

ä Thus the prediction for the mean response at explanatory variable values

x = (1, x1, . . . , xp)′ is y = x′β and standard errors and confidence intervals for

this then follow from the estimated variance of y i.e.: var(y) = σ2x′(

X ′X)−1

x

ä The prediction for an individual response at x is the same but with larger estimated

variance: var(y) = σ2(1 + x′(

X ′X)−1

x)

'

&

$

%

The linear model

ä The residuals εi = (yi − yi) are ‘predicted’ rather than real errors. Thus they in turn

have standard errors which may be derived from the relationships for the standard

errors of prediction

ä It follows that appropriately standardised residuals for the linear model are given by:

ε∗i =(yi − yi)

√

σ2(1 − x′i

(

X ′X)−1

xi)

21

'

&

$

%

Examples of using the linear model in R

ä To illustrate the simple linear model in R, consider data on 102 occupations derived

from a Canadian social survey. The response variable is an occupational ‘prestige’

score (y). There are two possible explanatory variables — average income (x1) and

average educational level (x2) of those in the occupation.

ä See the notes on worked example 1

ä Another example is provided by data on harmful algae in rivers. Algae ‘blooms’ are a

potential ecological problem with impact on river life forms and water quality. Monitoring

and forecasting blooms is therefore important. The data comprise 148 water samples

for European rivers over a year. For each sample, 8 chemical properties were

measured and concentrations of 7 types of algae. Other characteristics of the sample

were season of the year, river size, and river speed.


'

&

$

%

Illustrations of parametric models — the GLM

ä A Generalized Linear Model (GLM) is a development of the linear model to

accommodate (in a clean and straightforward way) both non-normal response

distributions and transformations to linearity in the systematic model component

ä In a GLM we have independent observations on a response, y = (y1, . . . , yn), where

the distribution of yi is in the exponential family i.e. it has the form:

p(yi; θi, φ) = exp

{

ai

φ(yiθi − b(θi)) + c(yi, φ)

}

where θi and φ are parameters, b(·), c(·) are known functions and ai is a known

observation specific weight (usually 1)

ä The parameters θi and φ of this distribution are essentially location and scale

(dispersion) parameters which relate to the mean and variance of yi via: µi = db(θi)dθi

and var(yi) = φai

d2b(θi)dθ2

i

22

'

&

$

%

Exponential family distributions

ä The exponential family includes as special cases the normal, binomial, Poisson,

exponential and gamma distributions and so provides for a wide range of modelling

options for the probability distribution of a response

ä E.g. a Gaussian (normal) distribution N(µi, σ2) is obtained from the exponential form

by setting: θi = µi, φ = σ2, ai = 1, b(θi) = θ2i /2 and

c(yi, φ) = − 12

[

y2i /φ + log(2πφ)

]

ä whereas a Poisson distribution with mean µi is obtained by setting: θi = log µi,

φ = 1, ai = 1, b(θi) = eθi and c(yi, φ) = − log yi!

ä and a binomial B(ni, πi) is obtained by setting: θi = log (πi/(1 − πi)), φ = 1,

ai = 1, b(θi) = ni log(1 + eθi) and c(yi, φ) = − log(

ni

yi

)

'

&

$

%

The generalised linear model

ä To complete the GLM specification we suppose that there are explanatory variables

(predictors) xi1, xi2, . . . , xip whose values may influence the distribution of the

response yi through a function called the linear predictor which is linear in the

unknown parameters: ηi = β0 + β1xi1 + β2xi2 + . . . + βpxip

ä As with the linear model the explanatory variables can be quantitative or categorical,

transformations of predictors, polynomial terms etc.

ä The mean value of yi is related to the linear predictor through a smooth invertible

function g(µi) = ηi know as the link function. So µi = g−1(ηi)

ä Note we do not transform the response yi, but rather its mean µi (e.g. a linear model

with response log yi is not the same as a GLM with normal error and a log link)

23

'

&

$

%


ä The GLM specification is loose enough to encompass a wide class of models useful in

statistical practice, but tight enough to allow the development of a unified methodology

of estimation and inference (at least approximately)

ä Some common distributions, link functions and resulting means and variances are

shown below:

Family Link µi var(yi)binomial logit ηi = log µi

(ni−µi)µi = ni

1+e−ηiµi(1 − µi

ni)

gaussian (normal) identity ηi = µi µi = ηi φgamma inverse ηi = 1

µiµi = 1

ηiφµ2

i

poisson log ηi = log µi µi = eηi µi

ä A range of alternative link functions are sometimes used for some of these families

'

&

$

%


ä Likelihood theory for the GLM leads to model fitting by iteratively re-weighted least

squares. Parameter estimates for the linear predictor are obtained by starting from a

trial estimate of the parameters, β, and then iterating using:

β =(

X ′WX)−1

X ′Wz

where X is the design matrix (as in the linear model), z = (z1, . . . , zn) is a working

vector with

zi = ηi + (yi − µi)d ηi

d µi

∣

∣

∣

∣

β

and W is a diagonal weighting matrix whose ith diagonal element is

wii =ai

d2 b(θi)d θ2

i

∣

∣

∣

β

(

d ηi

d µi

∣

∣

∣

β

)2

24

'

&

$

%


ä Standard errors for estimates (and hence confidence intervals) follow from:

var(

β)

= φ(

X ′WX)−1

with βi−βiq

ˆvar(βi)being approximately distributed as N(0, 1)

ä If the scale parameter φ is unknown it must be replaced by a suitable estimate φ (see

later) in which case var(

β)

= φ(

X ′WX)−1

and βi−βiq

ˆvar(βi)is approximately

distributed as tn−p−1

ä Note that with a normal error and identity link these GLM expressions for parameter

estimates and their variances essentially collapse to the results given earlier for the

linear model

'

&

$

%


ä For a GLM, M , the general log likelihood ratio statistic (scaled deviance) mentioned

earlier i.e. −2 logΛ = 2 (log LMs− log LM ) (where Ms is the saturated model)

corresponds mathematically to a quantity DM

φ where DM depends on the data y and

the estimated linear predictor but does not depend on φ. The quantity DM is thus the

deviance for the GLM, the scale factor being φ

ä It follows that a ‘pseudo’ R2 value for a GLM, M , is simply the difference between the

deviance for the null model, Mn, and that for M expressed as a proportion of the

deviance for the null model i.e: 1 − DM

DMn

ä Further, since the scaled deviance has an asymptotic χ2n−p−1 distribution with

expected value n − p − 1 it follows that where φ is unknown a sensible estimate is

given by φ = DM

n−p−1

25

'

&

$

%


ä As a consequence, tests of differences in fit between nested models M1, M2 with

numbers of parameters p1 < p2 should be based on the likelihood ratio statistic:DM1

−DM2

φ

ä General likelihood theory says this is approximately χ2p2−p1

distributed if there is no

difference in model fit

ä But if φ is unknown it needs to be replaced by an estimate and the ‘best’ such estimate

must be based on the model with the more parameters i.e. φ =DM2

n−p2−1

ä Where φ is unknown we thus substitute its estimate and slightly modify the likelihood

ratio statistic for comparing M1 and M2 to:DM1

−DM2)/(p2−p1)

DM2/(n−p2−1)

ä Which, for a GLM, has an approximate Fp2−p1,n−p2−1 distribution under no difference

in model fit

'

&

$

%


ä Prediction of values from a GLM is similar to that used for a linear regression. We

predict a response y by its fitted mean µ.

ä However, we need to remember that in a GLM the mean µ is actually a function of the

linear predictor η. So the prediction for the response at explanatory variable values

x = (1, x1, . . . , xp)′ is y = g−1

(

x′β)

ä Derivation of standard errors and confidence intervals for a GLM prediction y then

follows (we do not give mathematical details here)

26

'

&

$

%


ä Similar considerations lead to determination of appropriate standardisation for

residuals. In fact several kinds of residuals can be defined for a GLM

ä The raw residuals, i.e (yi − yi), are known as response residuals, Pearson

residuals are a standardised version of the response residuals, working residuals are

an alternative (derived from the last stage of the iterative fitting process i.e.

(yi − µi)/dµi/dηi), and a further alternative is deviance residuals (the square roots

of an individual observation’s contribution to the deviance with the sign of (yi − yi)

attached)

ä For Gaussian families all standardised types of residuals are equivalent. More generally

there are arguments for and against use of the different types. Deviance residuals tend

to be the most useful for model diagnostic purposes and should follow a normal

distribution if the distributional assumptions are valid.

'

&

$

%


ä Finally we should note that for certain GLM response families (e.g. binomial and

Poisson) the variance of the response is purely determined by its mean — there is no

additional scale parameter which is free to vary

ä However, there may be data which for which a Poisson or binomial assumptions seems

appropriate but where in practice the random component has a variance which is larger

than that implied by the mean-variance relationship for such distributions (i.e. the

scaled deviance shows a clear lack of fit)

ä This situation is often described as overdispersion. Failure to take account of

overdispersion can lead to serious underestimation of standard errors and hence to

misleading inference about model parameters or predictions derived from them

27

'

&

$

%


ä One way to handle overdispersion for the binomial and Poisson is to use the GLM

framework as the estimation mechanism for the parameters of the linear predictor (this

does not depend on the scale parameter) but then to use a more general variance

function than that given by the standard theory. E.g. var(yi) = φµi(1 − µi/ni) for

the binomial or var(yi) = φµi for the Poisson. The scale parameter φ is then

estimated from the deviance as described earlier

ä Strictly, when doing this we are not using a GLM, since the response distribution is no

longer from the exponential family. The approach is actually an example of an

quasi-likelihood model — i.e. one where the precise response distribution is not

explicitly specified, but instead we simply specify a link function for the mean of the

response and then a variance function in terms of the mean

ä Quasi-likelihood uses formally identical estimation techniques to those for GLM, so it

provides a way of fitting GLMs with non-standard links or variance functions

'

&

$

%

Examples of using the GLM in R

ä To illustrate a simple binomial GLM using R, consider a data set concerning failure of

‘O’ rings on the Challenger space shuttle


ä A further GLM example using R, this time a Poisson model, is provided by a data set

relating to football related crime in the UK


ä A final, more extended example of a GLM using R, is provided by a data set relating to

predicting recessions in the US economy


28

'

&

$

%

Direct use of likelihood for non-standard models

ä So far so good! But what if a proposed model is non-standard and falls outside the

frameworks of the linear or generalised linear model

ä Well the general purpose likelihood based inferential method remains applicable, but

clearly non-standard models will need to be implemented through direct maximisation

of the likelihood rather than being able to use a general purpose framework such as

least squares or iteratively re-weighted least squares

ä It may therefore be worth briefly reviewing in more detail some further general aspects

of likelihood-based inference as opposed to its specific manifestation in the context of

linear and generalised linear modes

'

&

$

%


ä We mentioned before that the shape of the likelihood (in particular ‘how peaked’ it is at

the maximum) provides a measure of the confidence that can be placed in maximum

likelihood estimates (i.e. standard errors and confidence intervals). We now discuss this

in more detail

ä The first derivative of the log likelihood is called the score function, u(θ). It is a vector

of first partial derivatives, one for each element of θ. The mle is found by setting the

score to zero, i.e. u(θ) = 0 and solving the resulting set of equations

29

'

&

$

%


ä The second derivative indicates the extent to which a function is peaked rather than flat

and for likelihood a useful related measure is the information matrix I(θ) — i.e. the

expected value under the model over possible data values, y, of minus the matrix of

second derivatives of the log-likelihood or:

I(θ) = −E

[

∂2 log (L(θ; y))

∂θ∂θ′

]

ä Under reasonable conditions, the mle estimator θ can be shown to have an asymptotic

(multivariate) Normal distribution with variance covariance matrix given by the inverse of

the information matrix i.e. cov(θ) = I−1(θ)

'

&

$

%


ä Since I−1(θ) depends on θ (which is unknown) we estimate its value by replacing θ

with its mle estimate θ so obtaining: cov(θ) ≈ I−1(θ)

ä Further, since the expectation involved in the information matrix I() can be difficult to

derive, we may sometimes replace I−1(θ) by its observed equivalent i.e.

−H−1(θ; y), where H(θ; y) is called the Hessian — the matrix of second

derivatives of the log-likelihood evaluated at θ and at the data values y that were

actually observed

ä So overall we have the useful result that maximum likelihood estimates θ are

approximately (multivariate) normally distributed with variance-covariance matrix

cov(θ) ≈ −H−1(θ; y)

ä This provides for approximate standard errors and confidence intervals for such

estimates (often attributed to Wald)

30

'

&

$

%


ä An alternative approach to confidence intervals for mle estimates is to base inference

on the log likelihood-ratio statistic

ä As discussed earlier, −2 logΛ = 2(

log L(θ; y) − log L(θ; y))

intuitively

measures how well a model with parameter values θ compares with that where the

parameters assume the maximum likelihood values θ. Furthermore, this quantity has

approximately a χ2p distribution under the hypothesis that there is no difference in fit

between these two models (where p is the dimension of θ)

ä Thus to find a confidence interval for θ we determine a region of θ values for which the

value of the log likelihood 2(

log L(θ; y) − log L(θ; y))

is not inconsistent with an

observation from a χ2p distribution

ä The Wald and likelihood-ratio results are asymptotically equivalent and give the same

results if the log likelihood is locally quadratic in the vicinity of the maximum

'

&

$

%

Example of direct use of likelihood for non-standard models in R

ä To illustrate the direct maximisation of log-likelihood in R and inference derived from

that, we consider fitting a non linear model with Gaussian errors to a set of fictitious

data consisting of a response and a single explanatory variable


31

'

&

$

%

Bayesian inference

ä Maximum likelihood is fine as a general approach, but if the form of the likelihood

L(θ; y) is complex and/or the number of individual unknowns involved in θ is large

then the approach may prove either very difficult or infeasible to implement

ä If so then a Bayesian approach to model fitting may prove useful

ä In the Bayesian approach we think of the unknowns as ‘random quantities’ (rather than

fixed constants)

ä The statistical model then becomes a joint probability distribution for both the data and

the parameters: p(y, θ) (the likelihood is now the conditional distribution of y ‘given’

the parameter values – p(y|θ))

'

&

$

%

Bayesian inference

ä Elementary probability theory then allows us to relate p(y, θ) to the likelihood:

p(y, θ) = p(y|θ)p(θ)

where p(θ) is called the prior probability distribution for the parameters. This prior

expresses our uncertainty about θ before taking the data into account. It will usually be

chosen to be ‘non-informative’.

ä Bayes Theorem then allows derivation of a posterior probability distribution for the

parameters given the observed data:

p(θ|y) =p(y|θ)p(θ)

p(y)=

p(y|θ)p(θ)∫

θp(y|θ)p(θ) dθ

i.e. ‘posterior’ is proportional to ‘likelihood’ × ‘prior’ — the denominator is just a

normalising constant independent of the parameters (but unfortunately difficult to

calculate because it involves a ‘nasty’ multidimensional integral)

32

'

&

$

%

Bayesian inference

ä The posterior p(θ|y) expresses our uncertainty about θ after taking the data into

account

ä So any characteristics of interest f(θ) concerning the parameters (e.g. mean,

standard deviation, mode, median, quantiles etc.) may be derived from the

corresponding characteristics of the posterior.

ä For example an obvious choice of a point estimate, θ, for the parameter values θ is the

posterior mean of the parameters:

θ = E [θ|y] =

∫

θ

θp(θ|y) dθ

ä But it’s important to stress that Bayes gives us a full posterior distribution, p(θ|y), for θ

and thus allows us to examine any aspect of θ we choose and make associated

probability statements

'

&

$

%

Bayesian inference

ä Let’s return to our earlier simple coin tossing example to clarify these ideas.

ä Recall we were estimating the probability of a head from a sample of n tosses where y

is the number of heads obtained.

ä As before, let θ be probability of a head, then the likelihood for y is a binomial

distribution:

p(y|θ) =

(

n

y

)

θy(1 − θ)n−y

ä Suppose take a prior for θ as U(0, 1) i.e. p(θ) = 1 for 0 ≤ θ ≤ 1 (this says θ is

equally likely to be anywhere in the (0, 1) range)

33

'

&

$

%

Bayesian inference

ä Then posterior is given by:

p(θ|y) =p(y|θ)p(θ)

∫

θp(y|θ)p(θ) dθ

=

(

ny

)

θy(1 − θ)n−y

∫ 1

0

(

ny

)

θy(1 − θ)n−y dθ=

(

ny

)

θy(1 − θ)n−y

1(n+1)

ä A reasonable point estimate for θ is the mean of the posterior i.e.

θ = E [θ|y] =

∫ 1

0

θ(n + 1)

(

n

y

)

θy(1 − θ)n−y dθ =y + 1

n + 2

ä Recall the mle estimate of θ was yn . For this prior that is the mode of the posterior or

MAP estimate (rather than the posterior mean). yn is actually the posterior mean when

prior is taken to be uniform for the log odds i.e. for log θ1−θ .

ä This illustrates that choice of prior can be tricky — which is the most sensible estimate

of θ here?

'

&

$

%

Bayesian inference

ä In summary the Bayesian modelling approach is:

ß Choose (joint) probability model (likelihood) for the data—p(y|θ)

ß Choose (joint) prior for the parameters—p(θ)

ß Derive the posterior p(θ|y)

ß Estimate any characteristic of interest involving one or more of the parameters by

the corresponding characteristic of the posterior. E.g the posterior mean for a point

estimate or the posterior standard deviation for a standard error.

ä This is a very general and flexible approach to statistical modelling capable of handling

very complex modelling frameworks. Problem is that you have to be able to integrate to

find the posterior distribution in order to use the method! So why is it any more useful

than maximum likelihood?

34

'

&

$

%

Markov Chain Monte Carlo (MCMC) methods

ä It’s true that until relatively recently the integrations involved in determining the posterior

have presented practical difficulties in Bayesian modelling, especially when large

numbers of parameters are involved.

ä In many applications mathematical evaluation of the posterior is impossible because of

the multidimensional integration involved in determining the normalising denominator

ä But now the ‘engineering’ approach of Monte Carlo integration can be used.

ä This evaluates the posterior by simulating many sample values from it. If samples are

numerous and representative of the posterior then they can provide virtually complete

information about it.

ä Any characteristic of the posterior is then approximated by the corresponding

characteristic of these samples.

'

&

$

%


ä For example suppose f(θ) is some function of the parameters of interest (e.g. a

prediction from the model)

ä Let θ(1), . . . , θ(n) where θ(i) = (θ(i)1 , . . . , θ

(i)p ), denote n samples from the

posterior p(θ|y).

ä Then if n is large enough: f(θ) = E [f(θ)|y] ≈ 1n

∑ni=1 f(θ(i))

ä This is called Monte Carlo Integration

ä The simplest example is where f(θ) = θ in which case θ = E [θ|y] ≈ 1n

∑ni=1 θ(i)

35

'

&

$

%


ä That’s nice! But the problem is then how to simulate samples from the posterior? Direct

sampling from p(θ|y) is difficult (because you don’t know what it is!). But indirect

sampling from a Markov Chain (MC) with p(θ|y) as its stationary (equilibrium)

distribution is feasible.

ä Sequence {θ(i)} is an MC if p(θ(i+1)|θ(i), . . . , θ(1)) = p(θ(i+1)|θ(i)) i.e. next

value θ(i+1) depends only on current value θ(i) and not previous values. Subject to

certain conditions, MCs gradually ‘forget’ their initial value and converge to a stationary

distribution (overall probability of taking any value remains same and this is

independent of the original starting value). Subsequent values of the chain are then

samples from this stationary distribution

ä Hence construct an MC with a stationary distribution identical to the posterior and use

values from that MC chain after a sufficiently long burn in as simulated samples from

the posterior. This is called Markov Chain Monte Carlo (MCMC)

'

&

$

%


ä The reason that this is so useful in practice is because it is surprisingly easy to

construct a Markov Chain which has a given stationary distribution.

ä This is achieved via the general Metropolis-Hastings algorithm.

ä Furthermore this algorithm only requires the stationary distribution (in our case p(θ|y)

i.e. the posterior) to be specified up to the normalising constant—i.e. we just need the

product of the prior and the likelihood — no nasty integration required!

36

'

&

$

%


ä The Metropolis-Hastings algorithm constructs a Markov Chain to converge to the target

distribution by sampling a candidate for the next value of the chain from a proposal

distribution and then either accepting it or rejecting it according to a acceptance

probability which depends upon the proposal distribution, the target distribution, the

current state of the chain and the candidate value

ä The proposal distribution can have any form subject to certain regularity conditions. It

will be chosen to be appropriate to the particular target distribution required and so that

it is easy to sample from

'

&

$

%


ä Given a target distribution, p(θ|y), the Metropolis-Hastings proceeds as follows:

ß Set i = 0 and choose arbitrary starting values θ(0) for θ

ß Sample a candidate, θ(∗), for the next state of the chain given the current state θ(i)

from a pre-selected proposal distribution, q(θ(∗)|θ(i))

ß Compute an acceptance probability

α(

θ(i), θ(∗))

= min

{

1,p(θ(∗)|y)q(θ(i)|θ(∗))

p(θ(i)|y)q(θ(∗)|θ(i))

}

(note target distribution appears only as a ratio, so unknown constant of

proportionality involved in p(θ|y) cancels)

ß Sample u such that u ∼ U(0, 1). If u ≤ α(θ(i), θ(∗)) then set θ(i+1) = θ(∗),

else set θ(i+1) = θ(i)

ß Set i = i + 1 and return to step 2 for a new candidate (repeat 1000’s of times)

37

'

&

$

%


ä Particular versions of the general Metropolis-Hastings algorithm need to be ‘hand

crafted’ to fit different applications through suitable choice of the proposal distribution so

as to obtain:

ß a good rate of convergence (short burn-in needed to achieve stationary distribution)

ß and good rate of mixing (fast movement around the support of the stationary

distribution once it is achieved)

ä But all this is generally easier than maximising the equivalent likelihoods and you get a

full posterior distribution for the parameters from it, rather than just point estimates and

standard errors

'

&

$

%


ä There are several variants of the general Metropolis-Hastings algorithm (the Metropolis

sampler, the independence sampler etc.)

ä One variant is Gibbs sampling, which always accepts a candidate. This can only be

used when full conditional posterior distributions of each parameter given values of all

the others are known

ä Suppose θ = (θ1, . . . , θp) then in Gibbs sampling we:

ß Set i = 0 and choose arbitrary starting values (θ(0)1 , . . . , θ

(0)p )

ß Sample θ(i+1)1 from p(θ1|θ

(i)2 , . . . , θ

(i)p , y)

Sample θ(i+1)2 from p(θ2|θ

(i+1)1 , θ

(i)3 , . . . , θ

(i)p , y)

...

Sample θ(i+1)p from p(θp|θ

(i+1)1 , . . . , θ

(i+1)p−1 , y)

ß Set i = i + 1 and repeat the last step (do this many 1000’s of times)

38

'

&

$

%


ä After sufficient ‘burn in’ the successive θ(i) = (θ(i)1 , . . . , θ

(i)p ) formed from general

Metropolis-Hastings or some variant such as Gibbs Sampling settle down to samples

from a markov chain with stationary distribution p(θ|y).

ä Samples from marginal posteriors (e.g. p(θj |y)) are approximated by simply picking

out the values for one parameter from the samples ignoring the other parameters.

ä Characteristics concerning a parameter are then estimated from the marginal posterior

samples via their sample equivalents (e.g. mean, mode, median, standard deviation,

quantiles etc.)

'

&

$

%


ä As said, important issues in MCMC to ensure good estimates are convergence

(‘burn-in’ required) and mixing (required number of samples after convergence)

ä There are formal ways to assess convergence but the essential point is that samples for

any parameter should be a random scatter about a stable mean value—convergence is

to a target distribution not a single value. Check convergence by several long runs with

different starting values (multiple chains). Statistics such as ‘Gelman’s R hat’ help

assess whether the multiple chains have converged (Rule of thumb: its value should be

< 1.2 for each parameter)

ä After convergence, sufficient samples are required to ensure posterior variance is

estimated accurately. Again formal techniques exist. A useful statistic is the MC

standard error for each parameter. Ideally want MC error small in relation to posterior

st. dev. (Rule of thumb: run simulation until MC error for each parameter < 5% of

sample (posterior) st. dev)

39

'

&

$

%


ä Choice of suitable prior distributions in Bayesian modelling can be controversial.

ä Conjugate priors are priors which lead to the posterior being in the same family as the

prior. These are useful, but unfortunately conjugate priors do not exist for all likelihoods.

MCMC methods make conjugacy less important.

ä In some cases the prior for the basic model parameters p(θ) will itself involve some

additional parameters, γ , i.e. the prior may be of the form p(θ|γ). Then we have a

hierarchical model. Parameters of the prior are known as hyperparameters.

'

&

$

%


ä One beauty of the Bayesian framework is that it easily incorporates these extra

unknown quantities.

ä We just extend the same game and specify a joint hyperprior p(γ) for the

hyperparameters γ and then use:

p(θ, γ|y) =p(y|θ)p(θ|γ)p(γ)

p(y)=

p(y|θ)p(θ|γ)p(γ)∫

θ

∫

γp(y|θ)p(θ|γ)p(γ) dγ dθ

ä Essentially the hyperparameters γ are treated on the same footing as the primary

parameters θ

40

'

&

$

%


MCMC methods make Bayesian modelling of complex situations involving many parameters

a practical feasibility. Use is now widespread. In summary the full Bayesian MCMC

approach is:

ä Choose appropriate (joint) probability model (likelihood) for the data—p(y|θ)

ä Choose appropriate (joint) prior for the parameters—p(θ|γ)

ä Choose appropriate (joint) hyperprior for the hyperparameters—p(γ)

ä Use MCMC (Gibbs Sampling or general Metropolis-Hastings) to generate numerous

samples (θ, γ)(i), i = 1, . . . , n from posterior p(θ, γ|y) using:

p(θ, γ|y) ∝ p(y|θ)p(θ|γ)p(γ)

ä Estimate any characteristic of interest involving one or more of the parameters or

hyperparameters by the equivalent characteristic of the posterior samples

'

&

$

%

Gibbs sampling and WinBUGS

ä As said, to do Gibbs sampling we need to be able to specify the full conditional

posterior distributions of each parameter given the values of the others. That is we

need p(θj |θ1, . . . , θj−1, θj+1, . . . , θp, y)

ä Turns out that these are relatively easy to derive for a wide range of commonly used

models

ä We also need to be able to simulate observations from each of these distributions and

this again turns out to be relatively easy since each is one-dimensional and often ‘log

concave’. Which means that general techniques such as adaptive rejection sampling

can be used

ä Hence Gibbs sampling is able to be used in a wide variety of Bayesian models. It forms

the basis of the MCMC method in the public domain WinBUGS package (Bayesian

Inference Using Gibbs Sampling)

41

'

&

$

%


ä Lets see all this in action on a very simple example. Cholesterol level (mg/ml) and age

(years) was measured for 24 patients diagnosed with hyperlipoproteinaemia and

resulted in the following scatter plot:

Cholesterol Level against Age

Age (years)

Cho

lest

erol

Lev

el (m

g/m

l)

20 30 40 50 60

1.5

2.5

3.5

4.5

ä Sample correlation between age and cholesterol is strong (≈ 0.9) and a standard

linear regression model (indicated in the plot) results in the following model:

yi (Cholesterol level) = α (1.2799) + β (0.0526) × agei

with residual standard deviation σ equal to 0.334.

'

&

$

%


ä In the notation we have been using, we can represent this as a Bayesian model by

taking the parameters as: θ = (α, β, σ) and the data y = (y1, . . . , y24) to comprise

independent observations with p(yi|θ) ∼ N(α + βagei, σ2) so that:

p(y|θ) =

24∏

i=1

p(yi|θ)

ä As priors for both α and β we use Normal distributions with zero means and large

variances.

ä As a prior for σ2 we take τ = 1σ2 (known as the precision) to have a Gamma

distribution with mean 1 and a large variance.

ä These choices are pretty standard for this situation and represent relatively

uninformative priors. Note here the priors do not involve any hyperparameters

42

'

&

$

%


In WinBUGS we require the following specification:

Modelfor(i in 1:N){

Y[i]∼ dnorm(mu[i], tau) # normal distribution for data, mean mu, precision tau

mu[i]← alpha + beta * age[i] # linear model for mean mu

}

alpha∼ dnorm(0, 1.0E-6) # diffuse normal prior for alpha

beta∼ dnorm(0, 1.0E-6) # diffuse normal prior for beta

tau∼ dgamma(.001, .001) # vague gamma prior for tau

sigma← 1/sqrt(tau) # st. deviation for Y derived from tauData

list(N = 24,Y = c(3.5,1.9,...,3.3),age = c(46, 20,...,50))

Initial values for the MCMC sampler

list(alpha = 0, beta = 0, tau = 1)

'

&

$

%


ä We can now run this model to generate samples from the posterior distribution and

collect summary statistics from those samples. WinBUGS itself derives the conditional

distributions required for the Gibbs Sampling from the dependency structure specified

in the model.

ä In this case 10,000 samples with a ‘burn-in’ of 5000 values gave the following (kernel

density) estimates for the marginal posterior distributions of each parameter:

p(α|y) p(β|y) p(σ|y)

43

'

&

$

%


ä The corresponding summary statistics were:

posterior mean sd 2.5% median 97.5%p(α|y) 1.27800 0.224300 0.83170 1.27900 1.73800p(β|y) 0.05265 0.005406 0.04178 0.05259 0.06324p(σ|y) 0.34600 0.055730 0.25660 0.33930 0.47360

ä Results from the conventional regression were:

parameter estimate sdα 1.279900 0.215700β 0.052625 0.005192σ 0.334000

ä Note that changing the model from p(yi|θ) ∼ N(α + βagei, σ2) to one with a different

distributional assumption, or with a mean which is a non-linear function of the parameters would

imply that regression cannot be used (a GLM or direct maximum likelihood is then required).

However, in the Bayesian case it means a simple adjustment to the model specification the basic

approach remains unchanged.

'

&

$

%

Example of use of MCMC in R

ä Note that good links have been developed between WinBUGS and R (and its many add

on packages).

ä For example, the R package R2WinBUGS allows one to set up data and model

specification in R and then use this to call WinBUGS to do the MCMC sampling and

return the results to R for further analysis

ä To illustrate this we return to the Challenger ‘O’ ring data previously discussed.


44

'

&

$

%

Illustrations of non-parametric statistical models

ä As said earlier in our discussion, non-parametric models are distinguished by not

specifying in advance a parametric form for the systematic model component f(xi).

The object is to estimate f(·) directly, rather than to estimate the relationship through

imposing a parametric form

ä There are now a whole range of non-parametric or data based modelling approaches

including additive and generalised additive models, projection pursuit regression,

multiple adaptive regression splines, regression and classification trees etc.

ä We only touch upon the subject here by briefly discussing the first of these classes of

models

'

&

$

%

Non-parametric statistical models

ä A key general point to bear in mind is that non-parametric and parametric models can

often be applied in similar situations, but they serve somewhat different purposes.

ä Parametric models emphasize estimation and inference for the parameters of the

model, while non-parametric models simply look for ‘best functions’ of variables that will

provide good prediction or explanation of the response for the particular data available.

Their focus is more on exploring the data set and visualizing the relationship between

the response and explanatory variables

ä Because non-parametric models make few assumptions about the data, inference from

such models (e.g. confidence intervals for model components or predictions) tends to

rely heavily on various kinds of resampling techniques, rather than likelihood theory.

45

'

&

$

%

Resampling ideas

ä We have come across one example of resampling in our discussion of cross-validation

for assessing predictions from parametric models.

ä There are a whole range of related techniques.

ä The bootstrap is particularly widely used and involves evaluating the accuracy of an

estimated model component or prediction, by repeatedly recalculating the quantity in

question using data samples of the same size drawn randomly with replacement from

the original data set

'

&

$

%

Additive models

ä The simplest non-parametric alternative to the conventional linear regression model is

the additive model in which: yi = β0 + f1(xi1) + . . . + fp(xip) + εi where the

partial-regression functions fj(·) are assumed to be smooth and are to be estimated

from the data. This model retains the assumption that explanatory variable effects are

additive. Thus the response is modelled as the sum of arbitrary smooth univariate

functions of the explanatory variables.

ä Note, one normally needs a reasonably large sample size to estimate each fj(·)

ä Variations of this include models in which interactions are allowed between some

predictors (e.g. yi = β0 + f12(xi1, xi2) + f3(xi3) + . . . + fp(xip) + εi). Also

semi-parametric models where some predictors enter linearly (e.g.

yi = β0 + β1xi1 + f2(xi2) + . . . + fp(xip) + εi). The latter are particularly useful

when some predictors are categorical.

46

'

&

$

%

Scatter Plot smoothers

ä The simplest form of the additive regression model is one where there is only one

explanatory variable i.e. effectively yi = f(xi) + εi, this is often referred to as a

scatter plot smoother

ä There are three common methods used to estimate the systematic component f(xi) of

the simple scatter plot smoother:

ß kernel estimation

ß localised regression (a generalisation of kernel estimation)

ß smoothing splines

ä The same sorts of ideas are used to estimate fj(xj) in additive models with multiple

explanatory variables

'

&

$

%


ä Kernel estimation proceeds as follows. Let x denote a focal point at which f(·) is to

be estimated. Define a symmetric, unimodal weight function, or kernel, wτ (·), centred

on x, that goes to zero (or nearly so) at the boundaries of a suitable neighbourhood

around x. Obtain fitted value at x by: f(x) =P

wτ (xi)yiP

wτ (xi)

ä Repeat for a range of x values spanning the data and connect the fitted values

ä The neighbourhood depends upon τ , the bandwidth. Choice of τ determines the

‘smoothness’ of the kernel estimation. Choice of the form of the kernel is not so critical.

Often the tricube weight function is used:

wτ (xi) =

[

1 −(

|xi−x|τ

)3]3

for |xi−x|τ < 1

0 for |xi−x|τ ≥ 1

A Gaussian (normal) density function is another common choice

47

'

&

$

%


ä Local polynomial regression (lowess) is similar to kernel estimation, but the fitted

values are produced by locally weighted regression rather than by local averaging

ä That is f(x) is obtained as a prediction from a regression which minimises the

weighted sum of squared residuals∑

wτ (xi)(yi − b0 − b1xi − b2x2i − . . . − bpx

pi )

2

ä Most commonly the order of the local polynomial is taken as p = 1 or p = 2.

ä Often τ is defined simply in terms of the proportion of the data points to be used in the

local regression in which case it is known as the span —- the nearest 100τ% of the

data points to x are given a weight of 1, the rest a weight of zero

'

&

$

%


ä Smoothing spines are the solution to the penalised regression problem: find f(x)

to minimize:∑

[yi − f(xi)]2

+ τ

∫ x−min

x−min

[f ′′(x)]2

dx

where τ is a smoothing parameter analogous to that controlling the neighbourhood in

kernel estimation or localised regression and f ′′(·) is the second derivative of the

regression function (taken as a measure of roughness)

ä At one extreme, when the smoothing constant is set to τ = 0, f(x) simply interpolates

the data (similar to localised regression with span = 1/n)

ä At the other extreme if τ is large, f(x) will be selected so that f ′′(x) is everywhere 0,

which implies a globally linear least-squares fit to the data (equivalent to localised

regression with infinite neighborhoods)

48

'

&

$

%


ä The function f(x) that minimizes the penalised regression is a natural cubic spline

with knots at the distinct observed values xi (splines are piece-wise polynomial

functions that fit together at the knots. For cubic splines, the first and second derivatives

are also continuous at the knots. Natural splines place two additional knots at the ends

of the data, and constrain the function to be linear beyond these points)

ä Although this result seems to imply that n parameters are required (when all xi-values

are distinct), the roughness penalty imposes additional constraints on the solution,

typically reducing the equivalent number of parameters for the smoothing spline

substantially, and preventing f(x) from interpolating the data. Indeed, it is common to

select the smoothing parameter τ indirectly by setting the equivalent number of

parameters for the smoother.

'

&

$

%


ä Because there is an explicit objective-function to optimize, smoothing splines are more

elegant mathematically than localised regression.

ä However the fits of smoothing splines and localised regression with the same equivalent

number of parameters are usually very similar

ä Both localised regression and smoothing splines have an adjustable smoothing

parameter τ . This parameter may be selected by visual trial and error, picking a value

that balances smoothness against fidelity to the data.

ä More formal methods of selecting smoothing parameters typically try to minimize the

mean-squared error of the fit, either by employing a formula approximating the

mean-square error, or by some form of cross-validation

49

'

&

$

%

Additive models

ä The generalisation of a scatter plot smoother such as localised regression to several

predictors i.e. yi = β0 + f(xi1, . . . , xip) + εi is mathematically straightforward

(more difficult for smoothing splines, but variants such as thin plate splines can be

employed)

ä However this is often problematic in practice (the curse of dimensionality —

multidimensional spaces grow exponentially more sparse and very large samples are

required for non-parametric estimation of multidimensional functions)

ä The additive model yi = β0 +f1(xi1)+ . . .+fp(xip)+ εi mentioned earlier offers

a viable multi-variable alternative; where the partial-regression functions fj(·) are each

fit using a univariate smoother, such as localised regression or smoothing splines

'

&

$

%

Generalised Additive models

ä The generalized additive model (GAM), extends the additive model in the same spirit

as the GLM extends the linear model, namely in allowing a link function and in allowing

non-normal distributions from the exponential family

ä The linear predictor is of the form: ηi = β0 + f1(xi1) + . . . + fp(xip)

ä A GAM model can be fitted by iteratively fitting weighted additive models by localised

regression or smoothing splines in an analogous way as the iteratively weighted least

squares procedure relates to ordinary least squares

ä We have already commented on variations of the additive model which include models

in which interactions are allowed between some predictors and also semi-parametric

models where some predictors enter linearly and others in a non-parametric way.

These extensions also apply to GAMs

50

'

&

$

%

Examples of non-parametric modelling in R

ä To illustrate the use of the additive model we return to the data on prestige scores for

Canadian occupations


ä As a further example of non-parametric modelling, this time using a GAM consider

again the data on predicting US recession


51

Introductory Statistical Modelling in R — worked examplesData sets for these examples are included in the course R workspace which is provided as a computer file

1. A simple linear modelConsider data on 102 occupations derived from a Canadian social survey and contained in the dataframe prestige. The response variable is an occupational ‘prestige’ score (y), there are two possibleexplanatory variables of those in the occupation — average income (x1) and average educationallevel (x2).Find out what’s in the data frame> names(prestige)

[1] "prestige" "income" "education"

Explore relationships with a scatterplot matrix:> library(lattice)

> splom(∼prestige)

Scatter Plot Matrix

Prestige.prestige

20 40

60 80

60

80

20

40

Prestige.income

0 5000 10000

150002000025000

15000

20000

25000

0

5000

10000

Prestige.education

6 8 10

12 14 16

12

14

16

6

8

10

Next fit a linear regression model: y ∼ N(β0 + β1x1 + β2x2, σ2)

> prestige.lm = lm(prestige ∼ income + education, data=prestige)

> summary(prestige.lm)

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) -6.847779 3.218977 -2.13 0.036income 0.001361 0.000224 6.07 0.000

education 4.137444 0.348912 11.86 0.000

Residual standard error: 7.81 on 99 degrees of freedom

Multiple R-Squared: 0.798, Adjusted R-squared: 0.794F-statistic: 196 on 2 and 99 DF, p-value: 0.000

Overall fit is not bad (R2 ≈ .8), both variables are significant. How well do fitted values of pres-tige correspond to those observed? Is there any pattern in the residuals? Are residuals normallydistributed?> plot(prestige.lm$fitted.values,prestige$prestige,xlab="predicted prestige",

+ ylab="observed prestige")

> plot(prestige.lm$fitted.values,prestige.lm$residuals,xlab="predicted prestige",

+ ylab="residuals")

> plot(prestige.lm,which=2,sub="")

30 40 50 60 70 80 90

2040

6080

predicted prestige

obse

rved

pre

stig

e

30 40 50 60 70 80 90

−20

−10

010

predicted prestige

resi

dual

s

−2 −1 0 1 2

−2−1

01

2

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q−Q plot

53

67

46

What does the fitted surface look like?

52

> inc=seq(min(prestige$income),max(prestige$income),len=25)

> ed=seq(min(prestige$education),max(prestige$education),len=25)

> newdat=expand.grid(income=inc,education=ed)

> fit.prestige=matrix(predict(prestige.lm,newdat),25,25)

> persp(inc,ed,fit.prestige,theta=45,phi=30,ticktype="detailed",xlab="Income",

+ ylab="Education",zlab="Prestige",expand=2/3,shade=0.5)

Income

5000

10000

15000

20000

25000

Educa

tion

8

10

12

14

Prestige

40

60

80

2. A more complex multiple regressionHigh concentrations of certain harmful algae in rivers is a serious ecological problem with a strongimpact not only on river life forms, but also on water quality. Being able to monitor and perform anearly forecast of algae blooms is essential to improve the quality of rivers. Chemical monitoring ofwater samples is cheap and easily automated, but biological analysis to identify the algae presentinvolves microscopic examination, requires trained manpower and is therefore both expensive andslow. As such, models that are able to accurately predict algae frequencies based on chemicalproperties would facilitate the creation of cheap and automated systems for monitoring harmfulalgae blooms. To build these we need better understanding of the factors influencing the algaefrequencies. Namely, we want to understand how these frequencies are related to certain chemicalattributes of water samples as well as other characteristics of the samples (like season of the year,type of river, etc.)The data frame algae contains data on 148 water samples collected in different European rivers.Each water sample is described by 11 potential explanatory variables. Three of these variables arecategorical and describe the season of the year when the sample was collected, the size of theriver, and the water speed of the river. The 8 remaining variables are values of different chemicalmeasures in the water sample viz:Maximum pH value

Minimum value of O2 (Oxygen)

Mean value of Cl (Chloride)

Mean value of NO- (Nitrates)

Mean value of NH+ (Ammonium)

Mean of PO3- 4 (Orthophosphate)

Mean of total PO4 (Phosphate)

Mean of Chlorophyll

The response variables (a1 — a7) associated with each of the water samples are frequency num-bers of 7 different types of harmful algae. We focus here on just algae type a1.First we get an idea of the distribution of a1:> hist(algae$a1, prob=T, xlab="",main="Histogram of algae type a1")

> lines(density(algae$a1))

> rug(jitter(algae$a1))

53

Histogram of algae type a1

Den

sity

0 20 40 60 80

0.00

0.01

0.02

0.03

0.04

This is a skewed distribution (suggests maybe a log transform, we return to that later). Plots wouldbe useful to see how the response depends on the quantitative explanatory variables:> par(mfrow=c(2,2))

> for (i in 4:7)plot(algae[,i],algae$a1,ylab="algae a1",xlab=colnames(algae)[i])

> for (i in 8:11)plot(algae[,i],algae$a1,ylab="algae a1",xlab=colnames(algae)[i])

7.0 7.5 8.0 8.5 9.0 9.5

020

4060

80

mxPH

alga

e a1

2 4 6 8 10 12

020

4060

80

mnO2

alga

e a1

0 100 200 300 400

020

4060

80

C1

alga

e a1

0 10 20 30 40

020

4060

80

NO3

alga

e a1

0 5000 10000 20000

020

4060

80

NH4

alga

e a1

0 100 200 300 400

020

4060

80

oPO4

alga

e a1

0 100 200 300 400 500

020

4060

80

PO4

alga

e a1

0 20 40 60 80 100

020

4060

80

Chla

alga

e a1

If we want to study how the response depends on variables which are categorical (factors) differenttools are more useful. For instance, we can obtain a set of box plots for the variable a1, eachfor different values of the variable river size. These graphs allow us to study how this categoricalvariable may influence the distribution of the values of a1. The code is:> library(lattice)

> bwplot(size ∼ a1, data=algae,ylab="River Size",xlab="Alga a1")

> bwplot(season ∼ a1, data=algae,ylab="Season",xlab="Algae a1")

> bwplot(speed ∼ a1, data=algae,ylab="River Speed",xlab="Algae a1")

The first argument of this instruction can be read as ‘plot a1 for each value of size’.

Alga A1

Riv

er S

ize

large

medium

small

0 20 40 60 80Alga A1

Sea

son

autumn

spring

summer

winter

0 20 40 60 80Alga A1

Riv

er S

peed

high

low

medium

0 20 40 60 80

This type of conditioned plot is not restricted to nominal variables, neither to a single factor. You cancarry out the same kind of conditioning study with continuous variables as long as you previously

54

discretise them. For example the behavior of the frequency of algae a1 conditioned by season andmnO2 would be obtained by:> mnO2x=equal.count(algae$mnO2,number=4,overlap=1/5)

> stripplot(season ∼ a3|mnO2x,data=algae)

a3

autumn

spring

summer

winter

0 10 20 30 40

mnO2x mnO2x

autumn

spring

summer

winter

mnO2x mnO2x

0 10 20 30 40

We now try a linear regression with all possible explanatory variables> lm.a1=lm(a1∼.,data=algae[,1:12])

> summary(lm.a1)

Residuals:Min 1Q Median 3Q Max

-36.735 -10.185 -3.658 9.033 67.047

Coefficients:

Estimate Std. Error t value Pr(>|t|)(Intercept) 42.0025989 34.9258548 1.203 0.2313

seasonspring -1.9058846 5.1069487 -0.373 0.7096seasonsummer -0.6260164 4.7353343 -0.132 0.8950seasonwinter 3.4902516 4.7524073 0.734 0.4640

sizemedium -1.7054810 4.6460746 -0.367 0.7141sizesmall 6.3655384 5.0300145 1.266 0.2079

speedlow 4.8056057 6.2165099 0.773 0.4409speedmedium -2.5158051 4.0088750 -0.628 0.5314

mxPH -1.6130011 4.0320894 -0.400 0.6898mnO2 0.2575212 0.9676965 0.266 0.7906C1 -0.0244247 0.0421359 -0.580 0.5631

NO3 -1.1875016 0.7050358 -1.684 0.0945 .NH4 0.0009854 0.0013435 0.733 0.4646

oPO4 -0.0198570 0.0500693 -0.397 0.6923PO4 -0.0549143 0.0409323 -1.342 0.1820Chla -0.1116463 0.1043608 -1.070 0.2867

---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 18.15 on 132 degrees of freedom

Multiple R-Squared: 0.3317, Adjusted R-squared: 0.2557F-statistic: 4.367 on 15 and 132 DF, p-value: 1.363e-06

Note how R handled the three categorical variables. For each factor variable with k levels, R willcreate k -1 auxiliary variables. These variables have the values 0 or 1. A value of 1 means that theassociated value of the factor is ‘present’, and that will also mean that the other auxiliary variableswill have the value 0. If all k - 1 variables are 0 then it means that the factor variable has theremaining kth value. Looking at the summary presented above, we can see that R has createdthree auxiliary variables for the factor season (seasonspring, seasonsummer and seasonwinter).This means that if we have a water sample with the value ‘autumn’ in the variable season, all thesethree auxiliary variables will be set to zero.Although there is a significant overall relationship between the response and the explanatory vari-ables (the p value of the overall F test is very small), the proportion of variance explained by thismodel is not very impressive (around 26.0%). Some further diagnostics on the model residuals maybe obtained by:> plot(lm.a1, which=1)

> plot(lm.a1, which=2)


55

−10 0 10 20 30 40

−40

−20

020

4060

80

Fitted values

Res

idua

ls

lm(formula = a1 ~ ., data = algae1[, 1:12])

Residuals vs Fitted

49

118

77

−2 −1 0 1 2

−2−1

01

23

4


Sta

ndar

dize

d re

sidu

als


Normal Q−Q plot

49

118

77

0 50 100 150

01

23

4

Obs. number

Coo

k’s

dist

ance


Cook’s distance plot

117

30 100

Here there is clear evidence of non constant variance and non-normality in the standardised resid-uals, plus the presence of a highly influential point. Perhaps a transformation in the response wouldbe sensible and removal of the observation 117. A log() transformation is an obvious choice, so themodel is now log(y) ∼ N(β0 + β1x1 + . . . + βpxp, σ2)

> algae=algae[-117,]

> lm.a1=lm(log(a1)∼.,data=algae[,1:12])

> summary(lm.a1)




Residuals:

Min 1Q Median 3Q Max-2.4507 -0.7201 0.1151 0.6433 2.1242


(Intercept) 4.461e+00 1.971e+00 2.263 0.0253 *seasonspring -1.582e-01 2.929e-01 -0.540 0.5901

seasonsummer 7.518e-02 2.674e-01 0.281 0.7790seasonwinter 1.099e-01 2.686e-01 0.409 0.6829sizemedium -1.386e-01 2.624e-01 -0.528 0.5984

sizesmall 4.947e-01 2.883e-01 1.716 0.0885 .speedlow 4.682e-01 3.552e-01 1.318 0.1897

speedmedium -1.988e-01 2.309e-01 -0.861 0.3907mxPH -1.178e-01 2.276e-01 -0.518 0.6054mnO2 -3.482e-02 5.463e-02 -0.637 0.5250

C1 -1.408e-03 2.392e-03 -0.589 0.5572NO3 -8.367e-02 4.479e-02 -1.868 0.0640 .

NH4 -8.145e-05 1.349e-04 -0.604 0.5469oPO4 -2.672e-03 2.842e-03 -0.940 0.3488

PO4 -3.666e-03 2.357e-03 -1.555 0.1223Chla -7.911e-03 6.009e-03 -1.316 0.1903---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.024 on 131 degrees of freedomMultiple R-Squared: 0.4318, Adjusted R-squared: 0.3667F-statistic: 6.637 on 15 and 131 DF, p-value: 1.983e-10

0 1 2 3

−3−2

−10

12

Fitted values

Res

idua

ls

lm(formula = log(a1) ~ ., data = algae2[, 1:12])

Residuals vs Fitted

99

61

120

−2 −1 0 1 2

−2−1

01

2


Sta

ndar

dize

d re

sidu

als


Normal Q−Q plot

99

120

61

0 50 100 150

0.0

0.1

0.2

0.3

0.4

Obs. number

Coo

k’s

dist

ance



30

100

121

The residual diagnostics are generally better (although still not good). But looking at the significanceof some of the coefficients (given the presence of the other variables) we may question the inclusionof some of them in the model.

We can get a breakdown of the sequential reduction in the residual sum of squares (deviance) aseach explanatory variable is added to the model via:

56

> anova(lm.a1)

Analysis of Variance Table

Response: log(a1)Df Sum Sq Mean Sq F value Pr(>F)

season 3 3.006 1.002 0.9554 0.4159644size 2 29.535 14.768 14.0808 2.890e-06 ***speed 2 15.062 7.531 7.1807 0.0010987 **

mxPH 1 0.904 0.904 0.8621 0.3548704mnO2 1 0.491 0.491 0.4683 0.4949787

C1 1 12.737 12.737 12.1445 0.0006702 ***NO3 1 14.015 14.015 13.3628 0.0003703 ***NH4 1 0.345 0.345 0.3286 0.5674589

oPO4 1 22.247 22.247 21.2123 9.615e-06 ***PO4 1 4.247 4.247 4.0490 0.0462477 *

Chla 1 1.818 1.818 1.7331 0.1903182Residuals 131 137.389 1.049

---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

One way (there are many!) to reduce the model size while retaining the same predictive poweris to sequentially drop out the least significant variables until all of those remaining are significant(known as backward elimination) the criteria used to select the order of dropping in R is the largestreduction in the AIC criterion. The model is then refitted and terms reassessed etc. This processcan be achieved in R by using:> final.lm=step(lm.a1)

Start: AIC= 22.06log(a1) ~ season + size + speed + mxPH + mnO2 + C1 + NO3 + NH4 +

oPO4 + PO4 + Chla

Df Sum of Sq RSS AIC

- season 3 1.533 138.923 17.692- mxPH 1 0.281 137.671 20.362

- C1 1 0.363 137.753 20.449- NH4 1 0.383 137.772 20.470

- mnO2 1 0.426 137.815 20.516- oPO4 1 0.927 138.317 21.050- Chla 1 1.818 139.207 21.993

<none> 137.389 22.061- PO4 1 2.536 139.926 22.750

- speed 2 5.346 142.735 23.672- NO3 1 3.660 141.049 23.925- size 2 8.057 145.446 26.438

Step: AIC= 17.69

log(a1) ~ size + speed + mxPH + mnO2 + C1 + NO3 + NH4 + oPO4 +PO4 + Chla

Df Sum of Sq RSS AIC- NH4 1 0.201 139.123 15.904

- mxPH 1 0.234 139.156 15.939- mnO2 1 0.296 139.218 16.005

- C1 1 0.463 139.386 16.182- oPO4 1 0.985 139.907 16.730- Chla 1 1.774 140.697 17.558

<none> 138.923 17.692- PO4 1 2.710 141.633 18.532

- NO3 1 3.297 142.219 19.140- speed 2 5.814 144.737 19.719

- size 2 7.935 146.858 21.858

...etc....Step: AIC= 10.16log(a1) ~ size + speed + NO3 + PO4 + Chla

Df Sum of Sq RSS AIC<none> 141.274 10.159

- Chla 1 1.962 143.235 10.186- NO3 1 4.858 146.132 13.129

- speed 2 7.605 148.879 13.867- size 2 8.421 149.694 14.670- PO4 1 32.546 173.820 38.635

So the final model contains just five of the original 11 explanatory variables.

57

The associated coefficients and diagnostics are:> summary(final.lm)

Residuals:Min 1Q Median 3Q Max

-2.4043 -0.7531 0.1851 0.6418 2.2262


(Intercept) 3.130710 0.254561 12.298 < 2e-16 ***

sizemedium -0.020330 0.236579 -0.086 0.9316sizesmall 0.541427 0.261681 2.069 0.0404 *

speedlow 0.553802 0.323625 1.711 0.0893 .speedmedium -0.225629 0.204490 -1.103 0.2718NO3 -0.085147 0.038946 -2.186 0.0305 *

PO4 -0.005717 0.001010 -5.659 8.38e-08 ***Chla -0.007637 0.005497 -1.389 0.1670

---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.008 on 139 degrees of freedomMultiple R-Squared: 0.4157, Adjusted R-squared: 0.3863

F-statistic: 14.13 on 7 and 139 DF, p-value: 9.095e-14

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

−2−1

01

2

Fitted values

Res

idua

ls

lm(formula = log(a1) ~ size + speed + NO3 + PO4 + Chla, data = algae2[, 1:12])

Residuals vs Fitted

99

120

126

−2 −1 0 1 2

−2−1

01

2


Sta

ndar

dize

d re

sidu

als


Normal Q−Q plot

99

120

126

0 50 100 150

0.00

0.02

0.04

0.06

Obs. number

Coo

k’s

dist

ance



120

69

124

Still far from marvellous, but maybe OK to try some predictions (leave that up to you — see theexercises)

3. A simple Binomial GLMIn 1986 the NASA space shuttle Challenger exploded shortly after it was launched. After an investi-gation it was concluded that this had occurred as a result of an ‘O’ ring failure. ‘O’ rings are toroidalseals and in the shuttle six are used to prevent hot gases escaping and coming into contact withfuel supply lines.Data had been collected from 23 previous shuttle flights on the ambient temperature at the launchand the number of ‘O’ rings, out of the six, that were damaged during the launch. This data isavailable in the data frame orings.NASA staff analysed the data to assess whether the probability of ‘O’ ring failure damage wasrelated to temperature, but it is reported that they excluded the zero responses (ie, none of therings damaged) because they believed them to be uninformative. The resulting analysis led themto believe that the risk of damage was independent of the ambient temperature at the launch. Thetemperatures for the 23 previous launches ranged from 53 to 81 degrees Fahrenheit while the launchtemperature when the accident occurred was 31 degrees Fahrenheit (ie, -0.6 degrees Centigrade).Lets look at the data frame:> orings[,1:3]

> orings

Failures Total Temperature1 2 6 532 1 6 57

3 1 6 58

...etc....17 0 6 7522 0 6 79

23 0 6 81

58

A binomial GLM with ni equal ‘Total’ and yi equal ni−‘Failures’, and with ‘Temperature’ as theexplanatory variable would seem sensible here. The R function to fit a GLM is glm() which usesthe form glm(formula, family=family.generator, data=data.frame). The only difference from lm() is thefamily.generator, which is the instrument by which the random component is described. Somepossible values are: binomial, gaussian, Gamma, poisson, quasibinomial, quasipoisson. Where there isa choice of link functions for a family, the name of the link may also be supplied with the family name,in parentheses as a parameter, if it is not supplied the default link for that family will be used.So a call such as: glm(y ∼ x1 + x2, family = gaussian, data = sales) achieves the same result aslm(y ∼ x1 + x2, data=sales) but much less efficiently since iterative reweighted least squares is farmore computationally demanding that least squares.For the ‘O’ ring data we are interested in the binomial family and we will use the default logistic linkfunction. To fit a binomial model using glm() there are three possibilities for the response: If theresponse is a vector it is assumed to hold binary data, and so must be a 0/1 vector. If the responseis a factor, its first level is taken as failure (0) and all other levels as success (1). If the responseis a two-column matrix it is assumed that the first column holds the number of ‘successes’ and thesecond holds the number of ‘failures’. Here we need the last of these conventions so we use:> orings$success.matrix=cbind(orings$Total-orings$Failures,orings$Failures)

> orings.glm = glm(success.matrix∼ Temperature, family=binomial, data=orings)

> summary(orings.glm)

Deviance Residuals:Min 1Q Median 3Q Max

-2.65152 0.04379 0.54117 0.78299 0.95227

Coefficients:

Estimate Std. Error z value Pr(>|z|)(Intercept) -5.08498 3.05247 -1.666 0.0957 .Temperature 0.11560 0.04702 2.458 0.0140 *

---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 24.230 on 22 degrees of freedomResidual deviance: 18.086 on 21 degrees of freedom

AIC: 35.647

Note that the effect of temperature is significant. Residual deviance shows no overdispersion. Theresidual diagnostics are:> plot(orings.glm,which=1,sub="")

> plot(orings.glm,which=2,sub="")

> plot(orings.glm,which=4,sub="")

1.0 1.5 2.0 2.5 3.0 3.5 4.0

−3−2

−10

1

Predicted values

Res

idua

ls

Residuals vs Fitted

18

1314

−2 −1 0 1 2

−6−4

−20

2


Std

. dev

ianc

e re

sid.

Normal Q−Q plot

18

13 14

5 10 15 20

0.0

0.2

0.4

0.6

Obs. number

Coo

k’s

dist

ance


18

1

13

Not particularly good, mainly because we have few data points which have non zero numbers of fail-ures. We can see what the fitted relationship looks like for the probability of failure, and associatedstandard errors of prediction, by:> newfits=predict(orings.glm,newdata=data.frame(Temperature=c(20:80)),type="response",se=T)

> plot(orings$Temperature,orings.glm$y,xlab="Temperature",ylab="Probability of success",xlim=c(20,80),ylim=c(0,1))

> lines(c(20:80),newfits$fit)

> lines(c(20:80),newfits$fit-1.96*newfits$se.fit,lty="dotted")

> lines(c(20:80),newfits$fit+1.96*newfits$se.fit,lty="dotted")

59

20 30 40 50 60 70 80

0.0

0.2

0.4

0.6

0.8

1.0

Temperature

Pro

babi

lity

of s

ucce

ss

Note that the probability of failure at 31 degrees is very uncertain and could well be very low. Oneshould be very worried about launch under these circumstances. If we exclude from the data set allobservations with 0 failures and then refit the model we get:> orings1=orings[orings$Failures!=0,]

> orings1.glm = glm(success.matrix∼ Temperature, family=binomial, data=orings1)

> summary(orings1.glm)

Deviance Residuals:[1] -0.6891 0.2836 0.2850 0.2919 0.3015 0.3015 -0.6560

Coefficients:

Estimate Std. Error z value Pr(>|z|)(Intercept) 1.389528 3.195752 0.435 0.664

Temperature -0.001416 0.049773 -0.028 0.977


Null deviance: 1.3347 on 6 degrees of freedom

Residual deviance: 1.3339 on 5 degrees of freedomAIC: 18.894

Temperature now has no significant effect on the probability of success which is effectively constantat all temperatures. The estimate at 31 degreees (and all other temperatures) and it’s standard erroris:> predict(orings1.glm,newdata=data.frame(Temperature=31),type="response",se=T)

\$fit[1] 0.7934151

\$se.fit[1] 0.274281

Giving an approximate 95% confidence interval of (0.79± 0.53). Still not very confident of success!But I suppose NASA might argue this is gives a reasonable chance (as they mistakenly did!)

4. A Poisson GLMThe data frame soccer contains data published in the Times, on August 19, 2003 on numbers offootball-related arrests during the 2002-2003 season for football clubs in the 4 divisions in Englandand Wales. The variables are ‘club’, ‘division’, ‘totarr’ (total arrests) and ‘viol’ (arrests for violentdisorder). We might consider a Poisson GLM here with the default log link. The response is ‘viol’(or perhaps ‘totarr’) and the possible explanatory variables are ‘club’ and ‘division’ (both categoricali.e. factors).We first try a model with just ‘division’:

60

> soccer.glm = glm(viol∼ division, family=poisson, data=soccer)

> summary(soccer.glm)

glm(formula = viol ~ division, family = poisson, data = soccer)


-3.4496 -1.8930 -1.1821 0.2318 6.7886

Coefficients:

Estimate Std. Error z value Pr(>|z|)(Intercept) 1.69562 0.09578 17.703 < 2e-16 ***

division2 -0.26850 0.13847 -1.939 0.0525 .division3 -1.11247 0.18008 -6.178 6.51e-10 ***

divisionp 0.08778 0.13258 0.662 0.5079---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 610.70 on 87 degrees of freedomResidual deviance: 548.67 on 84 degrees of freedom

AIC: 754.87

Division clearly has a significant effect here, although the model is heavily overdispersed (residualdeviance is about 7 times what it should be). Since R has taken division 1 to be the base leveland introduced auxiliary variables for the other divisions, the coefficient values for these representdifferences from division 1. The indication is no difference between division 1 and the premiere di-vision, very significantly lower numbers of violent arrests in division 3 than division 1 and somewhatlower levels in division 2 than division 1. There is little point in using ‘club’ since this representsa saturated model (one mean for each observation), although I suppose it will show which clubshave significantly different numbers of violent arrests compared with whichever club R takes as thebase (try it for yourself). Perhaps we should try a quasi Poisson model with ‘division’ to try andcompensate for the overdispersion:> soccer.glm = glm(viol∼ division, family=quasipoisson, data=soccer)

> summary(soccer.glm)

glm(formula = viol ~ division, family = quasipoisson, data = soccer)


-3.4496 -1.8930 -1.1821 0.2318 6.7886


(Intercept) 1.69562 0.28219 6.009 4.64e-08 ***division2 -0.26850 0.40796 -0.658 0.512division3 -1.11247 0.53056 -2.097 0.039 *

divisionp 0.08778 0.39061 0.225 0.823---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 8.679932)


Residual deviance: 548.67 on 84 degrees of freedomAIC: NA

Notice that the estimates are exactly the same as before, but the scale (dispersion) parameter isnow estimated as 8.7 (as opposed to the theoretical value of 1 for the Poisson) and thus the standarderrors for the estimates have been adjusted accordingly. It now appears there is a significant lowernumber of violent arrests in division 3, but the other divisions are not significantly different. (maybeinteresting to see what the situation is for ‘total arrests’ — try it for yourselves)

5. A further Binomial GLMRecession is associated with bad times like high unemployment and low production, or more generalwith a stagnant economy. Strictly speaking though, a recession is the period when overall economicactivity is actually declining and production, employment, and sales are falling below normal. Arecession starts just after a business cycle peak, the high point in the level of economic activity, andends at the business cycle trough, the low point. A popular rule of thumb is that two consecutivequarterly declines in real GDP signal a recession. This rule is consistent with the dispersion andduration requirements for a recession and with the average recessionary path of real GDP; however,two very small quarterly declines might not produce the depth required for a recession.

61

The most widely accepted determination of business cycle peaks and troughs in the US is madeby the National Bureau of Economic Research, NBER. NBER identified six US recessions fromJanuary 1960 through September 1999. Here we consider whether the yield curves can serve as apredictor of these US Recessions. We use a binomial model with a logistic link to directly relate theprobability of being in a recession to the yield curves of the 10-year treasury bond and the 3-monthtreasury bill six months before. We analyze the model and evaluate a six month ahead forecast.The data are contained in the date frame recession and comprise monthly figures from 1966 through2001 on variables ‘date’, ‘recession’ (a 1/0 variable), ‘tbills3m’ (3 Month Treasury Bills), ‘tbonds10y’(10 Years Treasury Bonds). First we rearrange the data slightly into a new data frame where thevariables are suitably lagged and time is converted from year/month format into a suitable metricmeasurement:> n = length(recession[,1])

> response = recession[,2][7:n]

> predictor1 = recession[,3][1:(n-6)]


> data = data.frame(cbind(response, predictor1, predictor2))

> time = floor(recession[,1]/100) + (recession[,1]-100*floor(recession[,1]/100)-0.5)/12

> time = time[1:(n-6)]

Next we fit the binomial model:> fit = glm(response predictor1 + predictor2, family=binomial, data=data)

> summary(fit)

glm(formula = response ~ predictor1 + predictor2, family = binomial,data = data)


-3.2443 -0.3962 -0.2637 -0.1334 2.1966

Coefficients:

Estimate Std. Error z value Pr(>|z|)(Intercept) -4.3101 0.6256 -6.890 5.58e-12 ***

predictor1 1.3769 0.1828 7.533 4.95e-14 ***predictor2 -0.9159 0.1686 -5.432 5.58e-08 ***---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



Residual deviance: 217.37 on 424 degrees of freedomAIC: 223.37

Both the predictors are significant. The model is somewhat underdispersed but not badly so. Theresidual diagnostics are:> plot(fit,which=1,sub="")

> plot(fit,which=2,sub="")

> plot(fit,which=4,sub="")

−6 −4 −2 0 2 4

−3−2

−10

12

Predicted values

Res

idua

ls

Residuals vs Fitted

180171

172

−3 −2 −1 0 1 2 3

−4−2

02


Std

. dev

ianc

e re

sid.

Normal Q−Q plot

180171

172

0 100 200 300 400

0.00

0.05

0.10

0.15

0.20

0.25

Obs. number

Coo

k’s

dist

ance


180

171

172

Not particularly good (probably because of a lack of independence — this is typical with time-seriesdata and violates one of the assumptions of the GLM). Nevertheless, we will proceed to predict theprobabilities of recession and plot these along with the original data:> plot(time, response, type="h", col=10, main="GLM predictions")

> lines(time, predict(fit, newdata=data, type="response"))

62

1965 1970 1975 1980 1985 1990 1995 2000

0.0

0.2

0.4

0.6

0.8

1.0

GLM predictions

time

resp

onse

6. An example of a non-standard model fitted by direct maximisation of likelihoodThe data frame nlmodel contains fictitious data on a response variable y and a single explanatoryvariable x. A plot of the data shows a clear non-linear relationship:> plot(nlmodel$x,nlmodel$y,xlab="x",ylab="y")

0.0 0.2 0.4 0.6 0.8 1.0

5010

015

020

0

x

y

Suppose for these data we wish to consider the model:

yi =θ1xi

θ2 + xi+ εi with εi ∼ N(0, σ

2)

Because of the form of the systematic component f(θ1, θ2, xi) = θ1xi

θ2+xi, this model cannot be

directly handled in a linear or GLM framework, we therefore consider model fitting by direct max-imisation of the likelihood. Given the assumption of a Gaussian random component, the likelihoodhere is:

L(θ1, θ2, σ2) =

1

(2πσ2)n/2exp

(

−

Pni=1

[yi − f(θ1, θ2, xi)]2

2σ2

)

So the log-likelihood is:

−n

2log(2πσ

2)−

Pni=1

[yi − f(θ1, θ2, xi)]2

2σ2= −

n

2log(2πσ

2)−

Pni=1

h

yi −θ1xi

θ2+xi

i2

2σ2

It is clear that for any σ2 > 0 this log-likelihood will be maximised whenPn

i=1

h

yi −θ1xi

θ2+xi

i2

isminimised, so we first write an R function to evaluate this quantity:

> fn = function(theta) sum((nlmodel$y - (theta[1]*nlmodel$x)/(theta[2] + nlmodel$x))∧2)

Then in order to do the fit we use the R function nlm(). We need initial estimates of the parametersθ1 and θ2 to use this and one way to find sensible starting values is to plot the data, guess someparameter values, and superimpose the model curve using those values.

63

> plot(nlmodel$x, nlmodel$y,xlab="x",ylab="y")

> xfit = seq(.02, 1.1, .05)

> yfit = 200 * xfit/(0.1 + xfit)

> lines(xfit, yfit)

0.0 0.2 0.4 0.6 0.8 1.0

5010

015

020

0

x

y

We could do better, but the starting values of θ1 = 200 and θ2 = 0.1 seem adequate. Now do the fit:

> out = nlm(fn, p= c(200, 0.1), hessian = T)

After the fitting, out$estimate are the mle estimates of the parameters, out$minimum is the correspond-ing value of the function minimised and out$hessian is the second derivative evaluated at the mini-mum.> out$estimate

[1] 212.68384222 0.06412146

We thus plot the maximum likelihood fit by:> plot(nlmodel$x, nlmodel$y,xlab="x",ylab="y")

> xfit = seq(.02, 1.1, .05)

> yfit = out$estimate[1] * xfit/(out$estimate[2] + xfit)

> lines(xfit, yfit)

0.0 0.2 0.4 0.6 0.8 1.0

5010

015

020

0

x

y

Looking back to the original log-likelihood, the profile log-likelihood for σ2 (i.e. the log-likelihoodwith all other parameters replaced by their mle estimates) is:

−n

2log(2πσ2) −

out$minimum

2σ2

which is maximised when σ2 = out$minimumn

(the customary mean square residual — find the actualvalue for yourself!).In general we obtain approximate standard errors of the mle estimates for θ1 and θ2 from minusthe inverse of the Hessian of the log-likelihood evaluated at the maximum. So here that amounts to(convince yourself!):

> sqrt(diag(2*out$minimum/(length(nlmodel$y)) * solve(out$hessian)))

[1] 8.57393083 0.01045205

A 95% confidence interval would be given by the parameter estimate ±1.96 of the standard error.

7. A simple binomial GLM fitted via a Bayesian approach using WinBUGSWe return to the data on ‘O’ ring failures for the Challenger space shuttle and consider fitting theprevious model using a Bayesian approach and the link between R and WinBUGS.

64

First we construct a text file say ex7.txt (in the current R working directory) containing the appropri-ate WinBUGS code to fit the model and do predictions of the response at 31 degrees i.e.model{

for(i in 1:N){success[i]~dbin(p[i],total[i])logit(p[i])<-beta0+beta1*temperature[i]

}logit(probsuccess)<-beta0+beta1*31.0

numsuccess~dbin(probsuccess,6)beta0~dnorm(0,0.001)

beta1~dnorm(0,0.001)}

We now set up the data and initial values for the MCMC chain in R and then call WinBUGS via theR2WinBUGS package. This provides a function bugs() which takes data and starting values as input,then automatically writes a WinBUGS script, calls a specified model file, does the MCMC simulationin WinBUGS and saves the simulated posterior values for easy access in R.> library(R2WinBUGS)

> N=nrow(orings)

> success=orings$Total-orings$Failures

> total=orings$Total

> temperature=orings$Temperature

> bugsdata=list("N","success","total","temperature")

> init1<-list(beta0 = 0, beta1 = 0,numsuccess=0)

> init2<-list(beta0 = 0.5, beta1 = 0.5,numsuccess=0)

> bugsinits=list(init1,init2)

> bugsparam=c("beta0","beta1","probsuccess","numsuccess")

> bugsresults=bugs(data=bugsdata,inits=bugsinits,parameters.to.save=bugsparam,model.file="ex7.txt",

+ digits=5,n.thin=10,n.burnin=5000,n.iter=10000,n.chains=2,DIC=T,debug=T) > bugsresults$summary

mean sd 2.5% 25% 50% 75% 97.5% Rhat

beta0 -4.9182481 3.07638421 -11.072750000 -7.03425000 -4.9360000 -2.737500 0.5851300 1.01beta1 0.1145268 0.04700652 0.031937250 0.08035500 0.1141500 0.145950 0.2113075 1.01

probsuccess 0.2845935 0.24944414 0.009319077 0.07479438 0.1993996 0.451475 0.8276609 1.01numsuccess 1.7140000 1.73355392 0.000000000 0.00000000 1.0000000 3.000000 6.0000000 1.02

deviance 33.6255500 1.91256838 31.719749962 32.28000000 33.0649996 34.430000 39.0017485 1.00

The convergence traces for the MCMC chain (provided by default in WinBUGS) for each parameterand the deviance are:

65

On return to R we can plot the full posteriors for all the model parameters and the predictions:> par(mfrow=c(2,2))

> y<-density(bugsresults$sims.matrix[,1])

> plot(y,type="l",col="red",main=paste("beta0","(",signif(bugsresults$mean[[1]],3),")"), + xlab="",ylab="")


> plot(y,type="l",col="red",main=paste("beta1","(",signif(bugsresults$mean[[2]],3),")"),

+ xlab="",ylab="")


> plot(y,type="l",col="red",main=paste("Probability of success at 31 deg","(",signif(bugsresults$mean[[3]],3),")"),

+ xlab="",ylab="")


> plot(y,type="l",col="red",main=paste("Number successes at 31 deg","(",signif(bugsresults$mean[[4]],3),")"),

+ xlab="",ylab="")

−15 −10 −5 0 5

0.00

0.04

0.08

0.12

beta0 ( −4.92 )

−0.05 0.05 0.15 0.25

02

46

8

beta1 ( 0.115 )

−0.2 0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

Probability of success at 31 deg ( 0.285 )

0 2 4 6

0.00

0.10

0.20

0.30

Number successes at 31 deg ( 1.71 )

This strongly confirms that the chance of multiple O ring failure were very high on the launch day!

8. A simple additive modelConsider again the data on 102 occupations derived from a Canadian social survey and containedin the data frame prestige. Recall the response variable is an occupational ‘prestige’ score (y), thereare two possible explanatory variables — average income (x1) and average educational level (x2)of those in the occupation. We first fit simple scatter plot smoother y = f(x1)+ε to average incomeusing localised regression (with various spans and orders) and plot the fitted functions.> plot(prestige$income,prestige$prestige,xlab="Average Income",ylab="Prestige")

> lines(seq(0,25000,10),predict(loess(prestige∼income,data=prestige,span=0.5,degree=2),

+ data.frame(income=seq(0,25000,10))))


+ data.frame(income=seq(0,25000,10))),lty="dashed")


+ data.frame(income=seq(0,25000,10))),lty="dotted")

and we do the same y = f(x2) + ε for average educational level:> plot(prestige$education,prestige$prestige,xlab="Average Education",ylab="Prestige")

> lines(5:16,predict(loess(prestige∼education,data=prestige,span=0.5,degree=2),

+ data.frame(education=5:16)))


+ data.frame(education=5:16)),lty="dashed")


+ data.frame(education=5:16)),lty="dotted")

66

0 5000 10000 15000 20000 25000

2040

6080

Average Income

Pre

stig

e

6 8 10 12 14 16

2040

6080

Average Education

Pre

stig

e

Next we try an additive non-parametric regression model: y = β0 + f1(x1) + f2(x2) + ε, where thepartial-regression functions fj(·) are fit using smoothing splines, using the gam() function in R> library(mgcv)

> prestige.am = gam(prestige ∼ s(income) + s(education),data=prestige)

The s(·) function, used in specifying the model formula, indicates that each term is to be fit with asmoothing spline. What does the additive regression surface look like?> inc=seq(min(prestige$income),max(prestige$income),len=25)

> ed=seq(min(prestige$education),max(prestige$education),len=25)

> newdat=expand.grid(income=inc,education=ed)

> fit.prestige = matrix(predict(prestige.am, newdat), 25, 25)

> persp(inc, ed, fit.prestige, theta=45, phi=30, ticktype="detailed",xlab="Income", ylab="Education",

+ zlab="Prestige", expand=2/3,shade=0.5)

Income

5000

10000

15000

20000

25000

Educa

tion

8

10

12

14

Prestige

20

40

60

80

67

Because slices of the additive-regression surface in the direction of one predictor (holding the otherpredictor constant) are parallel, we can also graph each partial-regression function separately.

> par(mfrow=c(1,2))

> plot(prestige.am)

0 5000 15000 25000

−20−10

010

20

income

s(incom

e,3.12

)

6 8 10 12 14 16−20

−100

1020

education

s(educ

ation,3

.18)

How well do fitted values of prestige correspond to those observed?> plot(prestige.am$fitted.values,prestige$prestige,xlab="predicted prestige",ylab="observed prestige")

30 40 50 60 70 80

2040

6080

predicted prestige

obse

rved

pre

stig

e

9. An example of using a GAMWe return to the US Recession example and show how to fit and forecast the data within thegeneralized additive modelling approach. Using a binomial family and spline smoothing.> n = length(recession[,1])

> response = recession[,2][7:n]



> data = data.frame(cbind(response, predictor1, predictor2))

> time = floor(recession[,1]/100) + (recession[,1]-100*floor(recession[,1]/100)-0.5)/12

> time = time[1:(n-6)]

> fit = gam(response sim s(predictor1) + s(predictor2),family=binomial, data=data)

> summary(fit)

Family: binomial

Link function: logitFormula:

response ~ s(predictor1) + s(predictor2)Parametric coefficients:

Estimate std. err. t ratio Pr(>|t|)

constant -9.309 3.444 -2.703 0.0068786Approximate significance of smooth terms:

edf chi.sq p-values(predictor1) 5.818 29.779 3.6550e-05

s(predictor2) 7.068 35.06 1.1597e-05

R-sq.(adj) = 0.677 Deviance explained = 71.3%

68

We then plot the resulting forecast values (compare back with those obtained from the GLM):> plot(time, response, type="h", col=10, main="GAM predictions")

> lines(time, predict(fit, newdata=data, type="response"))

1965 1970 1975 1980 1985 1990 1995 2000

0.00.2

0.40.6

0.81.0

GAM predictions

time

respo

nse

69

Introductory Statistical Modelling in R — further examplesData for the further examples are included in the course R workspace which is provided as a computer file

1. The data frame mammals contains average brain and body weights for 62 species of animals, wherebody weight is in kg and brain weight is in g. Explore this data using linear model(s), check theresidual diagnostics and make suitable transformations of the variables if necessary. Compare yourlinear model analysis analysis with use of a scatter-plot smoothers. Summarise and justify yourconclusions about the relationship between brain and body weights.

2. The data frame funding contains data on per student government funding (£) to UK Universities at2001-2 prices for academic years from 1989-2002 along with associated student numbers (000’s).It was published in the Independent newspaper on November 28, 2003 under the headline ‘Amidthe furore, one thing is agreed: university funding is in a severe crisis’. Examine this data usinglinear model(s) (and possibly also non-parametric approaches). What are your conclusions, and doyou agree with the Independent?

3. The data frame fat contains data from a study investigating a new method of measuring bodycomposition. The body fat percentage, age and gender is given for 18 adults aged between 23 and61. There are 18 observations on three variables:Age: The age of the subject in (complete) yearsPercent.Fat : The body fat percentage of the subjectGender The gender of the subjectFit a linear model to these data using one factor and one quantitative covariate and report yourconclusions.

4. The data frame algae1 contains a further 122 water samples on algae blooms with the same formatas the data in the data frame algae. These samples can be regarded as a test set for the modeldeveloped on the original data. Predict values for algae a1 for these data using the final modelderived from the original data, and compare your predictions with the actual values in the test set.Compute the mean square error of prediction.

5. The data frame sexist contains data published in the Independent, October 8, 1992, under theheadline ‘Irish and Italians are the sexists of Europe’. The percentage of people surveyed in twelveEuropean countries who had equal confidence in the competence of both sexes for four occupations(1 (bus/train driver), 2 (surgeon), 3 (barrister) and 4 (MP)). Fit a linear model with two factors to thisdata, provide a suitable anova table, interpret it appropriately and report your conclusions. Do youagree with the Independent? Do you have any reservations about whether a Gaussian linear modelis sensible for these data?

6. The data frame doses contains data obtained from a dose response experiment, comprising thenumber(y) responding to the drug out of a total (n) in eight groups receiving different dose regimes(x). Analyse these data using a binomial GLM. Report and justify your conclusions.

7. The data frame contraceptive contains contraceptive use data from 1607 currently married and fe-cund women interviewed in a fertility survey conducted in Fiji, according to age, education, desirefor more children and current use of contraception. Analyse these data using a binomial (and ifnecessary a quasi binomial) GLM. Report and justify your conclusions.

8. The data frame pilots contains data obtained from the Civil Aviation Safety Authority of Australiaand the Australian Transport Safety Bureau on registered numbers and reported deaths of privatepilots in various age groups from 1992 to 1999 in Australia. There are 64 observations on fourvariables:Year : The year of interestNumbers: The number of (registered) pilotsDeaths: The number of (reported) pilot deathsAge: The age of the pilotsAnalyses these data using an appropriate Poisson GLM and associated analyses. Report and justifyyour conclusions concerning the impact of age and year on private pilot death rates.

9. The table below gives the Confirmed and probable variant CJD patients reported to the end ofDecember 2001, sorted into males (the first number given in each pair) and females (the secondnumber of the pair), according to their Year of onset of the illness.

1994 1995 1996 1997 1998 1999 20003,5 5,5 4,7 7,7 8,9 20,9 12,11

Propose a sensible Poisson model for these data, read the data into R, fit the model, carry outrelevant diagnostic checks and report and justify your conclusions.

70

10. The data frame maths give the results from men and women studying Honours in England attemptingfour mathematics questions. There are four variables on 24 observations:Result : The result of the the question: Correct, Incorrect or No.attemptGender : Either Male or FemaleQuestion: One of A, B, C or D Count : The number of students in each category Analyse these datausing a Poisson GLM. Report and justify your conclusions.

11. The data frame spam consists of data on 4601 email items, of which 1813 were identified as spam.The variables are:crl.tot — total length of words in capitalsdollar— number of occurrences of the ‘$’ symbolbang — number of occurrences of the ‘!’ symbolmoney— number of occurrences of the word ‘money’n000— number of occurrences of the string ‘000’make— number of occurrences of the word ‘make’yesno— a factor with levels ‘n’ not spam and ‘y’ spamFit suitable parametric or non-parametric models to these data in order to predict the probability thatan email item is ‘spam’. Report and justify your conclusions. Comment on the predictive validity ofyour final model.

12. Using the data on algae blooms in the data frame algae, develop a additive model to predict algaea3. Test predictions of a3 from this model using water sample value in the data frame algae1 andreport on your conclusions.

71

introductory statistical modelling using r - inpe

Documents