notes on nonlinear regression - tanujit chakraborty's blog

Notes On

Nonlinear

Regression

STAT 8230 — Applied Nonlinear RegressionLecture Notes

Linear vs. Nonlinear Models

Linear regression, analysis of variance, analysis of covariance, and most ofmultivariate analysis are concerned with linear statistical models.

These models describe the dependence relationship between one or morecontinuously distributed response random variables and a set of explanatoryvariables or factors.

• These models are parametric because, when fully specified, theyassume that the probability distribution of the response variable(s),including a model for the dependence between response and explana-tory variables, is known except for the values of a small number ofunknown constants called parameters.

• These models are linear in the sense that the regression parameters(the parameters that describe the dependence of the mean responseon explanatory variables) enter into the models linearly.

– The model is linear in the parameters, not the explanatory vari-ables.

1

For example, the following is the general form of the classical multipleregression model:

yi = β1xi1 + β2xi2 + · · ·+ βpxip + ei, i = 1, . . . , n,

and e1, . . . , eniid∼ N(0, σ2)

(∗)

• Here, we assume that we have a random sample of n observations(yi, xi1, . . . , xip), i = 1, . . . , n, of a response variable Y and a set of pexplanatory variables X1, . . . , Xp.

• In addition, the notationiid∼ N(0, σ2) means, “are independent, iden-

tically distributed random variables each with a normal distributionwith mean 0 and variance σ2.”

• Typically, xi1 is equal to one for all i in multiple linear regressionmodels, but this need not be so.

• In model (*) the parameters are β1, . . . , βp, σ2. The regression param-

eters are β1, . . . , βp.

In regression models, the explanatory variables (xi1, . . . , xip, i = 1, . . . , n,above) are treated as nonrandom, either by assumption that they have beenset to their observed values by design or some other nonrandom mechanism,or, more generally, by making the model conditional on the observed Xvales.

• That is, regression models specify the conditional distribution of Y |X1, . . . , Xp.In particular, regression models are primarily concerned with themean of this distribution: E(Y |X1, . . . , Xp).

• E(Y |X1, . . . , Xp), the conditional expectation of the response giventhe values of the explanatory variables, is known as the regressionfunction.

– Since we always condition on the explanatory variables in linearand nonlinear regression models, we will often drop the condi-tioning from the notation for convenience and write E(Y ) inplace of E(Y |X1, . . . , Xp).

2

In the multiple linear regression model (*), the regression function for theith subject (unit of observation), call it µi, is

µi ≡ E(yi) = E(β1xi1 + β2xi2 + · · ·+ βpxip + ei)

= β1xi1 + β2xi2 + · · ·+ βpxip + E(ei)︸︷︷︸=0

= β1xi1 + β2xi2 + · · ·+ βpxip

Notice that the regression parameters β1, . . . , βp enter into the regressionfunction in a linear fashion.

• Recall that a linear combination of z1, . . . , zk is a weighted sum a1z1+a2z2 + · · ·+ akzk of the zj ’s with coefficients a1, . . . , ak.

• Of course, the multiple linear regression model is linear in the βj ’sand in the xji’s, but the fact that it is linear in the βj ’s is what makesit a linear model.

In this course, a nonlinear regression model is still going to be a regres-sion model describing the relationship between a continuously distributedresponse variable yi and explanatory variables xi1, . . . , xip, but now we dropthe linearity assumption,

µi = β1xi1 + · · ·+ βpxip,

and allow the parameters θ1, . . . , θp and explanatory variables xi1, . . . , xikto enter into the regression function in a nonlinear way.

• Notice we’ve switched to calling the parameters θ’s instead of β’s. Inaddition, the number of explanatory variables, k, is not necessarilyequal to the number of parameters, p.

3

That is, in the nonlinear regression models under study in this course,µi = f(xi1, . . . , xik, θ1, . . . , θp), where f(·) is a function not necessarily linearin the θ’s. Otherwise, the model is the same:

yi = f(xi1, . . . , xik, θ1, . . . , θp) + ei, i = 1, . . . , n,

where e1, . . . , eniid∼ N(0, σ2)

(∗∗)

• Note that we’ve restricted attention here somewhat in the class of allpossible not-linear models. We retain the assumption of a continuousresponse, additive error, and (usually) normally distributed errors.

• Excluded from consideration are several important classes of regres-sion models that are nonlinear in some sense.

– In particular, we exclude generalized linear models (GLMs, e.g.,logistic regression models, Poisson loglinear models, etc.) Manycases of GLMs are for discrete data, and GLMs retain a (modi-fied) linearity in the parameters assumption.

4

Example — Onion Data:

The following table and scatterplot display data on the dry weight (Y ) of 15onion bulbs randomly assigned to 15 growing times (X) until measurement.

Growing Time Dry Weight Growing Time Dry Weight

1 16.08 9 590.032 33.83 10 651.923 65.8 11 724.934 97.2 12 699.565 191.55 13 689.966 326.20 14 637.567 386.87 15 717.418 520.53

••

••

•

•

•

•

•

•

•• •

•

•

Growing Time

Dry

Wei

ght

2 4 6 8 10 12 14

020

040

060

0

5

Suppose we wanted to fit a model to describe how the mean dry weightof onions depends upon growing time. From the data and scatterplot, itis clear that weight tends to increase with growing time in a nonlinear (ingrowing time) fashion.

However, a linear (in the parameters) model can still be used to capture thisnonlinear pattern of growth by considering polynomial models in growingtime. That is, consider models of the form

yi = β1 + β2xi + β3x2i + · · ·+ βpx

p−1i + ei, i = 1, . . . , n,

where e1, . . . , en are i.i.d. each with mean 0, and variance σ2 (constantvariance).

Alternatively, we might consider a nonlinear model. In particular, considerthe 3-parameter logistic model (a.k.a. simple logistic model):

yi =θ1

1 + exp{(θ2 − xi)/θ3}+ ei, i = 1, . . . , n,

with the same assumptions on the ei’s.

How do we choose between a linear model (e.g., polynomial in growingtime) and a nonlinear model (e.g., simple logistic model) in this problem?

From a purely empirical point of view, we might choose the model that fitsthe data most closely.

However, we need to be a little bit careful here to balance fit against par-simony and generality. If we include enough terms in a polynomial modelwe can fit the data perfectly.

6

In particular, a (n− 1)th degree polynomial can fit n points exactly. Sucha model has n parameters and is equivalent to (is just a reparameterizationof) the model

yi = βi + ei, i = 1, . . . , n. (†)

Such a model clearly doesn’t summarize or simplify the data at all andcan’t be expected to generalize beyond the particular features of this onerandomly drawn data set.

• In addition, a model such as (†) leaves no degrees of freedom left toestimate error variance ⇒ can’t do inference (test hypotheses, formconfidence intervals) based on the model.

If we think of each of the n observations as an independent piece of informa-tion (or degree of freedom) from which to fit a model, then we use up one ofthese pieces of information (degrees of freedom) for every (nonredundant)parameter estimated in the model.

• n regression parameters to be estimated ⇒ n d.f. used to estimatethe model (model d.f.) ⇒ n − n = 0 d.f. left to estimate the errorvariance parameter σ2 (0 d.f. for error).

The smaller the number of regression parameters in the model, the mored.f. available to estimate error variance ⇒ more power for hypothesis tests,more precision in confidence intervals.

• Parsimonious models that fit the main features of the data are pre-ferred.

Consider the fits of the simple logistic model and polynomial models oforder 2, 3, and 4 to the onion data on the following page.

7

••

••

•

••

••

••

••

••

Gro

win

g T

ime

Dry Weight

24

68

1012

14

0200400600

3-pa

ram

eter

logi

stic

mod

el (

nonl

inea

r)

••

••

•

••

••

••

••

••

Gro

win

g T

ime

Dry Weight

24

68

1012

14

0200400600

3-pa

ram

eter

line

ar m

odel

••

••

•

••

••

••

••

••

Gro

win

g T

ime

Dry Weight

24

68

1012

14

0200400600

4-pa

ram

eter

line

ar m

odel

••

••

•

••

••

••

••

••

Gro

win

g T

ime

Dry Weight

24

68

1012

14

0200400600

5-pa

ram

eter

line

ar m

odel

8

• Notice that it requires a less parsimonious (more parameters) linearmodel to fit the main features of the data than for a nonlinear model.

• In addition, while the quadratic (3 parameter linear) model clearlyunderfits the general shape of the curve, the cubic and quartic linearmodels appear to overfit the data.

• So, from a purely empirical point of view, the logistic model appearspreferrable.

• In addition, the logistic model has a big advantage in terms of param-eter interpretability for a growth model such as this one.

Interpretations of the parameters in the simple logistic model:

x

y

φ1

φ2

φ3

FIGURE C.7. The simple logistic model showing the parameters φ1, the horizon-tal asymptote as x→ ∞, φ2, the value of x for which y = φ1/2, and φ3, a scale

parameter on the x-axis. If φ3 < 0 the curve will be monotone decreasing instead

of monotone increasing and φ1 will be the horizontal asymptote as x→ −∞.

9

• θ1 (ϕ1 in the plot on the previous page) represents the asymptote ofthe curve (limit of onion weight as growth time increases toward itsmaximum value).

• θ2 represents the x-value at which y is equal to θ1/2, one-half of itsasymptotic value. The growth time at which onions have achievedhalf of their total potential weight.

• θ3 is a scale parameter that does not have as natural of an interpre-tation.

• In contrast, the polynomial parameters are not as meaningful to thecontext of the problem.

So fit, parsimony, and parameter interpretability can point to using non-linear models over linear ones. A further important motivation for usingnonlinear models over linear ones is subject matter theory.

• It may be that we have a mechanistic theory that explains the natureof onion growth and which implies a nonlinear functional form forthe relationship between weight (or some other measure of size) andgrowing time.

• We will see plenty of examples of theoretically motivated nonlinearmodels as the course progresses.

10

Review of Linear Regression Models

Before discussing nonlinear regression we need to review linear regression.

Why?

• Because many of the ideas and methods from linear regression transferdirectly or with minor modification to the nonlinear case.

• And because many of the methods of estimation and inference inNLMs are linear methods applied to a linear approximation to theNLM.

Again, we assume that we observe a sample of independent pairs, (y1,x1), . . . ,(yn,xn) where yi is a response variable and xi = (xi1, . . . , xip)

T is a p × 1vector of explanatory variables (assumed fixed).

The classical linear model can be written

yi = β1xi1 + · · ·+ βpxip + ei, i = 1, . . . , n,

= xTi β + ei,

where e1, . . . , eniid∼ N(0, σ2). Equivalently, we can stack these n equations

and write the model as follows: y1...yn

=

x11 x12 · · · x1p...

.... . .

...xn1 xn2 · · · xnp

β1

...βp

+

e1...en

or y = Xβ + e

Our assumptions on e1, . . . , en can be equivalently restated as

e ∼ Nn(0, σ2In),

where Nn(µ,M) denotes the n−dimensional multivariate normal distribu-tion with mean µ and variance-covariance matrix M.

11

Multivariate normal distribution:

The multivariate normal distribution is to a random vector (vector of ran-dom variables) as the univariate (usual) normal distribution is to a randomvariable. It is the version of the normal distribution appropriate to the jointdistribution of several random variables (collected and stacked as a randomvector) rather than the distribution of a single random variable.

• E.g., for a bivariate random vector x =

(x1x2

)that has a bivariate

normal distribution, the density function of x maps out a bell overthe (x1, x2) plane.

The Nn(µ,Σ) distribution is completely described by the two parametersµ, the mean of the distribution, and Σ, the variance-covariance matrixof the distribution.

• That is, for x = (x1, x2, . . . , xn)T ∼ Nn(µ,Σ),

µ =

E(x1)E(x2)

...E(xn)

, and Σ =

var(x1) cov(x1, x2) · · · cov(x1, xn)

cov(x2, x1) var(x2) · · · cov(x2, xn)...

.... . .

...cov(xn, x1) cov(xn, x2) · · · var(xn)

describe the location and dispersion (spread), respectively, of the bell-shaped distribution of possible values for x.

12

The probability density function (p.d.f.) of x generalizes the univariatenormal p.d.f.

• Recall for X ∼ N(µ, σ2) the p.d.f. of X is

f(x) =1

(2πσ2)1/2exp

{−1

2

(x− µ)2

σ2

}, −∞ < x <∞

• In the multivariate case, for x ∼ Nk(µ,Σ), the p.d.f. of x is

f(x) =1

(2π)k/2|Σ|1/2exp

{−1

2(x− µ)TΣ−1(x− µ)

}, x ∈ Rk.

– Here |Σ| denotes the determinant of the var-cov matrix Σ.

In the CLM, since e ∼ Nn(0, σ2In) and y = Xβ+e, it follows that y ∼ Nn

too, with mean

E(y) = E(Xβ + e) = Xβ + E(e)︸︷︷︸=0

= Xβ

and var-cov matrix

var(y) = var(Xβ + e)

= var(e) = σ2

1 0 · · · 00 1 · · · 0...

.... . .

...0 0 · · · 1

=

σ2 0 · · · 00 σ2 · · · 0...

.... . .

...0 0 · · · σ2

.(∗)

• Assumption (*) says that the yi’s are uncorrelated (cov(yi, yi′) = 0,for i = i′) and have constant variance (var(y1) = · · · = var(yn) = σ2).

13

So, the assumptions of the CLM can be stated quite succinctly as:

y ∼ Nn(Xβ, σ2In). (†)

Therefore, in the CLM y is assumed to have the joint p.d.f.

f(y;β, σ2) =1

(2π)n/2|σ2In|1/2exp

{−1

2(y −Xβ)T (σ2In)

−1(y −Xβ)

}= (2πσ2)−n/2 exp

{− 1

2σ2(y −Xβ)T (y −Xβ)

}

= (2πσ2)−n/2 exp

− 1

2σ2∥y −Xβ︸︷︷︸

=e

∥2 , y ∈ Rk.

• Here, ∥v∥ =√vTv denotes the norm, or length, of the vector v.

Therefore, ∥y −Xβ∥ denotes the length of the difference between yand Xβ; i.e., the (Euclidean) distance between y and Xβ.

• (Note that we’ve used here the fact that the determinant of a diagonalmatrix (a matrix whose off-diagonal elements are all 0) is the productof the diagonal elements ⇒ |σ2In| = σ2n.)

• Actually, many of the results in the theory of linear models canbe established with the weaker assumptions obtained by droppingnormality from (†); that is, under the assumptions E(y) = Xβ,var(y) = σ2In and y1, . . . , yn are independent.

• As mentioned previously, the mean of y (conditional onX) is known asthe regression function or (as we’ll call it) the expectation functionof the model. In the linear regression model, the expectation functionis E(y) = Xβ.

14

Notice that in the linear model,

∂E(yi)

∂βj=∂(xT

i β)

∂βj=∂(xi1β1 + · · ·+ xipβp)

∂βj= xij

To denote the derivative of the vector µ = E(y) with respect to the vectorβ we use the notation

∂µ

∂βT=

∂µ1

∂βT

∂µ2

∂βT

...∂µn

∂βT

=

∂µ1

∂β1

∂µ1

∂β2· · · ∂µ1

∂βp

∂µ2

∂β1

∂µ2

∂β2· · · ∂µ2

∂βp

· · ·...

. . ....

∂µn

∂β1

∂µn

∂β2· · · ∂µn

∂βp

• So we see that the derivative of Xβ with respect to β gives the matrixX. For this reason X is called the derivative matrix.

– Note thatX is also sometimes called themodel matrix, or designmatrix in linear models.

– In linear regression notice that the derivative matrix∂µ∂βT

does

not depend on β. This will not be the case in nonlinear regres-sion.

15

Estimation of β and σ2:

Maximum likelihood estimation:

In general, the likelihood function is just the probability density function,but thought of as a function of the parameters rather than of the data.

• For Example, in the CLM, the p.d.f. of the response variable y is

f(y;β, σ2) = (2πσ2)−n/2 exp

{− 1

2σ2∥y −Xβ∥2

}

– This p.d.f. is a function of the observed response y and theparameters β and σ2, but we think of it primarily as a functionof y.

– In the discrete case, the p.d.f. gives the probability of observingits argument, the data (y above), for given values of the param-eters. In the continuous case the interpretation is very similar,but slightly more complicated.

Since the p.d.f. involves both the parameters (β and σ2) and the data(y), once the data are observed, we can think of it as a function of theparameters given the data.

This re-interpretation of the density function is given a new name, thelikelihood function, and written as primarily a function of the parameters.

E.g., in the CLM the likelihood function is

L(β, σ2;y) = (2πσ2)−n/2 exp

{− 1

2σ2∥y −Xβ∥2

}

16

• The idea behind maximum likelihood estimation is to find the valuesof β and σ2 under which the data are most likely. That is, we findthe β and σ2 that maximize the likelihood function (and p.d.f.) forthe value of y actually observed. These values are the maximumlikelihood estimators (MLEs) of the parameters.

• Note that since the natural logarithm is an increasing function, maxi-mizing L(β, σ;y) with respect to the parameters is equivalent to (pro-duces the same answer as) maximizing ℓ(β, σ2;y) ≡ log{L(β, σ2;y)}.Since taking logarithms is often mathematically convenient and itdoesn’t change the probelm, we’ll typically work with this loglikeli-hood function rather than the likelihood function.

For the CLM, the loglikelihood is

ℓ(β, σ2;y) = −n2log(2π)︸︷︷︸

a constant

−n2log(σ2)− 1

2σ2∥y −Xβ∥2︸︷︷︸

kernel of ℓ

.

• Note that it is equivalent to maximize the kernel of the loglikelihood— that portion of the loglikelihood depending upon the parameters.Terms not involving the parameters can be ignored.

Obtaining the MLEs in the CLM: This can be done in two steps

1. Maximize ℓ(β, σ2;y) with respect to β, treating σ2 as known. Call

the resulting estimator β.

2. Then maximize ℓ(β, σ2;y) with respect to σ2. Call the resulting esti-mator σ2.

Then β, σ2 will be the MLEs of β, σ2.

17

1. In step 1 we treat σ2 as known and maximize ℓ(β, σ2;y) with respectto β. Note that this is equivalent to maximizing the third term,

− 1

2σ2∥y −Xβ∥2,

which is equivalent to minimizing

∥y −Xβ∥2 = ∥e∥2 =n∑

i=1

e2i =n∑

i=1

(yi − xTi β)

2 ≡ S(β). (∗)

• Thus the MLE of β minimizes S(β), which is known as the leastsquares criterion.

– So, the estimators of β given by ML and least squares coincide.

We can obtain the MLE/LSE, β, by solving the normal equationswhich are obtained by differentiating S(β) and setting the result equalto 0.

S(β) can be written

S(β) = (y −Xβ)T (y −Xβ) = yTy − yTXβ − βTXTy︸︷︷︸=yTXβ

+βTXTXβ

= yTy − 2yTXβ + βTXTXβ

• Need ∂S

∂βT.

18

To take the necessary derivatives, we need some results on matrixdifferentiation. For x a vector and A a matrix,

i. ∂Ax∂xT = A.

ii. ∂xTAx∂xT = 2xTA.

Using (i) and (ii) we get

∂S

∂βT= −2yTX+ 2βTXTX

so that the normal equations become

2βTXTX = 2yTX

or XTXβ = XTy.

If XTX is invertible (nonsingular) we can multiply through on bothsides by (XTX)−1 to give the MLE/Least squares estimator of β as:

β = (XTX)−1XTy. (‡)

• We’ll assume (XTX) is invertible henceforth unless stated otherwise.Note that if (XTX) is not invertible a (no longer unique) estimatorof β is obtained simply by replacing the matrix inverse in (‡) with ageneralized matrix inverse.

19

2. Now we maximize ℓ(β, σ2;y) with respect to σ2. Taking derivativeswith respect to σ2 we get

∂ℓ

∂(σ2)=

−n/2σ2

+(1/2)∥y −Xβ∥2

σ4= 0,

which has solution,

σ2 =1

n∥y −Xβ∥2 =

1

n

n∑i=1

(yi − xTi β)

2.

• We will see that β has a number of desirable properties includingsome optimality properties. However, σ2 is not typically the preferredestimator.

One fault with σ2 is that it is biased. It can be shown that

E(σ2) =n− p

nσ2

Therefore, an unbiased estimator that is generally superior to σ2 canbe formed by taking

n

n− pσ2 =

1

n− p∥y −Xβ∥2 =

1

n− p

n∑i=1

(yi − xTi β)

2 ≡ s2

• We call this estimator s2, the mean squared error.

20

Properties of β and methods of inference on β:

Several properties of β follow from the fact that it is a linear function of y.

(That is, β = (XTX)−1XTy is of the form My for M a matrix ofconstants.)

This linearity combined with the model equation y = Xβ+e leads to somenice, simple properties.

Notice,

β = (XTX)−1XTy = (XTX)−1XT (Xβ + e)

= (XTX)−1XTXβ + (XTX)−1XTe

= β + (XTX)−1XTe

It follows that

1. β is unbiased, since

E(β) = E{β + (XTX)−1XTe} = β + (XTX)−1XT E(e)︸︷︷︸=0

= β.

2. β has var-cov matrix

var(β) = var{β + (XTX)−1XTe} = var{(XTX)−1XTe}= (XTX)−1XT var(e)︸︷︷︸

=σ2I

X(XTX)−1 = σ2(XTX)−1

( Here we’ve used that fact that if w is a n × 1 random vectorwith var-cov matrix Σ and B is an m × n matrix of constants,then var(Bw) = Bvar(w)BT = BΣBT .)

3. (normality) β ∼ Np(β, σ2(XTX)−1) (if e is assumed normal).

21

4. β is the Best (minimum variance) estimator in the class of all Linear

Unbiased Estimators (β is BLUE). This result doesn’t require theassumption of normality on the errors of the CLM.

5. Under the assumption of normal errors, β and s2 are minimum vari-ance unbiased estimators (best in the class of unbiased, but not-necessarily linear estimators).

6. Since var(β) = σ2(XTX)−1 (by (2)), var(βj) = σ2(XTX)−1jj . Since σ

2

is typically unknown, we must estimate it with s2 to get an estimate ofvar(βj). The square root of this estimated variance is the standard

error of βj :

s.e.(βj) = s√(XTX)−1

jj .

With this standard error in hand, methods of inference (hypothesistests, confidence intervals) follow from the fact that

βj − βj

s.e.(βj)∼ t(n− p)︸︷︷︸

the t distribution with n− p d.f.

– ⇒ βj ± t1−α/2(n − p)s.e.(βj) forms a 100(1 − α)% marginalconfidence interval for βj .

– For an α-level test of H0 : βj = β0 versus H1 : βj = β0 we usethe rule: reject H0 if

|βj − β0|s.e.(βj)

> t1−α/2(n− p)

22

7. Inference on the entire vector β is based on the fact that

(β − β)T (XTX)(β − β)ps2

∼ F (p, n− p)︸︷︷︸the F distribution with p and n− p d.f.

– A 100(1− α)% joint confidence region for β is given by theset of all β such that

(β − β)T (XTX)(β − β)ps2

≤ F1−α(p, n− p)

This region consists of the surface and interior of an ellipsoid(p-dimensional ellipse; e.g., a watermelon for p = 3).

– For an α-level test, we rejectH0 : β = β0 in favor ofH1 : β = β0

if(β − β0)

T (XTX)(β − β0)

ps2> F1−α(p, n− p).

– The previous result is sometimes of use when β0 = 0, but oftenwe want to test that some linear function of β (e.g., a subvectorof β) is equal to 0 or some other null value b. In that case, itis useful to have a generalization of the above F test for H0 :Aβ = b versus H1 : Aβ = b where A is a k × p matrix ofconstants and b is a k× 1 vector of constants. The appropriatetest has rejection rule: reject if

F =(Aβ − b)T {A(XTX)−1A}−1(Aβ − b)

ks2> F1−α(k, n− p).

(Note that not every such hypothesis is testable. We requirethat A has full row rank. Essentially, this means there is noredundancy in the statement of the null hypothesis.)

23

8. A 100(1− α)% C.I. for the expected response at a given value of thevector of explanatory variables xo is given by

xT0 β ± t1−α/2(n− p)

√s2xT

0 (XTX)−1x0.

– Similarly, we can form a C.I. for any linear combination of theβ’s that can be written in the form cTβ for c a vector of con-stants by replacing x0 with c. A 100(1 − α)% C.I. for cTβ isgiven by

cT β ± t1−α/2(n− p)√s2cT (XTX)−1c.

9. A 100(1−α)% C.I. for the predicted response (not the mean responseover the population, but a single new observation of the responsevariable) at a given value of the vector of explanatory variables xo isgiven by

xT0 β ± t1−α/2(n− p)

√s2[1 + xT

0 (XTX)−1x0].

• Such an interval is usually called a prediction interval ratherthan a confidence interval.

10. A 100(1− α)% confidence band for the response function at any xis given by

xT β ±√F1−α(p, n− p)

√ps2xT (XTX)−1x.

• Result (8) gives a C.I. at a single given point (x0) whereas re-sult (10) gives a confidence band that holds for all values of xconsidered simultaneously.

24

Example – PCB in Trout (fitting a linear model in R):

The data below consist of PCB (polychlorinated biphenyls, a toxin) con-centrations in Lake Cayuga (NY) trout of various ages.

Age PCB Conc. Age PCB Conc.(years) (ppm) (years) (ppm)

1 0.6 6 3.41 1.6 6 9.71 0.5 6 8.61 1.2 7 4.02 2.0 7 5.52 1.3 7 10.52 2.5 8 17.53 2.2 8 13.43 2.4 8 4.53 1.2 9 30.44 3.5 11 12.44 4.1 12 13.44 5.1 12 26.25 5.7 12 7.4

• See the handout labelled trout1. The first page of this handout con-tains R commands contained in the R script file trout1.R. Pages 2–3contain the text output of these commands and p.4 the graphics out-put.

• The first plot p.4 of trout1 contains a scatterplot of PCB concentrationversus age. From the plot there appears to be some nonlinearity inage and heteroscedasticity (nonconstant, in this case increasing withage, variance).

• From these observations it appears that the assumptions of the CLMpreclude its use here. However, transformations of the response andexplanatory variables can often induce linearity, normality and con-stant variance thereby making the CLM an appropriate tool.

25

• One useful class of transformations is the Box-Cox family of powertransformations:

g(Y ;λ) =

{Y λ−1

λ , if λ = 0log Y, if λ = 0

– For CLMs with an intercept term, this family is equivalent tothe “simple” family of power transformations given by

gS(Y ;λ) =

{Y λ, if λ = 0log Y, if λ = 0

but the Box-Cox family is slightly more convenient mathemati-cally.

• The specific transformation in the Box-Cox family can be chosen byestimating λ by ML estimation. Box and Cox showed that the MLestimator (MLE) of λ can be obtained as the maximizer of the function

−n2log SSE{z(λ)},

where z(λ) has ith element g(yi;λ)/yλ−1, y is the geometric mean of

the elements of y and SSE{z(λ)} denotes the error sum of squares forthe regression of z(λ) on X.

– This function is known as the profile likelihood for λ.

26

• Therefore, it is possible to obtain the MLE of λ by plotting

−n2log SSE{z(λ)}

over a range of λ values and selecting the λ-value that maximizes thisfunction. This is automated in the boxcox macro in R (part of theMASS package).

• See trout1.R. trout1.R contains R commands to select the appropriatetransformation of PCB concentration and then to fit a linear regres-sion model to the transformed data. These commands also producethe plots on p.4 of trout1.

• The boxcox macro is part of the MASS (Modern Applied Statisticswith S-PLUS — the book by Venables and Ripley mentioned in thesyllabus) library. This library comes standard with R, but must beloaded into an R session with the library(MASS) command.

• The par(mfrow=c(2,3)) command sets the graphical parameter mfrowso that plots are laid out 2× 3 on a page.

• boxcox plots the profile likelihood for λ (see p.4 of trout1 handout,

top-middle plot). From this plot we see that the MLE λ is close to

0. Rather than using the exact MLE, its preferable to round λ to thenearest interpretable value (e.g., 0, 1/4, 1/3, 1/2). In this case wetake λ = 0 and use the log transformation.

• A plot of log(PCB Conc.) against age looks much more linear andhomoscedastic. A cube-root transformation of age improves the sit-uation even further (we omit discussion of selecting transformationsof the explanatory variables, but this subject is discussed in Box andTidwell (1962) and elsewhere).

27

• A linear regression of the form

yi = β1 + β2xi + ei, i = 1, . . . , 28,

where e1, . . . , e28iid∼ N(0, σ2) and yi = log(PCB), xi = age

1/3i is fit

using the lm function. The “data frame” trout contains the variableson their original scale. The I() function just allows the computationof the transformation to be done within the call to lm.

• R is an object-oriented language. This means that quite complicatedentities like model fits can be stored and operated on as a singleobject. E.g., m1trout.lm is assigned the entire model fit. It is storedas a list containing all of the results of the model fit listed by thefunction names(m1trout.lm). The fitted model can be summarizedwith the command summary(m1trout.lm) and various other functions(like coef) exist for extracting results from m1trout.lm.

• The remainder of the code in trout1.R computes confidence intervalsand regions for β and produces the rest of the plots on p.4 of trout1.

• For example, we can obtain the 95% confidence band for the averagelog(PCB) value at all x (values of age1/3) by plotting

( 1 x )

(β1β2

)±

√F.95(2, 28− 2)

√2s2 ( 1 x ) (XTX)−1

(1x

)or (−2.391 + 2.300x)±

√3.369

√2(.246)(.637− .718x+ .214x2)

over the range of x-values observed in the data.

28

The Geometry of Linear Least Squares:

Again, let S(β) = ∥y − Xβ∥2, the squared (Euclidean, or straight-line)distance from y to its mean according to the model, Xβ.

• The linear least squares estimator β minimizes S(β).

Calculating S(β) consists of 2 steps:

1. Using the n× p derivative matrix X and the p × 1 parameter vectorβ to form the expected response vector η(β) = Xβ.

2. Calculating the squared distance from the expected response η(β) tothe observed response y.

• Though η(β) lies in n−space (has n components), the set of all pos-sible values it can take is not n−space. We can only vary the pparameters in β to get different values of η(β). That is, η(β) lies ina p-dimensional subspace of n−dimensional space.

• We call the set of all possible values of η(β) the expectation surfaceof the model.

– In a linear model, η(β) = Xβ is a linear combination of thecolumns of X (Xβ = β1x1 + · · · + βpxp where xj is the jth

column of X) so we call the Xβ the expectation plane of themodel.

29

Very Simple Example — n = 2, p = 1:

Suppose we have a response vector with just two components: y =

(42

)to which we’d like to fit the linear model

yi = β + ei, i = 1, 2

or

y = βx+ e, where x =

(11

).

The response vector y falls in two-dimensional space. We can plot y asfollows:

• Least-squares estimate of β: the β so that η(β) is the closest pointon the expectation plane to y.

– Since η(β) =

(33

)is the closest point to y =

(42

)it is easy

to find that the β that yields β

(11

)=

(33

)is β = 3.

30

A Slightly Less Simple Example — n = 3, p = 2:

Consider again the PCB in trout data and suppose we want to fit our simplelinear regression model as before

yi = β1 + β2xi + ei, i = 1, . . . , n

for y = log(PCB) and x = age1/3, but now suppose we have only n = 3observations:

age1/3 log(PCB)

1.26 0.921.82 2.152.22 2.52

The derivative matrix here is

X =

1 1.261 1.821 2.22

with columns x1 = (1, 1, 1)T and x2 = (1.26, 1.82, 2.22)T .

31

• Since the response has n = 3 components, the response space is3−dimensional space. We can plot x1 and x2 in that space (below,plot (b)).

• The expectation plane is the set of all η such that η = β1x1+β2x2 forsome constants β1 and β2. This plane is depicted in plot (c) above.

• η(β) is the point on this plane that is closest to y = (0.92, 2.15, 2.52)T

(it is the (Euclidean) projection of y onto that plane).

• β is the value of β that yields this closest point to y.

32

So, to find β we

1. Find η that is closest to y; then

2. find β such that η(β) = η.

We know from differentiating the least squares criterion that β solves thenormal equation

(XTX)β = XTy (∗)

• Another way we can derive (*) is from the geometry of the problem:

– The expectation plane is the set of all n× 1 vectors that can bewritten as Xa for some p × 1 vector a (the set of all possiblelinear combinations of the columns of X).

– We know that the residual vector y − Xβ must be orthogo-nal (i.e., perpendicular) to the expectation plane, so the angle

between y −Xβ and Xa must be 90◦ for any p× 1 vector a.

– Algebraically, two vectors b and c form a 90◦ angle if and onlyif their inner product bT c equals 0.

– Therefore, the geometry of least squares implies that the leastsquares estimator of β satisfies

(y −Xβ)T (Xa) = 0, for all a

or yTXa = βTXTXa, for all a

which implies yTX = βTXTX,

or, equivalently, XTy = XTXβ, (the normal equation)

and again we obtain that β must satisfy the normal equation.

33

Assumptions of the CLM:

Model:y = Xβ + e, e ∼ Nn(0, σ

2In).

1. Expectation function is correctly specified as E(y) = Xβ. Expecta-tion form assumed linear in β and contains all important predictorvariables each on the right scale.

2. Additive error. We assume y = Xβ + e rather than, for example,yi = (xT

i β)ei . This assumption implies

– Distribution of y −Xβ doesn’t depend on β.

– β should be estimated to make y −Xβ “small.”

3. The distribution of e does not depend on X. That is, the effect of Xon y is completely captured by Xβ. Often justified by randomization.

4. Each ei has mean 0. This is a consequence of (1) and (2). Not atall restrictive in a linear model with intercept, but deserves someattention in a nonlinear model with no intercept.

5. Homoscedasticity. Each ei has the same variance, σ2. Implies y1, . . . , ynall have the same variance, σ2.

6. Independence. The ei’s are independent ⇒ the yi’s are independent.Often justified by randomization.

7. Normality. Assume each ei follows a normal (Gaussian) distribution.

• Assumptions (4)–(7) imply that the length of y−Xβ should be mea-sured with Euclidean distance: ∥y −Xβ∥.

• Assumption (7) is not necessary to motivate least-squares and estab-lish its optimality (BLUE-ness). However, classical inference methodsrely on (7).

34

Model fitting is an iterative process: Make assumptions. Fit model. Checkassumptions. Revise model. Check revised model’s assumptions. Etc.

Verifying the Assumptions:

Since most of the assumptions are made on the error terms, it makes senseto check whether these assumptions appear to hold for the estimated errorterms, or residuals.

Let y = Xβ be the vector of fitted values (a.k.a. predicted values) andlet

e = y − y (raw residuals)

Most of the strongest and/or most commonly violated assumptions of theCLM can be checked by plotting the residuals. E.g., to check (1) we couldplot e1, . . . , en versus the sample values of other potential explanatory vari-ables.

• For most residual plots presence of any pattern indicates violation. Inaddition, we look for large residuals.

• Since the size of raw residuals depends upon the units of y, its usefulto standardize the residuals in some way.

• Several possibilities: one simple way is to look at studentized residualswhich are simply the raw residuals divided by their estimated standarddeviation.

Raw residuals:

e = y − y = y −Xβ = y −X(XTX)−1XT︸︷︷︸≡H

y = (I−H)y.

• Here H = X(XTX)−1XT is known as the hat matrix because y =Hy (H is the matrix that puts the “hat” on y).

35

To studentize the elements of e we divide each element by its estimatedstandard deviation. It can be shown that

var(e) = σ2(I−H),

which we estimate with

var(e) = s2(I−H). (∗)

Therefore, the estimated standard deviation of ei is the ith diagonal element

of (*), or s√1− hii, where hii is the i

th diagonal element of H. Thus, thestudentized residuals are

yi − yi

s√1− hii

, i = 1, . . . , n.

• A simpler way to standardize residuals is to use the Pearson resid-uals:

yi − yi√var(yi)

=yi − yis

, i = 1, . . . , n.

• Other types of standardized are possible, but different choices usuallylead to the same conclusions and, for many purposes, there’s littlereason to prefer one definition over another. Unfortunately, there’sconsiderable variability in the terminology used for various types ofresiduals.

36

Residual plots (see, e.g., Draper & Smith, Ch. 3):

1. versus the fitted values. Should see no pattern, few large (in absolutevalue) residuals. Violations can indicate heteroscedasticity, incorrectspecification of expectation functional form, outliers, correlation.

2. versus the predictor variables. Should see no pattern. Patterns can in-dicate heteroscedasticity, need for extra terms (e.g., a term quadraticin the predictor).

3. versus potential predictor variables. Should indicate no pattern; oth-erwise, potential predictor should be included.

4. versus time (or some other potentially ordering index of the responses).Should see no pattern. Pattern can reveal autocorrelation (depen-dence through time), heteroscedasticity, or the need to include timeas predictor variable.

5. quantile-quantile plot (normal probability plot). Plot the samplequantiles versus the expected quantiles under the assumption thatthe ei’s are normally distributed. Should be a straight line. For mod-erate to large sample sizes, non-straight plots indicate non-normality.Not useful in small samples (n < 30 or so).

37

Example — Scottish Hill Races Data:

The S-PLUS library MASS contains a data set containing record fastesttimes in 35 Scottish hill races (running races) against distance and totalheight climbed in the race.

• See handout, hills1. This handout contains hills1.SSC, a file contain-ing S-PLUS commands to analyze these data; the associated output;and associated graphics from the analysis.

• On line 2 of hills1.SSC we print out the data. The par(mfrow=c(2,2))command sets up 8 plots per page in a 2×2 grid. In the first plot (la-belled “(a)”) we simply plot time vs. dist. As we should expect, thereappears to be an increasing relationship between time and distance.

• We first fit a simple linear regression model of the form

timei = β0 + β1disti + ei, i = 1, . . . , 35. (m1)

and add the fitted regression line to plot (a). This model appears tofit reasonably well, but we should check residuals.

38

• The functions fitted() and stdres() extract the fitted values and stu-dentized residuals (as I’ve defined them) from the fitted model. Weplot these residuals vs. fitteds in plot (b). Notice that there appearto be several outliers and, perhaps, some increasing variance. In ad-dition, the residuals don’t appear to be centered around zero as muchas would be desirable. This is probably an effect of fitting outliers.

• Next we consider plot (c), a plot of residuals vs. the potential pre-dictor, climb. There appears to be an increasing pattern, suggestingthat climb belongs in the model.

• So, next we fit model m2hills.lm which is of the form

timei = β0 + β1disti + β2climbi + ei, i = 1, . . . , 35. (m2)

• These models are examples of nested models. Note that m1hills.lmis nested in model m2hills.lm in the sense that m1 is a special case ofm2 that occurs when β2 is fixed at 0. We can test m1 versus m2 bytesting H0 : β2 = 0 versus H1 : β2 = 0 in model m2. Notice that thistesting situation is a special case of that described on bottom of p.23with

Aβ − b = ( 0 0 1 )

β0β1β2

− 0.

39

• In nested models testing situations such as this, the F test statisticon the bottom of p.23 has an algebraically equivalent, and more con-venient, form in terms of the degrees of freedom and sums of squaresfor error for the two models:

F =(SSE0 − SSE)/(dfE0 − dfE)

SSE/dfE

where SSE0 and SSE are the sums of squares for error associated withthe null model (model in which H0 holds) and the alternative models,respectively, and dfE0 and dfE are the degrees of freedom for thesetwo models.

– In general, the SSE and dfE for a (full rank) model with p × 1regression parameter β are given by

SSE = ∥y − y∥2 =

n∑i=1

(yi − xTi β︸︷︷︸=yi

)2, dfE = n− p.

– We rejectH0 at significance level α if F > F1−α(dfE0−dfE,dfE).

• The anova() function in S-PLUS automates the testing of nested mod-els using the test described above. It takes as arguments, two fittedmodel objects and tests the null hypothesis that the smaller (null)model holds versus the alternative that it does not under the main-tained hypothesis that the larger model holds. In the example, wereject H0 : {model m1 holds} in favor of model m2 (F = 29.02,p < .0001).

• After adding climb to our model, we recheck the residuals versus climbplot (plot (d)). There still appears to be a pattern to this plot, al-though now it looks different — a convex curve. This suggestes addinga climb2 term.

40

• In m3hills.lm we fit

timei = β0+β1disti+β2climbi+β3climb2i +ei, i = 1, . . . , 35. (m3)

Again, using the anova() function, we see that m3 fits significantlybetter than m2.

• However, the residuals versus climb plot (plot (e)) and the residualsversus fitteds plot (plot (f)) still don’t look particularly good. In plot(f) we can identify the outlier using the identify() function. Type?identify in S-PLUS to get a description on this function.

• We print out the predicted (based on model m3) and observed datafor this outlier using the predict function. Notice that the predictedand observed times are about an hour apart. It is possible that thisdata point was misrecorded. We complete the analysis under thisassumption, omitting this point from further models.

• Model m4hills.lm refits m3 with the outlier removed. Plot (g) displaysthe residuals versus fitteds from this model. These residuals look fairlygood.

• Although there doesn’t appear to be any heteroscedasticity in plot (g),it seems intuitively reasonable that variability in race times shouldincrease with the length of the race. If this were the case, we mightaccount for it be transforming the response so that it had constantvariance on the transformed scale.

• Alternatively, we might consider a model such as m5hills.lm. Thismodel is identical to m4, but instead of the constant variance as-sumption var(ei) = σ2, i = 1, . . . , n, we assume the error variance isproportional to race length squared: var(ei) = σ2dist2i (i.e., the errorstandard deviation is proportional to dist).

• This is an example of a linear model fit withweighted least squares.

41

Weighted Least Squares

Suppose we have a linear model

y = Xβ + e (†)

where e has variance var(e) = σ2V where V is a known positive-definitematrix not necessarily equal to In.

• For such a V it is always possible to find a square-root matrix V1/2

that has the property (V1/2)TV1/2 = V.

Notice that we can multiply both sides of (†) by V−T/2 ≡ {(V1/2)T }−1 toobtain an equivalent transformed model,

V−T/2y︸︷︷︸≡y∗

= V−T/2X︸︷︷︸≡X∗

β +V−T/2e︸︷︷︸≡e∗

or y∗ = X∗β + e∗ (∗)

Notice that the *’ed model satisfies the CLM assumptions because

E(e∗) = E(V−T/2e) = V−T/2 E(e)︸︷︷︸=0

= 0

andvar(e∗) = var(V−T/2e) = V−T/2 var(e)︸︷︷︸

=σ2V

V−1/2

= σ2V−T/2 V︸︷︷︸=VT/2V1/2

V−1/2 = σ2In.

42

Therefore, the (ordinary) least squares estimator of β in (*),

β = {(X∗)TX∗}−1(X∗)Ty∗

= {(V−T/2X)TV−T/2X}−1(V−T/2X)T V−T/2y

= (XTV−1X)−1XTV−1y

is the optimal (BLUE) estimator of β.

• Because this estimator differs from the ordinary least squares esti-mator (XTX)−1XTy by the inclusion of a weight matrix V−1 in theformula, we call this estimator the weighted least squares estima-tor of β.

• It can be shown that the WLS estimator β minimizes

(y −Xβ)TV−1(y −Xβ) (WLS Criterion)

instead of the OLS criterion ∥y −Xβ∥2 = (y −Xβ)T In(y −Xβ).

– This is to say that the OLS estimator minimizes the lengthof y − Xβ with respect to Euclidean distance and the WLSestimator minimizes the length of y − Xβ with respect to amore general statistical distance metric.

– The statistical distance used in WLS is weighted Euclidean dis-tance or Karl Pearson distance in the case that V is a diagonalmatrix. In this case, we are just accounting for heteroscedastic-ity.

– The statistical distance used in WLS is Mahalanobis distancein the case that V is non-diagonal (a.k.a. generalized leastsquares). In this case, we are accounting for heteroscedastic-ity and correlation among the ei’s.

• It’s also straight-forward to show that the WLS estimator β max-imizes the log-likelihood of model (†) under the assumption e ∼Nn(0, σ

2V) so that WLS estimation = ML estimation in this model.

43

Back to the Scottish Hills Races Example:

• In model m5hills.lm, we use the weight option in function lm() to fitthe model using WLS.

• The residuals from this model (plot (h)) don’t look any better thanthose from model m4. It is not possible to test m4 versus m5 usingan F test for nested models (e.g., using the anova() function) becausethese are not nested models. They have the same linear predictor andsame total number of parameters.

• However, it is possible to informally compare the two models usinginformation criteria. Two of the most popular information cri-teria are AIC (Akaike’s Information Criterion) and BIC (BayesianInformation Criterion, a.k.a. Schwarz’s Bayesian Criterion).

• Both of these quantities are penalized version of the maximized log-likelihood function. That is, they measure how likely the data areaccording to the model (as quantified by the loglikelihood functionevaluated at the MLEs of the model parameters) but then penalizethis quantity by an amount related to the complexity of the model.(This same penalization for lack of parsimony idea is the idea behindadjusted R2.)

• For a model with k× 1 parameter vector θ (including all parameters,not just regresison parameters) AIC and BIC are defined as

AIC = −2ℓ(θ;y) + 2k

BIC = −2ℓ(θ;y) + k log(n)

where θ is the MLE of θ.

– AIC and BIC are sometimes given in other forms, but in thisform, the model with the smallest value of AIC (or BIC, if thatcriterion is used) is the winner.

– Its hard to say which criterion is best, but BIC tends to lead tomore parsimonious models than AIC.

44

• The S-PLUS function AIC() and BIC() extract these information cri-teria from fitted model objects. The function logLik() can be used toobtain the maximized log likelihood values.

• According to both AIC and BIC, m4 is preferred to m5, and we aban-don the idea of accounting for heteroscedasticity in this example.

• Although m4 fits pretty well, it is possible to obtain a more parsimo-nious model for these data that fits even better. Venables and Ripley(1999, Ch. 6) consider regressing inverse speed (time/distance) onthe race course gradient (climb/distance). We fit this model,

speed−1i = β0 + β1gradi + ei, i = 1, . . . , n, (m6)

as model m6hills.lm.

• The residuals versus fitted values for m6 (see plot (i)) look as good orbetter than any previous model.

• We produce a Q-Q plot for model m6 using the qqnorm() and qqline()function. This plot (plot (j)) indicates there are more extreme valuesin the data set than expected according to a normal distribution. Thissuggests that a more appropriate distribution for the errors in model(m6) might be a distribution with fatter tails than the normal (e.g.,the t(ν)-distribution with ν small). See Venables and Ripley (1999,Ch.6) for discussion of such a robust regression approach to analyzingthese data.

• The Scottish hill race data are also analyzed in Ch.6 of Maindonald& Braun’s book, Data Analysis and Graphics Using R, which wasdistributed in class.

45

Nonlinear Regression (Ch.2 of Bates & Watts)

We assume that we observe data: (y1,x1), . . . , (yn,xn) where yi is a scalarresponse variable and xi is an m× 1 vector of explanatory variables.

Model:yi = f(xi,θ) + ei, i = 1, . . . , n,

where e1, . . . , eniid∼ N(0, σ2)

where

f(·) is a known function (the expectation or regression function)

θ is a p× 1 parameter vector

ei’s are i.i.d. error terms

f(·) is a nonlinear function of θ. That is,

∂f(x,θ)

∂θjdepends on θ for some j.

• If f(·) is nonlinear in any component of θ, it is a nonlinear model.

Let ηi(θ) = f(xi,θ) and η = (η1, . . . , ηn)T . Then we can write our model

equivalently asy = η(θ) + e, e ∼ Nn(0, σ

2In).

Some examples of expectation functions:

f(x,θ) =θ1x

θ2 + x(Michaelis-Menten Model)

f(x,θ) =θ1

1 + exp{(θ2 − x)/θ3}(Simple Logistic Model)

f(x,θ) = θ1 + (θ2 − θ1) exp{− exp(θ3)x} (Asymptotic Regression Model)

f(x,θ) = θ1 + θ2x1 + θ3xθ42 (regression w/ power trans. of x2)

46

How do we choose f(·)?

1. Mechanistic models. Often there is some scientific theory availablethat describes the data-generating mechanism. This theory suggeststhe form of f .

2. Empirical models. At other times no theory is available or we simplywant to describe the relationship between y and x in a simple modelor develop a model that produces good predictions (unconcerned byhow those predictions come about). In such cases we simply try to“fit the data”. For data that follow certain general shapes of curves(e.g., sigmoidal, parabolic, etc.) “promising candidates” for nonlinearexpectation functions are available (e.g., Ratkowsky, 1990) that canbe tried.

• In linear modelling, empirical models are most common. In nonlinearmodelling, mechanistic models are more common.

Example — Wind Speed:

Consider the following five measurements of wind speed (y, in cm/sec) atvarious heights (x in cm):

x y

40 490.280 585.3160 673.7320 759.2640 837.5

47

These data are plotted below

•

•

•

•

•

Height (cm)

Win

d S

peed

(cm

/sec

)

100 200 300 400 500 600

500

600

700

800

• See handout, “wind1”.

• For these data we might consider fitting linear models.

Model 1 (simple linear regression):

yi = β0 + β1xi + ei, i = 1, . . . , 5

yields R2 = 0.847, and residual standard deviation s =√mse = 62.13.

Model 2 (quadratic regression):

yi = β0 + β1xi + β2x2i + ei, i = 1, . . . , 5

yields R2 = 0.975, and residual standard deviation s =√mse = 30.52.

48

• The fitted regression lines for these models are plotted on the lastpage of wind1. Clearly, model 1 is inadequate. Model 2 fits ok, butnot great. We’ve already used 3 out of 5 d.f. to fit this model, sogoing to a cubic to improve the fit is not particularly attractive.

Theory: Under adiabatic conditions, wind speed is related to height as

windspeed = θ1 log{height(1− θ2/θ3)− 1/θ3}, (∗)

whereθ1 = friction velocity

θ2 = zero point displacement

θ3 = roughness length

This relationship is not likely to hold exactly for our data due to mea-surement error and deviations from the ideal conditions under which therelationship is theorized to hold. Therefore, we fit a stochastic version of(*):

yi = θ1 log{xi(1− θ2/θ3)− 1/θ3}+ ei, i = 1, . . . , 5. (∗∗)

• Notice the model (**) is nonlinear since ∂f(x,θ)/(∂θj) depends uponθ for all j = 1, 2, 3. E.g.,

∂f(x,θ)

∂θ1= log{xi(1− θ2/θ3)− 1/θ3}.

• For now, we skip the details of how to fit a nonlinear such as (**),but in R it can be done using nonlinear least squares with the nls()function. The parameter estimates (standard errors) turn out to be

θ1 = 115.1(2.04), θ2 = −.0595(.00546), θ3 = .0454(.0132).

49

• The fitted regression line for model (**) fits the observed data muchmore closely than that of either linear model. In addition, the residualstandard deviation of the nonlinear model, s = 1.87, is much smallerthan for the linear models.

• Perhaps most importantly, this model reflects what is know aboutthe relationship between wind speed and the height at which it ismeasured, and the parameter estimates have specific, useful interpre-tations in terms of the physics/meteorology of the problem.

Classes of Nonlinear Models:

1. Yield-Density Models.

– Common in agriculture (e.g., forestry).

– Models describe the relationship between the yield of a crop andthe density of planting.

LetX = plant density in plants/unit area

R = yield/plant.

ThenW = XR = total yield per unit area (e.g., acre).

Two common yield density relationships:

i. “asymptotic relationship” (most crops. e.g., carrots, beans, tomatoes)

50

ii. “parabolic relationship” (e.g., sweet corn, cotton)

A common model for the asymptotic case is

R = (θ1 + θ2X)−1.

Notice that as X → 0, R → 1/θ1 ⇒ θ−11 has interpretation as the

“genetic potential” of a crop uninhibited by competition. In addition,as X → ∞, W = XR→ 1/θ2, ⇒ θ−1

2 =“environmental potential”.

For observed data, the model

Ri = (θ1 + θ2Xi)−1 exp(ei)

is often used; or, transforming,

Yi = log(θ1 + θ2Xi) + ei,

where Yi = − log(Ri).

– This model is not based on any theory, its simply an empiricalmodel whose parameters have meaningful interpretations in thiscontext.

51

An important model that is used in the asymptotic case of the yield-density curve but also in a wide variety of other applications is theAsymptotic Regression Model:

Yi = α− βγXi + ei, i = 1, . . . , n.

This model yields a curve that has the following shape:

• As with many named classes of nonlinear models, a variety of differentparameterizations of the asymptotic regression models are used:

Yi = θ1 − θ2e−θ3Xi + ei, (here, e−θ3 = γ)

Yi = θ1 − θ1e−(Xi+θ2)θ3 + ei,

Yi = θ1 − e−(θ2+θ3Xi) + ei

• In nonlinear regression, the parameterization you use affects the al-gorithms used to solve the models, the properties of the estimators,and the accuracy of approximations used for inference. To a muchgreater degree than in linear models, it is important to choose theright parameterization! We will return to this point later.

52

2. Growth Models.

– These models relate the size or change in size of some entity totime.

The most common shape for growth curves is a “sigmoidal curve”:

Let R =a measure of size, and X =time. There are several com-monly used growth curves that capture the sigmoidal shape for R asa function of X; e.g.,

R = θ1 exp{− exp(θ2 − θ3X)}, (Gompertz)

R =θ1

1 + exp(θ2 − θ3X), (simple logistic)

R =θ1X

θ2 + θ3θ4Xθ2 + θ3

, (Morgan-Mercer-Flodin)

R = θ1{1− exp(−θ2X)}θ3 , (Chapman-Richards)

• In fitting these models to data, either an additive or multiplicativeerror term may be appropriate. E.g., in the case of the logistic modelwe might consider an additive version:

Yi =θ1

1 + exp(θ2 − θ3Xi)+ ei, where Yi = Ri

or a multiplicative version:

Yi = log(θ1)− log{1 + exp(θ2 − θ3Xi)}+ ei, where Yi = log(Ri).

53

3. Compartmental Models and Other Models Based on Systems of Dif-ferential Equations.

– Compartmental models are mechanistic models where one ormore measurements of some physical process is related to time,inputs to the system, and other explanatory variables througha compartmental system.

– A compartmental system is “a system which is made up of afinite number of macroscopic subsystems, called compartmentsor pools, each of which is homogeneous and well mixed, and thecompartments interact by exchanging materials. There may beinputs from the environment into one or more of the compart-ments, and there may be outputs (excretion) from one or morethe compartments to the environment.” (Seber & Wild, p.367)

– Compartmental models are common in chemical kinetics, toxi-cology, hydrology, geology, and pharmacokinetics.

As an example from pharmacokinetics, consider the data in the fol-lowing scatterplot on tetracycline concentration over time.

•

•

• •

•

•

•

•

•

Time (hrs)

Tet

racy

clin

e C

once

ntra

tion

(mug

/ml)

5 10 15

0.4

0.6

0.8

1.0

1.2

1.4

Tetracycline Concentration vs. Time

54

The data come from a study in which a tetracycline compound wasadministered to a subject orally, and the concentration of tetracyclinehydrochloride in the blood serum was measured over a period of 16hours (the data are in Appendix A1.14 of Bates & Watts).

A simple compartmental model for the biological system determiningtetracycline concentration in serum is one that hypothesizes

a. a gut compartment into which the chemical is introduced,

b. a blood compartment which absorbs the chemicals from the gut,and

c. an elimination path.

This simple two-compartment open model can be represented in acompartment diagram as follows:

Here, γ1 and γ2 represent the the concentrations of the chemical incompartments 1 and 2, respectively, and θ1 and θ2 represents rates oftransfer into and out of compartment 2, respectively.

55

Under the assumption of first-order (linear) kinetics, it is assumedthat at time t, the rate of elimination from any compartment is pro-portional to γ(t), the concentration currently in that compartment.

Thus the rates of change in the concentrations in the two compart-ments in the model represented above are

∂γ1(t)

∂t= −θ1γ1(t)

∂γ2(t)

∂t= θ1γ1(t)− θ2γ2(t)

Differential equations solutions for linear compartmental models gen-erally take the form of linear combinations of exponentials. Therefore,these models are nonlinear models that can be fit using methods sim-ilar to those used for yield-density models, growth curve models, etc.

For example, the solution for γ2(t), the concentration in blood serumat time t is

γ2(t) =θ3θ1(e

−θ1t − e−θ2t)

θ2 − θ1.

• Here, θ3 is the amount of drug (tetracycline) ingested initially (attime t = 0).

56

Therefore, we might try the additive error model

yi =θ3θ1(e

−θ1ti − e−θ2ti)

θ2 − θ1+ ei, i = 1, . . . , n,

to model the tetracycline data. The resulting fitted regression curveis displayed below.

•

•

• •

•

•

•

•

•

Time (hrs)

Tet

racy

clin

e C

once

ntra

tion

(mug

/ml)

5 10 15

0.4

0.6

0.8

1.0

1.2

1.4


4. Multiphase and Spline Regressions.

In multiphase and spline regression models, the expectation functionfor the regression of y on x, E(y) = f(x;θ), is obtained by piecingtogether different curves over different intervals.

57

That is, in multiphase and spline models,

f(x;θ,α) =

f1(x;θ1), x ≤ α1;f2(x;θ2), α1 < x ≤ α2;...

...fD(x;θD), αD−1 < x.

– Here, the expectation function f(x;θ,α) is defined by differentfunctions (f1, f2, . . . , fD) on different intervals, and typically theendpoints of the intervals are unknown and must be estimated.

– The D submodels are referred to as phases, and the α’s arechangepoints.

– Multiphase models are intended for situations in which (a) thenumber of phases is small; (b) the behavior in each phase iswell-described by a simple parametric function like a line orquadratic; and (c) there are fairly abrupt changes between regimes.

– Spline models have the same piecewise form, but the individualphase models fd, d = 1, . . . , D, are always polynomials and theemphasis is on joining these “splines” to obtain a smooth andvery flexible function to capture the underlying regression form.

• After presenting methodology for the general nonlinear regressionmodel, we’ll come back and spend some more time on certain spe-cial class of nonlinear models as time allows.

58

Special Types of Nonlinear Models:

1. Transformably Linear Models:

Suppose we observe zi, xi, i = 1, . . . , n, where

E(zi) = f(xi;θ).

If it is possible to find a transformation h(·) such that yi = h(zi)satisfies a linear regression model, then we say that the expectationfunction f is transformably linear.

• We must be careful about assumptions on error terms when usinglinearizing transformations!

SupposeE(zi) = eα+βxi .

If the error in zi is proportional to the expected magnitude of zi butotherwise independent of xi (“constant relative error”) then we canwrite a model for zi as

zi = eα+βxi(1 + ei),

where

E(ei) = 0, and var(ei) = σ2 (constant variance).

Equivalently,zi = eα+βxi + e∗i (∗)

where e∗i = E(zi)ei has mean 0 and variance var(e∗i ) = σ2{E(zi)}2.

If we transform to linearity by taking logs we get

yi = α+ βxi + log(1 + ei), where yi = log(zi)

= α+ βxi + ei,

where ei = log(1+ ei) has mean E(ei) ≈ E(ei) = 0 (for small ei), andvariance var(ei) that is independent of xi.

59

– That is, in (*) we had a model in the original scale with an addi-tive error with variance proportional to the square of the mean.This transformed to a model with nearly constant variance onthe log scale.

– Another way to say this is that if the original model had had amultiplicative error,

zi = eα+βxi eui︸︷︷︸error term

= eα+βxi+ui ,

where E(ui) = 0 and var(ui) = σ2, then the transformed modelwould be

yi = α+ βxi + ui,where E(ui) = 0, var(ui) = σ2.

However, if instead of (*) we had additive homoscedastic errors onthe original scale:

zi = eα+βxi + ei, where E(ei) = 0, var(ei) = σ2.

= eα+βxi

(1 +

eiE(zi)

).

Then

yi = α+ βxi + log

(1 +

eiE(zi)

)= α+ βxi + vi,

where now the additive error, vi = log (1 + ei/E(zi)) has mean E(vi) ≈E{ei/E(zi)} = 0 (for ei small compared with E(zi)) and variancevar(vi) that varies with E(zi) (heteroscedasticity).

60

To summarize:

– homoscedasticity on original scale generally induces heteroscedas-ticity on transformed scale. In this case, we should either fit ahomoscedastic nonlinear model to the original data (NLS) or useWLS to fit a heteroscedastic linear model on the transformedscale.

– Certain types of heteroscedasticity on the original scale can leadto homoscedasticity on the transformed scale. In this case, ei-ther nonlinear WLS on the original scale or linear OLS on thetransfored scale can be used.

– As long as the error variance is accounted for correctly, work-ing on either the linear or nonlinear scale may be appropriate.Desirability of interpretable parameter estimates may argue fornonlinear model on original scale.

– We assume nonlinear model will be fit. Even in this case avail-ability of a linear transformation can be very useful (e.g., forobtaining starting values).

– Note that transformations affect entire distribution of the errorterms, not just their variance. ⇒ normal additive errors on orig-inal scale lead to non-normal additive errors on the transformedscale.

2. Partially Linear Models:

Consideryi = θ1(1− e−θ2xi) + ei.

If θ2 is known (i.e., fixed) then the model is linear:

yi = θ1wi + ei, where wi = 1− e−θ2xi .

61

θ1 is a conditionally linear parameter if for fixed θ2, . . . , θp themodel is linear.

If there is at least one conditionally linear parameter in the modelthen the model is partially linear.

– For partially linear models, some procedures (e.g., obtainingstarting values) are simplified.

Geometry of the Expectation Surface:

Consider the linear model with n = 2, p = 1 and model equation

yi = θxi + ei, i = 1, 2

where x1 = 1, x2 = 2.

The expectation plane is the set of all 2×1 vectors η(θ) = θ

(12

). We plot

this expectation plane below

62

Features:

1. Linearity (with respect to θ).

a. The effect of changing θ by δ units is the same for all θ.

b. θ points with equal spacing correspond to η points with equalspacing.

c. Line segments in the parameter space correspond to line seg-ments in the expectation plane.

– (a)–(c) above are all essentially restatements of the same idea,linearity.

2. Expectation is of infinite extent.

Now consider the nonlinear model:

yi =1

1 + e−θxi+ ei. (∗)

Again, take x = (x1, x2)T = (1, 2)T . Then the expectation surface is

η(θ) =

((1 + e−θ)−1

(1 + e−2θ)−1

), (1-dim. surface in 2-space)

63

We can plot this surface in 2-space by plugging a few values of θ into the

formula for η(θ) =

(η1(θ)η2(θ)

):

θ η1 η2

−∞ 0 0−2 .119 .0180−1 .269 .1190 .500 .5001 .731 .8812 .881 .982∞ 1 1

eta1

eta2

0.0 0.5 1.0

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

theta=-inftytheta=-2

theta=-1

theta=0

theta=1

theta=2theta=infty

Expectation Surface for Example Nonlinear Model

64

Features:

1. Nonlinearity:

a. Effect of changing θ by 1 unit depends upon the value of θ.

b. θ points with equal spacing correspond to η points with unequalspacing.

c. Line segments in the parameter space correspond to curves inthe expectation space.

2. The expectation surface may be of finite extent.

• The curved shape of the expectation surface is invariant to reparam-eterization. That is, the degree of curvature is the same no matterwhat parameterization is used. This aspect of nonlinearity is knownas intrinsic nonlinearity.

• The extent to which equally-spaced θ points map to unequally spacedη-points is known a parameter-effects nonlinearity. This type ofnonlinearity depends upon the parameterization chosen for the model.A good parameterization leads to small parameter-effects nonlinearity.

For example, consider the following reparameterization of model (*) on p.63:

yi =1

1 + ϕ−xi+ ei, here, ϕ = eθ.

65

We can again plot the expectation surface based on the new parameteriza-tion. The result is as follows:

eta1

eta2

-5 0 5 10

0.0

0.2

0.4

0.6

0.8

1.0

phi parameterization, entire expect. surface

eta1

eta2

0.0 0.5 1.0

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

phi parameterization, phi>=0

phi=0

phi=1

phi=2

phi=3phi=4

phi=10 phi=1e10

• Notice that under the reparameterization, the expectation surface isof infinite extent.

• Restricting attention to the region ϕ ≥ 0 (corresponding to the en-tire range of θ) depicted in the right-hand plot, we see that theθ-parameterization has the exact same intrinsic nonlinearity as theϕ−parameterization.

• However, the θ-parameterization has considerably less parameter-effectsnonlinearity than the ϕ-parameterization. This is so because underthe ϕ-parameterization, η-curve segments corresponding to values ofϕ one unit apart differ in length to a much greater extent than underthe θ-parameterization.

66

• Note that the degree of both intrinsic nonlinearity and parameter-effects nonlinearity in a particular model depend upon the values ofx. Two models with the same form and parameterization but differentvalues of the explanatory variables will have different nonlinearitiesof both types.

E.g., if we change x from (1, 2)T to (−1, 5)T we get the following curves.

eta1

eta2

0.0 0.5 1.0

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

theta parameterization. x=(1,5)

theta=-infty theta=-2theta=-1

theta=0

theta=1theta=2

theta=infty

eta1

eta2

0.0 0.5 1.0

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

phi parameterization. x=(1,5)

phi=0

phi=1

phi=2

phi=3 phi=4

phi=1e10

• Later in the course we will come back to the ideas of intrinsic andparameter-effects nonlinearity and discuss measures of these two-typesof curvature. For now, suffice it to say that it is desirable to chooseparameterizations with low parameter-effects nonlinearity.

67

Linear Approximations:

Let h(·) denote a (twice differentiable) scalar valued function of a scalarargument. For any fixed points u and u∗, Taylor’s Theorem says

h(u) = h(u∗) + h′(u∗)(u− u∗) +1

2h′′(u∗∗)(u− u∗)2,

where h′(u∗) = ∂h(u)∂u

∣∣u=u∗ , h

′′(u∗∗) = ∂2h(u)∂u2

∣∣u=u∗∗ , and u∗∗ is a point

between u and u∗.

If u∗ is close to u, then the last term in this expansion will be small relativeto the rest and we have

h(u) ≈ h(u∗) + h′(u∗)(u− u∗) ≡ h(u)

for u close to u∗.

• This is known as a first-order (linear) Taylor series approxima-tion.

This approximation is

i. Linear. (h(u) is a linear function of u.)

ii. Local. (Only valid for u near u∗.)

Now suppose h is a scalar-valued function of u, a p × 1 vector. For thissituation the above linear Taylor series approximation generalizes to

h(u) ≈ h(u∗) +∂h(u∗)

∂uT︸︷︷︸1×p

(u− u∗)︸︷︷︸p×1

for u close to u∗ (that is, when ∥u− u∗∥ is small).

Here,

u =

u1...up

,u∗ =

u∗1...u∗p

, and∂h(u∗)

∂uT=

(∂h(u)∂u1

∂h(u)∂u2

· · · ∂h(u)∂up

)

68

If we prefer a non-vector notation form, we can multiply out the secondterm and write this approximation in an equivalent form as follows:

h(u) ≈ h(u∗) + (u1 − u∗1)∂h(u)

∂u1

∣∣∣u=u∗

+ · · ·+ (up − u∗p)∂h(u)

∂up

∣∣∣u=u∗

.

Consideryi = f(xi,θ) + ei,

= ηi(θ) + ei,i = 1, . . . , n,

or, in vector form,y = η(θ) + e.

Estimation and inference about θ is easy if f(xi,θ) is linear in θ. Thissuggests using a linear Taylor series approximation of f(xi,θ).

For θ near θ∗,

f(xi,θ) ≈ f(xi,θ∗) + (θ1 − θ∗1)Vi1 + · · ·+ (θp − θ∗p)Vip,

for each i = 1, . . . , n, where

Vij =∂f(xi,θ)

∂θj

∣∣∣θ=θ∗

.

Stacking these n approximations on top of one another in vector form wehave

η(θ) ≈ η(θ∗) +V(θ∗)(θ − θ∗),

where V(θ∗) is the n× p matrix with (i, j)th element Vij . (I.e., V(θ∗) has

ith row ∂f(xi,θ)

∂θT

∣∣θ=θ∗ .)

69

A picture of the Taylor series linear approximation taken at θ∗ = 2:

eta1

eta2

0.0 0.5 1.0

0.0

0.5

1.0

theta=2

Estimation of θ:

Under the assumption that e ∼ Nn(0, σ2In), the MLE of θ is also the

nonlinear least-squares estimator. I.e., it is the value of θ that minimizes

∥y − η(θ)∥2 =

n∑i=1

{yi − f(xi,θ)}2.

• This is still the (ordinary) least squares criterion. The only differenceis that f(xi,θ) = xT

i θ (nonlinearity).

• As in linear least squares, we need to find the point on the expectationsurface, η(θ), that is closest to y in terms of Euclidean distance.

• This is a harder problem because the expectation surface is no longera plane.

70

E.g., Suppose y = (.85, .85)T in our example from p.63–64.

eta1

eta2

0.0 0.5 1.0

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

y=(.85,.85)’

• In the nonlinear case, finding the point η on the expectation surfaceis hard because of intrinsic nonlinearity.

• Once we find η we must find the value of θ solving η(θ) = η. Inlinear regression this step is easy because the mapping from β to ηis linear and invertible. In nonlinear regression, this step is harder,because of both intrinsic and parameter-effects nonlinearity.

71

Nonlinear least squares: find the θ to minimize

S(θ) = ∥y − η(θ)∥2 =n∑

i=1

{yi − f(xi,θ)}2.

Q: How do we minimize S(θ)?

A: Solve normal equations.

Taking derivatives we get

∂S(θ)

∂θj= −2

n∑i=1

{yi − f(xi,θ)}∂f(xi,θ)

∂θj, j = 1, . . . , p,

or, in matrix notation,

∂S(θ)

∂θT= −2[y − η(θ)]T V(θ)︸︷︷︸

n× p derivative matrix

Therefore, θ satisfies[y − η(θ)]TV(θ) = 0. (†)

• Recall that in linear regression, the derivative matrix was V(θ) = Xand our normal equation took the form of the orthogonality condition:

[y −Xβ]TX = 0,

which says, the residual vector y−Xβ is orthogonal to the expectationplane (all vectors of the form Xβ).

• In the nonlinear case, (†) says that the residual vector y − η(θ) is

orthogonal to the “tangent plane” to the expectation surface at θ = θ.

72

• Notice (†) is nonlinear in θ and doesn’t, in general, yield a closed form

expressionfor the solution, θ.

Q: How do we solve a nonlinear set of equations like (†)?

A: Usually requires an iterative method. For example,...

Gauss-Newton Method

The G-N method proceeds by

1. Obtaining a starting value θ0.

2. Using a linear approximation to η(θ) for θ near θ0.

3. Using linear regression methods to “estimate θ”; i.e., to update θ0 toθ1.

4. Repeat steps 2 and 3 until convergence.

Let V0 = V(θ0). Then,

η(θ) ≈ η(θ0) +V0(θ − θ0)

for θ near θ0. It follows that

y − η(θ) ≈ y − η(θ0)︸︷︷︸≡z0

−V0 (θ − θ0)︸︷︷︸≡δ

≡ z0 −V0δ

Therefore, for θ near θ0, choosing θ to minimize ∥y − η(θ)∥2 is “approxi-mately equivalent to” (should give nearly the same result as) the problemof choosing δ to minimize ∥z0 −V0δ∥2.

• This minimization can be carried out using linear regression. E.g., wecould use δ = {(V0)TV0}−1(V0)T z0. However, this formula, whilecorrect, is not the best way to do the computations - either here, orin regular linear regression.

73

This process consists of two steps:

a. Obtaining the point η∗ = V0δ.

b. Determining δ from η∗.

Once we have δ, we can easily translate to θ. From its definition,

δ = θ − θ0 θ = θ0 + δ.

Therefore, we can update θ0 to θ1 via

θ1 = θ0 + δ (∗)

• Because of this updating formula, we call δ the Gauss-Newton in-crement.

• θ is updated this way until convergence.

Complication: There is no guarantee that the update will reduce the ob-jective function S(θ). I.e., no guarantee that S(θ1) < S(θ0).

We can fix this problem simply enough. If S(θ1) ≥ S(θ0) then replace theupdate (*) by

θ1 = θ0 + λδ, where 0 < λ ≤ 1.

To choose λ, start with λ = 1 and then try λ values 1, 12 ,14 ,

18 , . . . until we

find a λ value that gives S(θ1) < S(θ0).

• λ is called a step factor.

74

Each of the steps a and b in the description of the G-N algorithm above ismade easier by using the QR Decomposition

The QR Decomposition:

• Our interest right now is in using the QR decomposition in the non-linear model, but just to keep things simple, let’s return to the linearmodel for a moment.

In the linear model, the least squares problem is to find the value of β tominimize

∥y −Xβ∥2

We know that (if XTX is invertible) the answer is

β = (XTX)−1XTy

.

• However, the computation of β via this formula can be computa-tionally inefficient and error-prone. A better way is with the QRdecomposition.

In general, an n× p (n ≥ p) matrix X can be decomposed as

X = QR

where Q is an n × n orthogonal matrix (it has the property QQT =QTQ = In) and R is a n× p matrix of the form

R =

(R1

0

)whereR1 is a p×p upper-triangular matrix (it has zeros below the diagonal).

• From the fact that the last n− p rows of R contain 0’s we can write

X = QR = (Q1︸︷︷︸n×p

, Q2︸︷︷︸n×(n−p)

)

(R1

0

)= Q1R1

75

where Q1 consists of the first p columns of Q and R1 contains thefirst p rows of R.

Since we know that β = (XTX)−1XTy, it also follows that the mean of y,Xβ has least squares estimate

Xβ = X(XTX)−1XTy

Applying the QR decomposition to X we get

Xβ = Q1R1(RT1 QT

1 Q1︸︷︷︸=I

R1)−1RT

1 QT1 y

= Q1R1(R1)−1(RT

1 )−1RT

1 QT1 y

= Q1QT1 y

= Q

(QT

1 y0

)= Q

(w1

0

)where w1 = Q1y.

• So, we have that

Xβ = Q

(w1

0

)(♣)

which allows us to get the least squares estimate of the mean of ywithout computing a matrix inverse - a computationally demandingan error-prone task.

76

All that’s left to do is find β once we haveXβ. This is easy because applyingthe QR decomposition to X again in (♣), we get

Xβ = Q

(w1

0

)⇒ QRβ = Q

(w1

0

)⇒ QTQ︸︷︷︸

=I

Rβ = QTQ︸︷︷︸=I

(w1

0

)

⇒ Rβ =

(w1

0

)⇒ R1β = w1 (♢)

• Solving for β is now easy because R1 is upper-triangular!

E.g., suppose that in a simple linear regression problem we did the compu-

tations and found that w1 =

(2−1

)and R1 =

(1 11

30 1

3

)then to obtain β

we would solve (1 11

30 1

3

)(β1β2

)=

(2−1

)

⇒β1 +

11

3β2 = 2

1

3β2 = −1

which solve easily to give

β2 = −3, β1 = 13.

77

Back to the G-N method for the nonlinear model:

Steps a and b above were aimed at finding the G-N increment δ by mini-mizing ∥z0 −V0δ∥2. The steps were:

a. Obtaining the point η∗ = V0δ.

b. Determining δ from η∗.

By utilizing the QR-decomposition V0 = QR = Q1R1 in step a, we obtainthe computationally efficient formula :

η∗ = Q1QT1 z

0. (compare p.76)

(Note that η∗ lies on the tangent plane to the expectation surface, it’s noton the expectation surface itself - hence the star notation. I.e., η∗ isn’tequal to η(θ) for any θ.)

Then in step b, δ can be determined from η∗ by solving the linear equation,η∗ = V0δ. Or, since we can write this equation asQ1R1δ = η∗ = Q1Q

T1 z

0,this step reduces to solving

R1δ = QT1 z

0, (compare (♢))

which is easy because R1 is upper triangular.

78

Geometry of Nonlinear Least-Squares:

The (k + 1)th G-N iteration consists of

1. approximation of η(θ) by η(θk) + Vk(θ − θk) near θk. This stepconsists of two parts:

a. replacement of η(θ) by the tangent plane (planar assumption).

b. using a linear coordinate system Vk(θ − θk) on the tangentplane (uniform coordinate system).

(a) and (b) are separate aspects of the linear approximation. Thatis, it is possible to have a linear (in some sense) approximation withnon-uniform coordinates. E.g., if f(xi, θ) = θ2xi then the expectationsurface for this model is a plane with nonuniform coordinates withrespect to θ:

eta1

eta2

0 2 4 6

02

46

8

theta=0

theta=1

theta=2

Exp. surface -- eta(theta)=x*theta^2, x=(1,2)

79

2. Finding the point on the tangent plane closest to y. I.e., minimizing∥zk −Vkδ∥2.

3. Updating θk to θk+1 = θk + δ.

Example — n = 2, p = 1

Consider again the example from pp.63–64. Let y = (.85, .85)T and suppose

we take θ0 = 2 as our starting value. Then the first (k = 1) G-N iterationlooks like this geometrically:

eta1

eta2

0.0 0.5 1.0

0.0

0.5

1.0

• Once we obtain the point η(θ0) +V0δ on the tangent plane, we map

back to the point η(θ0 + δ) = η(θ1) on the expectation surface.

• If η(θk + δ) = η(θk) +Vkδ, then the linear approximation is exactand the MLE has been found. Unless the problem is linear, we willnever obtain exact equality. However, we will obtain approximateequality when the G-N increment δ is small. When that happens, wesay the algorithm has converged.

80

Convergence of the Algorithm:

Let θk denote the estimate from the kth iteration. Convergence can bemeasured in several ways:

1. Look at change in the parameter estimates:

If θk ≈ θk−1 we may consider the sequence θ0, θ1, θ2, . . . to haveconverged and take θ = θk as the MLE/NLS estimator.

The most common way to establish that θk ≈ θk−1 is to look at

maxj

|θkj − θk−1j |

|θk−1j |

.

If this criterion is less than some tolerance value (1 × 10−6, say) thealgorithm stops.

2. Look at change in the objective function S(θ):

If S(θk) ≈ S(θk−1) then take θ = θk. To judge this we can use thecriterion

S(θk−1)− S(θk)

S(θk).

Again, if this criterion is smaller than some tolerance value, then stop.

Q: Which approach, 1 or 2, is better?

A: It depends. If the main interest is in the mean response, η(θ), then (2)may be more appropriate. If the regression parameters themselves are ofmore interest than perhaps (1) is better.

81

Both criteria have the disadvantage that they don’t indicate when θk isclose to θ or when S(θk) is close to S(θ).

E.g., if we knew the minimum value of the least squares criterion S(θ) wasequal to 10, say, then we could stop the algorithm based on judging whenS(θk) ≈ 10.

Of course, we would never know S(θ) or θ to judge convergeence to the

optimal value in this manner. However, we do know that θ must satisfy thenormal equation

[y − η(θ)]TV(θ) = 0

therefore we can stop iterating when

[y − η(θk)]T︸︷︷︸residual vector for θk

V(θk) ≈ 0.

This suggests a third convergence criterion:

3. Relative offset criterion. Based on θk, the residual vector [y−η(θk)],can be written as the sum of two components:

– a vector in the tangent plane

– a vector ⊥ to the tangent plane

82

These two components can be seen in the following plot

eta1

eta2

0.0 0.5 1.0

0.0

0.5

1.0

This is to say that [y − η(θk)] can be written as

[y − η(θk)] = ψ1 +ψ2

where ψ1 is in the tangent plane, and ψ2 is ⊥ to the tangent plane.

Then one measure of “how orthogonal” y − η(θk) is to the tangentplane is

∥ψ1∥∥ψ2∥

the relative offset criterion

• When this criterion is close to zero, the ψ1 component of y − η(θk)is negligible, implying y − η(θk) is approximately ⊥ to the tangent

plane; i.e., [y − η(θk)]TV(θk) ≈ 0.

• It can be shown that the components ψ1 and ψ2 have lengths

∥ψ1∥ = ∥QT1 [y − η(θk)]∥, ∥ψ2∥ = ∥QT

2 [y − η(θk)]∥

where Q = [Q1,Q2] is the Q matrix from a QR-decomposition of

V(θk).

83

Inference Using the Linear Approximation:

• The basic approach to inference in nonlinear models is to use the linearapproximation to reduce the situation to the linear case and then usestandard linear models results. Because of the approximation, theseresults will be approximate rather than exact.

Recall that the (nonlinear) least squares estimator θ minimizes

S(θ) = ∥y − η(θ)∥2

Let θ∗ denote the true value of θ. Using the linear Taylor series approxi-mation, we can approximate η(θ) for θ close to θ∗ as

η(θ) ≈ η(θ∗) +V(θ∗)(θ − θ∗)

It follows that

y − η(θ) ≈ y − η(θ∗)︸︷︷︸≡z∗

−V(θ∗) (θ − θ∗)︸︷︷︸≡δ∗

= z∗ −V(θ∗)δ∗

so thatS(θ) = ∥y − η(θ)∥2

≈ ∥z∗ −V(θ∗)δ∗∥2 ≡ S∗(θ).

84

We know from linear models that the value of

=δ∗︷︸︸︷θ − θ∗ that minimizes S∗(θ)

is{(V∗)TV∗}−1(V∗)T z∗,

where V∗ ≡ V(θ∗).

It follows that the value of θ that minimizes S∗(θ) is

{(V∗)TV∗}−1(V∗)T z∗ + θ∗.

It’s also true that the value of θ that minimizes S(θ) is θ.

Therefore, since S(θ) ≈ S∗(θ) in a neighborhood of θ∗ (which we should

expect to contain θ, since θ is a “good estimator”) these two minimizersshould be approximately equal:

θ ≈ {(V∗)TV∗}−1(V∗)T z∗ + θ∗

or, equivalently,

θ ≈ θ∗ + {(V∗)TV∗}−1(V∗)T [y − η(θ∗)]︸︷︷︸=e

= θ∗ + {(V∗)TV∗}−1(V∗)Te

This result,θ ≈ θ∗ + {(V∗)TV∗}−1(V∗)Te (†)

is an approximate version of the exact result

β = β + (XTX)−1XTe (see p.21)

that we had for the linear model, where the derivative matrix has changedfrom X to V(θ∗).

85

From (†), analogues of all of the linear model inference results can be ob-tained for the nonlinear model, although they are now only approximate.

E.g.,

var(θ) ≈ var(θ∗ + {(V∗)TV∗}−1(V∗)Te)

= {(V∗)TV∗}−1(V∗)T var(e)︸︷︷︸=σ2In

V∗{(V∗)TV∗}−1

= σ2{(V∗)TV∗}−1 ≡ avar(θ)

avar() for “asymptotic” variance-covariance matrix (approximate, based ona large sample approximation).

We estimate this asymptotic/approximate var-cov matrix as

ˆavar(θ) = s2(VT V)−1, where V ≡ V(θ),

and

s2 =1

n− p

n∑i=1

{yi − f(xi, θ)}2

An asymptotic standard error for θj is given by the square root of the jth

diagonal element of ˆavar(θ):

a.s.e.(θj) = s√(VT V)−1

jj

All other inference results 1–9 (except 4 & 5) on pp. 21–24 hold in an

analogous fashion by replacing XTX by VT V.

86

In particular, the following linear approximation methods of inference holdfor the nonlinear model with NLS parameter estimator θ:

1. Although θ is no longer exactly unbiased, it is asymptotically unbiased(i.e., consistent). This is to say that θ gets closer and closer to θ∗

(in probability) as the sample size increases.

2. θ has asymptotic/approximate var-cov matrix that can be estimatedas

ˆavar(θ) = s2(VT V)−1, where V ≡ V(θ).

3. (Asymptotic normality) θ.∼ Np(θ, σ

2{(V∗)TV∗}−1) (if e is assumednormal). Here, “

.∼” mean approximately distributed as.

6. An asymptotic/approximate standard error of θj , the jth estimatedregression parameter, is

a.s.e.(θj) = s√(VT V)−1

jj ;

an approximate 100(1− α)% confidence interval for θj is given by

θj ± t1−α/2(n− p)a.s.e.(θj);

and a test of H0 : θj = θ0 that has approximate level α has therejection rule: reject H0 if

|θj − θ0|a.s.e.(θj)

> t1−α/2(n− p)

7. A joint confidence region for θ with approximate coverage probability100(1− α)% is given by the set of all θ such that

(θ − θ)T (VT V)(θ − θ)ps2

≤ F1−α(p, n− p).

For an approximate α-level test, we reject H0 : θ = θ0 in favor ofH1 : θ = θ0 if

F1 ≡ (θ − θ0)T (VT V)(θ − θ0)ps2

> F1−α(p, n− p). (∗)

87

Recall that in the linear model, the test statistic F1 given in (*) was alge-braically equivalent to the test statistic

F2 =(SSE0 − SSE)/(dfE0 − dfE)

SSE/dfE

where SSE0 and SSE are the sums of squares for error associated with thenull model (model in which H0 holds) and the alternative models, respec-tively, and dfE0 and dfE are the corresponding degrees of freedom for thesetwo models.

In the nonlinear model, it is important to note that the statistics F1 andF2 are no longer equal and yield different* tests of H0!

Which is better?

• Since F2 depends upon θ only through η(θ), the estimated mean of y,

and not directly through θ, this means that models with two differentparameterizations will have the same value of F2.

– Hence, F2 will be affected only by the intrinsic nonlinearity ofthe model and not by its parameter-effects nonlinearity.

– In contrast, F1 will be affected by both intrinsic and parameter-effects nonlinearity.

• The result is that the asymptotic (approximate) distribution of F2

is often much more accurate than that of F1 in finite samples, andtherefore the test based on F2 is superior.

• Both F1 and F2 can be generalized in an obvious way to test the moregeneral hypothesis H0 : Aθ = b for a matrix and vector of constantsA and b, respectively.

* (though asymptotically equivalent)

88

8. An approximate 100(1 − α)% CI for the mean response at a givenvector of explanatory variables x0 is given by

f(x0; θ)± t1−α/2(n− p)

√s2fT0 (VT V)−1f0

where

fT0 =

(∂f(x0;θ)

∂θ1,∂f(x0;θ)

∂θ2, . . . ,

∂f(x0;θ)

∂θp

) ∣∣∣∣θ=ˆθ

(f0 is the gradient of the expectation function at x0, f(x0;θ),

with respect to θ, evaluated at our estimate of θ, θ.)

An approximate 100(1 − α)% CI for a linear combination of the θ’sof the form cTθ for some vector of constants c is given by

cT θ ± t1−α/2(n− p)

√s2cT (VT V)−1c

9. An approximate 100(1 − α)% prediction interval for the response y0at a given vector of explanatory variables x0 is given by

f(x0; θ)± t1−α/2(n− p)s

√1 + fT0 (VT V)−1f0

10. An approximate 100(1−α)% confidence band for the response func-tion at that holds for all possible x0 is given by

f(x0; θ)±√F1−α(p, n− p)

√ps2fT0 (VT V)−1f0.

89

• The “approximateness” of most of the above approximate inferenceresults in the nonlinear model depends upon the accuracy of thelinear approximation. This depends upon both the planar assump-tion (intrinsic nonlinearity) and the uniform coordinates assumption(parameter-effects nonlinearity).

– In particular, results 6, 7 (the use of the F statistic F1, notF2), 8, 9, 10 are collectively known as “linear approximation”inference methods and are affected by both the intrinsic andparameter-effects nonlinearity of the model.

A better approach to inference is based upon the profile likelihood func-tion or, equivalently, the profile t function. The statistic F2 is based onthis approach. We will describe this approach in detail later, but for now,we can use linear approximation results for inference.

90

Practical Considerations in Nonlinear Regression

1. Model Specification:

There are two components of any model relating y to x:

1. Deterministic component — the expectation function.

2. Stochastic component — the error (or disturbance) term.

Specification of the expectation function:

Ideally, the expectation function is implied by some contextual theory. I.e.,the expectation function is based on a mechanistic model.

Examples:

• Wind Speed: Recall our wind speed model. Based on a theory of therelationship between wind speed and height under adiabatic condi-tions, we have

windspeed = θ1 log{height(1− θ2/θ3)− 1/θ3}, (∗)

whereθ1 = friction velocity

θ2 = zero point displacement

θ3 = roughness length

• Pressure and Temperature in Steam: The relationship between pres-sure and temperature in saturated steam can be written as

pressure = θ1(10)θ2temperature/(θ3+temperature)

where θ1, θ2, θ3 are constants.

91

• Chemical Reaction: According to chemical theory, the reaction rateR for a certain chemical reaction is expected to be related to x1, x2,the partial pressures of reactant and product, respectively, accordingto

R =θ1θ2x1

1 + θ1x1 + θ2x2

where θ1 and θ2 are absorption equilibrium constants for reactant andproduct, respectively, and θ3 is the effective reaction rate constant.

• A Compartment Model: According to the two compartment openmodel introduced on pp.55–57, the concentration of tetracycline inblood serum over time (t) satisfies

concentrationt =θ1θ3(e

−θ1t − e−θ2t)

θ2 − θ1

• A Growth Model: A simple assumption leading to growth subject toan upper limit α on size is that growth is proportional to the sizeremaining; i.e., if x denotes time and f denotes size, this assumptioncan be expressed as the differential equation

∂f

∂x= κ(α− f)

This diff eq has general solution

f(x) = α− βe−κx

which can be reparameterized in a variety of ways including

f(x) = θ1(1− e−θ2(x−θ3)) (monomolecular growth model)

andf(x) = ϕ1 + ϕ2ϕ

x3 (asymptotic regression model)

The above relationship is also Newton’s law of cooling for a bodycooling over time (in that context f(x) represents temperature as afunction of time, x).

92

Often, however, no mechanistic model is available, and an empirical ap-proach must be taken. That is, we need to choose an appropriate expec-tation form that fits the data well. Preferrably, such a model will haveinterpretable parameters and low parameter-effects nonlinearity.

Approaches to choosing an empirical model:

• Search the literature. Is there a model (possibly a mechanistic one)that has been used in a similar context previously. If so, someone elsehas already done the work of selecting an empirical model and theirchoice may be appropriate for the data currently under consideration.

• Classify the problem. Are we modeling growth? Are we modelingyield? Are we modeling concentration over time? Are we model-ing some kind of rate of change? For some general classes of prob-lems, some “standard models” for these kinds of problems exist. (SeeRatkowsky, Handbook of Nonlinear Models, for a collection of com-monly used nonlinear models.)

• Examine the relationship between the response and the primary co-variate(s).

– Is the curve sigmoidal? Then consider a logistic model, or one ofthe other common sigmoidal forms: Gompertz, Von Bertalanffy,Richards, Weibull, Fletcher, or Morgan-Mercer-Flodin. We’llcome back and talk about some of these in some detail whenwe discuss growth curves. These models are treated in detail inSeber & Wild, Ch. 7.

– Does the curve rise monotonically to an asymptote and have aconcave form? Then a Michaelis-Menten, asymptotic regres-sion, asymptotic yield-density curve, or a simplified growth-curve model might be appropriate.

– Is the curve parabolic? Then a Michaelis-Menten curve with aquadratic term in the denominator may be useful, or a parabolicyield density curve. Several yield-density models, both asymp-totic and parabolic, are presented in Seber & Wild, §7.6.

93

– Does the curve show exponential decay? Then perhaps a sin-gle exponential model (E(y) = θ1e

−θ2x) or biexponential form(E(y) = θ1e

−θ2x + θ3e−θ4x) would be useful.

Specification of the stochastic component:

The question of how the error term should enter into the model can be an-swered based on theory, but usually it is addressed empirically. The choiceis often between an additive homoscedastic vs. a multiplicative homoscedas-tic error term, although later we will discuss a wider class of models thatcan accommodate heteroscedastic additive errors.

Let E(R) = h(x,θ) be the deterministic component relating some responseR to explanatory variables x. The additive situation is one in which

R = h(x,θ) + u

where E(u) = 0 and var(u) = σ2. In this case, we formulate our model bytaking y = R, f(x,θ) = h(x,θ) and e = u giving our standard form

y = f(x,θ) + e, where E(e) = 0, var(e) = σ2.

The multiplicative situation is one in which

R = h(x,θ)(1 + u)

= h(x,θ)z

where u has mean 0 so that z = (1 + u) has mean E(z) = 1. In this case,we formulate our model by taking y = log(R), f(x,θ) = log{h(x,θ)} ande = log(z) giving our standard additive-error form

y = f(x,θ) + e (∗)

• Note that the assumption E(e) = 0 here is appropriate because alinear Taylor series approximation of log(z) about its mean 1 gives

E(e) = E{log(z)} ≈ E{log(1) + 1(z − 1)} = E{z − 1} = 0.

94

• However, the appropriateness of the assumption var(e) = σ2 dependsupon the nature of the variance of R. Using the same Taylor lineariza-tion, we have log(z) ≈ z − 1 = R/h(x,θ)− 1. Therefore,

var(e) = var{log(z)} ≈ var(R)

{h(x,θ)}2

• Therefore, a homoscedastic error assumption on e in model (*) isappropriate in the multiplicative error case ifR has standard deviationproportional to its mean. This is a special, but common, form of non-constant variance.

• If the multiplicative-error model forR is not log-transformable to a ho-moscedastic additive-error model, or if we would prefer to work withan additive-error model for R rather than taking a transformation,then it will be necessary to drop the assumption of homoscedastic-ity. As in linear regression, it is possible to use weighted/generalizednonlinear least squares to fit a heteroscedastic nonlinear model.

• To judge the nature of the variability in R, a plot of R versus each ofthe covariates in x can be helpful. In addition, residual plots such asresiduals versus fitteds can be used.

• In the case that there are multiple observations of R at each level ofx we can calculate the sample standard deviation of R at each leveland examine how these SDs change over increasing x.

Example — Age of Rabbits Measured “By Eye”:

The European rabbit Oryctolagus cuniculus is a major pest in Australia. Areliable method of age determination for rabbits caught in the wild wouldbe of importance in ecological studies. In a study by Dudzinski and Myky-towycz (1961), the dry weight of the eye lens was measured for 71 free-livingwild rabbits of known age. Eye lens weight tends to vary much less withenvironmental conditions than does total body weight, and therefore maybe a much better indicator of age

95

The rabbits were born and lived free in an experimental 1.7 acre enclosureat Gungahlin, ACT. The birth data and history of each individual wereaccurately known. Rabbits in the enclosure depended on the natural foodsupply. In this experiment, 18 of the eye lenses were collected from rabbitsthat died in the course of the study from various causes such as coccidiosis,bird predation or starvation. The remaining 53 rabbits were deliberatelykilled, immediately after being caught in the enclosure or after they hadbeen kept for some time in cages. The lenses were preserved and their dryweight determined.

Here we takeR = eye lens weight in mg.

and x = age in days

Dudzinski and Mykytowycz suggest the deterministic relationship

E(R) = θ1 exp

{−θ2θ3 + x

}.

• The data from this example are contained in file rabbiteye.dat avail-able on the course web site.

• See handout Rabbit1. In this handout, we plot the data, fit somemodels, and check some residual plots to determine the appropriatescale for the error term.

• The first thing we do is plot both Lens (lens weight) and log(Lens)against Age. From these plots it is clear that a multiplicative errorterm is more appropriate than an additive one.

• To follow up on this conclusion, we fit the models

Lensi = θ1 exp

{−θ2

θ3 + xi

}+ ei, var(ei) = σ2 (m1rabbit.nls)

and

log(Lensi) = θ1 −θ2

θ3 + xi+ ei, var(ei) = σ2 (m2rabbit.nls)

96

• The fitted curves from these models are given as dotted lines on thebottom two plots on the first page of plots. Both fits look pretty goodhere, but the corresponding residual vs. fitted plots (top of last pageof Rabbit1) show the failure of the homoscedasticity assumption inm1rabbit.nls.

• The function gnls() in S-PLUS will allow the user to fit a nonlinearmodel using weighted/generalized least squares. We use this functionto fit m1rabbit.gnls. This model is as follows:

Lensi = θ1 exp

{−θ2

θ3 + xi

}+ ei, var(ei) = σ2Agei. (m1rabbit.gnls)

• The fitted curve from m1rabbit.gnls is overlayed in the bottom-leftplot on the first page of plots in Rabbit1. It is almost indistinquishablefrom the OLS fit, and the OLS and WLS parameter estimates are verysimilar. Note however, that the standard errors between m1rabbit.nlsand m1rabbit.gnls have changed substantially (the gnls ones are moreappropriate).

• In general, gnls will fit models for which

var(ei) = σ2g2(µi,vi, δ),

where µi = E(yi), vi is a vector of variance covariates, δ is a vectorof variance parameters to be estimated, and g(·) is a known variancefunction.

• In our example, there was just one variance covariate vi = Agei, andthere were no variance paramters, so g2(µi, vi, δ) = g2(µi,Agei) =Agei and var(yi) = σ2Agei.

97

• Alternatively, we could have estimated the power of the variance co-variate Age by taking g2(·) as g2(µi,Agei, δ) = Age2δi , and treatingδ as an unknown parameter to be estimated. This could be done ingnls() with the option, weights=varPower(form = ˜Age).

– Note that as soon as we include a parameter δ to be estimated inthe variance function g2(·), this takes us out of the WLS context.Remember that in WLS, the variance is assumed proportional toa known value. Assuming var(ei) = σ2Age2δi only correspondsto WLS if δ is known. Otherwise, we need ML estimation.

– Furthermore, in the last two examples, g2(µi, vi, δ) did not de-pend upon µi. If it does, then the model for the mean µi appearsboth in the expectation function and the variance function of themodel. This takes us further afield, requiring pseudo-likelihoodestimation. For example, the option, weights = varPower(form= ˜fitted(.)), specifies that the error variance is proportional tosome to-be-estimated power of the mean: var(ei) = σ2|µi|2δ.

• In fact, gnls() can fit not only weighted least squares (heteroscedastic)models, but also generalized least squares (correlated errors) models.We will return to this capability later.

Typically, either an additive or multiplicate error structure is appropriatefor most nonlinear models. However, these are not the only possibilities.gnls() provides several different choices for g2(·) so that the variance as-sumption on an additive error term can be chosen to be appropriate forwhatever scale a homoscedastic error enters the model.

• One example of a model in which a random component enters themodel in neither a multiplicative nor an additive way, is an errors-in-variables regression model. In such a model we assume that one ormore of the explanatory variables is observed subject to measurement(or some other kind of) error.

98

• E.g., in the rabbit eye example, the true age of the rabbits mighthave been estimated subject to error. An appropriate model for thissituation is

log(Lensi) = θ1 −θ2

θ3 + (xi + ei)︸︷︷︸“true x”

+ ei, var(ei) = σ2e , var(ei) = σ2

e

• Specialized techniques for errors-in-variables model are neceessary inboth the linear and nonlinear regression context. See Seber & Wild,Ch.10, and the book-length treatments by Fuller and by Carroll et al.

2. Derivatives:

The G-N algorithm utilizes the derivative matrix V(θ) = ∂f(x,θ)∂θT

.

Q: How do the functions nls() and gnls() know what these derivativesare?

A: By default, these routines use numerical derivatives rather thananalytic derivatives. However, numerical derivatives can be inaccurate andslow (especially when second derivatives are taken, as in, e.g., the Newton-Raphson algorithm).

Forward Difference Approximations:

The forward difference method of computing numerical derivativescost less computer time than central differences, but are less accurate. Formany applications the accuracy of forward differences is sufficient, though.

Recall from calculus the definition of a derivative at some point x of a simplefunction g(x) of a scalar argument x:

limh→0

g(x+ h)− g(x)

h.

More generally, the partial derivative of a function g(x, y) of two scalarvariables x and y with respect to x is

∂g

∂x= lim

h→0

g(x+ h, y)− g(x, y)

h.

99

• The forward difference approximation just replaces the limit as h→ 0with a very small value of h.

Suppose we have a function g(θ) of a p-dimensional parameter θ and (pos-sibly) other variables or parameters (e.g., for g the expectation function of anonlinear model, it also depends on xi, the vector of explanatory variablesfor the ith subject).

Let ji be a p×1 vector with a 1 in the ith position, and 0s elsewhere. Thenthe forward difference approximation is

∂g

∂θi≈ g(θ + hiji)− g(θ)

hi

where hi is a very small number (as small as possible subject to the limitsof computational precision of the computer).

• Note that θ + hiji just adds hi to the ith element of θ.

• The value of hi can be taken to be the same for all elements of θ,but because some elements of θ may have very different scales thanothers, hi is usually scaled up or down depending on the magnitudeof θi.

• Specifically, I recommend hi =√ϵ(1 + |θi|) where ϵ is the machine

precision of the computer on which the calculations are being done.

• Machine precision is a constant specific to any given computer hard-ware that quantifies the limit of that machine’s ability to distinguishsmall (in magnitude) double-precision floating point numbers.

• For the PC in my office, ϵ = 2.2204e− 016. Most statistical softwarecan look up ϵ. E.g., in SAS, use the function: constant(’maceps’);in S-PLUS and R ϵ is held in the constant .Machine$double.eps; inMatlab ϵ is held in the constant eps.

100

E.g., suppose we have a model with expectation function f(x,θ) = θ1/(θ2+x) and we want ∂f(x,θ)/(∂θ2) evaluated at θ = (1, 1)T and x = 2. Thenthe forward difference approximation is

∂f(x,θ)

∂θ2≈ 1

h2

(θ1

θ2 + h2 + x− θ1

θ2 + x

) ∣∣∣∣∣θ=(1,1)T ,x=2

where h2 =√ϵ(1 + |θ2|) =

√2.2204e− 016(1 + 1) = 2.9802e− 008, so

∂f(x,θ)

∂θ2≈ 1

2.9802e− 008

(1

1 + 2.9802e− 008 + 2− 1

1 + 2

)= −0.1111111100763083

• The exact derivative is −1/9 = −.1111.

Central Difference Approximations:

Some improvement in the accuracy of numerical derivatives can be ob-tained by using central differences rather then forward differences.

In the general set-up described above, the central difference approximationis

∂g

∂θi≈ g(θ + hiji)− g(θ − hiji)

2hi

• For central differences, a larger hi value is recommended: hi =(ϵ)1/3(1 + |θi|).

In the example, the central difference approximation is

∂f(x,θ)

∂θ2≈ 1

2h2

(θ1

θ2 + h2 + x− θ1

θ2 − h2 + x

) ∣∣∣∣∣θ=(1,1)T ,x=2

= −0.1111111111129642

• Note the improved accuracy of the central difference approximation.See NumDerivExamp.R for implementation of this example.

101

In nls() and gnls() numerical derivatives are used by default. However,it is best to supply these functions with analytic derivatives instead. Aconvenient way to do this is with the deriv() function.

• The deriv() function uses symbolic mathematics (as in Maple orMathematica) to produce the analytic derivates. This saves the userfrom the (very) error-prone task of taking the derivatives him/herself.

For example, consider the model fit to log(Lens) in model m2rabbit.nls.See handout Rabbit2.

• The Rabbit2 handout contains an R script rabbit2.R and its out-put. It also contains a SAS program, rabbit1.sas, and its output,rabbit1.lst. In rabbit2.R, we first demonstrate that a function canbe coded that represents the expectation function of the model. Icall this function m2rabbit.

• We can refit model m2rabbit.nls by using the m2rabbit() function.This simplifies things slightly. (Later when we consider adding co-variates to nonlinear models, we’ll see that this step is very handy.)

102

• An advantage to coding the function f(x,θ) = θ1 − θ2/(θ3 + x) asm2rabbit(x,th1,th2,th3) is that we can add a derivative (gradient)attribute to m2rabbit() so that it returns not only the function’svalue, but also the values of the partial derivatives with respect toeach parameter at each evaluation.

• This is done using the deriv() function. We can see what’s going onby printing the redefined function m2rabbit and this function eval-uated at some given values of Age and θ. See the R documentationon deriv() for more information.

• Now if we refit the model using the m2rabbit() function, analyticalderivatives will be used because they are available as an output ofthis function.

• In rabbit1.sas the same model is fit using SAS’ PROC NLIN. As ofversion 8, PROC NLIN uses analytic derivatives by default. It usedto be the case that derivatives had to be coded into PROC NLINusing der.parm statements.

• The automatic calculation of analytic derivatives with symbolic mathis a very helpful recent development. Mistakes in the calculation ofanalytic derivatives by hand are very common, and the elimination ofthis step by the software makes the fitting of nonlinear models muchmore reliable. When using software that cannot provide analyticderivatives, it is recommended that a symbolic math program suchas Maple be used to compute derivatives when possible.

• As you can see, PROC NLIN and the nls() function produce identicalresults.

103

3. Starting Values:

For the G-N and other commonly used algorithms for fitting nonlinearmodels, starting values of the parameters are needed. Bad starting valuescan lead to slow or no convergence, or possibly convergence to the wrongsolution.

There are several methods for selecting starting values and often somecombination of these methods is necessary for a given problem:

1. Analysis of the Expectation Function.

Starting values can often be obtained by considering the behavior off(x,θ) as the x-values approach some limiting value (e.g., 0, ±∞).This information is often used with the observed y-values.

E.g., if one of the parameters has an interpretation as the asymptotein a growth model, then we can take the maximum y-value as aninitial guess of that parameter.

E.g., consider the simple logistic model

f(x,θ) =θ1

1 + exp{(θ2 − x)/θ3}.

As x → ∞, the denominator→ 1 which implies that θ1 is the limitingvalue for large x. If this is an upper asymptote as in a growth model,then take θ01 = ymax where ymax=the largest response.

In general, it may be possible to solve

yi = f(xi,θ)

for one or more elements of θ for certain well chosen pairs (yi,xi).

104

Another example: in the asymptotic regression model,

f(x,θ) = θ1 + θ2e−θ3x,

letting x → ∞ we obtain the asymptote θ1, and letting x = 0 givesthe initial value θ1 + θ2. Therefore, we might take θ01 = ymax and

θ02 = y0 − ymax where y0 is the (observed or “expected”) y-value atx = 0.

2. Analysis of Derivatives of the Expectation Function.

Sometimes the derivatives of f(x,θ) with respect to x are simplerthan f itself and can be solved for one or more components of θ.

Suppose x = x consists of just one variable and let x(1), . . . , x(n) rep-resent the ordered (e.g., smallest to largest) x-values and y(1), . . . , y(n)

the corresponding y-values.

Then if we can solve

y(i) − y(i−1)

x(i) − x(i−1)︸︷︷︸“empirical derivative”

=∂f(x,θ)

∂x

∣∣∣∣x=x(i)

for θj then take this solution as the initial value for θj .

For example: In the Michaelis-Menten model, f(x,θ) = θ1xθ2+x , we

have∂f(x,θ)

∂x

∣∣∣∣x=0

=θ1θ2

(θ2 + x)2

∣∣∣∣x=0

=θ1θ2

In addition, θ1 has an interpretation as an upper asymptote sincelimx→∞ f(x,θ) = θ1. Therefore, we take θ01 = ymax and solve

y(1) − y(0)

x(1) − x(0)=

ymax

θ02

for θ02 to get an an initial value for θ2.

105

3. Transformation of the Expectation Function.

If the expectation function is transformably linear, then we can fit alinear model to the transformed model (don’t need to worry abouterror term to get roughly accurate parameter estimates) to get initialvalues.

E.g., If the expectation function for response R is f(x,θ) = θ1xθ21 xθ3

2

then the log transformation yields

log{f(x,θ)} = log(θ1) + θ2 log(x1) + θ3 log(x2).

Therefore, by fitting the linear regression model:

log(Ri) = β1 + β2 log(x1i) + β3 log(x2i), i = 1, . . . , n

we can obtain starting values:

θ01 = eβ1 , θ02 = β2, θ03 = β3.

4. Conditional Linearity.

In many model functions, one or more of the parameters are condi-tionally linear. In that case, once starting values for the nonlinearparameters have been obtained, then linear regression conditionalon the nonlinear parameters’ starting values can be used to obtainstarting values for the linear parameters.

For example: consider the asymptotic regression model y = θ1 +θ2e

−θ3x. If a starting value θ03 is available, then a simple linear

regression of y on z = e−θ03x will yield starting values for θ1 and θ2.

106

5. Reducing Dimensions.

Usually, more than one of the above approaches are necessary andare combined in succession. After an initial value of each successiveparameter has been obtained, the problem of obtaining starting val-ues for all parameters is simplified because it has been reduced indimension.

E.g., in the last example, once θ03 has been obtained it becomes easier

to obtain θ01 and θ02 because the problem is conditionally linear.

Another example is provided by the method of exponential peel-ing. This method is useful when the expectation function is a sumof exponentials.

Consider the biexponential model

f(x,θ) = θ1e−θ2x + θ3e

−θ4x

with θ1, θ2, θ3, θ4 assumed positive.

– Notice that in such a model the two pairs of parameters (θ1, θ2)and (θ3, θ4) are exchangeable, meaning that the values of thepairs can be exchanged without changing the value of f .

– This means that the parameters of the model are not identi-fiable. To get around this we create an identifiable parame-terization by requiring θ2 > θ4.

107

The following plot gives an example of a biexponential where θ =(3.5, 4, 1.5, 1)T :

x

f(x,

thet

a)

0 1 2 3 4 5

01

23

45

A Biexponential Expectation Function

f=f1+f2f1=3.5*exp(-4*x)f2=1.5*exp(-x)

Because θ2 > θ4 the function behaves like a simple exponentialθ3e

−θ4x for large x, and like θ1e−θ2x + θ3 for small x.

This suggests splitting the data into a portion corresponding to thelargest x values and the remaining portion corresponding to thesmallest x values. Then fit obtain starting values by fitting the com-ponents as follows:

i. First fit the simple linear regression

log(yi) = α0 + α1xi,

for the portion of the data with the largest values of x and setθ03 = eα0 and θ04 = −α1.

108

ii. Then calculate the residuals, ri = yi − θ03 exp(−θ04xi) for theportion of the data with the smallest x values. Fit the simplelinear regression model

log(|ri|) = β0 + β1xi

and set θ01 = eβ0 and θ02 = −β1.

Example 1 — Rabbit Age and Eye Weight:

Again consider the model

yi = θ1 −θ2

θ3 + xi+ ei

wherey = log(eye lens weight)

x = age in days

From a plot of y versus x we saw that y reaches an asymptote as x getslarge.

As x → ∞, f(x,θ) → θ1 so we can take

θ01 = ymax = log(246.70) = 5.51

As x → 0, f(x,θ) → θ1−θ2/θ3. The minimum x value in the data was x =15 for which there were three observations with y values log(21.66), log(22.75),and log(22.30). We take their average, 3.10, to represent y when x is small.

Now solve

θ01 −θ2θ3

= 3.10

to getθ2θ3

= 5.51− 3.10 = 2.41

109

Since f is approximately linear near x = 0 we can examine

∂f(x,θ)

∂x=

θ2(θ3 + x)2

→ θ2θ23

as x → 0.

The second smallest x value was x = 18 with a corresponding y value ofy = log(31.25). Therefore, we can solve

log(31.25)− 3.10

18− 15︸︷︷︸empirical derivative

=θ2θ23

⇒ θ2θ23

= .113

We now have the two equations in two unknowns:

θ2θ3

= 2.41

θ2θ23

= .113

which can be solved to get θ03 = 21.3 and θ02 = 51.4.

• So, our starting values are given by θ0 = (5.51, 51.4, 21.4)T .

110

Example — Pressure vs. Temperature in Saturated Steam:

Data:

Temperature Pressure

0 4.1410 8.5220 16.3130 32.1840 64.6250 98.7660 151.1370 224.7480 341.3585 423.3690 522.7895 674.32100 782.04105 920.01

Let y =pressure, and x =temperature. Then the model we consider is

yi = θ1 exp{θ2xi/(θ3 + xi)}+ ei, i = 1, . . . , n = 14

where e1, . . . , eniid∼ N(0, σ2).

Since f(0,θ) = θ1 we take θ01 = 4.14.

111

Now note that

yiθ1

≈ exp{θ2xi/(θ3 + xi)}

⇒ log(yi/θ1) ≈θ2xi

θ3 + xi=

θ21 + θ3/xi

⇒ 1

log(yi/θ1)≈ 1

θ2+

θ3θ2

1

xi

This suggests that we obtain starting values for θ2, θ3 by performing asimple linear regression of 1/log(yi/θ1) on 1/xi for cases 2–14 (we need toexclude x = 0).

• See handout Pressure1.

This simple linear regression yields

1

θ02= .0553 ⇒ θ02 = 1/.0553 = 18.07,

andθ03

θ02= 13.28 ⇒ θ03 = 13.28θ02 = 240.03

so that θ0 = (4.14, 18.1, 240.1)T .

• A very convenient feature in the nlme library is the ability to defineself-starting models.

• We’ve already seen that it can be convenient to code the expectationfunction as an R function, and use the deriv() function so that thisfunction returns not only the value of the expectation function, butalso the values of its partial derivatives. E.g., m2rabbit() returned agradient component as well as the function value.

112

• If we can code the process used to obtain starting values, we canextend this idea so that the function returns its value, its gradient,and appropriate starting values all automatically.

• In pressure1.R we create a self starting function SSPresMod(). Thefirst step is to code the function PresMod to return the expectationfunction of the model and its derivatives. This is done with thederiv() function.

• Next we code a function PresModInit() to automate the processwe just went through to obtain starting values. Then SSPresModis created by combining PresMod() with the initial value functionPresModInit() using the selfStart() function. See the R online helpfor selfStart() and also see §8.1.2 of Pinheiro and Bates (2000) formore information on creating self starting functions.

• Once we have created SSPresMod() we can check that it works byasking it to produce starting values for our model and the observeddata. This is done with the getInitial() function and you can see thatSSPresMod() does return the values we obtained less automatically.

• We can now fit our nonlinear regression model using SSPresMod()in the nls() function. We no longer have to specify starting valuesbecause they are computed automatically. In addition, any new,similarly-behaving data set can be fit with this model without com-puting a new set of starting values “by hand”.

Several self starting functions for commonly used nonlinear regressionmodels (e.g., biexponential, Michaelis-Menten, many more) are includedin S-PLUS through the nlme3 library. Appendix C in Pinheiro and Bates(2000, on reserve) contains descriptions of these self starting functions,including details concerning how starting values are obtained and inter-pretations of the expectation functions of the models. This appendix is avery useful reference.

113

Example — Pasture Yield vs. Growing Time

Ratkowsky (1983) reports a small data set on pasture regrowth over timefollowing cutting. The data are as follows:

Growing Time Pasture Yield

9 8.9314 10.8021 18.5928 22.3342 39.3557 56.1163 61.7370 64.6279 67.08

We consider fitting a Gompertz Model to these data of the form

yi = θ1 − eθ2−θ3xi + ei, i = 1, . . . , n = 9,

where y = log(pasture yield), x =time, and e1, . . . , e9 are i.i.id mean zeroeach with variance σ2.

Starting values:

First note that increasing yields over time imply that

0 <∂f(x,θ)

∂x= θ3e

θ2−θ3x, ⇒ θ3 > 0.

Therefore, as x → ∞, f(x,θ) → θ1, so θ1 is the upper asymptote. So, wetake

θ01 = log(67.08) = 4.21

Now to get other starting values, we could transform to linearity. However,for illustrative purposes, we’ll do something else.

114

As x → 0, f(x,θ) → θ1 − eθ2 . So, plugging in θ01 for θ1 and min(y) =log(8.93) = 2.19 for f(0,θ) we can solve the following equation for θ2

4.21− eθ2 = 2.19 ⇒ θ2 = log(4.21− 2.19) = .70

and take θ02 = .70.

Note that

y = θ1 −1

n

n∑i=1

eθ2−θ3x + e

≈ θ1 − eθ2−θ3x,

the approximation holding roughly because the plot of y versus x is ap-proximately linear.

Substituting the initial values of for θ1 and θ2 and y = 3.42 and x = 42.56we solve

3.42 = 4.21− e.70−θ3(42.56) ⇒ .70− θ3(42.56) = log(4.21− 3.42)

to get θ03 = .022.

• We use these starting values, θ0 = (4.21, .70, .022)T , to fit the Gom-pertz model to these data in handout Pasture1.

4. Parameter Transformations:

In nonlinear regression models, choosing a “good parameterization” is amuch more significant and difficult problem than in linear regression.

Parameter transformations can be effective means to reduce the parameter-effects nonlinearity of a model. This improves the linear approximation,which increases the accuracy of approximate inference methods and speedsconvergence.

115

We will come back to the issue of decreasing parameter-effects nonlinearitylater. For now, we focus on parameter transformations

a. to enforce constraints on parameters; and

b. to speed convergence.

Constrained Parameters:

In most nonlinear models, parameters are restricted to regions that makesense scientifically.

For example, in growth and yield models, asymptote parameters typicallymust be positive; in exponential models, parameters in the exponent (oftenwith minus signs in front) must be positive.

Often, it is possible to ignore parameter constraints in the fitting process.One simply checks the converged parameter estimates to ensure that theyare in their feasible regions. If so, and the estimated model fits the datawell, then the NLSE/MLE has been reached and there is no problem.

• It can happen, though, that during the iterations to fit the model,parameter estimates can go “out of range”. This may slow or stopthe algorithm, or lead to convergence to incorrect solutions.

• A general solution to this is to use constrained optimization (maxi-mization or minimization) rather than an unconstrained method likeG-N.

– E.g., for equality constraints, can use method of Lagrange mul-tipliers. For inequality constraints (more common), fancier,more difficult methods are necessary.

116

An easier approach is simply to reparameterize so that, under the reparam-terization, the constraint is enforced.

• The most common use of this approach is to enforce positivity. If θmust be positive, reparameterize to ϕ = log(θ). This ensures thatthroughout the iterations, θ = eϕ remains positive.

• An interval constraint of the form

a ≤ θ ≤ b

can be enforced by a logistic transformation of the form

θ = a+b− a

1 + e−ϕ.

• Although less commonly used, one can even use a transformation(see reference given on p.78 of Bates & Watts) to enforce an orderconstraint of the form

a ≤ θj ≤ θj+1 ≤ · · · ≤ θk ≤ b

Besides scientific meaningfulness, a parameter constraint can arise to forceidentifiability of a model.

For a nonlinear regression model

yi = f(xi;θ) + ei,

where the ei’s are i.i.d. with E(ei) = 0, var(ei) = σ2, the model is said tobe identifiable if

f(xi;θ) = f(xi; θ) implies θ = θ.

117

Recall this was not the case in the unconstrained biexponential modelbecause for θ = (1, 2, 3, 4)T and θ = (3, 4, 1, 2)T ,

f(x,θ) = f(x, θ) = 1e−2x + 3e−4x

• This problem can be fixed by requiring

0 ≤ θ2 ≤ θ4

which can be done with the transformation

θ2 = eϕ2

θ4 = eϕ2(1 + eϕ4)

Another example of a nonidentifiable model:

f(x,θ) = exp(−θ2θ3xi) +θ1θ2

{1− exp(−θ2θ3x1)}x2.

Here, θ3 only occurs in the model through the product θ2θ3. Therefore, θ =(cθ1, cθ2, θ3/c)

T gives the same value of f(x,θ) for any nonzero constantc.

A solution is to reparameterize so that the model has only two (nonredun-dant) parameters. Under the reparameterization, take

ϕ1 = θ2θ3, ϕ2 = θ1/θ2

to fix the problem and make the model identifiable.

118

Facilitating Convergence:

Two simple methods to facilitate convergence are centering and scaling.

By centering, we mean replacing the explanatory variables xi1, . . . , xim bytheir deviations from their respective means, x1, . . . , xm.

For example, Bates and Watts suggest reparameterizing the exponentialmodel

f(x,θ) = θ1e−θ2x (∗)

by replacing x by x− x. That is, (*) can be written as

f(x,θ) = θ1e−θ2(x−x+x) = θ1e

−θ2xe−θ2(x−x) = ϕ1e−ϕ2(x−x)

where we have reparameterized using

ϕ1 = θ1e−θ2x, ϕ2 = θ2.

A centering reparameterization such as this can make the columns of thederivative matrix V less collinear (linearly dependent) than they would beotherwise and thus can stabilize the GN algorithm and facilitate conver-gence.

119

It’s also often helpful to scale the parameters and the data to ensure goodconditioning of the derivative matrix. For example, if the model was

f(x,θ) = θ1e−θ2x

with the response in the range 0 < y < 100 and the regressor x in therange 0 < x < 1000, and θ1 ≈ 100 while θ2 ≈ 0.001, then it would beprudent to write

f(x,θ) = 100ϕ1e−ϕ2(x/1000).

This way both ϕ1 and ϕ2 are approximately 1 and the derivatives are ofsimilar magnitude.

• In linear and nonlinear regression, centering and scaling are help-ful to reduce the multicollinearity (near linear dependence) amongthe columns of the derivative matrix (X in the linear case, V inthe nonlinear case). This multicollinearity or ill conditioningof the regression problem can be thought of as “approximate non-identifiability.”

Ill-Conditioning:

Consider the linear model

y = Xβ + e, E(e) = 0, var(e) = σ2I.

• This model is identifiable (i.e., β is estimable) if and only if X isof full rank. X is full rank just means that the columns of X arelinearly independent, or, mathematically, that Xa = 0 if and only ifa = 0.

120

If X is not of full rank (and so has linearly dependent columns), thereexists a vector c = 0 so that Xc = 0 and hence

∥Xc∥2 = (Xc)T (Xc) = cT (XTX)c = 0

IfX has nearly linearly dependent columns then there exists a vector c = 0so that

∥Xc∥2 = cT (XTX)c ≈ 0

I.e., for any β there exists a β = β so that β − β = 0 and

∥X(β − β)∥2 = (β − β)T (XTX)(β − β) ≈ 0

⇒ ∥Xβ −Xβ∥2 ≈ 0

⇒ Xβ ≈ Xβ

• The model is identifiable if Xβ = Xβ implies β = β. Under mul-ticollinearity, we have just seen that we obtain approximately equalexpectation functions for distinct values β and β. Hence multi-collinearity can be thought of as approximate nonidentifiability.

• Two measures of multicollinearity are the determinant of XTX andthe minimum eigenvalue of XTX. Both of these quantities will be0 under linear dependence and will be close to 0 under approximatelinear dependence (or multicollinearity).

• When this happens XTX will be nearly singular (uninvertible) andwe say XTX is ill-conditioned.

121

Consequences:

a. Numerical Instability:

It’s possible to obtain two substantially different estimators β1 andβ2 that give nearly equal expectation functions: Xβ1 ≈ Xβ2. Thismeans it’s hard to find the “right” β.

This is minimized by using a good algorithm to obtained the LSE(e.g., use the QR decompostion).

b. Large variances for some parameters (or for some linear combinationsof the parameters).

This can be seen from the fact that var(β) = σ2(XTX)−1 and(XTX)−1 is equal to 1/ det(XTX) times some matrix. If det(XTX) ≈0 then the elements of this var-cov matrix will blow up.

• In linear regression, the cure for multicollinearity is to eliminate oneor more of the explanatory variables.

In nonlinear regression, the matrixV(θ) plays the role ofX. If the columns

of V(θ) for θ near θ are approximately linearly dependent, then we mayhave

a. Numerical instability;

b. Large variances; and

c. Slow or non-convergence.

• In the nonlinear case, this ill-conditioning is difficult to correct be-cause it could be due to

i. the data (as in the linear case); and/or

ii. the model.

122

Example — Corn Yield

The following table contains data on R, the mean dry kernel weight (4plants) of corn, at various levels of x, the time since silking.

Time Since Silking Mean Kernel Weight

17.125 14.2625.625 50.5129.625 60.8339.625 104.7846.375 94.4654.250 97.0262.125 172.41

For these data we consider the model

yi = θ1 − θ4 log(1 + eθ2−θ3xi) + ei, i = 1, . . . , n = 7

where yi = log(Ri) and we make the usual error assumptions.

• This model is a four-parameter form of the Richards Model withmultiplicative error.

• See the handout corn1. Here, we attempt to fit the above model.Methods for obtaining starting values for Richards models are dis-cussed in Ratkowsky (1983, §8.3.3) and Seber & Wild (1989, §7.3.6),but we ignore that issue in this example.

• In corn1, we use the gnls() function rather than the nls() function,simply because it allows greater control over the fitting algorithmand because the returnObject option allows the function to returnthe “fitted model” at the end of the fitting algorithm, even if the al-gorithm has not converged. The model we are fitting is an OLS-typemodel, though, with homoscedastic, independent errors as usual.

123

• Notice that the algorithm did not converge. We could reduce the con-vergence criteria in this model further and we’d see that the paramterestimates would continue to jump around without settling down.

• Notice that at the point at which the algorithm stopped, the correla-tions between the parameter estimates were very high in magnitude.In particular, ˆcorr(θ2, θ4) = −1.000 (to three decimal places). Largemagnitude correlations like this are indicators of ill-conditioning andoften suggest that the model is overparameterized.

– As a rule of thumb, correlations > .99 in absolute value arestrong indicators of overparameterization.

• In corn1 we also compute the determinant and eigenvalues of{V(θ)}TV(θ) taking θ to be the value of θ when the G-N algorithmstopped. The determinant is 4.65e-6 and the minimum eigenvalue is2.61e-9, both tiny, indicating ill-conditioning.

• The solution here is to simplify the model. A four-parameter modelis overkill here. Four parameters are sometimes necessary in growthcurve models like the Richards model above to establish the shapeof the entire curve from beginning to end of the growth process. Inour data, we’re not seeing the convex part of the sigmoidal growthcurve, and we’re certainly not seeing all of the “tail” behavior, so wereally don’t have data appropriate to estimate parameters related tothe point of inflection and the asymptotes. Doing so is essentiallyextrapolating from the data.

• A three-parameter version of this model arises by setting θ4 = 1.The resulting model is a reparameterization of the 3-parameter lo-gistic regression model. We succesfully fit that model in corn1 andit appears to fit the data well.

124

Q: Why do high correlations among the θj ’s indicate ill-conditioning?

A: Consider the linear case. There, it is not hard to show corr(βj , βk)is equal to −1 times the partial correlation between xj and xk, the j

th andkth explanatory variables, respectively.

• Recall that the partial correlation between xj and xk measures thelinear association between xj and xk after removing the effects of allother explanatory variables.

• Therefore, | ˆcorr(βj , βk)| ≈ 1 means that xj and xk have almost per-fect (positive or negative) partial correlation (i.e., they are collinear).

• The situation is somewhat more complex in the nonlinear case, butthe result is the same: large magnitude correlations among the θj ’sindicate collinearity among the columns of V(θ) and ill-conditioningof the {V(θ)}TV(θ) matrix.

5. Alternative Fitting Algorithms:

There are several alternatives to the GN algorithm for finding θ, the valueof θ that minimizes

S(θ) = ∥y − η(θ)∥2.

125

A. Newton-Raphson Algorithm.

θ satisfies the normal equation

∂S

∂θ(θ) = 0.

where ∂S

∂θ(θ) ≡ ∂S(θ)

∂θ

∣∣θ=

ˆθ.

Using a linear Taylor series approximation of ∂S

∂θ(θ) we have

0 =∂S

∂θ(θ) ≈ ∂S

∂θ(θ) +

∂2S

∂θ∂θT(θ)(θ − θ)

for θ close to θ.

Rearranging,

θ − θ ≈ −{

∂2S

∂θ∂θT(θ)

}−1

︸︷︷︸p×p

∂S

∂θ(θ)︸︷︷︸

p×1

.

This approximation suggests the Newton-Raphson updating formula: Givena starting value θ0, we update via

θj = θj−1 −{

∂2S

∂θ∂θT(θj−1)

}−1∂S

∂θ(θj−1), j = 1, 2, . . . , convergence,

to obtain θ.

• This approach is valid for any minimization/maximization problem.

126

In nonlinear regression,

S(θ) = {y − η(θ)}T {y − η(θ)} =n∑

i=1

{yi − f(xi,θ)}2

so

∂S

∂θ(θ) = −2

n∑i=1

{yi − ηi(θ)}∂f(xi,θ)

∂θ

= −2

∂f(x1,θ)

∂θ1· · · ∂f(xn,θ)

∂θ1...

. . ....

∂f(x1,θ)∂θp

· · · ∂f(xn,θ)∂θp

y1 − f(x1,θ)

...yn − f(xn,θ)

= −2{V(θ)}T {y − η(θ)}

In addition, the second derivative matrix has (j, k)th element

∂2S(θ)

∂θj∂θk=

∂

∂θk

(∂S(θ)

∂θj

)=

∂

∂θk

(−2∑i

{yi − f(xi,θ)}∂f(xi,θ)

∂θj

)

= 2n∑

i=1

∂f(xi,θ)

∂θk

∂f(xi,θ)

∂θj− 2

n∑i=1

{yi − f(xi,θ)}∂2f(xi,θ)

∂θj∂θk

or, in matrix notation, the entire second derivative matrix is

∂S(θ)

∂θ∂θT= 2{V(θ)}T {V(θ)} − 2

≡D︷︸︸︷∂{V(θ)}T

∂θ︸︷︷︸a 3-dim array of derivatives

{y − η(θ)}

127

So in the context of nonlinear least squares regression, the Newton-Raphsonupdate is given by

θj = θj−1 +({V(θj−1)}T {V(θj−1)} −D

)−1

{V(θj−1)}T {y − η(θj−1)}

Recall that the GN update was

θj = θj−1 +({V(θj−1)}T {V(θj−1)}

)−1

{V(θj−1)}T {y − η(θj−1)}

• These algorithms are obviously very similar. As compared with NR,the GN algorithm just replaces

∂S(θ)

∂θ∂θT= 2{V(θ)}T {V(θ)} − 2D

by its expected value, 2{V(θ)}T {V(θ)}. That is, GN ignores theterm D, which has mean 0, in the NR update formula above.

• In a more general context, the GN modification of the NR algorithmis often called Newton-Raphson with Fisher scoring, or sometimessimply, the Fisher scoring algorithm.

• Advantage of GN: only requires first derivatives of the expectationfunction.

• Advantage of NR: sometimes has better convergence properties thanGN.

• Often, the two algorithms will perform about the same.

128

B. Quasi-Newton Methods.

• The main drawback to the N-R method is having to compute thesecond derivatives. QN methods avoid this by using a numericalapproximation to the second derivative matrix.*

• This approximation starts out crudely, but is updated (improved)after each step of the algorithm to get progressively more accurate.

Let

H(θ) =∂S

∂θ∂θT(θ) and g(θ) =

∂S

∂θ(θ)

(‘H’ for Hessian, ‘g’ for gradient).

In the NR algorithm we had the updating formula

θj+1 = θj − {H(θj)}−1g(θj).

In the QN algorithm we want to avoid computing second derivatives, sowe replace {H(θj)}−1 by an approximating matrix Bj and use

θj+1 = θj −Bjg(θj). (♡)

• At step 0, B0 is often taken to be something quite simple and crude,such as B0 = I and then improved at each step of the algorithm.

How do we update the approximate inverse Hessian Bj?

* (or, to be more accurate, its inverse, which is really what is needed)

129

Note that a first-order Taylor approximation of the gradient vector gives

g(θj+1) ≈ g(θj) +H(θj)(θj+1 − θj)

for θj+1 close to θj .

Rearranging, we have

{H(θj)}−1{g(θj+1)− g(θj)} ≈ (θj+1 − θj)︸︷︷︸depends on Bj

or{H(θj)}−1qj ≈ pj , (♠)

where qj = g(θj+1)− g(θj) and pj = θj+1 − θj .

• Note that (♠) holds exactly if S(θ) (the objective function) is quadraticin θ (e.g., in a linear model).

The Quasi-Newton method uses (♠) to obtain an updated estimate of theinverse Hessian of the form

Bj+1 = Bj +Ej (♢)

• Here, Ej updates Bj to give Bj+1 which is used as an estimate of{H(θj+1)}−1 for the next step of the algorithm .

Turning (♠) into an equality and substituting (♢), we get

(Bj +Ej)qj = pj (1)

130

Equation (1) now gives a means for choosing Ej , the update matrix to im-prove our estimate of the inverse Hessian. There is no unique Ej satisfying(1), but typically, Ej is taken to be of the form

Ej = auuT + bvvT

where auuT and bvvT are symmetric matrices each of rank one chosen sothat (1) is satisfied.

• QN methods with b = 0 are called rank one update methods, QNmethods with b = 0 are known as rank two methods.

• Rank two methods are the most widely used, with the Davidon-Fletcher-Powell (DFP) and, especially, the Broyden-Fletcher-Goldfarb-Shanno (BFGS) update methods most common in modern imple-mentations of the QN method.

The BFGS update takes the form:

Ej =

(1 +

qjTBjqj

pjTqj

)pjpjT

pjTqj− pjqjTBj +BjqjpjT

pjTqj.

• As in the GN method, QN methods usually make use of a step sizeλ. That is, instead of (♡), we update via

θj+1 = θj − λBjg(θj).

where λ is chosen that ensures a decrease in the objective functionor, alternatively, a decrease in the length of the gradient vector.

• Typically, a line search is done to obtain a near optimal value of λ,not just one that works. E.g., instead of just picking a λ value thatdecreases the objective function, we attempt to find the λ value thatproduces the greatest decrease in the objective function.

131

Comparisons with GN, NR:

– Speed of convergence: (slowest) GN≤QN≤NR (fastest).

– Difficulty of use: (easiest) GN=QN≤NR (hardest).

• Convergence of GN, QN approaches that of NR when S(θ) is ap-proximately quadratic in θ.

• QN appropriate for a problem that is not highly nonlinear but wheresecond derivatives are tedious to calculate.

132

C. Steepest Descent.

Iterative methods for calculating θ have the general form:

θj+1 = θj + δj ,

where the increment δj usually has 2 components:

i. the direction of the step dj ; and

ii. the length of the step, λj ;

where δj = λjdj .

• In GN algorithm, we chose λj so that

S(θj+1) < S(θj).

dj is a descent direction if

∂

∂λS(θj + λdj)

∣∣λ=0

< 0.

If dj is a descent direction, then for small enough λj , the sum of squareswill decrease. I.e.,

S(θj + λjdj) < S(θj).

Note that

∂

∂λS(θj + λdj)

∣∣λ=0

={S′(θj + λdj)

∣∣λ=0

}T

dj =

{∂

∂θS(θj)

}T

dj︸︷︷︸=(♣)

• The method of steepest descent says choose dj to minimize (♣)(make it as negative as possible).

133

Equivalently, we can minimize

1

∥dj∥

{∂

∂θS(θj)

}T

dj

over all vectors dj .

By the Cauchy-Schwartz Inequality (|xTy| ≤ ∥x∥∥y∥ with equality iffx = cy) we have ∣∣∣∣∣∣∣∣∣

{∂

∂θS(θj)

}T

dj︸︷︷︸<0

∣∣∣∣∣∣∣∣∣ ≤ ∥dj∥∥ ∂

∂θS(θj)∥.

Therefore, we have {∂

∂θS(θj)

}T

dj

∥dj∥≥ −∥ ∂

∂θS(θj)∥︸︷︷︸

lower bound

with equality if and only if dj is a multiple of − ∂

∂θS(θj).

Thus, the steepest descent iteration becomes

θj+1 = θj − λ∂

∂θS(θj) (∗)

where λ is chosen to minimize S(θj+1).

134

Note that in (*), ∂

∂θS(θj) is just the gradient of S(θ) evaluated at θj .

Thus, we have the following updates:

θj+1 = θj − λ[E{H(θj)}]−1g(θj) (GN)

θj+1 = θj − λ{H(θj)}−1g(θj) (NR)

θj+1 = θj − λBjg(θj) (QN)

θj+1 = θj − λIg(θj) (SD)

• Advantages to Steepest Descent: Relatively insensitive to the start-ing values. May converge when GN and NR do not.

• Disadvantage: Can be (very) slow relative to other methods.

135

D. Levenberg-Marquardt Method.

If {V(θ)}T {V(θ)} is close to singular for θ near θ, the GN method maybecome erratic. This problem can be diminished by a good algorithm forobtaining the GN increment, but not completely solved.

One helpful approach is to modify the GN increment to

δj(k) ≡ [{V(θj)}T {V(θj)}+ kB]−1{V(θj)}T {y − η(θj)}

where B is a diagonal matrix and k is a constant called the conditioningfactor.

Two standard choices of B are

i. B = I (Levenberg); and

ii. B a diagonal matrix with diagonal elements taken from the diagonalof {V(θj)}T {V(θj)} (Marquardt).

• When k = 0 this method (either choice of B) reduces to the G-Nmethod.

• For B = I and k → ∞ this method becomes steepest descent.

• Therefore, the LM method is a compromise between G-N and steep-est descent.

To choose k, the following method is often used:

i. Start with small k.

ii. If S(θ0 + δ0(k)) < S(θ0), then decrease k on next iteration.

iii. If S(θ0 + δ0(k)) ≥ S(θ0), then increase k until (ii) occurs.

• For well-behaved problems, this approach will lead to k → 0 and themethod will behave like GN.

• For ill-conditioned problems, k will increase toward∞ and the methodwill become a variant of steepest descent.

136

• Advantage: LM method is useful when {V(θj)}T {V(θj)} is ill-conditioned.

• Disadvantage: None, really. This is a good default method.

E. Golub-Pereyra Algorithm for Partially Linear Models.

In partially linear models, the conditionally linear parameters canbe estimated by linear least-squares, conditional on the values of thenonlinear parameters.

That is, if we partition θ as θ = (βT ,ϕT )T where β (p1 × 1) con-tains the conditionally linear elements of θ and ϕ ((p− p1)× 1) theremaining elements of θ, then we can write

η(θ) = η(β,ϕ) = A(ϕ)β

where A(ϕ) is an n × p1 matrix depending only on the nonlinearparameters.

For fixed ϕ, we can estimate β as

β(ϕ)︸︷︷︸depends on ϕ

= [{A(ϕ)}T {A(ϕ)}]−1{A(ϕ)}Ty

Golub and Pereyra formulated a version of the GN algorithm tominimize the least-squares criterion expressed solely as a function ofϕ:

S2(ϕ) = ∥y −A(ϕ)β(ϕ)∥2.

The nonlinear least squares estimator of θ is then θ = (ϕT , β(ϕ)T )T ,

where ϕ is the minimizer of S2(ϕ).

• Advantages: Reduces the number of starting values needed. Reducesthe dimension of the nonlinear estimation.

• Disadvantages: Requires derivatives∂A(ϕ)

∂ϕwhich can be compli-

cated. Only applies to partially linear models so is not a competitorto the other methods in general.

137

Example — Hahn Data:

To illustrate the Golub-Pereyra algorithm, we consider data fromthe National Institute of Standards and Technology (NIST) pertaining toa study of the thermal expansion of copper. The data set contains thefollowing variables:

y = coefficient of thermal expansion

x = temperature (Kelvin)

and we consider the model

yi =θ1 + θ2xi + θ3x

2i + θ4x

3i

1 + θ5xi + θ6x2i + θ7x3

i

+ ei.

• See handout Hahn1. The data and fitted curve are displayed on thelast page of this handout.

• Starting values for these data were provided by NIST. We use thesestarting values along with the usual G-N algorithm to fit the modelas m1Hahn.nls. The trace of this model fitting procedure revealsthat the G-N algorithm took 10 steps to converge, with objectivefunction S(θ0) = 2, 275, 330 at step 0.

• Alternatively, we can use the Golub-Pereyra algorithm to fit thismodel by specifying the option, algorithm=plinear, in the call tonls(). We also must change the model specification so that the right-hand side is the derivative matrix of the linear parameters.

• Notice that when using the plinear algorithm in m2Hahn.nls, weneed only specify starting values for the nonlinear parameters (3 ofthe 7). In fact, we can even get away with specifying the startingvalues of 0 for these parameters and still achieve convergence to thecorrect solution (see m3Hahn.nls).

• In addition, the plinear algorithm converges after only 7 iterations.Furthermore, the objective function is S2(ϕ

0) = 52.38 at initial-ization. This smaller objective function speeds and stabilizes thealgorithm.

138

Implementations of Various Fitting Algorithms:

S-PLUS/R: nls()

• GN (default)• Golub-Pereyra (plinear)

SAS: PROC NLIN:

• GN (METHOD=GAUSS, the default)• NR (METHOD=NEWTON)• LM (METHOD=MARQUARDT)• Steepest descent (METHOD=GRADIENT)

SAS: PROC NLMIXED:

• QN (TECHNIQUE=QUANEW, the default). Can control the linesearch and update methods separately.

• NR w/o line search (TECHNIQUE=NEWRAP). Also uses “ridg-ing”, which is essentially the technique of the LM method applied tothe Hessian.

• NR w/ line search (TECHNIQUE=NRRIDG) Also uses “ridging”.• Several other optimization methods (trust region, double dogleg,conjugate gradient, Nelder-Mead simplex).

In addition, there are a variety of equation solvers and function optimiz-ers available that are generic; that is, they are not designed specificallyfor nonlinear regression or even statistical estimation problems, but aregeneral tools. Nevertheless, they can be quite useful.

• In R, the function optim() will do general function optimization viaeither QN, Nelder-Mead, or conjugate-gradient methods. Also avail-able for S-PLUS in the MASS library.

• In S-PLUS, the function nlmin() implements the QN method usingnumerical derivatives for the gradient, and nlminb() uses trust-regionoptimation.

139

• Matlab has an optimization toolbox with several different functionsincluding fsolve (equation solving), fminunc (unconstrained mini-mization), and fmincon (constrained minimization). fminunc doesunconstrained optimization via the QN method.

6. Obtaining Convergence

When using any iterative method, one of the following occurs:

1. Convergence to “reasonable” parameter estimates.

2. Convergence to “unreasonable” parameter estimates or divergenceto extreme values (e.g., values toward ±∞).

3. Failure of the algorithm to converge.

Under outcome 2, divergence to extremes is typically easy to identify asan incorrect solution. Convergence to a local minimum of the objectivefunction can be a bit harder to detect. These problems are typically dueto

a. incorrectly entered data; or

b. incorrectly entered, or otherwise incorrect, starting values.

140

Suppose we converge to a parameter estimate θ. Is it a reasonable valuefor θ?

To answer this:

1. Use our knowledge of the subject matter/science of the problem.

2. Plot the estimated (fitted) expectation function f(x; θ) along withthe observed data.

– If x = x is one-dimensional, then plot f(x, θ) as a function ofx.

– If x is multi-dimensional, then plot the fitted values y = f(x; θ)vs. each component of x along with the observed data.

Example — Pressure vs. Temperature in Saturated Steam

• See pressure1.sas. Recall we considered a model of the form

presi = θ1 exp{θ2tempi/(θ3 + tempi)}+ ei

• In pressure1.sas we fit that model again using “bad” starting values:θ0 = (.1, .1, .1)T . Although these starting values lead to divergence

in nls(), in PROC NLIN we converge to θ = (13.89, 3.88e10, 9.65e11)T .Recall that using “good” starting values, the parameter estimate wasθ = (5.27, 19.72, 295.00)T .

• By plotting the fitted curve under each parameter estimate, we seethat the latter estimate (from the “good” starting values) fits thedata best. Often, spurious converged parameter estimates will givemuch worse fits than the correct parameter estimates.

• Here, the best criterion to choose between the parameter estimatesis scientific reasonableness, but fit also plays a role.

141

• If the algorithm converges to an “unreasonable” estimate, the mostlikely reason is bad starting values.

• It is a good idea to plot f(x, θ0), the curve generated by the startingvalues, to judge the quality of those starting values.

Outcome 3, failure to converge can be a much more stubborn problem.Again, it may be due to any of several problems:

a. Bad, or incorrectly specified starting values.

b. the expectation function and/or derivatives have been incorrectlycoded.

c. {V(θ)}TV(θ) is singular or nearly singular for θ near θ; or

• If (a) starting values are good, and (b) the model and its derivativesare correct, then it can be shown that (c) {V(θ)}TV(θ) must besingular, or nearly so.

§3.6, p.86 of Bates & Watts provides a useful list of questions to ask/thingsto check when convergence is a problem:

• Is the expectation function correctly specified and coded?• Are the derivatives correctly specified and coded?• Are the data entered correctly and are they “reasonable”?• Are the response and explanatory variables correctly identified?• Do the starting values correctly identified with the correspondingparameters?

If the answer to all of these questions is yes, then tracing the details of thefitting algorithm can help diagnose the problem (e.g., use the trace=T orverbose=T options in nls() and gnls(), respectively).

142

Other points to consider:

• Are the data rich enough to fit the model that you have specified?The model may have parameters that describe features of the ex-pectation function that the data don’t reveal (e.g., an asymptoteparameter when no data are collected in the asymptotic region ofthe curve). In this case, simplification of the model may be neces-sary.

• If the estimation algorithm sends parameter values into infeasibleregions, then reparameterization of the model may help.

• For large problems, it is often difficult to fit the final model on thefirst try, but success may be achieved by building up gradually fromsmaller, simpler models to the eventual large final model.

– E.g., if the data set involves data from several subjects, try fit-ting the model to data from one or two subjects first, then addin subjects gradually, refitting several times to progressivelybigger data sets.

– Or if the model involves many parameters, it may be fruitfulto first fit a simplified model with fewer parameters and thenintroduce additional parameters one step at a time.

143

7. Model Extensions — Heteroscedasticity and Correlation.

As in the linear regression model, it is possible to relax the assumptione ∼ N(0, σ2I) to account for heteroscedasticity and correlation among theerrors/responses.

Consider the nonlinear regression model

y = f(X,θ) + e, e ∼ Nn(0, σ2Λ) (†)

where Λ is a known, positive-definite matrix. Here,

f(X,θ) =

f(x1,θ)...

f(xn,θ)

, X =

xT1...xTn

.

Because Λ is positive definite (the matrix version of a number being posi-tive), it has an invertible square root Λ1/2 with inverse Λ−1/2 such that

Λ = ΛT/2Λ1/2 and Λ−1 = Λ−1/2Λ−T/2.

144

Lety∗ = Λ−T/2y,

f∗(X,θ) = Λ−T/2f(X,θ)

e∗ = Λ−T/2e

Then we can premultiply both sides of our original model (†) to get anequivalent transformed model

y∗ = f∗(X,θ) + e∗, e∗ ∼ Nn(Λ−T/20, σ2Λ−T/2ΛΛ−1/2) = N(0, σ2I)

(∗)

Since this model is of the homoscedastic, uncorrelated form for whichnonlinear least squares is appropriate, we can fit model (†) by simplyapplying (ordinary) NLS to (*). That is, choose θ to minimize

∥y∗ − f∗(X,θ)∥2 = {y∗ − f∗(X,θ)}T {y∗ − f∗(X,θ)}= {y − f(X,θ)}TΛ−1{y − f(X,θ)},

which is known as the generalized least-squares criterion.

The major limitation of this approach is that it is unusual that we knowΛ. Typically, we don’t know Λ, although it will often be the case that wewill be willing to make an assumption about the form of Λ.

• That is, we will often be willing to assume that Λ = Λ(λ) is a functionof an unknown parameter λ typically of much smaller dimension thanthe number of non-repeated elements of Λ (n(n+ 1)/2).

• Under such an assumption, the model parameters are now θ,λ, andσ2.

There are a variety of methods that can be used to estimate such anextended nonlinear regression model, but for now, we focus on maximumlikelihood estimation.

145

ML Estimation in the Extended Nonlinear Regresison Model:

Recall that in ML estimation, we choose as our estimator of a parameter,the value that minimizes the joint probability density function of the data,viewed as a function of the parameter.

For the model (†), the joint density of y is a multivariate normal density.Thus the loglikelihood function is

ℓ(θ, σ2,λ;y) = −1

2

{n log(2πσ2) + log det{Λ(λ)}+ ∥y∗ − f∗(X,θ)∥2

σ2

}.

For fixed θ and λ, the maximum likelihood estimator of σ2 is

σ2(θ,λ) =1

n∥y∗ − f∗(X,θ)∥2.

Plugging this value into the loglikelihood ℓ(θ, σ2,λ;y) we obtain the profilelikelihood which is the log likelihood as a function of θ and λ only:

pℓ(θ,λ;y) ≡ ℓ(θ, σ2(θ,λ),λ;y)

= −1

2

{n[log(2π/n) + 1] + log det{Λ(λ)}+ n log ∥y∗ − f∗(X,θ)∥2

}

146

The MLEs of θ and λ are obtained by maximizing pℓ(θ,λ;y) with respectto these parameters. This is done by alternating between fixing λ andmaximizing w.r.t. θ and fixing θ and maximizing w.r.t. λ. This procedureis iterated to convergence.

That is, at the kth iteration, we fix λ at its current estimate λk−1 andmaximize pℓ(θ, λk−1;y) w.r.t. θ. Note that the resulting maximizer, θk

is the minimizer of

∥y∗,k−1 − f∗,k−1(X,θ)∥2 = ∥{Λ(λk−1)}−T/2{y − f(X,θ)}∥2

where λk−1 is fixed. So, this step is just a usual nonlinear least squaresproblem which can be solved by the G-N method.

Then we fix θ at θk and obtain λk by minimizing pℓ(θk,λ;y) w.r.t. λ.

We then repeat these steps until convergence.

The MLE of σ2 can be obtained at convergence as σ2 = σ2(θ, λ) (by

simply plugging in the MLEs θ, λ into the formula from the previouspage). However, to reduce the bias of this estimator, we use instead andMSE-type estimator:

σ2 =∥Λ−T/2{y − f(X, θ)}∥2

n− p,

where Λ = Λ(λ).

Inference for θ is based on “classical” asymptotic theory for ML estimation.The asymptotic distribution of θ is

θa∼ Np

(θ, σ2

[{V(θ)}TΛ−1V(θ)

]−1)

with var-cov matrix that can be consistently estimated as

ˆavar(θ) = σ2[{V(θ)}T Λ−1V(θ)

]−1

In addition, n−pσ2 σ2 a∼ χ2(n − p) and θ and σ2 are asymptotically inde-

pendent, so the usual approximate F and t tests for inference on θ can beperformed.

147

We can take advantage of this methodology to fit a much broader class ofnonlinear models than we have so far considered. To describe this broaderclass, it’s convenient to decompose var(e) = σ2Λ as follows:

var(e) = σ2Λ = σ2V1/2CV1/2

where

V1/2 =1

σ

√var(e1) 0 0 · · · 0

0√var(e2) 0 · · · 0

......

.... . .

...0 0 0 · · ·

√var(en)

and

C = corr(e) =

1 corr(e1, e2) corr(e1, e3) · · · corr(e1, en)

corr(e2, e1) 1 corr(e2, e3) · · · corr(e2, en)...

......

. . ....

corr(en, e1) corr(en, e2) corr(e1, e3) · · · 1

Heteroscedasticity specification: We assume, for now, that

var(ei) = σ2g2(vi, δ)

where vi is a vector of variance covariates, δ is a vector of variance param-eters to be estimated (part of λ), and g2(·) is a known variance function.

• Later, we will allow that g2(·) depend also on µi = E(yi), the meanresponse, but that will take us out of the context of ML estimation.

148

Correlated errors are often appropriate when there is some temporal orspatial dependence among the observations.

For example, suppose we have data on CO2-uptake in plants measuredover a one hour period in which the plant was exposed to different lightintensities. Suppose these measurements were taken during consecutiveone hour periods. E.g., suppose our data are as follows:

Plant CO2-uptake Light Intensity Time (hrs)

1 0 0 11 .33 20 21 2.5 80 31 . 120 41 6.1 150 51 6.3 250 62 0 0 1...

In such a situation, we may expect there to be a serial dependence structureto the errors, where observations close together in time are correlated withthe strength of the correlation decreasing with the time lag.

E.g., we may use an Autoregressive structure of order 1 (AR(1)) wherecorr(ei, ej) = ρ|ti−tj | where ti is an integer-valued measurement time forresponse yi.

149

Correlation Specification: In general, our correlation model will be

corr(ei, ej) = h{d(pi,pj),ρ}

where ρ is a vector of correlation parameters, h(·) is a known correlationfunction, pi,pj are position vectors (often scalars for serial (time) corre-lation) for observations yi,yj , and d(·, ·) is a known distance function.

• The correlation function h(·) is assumed continuous in ρ, returningvalues in [−1,+1]. In addition, h(0,ρ) = 1, so that observations thatare 0 distance apart (identical observations) are perfectly correlated.

• In our example above, pi = ti, d(ti, tj) = |ti − tj | and ρ was a scalarρ.

• Vector valued positions pi arise when modeling spatial data. E.g.,pi could be bivariate containing the longitude and latitude of yi.

A wide variety of models can be specified within this general frameworkand fit using ML estimation as described above.

These models all assume var(e) = σ2Λ(λ) where the variance-covarianceparameter λ consists of two parts: δ, the variance parameter; and ρ, thecorrelation parameter. That is, λ = (δT ,ρT )T .

150

Variance Functions Available in the nlme Software (e.g., in gnls()):

• Variance functions in the nlme software are described in §5.2.1 inPinheiro and Bates (2000) (see also ?varClasses in the nlme docu-mentation). Here, we give only brief descriptions.

1. varFixed. The varFixed variance function is g2(vi) = vi. That is,

var(ei) = σ2vi,

the error variance is proportional to the value of a covariate. This isthe common weighted least squares form.

2. varIdent. This variance specification corresponds to different vari-ances at each level of some stratification variable s. That is, supposesi takes values in the set {1, 2, . . . , S} corresponding to S differentgroups (strata) of observations. Then we assume that observations instratum 1 have variance σ2, observations in stratum 2 have varianceσ2δ1, . . ., and observations in startum S have variance σ2δS .

That is,var(ei) = σ2δsi , so that g2(si, δ) = δsi

where, for identifiability we take δ1 = 1.

151

3. varPower. This generalizes the varFixed function so that the er-ror variance can be a to-be-estimated power of the magnitude of avariance covariate:

var(ei) = σ2|vi|2δ so that g2(vi, δ) = |vi|2δ.

The power is taken to be 2δ rather than δ so that s.d.(ei) = σ|vi|δ.

A very useful specification is to take the variance covariate to be themean response. That is,

var(ei) = |µi|2δ

However, this corresponds to g2(µi, δ) = |µi|2δ depending upon themean. Such a model is fit with a variant of the ML estimationalgorithm. However, this technique is not maximum likelihood, andindeed ML estimation is not recommended for such a model. Insteadthe method is what is known as pseudo likelihood estimation.

4. varConstPower. The idea behind this specification is that varPowercan often be unrealistic when the variance covariate takes valuesclose to 0. The varConstPower model specifies

var(ei) = σ2(δ1 + |vi|δ2

)2, δ1 > 0.

That is, for δ2 > 0 (as is usual), the variance function is approxi-mately constant and equal to δ21 for values of the variance covariateclose to 0, and then it increases as a power of |vi| as vi increases inmagnitude away from 0.

5. varExp. The variance model for varExp is

var(ei) = σ2 exp(2δvi)

6. varComb. Finally, the varComb class allows the preceding varianceclasses to be combined so that the variance function of the model isa product of two or more component variance functions.

152

Correlation Structures Available in the nlme Software (e.g., in gnls()):

• nlme includes correlation structures to account for time dependence(serial correlation structures) and spatial dependence (spatial corre-lation structures). It also has a couple of generally applicable corre-lation structures.

• Correlation structures in the nlme software are described in §5.3 inPinheiro and Bates (2000) (see also ?corClasses in the nlme docu-mentation). Here, we give brief descriptions of the serial and generalcorrelation structures.

Serial Correlation Structures:

The work-horse class of models in time-series analysis is the class of Auto-regressive-Moving Average (ARMA) models.

We will apply these models to the errors, ei, but for notational conveniencelet’s index e by t to indicate time.

In an Autoregressive (AR) model, we assume the current observation etis a linear function of previous observations plus “white noise” ( a meanzero, constant variance error term):

et = ϕ1et−1 + · · ·+ ϕpet−p + at, E(at) = 0, var(at) = σ2.

• The number of previous observations on which et depends, p, is calledthe order of the process and we write AR(p).

• The simplest, and practically most useful, AR model is an AR(1):

et = ϕet−1 + at, where −1 < ϕ < +1.

• For an AR(1) model,

corr(et, es) = ϕ|t−s|

and ϕ represents the correlation between two observations one timeunit apart.

153

A Moving-Average model is one in which the observation et at time t is alinear combination (weighted average, in some sense) of past independentand identically distributed white noise error terms plus a current timewhite noise error:

et = θ1at−1 + · · ·+ θqat−q + at

• The number of past errors on which et depends is the order of theprocess, so above we have an MA(q) process.

• Again, an order 1, in this case MA(1), process is often useful. Foran MA(1),

corr(et, es) =

{1, if s = t;θ1/(1 + θ21) if |s− t| = 1; and0 otherwise.

• In general, MA(q) processes have nonzero correlation for observa-tions ≤ q time units apart and 0 correlation for observations > qtime units apart.

Combining an AR(p) process with a MA(q) process we get an ARMA(p, q)process:

et =

p∑i=1

ϕiet−i +

q∑j=1

θjat−j + at.

• It is always possible to model any autocovariance structure to anarbitrarily small level of precision with a high enough order AR orMA process. Often, we will find that a very low order AR, MA, orARMA model will suffice.

154

1. corAR1. This correlation structure is specified as corAR1(value,form = one-sided formula), where value specifies an (optional) initalvalue for estimating the AR(1) parameter ϕ and one-sided formulais a formula of the form:

∼ covariate|Groupingvariable

Here, the covariate is an integer-valued time index and |Groupingvariableis an optional group specification. Groups are specified to be unitsof observations on which repeated measurements through time aretaken.

• For example, in the example CO2-uptake example above, the speci-fication corAR1(.8, form = ∼ time | Plant) would yield the followingcorrelation matrix for plant 1:

C =

1 ρ ρ2 ρ4 ρ5

1 ρ ρ3 ρ4

1 ρ2 ρ3

1 ρ1

with initial value of .8 for ρ.

2. corCAR1. This correlation structure is a continuous-time versionof an AR(1) correlation structure. The specification is the same asin corAR1, but now the covariate indexing time can take any non-negative non-repeated value and we restrict ϕ ≥ 0.

3. corARMA. This correlation structure corresponds to an ARMA(p, q)model. AR(p) and MA(q) models can be specified with this function,but keep in mind that the corAR1 specification is more efficient thanspecifying corARMA with p = 1 and q = 0.

We can specify an ARMA(1,1) model with initial values of ϕ =.8, θ = .4 via corARMA(value = c(.8,.4), form =∼ covariate|Groupingvariable,p=1, q=1).

155

General Correlation Structures:

1. corCompSymm. In this structure,

corr(ei, ej) =

{1 if i = j; andρ if i = j.

That is, the correlation between any two distinct observations isthe same. Like many of the correlation structures, this structureis often useful within groups. For our example, the specificationcorCompSymm(value=.3, form = ∼ 1 | Plant) defines the correlationmatrix

C =

1 ρ ρ ρ ρ

1 ρ ρ ρ1 ρ ρ

1 ρ1

with initial value of .3 for ρ.

2. corSymm. Specifies a completely general correlation structure witha separate parameter for every non-redundant correlation. E.g., inour example corSymm(form = ∼ 1 | Plant) specifies the correlationmatrix

C =

1 ρ1 ρ2 ρ3 ρ4

1 ρ5 ρ6 ρ71 ρ8 ρ9

1 ρ101

where initial values for ρ can be supplied with an optional value=specification.

156

Spatial Correlation Structures:

• A classic reference on spatial statistics is Cressie, Statistics for Spa-tial Data. The following material is based on Pinheiro and Bates(2000, §5.3), who base their treatment on material in Cressie’s book.

Let ep denote the observation (error term in our nonlinear model) corre-sponding to position p = (p1, p2, . . . , pr)

T .

• Often r = 2 and p = (p1, p2)T gives two dimensional coordinates.

Time series correlation structures are typically described by their auto-correlation function (which we’ve denoted h(·) above). Spatial correlationstructures are usually described by their semivariogram.

For a given distance function d(·), the semivariogram is a function γ ofthe distance between two points ep and eq say, and a parameter ρ, thatmeasures the association between two points that distance apart:

γ{d(ep, eq),ρ} =1

2var(ep − eq)

We assume the observations have been standardized to have E(ep) = 0 andvar(ep) = 1 for all p. Such a standardization does not alter the correlationstructure.

In that case, it is easy to see the relationship between the semivariogramγ(·) and the autocorrelation function h(·):

γ(s,ρ) = 1− h(s,ρ).

From this relationship it is clear that observations 0 distance apart haveh(0,ρ) = 1 and thus γ(0,ρ) = 0. The autocorrelation function h increasescontinuously to 1 as the distance decreases to 0. Hence the semivariogramincreases continuously to 0 as distance decreases to 0.

157

In some applications it is useful to violate this by introducing a nuggeteffect into the definition of the semivariogram. This nugget effect is aparameter c0 that forces γ(0,ρ) = c0 where 0 < c0 < 1 rather thanγ(0,ρ) = 0 when the distance between the observations is 0.

The following spatial correlation structures are implemented in the nlmesoftware in R. All have a scalar-valued correlation parameter ρ. Thisparameter is known as the range in the spatial literature.

1. corExp. (Exponential) This structure corresponds to the semivari-ogram

γ(s, ρ) = 1− exp(−s/ρ)

and the autocorrelation function h(s, ρ) = exp(−s/ρ).

2. corGauss. (Gaussian) This structure corresponds to the semivari-ogram

γ(s, ρ) = 1− exp{−(s/ρ)2}

and the autocorrelation function h(s, ρ) = exp{−(s/ρ)2}.

3. corLinear. (Linear) This structure corresponds to the semivariogram

γ(s, ρ) = 1− (1− s/ρ)1{s<ρ}

and the autocorrelation function h(s, ρ) = (1 − s/ρ)1{s<ρ}. Here1{A} represents the indicator variable that equals 1 when conditionA is true, 0 otherwise.

4. corRatio. (Rational Quadratic) This structure corresponds to thesemivariogram

γ(s, ρ) =(s/ρ)2

1 + (s/ρ)2

and the autocorrelation function h(s, ρ) = {1 + (s/ρ)2}−1.

5. corSpher. (Spherical) This structure corresponds to the semivari-ogram

γ(s, ρ) = 1− {1− 1.5(s/ρ) + .5(s/ρ)3}1{s<ρ}.

158

• A nugget effect can be added to any of these structures. With anugget effect c0, the semivariogram with the nugget effect γnugg(·) isdefined in terms of the semivariogram without the nugget effect γ(·)as follows:

γnugg(s, c0, ρ) =

{c0 + (1− c0)γ(s, ρ), if s > 0; and0, otherwise.

• When using the above spatial correlation structures, the user canchoose between distance metrics. Currently implemented distancemetrics are Euclidean distance, d(ep, eq) = ∥p−q∥ =

√∑ri=1(pi − qi)2,

Manhattan distance, d(ep, eq) =∑r

i=1 |pi − qi|, and maximum dis-tance, d(ep, eq) = maxi=1,...,r |pi − qi|.

• One can get a feel for these various semivariogram models by exam-ining them as functions of distance for different choices of the rangeparameter ρ and the nugget effect c0. The 5 semivariograms listedabove are plotted below for ρ = 1, c0 = .1.

159

Q: How do we choose a correlation structure?

A: This is a hard question that would be the focus of several weeks worthof study in a time series or spatial statistics course.

In a regression context, inference on the regression parameters is the pri-mary interest. We need to account for a correlation structure if one existsto get those inferences right, but we’re typically not interested in the cor-relation structure in and of itself. Therefore, we opt for simple correlationstructures that capture “most of the correlation” without getting caughtup in extensive correlation modeling.

In a time-dependence context, AR(1) models are often sufficient.

If we are willing to consider other ARMA models, two tools that are usefulin selecting the right ARMAmodel are the sample autocorrelation function(ACF) and the sample partial autocorrelation function (PACF).

Let

ri =yi − yi√var(ei)

denote the standardized residuals from a fitted nonlinear model.

The sample autocorrelation at lag ℓ is defined as

ρ(ℓ) =

∑n−ℓk=1 rkrk+ℓ/(n− ℓ)∑n

k=1 r2k/n

, ℓ = 1, 2, . . .

The sample partial autocorrelation at lag ℓ is a sample estimate of thecorrelation between observation et and et−ℓ after removing the effects ofet−1, . . . , et−ℓ+1. The estimate is obtained by a recursive formula not worthreproducing here.

160

• AR(p) models have PACFs that are non-zero for lags ≤ p and 0 forlags > p. Therefore, we can look at the magnitude of the samplePACF to try to identify the order of an AR process that will fit thedata. The number of “significant” partial autocorrelations is a goodguess at the order of an appropriate AR process.

• MA(q) models have ACFs that are nonzero for lags ≤ q and 0 forlags > q. Again, we can look at the sample ACF to choose q.

• ARMA(p, q) models will have sample ACFs and PACFs that don’tfit these simple rules. The following table (from Seber & Wild, Ch.6, p.320) describes the general behavior of the ACF and PACF func-tions for various ARMA models.

Similarly, in a spatial setting, we can estimate and plot the semivariogramto help choose an appropriate spatial correlation model. The Variogram()function in the nlme library will compute either one of two semivariogramestimators: the classical estimator, and a robust estimator that reduces theinfluence of outliers. See Pinheiro and Bates (2000, p.231) for formulas.

161

Example — Cumulative Bole Volume of a Sweetgum Tree

For purposes of forest inventory and planning, it is useful to havebole-volume equations to predict the volume of trees while standing basedon easily measured tree size variables such as diameter at breast height(DBH) and stem height. Merchantable volume is the volume from stumpheight to a specified upper-bole diameter which establishes the merchantabil-ity limit. Since this limit is subject to change according to technologicalcapabilities and economic conditions, it is useful to predict cumulative vol-ume to an upper diameter d that is variable. Typically, a sample of treesspanning the range of relevant sizes is felled. For each tree, cumulativebole volumes Vd and stem diameters d are measured at a series of ascend-ing heights, and the DBH and total stem height H are measured. In thisexample, we consider data from just one tree (tree #5) in a larger data setanalyzed by Gregoire and Schabenberger (1996, JABES) and Davidian &Giltinan (1995, §11.3).

A partial listing of the data is given below:

Tree Stem Measurement CumulativeNo. DBH Height (H) Diameter (d) Height Volume

5 11.8 97.6 10.7 7.2 5.275 11.8 97.6 10.4 10.2 7.095 11.8 97.6 10.3 13.2 8.85...

......

......

...5 11.8 97.6 1.1 88.2 27.09

162

• It seems reasonable to expect that cumulative volume measurementshere will be subject to spatial correlation. Measurements taken atlocations that are close to one another, especially observations atadjacent locations, can be expected to be correlated, with correlationthat we expect to decrease with distance.

• Although the correlation is spatial, there is only one dimension here— length along the stem — and measurements are taken at equallyspaced locations along this dimension. Therefore, we can handlethese data as though they represent a time series, with time indext = 1, 2, . . . , 24 (t = (Measurement Ht− 4.2)/3).

• See handout sweetgum1.

• In this R script we follow Davidian and Giltinan in considering mod-els for cumulative bole volume V as a function of x = log(DBH−d).We first plot V versus x. This plot reveals a sigmoidal form. Sincethe response is cumulative volume, this form is not unexpected. Ithas the same general form as a cumulative distribution function.

• While a variety of sigmoidal curves might be chosen to model thesedata, we follow Davidian and Giltinan in considering models with alogistic expectation function,

f(x,θ) =θ1

1 + exp{(θ2 − x)/θ3}.

• We first consider m1, a logistic model with spherical errors. A plotof the residuals doesn’t show a pattern suggestive of heteroscedastic-ity. However, to investigate possible heteroscedasticity, we refit themodel, now as model m2, with variance function

g2(µi, δ) = |µi|2δ.

The anova() function provides a LRT, as well as AIC and BIC infor-mation criteria to compare models m1 and m2. According to the testand the criteria, m2 does not fit significantly better than m1 and theresidual plot for m2 looks almost identical to that for m1. Therefore,we reject model m2 and conclude that the data are homoscedastic.

163

• Next we obtain plots of the sample ACF and PACF for model m1.The nlme library contains an ACF() function that computes thesample ACF directly from a fitted gnls object. We specify the num-ber of lags for which we want autocorrelations computed with themaxLag= option. When ploting the ACF we can request error barson the ACF function for any given confidence level 100(1 − α)%.These error bars are placed at

±z1−α/2/n(ℓ), ℓ = 1, 2, . . . ,

where n(ℓ) = the number of residual pairs that went into the cacu-lation of the autocorrelation at lag ℓ.

• There is no PACF function in nlme, but we can obtain the PACF*by first extracting the standardized residuals from our model (e.g.,resid(m1, type=”p”)) and inputting that into the ar() function. Theresults of the ar() function can then be input into acf.plot to yieldthe PACF plot. This plot also contains a 95% error bar by default.(The error bars here replace n(ℓ) with n in the formula above, sothey stay constant over increasing lags. There seems to be somedifference of opinion in the literature on which is more appropriate,n(ℓ) or n).

• The sample ACF and PACF plots are not definitive in this exam-ple. What we’d really like to see is no significant (partial) auto-correlations, or only one significant (partial) autocorrelation at lag1. Without a clear indication of the true correlation structure fromthese plots, we consider several ARMA structures, using a trial anderror approach.

* Note that this approach is only appropriate for data taken over anequally spaced time index from a single subject/unit. This is not appropri-ate for longitudinal/repeated measures from several subjects in the samedataset.

164

• In models m3–m7 we fit an AR(1) model, an AR(2), an ARMA(1,1),an AR(3), and an ARMA(2,1) model, respectively. ACF plots areproduced for all of these models, and a PACF model is produced form3. According to the AIC criterion, model m3 (the AR(1) model)is the winner, and its ACF and PACF plots look pretty good.

• Its very important to note that the PACF and ACF plots for modelswith correlation should always be based on the normalized residuals.The vector of normalized residuals is defined as

r = σ−1(Λ−T/2)(y − y).

If the estimated variance-covariance structure is correct, r should beapproximately mean 0 with var-cov matrix I. Therefore, if the mean,correlation and heteroscedasticity models are correct, then the ACFfrom r should reflect independent white noise (no pattern and nosignificant autocorrelations).

• Note that while the parameter estimates don’t differ much betweenmodel m1 and m3, their standard errors do. It is important toaccount for correlation (and heteroscedasticity) to obtain correct in-ferences.

• The general strategy to choosing an error var-cov structure that wehave taken here is to combine residual analysis with LRTs and in-formation criteria.

• Finally, note that the ACF() function in the nlme library will workboth for a single time series like this one and for grouped data,where we essentially have several time series that we are analyzingsimultaneously (e.g., if we had been modeling the cumulative volumeof several trees at once).

– However, for data from multiple time series (i.e. longitudinaldata), the steps we took above to plot the PACF will notwork and should not be followed!

– At present there is no easy way to plot the PACF for a modelfit to longitudinal data in S-PLUS, R.

165

Because of its usefulness, consider again the AR(1) model.

Suppose the homoscedastic AR(1) model holds. That is, suppose the truemodel is

yt = f(xt,θ) + et, where et = ϕet−1 + at,

for t = 1, . . . , n.

A simple approach to fitting this model is to transform it to a model withuncorrelated errors by subtracting ϕ times the model at the previous timeperiod:

yt − ϕyt−1 = f(xt,θ)− ϕf(xt−1,θ) + et − ϕet−1︸︷︷︸=at

, t = 2, . . . , n.

Equivalently,

yt = ϕyt−1 + f(xt,θ)− ϕf(xt−1,θ) + at, t = 2, . . . , n. (∗)

where at has mean 0 and constant variance.

Thus, we can estimate θ and the AR(1) parameter ϕ by fitting (*) with or-dinary nonlinear least-squares. Note that we fit the model to observations2, . . . , n. This approach is known as conditional least squares.

Because we must throw out one observation in this approach, we can expectit to be less efficient than ML estimation, expecially when n is small.

• Although this is an appealing, simple approach, it is not recom-mended in general.

166

A big improvement to CLS can be made without necessitating anythingmore sophisticated than ordinary nonlinear least squares. The techniqueis as follows

0. Fit the model

yt = f(xt,θ) + at, t = 1, . . . , n

with ordinary NLS.

1. Compute the moment estimate ϕ of the AR(1) parameter ϕ basedon the residuals {at} of the most recently fit model:

ϕ =

∑nt=2 atat−1∑n

t=1 a2t

.

(This is just the sample ACF at lag 1.)

2. Fit the model

zt = g(xt,θ) + at, t = 1, . . . , n,

where

z1 =

√1− ϕ2y1, g(x1,θ) =

√1− ϕ2f(x1,θ)

and for t = 2, 3, . . . , n,

zt = yt − ϕyt−1, g(xt,θ) = f(xt,θ)− ϕf(xt−1,θ).

3. Go to step 1. Repeat until convergence.

• This method is called iterated two-stage estimation and will usuallygive an answer very similar to ML estimation. In fact, the one-stepversion (omit step 3) will often work well.

167

Sweetgum Example, Continued:

• We illustrate CLS and two-stage estimation on the sweetgum exam-ple. In model m8 we refit the AR(1) model using CLS.

• We see that the residual ACF plot for the CLS model m8 looksabout the same as the residual ACF plot for the AR(1) model m3.In addition, the regression parameters estimators are similar in thetwo models. However, there is some disagreement in the standarderrors.

• Finally, we implement two-stage estimation in model m9 by takingϕ = .4047 to be the ML estimator of ϕ from the AR(1), model m3,and performing only one iteration. Notice that the results for modelm9 are very similar to those for m3.

An interesting result in our sweetgum example is the effect of accountingfor a non-zero correlation structure on the parameter estimates and theirstandard errors.

For the two homoscedastic models m1 and m3 with independence andAR(1) correlation structures, respectively we had:

m1 (Indep) m3 (AR(1))

Coefficients: Coefficients:Value Std.Error t-value Value Std.Error t-value

Asym 30.12624 0.6464298 46.60404 Asym 30.17798 0.9306283 32.42753xmid 0.99812 0.0328612 30.37395 xmid 0.99676 0.0491328 20.28710scal 0.56360 0.0289948 19.43791 scal 0.57493 0.0414744 13.86223

• The parameter estimates are similar, but the s.e.’s increase substan-tially when we account for correlation. Why?

168

• Roughly speaking, n dependent observations “contain less informa-tion” about the marginal mean than n independent observations.If we fit n dependent observations with an independence model, we“overestimate” the sample size/precision of our parameter estimates.

• In general, for serially correlated data fit with an independencemodel, parameters associated with time invariant explanatory vari-ables tend to have their s.e.’s underestimated, parameters associatedwith time varying explanatory variables tend to have their s.e.’s over-estimated.

8. Comparison of Models.

Suppose we are considering two or more models and we wish to choosebetween them to select the model that best describes the data.

We will break this problem into two cases: Nested vs. Non-Nested Models.

Let f1(x,θ), f2(x,ϕ) denote two possible expectation functions. That iswe are faced with two possible models:

yi = f1(xi,θ) + ei,

yi = f2(xi,ϕ) + ϵi,i = 1, . . . , n,

where we assume

i. the error var-cov structure is the same in the two models, and

ii. the error var-cov structure is spherical in the two models.

• We will eventually drop assumption (ii.), but we keep it for now.

We say the models are nested if one is a special case of the other.

169

Examples:

1. Obviously nested models:

f1(x,θ) =θ1

θ2 + x

f2(x,ϕ) = ϕ3 +ϕ1

ϕ2 + x

Clearly, f1 is a special case of f2 corresponding to ϕ1 = θ1, ϕ2 =θ2, ϕ3 = 0.

2. Not-so-obviously nested models:

f1(x,θ) =θ1

θ2 + x

f2(x,ϕ) =1

ϕ1 + ϕ2x+ ϕ3x2

These are nested as well, because we can re-parameterize f1 as

f1 =1

ϕ1 + ϕ2x

where ϕ1 = θ2/θ1 and ϕ2 = 1/θ1.

3. Non-nested models:

f1(x,θ) = θ1(1− e−θ2x)

f2(x,ϕ) =ϕ1

1 + eϕ2−ϕ3x

To show these models are non-nested one need only show that forone choice of θ it is not possible to find a ϕ so that f1 = f2.

E.g., for θ = (1, 1)T ,

f1(x,θ) = 1− e−x = ϕ1

1 + eϕ2−ϕ3xfor any ϕ.

170

1. Nested Models.

Let f1, f2 denote expectation functions and suppose f1 is a specialcase of f2.

Q: Is the added complexity of f2 needed?

That is, we wish to test

H0 : (f1 and f2 both hold), versus H1 : (only f2 holds).

By analogy with the linear model, we might consider the test statistic

F =(SSE1 − SSE2)/(dfE1 − dfE2)

SSE2/dfE2

where SSEj and dfEj are the error sum of squares and error d.f.,respectively, under model fj , j = 1, 2.

For an α-level test of H0 versus H1, we compare this test statisticswith the upper αth quantile of its distribution under H0.

Under H0, F.∼ F (dfE1 − dfE2,dfE2), so an approximate test has

rejection rule: reject H0 if

F > F1−α(dfE1 − dfE2,dfE2).

• Note that this test is a likelihood ratio test.

Quite generally in a model with a parametric likelihood functionL(γ;y) depending on parameter γ and data y, under certain reg-ularity conditions nested models can be tested by comparing thevalues of the likelihood when maximized under the competing mod-els.

171

That is, we reject H0 for large values of the ratio

λ ≡ L(γ;y)

L(γ0;y)

where

γ0 = MLE under H0 (under null, or partial model)

γ = MLE under H1 (under alternative, or full model)

Logic: If the observed data are much less likely under H0 than underH1 then λ >> 1 and we should reject H0.

In our spherical errors nonlinear regression model, suppose we wishto test

H0 : Aθ = b versus H1 : Aθ = b

where A is k × p.

The likelihood function is

L(θ, σ2;y) = (2πσ2)−n/2 exp

{− 1

2σ2∥y − f(X,θ)∥2

}(cf. p.16).

Therefore, the LR is

λ =L(θ, σ2;y)

L(θ0, σ20 ;y)

=(2πσ2)−n/2 exp

{− 1

2σ2 ∥y − f(X, θ)∥2}

(2πσ20)

−n/2 exp{− 1

2σ20∥y − f(X, θ0)∥2

}

=

(2πσ2)−n/2 exp

{− 1

2 1n∥y−f(X,

ˆθ)∥2∥y − f(X, θ)∥2

}(2πσ2

0)−n/2 exp

{− 1

2 1n∥y−f(X,

ˆθ0)∥2∥y − f(X, θ0)∥2

}

=

(σ20

σ2

)n/2

=

(∥y − f(X, θ0)∥2

∥y − f(X, θ)∥2

)n/2

=

(S(θ0)

S(θ)

)n/2

172

We reject H0 if λ is large compared to its distribution under H0.

Equivalently, we can reject if some increasing function of λ is largecompared with its distribution (which may be easier to calculate).In particular, we reject at level α if

(λ2/n − 1)n− p

k=

[S(θ0)− S(θ)]/k

S(θ)/(n− p)= F

.∼ F (k, n− p),

which is the same test we presented before arguing simply by analogy.

• Transforming λ to a ratio of mean squares gives us a test statisticwith an approximate F distribution. Alternatively, there is a famousresult known as Wilks’ Theorem that gives the asymptotic distribu-tion of 2 log(λ) under quite general conditions. The results says

2 log(λ).∼ χ2(# of nonredundant restrictions on γ made by H0).

• Therefore, in our spherical errors nonlinear regression model, an al-ternative test of H0 : Aθ = b is to reject H0 at level α if

2 log(λ) > χ21−α(k)

• This test is asymptotically equivalent to the F -test version of theLRT given above, but the F version of this test performs betterthan the chi-square version in small samples.

• In linear regression, the F test given above was exact, and was alge-braically equivalent to the F test based on

(Aβ − b)T {A(XTX)−1A}−1(Aβ − b)

ks2∼ F (k, n− p). (∗)

(cf. p.23).

173

• In the nonlinear regression context, (*) becomes

(Aθ − b)T {A(VT V)−1A}−1(Aθ − b)

ks2.∼ F (k, n− p). (∗∗)

However, the equivalence no longer holds!

• The statistic (**) is a special case of a Wald test rather than a LRT.It can still be used as an approximate F -test statistic. However, thevalidity of the Wald test is affected by parameter-effects nonlinear-ity and intrinsic nonlinearity whereas the LRT is affected only byintrinsic nonlinearity. This makes the LRT the preferred choice fortesting nested models.

What if the errors are heteroscedastic and/or correlated?

Consider the problem of testing

H0 : Aθ = b vs. H1 : Aθ = b

based on the (full) model

y = f(X,θ) + e, where var(e) = σ2Λ(λ).

• When testing such a hypothesis on the mean parameter θ using aLRT, the variance-covariance model should be kept fixed!

The reason for this is that we do not want to confound our question con-cerning the mean structure with issues concerning the variance-covariancestructure.

Q: Where do we fix the value of Λ(λ)?

A: At Λ(λ) where λ is our best estimate of λ.

• λ should be obtained from the full (alternative) model under con-sideration in the hypothesis test, or, better yet, the fullest modelamong all of those under consideration in the analysis.

174

• The degrees of freedom for the test is still given by k, because we arenot fixing λ at a known value, but rather an estimated value. Westill have just as many variance-covariance parameters to estimatein both the null and alternative models.

Example — High-Flux Hemodialyzer Ultrafiltration Rates

Vonesh and Carter (1992, Biometrics) describe and analyze data measuredon high-flux hemodialyzers to assess their in vivo ultrafiltration character-istics. The ultrafiltration rates (in ml/hr) of 20 high-flux dialyzers weremeasured at 7 ascending transmembrane pressures (in dmHg). The invivo evaluation of the dialyzers used bovine blood at flow rates of either200 dl/min or 300 dl/min. These data are described in Appendix A.6 ofPinheiro and Bates (2000) and analyzed in their book in §5.2.2, §5.4, and§8.3.3. The data are included as the groupedData object Dialyzer in thenlme library for S-PLUS. We give a partial listing of the data below.

Obs. Blood Flow Transmembrane Ultrafiltration W/in SubjectNo. Subject Rate (QB) Pressure Rate Index

1 1 200 0.240 0.645 12 1 200 0.505 20.115 23 1 200 0.995 38.460 34 1 200 1.485 44.985 45 1 200 2.020 51.765 56 1 200 2.495 46.575 67 1 200 2.970 40.815 78 2 200 0.240 3.720 19 2 200 0.540 18.885 2...

......

......

...139 20 300 2.510 53.625 6140 20 300 3.000 56.430 7

175

The model we consider for these data is an asymptotic regression modelwith an offset (as in Appendix C.2 of Pinheiro and Bates (2000)). Wemodel y = ultrafiltration rate as the following function of x =transmembranepressure:

yi = θ1{1− exp[−eθ2(xi − θ3)]}+ ei, i = 1, . . . , n. (∗)

The parameters here have the folowing interpretations: θ1 = the maximumultrafiltration rate that can be obtained (the upper asymptote), θ2 = thelog of the hydrolic permeability transport rate; and θ3 = the transmem-brane pressure required to offset the oncotic pressure.

• See handout dialyzer1.

• Since Dialyzer is a groupedData object it is very easy to plot usingthe simple command, plot(Dialyzer, outer = ∼ QB). This yieldsplots of ultrafiltration rate (y) versus transmembrane pressure (x)separately by blood flow rate (QB=200 vs. QB=300). From thisplot we can see the asymptotic form of the relationship with possiblydifferent values of the parameters for the two QB groups.

• The nlme library contains a handy function nlsList() that allows oneto fit separate nonlinear regressions within each of several groupsusing nls to fit each separate model. We use this function to fitseparate models of the form (*) for each value of QB. We see thatθ1 and θ2 appear to change across QB groups, but θ3 does not.

• A useful technique to compare the θ values across groups is to plot95% intervals for each component of θ, separately by group. Thisis easily done by extracting the intervals from m1Dial.lis with theintervals() function and feeding that into the plot function. Clearly,there is little overlap in the intervals for θ1 and θ2, but considerableoverlap for θ3.

• These results suggest fitting a model with dummy variables to allowθ1 and θ2 to change with QB.

176

Dummy Variables:

Suppose we have grouped data yij where yij represents the jth observation

from the ith group:

yij ,i = 1, . . . , aj = 1, . . . , ni

Suppose our model for yij involves a parameter γ, say, that we wish tochange across groups. That is, we want the value of γ associated with yijto change with i but not j.

There are many different ways to choose a parameterization to accomplishthis. Perhaps the simplest is to replace γ with

γi = γ11{i=1} + γ21{i=2} + · · ·+ γa1{i=a}

where

1{A} =

{1, if condition A is true;0, otherwise.

• Here, γi has the interpretation as the γ-parameter associated withgroup i, i = 1, . . . , a.

Alternatively, we can replace γ with

γ1 + ϕ21{i=2} + ϕ31{i=3} + · · ·+ ϕa1{i=a}

• Here, the interpretations as follows:

γ1 = the γ parameter for group 1

γ1 + ϕ2 = the γ parameter for group 2

γ1 + ϕ3 = the γ parameter for group 3

...

γ1 + ϕa = the γ parameter for group a

Hence, ϕi has the interpretation as the additive effect of being ingroup i versus being in group 1. This is an especially convenientparameterization when group 1 corresponds to a standard of com-parison (e.g., a control group, or the standard treatment group).

177

A third option is to replace γ with

γ0 + γi = γ0 + γ11{i=1} + γ21{i=2} + · · ·+ γa1{i=a},

where γ1, γ2, . . . , γa are constrained to sum to 0.

• Here, the interpretations are as follows:

γ0 = the “average’ γ parameter across all groups

γ1 = the effect up or down from γ0 associated with group 1

γ2 = the effect up or down from γ0 associated with group 2

...

γa = the effect up or down from γ0 associated with group a

In this ANOVA-type parameterization, note that for a = 2 groups we canuse the the constraint γ1 + γ2 = 0 to write γ2 = −γ1 so that γ becomes

γ0 + γ11{i=1} + γ21{i=2} = γ0 + γ11{i=1} − γ11{i=2}

= γ0 + γ1(1{i=1} − 1{i=2}).

and the hypothesis of equal γ values across groups can be tested by testingH0 : γ1 = 0.

178

While all three approaches can be generalized to > 1 grouping factor,the ANOVA coding is particularly convenient. Suppose now we have two2-level grouping factors, F1, and F2, and we have data yijk, the k

th obser-vation at the ith level of F1 combined with the jth level of F2.

E.g., suppose we have three replicates at each combination of F1 and F2:

yijk F1 (i) F2 (j) Replicate (k)

y111 1 1 1y112 1 1 2y113 1 1 3y121 1 2 1y122 1 2 2y123 1 2 3y211 2 1 1y212 2 1 2y213 2 1 3y221 2 2 1y222 2 2 2y223 2 2 3

Then we can let γ differ across the four groups by replacing γ by

γ0 + αi + βj + γij (†)

where we constrain the parameters so that∑

i αi =∑

j βj =∑

i γij =∑j γij = 0. By substituting these constraints into (†) it’s not hard to see

that this simplifies to

γ0 + α1(1{i=1} − 1{i=2}) + β1(1{j=1} − 1{j=2})

+ γ11(1{i=1,j=1} − 1{i=1,j=2} − 1{i=2,j=1} + 1{i=2,j=2})

and we can test main effects and interaction between F1 and F2 with thefollowing hypotheses:

H0 : γ11 = 0 ⇒ no interaction between F1 and F2

H0 : α1 = 0 ⇒ no main effect of F1

H0 : β1 = 0 ⇒ no main effect of F2

179

Back to the example:

• For now we assume all three parameters θ1, θ2, θ3 differ across QBgroups and we use the reference group-type parameterization to dothis. That is, we make the substitutions

θ1 = ϕ1 + γ1Qi

θ2 = ϕ2 + γ2Qi

θ3 = ϕ3 + γ3Qi

where

Qi =

{0, if observation i comes from the QB=200 group; and1, if observation i comes from the QB=300 group; and

Thus, our model becomes

yi = (ϕ1 + γ1Qi){1− exp[−eϕ2+γ2Qi{xi − (ϕ3 + γ3Qi)}]}+ ei.

We fit this model as m2Dial.gnls.

• Since we suspect that θ3 does not depend on group, we may want totest this hypothesis. To do so, we can use either a Wald type test ofH0 : γ3 = 0 or a LRT. Before doing so though, we must consider theadequacy of the assumed variance-covariance structure.

• m2Dial.gnls assumes a spherical var-cov structure. A plot of theresiduals vs fitteds and vs the values of the covariate x =transmembranepressure, suggests heterogeneity. We consider variance functions:

g2 = |µi|2δ and g2 = |xi|2δ

in models m3Dial.gnls and m4Dial.gnls, respectively. The latter fitsslightly better according to AIC and BIC, but either could be used.

180

• Next we examine the within-subject autocorrelation. Data were col-lected on increasing x−values that were increased over 7 consecutiveequally-spaced measurement times. This may induce some serialcorrelation within subject, and we do indeed see evidence of strongautocorrelation in the ACF plot for model m4Dial.gnls.

• An AR(1) model within-subject is fit to the data in m5Dial.gnls.A LRT test comparing the models with and without autocorrela-tion (m5 and m4, respectively) indicates that the AR(1) model fitssubstantially better than the independence model. In addition, theresidual ACF from model m5 indicates that the AR(1) model is suf-ficient.

• To test equal θ3-values across QB groups we now test H0 : γ3 = 0.The Wald test of this hypothesis is given in the summary of modelm5Dial.gnls as the t-statistic for γ3: t = 1.25 (p = .2131).

• For the LRT, we drop γ3 from the model and refit, fixing the het-erogeneity and autocorrelation parameters at their estimated valuesfrom the full model m5Dial.gnls. This is done with the fixed= op-tions in varPower() and corAR1(). The test statistic is then equal to2[logLik(m5Dial.gnls)-logLik(m6Dial.gnls)]=1.6227 which we com-pare with a χ2 distribution with k = 1 d.f. (We’re testing 1 re-striction: γ3 = 0.) Note that the d.f. and p-value from the anova()function are incorrect here.

• Notice that the p-value of .2027 from this LRT is very close to theWald test result, p = .2131. The Wald test is substantially easierto implement, but can be affected by parameter-effects nonlinearity.Thus there are pros and cons for the Wald vs. LRT approaches, butthe LRT is generally preferrable on statistical grounds.

181

2. Non-nested Models.

Choosing between competing non-nested models using formal means suchas hypothesis tests is a hard problem. See Seber & Wild (1989, §5.9.6)for a brief discussion of some approaches that have been used and somereferences. We take a less formal approach to the problem based on thefollowing considerations:

i. Theory — any model suggested by theory should have some prece-dence.

ii. Parsimony — simpler models and/or models with fewer parametersshould be favored

iii. Analysis of residuals — models with “more random/unstructured”patterns in the residual plots should be favored.

iv. Curvature — models with low parameter-effects nonlinearity shouldbe preferred.

v. Model selection criteria — also useful are information/model selec-tion criteria such as AIC and BIC. However, one should be aware ofsome misuses and caveats:

• AIC or BIC are not comparable across models that have differentresponse variables. In particular, one cannot use information crite-ria to compare a model with response variable y to a model withresponse variable g(y) (e.g., log(y) or any other transformation ofy).

182

• Information criteria are not comparable across models involving dif-ferent data sets. This may seem obvious, but this mistake is oftenmade especially when small differences in the data set are presentthat the analyst may not realize.

– E.g., model 1 might involve x1, and model 2 might involve x1

and x2. If missing values exist on x2 these data will often beomitted from the data to which model 2 is fit automatically bythe software. In such a situation information criteria cannotbe used to compare the models because they are fit to differentdata sets.

• There is some disagreement about how to define AIC and, especially,BIC in situations in which the data are not all independent.

– In particular, in longitudinal data and other clustered datasettings, it is not so clear whether the penalty term for BICshould involve n, the total number of observations (which arenot all independent in this context), or the number of indepen-dent subjects (clusters) in the data set, or some intermediatequantity.

– E.g., in mixed model software such as PROCs MIXED andNLMIXED in SAS, the penalty involves the number of levelsof the ”outermost” random effect (in a clusterd data contextthis will typically be the number of clusters). So, in two modelsfit to the same data set, one with cluster-specific random effectsand one without, the penalty terms for BIC will differ and thecriteria will not be comparable.

To choose among non-nested models, some combination of (i)–(v) shouldbe used along with the judgement/experience of the analyst.

183

Example — Rabbit Eyes Again

Recall that previously we fit these data using the model

yi = θ1 −θ2

θ3 + xi+ ei, i = 1, . . . , n, (1)

where e1, . . . , en are i.i.d. with E(ei) = 0, var(ei) = σ2 and

y = log(eye lens weight)

x = age in days

Since these data exhibit an apparently asymptotic form, we might alsoconsider the asymptotic regression model as an alternative model. Here weparameterize it as in SSasymp, the self-starting version of the asymptoticregression model provided in the nlme software:

yi = θ1 + (θ2 − θ1) exp[−eθ3xi] + ei, i = 1, . . . , n, (2)

with the same assumptions on the errors.

• See handout rabbit3.

• In rabbit3.R we refit model 1 (previously fit in rabbit1.R) as m1rabbit.nlsand we fit model 2 as m2rabbit.nls. Before examining the fittedmodels, we should first determine whether one of these models hasa theoretical motivation that would give it precedence. Model 1 isbased on the model originally proposed by Dudzinski and Myky-towycz (1961). It would require going back to the original paper todetermine whether Dudzinski and Mykytowycz’s original model wasa mechanistic one. We assume for illustration purposes that it wasnot.

• Since both models are 3-parameter models, neither is more parsimo-nious.

184

• In the first page of plots in rabbit3, the fitted curves for models 1and 2 are displayed. In addition, on the second page of plots thereare plots of the residuals versus fitteds for both models and residualsversus the covariate Age for both models.

• Notice that model 2 does not appear to fit the data as well near theelbow in the curve and for large values of Age. This poor fit is espe-cially obvious in the residual plots for model 2. In the Residuals vs.Age plots, the homoscedasticity assumption appears to be violatedin both models, with variance apparently decreasing with Age.

• To deal with this heteroscedasticity, we refit these models with gnls()and then add in heteroscedasticity of the form

var(ei) = σ2|Agei|2δ (∗)

using the varPower(form = ∼ Age) specification. The heteroscedas-tic versions of models 1 and 2 are m1arabbit.gnls and m2arabbit.gnls,respectively, and based on the LRTs and information criteria pro-duced by the anova() function, these models fit substantially betterthan their homoscedastic counterparts.

• On the final page of plots in rabbit3, we reproduce the residual plotsfor the models with heteroscedasticity. Now model (1) appears tofit well, but model (2) still shows substantial misspecification in theexpectation function.

• On this basis we prefer model (1) with heteroscedastic errors as in

(*). Note that δ is estimated as δ = −.266 indicating that the errorstandard deviation decreases with Age, as expected.

185

Models Defined by Systems of Differential Equations

• Read Ch. 5 of Bates & Watts (handout). See also Ch. 8 of Seber &Wild.

An important and large subclass of nonlinear models occurs when the re-sponse is described by a system of ordinary differential equations. Thesemodels are used in a wide variety of fields, but one important area of appli-cation is pharmacokinetics, where they are called compartment models.

• These models were introduced briefly on pp. 54–57 of the notes andwe recall some of that material now:

• Compartmental models are mechanistic models where one or moremeasurements of some physical process is related to time, inputs tothe system, and other explanatory variables through a compartmen-tal system.

• A compartmental system is “a system which is made up of a finitenumber of macroscopic subsystems, called compartments or pools,each of which is homogeneous and well mixed, and the compartmentsinteract by exchanging materials. There may be inputs from theenvironment into one or more of the compartments, and there maybe outputs (excretion) from one or more the compartments to theenvironment.” (Seber & Wild, p.367)

• Compartmental models are common in chemical kinetics, toxicology,hydrology, geology, and pharmacokinetics.

186

As an example from pharmacokinetics, consider the data in the followingscatterplot on tetracycline concentration over time.

•

•

• •

•

•

•

•

•

Time (hrs)

Tet

racy

clin

e C

once

ntra

tion

(mug

/ml)

5 10 15

0.4

0.6

0.8

1.0

1.2

1.4


The data come from a study in which a tetracycline compound was ad-ministered to a subject orally, and the concentration of tetracycline hy-drochloride in the blood serum was measured over a period of 16 hours(the data are in Appendix A1.14 of our text).

A simple compartmental model for the biological system determining tetra-cycline concentration in serum is one that hypothesizes

a. a gut compartment into which the chemical is introduced,

b. a blood compartment which absorbs the chemicals from the gut, and

c. an elimination path.

187

This simple two-compartment open model can be represented in a com-partment diagram as follows:

Here, γ1 and γ2 represent the concentrations of the chemical in compart-ments 1 and 2, respectively, and θ1 and θ2 represents rates of transfer intoand out of compartment 2, respectively.

• The model above is an example of an open compartment model.Compartment models with no interchange with the environment aresaid to be closed; otherwise they are open.

Under the assumption of first-order (linear) kinetics, it is assumed that attime t, the rate of elimination from any compartment is proportional toγ(t), the concentration currently in that compartment.

Thus the rates of change in the concentrations in the two compartmentsin the model represented above are

γ1 ≡ ∂γ1(t)

∂t= −θ1γ1(t)

γ2 ≡ ∂γ2(t)

∂t= θ1γ1(t)− θ2γ2(t)

Here, the dot denotes differentiation with respect to time.

• We will restrict attention to first-order or linear compartment mod-els.

• Another restriction of our scope is to ordinary differential equations.In particular, we exclude models based on systems of partial differ-ential equations.

188

Differential equations solutions for linear compartmental models generallytake the form of linear combinations of exponentials. Therefore, thesemodels are nonlinear models that can be fit using methods similar to thoseused for yield-density models, growth curve models, etc.

• E.g., the biexponential model that we’ve worked with several timesalready comes up often in the analysis of compartment models.

For example, under the assumptions that at time 0 γ1(0) = θ3 and γ2(0) =0, the solution for γ2(t), the concentration in blood serum at time t is

γ2(t) =θ3θ1(e

−θ1t − e−θ2t)

θ2 − θ1.

Therefore, we might try the additive error model

yi =θ3θ1(e

−θ1ti − e−θ2ti)

θ2 − θ1+ ei, i = 1, . . . , n,

to model the tetracycline data. The resulting fitted regression curve isdisplayed below.

•

•

• •

•

•

•

•

•

Time (hrs)

Tet

racy

clin

e C

once

ntra

tion

(mug

/ml)

5 10 15

0.4

0.6

0.8

1.0

1.2

1.4


189

In the general compartment model with K compartments we write theconcentrations at time t as

γ(t) =

γ1(t)γ2(t)...

γK(t)

.

Assuming first-order kinetics with rate constants θ1, θ2, . . . , θp, the con-centrations obey the linear system of differential equations

∂γ

∂t= γ(t) = Aγ(t) + i(t)

where A is a K × K matrix known as the transfer matrix and i(t) is aK × 1 vector-valued function of time representing inputs to the system.

• A contains the rate constants (elements of θ) and is determined bythe model specification (assumed form of the model — how manycompartments and how they exchange material).

• Input into the system at time t, i(t), is often assumed to be of theform

i(t) =

{i, if t ≥ 0 (constant in time); and0, if t < 0.

Such an input function specifies continuous infusion of material ift is continuous and step input if time is indexed discretely (t =0, 1, 2, 3, . . .).

• Another common input specification is a bolus or instantaneous in-jection of material. In that case i(t) can be replaced by a vector ofinitial conditions

γ(0) = γ0.

190

Tetracycline Example:

In the tetracycline example, we can write the model-defining differentialequations as (

γi(t)γ2(t)

)︸︷︷︸

=γ(t)

=

(−θ1 0θ1 −θ2

)︸︷︷︸

=A

(γ1(t)γ2(t)

)︸︷︷︸

=γ(t)

, t > 0

and

γ(t) = γ0 =

(θ30

), t = 0.

Another Example — Brunhilda Data

Measurements were taken on the radioactivity of blood samples taken froma baboon named Brunhilda at a number of specified times after a bolusinjection of radiactive sulfate. The data are given below:

Time Time Time(min) Count (min) Count (min) Count

2 151117 25 70593 90 539154 113601 30 67041 110 509386 97652 40 64313 130 487178 90935 50 61554 150 4599610 84820 60 59940 160 4496815 76891 70 57698 170 4360220 73342 80 56440 180 42668

191

We consider a three-compartment open model of the following formfor these data:

The measurements are treated as coming from compartment 1 and thebolus injection was taken as going into compartment 1.

The linear system defining the model is

γ1 = −(θ1 + θ2)γ1(t) + θ3γ2(t)

γ2 = θ2γ1(t)− (θ3 + θ4)γ2(t) + θ5γ3(t)

γ3 = θ4γ2(t)− θ5γ3(t)

subject to γ(0) = γ0 = (θ6, 0, 0)T .

Here,

A =

−(θ1 + θ2) θ3 0θ2 −(θ3 + θ4) θ50 θ4 −θ5

.

192

Estimation in Compartment Models

There are several approaches to estimating parameters in compartmentmodels:

1. Most obvious is to obtain the analytic solution to the system ofdifferential equations and use that as the expectation function in anordinary NLS fitting routine.

– A drawback is that it is often difficult and sometimes impossi-ble to derive closed form expressions for the expectation func-tion and its derivatives with respect to the parameters.

2. Again we can use on ordinary NLS fitting routine but now withexpectation function which is calculated numerically by solving thedifferential equations with quadrature (numerical integration).

3. A method of historical interest is as follows: A K-compartmentmodel generally has a solution to its differential equations that takesthe form of a sum of exponentials where the coefficients and expo-nents are functions of the rate parameters θ1, θ2, . . . , θK . Therefore,fit a generic sum-of-exponentials model of the form

γj(t) =

K∑k=1

βkeλkt + e (∗)

then use the relationship between the exponential parameters (β,λ)

and the system parameters θ to solve for θ from (β, λ).

– Since we need the analytic solution of the differential equationsto know the relationship between θ and (β,λ), it would seemthat we would always prefer method (1) to this method. Thisis true if (*) were being fit by NLS because it would be just aseasy to use NLS with (*) parameterized in terms of θ. However,historically (*) was fit using the method of exponential peeling(a method adequate for starting values, but pretty crude as amethod of estimation), so (*) could be fit easily in the (β,λ)-parameterization but not in the θ-parameterization.

193

– In addition to the crudeness of estimation of (*) with peeling,another problem is that there may be fewer system parametersin θ than exponential parameters in (β,λ) so that it may notbe possible to solve for θ.

– This approach is obsolete.

4. A fourth method, known as the matrix exponential approach, is veryuseful because it does not require an analytic solution to the systemof differential equations and it computes the expectation functionand its derivatives in a unified, efficient manner.

The Matrix Exponential Method:

The general solution to the linear differential equation system γ(t) =Aγ(t) + i(t) is given by

γ(t) = eAtγ0 + eAt ∗ i(t)

where the matrix exponential eAt represents the convergent power series

eAt = I+At

1!+

(At)2

2!+ · · ·

and ∗ denoted the convolution,

eAt ∗ i(t) =∫ t

0

eA(t−u)i(u)du,

where the integration is performed componentwise.

• Therefore, if we can evaluate the matrix exponential eAt and theconvolution integral eAt ∗ i(t), we can evaluation the expectationfunction of the model.

• Note that it is rarely useful to sum the power series representationof eAt to compute it. In addition, the matrix convolution eAt ∗i(t) can be reduced to easier-to-compute scalar convolutions. Bothcomputations are simplified by using the spectral decomposition ofA.

194

Suppose it is possible to decompose A as

A = UΛU−1

where Λ = diag(λ1, . . . , λK) is a diagonal matrix containing the eigenvaluesof A, and U contains as its columns the eigenvectors of A. Then we canwrite

eAt = I+At

1!+

(At)2

2!+ · · ·

= I+UΛU−1t

1!+

(UΛU−1t)2

2!+ · · ·

= U (I+Λt

1!+

(Λt)2

2!+ · · ·)︸︷︷︸

≡M

U−1

where M has (i, j)th element given by

Mij =

0 if i = j; and

1 +λit

1!+

(λit)2

2!+ · · ·︸︷︷︸

=eλit

if i = j.

Therefore,eAt = UeΛtU−1 (†)

whereeΛt = diag(eλ1t, . . . , eλKt).

• By using the spectral composition, we’ve reduce the matrix expone-tial to a simple matrix multiplication given by (†).

195

For a bolus input, this is all we really need, because the convolution integraleAt ∗ i(t) drops out of the solution. That is, for a bolus input, the systemof differential equations defining the model can be written

γ(t) = Aγ(t), for t > 0

γ(0) = γ0, for t ≤ 0.

And this system of equations has solution

γ(t) = eAtγ0 = UeΛtU−1γ0.

For other types of input we need to evaluate the convolution integral eAt ∗i(t). Again, using the spectral decomposition of A we have

eAt ∗ i(t) =∫ t

0

eA(t−u)i(u)du

=

∫ t

0

UeΛ(t−u)U−1i(u)du

= U

∫ t

0

eΛ(t−u)κ(u)du

= U[eΛt ∗ κ(t)

],

where κ(t) = U−1i(t).

Thus if we define ζ(t) = U−1γ(t) we have

ζ(t) = U−1(eAtγ0 + eAt ∗ i(t)

)= eΛtζ0 + eΛt ∗ κ(t)

where ζ0 = U−1γ0.

196

In the case of continuous infusion/step input (recall this is when i(t) = i,t ≥ 0), we have κ(t) = κ = U−1i, and the convolution integral becomes

eΛt ∗ κ

which evaluates to a vector with ith element

{eΛt ∗ κ}i ={κi

eλit−1λi

, if λi = 0; andκit if λi = 0.

• This gives us all we need to calculate the solution the system ofdifferential equations defining the model (that is, to calculate theexpectation function of our model) in an efficient manner.

• Note that this approach assumes that a spectral decomposition existsfor A and that the eigenvalues of A are all real. These conditionsdo not always hold, and Bates and Watts give generalizations of thisprocedure to cover those situations in Appendix A5 of their text.

How about computing derivatives of the expectation function withrespect to the parameters θ1, . . . , θp for the G-N method?

These can be calculated using the same tools. But before we describehow that works, we need to talk about dead time because it can add aparameter to our model that we want to account for.

197

Dead time:

Often there is a delay or lag between time 0 when the system is initialized(e.g., the bolus dose is given) and when the system reacts (e.g, beforedrug concentration changes from its initial value). This delay is calleddead time, and can be built into our model as an extra parameter to beestimated.

We denote the time at which the system reacts as t0. In the presence ofdead time, t0 > 0 and the model-defining system of differential equationsbecomes

γ(τ) = Aγ(τ) + i(τ), if τ > 0

γ(τ) = γ0, if τ = 0,(∗)

where

τ =

{t− t0, if t ≥ t0; and0, otherwise.

• Here t0 can be assumed known or treated as an unknown parameterto be estimated.

With dead time, the general solution to (*) becomes

γ(τ) =

{γ0, if τ = 0; andeAτγ0 + eAτ ∗ i(τ), if τ > 0.

(∗∗)

• Evaluation of the expectation function of the model goes throughessentially as before, but with t replaced by τ .

198

To illustrate how dead time can be useful, we refit the two-compartmentmodel of p.181 to the tetracycline data with an unknown dead time pa-rameter t0 added to the model. The resulting fitted curve fits the dataconsiderably better than the model without dead time:

Time (hrs)

Tet

racy

clin

e C

once

ntra

tion

(mug

/ml)

5 10 15

0.4

0.6

0.8

1.0

1.2

1.4


2-Compartment model w/o Dead Time2-Compartment model w/ Dead Time

199

Back to computing derivatives of the expectation function:

Notation: we will denote a derivative with respect to θj by a (j)-subscipt.E.g., the derivatives we are after are

γ(j)(τ) ≡∂γ(τ)

∂θj, j = 1, . . . , p.

In the general solution given by (**), γ(τ) depends on γ0, A, τ , and i. Forany given paramater θj for which we seek the derivative γ(j), one, someor all of these quantities may depend upon θj . Therefore, we consider theformula for γ(j) by cases.

Case 1: τ depends on θj ; γ0, A, and i do not.

In this case we use the chain rule. By the chain rule,

γ(j)(τ) ≡∂γ(τ)

∂θj

=∂γ(τ)

∂τ

∂τ

∂θj

= γ(τ)τ(j) = τ(j)[Aγ(τ) + i(τ)]

Case 2: Suppose A, γ0, and/or i depend on θj , but τ does not. Then wecan differentiate (*) to obtain

γ(j)(τ) = Aγ(j)(τ) +A(j)γ(τ) + i(j) (‡)

Since we’re trying to compute γ(j)(τ) not γ(j)(τ), (‡) does not directlygive us what we seek.

However, (‡) does give us a differential equation whose solution is whatwe seek: γ(j)(τ).

200

The differential equation (‡) is of the same form as before (see (*)), butwith input function A(j)γ(τ) + i(j). Therefore, (**) gives us the form ofthe solution, we just need to change the input function in (**) and changeγ to γ(j). This solution is

γ(j)(τ) = eAτγ(j)(0) + eAτ ∗ [A(j)γ(τ) + i(j)]

Cases 1 & 2 combined: putting these two cases together, we can obtain anexpression for γ(j) no matter which of γ0, A, τ , and i depend on θj :

γ(j)(τ) = eAτγ(j)(0) + eAτ ∗ [A(j)γ(τ) + i(j)] + τ(j)[Aγ(τ) + i(τ)]

• This expression holds for any input function of a bolus (impulse) orcontinuous infusion/step type.

• Bates and Watts provide details on efficiently computing the convo-lutions in this expression in Appendix A5 and pseudo-code for doingso in Appendix A3.

• I have programmed this pseudo-code in the R functions formcomp-model() and compmodel(). These functions can be found in the filecompmodel.R which can be obtained from the course web site.

201

Example — Tetracycline

Recall that the transfer matrix in the two compartment model that weconsidered previously for these data is

A =

(−θ1 0θ1 −θ2

).

This is a simple model with a bolus input function, so obtaining the ana-lytic solution for γ(t) = (γ1(t), γ2(t))

T is particularly easy.

Recall (top of p.189) that the solution is given by

γ(t) = UeΛtU−1γ0. (∗)

The eigenvalues and eigenvectors of A aren’t difficult to calculate here,especially with a symbolic math program like Maple or Mathematica (seetetra1.mws, a Maple worksheet). These calculations lead to

Λ =

(−θ2 00 −θ1

), and U =

(0 θ2−θ1

θ11 1

).

We also need U−1:

U−1 =

( θ1θ1−θ2

1

− θ1θ1−θ2

0

).

Plugging into (*), we have

γ(t) =

(0 θ2−θ1

θ11 1

)(e−θ2t 00 e−θ1t

)( θ1θ1−θ2

1

− θ1θ1−θ2

0

)(θ30

)=

(θ3e

−θ1t

θ3θ1(e−θ1t−e−θ2t)θ2−θ1

)

202

• See handout tetra1. Here we fit this two-compartment model in twoways: using the analytic solution γ2(t) on the previous page as theexpectation function of the model; and using the matrix exponentialapproach in which we need not obtain the analytic solution. We alsofit the corresponding model with dead time using the latter method.

• In both cases we parameterize the rate constants using an exponen-tial transformation to ensure that the rate constants are positive.That is, use ϕ’s to represent the transfer rates rather than θ’s. Withthis notation change the transfer matrix is

A =

(−ϕ1 0ϕ1 −ϕ2

)and we use an unconstrained θ-parameterization where ϕi = eθi ,i = 1, . . . , 3. Our model becomes

yi =eθ3+θ1(exp{−eθ1ti} − exp{−eθ2ti})

eθ2 − eθ1+ ei, i = 1, . . . , n

• In the compmodel() function that implements the matrix exponentialapproach to computing the expectation function and its derivativesfor linear compartment models, the rate constants and initial con-centrations are all parameterized as exponential parameters like this.Dead time parameters are not transformed.

• In tetra1.R we code the analytic solution γ2(t) as a function tetramod().By using the deriv() function, tetramod() will return not only thefunction value, but the values of its analytic derivatives with respectto the parameters θ1, θ2, θ3.

• We will use tetramod() as the expectation function in an nls() fit.But first, we demonstrate that the matrix exponential function comp-model() returns the same value and gradient as tetramod().

203

• The function formcompmodel() sets up the model from an inputmatrix J. formcompmodel() is always called as the first step incompmodel(), but can also be called on its own. The model is “setup” from a matrix J that describes the compartment model. Specifi-cally, it consists of several rows, one for each parameter in the modeland/or arrow in the compartment diagram. In addition, J has threecolumns:

1. the parameter number2. the source compartment (0 if the parameter is a dead time

parameter)3. the destination compartment (0 if the parameter is a dead time

parameter or if the destination is excretion; -1 if the parameteris an initial value).

• For example, our 2-compartment tetracyline model is specified as

J =

1 1 22 2 03 1 −1

Adding a dead time parameter θ4 = t0, the model would be specifiedas

J =

1 1 22 2 03 1 −14 0 0

• From input J, θ, and γ0fix, the fixed (known) portion of the initialconditions vector, formcompmodel(J,θ,γ0fix) returns A, γ0, t0,

∂A

∂θ,

∂γ0

∂θT, and ∂t0

∂θ.

• compmodel() takes arguments J; θ = (θ1, . . . , θp)T ; γ0fix, a K × 1

vector of initial values for the K compartments (0 should be usedfor any compartment whose initial value is a parameter to be esti-mated); t a time vector; and k, the compartment from which dataare assumed to come and for which a solution is sought.

204

• compmodel() first calls formcompmodel() to set up the model andthen computes and returns the model solution for compartment k.

A gradient attribute is also returned containing ∂γk(τ)

∂θTfor each value

of the time indexing variable τ .

• To demonstrate that compmodel() returns the correct model solutionand gradient, we evaluate the model with tetramod() (assigned toanalyticeval) and compmodel() (assigned to matrixexptialeval) andprint out a portion of both results to verify that they are the same.

• Notice that both the values and gradients agree using the analyticsolution given by tetramod() or the matrix exponential evaluationgiven by compmodel().

• We fit the model with tetramod() first as m1tetra.nls. Then we refitusing compmodel() as m2tetra.nls. Starting values can be obtained

by exponential peeling. Here, we omit that step and just take θ0 =(log(.2), log(.4), log(6))T . In each case, the nls() function converges

to θ = (−1.698,−.834, 1.791)T , or in the ϕ-parameterization, ϕ =

exp(θ) = (.183, .434, 5.994)T .

• We refit the model allowing for dead time by changing J and usingcompmodel() again. This model is m3tetra.nls. An extra sum ofsquares test (LRT) of model m2tetra.nls vs. m3tetra.nls tests thenecessity of dead time (tests H0 : θ4 = 0). We reject this hypothesisand prefer the model with dead time. Residuals from this modellook reasonable.

205

Practical Considerations:

Parameter Transformations:

• Rate constants, initial concentrations, and infusion rates must bepositive.

A convenient way of ensuring this positivity is with a parameterizationin which the constrained rate constant (say) ϕ is parameterized as theexponential of an unconstrained parameter θ:

ϕ = eθ ⇒ θ = log(ϕ)

• This parameter transformation not only imposes the desired con-straint, but θ, the log rate constant, has a convenient relationshipwith t1/2, the half-life associated with the exchange of material thathas rate ϕ.

A single compartment elimination model looks like this

206

Such a model satisfies the differential equation γ(t) = −ϕγ(t) which hassolution γ(t) = e−ϕtγ0. The half-life t1/2 is the time at which half of theinitial concentration γ0 has been eliminated from the compartment. Thatis t1/2 satisfies

γ0e−ϕt1/2

γ0=

1

2

or eϕt1/2 = 2

or t1/2 =log(2)

ϕ

Thus,log(t1/2) = log[log(2)]− log(ϕ) = −.367− θ

so that the width of a linear approximation interval for log(t1/2) is thesame as the width of the interval for θ = log(ϕ).

Another derived quantity of interest is the volume of distribution in acompartment. With a bolus injection, the dose D is known, but the con-centration γ0 is estimated because the volume of the compartment in whichthat dose is distributed V is unknown. The relationship between initialconcentration γ0, dose D, and initial volume of distribution V is given by

γ0 =D

V⇒ log(γ0) = log(D)− log(V )

so again, a linear approximation CI on log(V ) will have the same width asthat of log(γ0).

207

Identifiability.

We have already seen that linear combinations of exponentials can haveidentifiability problems (recall the biexponential model).

For example, the following three compartment model with initial condi-tions γ0 = (1, 0, 0)T and data collected in compartment 3, yields the sameγ3(t) curve for the parameter pair θ = (a, b) as for θ = (b, a).

• Here (a, b) and (b, a) are exchangeable. In such a case the model isonly locally identifiable because discrete sets of parameters give thesame predicted response.

A more serious situation is global unidentifiability, where continuous setsof parameters give the same predictions.

• Identifibaility in compartment models is a big topic with lots of re-search. We don’t attempt to give general results on when a com-partment model is identifiable.

• However, a simple way to check identifiability is as follows: fix a setof design times and generate the parameter derivative matrix V(θ)at a number of different choices of θ. If the matrix {V(θ)}TV(θ)is computationslly singular for all choices of θ, then the model canbe assumed to be unidentifiable. Note that one should check severalchoices of θ because one can get unlucky with a particular choicewithout the model being unidentifiable.

• Nonidentifiability is a especially problematic in univariate compart-mental systems in which only one compartment is measured. Mul-tiresponse experiments allow one to fit much richer and more com-plex compartment models.

208

Starting Values.

• The method of exponential peeling is widely used to obtain startingvalues for compartment models.

• A second approach is to build up an appropriate model from a verysimple one. At each stage only a small number of parameters (oftenjust one) are added, so that starting values can be easily obtained.

Example — Lipoproteins

• See handout lipo1. Here we fit the models in §5.4 of our text to thelipoprotein data of Appendix 1, §A1.16. The response variable is thepercentage concentration of a tracer in the serum of a baboon givena bolus injection at time 0. We assume that the initial concentrationis 100% in compartment one and 0 in all other compartments.

• Before fitting any models, we plot the data on both the original andlog scale. These plots appear below. From the log-scale plot, it isapparent that at least two compartments will be necessary since thisplot is not linear. However, we start by fitting a one compartmentmodel and build up from there.

Time (days)

Con

cent

ratio

n (%

)

2 4 6 8 10

010

2030

40

Lipoprotein Data - Concentration vs. Time

Time (days)

Log(

Con

cent

ratio

n)

2 4 6 8 10

01

23

Lipoprotein Data - Log(Concentration) vs. Time

209

• We first fit a one compartment elimination model:

• The data near time t = 0 may approximately satisfy such a model.According to this model γ1(t) = γ0e

−ϕt = 100e−ϕt or

y = 100e−ϕt ⇒ log(y) = log(100)− ϕt

⇒ ϕ =log(100)− log(y)

t=

− log(y/100)

t

At time t = .5 we have y = 46. Plugging in these data we have

ϕ =− log(.46)

.5= 1.55

so we take θ0 = log(1.55) as our initial value for θ.

• In lipo1.R we fit this simple model using the analytic solution 100 exp(−eθt)as the expectation function. This model is assigned to m1Lipo.nlsand yields ϕ = exp(θ) = 1.31 and a residual standard error of 3.48on 11 residual df.

• The residuals from this model (not shown in lipo1) indicate substan-tial lack of fit. This was expected and we proceed to a 2 compartmentmodel of the following form:

• Initially, we assume ϕ2 = ϕ3 so that we introduce only one additionalparameter. The elimination from compartment 1 is now ϕ1 + ϕ2 sowe set 1.31 = ϕ1+ϕ2 and try (somewhat arbitrarily) the initial value

ϕ0 = (1.00, 0.31)T or θ0 = (log(1.00), log(.31))T .

210

• This model is fit as m2aLipo.nls. Note that without the constraintϕ2 = ϕ3 the J matrix is

J =

1 1 02 1 23 2 1

To impose the constraint, we just include two rows for ϕ2:

J =

1 1 02 1 22 2 1

Model m2aLipo.nls yields ϕ = (.992, .663)T .

• The model without the constraint ϕ2 = ϕ3 is fit as m2bLipo.nls.For starting values we use ϕ0 = (.99, .67, .65)T . This leads to ϕ =(1.028, .662, .820)T and a residual standard error of .374 on 9 residualdf.

• While this 2 compartment model fits substantially better than theone compartment model, the residuals of m2bLipo.nls still clearlyindicate that the model is inadequate. To better fit the data wehave several choices for extending this model. We choose to try toadd a third compartment to the model. This can be done in severalways, but two simple choices are a 3 compartment catenary systemand a 3 compartment mamillary system.

• In the catenary system, compartments are chained together as fol-lows:

211

• In the mamillary system, there is a central mother compartmentwith which each peripheral compartment exchanges material:

• We will fit both the catenary and mamillary version of the threecompartment model. In each case, we could fit fewer than 5 param-eters by constraining some rate constants to be equal. Instead, wewill fit the full 5-parameter version of these models and then considerreducing the model by equating parameters.

• The catenary model and mamillary models are described by the ma-trices

J =

1 1 02 1 23 2 14 2 35 3 2

, J =

1 1 02 1 23 2 14 1 35 3 1

,

respectively. These models are fits as m3acatLipo.nls and m3amamLipo.nls,respectively. For starting values we use ϕ0 = (1.0, .66, .82, .5, .2)T forboth models. The first three starting values are taken from the fit ofthe previous model. Arbitrarily, ϕ04 and ϕ05 are chosen to be smaller

than, and distinct from ϕ02 and ϕ03.

• Models m3acatLipo.nls and m3amamLipo.nls converge to ϕ = (.990,

.763, 1.01, .240, .352)T and ϕ = (.990, .531, 1.34, .231, .266)T , respec-tively. Both models have the same residual standard error of .0787on 7 df. In fact, these two models are equivalent (and not identifiablefrom one another) when only compartment 1 is measured.

212

• Finally, we consider reducing these models by constraining some pa-rameters to be equal to one another. In both models parameterestimates for parameters θ4 and θ5 are closest to each other relativeto their standard errors (as compared with any other pair of θj ’s).

• Therefore, we refit the two three compartment models with θ4 = θ5in each case. For the constrained models, the J matrices become

J =

1 1 02 1 23 2 14 2 34 3 2

, J =

1 1 02 1 23 2 14 1 34 3 1

,

for m3bcatLipo.nls and m3bmamLipo.nls, respectively. The 4-parameterversions of these models are no longer equivalent, and they yield dis-tinct residual standard errors of .0880 and .0792, respectively.

• Based on extra sum of squares analyses (LRTs) the 4-parameterversions of these models fit as well as the 5-parameter versions, andare therefore preferred based on parsimony. In addition, the residualsfrom these models look good.

• Based on the residual standard error, the 4-parameter mamillarymodel fits slightly better than the 4-parameter catenary model. Wecan also obtain the AIC and BIC values for these model. These cri-teria also point to model m3bmamLipo.nls. However, the choice ofmodel should be made based on these criteria secondary to biologi-cal/pharmacologic considerations.

213

Growth Models

There are two traditions in the development of models to describe growth.

1. “Statistical” Approach. This is a purely empirical approach in whichpolynomial models in time (linear models) are fit to the data usingmultivariate methods.

– Parameters have no biological interpretation.

– Models are not necessarily parsimonious.

– Extrapolation (e.g., prediction of future growth) is always dan-gerous, but especially so for these models.

– Models are linear, so methodology, theory are easier.

– These models are often discussed in multivariate texts. See,for example, Timm (2002, Applied Multivariate Analysis) for agood treatment.

2. “Biological” Approach. Models have a mechanistic motivation, al-though in practice they are often used in a purely empirical way.Models are usually nonlinear, with relatively few, biologically inter-pretable parameters.

• We concentrate on models in the latter tradition.

Exponential and Monomolecular Models:

The simplest organisms begin to grow by the binary splitting of cells.If we let t denote time and f(t) denote size at time t, then this leads toexponential growth in which the growth rate is proportional to the currentsize f(t):

∂f(t)

∂t= κf(t), or f(t) = eκ(t−γ). (∗)

214

The time-power modelf(t) = αtβ ,

does not increase as fast as the exponential, but is sometimes useful.

• Both of these models imply unlimited growth, which make themunsuitable for many applications (except perhaps as models of earlygrowth).

We can change (*) to imply growth bounded by an upper limit by assumingthat the growth rate is proportional to size remaining:

∂f(t)

∂t= κ[α− f(t)] where κ > 0.

The solution to this differential equation can be parameterized in a varietyof ways including

f(t) = α− (α− β)e−κt, α > β > 0.

Here α is the final size (asymptote), β the initial size, and κ dictates thegrowth rate. Alternative parameterizations are

f(t) = α− βe−κt

and

f(t) = θ1(1− e−θ2(t−θ3)) (monomolecular growth model)

andf(t) = ϕ1 + ϕ2ϕ

t3 (asymptotic regression model)

Sigmoidal Models:

From the fact that the above model can be reparameterized as the asyp-totic regression model, it is clear that this model has an asymptotic formwith growth rate decreasing through time.

Since growth rate may increase early in development we may prefer a curvewith sigmoidal form.

215

One way that a sigmoidal curve may be achieved is by assuming that thecurrent growth rate is the product of functions of the current size f andthe remaining growth on a transformed scale:

∂f

∂t∝ g(f)[h(α)− h(f)], (†)

where g(·) and h(·) are increasing functions with g(0) = h(0) = 0.

Various choices of g and h lead to the logistic, Gompertz and Von Berta-lanffy models.

1. Logistic (Autocatalytic) Model: If we take g(f) = f and h(f) = fthen (†) becomes

∂f

∂t=κ

αf [α− f ]

where κ > 0 and 0 < f < α. Here α is the upper limit of growth, andwe’ve chosen to parameterize it so that κ/α is the proportionalityconstant.

This differential equation has solution

f(t) =α

1 + e−κ(t−γ), −∞ < t <∞.

This is the 3-parameter (simple) logistic model. It has asymptotesf = 0 as t→ −∞ and f = α as t→ ∞.

Again, a variety of parameterizations are possible including

f(t) =α

1 + βe−κt

and

f(t) =α

1 + e(γ−t)/β(the SSlogis function in nlme)

216

2. Gompertz Model: Here we take g(f) = f and h = log. This leadsto a model in which growth is not symmetric about the point ofinflection:

f(t) = α exp{−e−κ(t−γ)}.

Here, the point of inflection is at time t = γ when the size is f(γ) =α/e. Again, α is the asymptote as t→ ∞.

3. Von Bertalanffy Model: von Bertalanffy hypothesized that the growthrate of an animal with weight f is the difference between the metabolicforces of anabolism and catabolism.

Roughly, anabolism is the process of assimilating new material (e.g,eating, breathing) and catabolism is the loss of building material(e.g., excretion, loss of dead cells, etc.).

By a mix of empiricism and theory, he assumed anabolism was pro-portional to the 2/3 power of weight and catabolism was proportionalto weight. Therefore, growth rate is given by

∂f

∂t= ηf2/3 − ζf.

Four-Parameter Sigmoidal Models:

4. Richards Model: Richards doubted the theory underlying von Berta-lanffy’s model, but noted that if we replace the power 2/3 by an un-known parameter δ, then the differential equation leads to a flexiblefamily of curves with arbitrarily placed point of inflection.

The Richards model can be parameterized in a variety of ways in-cluding

f(t) = α[1 + (δ − 1)e−κ(t−γ)]1/(1−δ), δ = 1.

• Richards’ model generalizes models 1–4. It includes the monomolec-ular model (δ = 0), the von Bertalanffy (δ = 2/3), the logistic(δ = 2), and the Gompertz (limit as δ → 1).

217

• The Richards model and all of its special cases are of themonomolecular form for some transformation of size f .

• The Richards model is not to be confused with the so-calledChapman-Richards model used commonly in forestry

f(t) = α{1− exp(−κt)}δ

which is not a Richards model at all.

5. Weibull Model: The Richards model can be obtained by assumingthat a transformation of size, namely f1−δ, is monomolecular. TheWeibull family is obtained by assuming that size f is monomolecularfor some power transformation tδ of time.

In what follows we provide another derivation in terms of the distri-bution function of the Weibull probability distribution.

Since the cumulative distribution function (c.d.f.) of any continu-ous random variable with a unimodal distribution is sigmoidal, suchc.d.f.s are a natural place to start in trying to build a sigmoidalgrowth curve.

Let F (x;ν) = Pr(X ≤ x) denote the c.d.f. of a continuous, unimodalrandom variable X. Here the distribution is assumed to be describedby a possibly vector-valued parameter ν.

There are a couple of different ways to use F to form a sigmoidalgrowth curve f(t) that have been used. One is to set

f(t) = αF (κ(t− γ);ν) (1)

and the other is to set

f(t) = β + (α− β)F (κt;ν). (2)

218

In (1) the time variable is shifted by γ and rescaled by κ. Thisshifts and expands/contracts the curve in the horizontal direction.In addition, F is rescaled by α so the asymptote is at α rather than1.

In (2) the time scale is expanded or contracted by κ, and then thecurve is shifted vertically to have asymptote β when t → −∞ andasymptote α when t→ ∞ (assuming κ > 0).

The one-parameter Weibull distribution has c.d.f.

F (x; δ) = 1− exp(−xδ), x > 0.

Using method (1) we obtain the model

f(t) = α(1− exp{−[κ(t− γ)]δ}),

and using (2) we obtain

f(t) = α− (α− β) exp{−(κt)δ}. (‡)

• These two families of curve are different, and each could beused as a legitimate 4-parameter sigmoidal growth curve. How-ever, (‡) is more commonly used and is the one that is typicallymeant when people say the Weibull growth curve model.

6. Morgan-Mercer-Flodin Model: The M-M-F model is given by

f(t) = α− α− β

1 + (κt)δ.

This model can be obtained as was the Weibull model by using (1)with c.d.f.

F (x; δ) =xδ

1 + xδ, 0 ≤ x <∞.

In the parameterization given above α is the horizontal asymptoteas t → ∞, β = f(0) the size at time 0, and δ and κ are shape andscale parameters, respectively.

219

7. Four-Parameter Logistic Model: A simple generalization of the sim-ple logistic model is to allow both asymptotes to be parameters ofthe model (for increasing growth, the simple logistic model assumesthe lower asymptote is 0).

This can be done by adding a fourth parameter to the logistic model.The parameterization used in SSfpl in the nlme software is

f(t) = θ2 +θ1 − θ2

1 + exp[(θ3 − t)/θ4].

Here, θ1 is the horizontal asymptote as t → ∞ and θ2 is the hori-zontal asymptote as t→ −∞.

• In general, four-parameter sigmoidal models provide more flexibilityto fit the curve closely to the data. However, the flip-side of thatstatement is that they need more information from the data to do so.That is, they need more data and data that cover the entire nonlinearrange of the curve to fit all four parameters with any precision.

• Of the four-parameter models, some studies have suggested that theRichards model has the most parameter-effects nonlinearity in typi-cal applications and is often more difficult to fit than the others.

• The book, Handbook of Nonlinear Regression Models, by Ratkowsky(1990) provides more examples of growth curve models, and manydifferent parameterizations of the models I have presented here. Healso provides some guidance on which models and parameterizationstend to have less parameter-effects nonlinearity, and he discussesmethods of obtaining starting values for most of the models he con-siders.

220

The usual approach to fitting growth curve models is to fit a model of theform

yi = f(ti;θ) + ei, i = 1, . . . , n. (∗)

For cross sectional data (data from independent units at each separatetime point), such a model with independent errors is reasonable.

Often, however, growth data aren’t collected cross-sectionally. Insteadthey are collected longitudinally, where individual units are re-measuredat several point through time.

For longitudinal measurements of growth the assumption of independenterrors is typically inappropriate. In such situations, we may instead assumevar(e) = σ2Λ, try to find an appropriate correlation structure for Λ (e.g.,an ARMA model) and then fit model (*) with ML or GLS.

This approach often works reasonably well, but sometimes not. Two dif-ficulties are that

i. we often have relatively short series, so that fitting a high orderARMA model is difficult; and

ii. errors from growth data are seldom stationary (some would say thatgrowth data are inherently nonstationary).

221

For longitudinal data from multiple subjects (the typical situation), sucha model can be fit to the data from each subject separately and then thesesubject-specific models can be pooled or averaged somehow to estimatepopulation level parameters (see, e.g., Davidian & Giltinan, 1995, Ch.5).

This approach has a long history and is an important technique. However,it suffers from some disadvantages:

• It can be difficult to fit models of the form (*) to the data from eachand every subject, especially if the number of observations per sub-ject is small, the model (e.g., the correlation structure) is complex,and/or the data are highly variable.

• Although it is possible to assume a common variance-covariancestructure with a common var-cov parameter in separately fit models,to do so requires iterative fitting and is cumbersome.

• This approach tends to overfit the data.

• There is not good software to implement this approach easily.

Instead, we concentrate on an approach in which a single nonlinear modelis fit to the data from all subjects at once, which takes account of thewithin subject sorrelation and between subject heterogeneity in an elegant,unified way. This approach, based on the class of nonlinear mixed-effectmodels, will be discussed next, and for the remainder of the course.

222

Nonlinear Mixed Effects Models

A Motivating Example — Circumference of Orange Trees

Recall the data in the table below on the circumference of five orange treesover time. We analyzed these data using ordinary (fixed effects) nonlinearregression models in homework #4.

Tree No.Time (days) 1 2 3 4 5

118 30 33 30 32 30484 58 69 51 62 49664 87 111 75 112 811004 115 156 108 167 1251231 120 172 115 179 1421372 142 203 139 209 1741582 145 203 140 214 177

A plot of the data, with observations from the same tree connected, ap-pears below.

Time since December 31, 1968 (days)

Tru

nk c

ircum

fere

nce

(mm

)

500 1000 1500

050

100

150

200 Tree 1

Tree 2Tree 3Tree 4Tree 5

Observed Growth CurveFitted Growth Curve

Orange Tree Data w/ NLS Fit

223

Also displayed in this plot is the fitted curve from a logistic function fitwith NLS. That is, if we let yij = circumference of the ith tree at age tij ,i = 1, . . . , 5, j = 1, . . . , 7, then the fitted model is

yij =θ1

1 + exp[−(tij − θ2)/θ3]+ eij (∗)

where {eij}iid∼ N(0, σ2).

• Clearly, model (*) is inadequate.

One obvious deficiency is that, while the fitted curve goes through thecenter of the combined data from all trees, the growth curves of individualtrees, especially large and small trees, are poorly estimated.

• Because the growth curves of the different trees spread out as thetrees get older, this misspecification will manifest itself as a cone-shaped residuals vs. fitteds plot suggesting heteroscedasticity.

• In fact, though, it is only (or at least mainly) between-tree variabilitythat is increasing over time. Within-tree error variance looks to behomoscedastic. Simply adding a heteroscedasticity specification tomodel (*) is not an appropriate solution.

Another deficiency of model (*) is that it treats the observations as inde-pendent. There are two obvious potential sources of correlation in thesedata:

1. Grouping. The data are grouped, or clustered, by tree. Whenever wehave grouped data, there is reason to suspect that observations fromthe same group (tree) will tend to be more similar than observationsfrom different groups. That is, there often is positive within-groupor within-cluster correlation and between cluster independence.

– Minimized by very homogeneous groups.

224

2. Serial Dependence. As we’ve noted previously, when data are col-lected through time, it is often the case that observations close to-gether in time will tend to be correlated more highly than observa-tions far apart in time.

– Often reduced by long lags between measurements, and/or ho-mogeneous environmental conditions through time.

The first of these sources almost certainly affects the orange tree data andthe second may as well.

To deal with the two model deficiencies described above, in homework #4we fit models in which we allowed the asymptote parameter to differ acrosstrees, and we tried to introduce an appropriate within-tree correlationstructure to our model.

That is, we fit a model with 5 separate asymptote parameters, one for eachtree:

yij =θ1i

1 + exp[−(tij − θ2)/θ3]+ eij (∗∗)

and we assumed

corr(ei, ei′) =

{0, if i = i′; andC(ρ), if i = i′.

Here C is an assumed form for the within-group correlation matrix, de-pending on an unknown parameter ρ.

While this approach is clearly an improvement over (*), it has some dis-advantages:

A. Number of parameters grows with sample size. In (**) we’ve intro-duced a distinct fixed asymptote parameter for each tree. Therefore,if we had measured 500 trees, our model would have 502 regressionparameters.

225

Having the number of parameters increase with the sample size in-troduces a number of problems:

• Theoretical: in ML and LS estimation, asymptotic argumentsestablishing consistency, optimality break down.

• Computational: Difficult to optimize a criterion of estimationwith respect to many parameters. Hard to form the V(θ) ma-trix and solve equations involving large dimension {V(θ)}TV(θ)matrix.

• Interpretation: We have 500 separate asymptotes and no singleparameter describing the average limit of growth. Do we reallycare what the limit of growth was for tree #391?

• Conceptual: θ1i is the asympote parameter for tree i. Thatis, its the fixed theoretical population constant for the limit ofgrowth for tree i. But what’s the population? and why is theasymptote of tree i a fixed constant? Wasn’t tree i randomlyselected from a population of trees? If so, the asymptote of thisrandomly drawn tree should be regarded as a random variable,not a parameter.

B. Correlation structure. The correlation structure in model (**) ac-counts for within-group (e.g., within-tree) correlation by modellingsource 2 (serial correlation) rather than source 1 (grouping correla-tion). It is often difficult and unnecessary to model both sourcesof correlation, but for short time series, modelling 2 is often harderthan modelling 1.

That is, it is often not easy to fit an ARMA model to the within-group observations through time. This can be so because of:

• Short series.

• Non-stationary series.

• Unbalanced/missing data and/or irregular or continuous timeindexing.

226

An alternative: A nonlinear mixed-effects model (NLMM) for the orangetree data.

Our fixed effects nonlinear model (**) with 5 separate tree-specific asymp-totes is

yij =θ1i

1 + exp[−(tij − θ2)/θ3]+ eij (∗∗)

Using an ANOVA-type parameterization for θ1i we can write θ1i = θ1+ τiwhere

∑5i=1 τi = 0. Here θ1 is the average or typical θ1-value (asymptote)

and τi is the deviation from the typical value for the ith tree.

Under this parameterization, model (**) becomes

yij =θ1 + τi

1 + exp[−(tij − θ2)/θ3]+ eij

5∑i=1

τi = 0.

In the fixed-effects (ordinary) nonlinear regression model, the θ’s and theτ ’s are all considered to be fixed unknown parameters, a.k.a. fixed effects.

In the NLMM, we consider the τi’s to be random variables, or randomeffects. τi is the deviation from θ1 of the asymptote of the ith tree; itis considered to be random because the tree itself is a randomly selectedrepresentative element of the population of trees to which we want togeneralize.

Since τi is now a random variable, we’d prefer to represent it with a Latinletter rather than a Greek one, so replace the τi’s with bi’s and the modelbecomes

yij =θ1 + bi

1 + exp[−(tij − θ2)/θ3]+ eij ,

b1, . . . , b5iid∼ N(0, σ2

b )

{eij}iid∼ N(0, σ2)

(†)

Here we’ve also dropped the bar from θ1.

• Since model (†) contains both fixed effects (the θ’s) and randomeffects (the bi’s) it is called a mixed-effects model.

227

• To completely specify the model we must make distribution assump-tions on whatever random variables (error terms, random effects) arein the model. We assume that the random effects are independentnormal, with mean 0 (corresponds to assumption that the τi’s sumto zero) and variance σ2

b (distinct from the error variance σ2).

• In the simplest case, the errors are assumed i.i.d. spherical normalas in the classical nonlinear model. However, this assumption canbe relaxed to accommodate heteroscedasticity and.or correlation inthe errors.

• We assume the bi’s are uncorrelated with the eij ’s.

• Now the asymptote for the ith tree is θ1 + bi, a random variablebecause bi is a random variable. The asymptote for the typical treeis θ1 (when bi = 0).

• If we write θ1i ≡ θ1 + bi, then we have that the 5 asymptotes are

randomly distributed around θ1: θ11, . . . , θ15iid∼ N(θ1, σ

2b ).

Fitting Model (†):

The fact that the random effects {bi} enter into the NLMM (†) nonlin-early complicates the methodology and theory of NLMMs substantially ascompared to ordinary NLMs.

To focus on the motivation, interpretation, and basic ideas of NLMMs wetemporary skip this material and just assume that the nlme() function inR can fit an NLMM with a “good” method.

• See handout Orange1.

• In this R script, we refit the fixed-effects models (*) as m1Oran.gnls,and (**) as m2Oran.gnls. We then fit the NLMM (†) as m1Oran.nlmeusing the nlme() function.

228

• nlme() is called in a manner similar to that used in gnls() with fixed=replacing params=. In addition, nlme() takes an argument random=which is used to specify which parameter(s) should have an associ-ated random effect.

• Notice that the NLME (†) has estimated regression parameter θ =(191.0, 722.6, 344.2)T similar to the estimated regression parameter

in the fixed-effects model (*): θ = (192.7, 728.8, 353.5)T .

• Variability in the asymptotes from tree to tree is captured throughbi, which is assumed normal, mean 0, with estimated variance σ2

b =(31.48)2. The error variance is estimated to be σ2 = (7.85)2.

• The NLME (†) has AIC=273.2, BIC=280.9 for 5 estimated param-eters: θ1, θ2, θ3, σ

2b , σ

2. This compares with AIC=324.8, BIC=331.0for the 4-parameter model (*) and AIC=254.1, BIC=266.5 for the8-parameter model (**).

• So, the addition of random effects in the asymptote in (*) only costsus 1 df and results in a vast improvement in fit. We can do even bet-ter by fitting separate asymptotes to each tree, but that shouldn’tbe surprising. In (†) we save on df in comparison to (**) by mak-ing a parametric assumption on the distribution of the random ef-fects: that they’re normal with only an unknown variance to esti-mate rather. In contrast, model (**) doesn’t make any assumptionabout the tree-to-tree variability in asymptotes, it separately esti-mates each asymptote.

• Of course the residuals of model (*) looked terrible because the in-dividual trees were poorly fit by the average curve. The residuals ofmodel (**) and model (†) look about equally good.

229

• In model (†) we only fit one asymptote parameter, θ1 = 191.0. Anindividual tree’s (the ith say) asymptote is θ1 + bi. Here, bi is anunobserved (latent) random variable so we don’t usually speak ofestimating it. Instead, we predict its value from the observed datawith a prediction bi. Then our fitted model for tree i at time tij is

yij =θ1 + bi

1 + exp[−(tij − θ2)/θ3]

• The bi’s aren’t estimated parameters of the model. They’re predictedquantities based on the fitted model, the data, and the assumption

that b1, . . . , b5iid∼ N(0, σ2

b ).

• The bi’s can be obtained from the fitted model using the ranef() func-

tion. ranef(m1Oran.nlme) yields b = (−29.4, 31.6,−37.0, 40.0,−5.18)T

(not shown in the handout) so that the predicted circumference oftree 1, say, at time tij is given by

yij =191.0− 29.4

1 + exp[−(tij − 722.6)/344.2]

• The predicted curves for individual trees can be obtained fromplot(augPred(m1Oran.nlme,level=0:1)) which yields the plots on p.6or Orange1. The level=0:1 argument here asks for predictions atlevel 0 (the population level averaged over all trees — correspondsto bi = 0) and at level 1 (here, the tree-level).

• Finally in Orange1, we examine the ACFs for models (*), (**), and(†). The ACF for (*) is affected by both mean misspecification andvar-cov misspecification so is not meaningful as a diagnostic of var-cov structure.

230

• The ACFs of models (**) and (†) are similar — the two models“account for” the residual correlation structure similarly here. Thismay be somewhat surprising considering that the two models makedifferent assumptions about the correlation between repeated mea-sures.

In particular, the NLM treats all observations including those from thesame group (tree) as independent. According to (**),

cov(yij , yij′) = cov(eij , eij′) = 0,

which implies independence under normality.

In contrast, model (†) implies

cov(yij , yij′) = cov

(θ1 + bi

1 + exp[−(tij − θ2)/θ3]+ eij ,

θ1 + bi1 + exp[−(tij′ − θ2)/θ3]

+ eij′

)= cov

(θ1 + bi

1 + exp[−(tij − θ2)/θ3],

θ1 + bi1 + exp[−(tij′ − θ2)/θ3]

)=

cov(bi, bi)

{1 + exp[−(tij − θ2)/θ3]}{1 + exp[−(tij′ − θ2)/θ3]}

=σ2b

{1 + exp[−(tij − θ2)/θ3]}{1 + exp[−(tij′ − θ2)/θ3]}= 0

• NLMMs imply that observations that share a random effect are cor-related! E.g., observations from the same tree in a model with tree-specific random effects are correlated.

• The corresponding NLM with fixed tree effects assumed that obser-vations from the same tree are independent.

Q: Then why do the two models have (approximately) the sameACF?

A: Because both model assume that for a given tree, the observationsare independent. The NLMM only accounts for correlation due to tree-to-tree differences, which are also accounted for in the NLM with treeeffects.

231

• We will see, however, that residual correlation can be added in toan NLMM. In addition, more complicated random effects specifica-tions can be made, allowing NLMMs to much more flexible modelcorrelation among clustered data.

• In addition, inclusion of fixed effects for each group/cluster/subjectin a fixed effect model is not a feasible or attractive option, in general.

• If the random effects enter into the model in a linear fashion, thenone can obtain a closed-form expression for the implied marginalcorrelation matrix. E.g., in our example letting zi denote the 7 × 1vector with jth element {1 + exp[−(tij − θ2)/θ3]}−1, we have

cov(yi) = σ2bziz

Ti + σ2I

so that

corr(yi) = diag(vi)−1/2(σ2

bzizTi + σ2I)diag(vi)

−1/2,

where vi denotes the diagonal of σ2bziz

Ti +σ

2I. From our fitted modelwe can estimate this correlation matrix by plugging in parameterestimates to obtain

ˆcorr(yi) =

1 .41 .45 .48 .49 .49 .491 .70 .75 .77 .77 .77

1 .83 .84 .84 .851 .90 .90 .91

1 .92 .921 .93

1

232

Another Example — Pharmacokinetics of Indomethacin

Pinheiro and Bates (2000, §6.2) present and analyze data from a laboratorystudy by Kwan et al. (1976) on the pharmacokinetics of indomethacin.Six human volunteers received bolus injections of the same dose of in-domethacin and had their plasma concentrations of the drug (in mcg/ml)measured 11 times between 15 minutes and 8 hours postinjection. Thedata are included in the nlme library as a groupedData object called In-dometh.

A plot of the data appears below.

0.0

0.5

1.0

1.5

2.0

2.5

1

0 2 4 6 8

4 2

0 2 4 6 8

5 6

0 2 4 6 8

0.0

0.5

1.0

1.5

2.0

2.5

3

Time since drug administration (hr)

Indo

met

hici

n co

ncen

trat

ion

(mcg

/ml)

Kwan et al. (1976) found that the plasma concentrations for each individ-ual subject were adequately described by a two compartment open model.We will fit two-compartment nonlinear models to these data where we takethe model in the sum-of-exponentials form:

yij = θ1 exp[−eθ2tij ] + θ3 exp[−eθ4tij ] + eiji = 1, . . . , 6j = 1, . . . , 11

, (1)

where yij = the concentration of indomethacin in plasma at time tij .

• See handout indometh1.

• In indometh1.R we first plot the concentration over time curves sep-arately by subject in a couple of different ways. It is clear that thereis a similar general form across all 6 subjects, but that there is alsosome subject-to-subject variability in the shape of these curves.

233

• We fit model (1) with spherical errors as model m1.nls. The thirdand fourth plots in the handout display the residuals separately bysubject. As in the orange tree example, individual curves are poorlyfit by a fixed effects model without subject-specific parameters orrandom effects in the parameters.

• As in the orange tree example, we can account for subject-to-subjectheterogeneity by including random effects in the parameters. Inthis case, though, the decision about which parameters should bemodelled with random effects is not as obvious.

• A useful aid for making this decision is to fit model (1) separately bysubject (see m1.lis) and then compare the subject-specific estimatesof θ1, . . . , θ4. This can be done graphically via plot(intervals(m1.lis)).

• The only parameter whose subject-specific confidence intervals alloverlap is θ4, but there appears to be significant subject-to-subjectvariability in θ1, θ2, θ3 and (possibly) θ4. Therefore, we consider amixed-effects model of the form

yij = θ1i exp[−eθ2itij ] + θ3i exp[−eθ4itij ] + eij , (2)

where all four of the biexponential parameters have subject-specificrandom effects:

θ1i = θ1 + b1i

θ2i = θ2 + b2i

θ3i = θ3 + b3i

θ4i = θ4 + b4i

• Here, there are four subject-specific random effects in the model. Wecan think of this as a single 4-dimensional subject-specific random ef-fect bi = (b1i, b2i, b3i, b4i)

T . Since bi is 4-variate, we need to assumea 4-variate distribution for it. Typically, we assume normality:

b1,b2, . . . ,b6iid∼ N(0,Ψ).

– Here Ψ is the var-cov matrix of each bi, and its structure mustbe assumed as part of the specification of the model.

234

– Since the four elements of bi are each effects specific to thesame subject, there is no obvious reason why these effectswouldn’t be correlated. So, the minimal assumption is thatΨ is simply a positive-definite matrix, with 4(4 + 1)/2 = 10non-redundant elements to be estimated as parameters of themodel:

Ψ =

ψ11 ψ12 ψ13 ψ14

ψ12 ψ22 ψ23 ψ24

ψ13 ψ23 ψ33 ψ34

ψ14 ψ24 ψ34 ψ44

– Often, a completely general positive-definite form as above re-sults in convergence problems and it is easier to fit a modelwhere Ψ is assumed to be diagonal initially (correspondingto uncorrelated subject-specific random effects). Once conver-gence has been obtained for the initial model, we may con-sider relaxing the diagonal Ψ assumption to allow it to be gen-eral positive-definite, or possibly some intermediate form (e.g.,block-diagonal).

• In indometh1.R, we fit model (2) as m1.nlme. A summary of m1.nlme

reveals that the sd of bi4 is estimated as ψ44 = 3.44 × 10−6, verysmall. This was somewhat expected from our plot of the confidenceintervals from m1.lis. Therefore, we consider eliminating the randomeffect from θ4.

• The resulting model is m2.nlme. Comparing models with nested ran-dom effects structures (like m1.nlme and m2.nlme) via LRTs seemslike a reasonable approach. However, this situation is one in whichthe usual χ2 reference distribution is no longer appropriate. Theproblem is a non-standard one not falling under the general theoryof Wilks’ Theorem. The bottom line is that the naive approach ofusing anova(m1.nlme,m2.nlme) which compares

2[logLik(m1.nlme)-logLik(m2.nlme)]

to a χ2(1) distribution, results in a conservative test (overestimatedp−value). For more on this issue see Pinheiro and Bates (2000,§2.4.1).

235

• We go ahead and use anova(m1.nlme,m2.nlme) in indometh1.R know-ing that the p−value will be overestimated. However, the p−valuefrom this comparisons (p = .9512) is so large that there can be nodoubt that model m2.nlme is to be preferred over m1.nlme. The twomodels have nearly identical loglikelihoods.

• Next we fit model m3.nlme in which we allow b1i, b2i, b3i to be cor-related. That is,

var(bi) = var

b1ib2ib3i

= Ψ =

ψ11 ψ12 ψ13

ψ12 ψ22 ψ23

ψ13 ψ23 ψ33

• An examination of summary(m3.nlme) reveals that the only pair ofrandom effects that are estimated to be highly correlated are b1i andb2i, so we consider a model intermediate between model m2.nlme andm3.nlme in which Ψ is assumed to have block-diagonal structure:

var(bi) = var

b1ib2ib3i

= Ψ =

ψ11 ψ12 0ψ12 ψ22 00 0 ψ33

• This model is fit as m4.nlme. In this case, since the null hypothesisH0 : ψ13 = ψ23 = 0 doesn’t place the parameters on the boundary oftheir parameter space, we could go ahead and use a LRT to comparemodels m3.nlme, and m4.nlme. However, an easier and more widelyvalid approach to selecting the variance-covariance structure is touse use AIC or BIC. Both of these criteria point to model m4.nlmeover m3.nlme and m2.nlme.

• The final two plots in indometh1 display the residuals and fittedcurves from model m4.nlme. Both indicate that the model fits thedata fairly well. The last plot displays the population average curve(solid line) corresponding to bi = 0 and subject-specific predictedconcentration over time curves (dotted lines).

• Finally, note that the parameter estimates from model m4.nlme, θ =(2.81, .849, .587,−1.11)T are similar to those from the fixed-effectsmodel m1.nls, but standard errors have changed appreciably.

236

The NLME Model Formulation

By far the most important area of application of NLMEs is for groupedor clustered data, particularly longitudinal or repeated measures data. Indescribing the NLME model, we first present the single-level-of-groupingand then extend to multilevel data.

• E.g., in an educational context, single level data might be repeatedmeasures through time on each of several students in Mrs. Smith’sthird grade class at Barrow Elementary School. Multilevel (in thiscase 3-level) data might be repeated measures through time on eachof several students (level 3) in each of several classes (level 2) ineach of several schools (level 1) in the Athens-Clarke County SchoolDistrict.

Formulation for Single Level Data:

Let yij denote the jth observation (e.g., through time) on the ith group(i.e., cluster; e.g, subject) where we haveM groups, and ni observations inthe ith group. Let xij be a vector of covariates corresponding to responseyij .

Then our NLMM is

yij = f(θij ,xij) + eij ,i = 1, . . . ,Mj = 1, . . . , ni

,

where θij = Aijβ +Bijbi,b1, . . . ,bM

iid∼ N(0,Ψ )

{eij}iid∼ N(0, σ2)

(∗)

Here, β is a p×1 vector of fixed effects, and bi is a q×1 vector of randomeffects specific to the ith group with var-cov matrix Ψ . The matrices Aij

and Bij are model matrices for the fixed and random effects, respectively.

237

Model (*) can be equivalently expressed in a more succinct matrix formas

yi = fi(θi,xi) + ei,

θi = Aiβ +Bibi,(∗∗)

for i = 1, . . . ,M , where

yi =

yi1...

yini

, θi =

θi1...

θini

, ei =

ei1...

eini

, fi(θi,xi) =

f(θi1,xi1)...

f(θini ,xini)

,

xi =

xi1...

xini

, Ai =

Ai1...

Aini

, Bi =

Bi1...

Bini

.

We assume

b1, . . . ,bMiid∼ Nq(0,Ψ), {ei}

iid∼ Nni(0, σ2Ini)

and the random effects {bi} are independent of the errors {ei}.

238

Example — Orange Tree Data

To illustrate the model formulation, we write model (†) that we used forthese data in the form (**). Model (†) can be written as

yij =θ1ij

1 + exp[−(tij − θ2ij)/θ3ij ]+ eij ,

where

θ1ijθ2ijθ3ij

︸︷︷︸

θij

=

1 0 00 1 00 0 1

︸︷︷︸

Aij

θ1θ2θ3

︸︷︷︸

β

+

100

︸︷︷︸Bij

( b1i )︸︷︷︸bi

,

where bi = bi is a scalar, so q = 1 and

b1, . . . , bMiid∼ N(0, σ2

b︸︷︷︸Ψ

), {eij}iid∼ N(0, σ2).

In this simple example, the individual coefficients θij and the model matri-ces Aij and Bij are indexed by j but don’t vary with j (don’t change overtime). The var-cov matrix of the random effects Ψ is a scalar variance,σ2b .

239

Another Example — Indomethacin Data

The model that we chose for these data in indometh1 can be written inthe form of (*) as follows:

yij = θ1ij exp[−eθ2ij tij ] + θ3ij exp[−eθ4ij tij ] + eij ,

where

θ1ijθ2ijθ3ijθ4ij

︸︷︷︸

θij

=

1 0 0 00 1 0 00 0 1 00 0 0 1

︸︷︷︸

Aij

θ1θ2θ3θ4

︸︷︷︸

β

+

1 0 00 1 00 0 10 0 0

︸︷︷︸

Bij

b1ib2ib3i

︸︷︷︸

bi

,

where

b1, . . . ,bMiid∼ N3

000

,

ψ11 ψ12 0ψ12 ψ22 00 0 ψ33

, {eij}iid∼ N(0, σ2).

• Again, here the individual coefficients θij and the design matricesAij and Bij do not vary over j (time).

Multilevel Formulation:

A Two-level Example:

The groupedData object Wafer in the nlme library for R/S-PLUS, con-tains data from an experiment conducted at Lucent Technologies to studydifferent sources of variability in the manufacture of analog MOS(?) cir-cuits.

The intensity of current (in mA) at 0.8, 1.2, 1.6, 2.0, and 2.4 V wasmeasured on manufactured electronic devices. Measurements were madeon eight sites in each of ten wafers selected from the same lot (batch ofproduct).

The main objective of the experiment was to construct an empirical modelfor simulating the behavior of similar circuits.

240

In this example, there are two nested blocking factors, wafer and site withinwafer. We expect there may be heterogeneity from site to site and waferto wafer, and consequently correlation for observations obtained from thesame site, or from the same wafer.

• It is reasonable to expect that two observations from different waferswill be independent (and hence uncorrelated).

• We expect that two observations from the same wafer, but fromdifferent sites, will be correlated.

• Two observations from the same wafer, and the same site within thewafer, we expect to be even more strongly correlated.

A natural way to account for this expected correlation structure is witha NLMM where we have random effects for wafers and random effects forsites within wafers.

• Here, the levels of site are nested within the levels of wafers. Thatis, site 1 in wafer 1 is not the same site as site 1 in wafer 2.

• Therefore, the site-specific random effects are nested within the wafer-specific random effects in our model.

• This is an example of two-level grouped data.

Our convention on numbering the levels is that level 0 is the populationlevel (averaged over all wafers and sites), level 1 is the wafer level (coarsestlevel of grouping), and level 2 is the site level (finest level of grouping).

• As an example of > 2 levels of nesting, suppose the experimentwas run on devices from several lots. Then we would have a 3-levelsituation with data grouped by lots (level 1) , wafers within lots(level 2), and sites within wafers within lots (level 3).

241

For two-level data, let yijk = the kth observation on the jth second levelgroup (e.g., site), on the ith first level group (e.g., wafer).

We suppose we have M first level groups (e.g., wafers), Mi second levelgroups within the ith first level group (Mi sites within wafer i), and nijobservations on the jth second level unit within the ith first level unit (nijobservations on the jth site within the ith wafer).

Then the two-level NLMM is

yijk = f(θijk,vijk) + eijk,

i = 1, . . . ,Mj = 1, . . . ,Mi

k = 1, . . . , nij

,

where θijk = Aijkβ +Bi,jkbi +Bijkbij ,

b1, . . . ,bMiid∼ N(0,Ψ1)

b11, . . . ,bM,Mi

iid∼ N(0,Ψ2)

{eijk}iid∼ N(0, σ2)

• Here the bi’s, bij ’s and eijk’s are assumed uncorrelated with ea-chother.

• Extension to > 2-level NLMMs follows in a straight-forward, butnotationally tedious way.

• NLMMs with crossed (rather than nested) random effects can beformulated. However, such models are much more difficult to fit andwe will not use them at all in this course. Fortunately, situations forwhich they are appropriate are much less common then single-levelor multilevel nested random effects models.

242

Estimation and Inference in NLMMs

Over the last 20–25 years, a huge literature has appeared on estimation andinference in NLMMs and in the closely related class of generalized linearmixed effects models (GLMMs). We follow the treatment of Pinheiroand Bates (2000, ch. 7) and restrict attention to methods based on thelikelihood function. More comprehensive treatments can be found in thereserve books by Davidian and Giltinan (1995) and Vonesh and Chinchilli(1997).

For ML estimation in the NLMM, as in general, we choose the values ofthe parameters that maximize the (log)likelihood (the parameter-valuesunder which the data are most likely). To do this maximization, we needto compute the likelihood function, and its derivatives with respect to theparameters.

• Unfortunately, in general this is hard for NLMMs because the ran-dom effects enter into the model in a nonlinear fashion.

For a 1-level NLMM the parameters of the model are β, σ2 and the un-known elements of Ψ . So, we will write the likelihood function based on allof the data as L(β, σ2,Ψ ;y), where y = (yT

1 , . . . ,yTM )T , is the combined

data vector from all groups.

L(β, σ2,Ψ ;y) is just equal to the joint probability density function

p(y;β, σ2,Ψ) =M∏i=1

p(yi;β, σ2,Ψ) (‡)

(since y1, . . . ,yM , the observation vectors from distinct lowest-level groups,are independent).

• Here we’ve used p(·) to denote a density function rather than f(·) toavoid confusion with the model function f .

We need to translate our model specification into an expression forp(yi;β, σ

2,Ψ ).

243

Q: What does our model imply about the density of yi?

A: Directly, nothing. But indirectly, the model assumptions implythe form of p(yi;β, σ

2,Ψ).

How?:

First of all, it is clear that the conditional density p(yi|bi;β, σ2,Ψ) given

bi is a normal density, inherited from the normality of the vector of errorterms ei.

This is clear from (**), p.238. Conditional on bi, θi is non-random, sothat yi = fi(θi,xi) + ei has the usual form of a fixed effects NLM, andei ∼ N(0, σ2I) implies

yi|bi ∼ N(fi(θi,xi), σ

2I).

Therefore, p(yi|bi;β, σ2,Ψ) is given by a multivariate normal density with

mean fi(θi,xi) and var-cov matrix σ2I.

Secondly, the model directly assumes the density p(bi;Ψ ) of the randomeffects vector bi to be a multivariate normal.

Thirdly, we have the general relationship between a marginal density suchas p(yi;β, σ

2,Ψ) and a conditonal one:

p(yi;β, σ2,Ψ ) =

∫p(yi|bi;β, σ

2,Ψ )p(bi;Ψ)dbi.

• Here the integral is with respect to bi so is a multidimensional inte-gral.

244

Putting these results together, we have that p(yi;β, σ2,Ψ) is given by the

integral of a product of multivariate normal densities. Substituting thesenormal densities and using (‡), we have

p(y;β, σ2,Ψ ) =M∏i=1

∫(2πσ2)−ni/2 exp

[− 1

2σ2∥yi − fi(θi,xi)∥2

]× (2π)−q/2|Ψ |−1/2 exp

[−1

2bTi Ψ

−1bi

]dbi

To simplify and work with this expression, its convenient to reexpress Ψ−1

as follows: Let ∆ be a square-root matrix of σ2Ψ−1 so that

Ψ−1 = σ−2∆T∆.

Then, with a fair amount of algebra, we can simplify p(y;β, σ2,Ψ) aboveas

p(y;β, σ2,Ψ) =

M∏i=1

|∆|(2πσ2)(ni+q)/2

∫exp

[∥yi − fi(β,bi)∥2 + ∥∆bi∥2

−2σ2

]dbi.

• Here we have changed notation replacing fi(θi,xi) with fi(β,bi) tomake explicit the dependence of this quantity on the random effectsvector bi.

Thus, we have an expression for the likelihood function:

L(β, σ2,Ψ ;y) =

M∏i=1

|∆|(2πσ2)(ni+q)/2

∫exp

[∥yi − fi(β,bi)∥2 + ∥∆bi∥2

−2σ2

]dbi.

• Unfortunately, because the random effects can enter into the modelnonlinearly, this likelihood involves an integral that does not, in gen-eral, have a closed form expression.

• The presence of the integral in L(β, σ2,Ψ ;y) is what makes NLMMsespecially hard to work with, statistically.

245

Many authors have proposed methods to fit NLMMs that deal with theintegral in L(β, σ2,Ψ ;y) in various ways. The approaches generally fall inthree main categories:

1. Numerical integration (a.k.a., quadrature) methods.

2. Monte Carlo (simulation based) methods.

3. Approximate maximum likelihood/estimating equation methods.

a. Methods based on approximations to the loglikelihood involv-ing Taylor series/Laplace approximations with expansions takenabout E(bi) = 0, the mean of the random effects vector.

b. Methods based on approximations to the loglikelihood involv-ing Taylor series/Laplace approximations with expansions taken

about bi, some predicted value of the random effects vector.

We will describe only two methods:

A. The LME Approximation method of Lindstrom and Bates (1990).(This falls in category 3b, and is the method implemented in thenlme software in R).

B. Adaptive Gaussian Quadrature. (This method falls in category 1,and is implemented in SAS’ PROC NLMIXED, for 1-level models).

246

The LME Approximation:

The LME approximation can be derived in several different ways, but theeasiest motivation is to obtain the approximate log-likelihood of model(**) by using the loglikelihood of a linear approximation of model (**),p.238.

Recall our single-level NLMM from p.238:

yi = fi(θi,xi) + ei ≡ fi(β,bi) + ei,

θi = Aiβ +Bibi,, i = 1, . . . ,M. (∗∗)

Taking a first-order (linear) Taylor series approximation of fi(β,bi) about

bi, a predictor of bi, we have

yi ≈ fi(β, bi) + Zi(bi − bi) + ei

where

Zi =∂fi∂bT

i

∣∣∣∣bi=bi

Rearranging, we have

yi + Zibi = fi(β, bi) + Zibi + ei, (♣)

• Note that now bi enters into model (♣) linearly, not nonlinearly.

• Treating bi as fixed, we can view model (♣) as a nonlinear model

with response yi + Zibi that is multivariate normal, with mean

E(fi(β, bi) + Zibi + ei) = fi(β, bi)

and variance

var(fi(β, bi) + Zibi + ei) = var(Zibi + ei)

= ZiΨ ZTi + σ2I

= σ2(Zi∆−1∆−T ZT

i + I) ≡ σ2Σi(∆)

247

Therefore, the likelihood of model (♣), which we take as the LME approx-imate likelihood of model (**), is

M∏i=1

(2πσ2)−ni/2|Σi(∆)|−1/2

× exp

[− 1

2σ2{yi + Zibi − fi(β, bi)}TΣi(∆)−1{yi + Zibi − fi(β, bi)}

].

• Note that the above approximate likelihood can also be obtaineddirectly from L(β, σ2,Ψ ;y) on p.245, by applying a quadratic Taylor

approximation of the integrand around bi.

The LME approximate log-likelihood is

− 1

2

M∑i=1

[ni log(2πσ

2) + log |Σi(∆)|

+ {yi + Zibi − fi(β, bi)}TΣi(∆)−1{yi + Zibi − fi(β, bi)}]. (♡)

This loglikelihood is maximized iteratively, alternating between a step toestimate β and obtain the predictor bi for fixed ∆ and a step to estimate∆ for fixed values of β and bi.

Step 1 — Estimating β and updating the predictor bi:

We obtain these quantities by maximizing (♡) with respect to β and bi,i = 1, . . . ,M , with one simplification: we ignore the dependence of Σi(∆)on β.

• It can be argued that Σi(∆) will, in general, vary slowly with β andtherefore the term ∂Σi(∆)/(∂β) will be negligible (see Wolfinger &Lin, 1997).

248

With this simplification, we see that maximizing (♡) with respect to β

and bi amounts to minimizing the quantity

M∑i=1

{yi + Zibi − fi(β, bi)}TΣi(∆)−1{yi + Zibi − fi(β, bi)}

=M∑i=1

[∥yi − fi(β, bi)∥2 + ∥∆bi∥2

].

• Notice from the second form given above, it’s appropriate to termthis objective function a penalized nonlinear least squares (PNLS)criterion.

Step 2 — Estimation of ∆, σ2:

Maximization of (♡) with respect to ∆ and σ2 is done by first profilingthis approximate loglikelihood with respect to these parameters. That is,we substitute β(∆, σ2), the estimator of β based on the current ∆, σ2, into(♡). This profiling yields the objective function

−1

2

M∑i=1

[ni log(2πσ

2)+log |Σi(∆)|+{wi−Xiβ}TΣi(∆)−1{wi−Xiβ}]β=

ˆβ(∆,σ2)

,

where

Xi =∂fi(β, bi)

∂βT, and wi = yi − fi(β, bi) + Xiβ + Zibi.

This objective function is the loglikelihood of a linear mixed-effects modelin which the response vector is w = (wT

1 , . . . ,wTM )T , and the fixed-

and random-effects design matrices are X = (XT1 , . . . , X

TM )T and Z =

(ZT1 , . . . , Z

TM )T , respectively.

This means that step 2 can be accomplished by fitting a linear mixed-effects model using standard techniques for that that kind of a problem.

Thus, the whole LME approximation method iterates between a PNLSstep (step 1) and a linear mixed-effects (LME) step (step 2). These twosteps are repeated until convergence.

249

• For extension to the multilevel model, see Pinheiro and Bates (2000,ch. 7).

Adaptive Gaussian Quadrature:

A computationally more intensive, but generally more accurate method toevaluate the likelihood function of an NLMM is adaptive Gaussian quadra-ture.

The basic idea of (ordinary) Gaussian quadrature is to approximate an

integral of the form∫ b

ag(x)w(x)dx by a finite weighted sum of the function

g(·) evaluated at a set of x-values chosen to be optimal for the particularform of the weight function w() in the integral.

That is, since an integral of the form∫ b

ag(x)w(x)dx has an interpretation

as an infinite weighted sum, approximate it by a finite weighted sum:∫ b

a

g(x)w(x)dx ≈m∑

k=1

wig(xi) (♠)

where w1, . . . , wm are weights and x1, . . . , xm are abscissas chosen basedon the form of w(·).

• In (♠), we use an m-term sum to approximate the integral, so thisis called m-point Gaussian quadrature. The accuracy of the approx-imation increases with m, although not linearly.

In an NLMM context, the integral we require is∫exp

[∥yi − fi(β,bi)∥2 + ∥∆bi∥2

−2σ2

]dbi (♢)

(cf. bottom of p.245).

With the change of variable z = σ−1∆bi, this integral can be written as∫σq|∆|−1 exp

[− 1

2σ2∥yi − fi(β, σ∆

−1z)∥2]exp

(−1

2∥z∥2

)dz

and we have exp(−1

2∥z∥2)playing the role of the weight function w(x) in

(♠).

250

Suppose q = 1. Then the integral that we wish to approximate is 1-dimensional, and we can apply the Gaussian quadrature approximationgiven by (♠). In this case, we have

(♢) ≈ σ|∆|−1m∑j=1

exp

[− 1


−1zj)∥2]wj

where zj , wj j = 1, . . . ,m, are abscissas and weights, respectively, forthe exp(− 1

2z2) weight function. These are known quantities, that can be

looked up in tables or calculated with computer programs.

If q > 1, then the integral we wish to approximate is multidimensional. Inthis case, the (ordinary) Gaussian quadrature approximations becomes aq-term nested sum:

(♢) ≈ σq|∆|−1m∑

j1=1

· · ·m∑

jq=1

exp

[− 1


−1zj)∥2] m∏k=1

wjk

where now zj is a vector of abscissa values: zj = (zj1 , . . . , zjq )T .

The adaptive Gaussian quadrature approximation is much the same, ex-cept that the weighted sum approximation is evaluated at abscissas-valuescentered around bi, our predictor of the random effects vector bi ratherthan around E(bi) = 0 as in ordinary Gaussian quadrature. In addition,the abscissas are rescaled.

This recentering and rescaling is accomplished by the change of variablez = σ−1(ZT

i Zi+∆T∆)1/2(bi−bi) (rather than the one given at the bottomof the previous page). This leads to the adaptive Gaussian quadratureapproximation

(♢) ≈ σq|ZTi Zi +∆T∆|−1/2

×m∑

j1=1

· · ·m∑

jq=1

exp

{− 1

2σ2

[∥yi − fi(β, bi + σ(ZT

i Zi +∆T∆)−1/2zj)∥2

+ ∥∆{bi + σ(ZTi Zi +∆T∆)−1/2zj}∥2

]+ ∥zj∥2/2

} m∏k=1

wjk

251

• Although the adaptive Gaussian quadrature approach can, in prin-ciple, be extended to the multilevel case, it is much more compu-tationally demanding than the LME approximation, and only the1-level case is practical.

• In fact, even in the one-level case, models with q > 2 random effects(e.g., our indomethacin example where the final model had q = 3)are very difficult to fit in PROC NLMIXED. Only one-level modelswith q ≤ 2 are really feasible currently with this methodology.

• It is true, though, that the adaptive Gaussian quadrature approachis more accurate than the LME approximation, and is the preferredmethod for feasible problems.

Example — CO2 Uptake in Grass

• See §8.2.2 in Pinheiro and Bates (2000) for more details on thisexample.

The groupedData object CO2 in the nlme library in R contains data froman experiment to investigate the effects of cold temperatures on the CO2

uptake of the grass species, Echinochloa crus-galli. A total of 12 four-week-old plants, 6 from Quebec and 6 from Mississippi, were divided into twogroups: control plants that were kept at 26◦C and chilled plants that weresubject to 14 h of chilling at 7◦C. After 10 h of recovery at 20◦C, CO2-uptake was measured for each plant at seven concentrations of ambientCO2.

• See handout CO2. A plot of the data appears on p.5 of the handout.This plot suggests an asymptotic relationship between uptake andconcentration.

252

We follow Potvin et al. (1990), who originally presented these data, inconsidering models based on the asymptotic regression model SSasympOff.In its fixed-effects incarnation, this model has expectation function

f(θ, xij) = θ1[1− exp{−eθ2(xij − θ3)}

](∗)

where yij and xij are the CO2-uptake and ambient CO2 concentration,respectively, for the jth observation made on the ith plant.

In building a mixed-effects version of (*), we must account for plant-to-plant heterogeneity and also differences in the mean response from oneexperimental group to the next. Here the experimental groups are thecrossing of Type of plant (Quebec vs. Mississippi) with Treatment (non-chilled vs. chilled).

It is natural to include random effects in one or more of the model param-eters θ1, θ2, θ3 in (*) to account for plant-to-plant heterogeneity. But wewill also want to include fixed-effects in our model to account for group-to-group differences.

However, there are several decisions to make: Which parameters shouldbe modelled with random-effects? In which parameters should we includefixed-effects to account for group differences. How should we test for sig-nificance of fixed-effects? How should we test for significance of randomeffects? and in what order?

There is not one uniquely correct strategy of model building for NLMMs.We will take the strategy advocated by Pinheiro and Bates (2000, §8.2):

1. Start with the basic NLMM involving no covariates (no fixed-effectsfor group differences) with all parameters mixed.

253

2. Drop any obviously non-significant random effects from the model.(By non-significant we mean random effects whose variance compo-nents are non-significant).

– Testing for significance of random effects is complicated by thenon-chisquare limiting distribution of the LRT. Using informa-tion criteria for nested models is more appropriate, althoughLRTs are informative if we keep in mind that chi-square-basedp−values (e.g., as given by the anova() function in R) are over-estimated.

3. Plot predicted random effects versus covariate-values (potential ex-planatory variable values) to determine which parameters’ variabilityis accounted for by the covariate(s) and thus which parameter shouldbe modelled with covariates and additional fixed-effects.

– Significance of new fixed-effects is assessed with Wald-typetests.

4. After covariates are added, we reconsider the random-effects struc-ture. It may be possible to remove random effects (or, possibly butunlikely, necessary to add random effects) once covariates have beenadded to the model.

– Again significance of random effects is assessed with informa-tion criteria and LRTs for nested models.

• In the CO2 handout, we first fit the SSasympOff model (*) to thedata from each plant separately using the nlsList() function. Anintervals() plot of the plant-specific parameters would be helpful indeciding which parameters should be mixed, but instead we go aheadand fit the model with all parameters mixed:

yij = θ1i[1− exp{−eθ2i(xij − θ3i)}

]+ eij

θi =

θ1iθ2iθ3i

=

β1β2β3

+

b1ib2ib3i

= β + bi

{bi}iid∼ N(0,Ψ), {eij}

iid∼ N(0, σ2)

254

• The above model is fit as m1CO2.nlme and can be fit directly fromthe nlsList() model m1CO2.lis. A summary of m1CO2.nlme revealsthat b1i and b3i are almost perfectly correlated. This suggests thatthe model is overparameterized, and we try dropping b3i from themodel.

• The resulting model, m2CO2.nlme fits better than m1CO2.nlme ac-cording to both information criteria. Model m2CO2.nlme has a rea-sonable looking residual plot. We now go on and consider the ex-planatory factors Type and Treatment.

• We plot predicted random effects from m2CO2.nlme against the ex-perimental groups given by the crossing of Type and Treatment onp.7. It appears that the random effects corresponding to θ1 andthose corresponding to θ2 both change across experimental groupsin a way that suggests an interaction between Type and Treatment.Therefore, next we consider model m3CO2.nlme, in which

θ1i = β10 + β11x1i + β12x2i + β13x1ix2i + b1i

θ2i = β20 + β21x1i + β22x2i + β23x1ix2i + b2i

and θ3i = β3

where

x1i =

{0, if Type=Quebec1, if Type=Mississippi

x2i =

{0, if Treatment=non-chilled1, if Treatment=chilled

• This model is fit as m3CO2.nlme. Significance of β11, β12, β13 (themain effect of Type, main effect of Treatment, and interaction in theasymptote parameter) is assessed with a Wald test viaanova(m3CO2.nlme,Terms=2:4). Similarly we test the significanceof β21, β22, β23 (the main effects and interaction in the rate parame-ter) via anova(m3CO2.nlme,Terms=6:8). In both cases, these termsare significant and we retain them in the model.

255

• Finally, we consider removing one or more of the the random effectsb1i, b2i now that explanatory variables have been incorporated intoθ1i, θ2i. A comparison of the estimated standard deviation of b1i

(

√ψ11 = 2.35) versus |β10| = 32.34 and a comparison of the esti-

mated standard deviation of b2i (

√ψ22 = .080) versus |β20| = 4.51

reveals that the variance component associated with b2i is relativelysmaller than that associated with b1i Therefore, we consider drop-ping b2i from the model. This yields model m4CO2.nlme, which fitsbetter than m3CO2.nlme according to the information criteria.

• We also consider dropping b1i from the model (see m5CO2.gnls), butthis change results in a significant reduction in the quality of the fit.

• So, the final model for these data is

yij = θ1i[1− exp{−eθ2i(xij − θ3i)}

]+ eij

θi =

θ1iθ2iθ3i

=

β10 + β11x1i + β12x2i + β13x1ix2i + b1iβ20 + β21x1i + β22x2i + β23x1ix2i

β3

{bi}

iid∼ N(0,Ψ), {eij}iid∼ N(0, σ2)

• A plot of the observed and predicted responses for each plant isincluded on the final page of the handout. As you can see, themodel appears to fit the data well.

• To illustrate the use of PROC NLMIXED and its implementation ofadaptive Gaussian quadrature. See CO2.sas and CO2.lst. In theseprograms we refit the final model above. As you’ll see, the resultsare quite similar but not identical to those from nlme() using theLME approximation.

An example of a multilevel NLMM fit in nlme is given by Pinheiro andBates (2000, §8.2.3).

256

Restricted Maximum Likelihood

In the linear mixed effects model, ML estimators of variance componentsare generally not preferred because they are biased. To take a simpleexample, consider the linear model

y = Xβ + e, e ∼ N(0, σ2I),

where X is an n× p full rank model matrix.

The ML estimator of σ2 is

σ2 =1

n∥y −Xβ∥2 =

1

n

n∑i=1

(yi − xTi β)

2,

where β = (XTX)−1XTy is the MLE/LSE of β and xTi is the ith row of

X.

It is well known and easy to show that σ2 is biased. A bias-corrected andgenerally preferred estimator is the mean squared error:

s2 =1

n− p∥y −Xβ∥2 =

1

n− p

n∑i=1

(yi − xTi β)

2 =n

n− pσ2.

• s2 is unbiased.

• the divisor n− p in s2 is dfE , the degrees of freedom due to error inthe model, which is also equal to dfT−dfReg = n−p, the total degreesof freedom (n) minus the degrees of freedom due to estimating β (p).Since the divisor in s2 is dfT − dfReg rather than just dfT as used inσ2, it is often said that s2 accounts for degrees of freedom lost fromhaving to estimate the regression parameter β.

257

Another example, a simple linear mixed model. Consider the one-wayanova model with random effects:

yij = µ+ bi + eij , i = 1, . . . , a, j = 1, . . . ,m.

Here, µ is a (fixed) mean, b1, . . . , baiid∼ N(0, σ2

b ) are random effects, andthe eij ’s are i.i.d. N(0, σ2) error terms assumed independent of the bi’s.

• Such a model might be appropriate for an example in which we aretrying to estimate the calcium content of the leaves of a certain plantand we have calcium measurements on several (m) leaves from eachof several (a) plants. In such a situation, we’re interested in µ, themean calcium concentration, but the data are grouped by plant, sowe want plant-specific random effects bi in the model to account forheterogeneity from plant-to-plant and correlation in the observationstaken from the same plant.

In this model, the total variance of a response yij is

var(yij) = σ2b + σ2.

Therefore, σ2b and σ2 are called variance components.

258

The ANOVA table for this model is as follows:

Source of Sum of df Mean FVariation Squares Squares

Groups SSTrt a− 1 MSGrps = SSGrps/dfGrpsMSGrps

MSE

Error SSE n− a MSE = SSE/dfETotal SST n− 1

The MLEs of the variance components are given by

σ2b =

1

n(a− 1

aMSGrps −MSE) and σ2 =MSE , if σ2

b ≥ 0

σ2b = 0 and σ2 =

SST

n, if σ2

b < 0

These are not the usual variance component estimators taught in courseslike STAT 8200. Instead, the usual estimators are derived from the ANOVAtable by equating mean squares to their expected values and solving forσ2 and σ2

b . This leads to the “ANOVA” estimators

σ2b =

1

n(MSGrps −MSE) and σ2 =MSE , if σ2

b ≥ 0

σ2b = 0 and σ2 =

SST

n− 1, if σ2

b < 0

• The ANOVA estimators, like s2, the MSE in the previous example,are less biased than the corresponding ML estimators because theyaccount for degrees of freedom lost in having to estimate fixed effectsare taken into account.

Both of these examples of “bias-corrected” alternatives to ML estimatorsof variance components are special case of what are known as restrictedmaximum likelihood (REML) estimators. In REML, the goal is toproduce better estimators of variance components by constructing an ob-jective function that does not involve the fixed effects. That is, REMLestimators maximize a likelihood-like function in which the nuisance pa-rameter β has been eliminated, or concentrated out of the likelihood.

259

The linear mixed model for grouped (two-level) data with spherical errorstakes the form

yi = Xiβ + Zibi + ei, i = 1, . . . ,M,

where e1, . . . , eMiid∼ Nni(0, σ

2I), b1, . . . ,bMiid∼ N(0,Ψ ), where the ei’s

and bi’s are independent of one another.

This model can be written more succinctly as

y = Xβ + Zb+ e,

where

y =

y1...

yM

, X =

X1...

XM

, b =

b1...

bM

, e =

e1...

eM

and Z = blkdiag(Z1, . . . ,ZM ). Now b ∼ N(0,blkdiag(Ψ , . . . ,Ψ )) ande ∼ Nn(0, σ

2I). For simplicity, assume that X is n× p of rank p.

Note that this model implies

y ∼ Nn(Xβ,V(Ψ , σ2)), V(Ψ , σ2) = blkdiag(V1(Ψ , σ2), . . . ,VM (Ψ , σ2)),

where Vi(Ψ , σ2) = ZiΨZT

i + σ2I.

• Note that the distribution of y depends upon the fixed effects β.

To eliminate β from the objective function, in REML we work not with thelikelihood function corresponding to the joint density of y’s, but instead wework with the joint density of a set of linearly independent error contrastsof y’s.

• A linear combination aTy is called an error contrast if E(aTy) = 0for all β.

260

Suppose we can find n− p linearly independent error contrasts

w1 = aT1 y

w2 = aT2 y

...

wn−p = aTn−py

orw = ATy,

where A has columns a1, . . . ,an−p.

Theny ∼ N(Xβ,V(Ψ , σ2))

impliesw ∼ Nn−p(0,A

TV(Ψ , σ2)A).

• It is clear now that the loglikelihood of w does not depend upon β.

The log density of w is taken as ℓR(Ψ , σ2;y), the restricted loglikelihoodof the variance parameters Ψ and σ2, based on y:

ℓR(Ψ , σ2;y) =M∑i=1

[− 1

2log |ATV(Ψ , σ2)A| − 1

2wT {ATV(Ψ , σ2)A}−1w

+ constant

].

261

In the REML approach, the loglikelihood is replaced by the “restrictedloglikelihood” which is defined as the loglikelihood based upon a set oflinear independent error contrasts in the original response. It can be shownthat this restricted loglikelihood does not depend upon which set of errorcontrasts is chosen. All choices lead to the same restricted loglikelihood,which, ignoring constant terms, is given by

ℓR(Ψ , σ2;y) = −1

2

M∑i=1

{log |Vi(Ψ , σ2)|+ log |XT

i V−1i (Ψ , σ2)Xi|

+ [yi −Xiβ(Ψ , σ2)]TV−1

i (Ψ , σ2)[yi −Xiβ(Ψ , σ2)]}

where β(Ψ , σ2) = {XTV−1(Ψ , σ2)X}−1XTV−1(Ψ , σ2)y.

In contrast, the profile likelihood for Ψ , σ2 corresponding to ML estimationis

pℓ(Ψ , σ2;y) = −1

2

M∑i=1

{log |Vi(Ψ , σ2)|

+ [yi −Xiβ(Ψ , σ2)]TV−1

i (Ψ , σ2)[yi −Xiβ(Ψ , σ2)]}

That is, with respect to the variance parameters Ψ and σ2, the REMLobjective function differs from that of ML as follows

ℓR(Ψ , σ2;y) = pℓ(Ψ , σ2;y)− 1

2

M∑i=1

log |XTi V

−1i (Ψ , σ2)Xi|.

• Consequently, REML is sometimes called a penalized likelihoodmethod.

262

REML estimates of Ψ , σ2 are obtained by solving the equations obtainedby differentiating ℓR(Ψ , σ2;y) with respect to these parameters and settingequal to zero.

• REML estimators are not, in general, unbiased, but they are typi-cally less biased the ML estimators and are preferred in most con-texts.

• An important caveat concerning REML estimation is that the ob-jective function is not a true loglikelihood, and cannot be treatedas such for all aspects of statistical inference. In particular, modelswith different fixed effects specifications cannot be compared via re-stricted likelihood ratio type tests or via model selction criteria suchas AIC and BIC in which the loglikelihood has been replaced by therestricted loglikelihood.

REML estimation doesn’t easily generalize to a nonlinear model context,because the idea of using an error contrast to eliminate the fixed effectsonly works when the fixed effects enter into the model in a linear way.

However, the LME approach to fitting the NLMM does lend itself nicelyto an approximate REML-type procedure. Recall that the LME approachconsisted of two steps:

1. A PNLS step in which we minimize the penalized nonlinear leastsquares criterion given on the top of p. 249 with respect to β andbi, and

2. An LME step in which we maximize the function

ℓLME(β(∆), σ2,∆;y) = −1

2

M∑i=1

[ni log(2πσ

2) + log |Σi(∆)|

+ {wi − Xiβ}TΣi(∆)−1{wi − Xiβ}]β=

ˆβ(∆,σ2)

, (∗)

where

Xi =∂fi(β, bi)

∂βT, and wi = yi − fi(β, bi) + Xiβ + Zibi.

263

• Recall that step 2 corresponds to fitting a linear mixed effects modelwith ML estimation.

The approximate REML version of the LME approach to fitting an NLMMjust fits the linear mixed effects model in step two with REML rather thanML. That is, the objective function (*) is replaced by

ℓRLME(σ2,∆;y) = ℓLME(β(∆), σ2,∆;y)−1

2

M∑i=1

log∣∣∣σ−2XT

i Σi(∆)−1Xi

∣∣∣ .︸︷︷︸REML penalty term

• The idea here is that the LME approximate ML procedure consists ofiteratively fitting a LMM with ML, so the LME approximate REMLprocedure is done by iteratively fitting the LMM with REML.

Inference on Fixed Effects Using the LME Approximation:

• We present these results for the two-level model only.

Under the LME approximation, the distribution of the approximate (re-

stricted) maximum likelihood estimator β of the fixed effects is

β.∼ N

β, σ2

[M∑i=1

XTi Σ

−1i Xi

]−1 , (♣)

where Σi = Zi∆−1∆−T ZT

i + I.

• In practice var(β) is estimated with

var(β) = σ2

[M∑i=1

XTi Σ

−1i Xi

]−1 ∣∣∣∣∆=∆,σ2=σ2

,

where ∆ and σ2 are the ML or REML estimates of these parameters.

Standard errors of βi, the jth component of β are obtained as the square

root of the jth diagonal element of var(β).

264

The distributional result (♣) suggests that approximate Wald tests can beused for inference on β. In particular, an approximate α-level test of ahypothesis of the form H0 : Aβ = c where A is k × dim(β) reject H0 if

(Aβ − c)T {A[var(β)]−1AT }−1(Aβ − c) > χ21−α(k)

where χ21−α(k) is the upper αth critical value of a chi-square distribution

on k df.

As a special case of this result, an approximate z test of H0 : βj = 0 versus

H0 : βj = 0 rejects H0 if

βj

s.e.(βj)> z1−α/2

where z1−α/2 is the (1− α/2) quantile of a standard normal distribution.In addition, an approximate 100(1− α)% CI for βj is given by

βj ± z1−α/2s.e.(βj).

• These Wald-based inferences are “approximately asymptotic”. Thatis, their validity depends upon the accuracy of the LME approxima-tion as an approximate version of ML (or REML) estimation, andon the usual asymptotic arguments for ML estimation that justifyWald-based inference as approximately valid in finite samples.

• Vonesh and Carter (1992) and Pinheiro and Bates (2000) have sug-gested that more accurate inferences can be accomplished by usingF and t reference distributions in place of the χ2 and z distributionsgiven above.

• The idea here is to account for the fact that we’re using var(β)

instead of var(β) to form our test statistics, and this substitutionshould introduce additional error into the sampling distributions ofthe test statistics.

265

An approximate F test of H0 : Aβ = c is based on the test statistic

(Aβ − c)T {A[var(β)

]−1

AT }−1(Aβ − c)

k

.∼ F (k, ν).

In addition, H0 : βj = 0 can be tested via the test statistic

βj

s.e.(βj)

.∼ t(ν).

What is the appropriate choice for the denominator d.f. ν in thesetests?

Unfortunately, there is not a definitive answer to that question yet.

• For the two level model (i.e., single level of clustering), Vonesh andCarter (1992) sugggested ν =M − dim(β).

• For the two level model SAS’ PROC NLMIXED uses M − q, whereq is the dimension of the random effects vector bi.

• Pinheiro and Bates (2000, p.322) suggest that the same procedure asused to compute denominator dffor tests in the LMM be used in theNLMM. That procedure is described on p.91 of their book. However,in the nlme() function, the procedure from the LMM is not followedexactly. In the two-level model, the denominator d.f. computed bynlme() will always be n−M − (p− 1).

• None of these choices for ν can be justified rigorously, and it is notat all clear which is the best approach to use in practice. The Waldbased inferences are perhaps easiest to justify theoretically, but theywill tend to be liberal (reject often, tight intervals) relative to theother methods.

266

notes on nonlinear regression - tanujit chakraborty's blog

Documents