gls theory

Upload: helmi-iswati

Post on 08-Apr-2018

244 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 GLS theory

    1/34

    Chapter 4

    Generalized Least Squares

    Theory

    In Section 3.6 we have seen that the classical conditions need not hold in practice.Although these conditions have no effect on the OLS method per se, they do affect theproperties of the OLS estimators and resulting test statistics. In particular, when theelements of y have unequal variances and/or are correlated, var( y ) is no longer a scalarvariance-covariance matrix, and hence there is no guarantee that the OLS estimator is

    the most efficient within the class of linear unbiased (or the class of unbiased) estimators.Moreover, hypothesis testing based on the standard OLS estimator of the variance-covariance matrix becomes invalid. In practice, we hardly know the true properties of y . It is therefore important to consider estimation that is valid when var( y ) has a moregeneral form.

    In this chapter, the method of generalized least squares (GLS) is introduced to im-prove upon estimation efficiency when var( y ) is not a scalar variance-covariance matrix.A drawback of the GLS method is that it is difficult to implement. In practice, certain

    structures (assumptions) must be imposed on var( y ) so that a feasible GLS estimatorcan be computed. This approach results in two further difficulties, however. First,the postulated structures on var( y ) need not be correctly specied. Consequently,the resulting feasible GLS estimator may not be as efficient as one would like. Sec-ond, the nite-sample properties of feasible GLS estimators are not easy to establish.Consequently, exact tests based on the feasible GLS estimation results are not read-ily available. More detailed discussion of the GLS theory can also be found in e.g.,Amemiya (1985) and Greene (2000).

    77

  • 8/7/2019 GLS theory

    2/34

    78 CHAPTER 4. GENERALIZED LEAST SQUARES THEORY

    4.1 The Method of Generalized Least Squares

    4.1.1 When y Does Not Have a Scalar Covariance MatrixGiven the linear specication (3.1):

    y = X + e ,

    suppose that, in addition to the conditions [A1] and [A2](i),

    var( y ) = o,

    where o

    is a positive denite matrix but cannot be written as 2oI

    T for any positive

    number 2o . That is, the elements of y may not have a constant variance, nor are theyrequired to be uncorrelated. As [A1] and [A2](i) remains valid, the OLS estimator T is still unbiased by Theorem 3.4(a). It is also straightforward to verify that, in contrastwith Theorem 3.4(c), the variance-covariance matrix of the OLS estimator is

    var( T ) = ( X X ) 1X oX (X X )

    1 . (4.1)

    In view of Theorem 3.5, there is no guarantee that the OLS estimator is the BLUE for o. Similarly, when [A3] fails such that

    y N (X o, o),we have

    T N o, (X X ) 1X oX (X X )

    1 ;

    cf. Theorem 3.7(a). In this case, the OLS estimator T need not be the BUE for o.

    Apart from efficiency, a more serious consequence of the failure of [A3] is that thestatistical tests based on the standard OLS estimation results become invalid. Recallthat the OLS estimator for var( T ) is

    var( T ) = 2T (X X )

    1,

    which is, in general, a biased estimator for (4.1). As the t and F statistics depend onthe elements of var( T ), they no longer have the desired t and F distributions underthe null hypothesis. Consequently, the inferences based on these tests become invalid.

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    3/34

    4.1. THE METHOD OF GENERALIZED LEAST SQUARES 79

    4.1.2 The GLS Estimator

    The GLS method focuses on the efficiency issue resulted from the failure of the classicalcondition [A2](ii). Let G be a T T non-stochastic matrix. Consider the transformedspecication

    Gy = GX + Ge ,

    where Gy denotes the transformed dependent variable and GX is the matrix of trans-formed explanatory variables. It can be seen that GX also has full column rank kprovided that G is nonsingular. Thus, the identication requirement for the specica-tion (3.1) carries over under nonsingular transformations. It follows that can still be

    estimated by the OLS method using these transformed variables. The resulting OLSestimator is

    b(G ) = ( X G GX ) 1X G Gy , (4.2)

    where the notation b(G ) indicates that this estimator is a function of G .

    Given that the original variables y and X satisfy [A1] and [A2](i), it is easily seenthat the transformed variables Gy and GX also satisfy these two conditions becauseGX is still non-stochastic and IE( Gy ) = GX o. Thus, the estimator b(G ) must be

    unbiased for any nonstochastic and nonsingular G . A natural question then is: Can wend a transformation matrix that yields the most efficient estimator among all linearunbiased estimators? Observe that when var( y ) = o,

    var( Gy ) = G oG .

    If G is such that G oG = 2o I T for some positive number 2o , the condition [A2](ii)would also hold for the transformed specication. Given this G , it is now readily seenthat the OLS estimator (4.2) is the BLUE for o by Theorem 3.5. This shows that, as

    far as efficiency is concerned, one should choose G as a nonstochastic and nonsingularmatrix such that G oG = 2oI T .

    To nd the desired transformation matrix G, note that o is symmetric and positivedenite so that it can be orthogonally diagonalized as C oC = , where C is thematrix of eigenvectors corresponding to the matrix of eigenvalues . For

    1/ 2o =

    C 1/ 2C (or 1/ 2o =

    1/ 2C ), we have

    1/ 2o o 1/ 2o = I T .

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    4/34

    80 CHAPTER 4. GENERALIZED LEAST SQUARES THEORY

    This result immediately suggests that G should be proportional to 1/ 2o , i.e., G =

    c 1/ 2o for some constant c. Given this choice of G, we have

    var( Gy ) = G oG = c2I T ,

    a scalar covariance matrix, so that [A2](ii) holds. The estimator (4.2) with G = c 1/ 2o

    is known as the GLS estimator and can be expressed as

    GLS = ( c2X 1o X )

    1(c2X 1o y ) = ( X 1o X )

    1X 1o y . (4.3)

    This estimator is, by construction, the BLUE for o under [A1] and [A2](i). The GLSand OLS estimators are not equivalent in general, except in some exceptional cases; see,e.g., Exercise 4.1.

    As the GLS estimator does not depend on c, it is without loss of generality to setG =

    1/ 2o . Given this choice of G , let y

    = Gy , X = GX , and e = Ge . Thetransformed specication is

    y = X + e , (4.4)

    and the OLS estimator for this specication is the GLS estimator (4.3). Clearly, theGLS estimator is a minimizer of the following GLS criterion function:

    Q( ; o) =1T

    (y X ) (y X

    ) =1T

    (y X ) 1o (y X ). (4.5)

    This criterion function is the average of a weighted sum of squared errors and hence ageneralized version of the standard OLS criterion function (3.2).

    Similar to the OLS method, dene the vector of GLS tted values as

    y GLS = X (X 1o X )

    1X 1o y .

    The vector of GLS residuals is

    e GLS = y y GLS .The fact that X (X 1o X )

    1X 1o is idempotent but not symmetric immediatelyimplies that y GLS is an oblique (but not orthogonal) projection of y onto span( X ). Itcan also be veried that the vector of GLS residuals is not orthogonal to X or any linearcombination of the column vectors of X ; i.e.,

    e GLS X = y [I T 1o X (X

    1o X )

    1X ]X = 0.

    In fact, e GLS is orthogonal to span( 1o X ). It follows that

    e e e GLS e GLS .That is, the OLS method still yields a better t of original data.

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    5/34

    4.1. THE METHOD OF GENERALIZED LEAST SQUARES 81

    4.1.3 Properties of the GLS Estimator

    We have seen that the GLS estimator is, by construction, the BLUE for o under [A1]and [A2](i). Its variance-covariance matrix is

    var( GLS ) = var (X 1o X )

    1X 1o y = ( X 1o X )

    1. (4.6)

    These results are summarized below.

    Theorem 4.1 (Aitken) Given the specication (3.1) , suppose that [A1] and [A2](i)hold and that var( y ) = o is a positive denite matrix. Then GLS is the BLUE for owith the variance-covariance matrix (X 1o X )

    1.

    It follows from Theorem 4.1 that var( T ) var( GLS ) must be a positive semi-denitematrix. This result can also be veried directly; see Exercise 4.3.For convenience, we introduce the following condition.

    [A3 ] y N (X o, o), where o is a positive denite matrix.The following result is an immediate consequence of Theorem 3.7(a).

    Theorem 4.2 Given the specication (3.1), suppose that [A1] and [A3 ] hold. Then

    GLS

    N o, (X

    1o X )

    1 .

    Moreover, if we believe that [A3 ] is true, the log-likelihood function is

    log L( ; o) = T 2

    log(2) 12

    log(det( o)) 12

    (y X ) 1o (y X ). (4.7)

    The rst order condition of maximizing this log-likelihood function with respect to is

    X 1o (y X ) = 0,

    so that the MLE is

    T = ( X 1o X )

    1X 1o y ,

    which is exactly the same as the GLS estimator. The information matrix is

    IE[X 1o (y X )(y X ) 1o X ] = o

    = X 1o X ,

    and its inverse (the Cr amer-Rao lower bound) is the variance-covariance matrix of theGLS estimator. We have established the following result.

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    6/34

    82 CHAPTER 4. GENERALIZED LEAST SQUARES THEORY

    Theorem 4.3 Given the specication (3.1) , suppose that [A1] and [A3 ] hold. Then GLS is the BUE for o.

    Under the null hypothesis R o = r , it is readily seen from Theorem 4.2 that

    (R GLS r ) [R (X 1o X )

    1R ] 1(R GLS r ) 2(q).The left-hand side above can serve as a test statistic for the linear hypothesis R o = r .

    4.1.4 FGLS Estimator

    In practice, o is typically unknown so that the GLS estimator is not available. Sub-stituting an estimator T for o in (4.3) yields the feasible generalized least squares

    (FGLS) estimator

    FGLS = ( X 1T X )

    1X 1T y .

    which is readily computed from data. Note, however, that o contains too many(T (T + 1) / 2) parameters. Proper estimation of o would not be possible unless furtherrestrictions on the elements of o are imposed. Under different assumptions on var( y ), o has a simpler structure with much fewer (say, p T ) unknown parameters and maybe properly estimated; see Sections 4.2 and 4.3. The properties of FGLS estimationcrucially depend on these assumptions.

    A clear disadvantage of the FGLS estimator is that its nite sample properties areusually unknown. Note that T is, in general, a function of y , so that FGLS is a complexfunction of the elements of y . It is therefore difficult, if not impossible, to derive thenite-sample properties, such as expectation and variance, of FGLS . Consequently, theefficiency gain of an FGLS estimator is not at all clear. Deriving exact tests based on theFGLS estimator is also a formidable job. One must rely on the asymptotic propertiesof FGLS to draw statistical inferences.

    4.2 HeteroskedasticityIn this section, we consider a simpler structure of o such that o is diagonal withpossibly different diagonal elements:

    o = diag[ 21 , . . . ,

    2T ] =

    21 0 00 22 0... ... . . . ...0 0 2T

    , (4.8)

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    7/34

    4.2. HETEROSKEDASTICITY 83

    where diag is the operator that puts its arguments on the main diagonal of a matrix.Given this o, the elements of y are uncorrelated but may have different variances.

    When yt , t = 1 , . . . , T , have a constant variance, they are said to be homoskedastic ;otherwise, they are heteroskedastic .

    To compute the GLS estimator, the desired transformation matrix is

    1/ 2o =

    11 0 00 12 0... ... . . . ...0 0

    1T

    .

    As o still contains T unknown parameters, an even simpler structure of o is neededto ensure proper FGLS estimation.

    4.2.1 Tests for Heteroskedasticity

    It is clear that the OLS method would prevail unless there is evidence that o = 2oI T .We therefore rst study the tests of the null hypothesis of homoskedasticity againstsome form of heteroskedasticity . Such tests are usually based on simplied parametricspecications of var( yt ).

    The simplest possible form of heteroskedastic yt

    is groupwise heteroskedasticity . Sup-pose that data can be classied into two groups: group one contains T 1 observationswith the constant variance 21 , and group two contains T 2 observations with the constantvariance 22 . This assumption simplies o in (4.8) to a matrix of only two unknownparameters:

    o =21 I T 1 0

    0 22 I T 2, (4.9)

    The null hypothesis of homoskedasticity is 21 = 22 = 2o ; the alternative hypothesis is,

    without loss of generality, 21 >

    22 .

    Consider now two regressions based on the observations of the group one and grouptwo, respectively. Let 2T 1 and

    2T 2 denote the resulting OLS variance estimates. In-

    tuitively, whether 2T 1 is close to 2T 2 constitutes an evidence for or against the null

    hypothesis. Under [A1] and [A3 ] with (4.9),

    (T 1 k)2T 1 / 21 2(T 1 k),(T 2 k)2T 2 / 22 2(T 2 k),

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    8/34

    84 CHAPTER 4. GENERALIZED LEAST SQUARES THEORY

    by Theorem 3.7(b). When yt are independent, these two 2 random variables arealso mutually independent. Note that 2T 1 and

    2T 2 must be computed from separate

    regressions so as to ensure independence. Under the null hypothesis,

    :=2T 12T 2

    =(T 1 k)2T 12o (T 1 k)

    (T 2 k)2T 22o(T 2 k)

    F (T 1 k, T 2 k).

    Thus, = 2T 1 / 2T 2 is the F test for groupwise heteroskedasticity.

    More generally, the variances of yt may be changing with the values of a particularexplanatory variable, say x j . That is, for some constant c > 0,

    2t = c x2tj ;

    the larger the magnitude of x tj , the greater is 2t . An interesting feature of this speci-cation is that 2t may take distinct values for every t, yet o contains only one unknownparameter c. The null hypothesis is then 2t = 2o for all t , and the alternative hypothesisis, without loss of generality,

    2(1) 2(2) . . . 2(T ) ,where 2(i) denotes the i th largest variance. The so-called Goldfeld-Quandt test is of thesame form as the F test for groupwise heteroskedasticity but with the following datagrouping procedure.

    (1) Rearrange observations according to the values of some explanatory variable x jin a descending order.

    (2) Divide the rearranged data set into three groups with T 1 , T m , and T 2 observations,respectively.

    (3) Drop the T m observations in the middle group and perform separate OLS regres-sions using the data in the rst and third groups.

    (4) The statistic is the ratio of the variance estimates:

    2T 1 / 2T 2 F (T 1 k, T 2 k).

    If the data are rearranged according to the values of x j in an ascending order, theresulting statistic should be computed as

    2T 2 / 2T 1 F (T 2 k, T 1 k).

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    9/34

    4.2. HETEROSKEDASTICITY 85

    In a time-series study, if the variances are conjectured to be decreasing (increasing)over time, data rearrangement would not be needed. In the procedure for computing

    the Goldfeld-Quandt test, dropping observations in the middle group is to enhance thetests ability of discriminating variances in the rst and third groups. A rule of thumbsuggests that no more than one third of the observations should be dropped; it is alsotypical to set T 1 T 2. Clearly, this test would be powerful provided that one cancorrectly identify the source of heteroskedasticity (i.e., the explanatory variable thatdetermines variances). On the other hand, nding such a variable may not be an easyjob.

    An even more general form of heteroskedastic covariance matrix is such that thediagonal elements

    2t = h( 0 + z t 1),

    where h is some function and z t is a p 1 vector of exogenous variables affecting thevariances of yt . This assumption simplies o to a matrix of p+1 unknown parameters.Tests against this class of alternatives can be derived under the likelihood framework,and their distributions can only be analyzed asymptotically. This will not be discusseduntil Chapter 10.

    4.2.2 GLS Estimation

    If the test for groupwise heteroskedasticity rejects the null hypothesis, one might believethat o is given by (4.9). Accordingly, the specied linear specication may be writtenas:

    y 1y 2

    =X 1X 2

    +e 1e 2

    ,

    where y 1 is T 1 1, y 2 is T 2 1, X 1 is T 1 k, and X 2 is T 2 k. A transformedspecication is

    y1/

    1y 2/ 2

    =X

    1/

    1X 2/ 2

    +e

    1/

    1e 2/ 2

    ,

    where the transformed yt , t = 1 , . . . , T , have constant variance one. It follows that theGLS and FGLS estimators are, respectively,

    GLS =X 1X 1

    21+

    X 2X 222

    1 X 1y 121

    +X 2y 2

    22,

    FGLS =X 1X 1

    21+

    X 2X 222

    1 X 1y 121

    +X 2y 2

    22,

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    10/34

    86 CHAPTER 4. GENERALIZED LEAST SQUARES THEORY

    where 2T 1 and 2T 2 are, again, the OLS variance estimates obtained from separate re-

    gressions using T 1 and T 2 observations, respectively. Observe that FGLS is not a linear

    estimator in y so that its nite-sample properties are not known.If the Goldfeld-Quandt test rejects the null hypothesis, one might believe that 2t =

    c x2tj . A transformed specication is then

    ytx tj

    = j + 11

    x tj+ + j 1

    x t,j 1x tj

    + j +1x t,j +1

    x tj+ + k

    x tkx tj

    +etx tj

    ,

    where var( yt /x tj ) = c := 2o . This is a very special case where the GLS estimator isreadily computed as the OLS estimator for the transformed specication. Clearly, thevalidity of the GLS method crucially depends on whether the explanatory variable x j

    can be correctly identied.When 2t = h( 0 + z t 1), it is typically difficult to implement an FGLS estimator,

    especially when h is nonlinear. If h is the identity function, one may regress the squaredOLS residuals e2t on z t to obtain estimates for 0 and 1. Of course, certain con-straint must be imposed to ensure the tted values are non-negative. The nite-sampleproperties of this estimator are, again, difficult to analyze.

    Remarks:

    1. When a test for heteroskedasticity rejects the null hypothesis, there is no guaran-

    tee that the alternative hypothesis (say, groupwise heteroskedasticity) must be acorrect description of var( yt ).

    2. When a form of heteroskedasticity is incorrectly specied, it is also possible thatthe resulting FGLS estimator is less efficient than the OLS estimator.

    3. As discussed in Section 4.1.3, the nite-sample properties of FGLS estimators andhence the exact tests are usually not available. One may appeal to asymptotictheory to construct proper tests.

    4.3 Serial Correlation

    Another leading example that var( y ) = 2o I T is when the elements of y are correlatedso that the off-diagonal elements of o are non-zero. This phenomenon is more com-mon in time series data, though it is not necessary so. When time series data yt arecorrelated over time, they are said to exhibit serial correlation . For cross-section data,the correlations of yt are usually referred to as spatial correlation . Our discussion inthis section will concentrate on serial correlation only.

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    11/34

    4.3. SERIAL CORRELATION 87

    4.3.1 A Simple Model of Serial Correlation

    Consider time series yt , t = 1 , . . . , T , with the constant variance 2o . The correlationcoefficient between yt and yt i is

    corr( yt , yt i ) =cov(yt , yt i)

    var( yt ) var( yt i ) =cov(yt , yt i )

    2o,

    for i = 0 , 1, 2, . . . , t 1; in particular, corr( yt , yt ) = 1. Such correlations are also knownas the autocorrelations of yt . Similarly, cov( yt , yt i ), i = 0 , 1, 2, . . . , t 1, are known asthe autocovariances of yt .

    A very simple specication of autocovariances is

    cov(yt , yt i ) = cov( yt , yt+ i ) = ci 2o ,

    where c is a constant such that |c| < 1. Hence, corr( yt , yt i ) = ci . In this specication,the autocovariances and autocorrelations depend only i, the time periods between twoobservations, but not on t. Moreover, the correlations between two observations decayexponentially fast when i increases. Hence, closer observatins have higher correlations;two observations would have little correlation if they are separated by a sufficiently longperiod. Equivalently, we may write

    cov(yt , yt i ) = c cov(yt , yt i+1 ).

    Letting corr( yt , yt i) = i , we have

    i = c i 1 . (4.10)

    From this recursion we immediately see that c = 1 which must be bounded between

    1 and 1. It follows that var( y ) is

    o = 2o

    1 1 21 T 1

    1

    1 1 1 T 21

    21 1 1 T 3

    1...

    ...... . . .

    ...T 11

    T 21

    T 31 1

    . (4.11)

    To avoid singularity, 1 cannot be 1.A novel feature of this specication is that it, while permitting non-zero off-diagonal

    elements of o, involves only two unknown parameters: 2o and 1 . The transformation

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    12/34

    88 CHAPTER 4. GENERALIZED LEAST SQUARES THEORY

    matrix is then

    1/ 2o =1

    o

    1 0 0

    0 0

    11 21 11 21 0 0 00 11 21 11 21 0 0...

    ......

    . . . ......

    0 0 0 11 21 00 0 0 11 21 11 21

    .

    Note that this choice of 1/ 2o is not symmetric. As any matrix that is a constant

    proportion to 1/ 2o can also serve as a transformation matrix for GLS estimation, the

    so-called Cochrane-Orcutt Transformation is based on

    V 1/ 2o = o 1 21 1/ 2o =

    1 21 0 0 0 01 1 0 0 00 1 1 0 0... ... ... . . . ... ...0 0 0 1 00 0 0 1 1

    ,

    which depends only on 1.The data from the Cochrane-Orcutt transformation are

    y1 = (1 21)1/ 2y1, x 1 = (1 21)1/ 2x 1,yt = yt 1yt 1 , x t = x t 1x t 1, t = 2 , , T,

    where x t is the t th column of X . It is then clear that

    var( y1 ) = (1 21)2o ,var( yt ) =

    2o +

    21

    2o 2212o = (1 21)2o , t = 2 , . . . , T.

    Moreover, for each i,

    cov(yt , y

    t i ) = cov( yt , yt i ) 1 cov(yt 1, yt i ) 1 cov(yt , yt i 1)+ 21 cov(yt 1, yt i 1)

    = 0 .

    Hence, the transformed variable yt satises the classical conditions, as it ought to be.Then provided that 1 is known, regressing y

    t on x

    t yields the GLS estimator for o.

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    13/34

    4.3. SERIAL CORRELATION 89

    4.3.2 An Alternative View

    There is an alternative approach to generating the variance-covariance matrix (4.11).Under [A2](i), let

    := y X odenote the deviations of y from its mean vector. This vector is usually referred to asthe vector of disturbances . Note that is not the same as the residual vector e . Whilethe former is not observable because o is unknown, the later is obtained from OLSestimation and hence observable. Under [A2], IE( ) = 0 and

    var( y ) = var( ) = IE( ).

    The variance and covariance structure of y is thus the same as that of .

    A time series is said to be weakly stationary if its mean, variance, and autocovari-ances are all independent of the time index t. Thus, a weakly stationary series can notexhibit trending behavior, nor can it generate very large uctuatinos. In particular,a time series with zero mean, a constant variance, and zero autocovariances is weaklystationary; such a process is also known as a white noise . Let {u t}be a white noise withIE( u t ) = 0, IE( u2t ) = 2u , and IE( u t u ) = 0 for t = . Now suppose that the elements of

    is generated as a weakly stationary AR(1) process ( autoregressive process of order 1 ):

    t = 1 t 1 + u t , (4.12)

    with 0 = 0. By recursive substitution, (4.12) can be expressed as

    t =t 1

    i=0

    i1u t i , (4.13)

    a weighted sum of current and previous random innovations (shocks).

    It follows from (4.13) that IE( t ) = 0 and IE( u t , t s ) = 0 for all t and s

    1. By

    weak stationarity, var( t ) is a constant, so that for all t ,

    var( t ) = 21 var( t 1) +

    2u =

    2u / (1 21).

    Clearly, the right-hand side would not be meaningful unless |1| < 1. The autocovari-ance of t and t 1 is, by weak stationarity,

    IE( t t 1) = 1 IE(2t 1) = 1

    2u1 21

    .

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    14/34

    90 CHAPTER 4. GENERALIZED LEAST SQUARES THEORY

    This shows that

    1

    = corr(t,

    t

    1) = corr( y

    t, y

    t

    1) =

    1.

    Similarly,

    IE( t t 2) = 1 IE( t 1 t 2) = 21

    2u1 21

    ,

    so that

    corr( t , t 2) = 1 corr( t , t 1) = 21.

    More generally, we can write for i = 1 , 2, . . . ,

    corr( t , t i ) = 1 corr( t , t i+1 ) = i1 ,

    which depend only on i, the time difference between two s, but not on t. This isprecisely what we postulated in (4.10). The variance-covariance matrix o under thisstructure is also (4.11) with 2o = 2u / (1 21).

    The AR(1) structure of disturbances also permits a straightforward extension. Con-sider the disturbances that are generated as an AR( p) process (autoregressive processof order p):

    t = 1 t 1 + + p t p + u t , (4.14)where the coefficients 1, . . . , p should also be restricted to ensure weak stationarity;we omit the details. Of course, t may follow different structures and are still seriallycorrelated. For example, t may be generated as an MA(1) process ( moving averageprocess of order 1 ):

    t = u t + 1u t 1, | 1| < 1,where

    {u

    t}is a white noise; see e.g., Exercise 4.6.

    4.3.3 Tests for AR(1) Disturbances

    As the AR(1) structure of disturbances is one of the most commonly used specication of serial correlation, we now consider the tests of the null hypothesis of no serial correlation(1 = 1 = 0) against AR(1) disturbances. We discuss only the celebrated Durbin-Watson test and Durbins h test; the discussion of other large-sample tests will bedeferred to Chapter 6.

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    15/34

    4.3. SERIAL CORRELATION 91

    In view of the AR(1) structure, a natural estimator of 1 is the OLS estimator of regressing the OLS residual et on its immediate lag et 1 :

    T =T t=2 et et 1

    T t=2 e2t 1

    . (4.15)

    The Durbin-Watson statistic is

    d =T t=2 (et et 1)2

    T t=1 e2t

    .

    When the sample size T is large, it can be seen that

    d = 2 2T T t=2 e

    2t 1

    T t=1 e2t

    e21 + e2T T t=1 e2t

    2(1 T ).For 0 < T 1 (1 T < 0), the Durbin-Watson statistic is such that 0 d < 2(2 < d 4), which suggests that there is some positive (negative) serial correlation.Hence, this test essentially checks whether T is sufficiently close to zero (i.e., d isclose to 2).

    A major difficulty of the Durbin-Watson test is that the exact null distribution of d

    depends on the matrix X and therefore varies with data. (Recall that the t and F testsdiscussed in Section 3.3.1 have, respectively, t and F distributions regardless of the dataX .) This drawback prevents us from tabulating the critical values of d. Nevertheless, ithas been shown that the null distribution of d lies between the distributions of a lowerbound ( dL ) and an upper bound ( dU ) in the sense that for any signicance level ,

    dL, < d

    < d

    U, ,

    where let d , d

    L, and d

    U, denote, respectively, the critical values of d, dL and dU .Although the distribution of d is data dependent, the distributions of dL and dU arenot. Thus, the critical values dL, and d

    U, can be tabulated. One may then rely onthese critical values to construct a conservative decision rule.

    Specically, when the alternative hypothesis is 1 > 0 (1 < 0), the decision rule of the Durbin-Watson test is:

    (1) Reject the null if d < d L, (d > 4 dL, ).(2) Do not reject the null if d > d U, (d < 4 dU, ).

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    16/34

    92 CHAPTER 4. GENERALIZED LEAST SQUARES THEORY

    (3) Test is inconclusive if dL, < d < d

    U, (4 dL, > d > 4 dU, ).This is not completely satisfactory because the test may yield no conclusion. Someeconometric packages such as SHAZAM now compute the exact Durbin-Watson dis-tribution for each regression and report the exact p-values. When such a program isavailable, this test does not have to rely on the critical values of dL and dU and hencemust be conclusive. Note that the tabulated critical values of the Durbin-Watson statis-tic are for the specications with a constant term; the critical values for the specicationswithout a constant term can be found in Farebrother (1980).

    Another problem with the Durbin-Watson statistic is that its null distribution holdsonly under the classical conditions [A1] and [A3]. In the time series context, it is quite

    common to include a lagged dependent variable as a regressor so that [A1] is violated.A leading example is the specication

    yt = 1 + 2x t2 + + kx tk + y t 1 + et .This model can also be derived from certain behavioral assumptions; see Exercise 4.7.It has been shown that the Durbin-Watson statistic under this specication is biasedtoward 2. That is, this test would not reject the null hypothesis even when serialcorrelation is present. On the other hand, Durbins h test is designed specically forthe specications that contain a lagged dependent variable. Let T be the OLS estimate

    of and var( T ) be the OLS estimate of var( T ). The h statistic is

    h = T T 1 T var( T ) ,and its asymptotic null distribution is N (0, 1). A clear disadvantage of Durbins h testis that it can not be calculated when var( T ) 1/T . This test can also be derived as aLagrange Multiplier test under the likelihood framework; see Chapter 10

    If we have quarterly data and want to test for the fourth-order serial correlation,the statistic analogous to the Durbin-Watson statistic is

    d4 =T t=5 (et et 4)2

    T t=1 e2t

    ;

    see Wallis (1972) for corresponding critical values.

    4.3.4 FGLS Estimation

    Recall that o depends on two parameters 2o and 1. We may use a generic notation (2, ) to denote this function of 2 and . In particular, o = (2o , 1). Similarly, we

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    17/34

    4.3. SERIAL CORRELATION 93

    may also write V () such that V o = V (1). The transformed data based on V () 1/ 2

    are

    y1() = (1 2)1/ 2y1, x 1() = (1 2)1/ 2x 1,yt () = yt yt 1, x t () = x t x t 1, t = 2 , , T.

    Hence, yt = yt (1) and x

    t = x t (1).

    To obtain an FGLS estimator, we must rst estimate 1 by some estimator T andthen construct the transformation matrix as V

    1/ 2T = V (T )

    1/ 2 . Here, T may be com-puted as in (4.15); other estimators for 1 may also be used, e.g., T = T (T k)/ (T 1).The transformed data are then yt (T ) and x t (T ). An FGLS estimator is obtained byregressing yt (T ) on x t (T ). Such an estimator is known as the Prais-Winsten estimator

    or the Cochrane-Orcutt estimator when the rst observation is dropped in computation.

    The following iterative procedure is also commonly employed in practice.

    (1) Perform OLS estimation and compute T as in (4.15) using the OLS residuals et .

    (2) Perform the Cochrane-Orcutt transformation based on T and compute the re-sulting FGLS estimate FGLS by regressing yt (T ) on x t (T ).

    (3) Compute a new T as in (4.15) with et replaced by the FGLS residuals

    et, FGLS = yt x t FGLS .

    (4) Repeat steps (2) and (3) until T converges numerically, i.e., when T from twoconsecutive iterations differ by a value smaller than a pre-determined convergencecriterion.

    Note that steps (1) and (2) above already generate an FGLS estimator. More iterationsdo not improve the asymptotic properties of the resulting estimator but may have asignicant effect in nite samples. This procedure can be extended easily to estimatethe specication with higher-order AR disturbances.

    Alternatively, the Hildreth-Lu procedure adopts grid search to nd the 1(1, 1)that minimizes the sum of squared errors of the model. This procedure is computation-ally intensive, and it is difficult to implement when t have an AR( p) structure withp > 2.

    In view of the log-likelihood function (4.7), we must compute det( o). Clearly,

    det( o) =1

    det( 1o )=

    1

    [det( 1/ 2o )]2

    .

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    18/34

    94 CHAPTER 4. GENERALIZED LEAST SQUARES THEORY

    In terms of the notations in the AR(1) formulation, 2o = 2u / (1 21), and

    1/ 2o =

    1o 1

    21

    V 1/ 2o =

    1u V

    1/ 2o .

    As det( V 1/ 2o ) = (1 21)1/ 2, we then have

    det( o) = ( 2u )

    T (1 21) 1.

    The log-likelihood function for given 2u and 1 is

    log L( ; 2u , 1)

    = T 2 log(2) T 2 log(2u ) + 12 log(1 21) 122u (y

    X ) (y X

    ).

    Clearly, when 2u and 1 are known, the MLE of is just the GLS estimator.

    If 2u and 1 are unknown, the log-likelihood function reads:

    log L( , 2, )

    = T 2

    log(2) T 2

    log(2) +12

    log(1 2) 1

    22(1 2)(y1 x 1 )2

    1

    22T

    t=2[(yt x t ) (yt 1 x t 1 )]2 ,

    which is a nonlinear function of the parameters. Nonlinear optimization methods (see,e.g., Section 8.2.2) are therefore needed to compute the MLEs of , 2 , and . Fora given , estimating by regressing et ( ) = yt x t on et 1( ) is equivalent tomaximizing the last term of the log-likelihood function above. This does not yield anMLE because the other terms involving , namely,

    12 log(1 2)

    122 (1 2)(y1 x 1 )2 ,

    have been ignored. This shows that the aforementioned iterative procedure does notresult in the MLEs.

    Remark: Exact tests based on FGLS estimation results are not available because thenite-sample distribution of the FGLS estimator is, again, unknown. Asymptotic theoryis needed to construct proper tests.

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    19/34

    4.4. APPLICATION: LINEAR PROBABILITY MODEL 95

    4.4 Application: Linear Probability Model

    In some applications researchers are interested in analyzing why consumers own a houseor participate a particular event. The ownership or the choice of participation aretypically represented by a binary variable that takes the values one and zero. If thedependent variable in a linear regression is binary, we will see below that both the OLSand FGLS methods are not appropriate.

    Let x t denote the t th column of X . The t th observation of the linear specicationy = X + e can be expressed as

    yt = x t + et .

    For the binary dependent variable y whose t th observation is yt = 1 or 0, we knowIE( yt ) = IP( yt = 1) .

    Thus, x t is just a specication of the probability that yt = 1. As such, the linearspecication of binary dependent variables is usually referred to as the linear probability model .

    When [A1] and [A2](i) hold for a linear probability model,

    IE( yt ) = IP( yt = 1) = x t o,

    and the OLS estimator is unbiased for o. Note, however, that the variance of yt isvar( yt ) = IP( yt = 1)[1 IP( yt = 1)] .

    Under [A1] and [A2](i),

    var( yt ) = x t o(1 x t o),which varies with x t . Thus, the linear probability model suffers from the problem of heteroskedasticity, and the OLS estimator is not the BLUE for o. Apart from theefficiency issue, the OLS method is still not appropriate for the linear probability model

    because the OLS tted values need not be bounded between zero and one. Whenx

    t

    T is negative or greater than one, it can not be interpreted as a probability.

    Although the GLS estimator is the BLUE, it is not available because o, and hencevar( yt ), is unknown. Nevertheless, if yt are uncorrelated so that var( y ) is diagonal, anFGLS estimator may be obtained using the transformation matrix

    1/ 2T = diag x 1 T (1 x 1 T )]

    1/ 2 , x 2 T (1 x 2 T )] 1/ 2 , . . . ,

    x T T (1 x T T )] 1/ 2 ,

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    20/34

    96 CHAPTER 4. GENERALIZED LEAST SQUARES THEORY

    where T is the OLS estimator of o. Such an estimator breaks down when 1/ 2T

    cannot be computed (i.e., when x t T is negative or greater than one). Even when

    1/ 2T is available, there is still no guarantee that the FGLS tted values are bounded

    between zero and one. This shows that the FGLS method may not always be a solutionwhen the OLS method fails.

    This example also illustrates the importance of data characteristics in estimation andmodeling. Without taking into account the binary nature of the dependent variable,even the FGLS method may be invalid. More appropriate methods for specicationswith binary dependent variables will be discussed in Chapter 10.

    4.5 Seemingly Unrelated RegressionsIn many econometric practices, it is important to study the joint behavior of severaldependent variables. For example, the input demands of an rm may be described usinga system of linear regression functions in which each regression represents the demandfunction of a particular input. It is intuitively clear that these input demands ought tobe analyzed jointly because they are related to each other.

    Consider the specication of a system of N equations, each with ki explanatoryvariables and T observations. Specically,

    y i = X i i + e i , i = 1 , 2, . . . , N , (4.16)

    where for each i, y i is T 1, X i is T ki , and i is ki 1. The system (4.16) isalso known as a specication of seemingly unrelated regressions (SUR). Stacking theequations of (4.16) yields

    y 1y 2...

    y N

    y=

    X 1 0 00 X 2 0... ... . . . ...0 0 X N

    X

    1 2...

    N

    +

    e 1e 2...

    e N

    e. (4.17)

    This is a linear specication (3.1) with k = N i=1 ki explanatory variables and T N obser-vations. It is not too hard to see that the whole system (4.17) satises the identicationrequirement whenever every specication of (4.16) does.

    Suppose that the classical conditions [A1] and [A2] hold for each specied linearregression in the system. Then under [A2](i), there exists o = ( o,1 . . . o,N ) such

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    21/34

    4.5. SEEMINGLY UNRELATED REGRESSIONS 97

    that IE( y ) = X o. The OLS estimator obtained from (4.17) is therefore unbiased.Note, however, that [A2](ii) for each linear regression ensures that, for each i,

    var( y i) = 2i I T ;

    there is no restriction on the correlations between y i and y j . The variance-covariancematrix of y is then

    var( y ) = o =

    21 I T cov(y 1, y 2) cov(y 1, y N )cov(y 2, y 1) 22I T cov(y 2, y N )... ... . . . ...cov(y N , y 1) cov(y N , y 2)

    2N I T

    . (4.18)

    This shows that y , the vector of stacked dependent variables, violates [A2](ii), evenwhen each individual dependent variable has a scalar variance-covariance matrix. Con-sequently, the OLS estimator of the whole system, T N = ( X X )

    1X y , is not theBLUE in general. In fact, owing to the block-diagonal structure of X , T N simplyconsists of N equation-by-equation OLS estimators and hence ignores the correlationsbetween equations and heteroskedasticity across equations.

    In practice, it is also typical to postulate that for i = j ,

    cov(y i , y j ) = ij I T ,

    that is, yit and yjt are contemporaneously correlated but yit and yj , t = , are seriallyuncorrelated. Under this condition, (4.18) simplies to o = S oI T with

    S o =

    21 12 1N 21 22 2N ... ... . . . ...N 1 N 2 2N

    . (4.19)

    As 1o = S 1o

    I T , the GLS estimator of (4.17) is

    GLS = [X (S 1o I T )X ]

    1X (S 1o I T )y ,

    and its covariance matrix is [ X (S 1o I T )X ] 1.

    It is readily veried that when ij = 0 for all i = j , S o becomes a diagonal matrix,and so is o. Then, the resulting GLS estimator reduces to the OLS estimator. Thisshould not be too surprising because estimating a SUR system would be unnecessary

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    22/34

    98 CHAPTER 4. GENERALIZED LEAST SQUARES THEORY

    if the dependent variables are in fact uncorrelated. (Note that the heteroskedasticityacross equations does not affect this result.) If all equations in the system have the

    same regressors, i.e., X i = X 0 (say), the GLS estimator is also the same as the OLSestimator; see e.g., Exercise 4.8. More generally, it can be shown that there would notbe much efficiency gain for GLS estimation if y i and y j are less correlated and/or X iand X j are highly correlated; see e.g., Goldberger (1991, p. 328) for an illustrativeexample.

    The FGLS estimator is now readily obtained by replacing S 1o with S 1T N , where

    S T N is an N N matrix computed as

    S T N =1T

    e 1

    e 2...

    e N

    e 1 e 2 . . . e N ,

    where e i is the OLS residual vector of the i th equation. The elements of this matrixare

    2i =e i e iT

    , i = 1 , . . . , N ,

    ij

    =e i e j

    T , i = j, i,j = 1 , . . . , N .

    Note that S T N is of an outer product form and hence a positive semi-denite matrix.One may also replace the denominator of 2i with T ki and the denominator of ijwith T max( ki , kj ). The resulting estimator S T N need not be positive semi-denite,however.

    Remark: The estimator S T N mentioned above is valid provided that var( y i) = 2i I T and cov( y i , y j ) = ij I T . If these assumptions do not hold, FGLS estimation would bemuch more complicated. This may happen when heteroskedasticity and serial correla-

    tions are present in each equation, or when cov( yit , yjt ) changes over time.

    4.6 Models for Panel Data

    A data set that contains a collection of cross-section units (individuals, families, rms, orcountries), each with some time-series observations, is known as a panel data set. Wellknown panel data sets in the U.S. include the National Longitudinal Survey of LaborMarket Experience and the Michigan Panel Study of Income Dynamics. Building such

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    23/34

    4.6. MODELS FOR PANEL DATA 99

    data sets is very costly because they are obtained by tracking thousands of individualsthrough time. Some panel data may be easier to establish; for example, the GDP data

    for all G7 countries over 50 years also form a panel data set.Panel data contain richer information than pure cross-section or time-series data.

    On one hand, panel data offer a description of the dynamics of each cross-section unit.On the other hand, panel data are able to reveal the variations of dynamic patternsacross individual units. Thus, panel data may render more precise parameter estimatesand permit analysis of topics that could not be studied using only cross-section or time-series data. For example, to study whether each individual unit i has its own pattern, aspecication with parameter(s) changing with i is needed. Such specications can notbe properly estimated using only cross-section data because the number of parametersmust exceed the number of observations. The problem may be circumvented by usingpanel data. In what follows we will consider specications for panel data that allowparameter to change across individual units.

    4.6.1 Fixed-Effects Model

    Given a panel data set with N cross-section units and T observations, the linear speci-cation allowing for individual effects is

    yit

    = xit

    i

    + eit

    , i = 1 , . . . , N, t = 1 , . . . , T,

    where x it is k 1 and i is the parameter vector depending only on i but not on t.In this specication, individual effects are characterized by i , and there is no time-specic effect. This may be reasonable when a short time series is observed for eachindividual unit. Analogous to the notations in the SUR system (4.16), we can expressthe specication above as

    y i = X i i + e i , i = 1 , 2, . . . , N , (4.20)

    where yi

    is T

    1, X

    iis T

    k, and e

    iis T

    1. This is a system of equations with

    k N parameters. Here, the dependent variable y and explanatory variables X are thesame across individual units such that y i and X i are simply their observations for eachindividual i. For example, y may be the family consumption expenditure, and each yicontains family is annual consumption expenditures. By contrast, y i and X i may bedifferent variables in a SUR system.

    When T is small (i.e., observed time series are short), estimating (4.20) is not feasible.A simpler form of (4.20) is such that only the intercept changes with i and the other

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    24/34

    100 CHAPTER 4. GENERALIZED LEAST SQUARES THEORY

    parameters remain constant across i:

    y i = T a i + Z ib + e i , i = 1 , 2, . . . , N , (4.21)

    where T is the T -dimensional vector of ones, [ T Z i ] = X i and [a i b ] = i . In(4.21), individual effects are completely captured by the intercept a i . This specicationsimplies (4.20) from kN to N + k 1 parameters and is known as the xed-effectsmodel . Stacking N equations in (4.21) together we obtain

    y 1y 2...

    y N

    y=

    T 0 00 T 0... ... . . . ...0 0 T

    D

    a1a2...

    aN

    a+

    Z 1Z 2...

    Z N

    Z b +

    e 1e 2...

    e N

    e. (4.22)

    Clearly, this is still a linear specication with N + k 1 explanatory variables and T N observations. Note that each column of D is in effect a dummy variable for the i th

    individual unit. In what follows, an individual unit will be referred to as a group.

    The following notations will be used in the sequel. Let Z i ((k 1) T ) be the i thblock of Z and z it be its t th column. For z it , the i th group average over time is

    z i =1T

    T

    t=1 zit =

    1T Z i T ;

    the i th group average of yit over time is

    y i =1T

    T

    t=1yit =

    1T

    y i T .

    The overall sample average of z it (average over time and group) is

    z =1

    T N

    N

    i=1

    T

    t=1

    z it =1

    T N Z T N ,

    and the overall sample average of yit is

    y =1

    T N

    N

    i=1

    T

    t=1yit =

    1T N

    y T N .

    Observe that the overall sample averages are

    z =1N

    N

    i=1

    z i , y =1N

    N

    i=1

    y i ,

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    25/34

    4.6. MODELS FOR PANEL DATA 101

    which are the sample averages of group averages.

    From (4.22) we can see that this specication satises the identication requirement[ID-1] provided that there is no time invariant regressor (i.e., no column of Z is aconstant). Once this requirement is satised, the OLS estimator is readily computed.By Theorem 3.3, the OLS estimator for b is

    bT N = [Z (I T N P D )Z ] 1Z (I T N P D )y , (4.23)

    where P D = D (D D ) 1D is a projection matrix. Thus, bT N can be obtained by

    regressing ( I T N P D )y on (I T N P D )Z . Let a T N denote the OLS estimator of thevector a of individual effects. By the facts that

    D y = D D a T N + D Z bT N ,

    and that the OLS residual vector is orthogonal to D , a T N can be computed as

    a T N = ( D D ) 1D (y Z bT N ). (4.24)

    We will present alternative expressions for these estimators which yield more intuitiveinterpretations.

    Writing D = I N

    T , we have

    P D = ( I N T )( I N T T ) 1(I N T )

    = ( I N T )[I N ( T T ) 1](I N T )

    = I N [ T ( T T ) 1

    T ]

    = I N T T /T,

    where T T /T is also a projection matrix. Thus,

    I T N P D = I N (I T T T /T ),and ( I T T T /T )y i = y i T y i with the t th element being yit y i . It follows that

    (I T N P D )y =

    y 1y 2...

    y N

    T y 1

    T y 2...

    T y N

    ,

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    26/34

    102 CHAPTER 4. GENERALIZED LEAST SQUARES THEORY

    which is the vector of all the deviations of yit from the group averages y i . Similarly,

    (I T N P D )Z =

    Z 1

    Z 2...

    Z N

    T z

    1T z 2...

    T z N

    ,

    with the t th observation in the i th block being ( z it z i) , the deviation of z it fromthe group average z i . This shows that the OLS estimator (4.23) can be obtained byregressing yit y i on z it z i for i = 1 , . . . , N , and t = 1 , . . . , T . That is,

    bT N =

    N

    i=1 (Z i z i T )(Z i T z i) 1 N

    i=1 (Z i z i T )(y i T y i)

    =N

    i=1

    T

    t=1(z it z i )(z it z i)

    1 N

    i=1

    T

    t=1(z it z i )(yit y i) .

    (4.25)

    The estimator bT N will be referred to as the within-groups estimator because it is basedon the observations that are deviations from their own group averages, as shown in(4.25). It is also easily seen that the i th element of a T N is

    aTN,i =1

    T ( T y i

    T Z i bT N ) = y i

    z i bT N ,

    which involves only group averages and the within-groups estimator. To distinguisha T N and bT N from other estimators, we will suppress their subscript T N and denotethem as a w and bw .

    Suppose that the classical conditions [A1] and [A2](i) hold for every equation i in(4.21) so that

    IE( y i ) = T a i,o + Z i bo. i = 1 , 2, . . . , N .

    Then, [A1] and [A2](i) also hold for the entire system (4.22) as

    IE( y ) = Da o + Zb o,

    where the i th element of a o is a i,o . Theorem 3.4(a) now ensures that a w and bw areunbiased for a o and bo, respectively.

    Suppose also that var( y i) = 2o I T for every equation i and that cov( y i , y j ) = 0 forevery i = j . Under these assumptions, var( y ) is the scalar variance-covariance matrix

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    27/34

    4.6. MODELS FOR PANEL DATA 103

    2o I T N . By the Gauss-Markov theorem, a w and bw are the BLUEs for a o and bo,respectively. The variance-covariance matrix of the within-groups estimator is

    var( bw) = 2o[Z (I T N P D )Z ] 1

    = 2oN

    i=1

    T

    t=1(z it z i)(z it z i )

    1

    .

    It is also easy to verify that the variance of the i th element of a w is

    var( aw,i ) =1T

    2o + z i [var( bw)]z i ; (4.26)

    see Exercise 4.9. The OLS estimator for the regression variance 2o in this case is

    2w = 1T N N k + 1N

    i=1

    T

    t=1(yit a w,i z it bw)2,

    which can be used to compute the estimators of var( a w,i ) and var( bw).

    It should be emphasized that the conditions var( y i) = 2o I T for all i and cov( y i , y j ) =0 for every i = j may be much too strong in applications. When any one of these con-ditions fails, var( y ) can not be written as 2o I T N , and a w and bw are no longer theBLUEs. Despite that var( y ) may not be a scalar variance-covariance matrix in prac-tice, the xed-effects model is typically estimated by the OLS method and hence also

    known as the least squares dummy variable model . For GLS and FGLS estimation seeExercise 4.10.

    If [A3] holds for (4.22) such that

    y N Da o + Zb o, 2oI T N ,the t and F tests discussed in Section 3.3 remain applicable. An interesting hypoth-esis for the xed-effects model is whether xed (individual) effects indeed exist. Thisamounts to applying an F test to the hypothesis

    H 0 : a1,o = a2,o = = aN,o .The null distribution of this F test is F (N 1, T N N k + 1). In practice, it maybe more convenient to estimate the following specication for the xed-effects model:

    y 1y 2...

    y N

    =

    T 0 0T T 0... ... . . . ...T 0 T

    a1a2...

    aN

    +

    Z 1Z 2...

    Z N

    b +

    e 1e 2...

    e N

    . (4.27)

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    28/34

    104 CHAPTER 4. GENERALIZED LEAST SQUARES THEORY

    This specication is virtually the same as (4.22), yet the parameters a i , i = 2 , . . . , N ,now denote the differences between the i th and rst group effects. Testing the existence

    of xed effects is then equivalent to testing

    H 0 : a2,o = = aN,o = 0 .This can be easily done using an F test; see Exercise 4.11.

    4.6.2 Random-Effects Model

    Given the specication (4.21) that allows for individual effects:

    y i = T a i + Z ib + e i , i = 1 , 2, . . . , N ,

    we now treat a i as random variables rather than parameters. Writing a i = a + u i witha = IE( a i ), the specication above can be expressed as

    y i = T a + Z i b + ( T u i + e i ), i = 1 , 2, . . . , N . (4.28)

    where T u i and e i form the error term. This specication differs from the xed-effectsmodel in that the intercept does not vary across i. The presence of u i also makes (4.28)different from the specication that does not allow for individual effects. Here, thegroup heterogeneity due to individual effects is characterized by the random variables

    u i and absorbed into the error term. Thus, (4.28) is known as the random-effects model .If we apply the OLS method to (4.28), the OLS estimator of b and a are, respectively,

    bp =N

    i=1

    T

    t=1(z it z )(z it z )

    1 N

    i=1

    T

    t=1(z it z )(yit y ) , (4.29)

    and ap = y z bp. Comparing with the within-groups estimator (4.25), bp is based onthe deviations of yit and z it from their respective overall averages y and z , whereas bwis based on the deviations from group averages y i and z i . Note that we have suppressedthe subscript T N for these two estimators. Alternatively, pre-multiplying /T throughequation (4.28) yields

    y i = a + z i b + ( u i + e i ), i = 1 , 2, . . . , N , (4.30)

    where e i =T t=1 eit /T . By noting that the sample averages of y i and z i are just y and

    z , the OLS estimators for the specication (4.30) are

    bb =N

    i=1

    (z i z )( z i z ) 1 N

    i=1

    (z i z )( y i y ) , (4.31)

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    29/34

    4.6. MODELS FOR PANEL DATA 105

    and ab = y z bb. The estimator bb is known as the between-groups estimator becauseit is based on the deviations of group averages from the overall averages. Here, we

    suppress the subscript N for ab and bb. It can also be shown that the estimator (4.29) isa weighted sum of the within-groups estimator (4.25) and the between-group estimator(4.31); see Exercise 4.12. Thus, bp is known as the pooled estimator .

    Suppose that the classical conditions [A1] and [A2](i) hold for every equation i suchthat

    IE( y i ) = T ao + Z i bo, i = 1 , . . . , N .

    Then, [A1] and [A2](i) hold for the entire system of (4.28) as

    IE( y ) = T N ao + Zb o.

    Moreover, they also hold for the specication (4.30) as

    IE( y i ) = ao + z i bo, i = 1 , . . . , N .

    It follows that ab and bb, as well as ap and bp, are unbiased for ao and bo in the random-effects model. Moreover, write

    y i = T ao + Z ibo +

    i , (4.32)

    where i is the sum of two components: the random effects T u i and the disturbance

    i for equation i. Then,

    var( y i) = 2u T T + var( i ) + 2cov( T u i , i ),

    where 2u is var( u i ). As the rst term on the right-hand side above is a full matrix,var( y i) is not a scalar variance-covariance matrix in general. Consequently, ap and bpare not the BLUEs. For the specication (4.30), ab and bb are not the BLUEs unlessmore stringent conditions are imposed.

    Remark: If the xed-effects model (4.21) is correct, the random-effects model (4.28)can be viewed as a specication that omits n 1 relevant dummy variables. This impliesthe pooled estimator bp and the between-groups estimator bb are biased for bo in thexed-effects model. This should not be too surprising because, while there are N + k1parameters in the xed-effects model, the specication (4.30) only permits estimationof k parameters. We therefore conclude that neither the between-groups estimator northe pooled estimator is a proper choice for the xed-effects model.

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    30/34

    106 CHAPTER 4. GENERALIZED LEAST SQUARES THEORY

    To perform FGLS estimation for the random-effects model, more conditions onvar( y i) are needed. If var( i) = 2oI T and IE( u i i ) = 0, var( y i ) has a simpler form:

    S o := var( y i) = 2u T T + 2o I T .

    Under the following additional conditions: IE( u i u j ) = 0, E (u i j ) = 0 and IE( i j ) = 0for all i = j , we have cov( y i , y j ) = 0. Hence, var( y ) simplies to a block diagonalmatrix:

    o := var( y ) = I N S o,

    which is not a scalar variance-covariance matrix unless 2u = 0. It can be veried thatthe desired transformation matrix for GLS estimation is

    1/ 2o = I N S

    1/ 2o , where

    S 1/ 2o = I T cT T T ,and c = 1 2o / (T 2u + 2o)1/ 2 . The transformed data are then S

    1/ 2o y i and S

    1/ 2o Z i ,

    i = 1 , . . . , N , and their t th elements are, respectively, yit cy i and z it cz i . Regressingyit cy i on z it cz i gives the desired GLS estimator.

    It can be shown that the GLS estimator is also a weighted average of the within-groups and between-groups estimators. For the special case that 2o = 0, we have c = 1and

    1/ 2o = I N

    (I T T T /T ) = I T N P D ,

    as in the xed-effects model. In this case, the GLS estimator of b is nothing but thewithin-groups estimator bw . When c = 0, the GLS estimator of b reduces to the pooledestimator bp.

    To compute the FGLS estimator, we must estimate the parameters 2u and 2o in S o.We rst eliminate the random effects u i by taking the difference of y i and T y i :

    y i T y i = ( Z i T z i )bo + ( i T i ).

    For this specication, the OLS estimator of bo is just the within-groups estimatorbw .

    As u i have been eliminated, we can estimate 2o , the variance of it , by

    2 =1

    T N N k + 1N

    i=1

    T

    i=1[(yit y i) (z it z i ) bw]2 ,

    which is also the variance estimator 2w in the xed-effects model. To estimate 2u , notethat under [A1] and [A2](i) we have

    y i = ao + z ibo + ( u i + i ), i = 1 , 2, . . . , N ,

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    31/34

    4.7. LIMITATIONS OF THE FGLS METHOD 107

    which corresponds to the specication (4.30) for computing the between-groups estima-tor. When [ A2](ii ) also holds for every i,

    var( u i + i ) = 2u + 2o /T.

    This variance may be estimated by N i=1 e2b,i / (N k), where

    eb,i = ( y i y ) (z i z ) bb, i = 1 , . . . , N .Consequently, the estimator for 2u is

    2u =1

    N kN

    i=1

    e2b,i 2

    T .

    The estimators 2u and 2 now can be used to construct the estimated transformationmatrix S

    1/ 2. It is clear that the FGLS estimator is, again, a very complex function of y .

    4.7 Limitations of the FGLS Method

    In this chapter we relax only the classical condition [A2](ii) while maintaining [A1]and [A2](i). The limitations of [A1] and [A2](i) discussed in Chapter 3.6 therefore stillexist. In particular, stochastic regressors and nonlinear specications are excluded inthe present context.

    Although the GLS and FGLS methods are designed to improve on estimation effi-ciency when there is a non-scalar covariance matrix o, they also create further difficul-ties. First, the GLS estimator is usually not available, except in some exceptional cases.Second, a convenient FGLS estimator is available at the expense of more conditions on o. If these simplifying conditions are incorrectly imposed, the resulting FGLS estima-tor may perform poorly. Third, the nite-sample properties of the FGLS estimator aretypically unknown. In general, we do not know if an FGLS estimator is unbiased, nordo we know its efficiency relative to the OLS estimator and its exact distribution. It is

    therefore difficult to draw statistical inferences from FGLS estimation results.

    Exercises

    4.1 Given the linear specication y = X + e , suppose that the conditions [A1] and[A2](i) hold and that var( y ) = o. If the matrix X contains k eigenvectors of o which are normalized to unit length. What are the resulting T and GLS ?Explain your result.

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    32/34

    108 CHAPTER 4. GENERALIZED LEAST SQUARES THEORY

    4.2 For the specication (3.1) estimated by the GLS method, a natural goodness-of-tmeasure is

    Centered R2GLS = 1 e GLS e GLS

    Centered TSS of y,

    where the denominator is the centered TSS of the original dependent variable y.Show that R2GLS need not be bounded between zero and one.

    4.3 Given the linear specication y = X + e , suppose that the conditions [A1] and[A2](i) hold and that var( y ) = o. Show directly that

    var( T ) var( GLS )is a positive semi-denite matrix.

    4.4 Given the linear specication y = X + e , suppose that the conditions [A1] and[A2](i) hold and that var( y ) = o. Show that

    cov( T , GLS ) = var( GLS ).

    Also nd cov( GLS , GLS T ).4.5 Suppose that y = X

    o+ and the elements of are

    t=

    1 t

    1+ u

    t, where

    1 = 1 and {u t}is a white noise with mean zero and variance 2u . What are theproperties of t ? Is {t}still weakly stationary?

    4.6 Suppose that y = X o + and the elements of are t = u t + 1u t 1, where

    |1| < 1 and {u t} is a white noise with mean zero and variance 2u . Calculatethe variance, autocovariances, and autocorrelations of t and compare them withthose of AR(1) disturbances.

    4.7 Let yt denote investment expenditure that is determined by expected earning x

    t :

    yt = ao + box

    t + u t .

    When xt is adjusted adaptively:

    xt = x

    t 1 + (1 o)(x t x

    t 1), 0 < o < 1,

    show that yt can be represented by a model with a lagged dependent variable andmoving average disturbances.

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    33/34

    4.7. LIMITATIONS OF THE FGLS METHOD 109

    4.8 Given the SUR specication (4.17), show that the GLS estimator is the same asthe OLS estimator when X i = X 0 for all i. Give an intuitive explanation of this

    result.

    4.9 Given the xed-effects model (4.21) for panel data, suppose that [A1] and [A2](i)hold for each group equation, var( y i ) = 2oI T and cov( y i , y j ) = 0 for i = j . Prove(4.26).

    4.10 Given the xed-effects model (4.21) for panel data, suppose that [A1] and [A2](i)hold for each group equation, var( y i) = 2i I T and cov( y i , y j ) = ij I T for i = j .What is var( y )? Find the GLS estimator and propose an FGLS estimator.

    4.11 Given the xed-effects model (4.27) for panel data, consider testing the null hy-pothesis of no xed effects. Write down the F statistic that is based on constrainedand unconstrained R2 and explain clearly how this test should be implemented.

    4.12 Consider the pooled estimator (4.29), the within-groups estimator (4.25) and thebetween-groups estimator (4.31). Dene the total sum of squares (TSS), totalsum of cross products (TSC), within-groups sum of squares (WSS), within-groupscross products (WSC), between-groups sum of squares (BSS) and between-groupssum of cross products (BSC) as, respectively,

    TSS =N

    i=1

    T

    t=1(z it z )(z it z ) , TSC =

    N

    i=1

    T

    t=1(z it z )(yit y ) ,

    WSS =N

    i=1

    T

    t=1(z it z i)(z it z i) , WSC =

    N

    i=1

    T

    t=1(z it z i )(yit y i) ,

    BSS =N

    i=1

    T (z i z )( z i z ) , BSC =N

    i=1

    T (z i z )( y i y ) .

    Prove that TSS = WSS + BSS and TSC = WSC + BSC. Based on these results,

    show that the pooled estimator (4.29) is a weighted sum of the within-groupsestimator (4.25) and the between-groups estimator (4.31).

    References

    Amemiya, Takeshi (1985). Advanced Econometrics , Cambridge, MA: Harvard Univer-sity Press.

    c Chung-Ming Kuan, 2004

  • 8/7/2019 GLS theory

    34/34

    110 CHAPTER 4. GENERALIZED LEAST SQUARES THEORY

    Farebrother, R. W. (1980). The Durbin-Watson test for serial correlation when there isno intercept in the regression, Econometrica , 48 , 15531563.

    Goldberger, Arthur S. (1991). A Course in Econometrics , Cambridge, MA: HarvardUniversity Press.

    Greene, William H. (2000). Econometric Analysis , Fourth ed., New York, NY: Macmil-lan.

    Wallis, K. F. (1972). Testing for fourth order autocorrelation in quarterly regressionequations, Econometrica , 40 , 617636.

    c Chung-Ming Kuan, 2004