model averaging for multivariate multiple regression models‡º版mod… · statistics 5...

24
Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=gsta20 Download by: [Academy of Mathematics and System Sciences] Date: 12 September 2017, At: 05:39 Statistics A Journal of Theoretical and Applied Statistics ISSN: 0233-1888 (Print) 1029-4910 (Online) Journal homepage: http://www.tandfonline.com/loi/gsta20 Model averaging for multivariate multiple regression models Rong Zhu, Guohua Zou & Xinyu Zhang To cite this article: Rong Zhu, Guohua Zou & Xinyu Zhang (2017): Model averaging for multivariate multiple regression models, Statistics, DOI: 10.1080/02331888.2017.1367794 To link to this article: http://dx.doi.org/10.1080/02331888.2017.1367794 Published online: 05 Sep 2017. Submit your article to this journal Article views: 16 View related articles View Crossmark data

Upload: others

Post on 24-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Model averaging for multivariate multiple regression models‡º版mod… · STATISTICS 5 Inaddition,ifinsteadof Ln(w),weusethesumofsquareslossfunctions L0,n(w) ={Vec(μ(ˆ w))−Vec(μ)}

Full Terms & Conditions of access and use can be found athttp://www.tandfonline.com/action/journalInformation?journalCode=gsta20

Download by: [Academy of Mathematics and System Sciences] Date: 12 September 2017, At: 05:39

StatisticsA Journal of Theoretical and Applied Statistics

ISSN: 0233-1888 (Print) 1029-4910 (Online) Journal homepage: http://www.tandfonline.com/loi/gsta20

Model averaging for multivariate multipleregression models

Rong Zhu, Guohua Zou & Xinyu Zhang

To cite this article: Rong Zhu, Guohua Zou & Xinyu Zhang (2017): Model averaging formultivariate multiple regression models, Statistics, DOI: 10.1080/02331888.2017.1367794

To link to this article: http://dx.doi.org/10.1080/02331888.2017.1367794

Published online: 05 Sep 2017.

Submit your article to this journal

Article views: 16

View related articles

View Crossmark data

Page 2: Model averaging for multivariate multiple regression models‡º版mod… · STATISTICS 5 Inaddition,ifinsteadof Ln(w),weusethesumofsquareslossfunctions L0,n(w) ={Vec(μ(ˆ w))−Vec(μ)}

STATISTICS, 2017https://doi.org/10.1080/02331888.2017.1367794

Model averaging for multivariate multiple regression models

Rong Zhua, Guohua Zoub and Xinyu Zhanga

aAcademy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, People’s Republic of China;bSchool of Mathematical Science, Capital Normal University, Beijing, People’s Republic of China

ABSTRACTThe theories and applications of model averaging have been developedcomprehensively in the past two decades. In this paper, we consider modelaveraging for multivariate multiple regression models. In order to makeuse of the correlation information of the dependent variables sufficiently,we propose a model averaging method based on Mahalanobis distancewhich is related to the correlation of the dependent variables.We prove theasymptotic optimality of the resultingMahalanobis Mallowsmodel averag-ing (MMMA) estimators under certain assumptions. In the simulation study,we show that the proposed MMMA estimators compare favourably withmodel averaging estimators based onAIC and BICweights and theMallowsmodel averaging estimators from the single dependent variable regressionmodels. We further apply our method to the real data on urbanization rateand the proportion of non-agricultural population in ethnic minority areasof China.

ARTICLE HISTORYReceived 23 March 2017Accepted 9 August 2017

KEYWORDSAsymptotic optimality;Mahalanobis distance;Mallows criterion; modelaveraging; multivariatemultiple regression model

1. Introduction

Model averaging method has been attracted more and more attention in the past two decades, whichconsiders a series of candidate models and weights the results from these models. From the perspec-tive of estimation and prediction, model averaging method can be regarded as an extension of modelselection, which can avoid the instability on the model selection process and can also avoid selectinga poor model. There are two main research directions in model averaging: one is from the Bayesianpoint of view, and the other is from the frequentist point of view. See [1] for a comprehensive surveyon Bayesian model averaging. However, in this paper, we focus on the frequentist model averaging.Buckland et al. [2] suggested smoothed-AIC and smoothed-BIC criteria, which weight all candidatemodel estimators and the negative exponents of AIC and BIC are used as the weights. Hansen [3]proposed the Mallows criterion for model averaging and selected the weights by minimizing thiscriterion, which is an estimator of the expected in-sample squared error. He constrained the modelweights to a special discrete set and showed that the Mallows model averaging (MMA) estimator isasymptotically optimal in the sense of achieving the lowest possible squared error for a class of nestedmodels. Wan et al. [4] extended Hansen’s [3] study and showed that the asymptotic optimality ofthe Mallows criterion for model averaging holds for continuous model weights set and non-nestedmodels. Subsequently, other model averaging criteria were also proposed. Examples include optimalmean squared error averaging by Liang et al. [5], Jackknife model averaging by Hansen and Racine[6] and Zhang et al. [7], heteroscedasticity-robust Cpmodel averaging by Liu andOkui [8], and leave-subject-out cross-validation by Gao et al. [9], among others. There are some other literatures focuson the asymptotic distribution theory of model averaging estimators, for example, [10–13].

CONTACT Guohua Zou [email protected]

© 2017 Informa UK Limited, trading as Taylor & Francis Group

Dow

nloa

ded

by [

Aca

dem

y of

Mat

hem

atic

s an

d Sy

stem

Sci

ence

s] a

t 05:

39 1

2 Se

ptem

ber

2017

Page 3: Model averaging for multivariate multiple regression models‡º版mod… · STATISTICS 5 Inaddition,ifinsteadof Ln(w),weusethesumofsquareslossfunctions L0,n(w) ={Vec(μ(ˆ w))−Vec(μ)}

2 R. ZHU ET AL.

All the proposed model averaging criteria mentioned above are for models with single dependentvariable. To the best of our knowledge, there are nomodel averaging criteria for models withmultipledependent variables. The main purpose of this paper is to fill this gap. We will consider the multi-variate multiple regression model (MMRM), which extends the dimension of dependent variables tobe multidimensional and is a powerful tool to study the relationship between multiple independentvariables and multiple dependent variables in the practical problems. These kinds of multiple-to-multiple problems are widely existent in economy, agriculture, environment, industry, biology andother fields. Here we provide a common example. To detect the atmospheric pollution in environ-mental science, we have the concentration of CO, SO, SO2, PM2.5, PM10, dust, smoke and methaneas the multidimensional dependent variables. The concentration of atmospheric pollutants closelyrelates to the distribution of pollutant source, discharge value, topography, landforms, weather andother conditions. For this multiple-to-multiple problem, we need to consider using the MMRM toanalyse the relationship between the variables. Other examples include that Izady et al. [14] used theMMRM to predict the groundwater fluctuations, Yanagihara and Satoh [15] applied the MMRM tostudy the Scottish election data, Jeong et al. [16] studied the daily precipitation with theMMRM,Danet al. [17] employed the MMRM to investigate the relationship between the Vital Signs and the socialcharacteristics of patients, and Shin et al. [18] analysed the mobile wireless networks by the MMRM.There are also lots of literatures for the theoretical study of MMRM, for example, [15,19–22].

In order to make use of the correlation information of the dependent variables sufficiently, a Mal-lows type model averaging criterion based on Mahalanobis distance [23] is proposed. We call thiscriterion as Mahalanobis Mallows criterion for model averaging. The weights are selected by mini-mizing theMahalanobisMallows criterion, which is an unbiased estimator of the expected in-sampleMahalanobis quadratic error plus a constant. We show that our Mahalanobis Mallows model aver-aging (MMMA) estimator is asymptotically optimal in the sense of achieving the lowest possibleMahalanobis quadratic error for continuous model weights set and non-nested models. The provingprocess of the asymptotic optimality is more complicated while comparing with the former work ofHansen [3] andWan et al. [4]mentioned before, because we need to deal with the correlation betweenthe dependent variables, and so need to provide relevant technical results.

The remainder of this paper begins with a description of the MMRM set-up and its parametricestimation in Section 2. Section 3 introduces the Mahalanobis Mallows criterion and the MMMAestimator. Section 4 discusses the asymptotic optimality of the MMMA estimator. Section 5 presentssimulation evidence in support of the MMMA estimator. A real data example is studied in Sections 6and 7 provides some concluding remarks. The technical proofs are relegated to the appendix.

2. Model set-up and parametric estimation

Our description of the model set-up follows Hansen’s [3] notations. We consider the MMRM,

yij = μij + εij =∞∑k=1

xikθkj + εij, i = 1, 2, . . . , n, j = 1, 2, . . . , p, (1)

where yij is the jth component of the dependent variable y(i) = (yi1, yi2, . . . , yip)′, which is a p × 1random vector; xik is the kth component of the independent variable x(i) = (xi1, xi2, . . .), which isa countably infinite random vector; θkj is the unknown coefficient of xik corresponding to the jthdependent variable; and εij is the jth component of the disturbance ε(i) = (εi1, εi2, . . . , εip)′, which sat-isfies E(ε(i) | x(i)) = 0, and E(ε(i)ε

′(i) | x(i)) = � = (σij)p×p > 0, which means � is a positive definite

matrix. We assume p is fixed, Eμ2ij < ∞ and μij converges in mean square.

Equation (1) can also be written as the matrix form

Y = μ + � = Xθ + �, (2)

Dow

nloa

ded

by [

Aca

dem

y of

Mat

hem

atic

s an

d Sy

stem

Sci

ence

s] a

t 05:

39 1

2 Se

ptem

ber

2017

Page 4: Model averaging for multivariate multiple regression models‡º版mod… · STATISTICS 5 Inaddition,ifinsteadof Ln(w),weusethesumofsquareslossfunctions L0,n(w) ={Vec(μ(ˆ w))−Vec(μ)}

STATISTICS 3

where Y = (y(1), y(2), . . . , y(n))′ is an n × p dependent variable matrix, X = (x(1), x(2), . . . , x(n))

′ isan n × ∞ independent variable matrix, θ is an ∞ × p parameter matrix, μ = Xθ is an n × pmatrixfunction of X and � = (ε(1), ε(2), . . . , ε(n))

′ is an n × p random error matrix.We denote the p single dependent variable regression models as

yj = μj + εj = Xθj + εj, j = 1, 2, . . . , p, (3)

where yj = (y1j, y2j, . . . , ynj)′ is the jth column of the dependent variable matrix Y, θj = (θ1j, θ2j, . . .)′is the jth column of the parameter matrix θ , and εj = (ε1j, ε2j, . . . , εnj)′ is the jth column of therandom error matrix �.

We use M candidate MMRMs to approximate (2), where M is allowed to diverge to infinity asn → ∞. Themth approximating (or candidate) MMRM can use any km regressors belonging to x(i).So themth approximating model is

Y = μ(m) + � = X(m)θ(m) + �, m = 1, 2, . . . ,M,

where X(m) is an n × km matrix and has full column rank, and the columns of X(m) are constructedfrom the km columns of X, that is we only choose km independent variables in the mth candidatemodel, θ(m) is a km × p parameter matrix.

Themth candidate model can be written as Equation (4), which is equivalent to Equation (5): weput the model in vector form by using the vectoring operation Vec(·), which creates a column vectorby stacking the column vectors of below one another, that is, we write

⎛⎜⎜⎜⎝y1y2...yp

⎞⎟⎟⎟⎠ = (Ip ⊗ X(m))

⎛⎜⎜⎜⎝

θ1(m)

θ2(m)

...θp(m)

⎞⎟⎟⎟⎠+

⎛⎜⎜⎜⎝

ε1ε2...εp

⎞⎟⎟⎟⎠ , (4)

where θi(m) = (θ1i(m), θ2i(m), . . . , θkmi(m))′ is the ith column of the parameter matrix θ(m) in the mth

candidate model. Denote

Vec(Y) = D(m)Vec(θ(m)) + Vec(�), (5)

where D(m) = Ip ⊗ X(m) is a pn × pkm matrix.We minimize the sum of squared errors and get the parameter estimators,

Vec(θ(m)) = (D′(m)D(m))

−1D′(m)Vec(Y) =

⎛⎜⎜⎜⎜⎜⎝

(X′(m)X(m))

−1X′(m)y1

(X′(m)X(m))

−1X′(m)y2

...

(X′(m)X(m))

−1X′(m)yp

⎞⎟⎟⎟⎟⎟⎠ =

⎛⎜⎜⎜⎜⎜⎝

θ1(m)

θ2(m)

...

θp(m)

⎞⎟⎟⎟⎟⎟⎠ = Vec(θ(m)),

where θj(m) = (X′(m)X(m))

−1X′(m)yj. Denote P(m) = X(m)(X′

(m)X(m))−1X′

(m), which is an n × n pro-jection matrix. Then under themth candidate model, the estimator of μ is given by

Vec(μ(m)) = Vec(X(m)θ(m)) = D(m)Vec(θ(m)) = H(m)Vec(Y), (6)

where H(m) = Ip ⊗ P(m). Equation (6) means Vec(μ(m)) is the linear function of Vec(Y).

Dow

nloa

ded

by [

Aca

dem

y of

Mat

hem

atic

s an

d Sy

stem

Sci

ence

s] a

t 05:

39 1

2 Se

ptem

ber

2017

Page 5: Model averaging for multivariate multiple regression models‡º版mod… · STATISTICS 5 Inaddition,ifinsteadof Ln(w),weusethesumofsquareslossfunctions L0,n(w) ={Vec(μ(ˆ w))−Vec(μ)}

4 R. ZHU ET AL.

3. Model averaging estimation and weight choice criterion

Let w = (w1,w2, . . . ,wM)′ be a weight vector in the unit simplex of RM :

Hn ={w ∈ [0, 1]M :

M∑m=1

wm = 1

}.

The model averaging estimator of Vec(μ) is

Vec(μ(w)) =M∑

m=1wmVec(μ(m)) =

M∑m=1

wmH(m)Vec(Y) = H(w)Vec(Y),

where P(w) = ∑Mm=1 wmP(m) and H(w) = ∑M

m=1 wmH(m) = Ip ⊗ P(w).In order to make use of the correlation information of the dependent variables sufficiently, we

adopt the Mahalanobis distance [23] to define the quadratic loss function,

Ln(w) = {Vec(μ(w)) − Vec(μ)}′�−10 {Vec(μ(w)) − Vec(μ)},

where �0 = � ⊗ In and ⊗ means the Kronecker product. The corresponding risk function is

Rn(w) = E(Ln(w) |X)

= Vec(μ)′(I − H(w))′�−10 (I − H(w))Vec(μ) + ptr{P′(w)P(w)}. (7)

Let

Cn(w) = {Vec(Y) − Vec(μ(w))}′�−10 {Vec(Y) − Vec(μ(w))} + 2ptr(P(w)). (8)

By simple calculation, we obtain

E(Cn(w)) = E(Ln(w)) + np,

which means Cn(w) is an unbiased estimator of the expected in-sample Mahalanobis quadratic errorplus a constant. Therefore, we use Cn(w) as the weight choice criterion and call this criterion asMahalanobis Mallows criterion for model averaging. The optimal weight vector can be obtained byminimizing the Mahalanobis Mallows criterion over the weight set Hn, that is,

w = arg minw∈Hn

Cn(w). (9)

The corresponding estimator μ(w) is referenced as the MMMA estimator.Considering a special case with the covariancematrix� being a diagonal matrix, theMahalanobis

Mallows criterion defined in Equation (8) can be written as

Cn(w) =p∑

i=1Cni(w)/σii, (10)

where Cni(w) = y′i(I − P(w))′(I − P(w))yi + 2σiitr(P(w)) is the Mallows criterion in the ith single

dependent variable regression model introduced by Hansen [3]. Equation (10) means that when thecovariance matrix is diagonal, the weight choice criterion is equivalent to the sum ofMallows criteriafor all single dependent variable regression models scaled by σii.

Dow

nloa

ded

by [

Aca

dem

y of

Mat

hem

atic

s an

d Sy

stem

Sci

ence

s] a

t 05:

39 1

2 Se

ptem

ber

2017

Page 6: Model averaging for multivariate multiple regression models‡º版mod… · STATISTICS 5 Inaddition,ifinsteadof Ln(w),weusethesumofsquareslossfunctions L0,n(w) ={Vec(μ(ˆ w))−Vec(μ)}

STATISTICS 5

In addition, if instead of Ln(w), we use the sum of squares loss functions

L0,n(w) = {Vec(μ(w)) − Vec(μ)}′{Vec(μ(w)) − Vec(μ)},

which does not contain �−10 , then we will get a weight choice criterion

C0,n(w) = {Vec(Y) − Vec(μ(w))}′{Vec(Y) − Vec(μ(w))} + 2tr(�)tr(P(w)).

It is easy to show that E(C0,n(w)) = E(L0,n(w)) + ntr(�), where the second term has nothing to dowith w. By some calculations, we have

C0,n(w) =p∑

i=1Cni(w),

that means that when using the sum of squares loss functions L0,n(w), the weight choice criterionfor MMRMs is equivalent to the sum of Mallows criteria for all single dependent variable regressionmodels.

4. Asymptotic optimality of MMMA estimator

Let ξn = infw∈Hn Rn(w) and w0m be anM × 1 vector in which themth element is one and the others

are zeros. We consider the asymptotic optimality of the proposed MMMA estimator, which is givenin the next theorem.

Theorem 4.1: As n → ∞, if, for some fixed integer G, 1 ≤ G < ∞, and constant κ < ∞,

E(ε4Gij |x(i)) ≤ κ < ∞, for all i = 1, . . . , n, j = 1, 2, . . . , p, (11)

and

Mξ−2Gn

M∑m=1

{Rn(w0m)}G → 0, (12)

then

Ln(w)

infw∈Hn Ln(w)

p−→ 1. (13)

Theorem 4.1 shows that the model averaging procedure using w is asymptotically optimal in thesense that the resultingMahalanobis quadratic loss is asymptotically identical to that of the infeasiblebest possible model averaging estimator. The proof of Theorem 4.1 is provided in Appendix A.2.

Remark 4.1: Condition (11) is a counterpart to condition (7) of Wan et al. [4], and places a boundon the conditional moments of random errors. The convergence condition given in Equation (12) issimilar to condition (8) ofWan et al. [4]. ξn → ∞ is obviously a necessary condition for (12) to hold.As Hansen [3] remarked, it simply means there is no finite approximating model for which the biasis zero. Condition (12) also requiresM

∑Mm=1{Rn(w0

m)}G → ∞ at a rate slower than ξ−2Gn → ∞.

So far we have assumed that the covariance matrix � is known. This is, of course, not the casein practice. We can use the largest approximating model to estimate �, which is also suggested by

Dow

nloa

ded

by [

Aca

dem

y of

Mat

hem

atic

s an

d Sy

stem

Sci

ence

s] a

t 05:

39 1

2 Se

ptem

ber

2017

Page 7: Model averaging for multivariate multiple regression models‡º版mod… · STATISTICS 5 Inaddition,ifinsteadof Ln(w),weusethesumofsquareslossfunctions L0,n(w) ={Vec(μ(ˆ w))−Vec(μ)}

6 R. ZHU ET AL.

Hansen [3]. The largest approximating model is the model such that kM∗ = max{k1, k2, . . . , kM}.Then the estimator of � based on the largest approximating modelM∗ is given by

� = 1n − kM∗

(Y − μ(M∗))′(Y − μ(M∗)). (14)

Let λ(·) be the maximum singular value of a matrix. The following theorem shows that Theorem 4.1remains valid if � is unknown and replaced by �, and provided that some mild conditions aresatisfied.

Theorem 4.2: When� is unknown and replaced by �, the asymptotic optimality of MMMA estimatorin Equation (13) remains valid if

λ(� − �)ξ−1n n

p−→ 0, (15)

kM∗

n→ 0, (16)

kM∗ → ∞, (17)

and

Vec(μ)′Vec(μ)/n = O(1), (18)

as n → ∞.

Theorem 4.2 presents the asymptotic optimality of μ(w) when the covariance matrix � isunknown. The proof of Theorem 4.2 is given in Appendix A.3.

Remark 4.2: Condition (15) requires that the convergence rate of � − � be faster than that of(ξ−1

n n)−1. When all the X(m) are of full column ranks, Conditions (16) and (17) place constraintson the number of regressors in the largest approximating model. Conditions (16) and (17) are sim-ilar to those of Theorem 2 in Hansen [3]. Condition (18) concerns the average of the squares of theconditional means μij’s. This condition is similar to Condition (11) in Wan et al. [4].

5. Simulation study

5.1. Simulation setup

Our setting is similar to the infinite-order regression by Hansen [3]. We generate the data byY = μ + � = Xθ + �, where Y is an n × 2 dependent variable matrix and X is an n × ∞independent variable matrix. For practical calculations, instead of ∞, we take X to be an n ×M0 independent variable matrix, where M0 is a large number and is set to be 200 in thissection. The component of the first column of X is 1 for intercept. The remaining xij are

i.i.d. N(0, 1). Each row of the random error matrix � is ε′(i) and ε(i)

i.i.d∼ N(0,�) with � =(1 r0r0 1

), and is independent of x(i). We consider r0 varying between 0, 0.3, 0.6 and 0.9. The

greater the r0 is, the greater the correlation of random errors is. The coefficient matrix θ isan M0 × 2 matrix, and the element of the jth row is (θj1, θj2). We set θj1 = c

√2αj−(1/2)α−1/2

and θj2 = c√

αj−α−1/10, which are similar to the setting of Hansen [3]. The coefficient c isselected to control the population R2, which is set as (var(μi1) + var(μi2))/(var(yi1) + var(yi2)) ≈(var(μi1) + var(μi2))/(var(μi1) + var(μi2) + var(εi1) + var(εi2)), and varied on a grid between 0.1and 0.9. The parameterα is varied between 0.5, 1.0 and 1.5. The sample size is set to be n =50, 100 and

Dow

nloa

ded

by [

Aca

dem

y of

Mat

hem

atic

s an

d Sy

stem

Sci

ence

s] a

t 05:

39 1

2 Se

ptem

ber

2017

Page 8: Model averaging for multivariate multiple regression models‡º版mod… · STATISTICS 5 Inaddition,ifinsteadof Ln(w),weusethesumofsquareslossfunctions L0,n(w) ={Vec(μ(ˆ w))−Vec(μ)}

STATISTICS 7

200. The number of candidate modelsM is determined by the ruleM = round(3n1/3), which is usedby Hansen [3], that is,M= 11, 14 and 18, where round(·) means rounding. TheM candidate modelsare the nested models, and themth candidate model contains the firstm independent variables.

We consider eight kinds of model selection and model averaging methods:

(1) AIC (Akaike information criterion) model selection method for MMRMs [24]: AICm =n(ln |�m| + p) + 2{pkm + p(p + 1)/2}, m = 1, 2, . . . ,M, where �m = (Y − μ(m))

′(Y −μ(m))/n and p= 2. We select the model that minimizes AIC;

(2) AICc model selection method for MMRMs [24]: AICcm = n(ln |�m| + p) + 2{pkm + p(p +1)/2}{n/(n − km − p − 1)}, where �m and p are the same as those in AIC. We select the modelthat minimizes AICc;

(3) BIC (Bayesian information criterion) model selection method for MMRMs: BICm =n(ln |�m| + p) + ln(n){pkm + p(p + 1)/2}, where �m and p are the same as those in AIC. Weselect the model that minimizes BIC;

(4) S-AIC, that is, smoothedAICmodel averagingmethod, which uses the negative exponent of AICas the weight: wm = exp(−AICm/2)/

∑Ml=1 exp(−AICl/2) is the weight of the mth candidate

model;(5) S-AICc, that is, smoothed AICc model averaging method, which uses the negative exponent

of AICc as the weight: wm = exp(−AICcm/2)/∑M

l=1 exp(−AICcl/2) is the weight of the mthcandidate model;

(6) S-BIC, that is, smoothed BICmodel averaging method, which uses the negative exponent of BICas the weight: wm = exp(−BICm/2)/

∑Ml=1 exp(−BICl/2) is the weight of the mth candidate

model;(7) MMA, which treats the p dimensional MMRM as p single dependent variable regression models

and is used to obtain the weights for each single dependent variable regression model;(8) MMMA, the Mahalanobis Mallows criterion for the model averaging estimator, which is the

proposed criterion in this paper.

To evaluate the estimators, we compute the averaged Mahalanobis quadratic loss

Risk = 1D

D∑d=1

{Vec(μ(d)(w)) − Vec(μ(d))}′�−10 {Vec(μ(d)(w)) − Vec(μ(d))}.

The subscript dmeans the dth simulation run. We take D= 500. For each parameterization, the riskis normalized by dividing by the risk of the infeasible optimal least squares estimator.

5.2. Simulation results

The risk calculations are presented in Figures 1–3 forα = 1.5 and n =50, 100 and 200, respectively. Tosave space, we only report the results with α = 1.5. The simulation results withα = 0.5 and 1 are sim-ilar and available from the author upon request. In each figure, four different correlation coefficientsr0 = 0, 0.3, 0.6 and 0.9 are considered.

We find that all the model selection methods perform worse than the corresponding model aver-aging methods, that is the S-AIC method achieves a lower risk than the AIC method, the S-AICcmethod achieves a lower risk than the AICc method, and the S-BIC method achieves a lower riskthan the BIC method, implying that the model averaging methods are superior to the model selec-tion methods. For simplicity, we do not plot the AIC, AICc and BIC model selection methods in allfigures. (The figures which contain the model selection methods are available on request from theauthors.)

Dow

nloa

ded

by [

Aca

dem

y of

Mat

hem

atic

s an

d Sy

stem

Sci

ence

s] a

t 05:

39 1

2 Se

ptem

ber

2017

Page 9: Model averaging for multivariate multiple regression models‡º版mod… · STATISTICS 5 Inaddition,ifinsteadof Ln(w),weusethesumofsquareslossfunctions L0,n(w) ={Vec(μ(ˆ w))−Vec(μ)}

8 R. ZHU ET AL.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.8

1

1.2

1.4

R2

Ris

k

alpha=1.5, r0=0, n=50

SAICSAICcSBICMMAMMMA

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.8

1

1.2

1.4

R2

Ris

k

alpha=1.5, r0=0.3, n=50

SAICSAICcSBICMMAMMMA

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.8

1

1.2

1.4

R2

Ris

k

alpha=1.5, r0=0.6, n=50

SAICSAICcSBICMMAMMMA

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.8

1

1.2

1.4

R2R

isk

alpha=1.5, r0=0.9, n=50

SAICSAICcSBICMMAMMMA

Figure 1. Risk comparison of five methods (α = 1.5, n = 50).

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.8

1

1.2

1.4

R2

Ris

k

alpha=1.5, r0=0, n=100

SAICSAICcSBICMMAMMMA

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.8

1

1.2

1.4

R2

Ris

k

alpha=1.5, r0=0.3, n=100

SAICSAICcSBICMMAMMMA

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.8

1

1.2

1.4

R2

Ris

k

alpha=1.5, r0=0.6, n=100

SAICSAICcSBICMMAMMMA

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.8

1

1.2

1.4

R2

Ris

k

alpha=1.5, r0=0.9, n=100

SAICSAICcSBICMMAMMMA

Figure 2. Risk comparison of five methods (α = 1.5, n = 100).

It is seen from all the figures that, with the increasing of r0, the risk gap betweenMMMAand othermethods is becoming larger, which shows the obvious advantages of our Mahalanobis Mallows crite-rion. When r0 is high, such as r0 = 0.9, the MMMAmethod is uniformly superior to other methods.In many situations, the MMMA normalized risk is less than 1, which means that it is often smallerthan that of the infeasible optimal model selection. The good performance of MMMAmethod owesto its consideration ofmaking sufficient use of the correlation information of the dependent variables.

It is clear from the last two figures that when r0 is low and moderate, the performance of theMMMAmethod depends on R2. When R2 is small, theMMMAmethod performs best in most of thesituations. When R2 is large, the MMAmethod can perform better than the MMMAmethod. This is

Dow

nloa

ded

by [

Aca

dem

y of

Mat

hem

atic

s an

d Sy

stem

Sci

ence

s] a

t 05:

39 1

2 Se

ptem

ber

2017

Page 10: Model averaging for multivariate multiple regression models‡º版mod… · STATISTICS 5 Inaddition,ifinsteadof Ln(w),weusethesumofsquareslossfunctions L0,n(w) ={Vec(μ(ˆ w))−Vec(μ)}

STATISTICS 9

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.8

1

1.2

1.4

R2

Ris

k

alpha=1.5, r0=0, n=200

SAICSAICcSBICMMAMMMA

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.8

1

1.2

1.4

R2

Ris

k

alpha=1.5, r0=0.3, n=200

SAICSAICcSBICMMAMMMA

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.8

1

1.2

1.4

R2

Ris

k

alpha=1.5, r0=0.6, n=200

SAICSAICcSBICMMAMMMA

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.8

1

1.2

1.4

R2R

isk

alpha=1.5, r0=0.9, n=200

SAICSAICcSBICMMAMMMA

Figure 3. Risk comparison of five methods (α = 1.5, n = 200).

not surprising, becausewhen r0 = 0, each single dependent variable regressionmodel is uncorrelated,and in this situation, we have no correlation information which can be used.

We also simulate the different coefficient matrix θ and consider the different distribution ofindependent variables xi,j, and the conclusions are similar.

6. Empirical application

We apply our method to analyse the data of 77 ethnic minority areas in China, collected comes fromthe sixth national population census in China in 2010.We treat the urbanization rate and the propor-tion of non-agricultural population as the two dimensinal dependent variables, and we consider sixindependent variables which may be related to the dependent variables: the proportion of the pop-ulation of ethnic minorities, urban and rural income ratio, per capita investment in fixed assets, percapita GDP, the second industry output value ratio and the service output value ratio.

We consider all possible, totally 64, candidate models. To compare the performance of the eightmethodsmentioned in Section 5, we use anyT ethnicminority areas data to do the regression processand get the coefficient estimates of the eight methods, where T= 50 and 60. Then we forecast the riskby calculating the averaged Mahalanobis quadratic loss with the remaining 77-T ethnic minorityareas data. This process is done N = 500 times, and we calculate the mean and the median risks ofthe eight methods, whose calculation formula is the same as that in Section 5. We also calculate theoptimal rate of eachmethod, and here the optimal ratemeans the number of times that the risk of thismethod is the smallest among the eight methods, divided by the total number N = 500. The resultsare displayed in Table 1, where the optimal values are highlighted in bold.

It is observed from Table 1 that both the mean risk and the median risk of MMMA are lowerthan those of other methods, respectively, and the optimal rate of MMMA is the highest among allthe methods, which showing that our proposed method has obvious advantages over other methods.From the perspective of the mean and the median, the best method is MMMA, followed by somemodel averaging methods, such as MMA, S-BIC, S-AICc and S-AIC. And the worst methods aresome model selection methods, such as AIC and BIC. From the perspective of the optimal rate, thebest is MMMA, followed by AIC, then MMA, BIC and S-BIC, and the worst are S-AIC, AICc and S-AICc. In most of the situations, the model selection methods perform worse than the correspondingmodel averaging methods.

Dow

nloa

ded

by [

Aca

dem

y of

Mat

hem

atic

s an

d Sy

stem

Sci

ence

s] a

t 05:

39 1

2 Se

ptem

ber

2017

Page 11: Model averaging for multivariate multiple regression models‡º版mod… · STATISTICS 5 Inaddition,ifinsteadof Ln(w),weusethesumofsquareslossfunctions L0,n(w) ={Vec(μ(ˆ w))−Vec(μ)}

10 R. ZHU ET AL.

Table 1. Comparison of the risk results of eight methods.

Method AIC AICc BIC S-AIC S-AICc S-BIC MMA MMMA

T = 50 Mean 4.032 4.022 3.999 4.006 3.962 3.935 3.929 3.910Median 3.618 3.610 3.645 3.604 3.593 3.570 3.599 3.569Optimal rate 0.182 0.076 0.094 0.064 0.022 0.132 0.170 0.260

T = 60 Mean 6.222 6.191 6.233 6.193 6.142 6.140 6.132 6.056Median 5.385 5.300 5.310 5.307 5.269 5.336 5.251 5.148Optimal rate 0.182 0.038 0.122 0.062 0.022 0.106 0.154 0.314

Note: Bold values indicate the best results.

7. Concluding remarks

The MMRMs have become popular in many fields because such models consider the multiple-to-multiple relationship between the independent variables and the dependent variables. In this paper,we have investigated the model averaging approach for such models. In order to make sufficient useof the correlation information of the dependent variables, we have derived a weight choice criterionfor model averaging estimators based on Mahalanobis distance. The asymptotic optimality of theresulting estimator has been proved, and both simulation study and real data analysis have shown itsvalidity.

This is the first paper which considers the MMRM in the model averaging literature, and so manyproblems remain for further studies. At least three issues deserve future research. First, we have notdiscussed the asymptotic distribution of the proposed MMMA estimator in this paper. Since theoptimal weight vector depends on the observations, more efforts are needed to derive the asymp-totic distribution of the MMMA estimator. Second, this paper has extended MMA criterion fromthe models with single dependent variable to the models with multiple dependent variables. It is alsointeresting to study other model averaging methods for MMRMs, such as Jackknife model averagingby Hansen and Racine [6]. Third, we have considered only the simple data for multivariate multipleregression models. How to develop optimal model averaging methods for complex data, such as lon-gitudinal data, high-dimensional data, time series data, missing data andmeasurement error data, forMMRMs is also an interesting problem.

AcknowledgmentsThe authors thank the editor, the associate editor and the referee for helpful comments and suggestions.

FundingZou’s work was partially supported by the National Natural Science Foundation of China [Grant nos. 11331011and 11529101] and a grant from the Ministry of Science and Technology of China [Grant no. 2016YFB0502301].Zhang’s work was partially supported by the National Natural Science Foundation of China [Grant nos. 71522004and 11471324] and a special talent grant from Academy of Mathematics and Systems Science , Chinese Academy ofSciences.

Disclosure statementNo potential conflict of interest was reported by the authors.

References[1] Hoeting JA, Madigan D, Raftery AE, et al. Bayesian model averaging: a tutorial. Statist Sci. 1999;14:382–417.[2] Buckland ST, Burnham KP, Augustin NH. Model selection: an integral part of inference. Biometrics.

1997;53:603–618.[3] Hansen BE. Least squares model averaging. Econometrica. 2007;75:1175–1189.[4] Wan ATK, Zhang X, Zou G. Least squares model averaging by Mallows criterion. J Econometrics.

2010;156:277–283.

Dow

nloa

ded

by [

Aca

dem

y of

Mat

hem

atic

s an

d Sy

stem

Sci

ence

s] a

t 05:

39 1

2 Se

ptem

ber

2017

Page 12: Model averaging for multivariate multiple regression models‡º版mod… · STATISTICS 5 Inaddition,ifinsteadof Ln(w),weusethesumofsquareslossfunctions L0,n(w) ={Vec(μ(ˆ w))−Vec(μ)}

STATISTICS 11

[5] Liang H, Zou G, Wan ATK, et al. Optimal weight choice for frequentist model average estimators. J Amer StatistAssoc. 2011;106:1053–1066.

[6] Hansen BE, Racine JS. Jackknife model averaging. J Econometrics. 2012;167:38–46.[7] Zhang X, Wan ATK, Zou G. Model averaging by jackknife criterion in models with dependent data. J Economet-

rics. 2013;174:82–94.[8] Liu Q, Okui R. Heteroscedasticity-robust Cp model averaging. Econom J. 2013;16:463–472.[9] Gao Y, Zhang X, Wang S, et al. Model averaging based on leave-subject-out cross-validation. J Econometrics.

2016;192:139–151.[10] Claeskens G, Carroll RJ. An asymptotic theory for model selection inference in general semiparametric problems.

Biometrika. 2007;94:249–265.[11] Hjort NL, Claeskens G. Frequentist model average estimators. J Amer Statist Assoc. 2003;98:879–899.[12] Liu C. Distribution theory of the least squares averaging estimator. J Econometrics. 2015;186:142–159.[13] Wang H, Zou G, Wan ATK. Model averaging for varying-coefficient partially linear measurement error models.

Electron J Stat. 2012;6:1017–1039.[14] Izady A, Soheili M, Davari K, et al. Groundwater level forecasting using multivariate-multiple regression model.

ICWR 2009 international conference on water resources; 2009.[15] Yanagihara H, Satoh K. An unbiased Cp criterion for multivariate ridge regression. J Multivariate Anal.

2010;101:1226–1238.[16] Jeong DI, St-Hilaire A, Ouarda TBMJ, et al. Multisite statistical downscaling model for daily precipita-

tion combined by multivariate multiple linear regression and stochastic weather generator. Clim Change.2012;114:567–591.

[17] Dan ED, Jude O, Chukwukailo NA. Application of multivariate multiple linear regression model on vital signsand social characteristics of patients. Int J Eng Sci. 2013;2:40–56.

[18] Shin Y, Chae CB, Kim S. Multivariate multiple regression models for a big data-empowered SON framework inmobile wireless networks. Mobile Inf Syst. 2016;2016:1–10.

[19] Fujikoshi Y, Satoh K. Modified AIC and Cp in multivariate linear regression. Biometrika. 1997;84:707–716.[20] Fujikoshi Y, Sakurai T, Yanagihara H. Consistency of high-dimensional AIC-type and Cp-type criteria in

multivariate linear regression. J Multivariate Anal. 2014;123:184–200.[21] Sparks RS, Coutsourides D, Troskie L. The multivariate Cp. Comm Statist Theory Methods. 1983;12:1775–1793.[22] Yanagihara H, Wakaki H, Fujikoshi Y. A consistency property of AIC for multivariate linear model when the

dimension and the sample size are large. Electron J Stat. 2015;43:231–245.[23] Mahalanobis PC. On the generalised distance in statistics. Proc Natl Inst Sci India. 1936;2:49–55.[24] Bedrick EJ, Tsai C-L. Model selection for multivariate regression in small samples. Biometrics. 1994;50:226–231.[25] Whittle P. Bounds for the moments of linear and quadratic forms in independent variables. Theory Probab Appl.

1960;5:302–305.

AppendixTo prove Theorems 4.1 and 4.2, we first give some lemmas.

A.1 Useful preliminary resultsLemma A.1: Assume that Conditions (11) hold, then for some fixed integer 1 ≤ G < ∞, the 4Gth conditional momentof the absolute value of each element of �−1/2

0 Vec(�) is O(1).

Proof: Note that �−1/20 = �−1/2 ⊗ In. Denote

�−1/2 =

⎛⎜⎝a11 · · · a1p...

. . ....

ap1 · · · app

⎞⎟⎠

p×p

.

Then for all j = 1, 2, . . . , p and l = 1, 2, . . . , n, the {(j − 1)n + l}th element of the vector�−1/20 Vec(�) can be expressed

as∑p

i=1 ajiεli. Using the Minkowski’s inequality, we obtain

E

⎛⎝∣∣∣∣∣

p∑i=1

ajiεli

∣∣∣∣∣4G

|x(l)

⎞⎠ ≤

[ p∑i=1

{E(|ajiεli|4G|x(l))}1/4G]4G

=[ p∑i=1

|aji|{E(ε4Gli |x(l))}1/4G]4G

.

Dow

nloa

ded

by [

Aca

dem

y of

Mat

hem

atic

s an

d Sy

stem

Sci

ence

s] a

t 05:

39 1

2 Se

ptem

ber

2017

Page 13: Model averaging for multivariate multiple regression models‡º版mod… · STATISTICS 5 Inaddition,ifinsteadof Ln(w),weusethesumofsquareslossfunctions L0,n(w) ={Vec(μ(ˆ w))−Vec(μ)}

12 R. ZHU ET AL.

From Condition (11), we obtain E(ε4Gli | x(l)) = O(1). Since � is positive definite and p is fixed, �−1 is also positivedefinite and aji is finite for all i, j = 1, 2, . . . , p. For some fixed integer 1 ≤ G < ∞,E(|∑p

i=1 ajiεli|4G|x(l)) = O(1). �

Lemma A.2: Suppose that Conditions (11), (16) and (17) hold, then

� − � = op(1), (A1)

where � = (1/(n − kM∗ ))(Y − μ(M∗))′(Y − μ(M∗)) is defined in Equation (14),which is an estimator of� based on the

largest approximating model.

Proof: We denote�m ⊆ {1, 2, . . .} as an index set of the independent variables in themth candidate model and �cm as

the complement of �m. The size of �m is km. From Equation (1), we have

μij =∞∑k=1

xikθkj =∑

k∈�M∗xikθkj +

∑k∈�c

M∗

xikθkj, i = 1, 2, . . . , n, j = 1, 2, . . . , p. (A2)

Let μ(M∗)ij = ∑k∈�M∗ xikθkj be the ijth component of μ(M∗) and b(M∗)ij = ∑

k∈�cM∗ xikθkj be the ijth component of

b(M∗), where b(M∗) = μ − μ(M∗) represents the approximation error in theM∗th model.From Equation (14), it is easily seen that

� = 1n − kM∗

(Y − μ(M∗))′(Y − μ(M∗))

= Y ′(In − P(M∗))Yn − kM∗

= (b(M∗) + �)′(In − P(M∗))(b(M∗) + �)

n − kM∗

=b′(M∗)(In − P(M∗))b(M∗)

n − kM∗+ 2

b′(M∗)(In − P(M∗))�

n − kM∗+ �′(In − P(M∗))�

n − kM∗.

We need only to verify thatb′(M∗)(In − P(M∗))b(M∗)

n − kM∗= op(1), (A3)

b′(M∗)(In − P(M∗))�

n − kM∗= op(1), (A4)

and�′(In − P(M∗))�

n − kM∗− � = op(1). (A5)

First, we consider (A3). Since P(M∗) is an idempotent matrix, we obtain

b′(M∗)(In − P(M∗))b(M∗)

n − kM∗≤

b′(M∗)b(M∗)

n − kM∗. (A6)

Denote (b′(M∗)b(M∗)/(n − kM∗ ))ij as the ijth component of b′

(M∗)b(M∗)/(n − kM∗ ) and ei as a p × 1 vector, in which theith element is one and the others are zeros. By Condition (17) and the model set-up that E(μ2

ij) < ∞ andμij convergesin mean square, we have E(b2(M∗)ij) → 0. Further, it is seen that

n∑i=1

E(b2(M∗)ij) = o(n). (A7)

Since b2(M∗)ij is non-negative, we obtain b2(M∗)ij = op(1). This implies∑n

i=1(∑

k∈�cM∗ xikθkj)

2 = op(n). ByCauchy–Schwarz inequality and Condition (16), we have∣∣∣∣∣∣

(b′(M∗)b(M∗)

n − kM∗

)ij

∣∣∣∣∣∣ =∣∣∣∣∣e′ib′

(M∗)b(M∗)ejn − kM∗

∣∣∣∣∣=∣∣∣∣∣∑n

l=1(∑

k∈�cM∗ xlkθki)(

∑k∈�c

M∗ xlkθkj)

n − kM∗

∣∣∣∣∣

Dow

nloa

ded

by [

Aca

dem

y of

Mat

hem

atic

s an

d Sy

stem

Sci

ence

s] a

t 05:

39 1

2 Se

ptem

ber

2017

Page 14: Model averaging for multivariate multiple regression models‡º版mod… · STATISTICS 5 Inaddition,ifinsteadof Ln(w),weusethesumofsquareslossfunctions L0,n(w) ={Vec(μ(ˆ w))−Vec(μ)}

STATISTICS 13

≤{∑n

l=1(∑

k∈�cM∗ xlkθki)

2}1/2{∑nl=1(

∑k∈�c

M∗ xlkθkj)2}1/2

n − kM∗

= op(n)n − kM∗

= op(1), (A8)

where

b(M∗) =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

∑k∈�c

M∗

x1kθk1∑

k∈�cM∗

x1kθk2 · · ·∑

k∈�cM∗

x1kθkp

∑k∈�c

M∗

x2kθk1∑

k∈�cM∗

x2kθk2 · · ·∑

k∈�cM∗

x2kθkp

......

. . ....∑

k∈�cM∗

xnkθk1∑

k∈�cM∗

xnkθk2 · · ·∑

k∈�cM∗

xnkθkp

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠.

From (A8), we see that each element of b′(M∗)b(M∗)/(n − kM∗ ) = op(1). Combining this and (A6), we obtain (A3).

Next, we prove (A4). We first consider the ijth element of b′(M∗)(In − P(M∗))�/(n − kM∗ ). Let

D = (In − P(M∗))b(M∗)eie′ib′(M∗)(In − P(M∗)) = (dij)n×n,

where dij is the ijth element of D. By (A7), we have

E{tr(D)} = E[tr{e′ib′(M∗)(In − P(M∗))(In − P(M∗))b(M∗)ei}]

≤ E{λ(In − P(M∗))λ(In − P(M∗))e′ib′(M∗)b(M∗)ei}

≤ E{e′ib′(M∗)b(M∗)ei}

=n∑l=1

E

⎧⎨⎩⎛⎝ ∑

k∈�cM∗

xlkθki

⎞⎠

2⎫⎬⎭

= o(n),

and so

E

(e′ib

′(M∗)(In − P(M∗))�ej

n − kM∗

)2

= E

(e′ib

′(M∗)(In − P(M∗))�eje′j�′(In − P(M∗))b(M∗)ei

(n − kM∗ )2

)

= E

{tr

((In − P(M∗))b(M∗)eie′ib′

(M∗)(In − P(M∗))�eje′j�′

(n − kM∗ )2

)}

= E

{tr

(D�eje′j�′

(n − kM∗ )2

)}

= E(∑n

l=1∑n

i=1 dliεijεlj(n − kM∗ )2

)

=E{∑n

i=1 dii(ε2ij)}

(n − kM∗ )2

= E(∑n

i=1 dii)σjj(n − kM∗ )2

= E{tr(D)}σjj(n − kM∗ )2

Dow

nloa

ded

by [

Aca

dem

y of

Mat

hem

atic

s an

d Sy

stem

Sci

ence

s] a

t 05:

39 1

2 Se

ptem

ber

2017

Page 15: Model averaging for multivariate multiple regression models‡º版mod… · STATISTICS 5 Inaddition,ifinsteadof Ln(w),weusethesumofsquareslossfunctions L0,n(w) ={Vec(μ(ˆ w))−Vec(μ)}

14 R. ZHU ET AL.

= o(n)(n − kM∗ )2

= o(1n

),

then the ijth element of b′(M∗)(In − P(M∗))�/(n − kM∗ ) is op(1/

√n), which implies (A4).

In the following, we prove (A5). From the model set-up in Section 2, we see that the ith row of � is ε(i), whichsatisfiesE(ε(i) | x(i)) = 0 andE(ε(i)ε

′(i) | x(i)) = � = (σij)p×p; also, {ε(i), i = 1, 2, . . . , n} are independent and identically

distributed. Let

σij = ε′i(In − P(M∗))εj

n − kM∗,

then we have

E(σij) = E(

ε′i(In − P(M∗))εj

n − kM∗

)

= tr{(In − P(M∗))E(εjε′i)}

n − kM∗

= tr{(In − P(M∗))σijIn}n − kM∗

= σij. (A9)

We now consider the sum of σij. Denote

A = In − P(M∗) =

⎛⎜⎜⎜⎝a11 a12 · · · a1na21 a22 · · · a2n...

. . ....

an1 an2 · · · ann

⎞⎟⎟⎟⎠ ,

then ε′i(In − P(M∗))εj can be written as

∑nl=1∑n

k=1 alkεliεkj. By the fact that A is an idempotent matrix, we have

tr(A′A) = n − kM∗ . (A10)

By the property of the trace of a matrix, we obtain

tr2(A) =( n∑

l=1

all

)2

=n∑l=1

a2ll +∑l =k

allakk ≥n∑l=1

a2ll and∑l =k

allakk, (A11)

and

tr(A′A) =n∑l=1

n∑k=1

a2lk =n∑l=1

a2ll +∑l =k

a2lk ≥n∑l=1

a2ll and∑l =k

a2lk. (A12)

It is easily seen that

E[{ε′i(In − P(M∗))εj}2] = E

⎧⎨⎩( n∑

l=1

n∑k=1

alkεliεkj

)2⎫⎬⎭

= E

( n∑l=1

n∑k=1

a2lkε2liε

2kj

)+ E

⎛⎝ n∑

l=1

n∑k=1

n∑d=1,d =k

alkaldε2liεkjεdj

⎞⎠

+ E

⎛⎝ n∑

l=1

n∑k=1

n∑c=1,c =l

alkackεliεciε2kj

⎞⎠

+ E

⎛⎝ n∑

l=1

n∑k=1

n∑c=1,c =l

n∑d=1,d =k

alkacdεliεkjεciεdj

⎞⎠ . (A13)

We compute the four parts of the above equation. Denote κ ′ = κ1/G. By Cauchy–Schwarz inequality and Condi-tion (11), we obtainE(ε4ij) ≤ {E(ε4Gij )}1/G ≤ κ1/G = κ ′, for all i = 1, . . . , n and j = 1, 2, . . . , p. Further, by (A10)–(A12),

Dow

nloa

ded

by [

Aca

dem

y of

Mat

hem

atic

s an

d Sy

stem

Sci

ence

s] a

t 05:

39 1

2 Se

ptem

ber

2017

Page 16: Model averaging for multivariate multiple regression models‡º版mod… · STATISTICS 5 Inaddition,ifinsteadof Ln(w),weusethesumofsquareslossfunctions L0,n(w) ={Vec(μ(ˆ w))−Vec(μ)}

STATISTICS 15

we have

E

( n∑l=1

n∑k=1

a2lkε2liε

2kj

)= E

( n∑l=1

a2llε2liε

2lj

)+ E

⎛⎝ n∑

l=1

n∑k=1,k =l

a2lkε2liε

2kj

⎞⎠

=n∑l=1

a2llE(ε2liε2lj) +

n∑l=1

n∑k=1,k =l

a2lkE(ε2li)E(ε2kj)

≤n∑l=1

a2ll{E(ε4li)}1/2{E(ε4lj)}1/2 +n∑l=1

n∑k=1,k =l

a2lkσiiσjj

≤ κ ′n∑l=1

a2ll + σiiσjj

n∑k=1

n∑l=1,l =k

a2lk

≤ κ ′tr(A′A) + σiiσjjtr(A′A)

= (κ ′ + σiiσjj)(n − kM∗ ),

E

⎛⎝ n∑

l=1

n∑k=1

n∑d=1,d =k

alkaldε2liεkjεdj

⎞⎠ = E

⎛⎝ n∑

l=1

n∑d=1,d =l

allaldε2liεljεdj

⎞⎠+ E

⎛⎝ n∑

l=1

n∑k=1,k =l

alkallε2liεkjεlj

⎞⎠

+ E

⎛⎝ n∑

l=1

n∑k=1,k =l

n∑d=1,d =k =l

alkaldε2liεkjεdj

⎞⎠

=n∑l=1

n∑d=1,d =l

allaldE(ε2liεlj)E(εdj) +n∑l=1

n∑k=1,k =l

alkallE(ε2liεlj)E(εkj)

+n∑l=1

n∑k=1,k =l

n∑d=1,d =k =l

alkaldE(ε2li)E(εkj)E(εdj)

= 0,

E

⎛⎝ n∑

l=1

n∑k=1

n∑c=1,c =l

alkackεliεciε2kj

⎞⎠ = E

⎛⎝ n∑

l=1

n∑c=1,c =l

allaclεliεciε2lj

⎞⎠+ E

⎛⎝ n∑

l=1

n∑k=1,k =l

alkakkεliεkiε2kj

⎞⎠

+ E

⎛⎝ n∑

l=1

n∑k=1,k =l

n∑c=1,c =k =l

alkackεliεciε2kj

⎞⎠

=n∑l=1

n∑c=1,c =l

allaclE(εliε2lj)E(εci) +

n∑l=1

n∑k=1,k =l

alkakkE(εli)E(εkiε2kj)

+n∑l=1

n∑k=1,k =l

n∑c=1,c =k =l

alkackE(εli)E(εci)E(ε2kj)

= 0,

and

E

⎛⎝ n∑

l=1

n∑k=1

n∑c=1,c =l

n∑d=1,d =k

alkacdεliεkjεciεdj

⎞⎠

= E

⎛⎝ n∑

l=1

n∑k=1,k =l

alkaklεliεkjεkiεlj

⎞⎠+ E

⎛⎝ n∑

l=1

n∑c=1,c =l

allaccεliεljεciεcj

⎞⎠

Dow

nloa

ded

by [

Aca

dem

y of

Mat

hem

atic

s an

d Sy

stem

Sci

ence

s] a

t 05:

39 1

2 Se

ptem

ber

2017

Page 17: Model averaging for multivariate multiple regression models‡º版mod… · STATISTICS 5 Inaddition,ifinsteadof Ln(w),weusethesumofsquareslossfunctions L0,n(w) ={Vec(μ(ˆ w))−Vec(μ)}

16 R. ZHU ET AL.

+ E

⎛⎝ n∑

l=1

n∑c=1,c =l

n∑d=1,d =c =l

allacdεliεljεciεdj

⎞⎠+ E

⎛⎝ n∑

l=1

n∑k=1,k =l

n∑c=1,c =k =l

alkaclεliεkjεciεlj

⎞⎠

+ E

⎛⎝ n∑

l=1

n∑k=1,k =l

n∑d=1,d =k =l

alkakdεliεkjεkiεdj

⎞⎠+ E

⎛⎝ n∑

l=1

n∑k=1,k =l

n∑d=1,d =k =l

alkaddεliεkjεdiεdj

⎞⎠

+ E

⎛⎝ n∑

l=1

n∑k=1,k =l

n∑c=1,c =k =l

n∑d=1,d =c =k =l

alkacdεliεkjεciεdj

⎞⎠

=n∑l=1

n∑k=1,k =l

alkaklE(εliεlj)E(εkjεki)

+n∑l=1

n∑c=1,c =l

allaccE(εliεlj)E(εciεcj)

+n∑l=1

n∑c=1,c =l

n∑d=1,d =c =l

allacdE(εliεlj)E(εci)E(εdj)

+n∑l=1

n∑k=1,k =l

n∑c=1,c =k =l

alkaclE(εliεlj)E(εkj)E(εci)

+n∑l=1

n∑k=1,k =l

n∑d=1,d =k =l

alkakdE(εli)E(εkjεki)E(εdj)

+n∑l=1

n∑k=1,k =l

n∑d=1,d =k =l

alkaddE(εli)E(εkj)E(εdiεdj)

+n∑l=1

n∑k=1,k =l

n∑c=1,c =k =l

n∑d=1,d =c =k =l

alkacdE(εli)E(εkj)E(εci)E(εdj)

=n∑l=1

n∑k=1,k =l

a2lkσ2ij +

n∑l=1

n∑c=1,c =l

allaccσ 2ij

= σ 2ij

⎛⎝ n∑

l=1

n∑k=1,k =l

a2lk +n∑l=1

n∑c=1,c =l

allacc

⎞⎠

≤ σ 2ij {tr(A′A) + tr2(A)}

= σ 2ij {(n − kM∗ ) + (n − kM∗ )2}.

Then (A13) can be written as

E[{ε′i(In − P(M∗))εj}2] ≤ (κ ′ + σiiσjj)(n − kM∗ ) + σ 2

ij {(n − kM∗ ) + (n − kM∗ )2}= (κ ′ + σiiσjj + σ 2

ij )(n − kM∗ ) + σ 2ij (n − kM∗ )2. (A14)

By (A9), (A14) and Condition (16), we obtain that the variance of σij is

Var(σij) = Var(

ε′i(In − P(M∗))εj

n − kM∗

)

= E

{(ε′i(In − P(M∗))εj

n − kM∗

)2}−{E(

ε′i(In − P(M∗))εj

n − kM∗

)}2

≤(κ ′ + σiiσjj + σ 2

ij )(n − kM∗ ) + σ 2ij (n − kM∗ )2

(n − kM∗ )2− σ 2

ij

Dow

nloa

ded

by [

Aca

dem

y of

Mat

hem

atic

s an

d Sy

stem

Sci

ence

s] a

t 05:

39 1

2 Se

ptem

ber

2017

Page 18: Model averaging for multivariate multiple regression models‡º版mod… · STATISTICS 5 Inaddition,ifinsteadof Ln(w),weusethesumofsquareslossfunctions L0,n(w) ={Vec(μ(ˆ w))−Vec(μ)}

STATISTICS 17

=κ ′ + σiiσjj + σ 2

ij

n − kM∗

= O(1n

).

That is to say

σij − σij = Op

(1√n

), for all i, j = 1, 2, . . . , p,

so�′(In − P(M∗))�

n − kM∗− � = op(1),

which implies (A5). �

Lemma A.3: Assume that � > 0 and Conditions (11), (16) and (17) hold, then

λ(�−1) = Op(1). (A15)

Proof: Since � is positive definite and p is fixed, �−1 is also positive definite and λ(�−1) is finite. From Lemma A.2,we see that � − � = op(1), which implies �−1 − �−1 = op(1), therefore,

λ(�−1 − �−1) = op(1), (A16)

then we have

λ(�−1) ≤ λ(�−1 − �−1) + λ(�−1) = op(1) + O(1) = Op(1).�

A.2 Proof of Theorem 4.1The Mahalanobis Mallows criterion for model averaging is

Cn(w) = {Vec(Y) − Vec(μ(w))}′�−10 {Vec(Y) − Vec(μ(w))} + 2ptr(P(w))

= Ln(w) + 2Vec(�)′�−10 (I − H(w))Vec(μ)

− 2Vec(�)′�−10 H(w)Vec(�)

+ Vec(�)′�−10 Vec(�) + 2ptr(P(w)),

where Vec(�)′�−10 Vec(�) is independent of w, so we need only to verify that

supw∈Hn

∣∣∣∣∣Vec(�)′�−10 (I − H(w))Vec(μ)

Rn(w)

∣∣∣∣∣ p−→ 0, (A17)

supw∈Hn

∣∣∣∣∣Vec(�)′�−10 H(w)Vec(�) − ptr(P(w))

Rn(w)

∣∣∣∣∣ p−→ 0, (A18)

and

supw∈Hn

∣∣∣∣ Ln(w)

Rn(w)− 1

∣∣∣∣ p−→ 0. (A19)

We observe, for any δ > 0, that

P

(sup

w∈Hn

∣∣∣∣∣Vec(�)′�−10 (I − H(w))Vec(μ)

Rn(w)

∣∣∣∣∣ > δ

)

= P(

max1≤m≤M

|Vec(�)′�−10 (I − H(w))Vec(μ)| > δξn

)

= P({|Vec(�)′�−10 (I − H(w0

1))Vec(μ)| > δξn} ∪ {|Vec(�)′�−10 (I − H(w0

2))Vec(μ)| > δξn}∪ · · · ∪ {|Vec(�)′�−1

0 (I − H(w0M))Vec(μ)| > δξn})

Dow

nloa

ded

by [

Aca

dem

y of

Mat

hem

atic

s an

d Sy

stem

Sci

ence

s] a

t 05:

39 1

2 Se

ptem

ber

2017

Page 19: Model averaging for multivariate multiple regression models‡º版mod… · STATISTICS 5 Inaddition,ifinsteadof Ln(w),weusethesumofsquareslossfunctions L0,n(w) ={Vec(μ(ˆ w))−Vec(μ)}

18 R. ZHU ET AL.

≤M∑

m=1P({|Vec(�)′�−1

0 (I − H(w0m))Vec(μ)| > δξn})

≤M∑

m=1

E|Vec(�)′�−10 (I − H(w0

m))Vec(μ)|2Gδ2Gξ 2Gn

≤M∑

m=1

c1‖�−1/20 (I − H(w0

m))Vec(μ)‖2Gδ2Gξ 2Gn

≤M∑

m=1

c1{Rn(w0m)}G

δ2Gξ 2Gn

= c1δ2GM

· M∑M

m=1{Rn(w0m)}G

ξ 2Gn→ 0,

where c1 is a positive constant and c2 − c13 in the following proofs denote different positive numbers, the first inequalityfollows from the Bonferroni’s inequality, the second inequality is obtained by an extension of Chebyshev’s inequality,the third inequality is ensured by Theorem 2 ofWhittle [25], and the fourth inequality follows fromEquation (7). Alongwith Condition (12), we see that (A17) is true.

Similar to the proving process of (A17), we have

P

(sup

w∈Hn

∣∣∣∣∣Vec(�)′�−10 H(w)Vec(�) − ptr(P(w))

Rn(w)

∣∣∣∣∣ > δ

)

≤ P

(sup

w∈Hn

M∑m=1

wm|Vec(�)′�−10 H(w)Vec(�) − ptr(P(w))| > δξn

)

= P(

max1≤m≤M

|Vec(�)′�−10 H(m)Vec(�) − ptr(P(m))| > δξn

)

≤M∑

m=1P(|Vec(�)′�−1

0 H(w0m)Vec(�) − ptr(P(w0

m))| > δξn)

≤M∑

m=1

E|{�−1/20 Vec(�)}′�−1/2

0 H(w0m)�

1/20 {�−1/2

0 Vec(�)} − ptr(P(w0m))|2G

(δξn)2G

≤M∑

m=1

c2trG{�−1/20 H(w0

m)�0H′(w0m)�

−1/20 }

(δξn)2G

≤M∑

m=1

c2[λ(�)λ(�−1)(ptr{P(w0m)P′(w0

m)})]G(δξn)2G

≤M∑

m=1

c2{λ(�)λ(�−1)Rn(w0m)}G

(δξn)2G

= c2{λ(�)λ(�−1)}Gδ2GM

· M∑M

m=1{Rn(w0m)}G

ξ 2Gn→ 0,

which, together with Condition (12), implies (A18).Note that

supw∈Hn

∣∣∣∣ Ln(w)

Rn(w)− 1

∣∣∣∣= sup

w∈Hn

∣∣∣∣∣−2Vec(�)′H′(w)�−10 (I − H(w))Vec(μ) + Vec(�)′H′(w)�−1

0 H(w)Vec(�) − ptr{P(w)P′(w)}Rn(w)

∣∣∣∣∣.

Dow

nloa

ded

by [

Aca

dem

y of

Mat

hem

atic

s an

d Sy

stem

Sci

ence

s] a

t 05:

39 1

2 Se

ptem

ber

2017

Page 20: Model averaging for multivariate multiple regression models‡º版mod… · STATISTICS 5 Inaddition,ifinsteadof Ln(w),weusethesumofsquareslossfunctions L0,n(w) ={Vec(μ(ˆ w))−Vec(μ)}

STATISTICS 19

In order to prove (A19), it is sufficient to verify that

supw∈Hn

|Vec(�)′H′(w)�−10 (I − H(w))Vec(μ)|Rn(w)

p−→ 0, (A20)

and

supw∈Hn

|Vec(�)′H′(w)�−10 H(w)Vec(�) − ptr{P(w)P′(w)}|

Rn(w)

p−→ 0. (A21)

Recognizing that for any weight vectors w and w∗ ∈ Hn,

tr{P2(w∗)P2(w)} ≤ λ2{P(w∗)}tr{P2(w)} ≤ tr{P2(w)},we obtain

P

(sup

w∈Hn

|Vec(�)′H′(w)�−10 (I − H(w))Vec(μ)|Rn(w)

> δ

)

≤ P

(sup

w∈Hn

M∑t=1

M∑m=1

wtwm|Vec(�)′H′(t)�

−10 (I − H(m))Vec(μ)| > δξn

)

≤ P(

max1≤t≤M

max1≤m≤M

|Vec(�)′H′(t)�

−10 (I − H(m))Vec(μ)| > δξn

)

≤M∑t=1

M∑m=1

P(|{�−1/20 Vec(�)}′�1/2

0 H′(w0t )�

−10 (I − H(w0

m))Vec(μ)| > δξn)

≤M∑t=1

M∑m=1

E|{�−1/20 Vec(�)}′�1/2

0 H′(w0t )�

−10 (I − H(w0

m))Vec(μ)|2G(δξn)2G

≤M∑t=1

M∑m=1

c3‖�1/20 H′(w0

t )�−10 (I − H(w0

m))Vec(μ)‖2G(δξn)2G

≤M∑t=1

M∑m=1

c3[λ(�0)λ(�−10 )λ{H′(w0

t )H(w0t )}]G‖�−1/2

0 (I − H(w0m))Vec(μ)‖2G

(δξn)2G

≤M∑t=1

M∑m=1

c3{λ(�)λ(�−1)Rn(w0m)}G

(δξn)2G

= c3{λ(�)λ(�−1)}Gδ2G

· M∑M

m=1{Rn(w0m)}G

ξ 2Gn→ 0,

which implies (A20). Furthermore, we have

P

(sup

w∈Hn

|Vec(�)′H′(w)�−10 H(w)Vec(�) − ptr{P(w)P′(w)}|

Rn(w)> δ

)

≤ P

(sup

w∈Hn

M∑t=1

M∑m=1

wtwm|Vec(�)′H′(t)�

−10 H(m)Vec(�) − ptr{P(t)P′

(m)}| > δξn

)

≤ P(

max1≤t≤M

max1≤m≤M

|Vec(�)′H′(t)�

−10 H(m)Vec(�) − ptr{P(t)P′

(m)}| > δξn

)

≤M∑t=1

M∑m=1

P(|Vec(�)′H′(w0t )�

−10 H(w0

m)Vec(�) − ptr{P(w0t )P

′(w0m)}| > δξn)

≤M∑t=1

M∑m=1

E|{�−1/20 Vec(�)}′�1/2

0 H′(w0t )�

−10 H(w0

m)�1/20 {�−1/2

0 Vec(�)} − ptr{P(w0t )P′(w0

m)}|2G(δξn)2G

≤M∑t=1

M∑m=1

c4trG{�1/20 H′(w0

t )�−10 H(w0

m)�0H′(w0m)�−1

0 H(w0t )�

1/20 }

(δξn)2G

Dow

nloa

ded

by [

Aca

dem

y of

Mat

hem

atic

s an

d Sy

stem

Sci

ence

s] a

t 05:

39 1

2 Se

ptem

ber

2017

Page 21: Model averaging for multivariate multiple regression models‡º版mod… · STATISTICS 5 Inaddition,ifinsteadof Ln(w),weusethesumofsquareslossfunctions L0,n(w) ={Vec(μ(ˆ w))−Vec(μ)}

20 R. ZHU ET AL.

≤M∑t=1

M∑m=1

c4[λ2(�0)λ2(�−1

0 )λ{H(w0m)H′(w0

m)}tr{H′(w0t )H(w0

t )}]G(δξn)2G

≤M∑t=1

M∑m=1

c4[{λ(�)λ(�−1)}2ptr{P(w0m)P′(w0

m)}]G(δξn)2G

≤M∑t=1

M∑m=1

c4{{λ(�)λ(�−1)}2Rn(w0m)}G

(δξn)2G

= c4{λ(�)λ(�−1)}2Gδ2G

· M∑M

m=1{Rn(w0m)}G

ξ 2Gn→ 0,

which implies (A21).

A.3 Proof of Theorem 4.2When � is replaced by �, the Mahalanobis Mallows criterion can be written as

Cn(w) = {Vec(Y) − Vec(μ(w))}′�−10 {Vec(Y) − Vec(μ(w))} + 2ptr(P(w))

= Cn(w) + {Vec(Y) − Vec(μ(w))}′(�−10 − �−1

0 ){Vec(Y) − Vec(μ(w))},

where �−10 = �−1 ⊗ In. Hence, from the result of Theorem 4.1, it suffices to prove that, as n → ∞,

supw∈Hn

|Cn(w) − Cn(w)|/Rn(w)

= supw∈Hn

|{Vec(Y) − Vec(μ(w))}′(�−10 − �−1

0 ){Vec(Y) − Vec(μ(w))}|/Rn(w)

≤ λ(�−10 − �−1

0 ) supw∈Hn

|{Vec(Y) − Vec(μ(w))}′{Vec(Y) − Vec(μ(w))}|/Rn(w)

p−→ 0,

where

λ(�−10 − �−1

0 ) = λ{(�−1 − �−1) ⊗ In}= λ(�−1 − �−1)

= λ{�−1(� − �)�−1}≤ λ(�−1)λ(� − �)λ(�−1)

and

supw∈Hn

|{Vec(Y) − Vec(μ(w))}′{Vec(Y) − Vec(μ(w))}|/Rn(w)

≤ supw∈Hn

|{Vec(Y) − Vec(μ(w))}′{Vec(Y) − Vec(μ(w))}|/ξn

= supw∈Hn

|Vec(μ)′(In − H(w))′(In − H(w))Vec(μ)|/ξn

+ supw∈Hn

|Vec(�)′(In − H(w))′(In − H(w))Vec(�) − tr{(In − H(w))′(In − H(w))�0}|/ξn

+ 2 supw∈Hn

|Vec(�)′(In − H(w))′(In − H(w))Vec(μ)|/ξn

+ supw∈Hn

|tr{(In − H(w))′(In − H(w))�0}|/ξn

= supw∈Hn

|Vec(μ)′(In − H(w))′(In − H(w))Vec(μ)|/ξn

+ supw∈Hn

|Vec(�)′Vec(�) − tr{�0}|/ξn

Dow

nloa

ded

by [

Aca

dem

y of

Mat

hem

atic

s an

d Sy

stem

Sci

ence

s] a

t 05:

39 1

2 Se

ptem

ber

2017

Page 22: Model averaging for multivariate multiple regression models‡º版mod… · STATISTICS 5 Inaddition,ifinsteadof Ln(w),weusethesumofsquareslossfunctions L0,n(w) ={Vec(μ(ˆ w))−Vec(μ)}

STATISTICS 21

− 2 supw∈Hn

|Vec(�)′H(w)Vec(�) − tr{H(w)�0}|/ξn

+ supw∈Hn

|Vec(�)′H′(w)H(w)Vec(�) − tr{H′(w)H(w)�0}|/ξn

+ 2 supw∈Hn

|Vec(�)′(In − H(w))′(In − H(w))Vec(μ)|/ξn

+ supw∈Hn

|tr{(In − H(w))′(In − H(w))�0}|/ξn.

Since λ(�−1) is finite and λ(�−1) = Op(1), we need to verify that

λ(� − �) supw∈Hn

|Vec(μ)′(In − H(w))′(In − H(w))Vec(μ)|/ξn p−→ 0, (A22)

λ(� − �) supw∈Hn

|Vec(�)′Vec(�) − tr(�0)|/ξn p−→ 0, (A23)

λ(� − �) supw∈Hn

|Vec(�)′H(w)Vec(�) − tr{H(w)�0}|/ξn p−→ 0, (A24)

λ(� − �) supw∈Hn

|Vec(�)′H′(w)H(w)Vec(�) − tr{H′(w)H(w)�0}|/ξn p−→ 0, (A25)

λ(� − �) supw∈Hn

|Vec(�)′(In − H(w))′(In − H(w))Vec(μ)|/ξn p−→ 0, (A26)

and

λ(� − �) supw∈Hn

|tr{(In − H(w))′(In − H(w))�0}|/ξn p−→ 0. (A27)

By Conditions (15) and (18), we have

λ(� − �) supw∈Hn

|Vec(μ)′(In − H(w))′(In − H(w))Vec(μ)|/ξn

≤ λ(� − �) supw∈Hn

M∑m=1

M∑t=1

wmwt|Vec(μ)′(In − H(m))′(In − H(t))Vec(μ)|/ξn

≤ λ(� − �) supw∈Hn

max1≤m≤M

max1≤t≤M

|Vec(μ)′(In − H(w0m))′(In − H(w0

t ))Vec(μ)|/ξn

≤ λ(� − �) max1≤m≤M

max1≤t≤M

λ(In − H(w0m))λ(In − H(w0

t ))|Vec(μ)′Vec(μ)|/ξn

= λ(� − �)O(ξ−1n n)

p−→ 0,

which implies (A22). Denote (A23) as

λ(� − �)ξ−1n n sup

w∈Hn

|Vec(�)′Vec(�) − tr(�0)|/n = λ(� − �)ξ−1n n · |Vec(�)′Vec(�) − tr(�0)|/n. (A28)

For any δ > 0,

P(|Vec(�)′Vec(�) − tr(�0)|/n > δ) = P(|Vec(�)′Vec(�) − tr(�0)| > δn)

≤ E|{�−1/20 Vec(�)}′�0{�−1/2

0 Vec(�)} − tr(�0)|2δ2n2

≤ c5tr(�20)

δ2n2

≤ c5λ2(�0)tr(Inp)δ2n2

≤ c6pδ2n

→ 0,

Dow

nloa

ded

by [

Aca

dem

y of

Mat

hem

atic

s an

d Sy

stem

Sci

ence

s] a

t 05:

39 1

2 Se

ptem

ber

2017

Page 23: Model averaging for multivariate multiple regression models‡º版mod… · STATISTICS 5 Inaddition,ifinsteadof Ln(w),weusethesumofsquareslossfunctions L0,n(w) ={Vec(μ(ˆ w))−Vec(μ)}

22 R. ZHU ET AL.

which, together with Condition (15) and (A28), implies (A23). For proving (A24), by λ(� − �) = op(1), one has

P

(sup

w∈Hn

|Vec(�)′H′(w)Vec(�) − tr{H(w)�0}|/ξn > δ

)

≤ P

(sup

w∈Hn

M∑m=1

wm|Vec(�)′H′(m)Vec(�) − tr{H(m)�0}| > δξn

)

= P(

max1≤m≤M

|Vec(�)′H(w0m)Vec(�) − tr{H(w0

m)�0}| > δξn

)

≤M∑

m=1P(|Vec(�)′H(w0

m)Vec(�) − tr{H(w0m)�0}| > δξn)

≤M∑

m=1

E|Vec(�)′H(w0m)Vec(�) − tr{H(w0

m)�0}|2Gδ2Gξ 2Gn

≤M∑

m=1

E|{�−1/20 Vec(�)}′{�1/2

0 H(w0m)�

1/20 }{�−1/2

0 Vec(�)} − tr{H(w0m)�0}|2G

δ2Gξ 2Gn

≤M∑

m=1

c7trG{�1/20 H(w0

m)�0H′(w0m)�

1/20 }

δ2Gξ 2Gn

≤M∑

m=1

c7{λ2(�0)}GtrG{H′(w0m)�−1

0 H(w0m)�0}

δ2Gξ 2Gn

≤M∑

m=1

c8{Rn(w0m)}G

δ2Gξ 2Gn→ 0.

Now, we prove (A25). For any δ > 0,

P

(sup

w∈Hn

|Vec(�)′H′(w)H(w)Vec(�) − tr{H′(w)H(w)�0}|/ξn > δ

)

≤ P

(sup

w∈Hn

M∑m=1

M∑t=1

wmwt|Vec(�)′H′(t)H(m)Vec(�) − tr{H′

(t)H(m)�0}| > δξn

)

≤ P(

max1≤m≤M

max1≤t≤M

|Vec(�)′H′(w0t )H(w0

m)Vec(�) − tr{H′(w0t )H(w0

m)�0}| > δξn

)

≤M∑

m=1

M∑t=1

P(|Vec(�)′H′(w0t )H(w0

m)Vec(�) − tr{H′(w0t )H(w0

m)�0}| > δξn)

≤M∑

m=1

M∑t=1

E|Vec(�)′H′(w0t )H(w0

m)Vec(�) − tr{H′(w0t )H(w0

m)�0}|2Gδ2Gξ 2Gn

≤M∑

m=1

M∑t=1

E|{�−1/20 Vec(�)}′�1/2

0 H′(w0t )H(w0

m)�1/20 {�−1/2

0 Vec(�)} − tr{H′(w0t )H(w0

m)�0}|2Gδ2Gξ 2Gn

≤M∑

m=1

M∑t=1

c9trG{�1/20 H′(w0

t )H(w0m)�0H′(w0

m)H(w0t )�

1/20 }

δ2Gξ 2Gn

≤M∑

m=1

M∑t=1

c9{λ2(�0)λ2(H(w0

m))}GtrG{H′(w0t )�

−10 H(w0

t )�0}δ2Gξ 2Gn

Dow

nloa

ded

by [

Aca

dem

y of

Mat

hem

atic

s an

d Sy

stem

Sci

ence

s] a

t 05:

39 1

2 Se

ptem

ber

2017

Page 24: Model averaging for multivariate multiple regression models‡º版mod… · STATISTICS 5 Inaddition,ifinsteadof Ln(w),weusethesumofsquareslossfunctions L0,n(w) ={Vec(μ(ˆ w))−Vec(μ)}

STATISTICS 23

≤M∑

m=1

M∑t=1

c10{Rn(w0t )}G

δ2Gξ 2Gn

= c10M∑M

t=1{Rn(w0t )}G

δ2Gξ 2Gn→ 0,

then we obtains (A25) by λ(� − �) = op(1). Similarly,

P

(sup

w∈Hn

|Vec(�)′(In − H(w))′(In − H(w))Vec(μ)|/ξn > δ

)

≤ P

(sup

w∈Hn

M∑m=1

M∑t=1

wmwt|Vec(�)′(In − H(t))′(In − H(m))Vec(μ)| > δξn

)

≤ P(

max1≤m≤M

max1≤t≤M

∣∣Vec(�)′(In − H(w0t ))

′(In − H(w0m))Vec(μ)

∣∣ > δξn

)

≤M∑

m=1

M∑t=1

P(|Vec(�)′(In − H(w0t ))

′(In − H(w0m))Vec(μ)| > δξn)

≤M∑

m=1

M∑t=1

E|Vec(�)′(In − H(w0t ))

′(In − H(w0m))Vec(μ)|2G

δ2Gξ 2Gn

≤M∑

m=1

M∑t=1

E|{�−1/20 Vec(�)}′�1/2

0 (I − H(w0t ))

′(In − H(w0m))Vec(μ)|2G

δ2Gξ 2Gn

≤M∑

m=1

M∑t=1

c11‖�1/20 (In − H(w0

t ))′(In − H(w0

m))Vec(μ)‖2Gδ2Gξ 2Gn

≤M∑

m=1

M∑t=1

c11{λ2(�0)λ2(In − H(w0

t ))}G‖�−1/20 (In − H(w0

m))Vec(μ)‖2Gδ2Gξ 2Gn

≤M∑

m=1

M∑t=1

c12{Rn(w0m)}G

δ2Gξ 2Gn

= c12M∑M

m=1{Rn(w0m)}G

δ2Gξ 2Gn→ 0,

which, together with λ(� − �) = op(1), implies (A26). Finally, we prove (A27).

λ(� − �) supw∈Hn

|tr{(In − H(w))′(In − H(w))�0}|/ξn

≤ λ(� − �) supw∈Hn

M∑m=1

M∑t=1

wmwt|tr{(In − H(w0t ))

′(In − H(w0m))�0}|/ξn

≤ λ(� − �) supw∈Hn

max1≤m≤M

max1≤t≤M

|tr{(In − H(w0t ))

′(In − H(w0m))�0}|/ξn

≤ λ(� − �) supw∈Hn

max1≤m≤M

max1≤t≤M

λ(In − H(w0t ))λ(In − H(w0

m))λ(�0)tr(Inp)/ξn

≤ λ(� − �)c13np/ξn,

which implies (A27) by Condition (15). This completes the proof of Theorem 4.2.

Dow

nloa

ded

by [

Aca

dem

y of

Mat

hem

atic

s an

d Sy

stem

Sci

ence

s] a

t 05:

39 1

2 Se

ptem

ber

2017