estimation of sometimes—pool predictor in multiple ... · sample regression with the 1970 sample...
Post on 07-Aug-2021
4 Views
Preview:
TRANSCRIPT
ESTIMATION OF SOMETIMES—POOL PREDICTOR IN MULTIPLE REGRESSION
ANALYSIS
A . M . KANDI L ABSTRACT
An objective statistical procedure is used as
an aid in deciding whether or not to pool two or more
regression equations. Sometimes-pool predictor is
derived . Relative efficiencies of the sometimes-pool
predictor to the never-pool predictor are obtained .
Biases of estimation of S.P. predictor and
computations of two numerical examples is presented.
KeyAsbrds, Sometimes-pool predictor , never-pool
predictor ■ .S.E ($ P.) , rn.s.E (NP) , relative
efficiency R.E
1— INTRODUCTION This paper is concerned with inference
procedures involving the use of preliminary tests of
significance to determine whether or not to pool two or
more linear regression lines with each others or
multiple regressiont.°
• Zagazig University, Banha Branh.
(1)
Since an investigator iS usually interested
in making inferences from a sample about the population
from which it was generated,we will be concerned
primarily with the effects that the pooling or not of
regression estimates has on subsequent inferences.
Bancroft (1944), Hosteller (1948), Bancroft
(1964),Kale and Bancroft (1967),Han and Bancroft
(1968),Larosn and Barr (1972)Studied the pooling
problem of data.
Suppose,the investigator suspects,but is not
certain that the conditions causing the underlying
linear relationship between Y and X are the same for
1969 as for 1970.Johnson,Bancroft and Han
(1977)considered the possibility of pooling the 1969
sample regression with the 1970 sample regression.
An important statistical problem in applied
substantive research is considered In this
paper.Suppose we have a lot of explanatory variables
and we wish to investigate the effects of these
variables on a response variable.In this case an
ordinary multiple regression analysis required a very
large sample to make inference about the popuation
1. 2 )
regression model of interest.Instead we wish to use an
objective statistical procedure as an aid in deciding
whether or not to pool two or more multiple regression
equations ,which are defined from a one suitable
sample,to make an inference about a common regression
model of interst
To illustrate our idea of this paper, let us
consider the case where we wish to investigate the
effects of two variables on a third.Thus,the ordinary
multiple regression model is defined as
Y. = Po + Pi Xi + f3z Zi + SI ; I= 1, 2, ..., n (1)
The least square method can be used to estimate fl o ,fl i
and p2 and consequently the estimated value of It.
Now,divide the set of explanatory variables
up into two parts with say the first variable X in the
first part and the remaining variable Z in the second
part.
Then fit the first linear relationship to n
observations :
((X 1 ,Y 11 ), (X2 , Y22 ),...., (X , Y) , denoting this
fitted line by
Xis = Poi + Psi Xi + Ili; I= 1, 2, . . . , n ( 2)
(3)
similarly, fit the second linear relationship
to the same Sample (2 1 2 21 ) , (2' Y22 ) , .. ,(2
Yzn ),denoting this fitted line by
Y. .0 +p 2 +s.;.1.= 1, 2, ..., n (3) 02 12 v.
Let us make the ordinary assumptions about the true
deviation terms UL,Cland
In this paper we wish to use an objective
statistical procedure as an aid in deciding whether or
not to pool the estimated value of Y il ,from the first A
regression eqation (2)vith the estimated value of Y.
from the second regression equation(3) to make an
inference about a comnon regression model of interst as
in equation (1).
Such procedure leads to a "Some
times-pool"predictor which will be defined and studied
in section (2).Relative efficiency(R.E),mean square
errors (m.S.E),and biases of estimation of
sometimes-pool predictor are derived in section(3).Tvo
numerical examples are illustrated in section(4).
2. POOLING PROCEDURE
(2.0 TIE CASE OF TWO REGRESSION EQUATIONS
Consider the two regression model (2)and
(3)as
Y. =0 +0+U;1= 1, 2,...,n of t
and
(4)
Zia A 002 + Piz Z. + s.; 1= 1, 2,...,n
Where XL and Z1 are known,UL and CL are normal
random variablestol and all are parameters of the
first model oz . fl are parameters of the second is
model,Y il is the i observation of a phenemenon for
the first model and Y . is the i th observation of the
same phenemenon for the secon model.
Since the determination coefficient R2
is
considered as a measure of the goodness of fit of the
fitted line to the observations,then we can use it as a A
weights for the estimated values Y. i vnd Y i2
(1=1,2, .... n)
Combining the two weighted regression A
models,the linear estimator for Y is defined as : A A A A A
Y.- W
t Y + W I . 1= 1, 2, ...,n ( 4)
LC
Where
it Wt
t= 1, 2 (5)
I.
Thus, we have
A R + R
z
0 A I ol 02 A A N A
t CI 2 02 141 + R
2
2
( 6)
l5)
R2 A 1 11 A A
lc if .1. Ra — W1 (31 1 s z
R2 A 2 1;2 A A (32e _2
+ R - W 0
f a Xs
2
and hence
A A Yi, = 00, + 01, Xi + Pa Z j 1 = 1, 2 ..... n
Nov, we are interested in estimating the equation A A A A Y. = Po + pi xi + 02 Z i; i= 1, 2 ..... n (10)
A A A When it is suspected that 0, = Poc , Oa
A A A Pao and consequently Y. =
se We then test whether
h A these two equations Y i and Y. are significantly
LC n A
different by testing whether 0,, and 0 cc are
significantly different A A
different , 02and Paa are A
consequently whether Y
A and Pic are significantly
significantly different and A
and Y are significantly
different in the usual way
Consider the null hypothesis that the two
lines (9) and (10) are both estimate of the true line
MthereforetheestimatorofY.is defined as
(7)
(8)
(9)
A =
tc' z
(6 )
A n rt + Os xi + (12
•
Z. if #K12 > X
A n a = /3 + 13
1c X
i 4' 0 Z. if 7 X
is ac a Y =
(vhererX
•
is the test statistics for testing H o : Y, A
r-yto and
X2 = En
I 1=1
A 2t t Y i t
ic) tic
(1 2)
The X
•
value is the 100 (1-K) percentage point of this
distribution with (n-3) degrees of freedom.It should A
be noted that, since y i is the best unbiased estimate
of yi , then y. can be used to estimnate the valueiX2
if A y i is not defined.The estimator y i is referred to as the
sometimes-pool predictor.
(12)THE CASE OF MORE THAN TWO MULTIPLE REGRESSION
EQUAT I ONS
Let us consider the case where we have m >2
independent explanatory variables . The multiple
regression model is defined as
X = X 0 + t (13)
where y is a nxl random vector having the expected
value X B and covariance matrix 6 3 .1 , R, = 100 , Pe
..., pm ] is the vector of parameter s,X is n x (m + 1)
(7)
matrix with the vector X = I and m vectors x.( 1= 1,
2, ..., m) which are the values of m explanatory
variables and e is the vector of random variables, with
ordinary assumptions of least square method.
Divide the set of explanatory variables up into
P > 2 parts with, say
X = (X
II
, X12
, ..., Xir ) for the first part
X = (X22 , X22 , X22 ) for the second part
1 1 XP = ( Xpl , Xp2 , ..., X ) for the Pth p part
Pr!
p Where X r = in (. 01 t
Then fit the P-rtlationships by
Y -x p +e,t - 1, 2, ..., p (14)
Where / t is a n x 1 random vector of observations, I t
is n x (r t + 1) matrix, pi is a (r t + 1) x 1 vector of
parameters for the t th , model.
Therefore. the estimator of / is defined as
= A A
Y== X 0 cif 1C22 5A 2
A A 1 = Z.0 >IX2
(15)
A A A A Where p [ p , p , p with e oe te me
(8)
P ^_ A
A L Al ot p - w . (16) P A
2 t Al t dot
I too R t
A2 A R p ^ 4 j t A A
Pic P A2
Wt Pit , J= 1,2,.., m; t= 1,2,..,p (17) t=1 R t
A A in is the test statistics for testing H Y = Y and 'X is the 100 (1-a) percentage point of the /( 2
distribution with (n- m- 1) degrees of freedom.
(3) ESTIMATION OF RELATIVE EFFICIENCIES AND BIASES OF SOMETIMES— POOL ESTIMATORS
(3.0 THE RELATIVE EFFICIENY OF Ye
In this shudy the estimated relative efficiency
(R.E) of the sometimes- pool estimator Y. the never-A
pool estimator Y. is defined to be the inverse ratio of
the arithemtic average of the mean square errors
(M.S.E)forytandthenevor- A poolestimat" L taken *-
over the observed values. In other words R.E. (Y.) is
defined as :- * A
R.E (Y) = H.S.E (Y.) / H.S.E (Y.) (18)
It is easy to show that
(9)
' 6 n H.S.E = E
, +
• n
1 c v20-2 oc . x) + c v 03
2 1 1 2
2 i
e
s c
s
6 2
Where
int
(19)
C E - ) 2 E
▪
xi •.• tat
C2
= E ( z. - • • E 3: t.•
(2e)
62 = w 6+: + wZ te.
then
• I 1 + W b2 e2 N.S.E (Y:) = - (22-
Also , it is easy to show that
2 - I _R2 C
1Z + C2
2X
R.S.E (Y ) E (I + i n C
1C2 n(1-R) 1=1
62 (2 1)
2 2
X1
Z 1 Cl C
2
i 2:xi 31 iZ xi 31 + 2 ( c 1 (X ) + ( C
1 C
2 CI C
2 C (2 )
2
Lx i 2:7( ., 3 1 ) ( )(xz
1))
c i C2 CI Ct
(10 )
3 erz n
(22)
Where : E 2 n
R2
= ( E x i 302
/ ( E x1 )
E i2
)
1=1 1=1 1-1
From equations (21) and (22), we have
* 3 R.E. (Y ) = --- i 2 42
(32 ) THE BIASE OF ESTIMATOR et
• Since Y i = vi cii + Te2
It is easy to show that
E (Yr) = (B + Lc X. + B2 Z.) + Y. (25)
Therefore
Blase
Where
A A ( yi) = B. + + BA
2 2 4 i= 1,2,..;n (26)
A A A A A A ( WI P.. w. P.. - 0.) A
B s . ors - 1) fiC A n A
(W2 - 1) 02
C13) THE RELATIVE EFFICIENCY OF Y:w The relative efficiency of the sometimes- pool
A pr ed i ctor el.
the never - pool predictor Y. is
defined in the same fashion in section (3.1)
2 A 6- n — %
M.S.E (Y.) = ( x . X) (0 x) ) 1 J (28) t=1
(23)
( 24)
(27)
(If)
Where
(x. -x) = and 1 = 1, 2, ...
and - , • P 2
M.S.E n— E ( E w2
er (x (x..; (x.-xt n )41 ] -
in tit
p A t.
+ 2 5 5 vit vac COV ( 1„. Yud t.(2
(29)
A If Y
t (t= 1, 2, p) are independent linear
relationships
then,
1 _
n
2 - M.S.ECYA= E [ E fr
Z ( XX) (x t a
x
t -. x
t n x ] t t - ■ii
(30)
and hence
R.E (Y ) 52 E int
( (x - x) (x'x) -* (x - t x) + n1 I
(3
[5 tiZ di(X. - X) (X' t - t 1 x. 4. -tt -t
)
n
(1.2)
C14) THE BIAS OF ESTIMATOR Yi A A A
Since 0 w = v (x' x) -1 x' Y = HY (32)
then A
Oc - 0 = ( H - V -1 X ' ) Y = AY (33)
Where H = v (x' x) -1 x'
A = (H - (x' x) -1 x') (39)
and thus
A E ((3,) = E LA + (x' )5) -1 x')] Y
A =AX0+ 0 (35)
A A Bias ((3, ) =AX0 ( 36)
and therefore A A
Bias (Y**
) =XAX0=XAY (37)
4. NUMERICAL EXAMPLES
In this section, two numerical examples are given.
The first illustrates the case of two simple linear
regression, the second illustrates the case of more
than two multiple regression equations.
Example Lii
This example is based on the data presented by
John Hey (1973). Suppose economic theory leads us to
expect that the variable Y (household consumption
(13)
expenditure depends on variables X (household income)
and Z (household size). If ve assume that we have two
linear relationships as follows
Y (.1 Po, 4. Pis x i. + ui
and , i= 1, 2, .... 20
Y = 0 + 0 Z + c i2 02 12
Ourfomputations are :- A A A
R: = 0.9887 ; v z = 0.5554 ; 00z = 8.76268 ; 014 = 0.5763 A A R2 = 0.7914 ; vz = 0.4446 ; 002=-24.8362 ; Ozz= 22.6580
= 1763 . ; 0.3201 floc - 6 = 0 ; (3 2c - 10.0733
and therefore
Yic
= - 6.1763 + 0.3201 x + 10.0733 Zi
with S2
= 8.5943 ; R2
= 0.9949
Also , the ordinary multiple regression equation is
defined as :
X11
= 1=1 Y
ic
A
n( Y. - Yic )
2
= 0.5072 ; = 0.4857 = 4.7468
and
Y = 0.5072 + 0.4857 X + 4.7468 Z
2 with S2
= 1.2525 ; R2
= 0.9993 and R = 0.9991
2 where it is the adjusted of R
2
Now
= 2.4853 with 9 d.f
A
C 2 12 w. Yic
= 2.19978 with 9 d.f
Since
1)(2 = 16.92 < k (0.05.9) tr
Therefore Y = Yic
and R.E (Yi) = 40 %
Emma ill
This example is based on the data presebred by Jan
(15)
Kmenta (1971) where the variable Y ( food consumption
per head) as an endogenous variable and x i (ratio of
food prices to general consumer prices), x2 (disposable
income in constant prices), x 3 (ratio of preceding
year's prices received by farmers for products to
general consumer prices), and x 4 (time in years). If we
assume that we have three linear relationships as
follows
Yt i . 0 + 0 X. + u
Oi ii
y =(.3 +13 X E I . 02 22 I .
Y. + 0 X + 0 X + e t3 03 33
I. 43 i4
Our comptations are 1' 2 R =
2 R = h, Ra
=
001 =
1102 =
002=
0.0096
0.5947
0.1057
94.6788
77.0146
98.7258
;
;
;
;
1 = 3 A 0ii=
02 =
73 22=
0.00049
0.062182
0.24487
0 : 0049
W
Q2
Q2
649
=
=
=
=
0.0159
0.9833
0.00081
0.1383
and hence,
; = 94.7688 + 0.0622 Xti A Y a
= 77.0146 + 0.2449 X1. 2
A Y = 98.7258 + 0.0049 I .
0.1383 X,4
Also, A
floc = 77.3125
= 0.0000039 Az Rc
= 0.97069
A (ii= = 0.000987
134c = 0.0001116
0 = 0.96287
172c = 0.240787 Az S = 7.2614
and hence
A YiC = 77.3125 + 0.000987 X . + 0.240787 Xt2 0.0000039%3
+ 0.0001116 X ,
Also, the ordinary multiple regression equation is
defined as :-
A
Y = 101.2215 - 0.34497 Xtc
+ 0.3632 X .2 ■
+ 0.00109 X .
- 0.13379 X
A2 A S = 3.3354 ; 712 = 0.8134 ; 12-2 = 0.1636
A •
n ( Y - )2
za - E - 0.5853137
tat yC
A22 - E
i.=1
A ( Y. - Y. )2 IC.
- 1.082596 Y. LC
Since
11( 2 = 4.601 ( . 0,45)
Therefore
A
= Yic
and R.E (YIN*
) = 0.95 %
(17)
ACKNOWLEGEMENT
I am grateful to Dr. M.A. ANWAR for his helpful
comments. I am also, grateful to Mr. A.W. AWAD for
revising the paper linguistically and I should like to
thank Mr. MEDHAT for his help in calculating and typing
the paper.
REFERENCES
(1) Bancroft, T.A. (1944) "On Biases in Estimation Due
to the use of preliminary tests of significance"
Annals of Mathematical Statistics 15, 190- 204.
(2) Bancroft, T.A (1964) "Analysis and Inference for
Incompletely specified models involving the use of
preliminary test (s) of significance" Biometrics
20, 427- 442.
(3) Han, C.P. and Bancroft, T.A. (1963) 'On pooling
means when variance is unknown". J.A.S.A. 63,
1333 - 1342.
(4) Jan, K. (1971) "Elements of Econometrics"
Macmillian Co. N.Y P. 565.
(5) John, D.H. (1973) "Statistics In Economics"
Praeger pub. P. 316.
(6) Johnson, J.P.; Bancroft, T.A.; and Han, C.P.
(1977) "A pooling Methodology for Regressions in
prediction" Biomtrics 33, 57 - 67.
(7) Larson, H.J., and barr, D.B. (1972) "Anoting on
pooling Regression Estimate". American
statistician 26, No. 5, 35.
(8) Mosteller, F (1948) "On pooling Data(' J.A.S.A. 43,
231 - 242.
(1 F )
top related