partitioned covariance matrices and partial correlations...

Partitioned Covariance Matrices and Partial Correlations

Proposition 1 Let the (p+q)× (p+q) covariance matrix C > 0 be partitionedas

C =

(C11 C12

C21 C22

).

Then the symmetric matrix C−1 > 0 has the following partitioned form:

C−1 =

(C−1

1|2 −C−11|2C12C

−122

−C−122 C21C

−11|2 C−1

22 +C−122 C21C

−11|2C12C

−122

)=

(C−1

1|2 −C−11|2C12C

−122

−C−122 C21C

−11|2 C−1

22 C21C−111 C12C

−122

),

whereC1|2 = C11 −C12C

−122 C21.

Further, C > 0 is equivalent to the fact that both C22 and C1|2 are regular(invertible) matrices.

Proof. Introduce the auxiliary matrix

D =

(Ip −C12C

−122

O Iq

).

Note that |D| = 1, so D is regular. With it,

DCDT =

(C1|2 OO C22

),

from where|C| = |C1|2| · |C22|

and

C−1 = DT ·(C−1

1|2 O

O C−122

)·D

that implies the required result. �

Theorem 1 Let (XT1 ,X

T2 )T ∼ Np+q(µ,C) be a random vector, where the ex-

pectation µ and the covariance matrix C are partitioned (with block sizes p andq) in the following way:

µ =

(µ1

µ2

), C =

(C11 C12

C21 C22

).

Here C11, C22 are covariance matrices of X1 and X2, whereas C12 = CT21 is the

cross-covariance matrix. Then the conditional distribution of the random vectorX1 conditioned on X2 = x2 is Np(C12C

−122 (x2 − µ2) + µ1,C1|2) distribution.

Proof. Observe that it suffices to prove for the (XT1 ,X

T2 )T ∼ Np+q(0,C) case.

The conditional distribution of X1 conditioned on X2 = x2 is determined bythe conditional pdf as follows:

f(x1|x2) =(2π)−

p+q2 |C|− 1

2 exp[− 12 (xT

1 ,xT2 )C−1(xT

1 ,xT2 )T ]

(2π)−q2 |C22|−

12 exp[− 1

2xT2C−122 x2]

.

1

By Proposition 1, after writing C−1 in the block-matrix form, and writing thequadratic form in the numerator as the sum of four terms, this simplifies to

f(x1|x2) = (2π)−p2 |C1|2|−

12 exp[−1

2(x1 −C12C

−122 x2)TC−1

1|2(x1 −C12C−122 x2)

which is a p-dimensional Gaussian density with the stated parameters. �Note that the conditional covariance matrix C1|2 does not depend on x2 of

the condition. Further, for the conditional expectation, which is the expectationof the conditional distribution, we get that

E(X1|X2 = x2) = C12C−122 (x2 − µ2) + µ1.

Therefore,E(X1|X2) = C12C

−122 (X2 − µ2) + µ1

which is a linear function of the coordinates of X2. In the p = q = 1 case, itis called regression line, while in the q = 1, p > 1 case, regression plane. Wewill not deal with the q > 1, p > 1 case, which is the topic of the Canoni-cal Correlation Analysis (CCA), Partial Least Squares Regression (PLS) andthe Structural Equation Modeling (SEM). Summarizing, in case of the multi-dimensional Gaussian distribution, the regression functions are linear functionsof the variables in the condition, which fact has important consequences in themultivariate statistical analysis.

Remark 1 Assume that E(X) = 0. Then with the notation ε = X1−E(X1|X2) =X1 −C12C

−122 X2 we can decompose X1 as the sum of E(X1|X2) and the error

term ε. By construction, E(ε) = 0 and E(C12C−122 X2ε

T ) = O. Therefore, thecovariance matrix of X1 is decomposed as

C11 = C12C−122 C21 +C1|2,

or else, the formula for total variances is applicable:

Var(X1) = Var(E(X1|X2)) + E(Var(X1|X2)).

Theorem 2 Let X = (X1, . . . , Xm)T ∼ Nm(0,C) be a random vector, and letΓ := {1, . . . ,m} denote the index set of the variables, m ≥ 3. Assume thatK = (kij) = C−1 > 0. Then

rXiXj |XΓ\{i,j} =−kij√kiikjj

i 6= j,

where rXiXj |XΓ\{i,j} denotes the partial correlation coefficient between Xi andXj after eliminating the effect of the remaining variables XΓ\{i,j}. Further,

kii = (Var(Xi|XΓ\{i})−1, i = 1, . . .m

is the reciprocal of the conditional variance of Xi conditioned on the other vari-ables XΓ\{i}.

Proof. Without loss of generality we prove that

rX1X2|XΓ\{1,2} =−k12√k11k22

.

2

We use the result of Proposition 1 with p = 2, q = m− 2, and partition C andK as in Figure (a) and (b).

Then regressing X1 with XΓ\{1,2} we get that

E(X1|XΓ\{1,2}) = cT1C−122 XΓ\{1,2}

and the error term isε1 = X1 − cT1C

−122 XΓ\{1,2,}.

Likewise, regressing X2 with XΓ\{1,2} we get that

E(X2|XΓ\{1,2}) = cT2C−122 XΓ\{1,2}

and the error term isε2 = X2 − cT2C

−122 XΓ\{1,2}.

By definition,rX1X2|XΓ\{1,2} = Corr(ε1, ε2).

To find this correlation, we find the covariance and the variances of ε1 and ε2.As E(ε1) = 0 and E(ε2) = 0, with the notation of Figures (a) and (b),

Cov(ε1, ε2) = E(ε1εT2 ) = E(X1X

T2 )− cT1C

−122 E(XΓ\{1,2}X

T2 )− E(X1X

TΓ\{1,2})C

−122 c2

+ cT1C−122 E(XΓ\{1,2}X

TΓ\{1,2})C

−122 c2 = c12 − cT1C

−122 c2,

where we used that E(XΓ\{1,2}XTΓ\{1,2}) = C22. Likewise,

Var(ε1) = E(ε1εT1 ) = c11 − cT1C

−122 c1

andVar(ε2) = E(ε2ε

T2 ) = c22 − cT2C

−122 c2.

Putting these together, and using that, by Theorem 2, the upper left 2×2 blockof K is: K11 = C−1

1|2 , with the notation of Figure (c) we get that

rX1X2|XΓ\{1,2} =b√ad

=− −b

ad−b2√a

ad−b2d

ad−b2

= − k12√k11k22

.

In the last equality we utilized that K11 =

(k11 k12

k21 k22

)and k12 = k21.

To prove the statement for kii, it suffices to prove it for k11. By Theo-rem 1, the conditional distribution of X1 conditioned on XΓ\{1} is univariate

N1(c̃12C̃−122 XΓ\{1}, C̃1|2) distribution, where

Var(X1|XΓ\{1}) = C̃1|2 = c11 − c̃T12C̃−122 c̃12 = k−1

11

if we apply the result of Proposition 1 with p = 1, q = m − 1, see Figure (d).From Proposition 1 is also clear thatC > 0 implies that k11 6= 0. This completesthe proof. �

3

Definition 1 Assume X ∼ Nm(0,C) with C > 0. Consider the regressionplane

E(X1|XΓ\{1} = xΓ\{1}) =∑

j∈Γ\{1}

β1j|Γ\{1}xj ,

where xj’s are the coordinates of xΓ\{1}. Then we call the coefficient β1j|Γ\{1}jth partial regression coefficient for j ∈ Γ \ {1}.

The definition naturally extends to βij|Γ\{i} for j ∈ Γ \ {i}.

Theorem 3 β1j|Γ\{1} = − k1j

k11for j ∈ Γ \ {1}.

To prove the theorem, first we need a lemma.

Lemma 1 C12 = O is equivalent to K12 = O and

K−111 K12 = −C12C

−122 .

Proof of the Lemma. By Proposition 1, an easy calculation shows that

K−111 K12 = C1|2K12 = C1|2(−C−1

1|2C12C−122 ) = −C12C

−122

that finishes the proof. �Proof of Theorem 3. With the help of Figure (d), the equation of the regres-sion plane is

E(X1|XΓ\{1} = xΓ\{1}) = c̃T1 C̃−122 xΓ\{1} =

∑j∈Γ\{1}

(c̃T1 C̃−122 )jxj

=∑

j∈Γ\{1}

(−k−111 K12)jxj =

∑j∈Γ\{1}

−k1j

k11xj ,

where in the last line we used the Lemma. This finishes the proof. �

Corollary 1 An important consequence of Theorem 3 is that

βij|Γ\{i} = −kijkii

= rXiXj |XΓ\{i,j}

√kjjkii

, for j 6= i.

So only the variables Xj ’s whose partial correlation with Xi (after eliminationthe effect of the remaining variables) is 0, enter into the regression of Xi with theother variables. We sometimes say that the partial regresion coeficient βij|Γ\{i}shows the change of Xi under a unit change of Xj keeping the other variablesfixed, in latin wirds, ‘ceterum paribus’.

If we form a graph G on the vertex-set Γ and

i ∼ j ⇔ kij 6= 0, i 6= j,

then only the variables corresponding to vertices connected to vertex i enter intothe regression of Xi with the other variables. This is called Gaussian graphicalmodel and it is the base of the Path Analysis.

If the Gaussian graphical model is decomposable (see Graphical models inthe Information Theory and Statistics course), then the maximal cliques, to-gether with their separators (with multiplicities), form a so-called junction tree

4

structure. Denote C the set of the maximal cliques and S the set of their sep-arators in G. Then the ML estimator of K can be calculated based on the Smatrices (see the ML estimators of the multivariate normal parameters).

Let n be the sample size for the underlying m-variate normal distribution,Γ = {1, . . . ,m}, and assume that m < n. For the maximal clique C ∈ C, let[SC ]Γ denote n times the empirical covariance matrix corresponding to the vari-ables {Xi : i ∈ C} complemented with zero entries to have an m×m (symmet-ric, positive semidefinite) matrix. Likewise, for the separator S ∈ S, let [SS ]Γ

denote n times the empirical covariance matrix corresponding to the variables{Xi : i ∈ S} complemented with zero entries to have an m × m (symmetric,positive semidefinite) matrix. Then the ML estimator of the mean vector is thesample average (as usual), while the ML estimator of the concentration matrixis

K̂ = n

{∑C∈C

[S−1C ]Γ −

∑S∈S

[S−1S ]Γ

};

further,

|K̂| = nm ·∏

S∈S |SS |∏C∈C |SC |

.

Testing hypothesis about partial correlations. For i 6= j we want to test

H0 : rXiXj |XΓ\{i,j} = 0,

i.e., that X1 and X2 are contitionally independent conditioned on the remainingvariables. Equivalently, H0 means that there is no edge in G between verticesi and j, or equivalently, βij|Γ\{i} = 0, βji|Γ\{j} = 0, or simply, kij = kji = 0(C > 0 should be assumed).

To test H0 in some form, several exact tests are known that are usuallybased on likelihood ratio tests or on Basu’s theorem (see below). The followingtest uses the empirical partial correlation coefficient, denoted by r̂XiXj |XΓ\{i,j} ,and the following statistic based on it:

B = 1− (r̂XiXj |XΓ\{i,j})2 =|SΓ\{i,j}| · |SΓ||SΓ\{i}| · |SΓ\{j}|

.

It can be proven that under H0, the test statistic

t =√n−m ·

√1

B− 1 =

√n−m ·

r̂XiXj |XΓ\{i,j}√1− (r̂XiXj |XΓ\{i,j})2

is distributed as Student’s t with n−m degrees of freedom. Therefore, we rejectH0 for large values of |t|.

Theorem 4 (Basu) If T = T (X) is a sufficient and complete statistic withrespect to the family P of distributions, and the distribution of Y = s(X) doesnot depend on P ∈ P, then T and Y are independent.

5

Figure 1: Figures about convenient block partitions of C and K

6

partitioned covariance matrices and partial correlations...

Documents