samsitutorial

8/8/2019 SAMSItutorial

1/46

A Tutorial on

MultivariateStatistical Analysis

Craig A. TracyUC Davis

SAMSISeptember 2006

1


2/46

ELEMENTARY STATISTICS

Collection of (real-valued) data from a sequence of experiments

X 1 , X 2 , . . . , X n

Might make assumption underlying law is N (, 2) with unknownmean and variance 2 . Want to estimate and 2 from the data.

Sample Mean & Sample Variance :

X = 1n

j

X j , S = 1n 1 jX j X 2

Estimators are unbiased

E (X ) = , E (S ) = 2

2


3/46

Theorem: If X 1 , X 2 , . . . are independent N (, 2) variables thenX and S are independent. We have that X is N (, 2 /n ) and(n 1)S/ 2 is 2(n 1).

Recall 2(d) denotes the chi-squared distribution with d degrees of freedom. Its density is

f 2 (x) =1

2d/ 2 (d/ 2)xd/ 2 1 e x/ 2 , x 0,

where

(z) =

0tz 1 e t dt, (z) > 0.

3


4/46

MULTIVARIATE GENERALIZATIONS

From the classic textbook of Anderson[1]:

Multivariate statistical analysis is concerned with data thatconsists of sets of measurements on a number of individualsor objects. The sample data may be heights and weights of some individuals drawn randomly from a population of school children in a given city, or the statistical treatment

may be made on a collection of measurements, such aslengths and widths of petals and lengths and widths of sepals of iris plants taken from two species, or one maystudy the scores on batteries of mental tests administered

to a number of students.

p = # of sets of measurements on a given individual,

n = # of observations = sample size

4


5/46

Remarks:

In above examples, one can assume that pn since typicallymany measurements will be taken.

Today it is common for p

1, so n/p is no longer necessarilylarge.

Vehicle Sound Signature Recognition: Vehicle noise is astochastic signal. The power spectrum is discretized to a

vector of length p = 1200 with n 1200 samples from thesame kind of vehicle.Astrophysics: Sloan Digital Sky Survey typically has many

observations (say of quasar spectrum) with the spectra of each quasar binned resulting in a large p.

Financial data: S&P 500 stocks observed over monthlyintervals for twenty years.

5


6/46

GAUSSIAN DATA MATRICES

The data are now n independent column vectors of length p

x1 , x2 , . . . , xn

from which we construct the n p data matrix

X =

xT 1 xT 2

...

xT n The Gaussian assumption is that

xj N p(, )

Many applications assume the mean has been already substractedout of the data, i.e. = 0.

6


7/46

Multivariate Gaussian Distribution

If x and y are vectors, the matrix xy is dened by

(xy)jk = x j yk

If = E (x) is the mean of the random vector x, then thecovariance matrix of x is the p p matrix

= E [(x )(x )]] is a symmetric, non-negative denite matrix. If > 0 (positivedenite) and X N p(, ), then the density function of X is

f X (x) = (2 ) p/ 2(det ) 1/ 2 exp

1

2x

, 1(x

, > x

R p

Sample mean:

x =1n

j

xj , E (x) =

7


8/46

Sample covariance matrix:

S =1

n 1n

j =1

( xj x)( xj x)

For = 0 the sample covariance matrix can be written simply as1

n 1X T X

Some Notation: If X is a n p data matrix formed from the nindependent column vectors x j , cov(x j ) = , we can form onecolumn vector vec( X ) of length pn

vec(X ) =

x1

x2...

xn

8


9/46

The covariance of vec( X ) is the np

np matrix

I n

=

0 0 00 0 0...

.

.....

.

..0 0 0

In this case we say the data matrix X constructed from n

independent x j N p(, ) has distribution

N p(M, I n )

where M = E (X ) = 1

, 1 is the column vector of all 1s.

9


10/46

WISHART DISTRIBUTION

Denition: If A = X T X where the n p matrix X isN p(0, I n ), > 0, then A is said to have Wishart distribution with n degrees of freedom and covariance matrix . We will say A

is W p(n, ).Remarks:

The Wishart distribution is the multivariate generalization of the chi-squared distribution.

AW p(n, ) is positive denite with probability one if andonly if n p.

The sample covariance matrix,

S =1

n 1A

is W p(n

1, 1n 1 ).

10


11/46

WISHART DENSITY FUNCTION , n

p

Let S p denote the space of p p positive denite (symmetric)matrices. If A = ( a jk )S p , let

(dA) = volume element of A =j k

da jk

The multivariate gamma function is

p(a) = S p e tr( A )

(det A)a ( p+1) / 2

(dA),(a) > ( p 1)/ 2.Theorem: If A is W p(n, ) with n p, then the density functionof A is

12np p(n/ 2)(det) n/ 2

e12 tr(

1 A ) (det A)(n p 1) / 2

11


12/46

Sketch of Proof:

The density function for X is the multivariate Gaussian(including volume element ( dX ))

(2) np/ 2 (det ) n/ 2 e12 tr(

1 X T X ) (dX )

Recall the QR factorization [8]: Let X denote an n p matrixwith n p with full column rank. Then there exists an uniquen p matrix Q, QT Q = I p, and an unique n p uppertriangular matrix R with positive diagonal elements so thatX = QR . Note A = X T X = RT R.

A Jacobian calculation [1, 13]: If A = X T X , then(dX ) = 2 p (det A)(n p 1) / 2 (dA)(QT dQ)

12


13/46

where

(QT dQ) = p

j =1

n

k = j +1

qT k dqj

and Q = ( q1 , . . . , q p) is the column representation of Q.

Thus the joint distribution of A and Q is(2) np/ 2 (det ) n/ 2 e

12 tr(

1 A ) 2

p

(det A)(n p 1) / 2

(dA)(QT

dQ)

Now integrate over all Q. Use fact that

V n,p

(QT dQ) =2 pnp/ 2

p(n/ 2)and V n,p is the set of real n p matrices Q satisfyingQT Q = I p. (When n = p this is the orthogonal group.)

13


14/46

Remarks regarding the Wishart density function

Case p = 2 obtain by R. A. Fisher in 1915. General p by J. Wishart in 1928 by geometrical arguments.

Proof outlined above came later. (See [1, 13] for completeproof.)

When Q is a p p orthogonal matrix p( p/ 2)2 p p2 / 2 Q

T

dQis normalized Haar measure for the orthogonal group O( p). Wedenote this Haar measure by ( dQ).

Siegel proved (see, e.g. [13]) p(a) = p( p 1) / 4

p

j =1

a 12

( j 1)

14


15/46

EIGENVALUES OF A WISHART MATRIX

Theorem: If A is W p(n, ) with n p the joint density functionfor the eigenvalues 1 , . . . , p of A is

p2 / 2 2 np/ 2 (det ) n/ 2

p( p/ 2) p(n/ 2)

p

j =1

(n p 1) / 2j

j > p)where L = diag( 1 , . . . , p) and ( dQ) is normalized Haar measure.Note that ( ) := j


16/46

Proof: Recall that the Wishart density function (times the volumeelement) is

12np p(n/ 2)(det) n/ 2

e12 tr(

1 A ) (det A)(n p 1) / 2 (dA)

The idea is to diagonalize A by an orthogonal transformation andthen integrate over the orthogonal group thereby giving the densityfunction for the eigenvalues of A.

Let 1 > > p be the ordered eigenvalues of A.A = QLQ T , L = diag( 1 , . . . , p), QO( p)

The j th column of Q is a normalized eigenvector of A. The

transformation is not 11 since Q = [q1 , . . . , q p] works for eachxed A. The transformation is made 11 by requiring that the 1 stelement of each qj is nonnegative. This restricts Q (as A varies) toa 2 p part of

O( p). We compensate for this at the end.

16


17/46

We need an expression for the volume element ( dA) in terms of Qand L. First we compute the differential of A

dA = dQLQ T + QdLQ T + QLdQ T

QT dA Q = QT dQ L + dL + L dQ T Q

= dQ t Q dL + L dQ T Q + dL= L,dQ T Q + dL

(We used QT Q = I implies QT dQ =

dQT Q.)

We now use the following fact (see, e.g., page 58 in [13]): If X = BY BT where X and Y are p p symmetric matrices, B is anonsingular p p matrix, then ( dX ) = (det B ) p+1 (dY ). In our caseQ is orthogonal so the volume element ( dA) equals the volumeelement ( QT dAQ). The volume element is the exterior product of the diagonal elements of QT dA Q times the exterior product of theelements above the diagonal.

17


18/46

Since L is diagonal, the commutator

L,dQ T Q

has zero diagonal elements. Thus the exterior product of thediagonal elements of QT dA Q is j dj .

The exterior product of the elements coming from the commutatoris

j


19/46

One is interested in limit laws as n, p

. For = I p,

Johnstone [11] proved, using RMT methods, for centering andscaling constants

np = n 1 + p2

,

np = n 1 + p 1n 1+ 1 p

1/ 3

that1

np

npconverges in distribution as n, p , n/p < , to theGOE largest eigenvalue distribution [15].

El Karoui [6] has extended the result to . The case pn appears, for example, in microarray data. Soshnikov [14] has lifted Gaussian assumption under the

additional restriction n

p = O( p1/ 3).

19


20/46

For = I p, the difficulty in establishing limit theorms comes

from the integral

O ( p) e 12 tr( 1 Q Q T ) (dQ)Using zonal polynomials innite series expansions have beenderived for this integral, but these expansions are difficult toanalyze. See Muirhead [13].

For complex Gaussian data matrices X similar density formulasare known for the eigenvalues of X X . Limit theorems for = I p are known since the analogous group integral, now overthe unitary group, is known explicitlythe Harish

ChandraItzyksonZuber (HCIZ) integral (see, e.g. [17]). Seethe work of Baik, Ben Arous and Peche [2, 3] and El Karoui [7].

20


21/46

PRINCIPAL COMPONENT ANALYSIS (PCA) ,

H. Hotelling, 1933Population Principal Components: Let x be a p 1 randomvector with E (x) = and cov( x) = > 0. Let 1 2 pdenote the eigenvalues of and H an orthogonal matrixdiagonalizing : H T H = = diag( 1 , . . . , p). We write H incolumn vector form

H = [h1 , . . . , h p]

so that h j is the p 1 eigenvector of corresponding to eigenvalue j . Dene the p 1 vectoru = H T x = ( u1 , . . . , u p)T

then cov(u) = E (H T x H T )(H T x H T )= H T E ((x )(x )) H = H T H =

u j uncorrelated .

21


22/46

Denition: u j is called the j th principal component of x. Notevar( u j ) = j .Statistical interpretations: The claim is that u1 is that linearcombination of components of x that has maximum variance .

Proof: For simplicity of notation, set = 0. Let b denote any p 1vector, bT b = 1, and form bT x.var( bT x) = E bT x bT x = E bT x (bT x)T = bT E (xx T )b = bT b.We want to maximize the right hand side subject to the constraintbT b = 1. By the method of Lagrange multipliers we maximize

bT b(bT b1)Since is symmetric the vector of partial derivatives is

2 b2bThus b must be an eigenvector with eigenvalue .

22


23/46

The largest variance corresponds to choosing the largest eigenvalue.

The general result is that u r has maximum variance of allnormalized combinations uncorrelated with u1 , . . . , u r 1 .

Sample principal components: Let S denote the sample

covariance matrix of the data matrix X and let Q = [q1 , . . . , q p] a p p orthogonal matrix diagonalizing S :

QT S Q = diag( 1 , . . . , p)

The j are the sample variances that are estimates for j . Thevectors qj are sample estimates for the vectors h j .

If x is the random vector and u = H T x is the vector of principal

components, then u = QT x is the vector of sample principal components .

23


24/46

SCREE PLOTS

In applications: How many of the j s are signicant?

24


25/46

CANONICAL CORRELATION ANALYSIS (CCA)

H. Hotelling, 1936Suppose a large data set is naturally decomposed into two groups.For example, p 1 random vectors x1 , . . . , xn make up one set andq

1 random vectors y1 , . . . , ym the other. We are interested in the

correlations between these two data sets. For example, in medicinewe might have n measurements of age, height, and weight ( p = 3)and m measurements of systolic and diastolic blood pressures

(q = 2). We are interested in what combination of the componentsof x is most correlated with a combination of the components of y.

Population Canonical Correlations: Let x and y be tworandom vectors of size p

1 and q

1, respectively. We assume

E (x) = E (y) = 0 and p q. Form the ( p + q) 1 vectorx

y

25


26/46

and its ( p + q)

( p + q) covariance matrix

= 11 12 21 22

.

Let u := T xR , v := T y

R

where and are vectors to be determined. We want to maximizethe correlation

corr( u, v) = cov(u, v)

var( u)var( v)The correlation does not change under scale transformationsu

cu, etc. so we can maximize this correlation subject to the

constraints

E (u2) = E ( T x T x) = T 11 = 1 (1)E (v2) = T 22 = 1 (2)

26


27/46

Under these constraints

corr( u, v ) = E ( T x T y) = T 12 .Let

= T 12

1

2 ( T 11

1)

1

2 ( t 22

1)

where and are Lagrange multipliers. Set the vector of partialderivatives to zero:

= 12

11 = 0 (3)

= T 12 22 = 0 (4)If we left multiply (3) by T and (4) by T , use the normalization

conditions (1) and (2) we conclude = . Thus (3) and (4) become

11 12 21

22

= 0 (5)

27


28/46

with corr( u, v ) =

and 0 is a solution to

det 11 12 21

22

= 0

This is a polynomial in of degree ( p + q). Let 1 denote themaximum root and 1 and 1 corresponding solutions to (5).

Denition: u1 and v1 are called the rst canonical variables andtheir correlation 1 = corr( u1 , v1) is called the rst canonical correlation coefficient .

More generally, the rth pair of canonical variables is the pair of

linear combinations u r = ( ( r ) )T x and vr = ( ( r ) )T y, each of unitvariance and uncorrelated with the rst r 1 pairs of canonicalvariables and having maximum correlation. The correlationcorr( ur , vr ) is the r th canonical correlation coefficient .

28


29/46

REDUCTION TO AN EIGENVALUE PROBLEM

Since

11 12 21 22

=

11 0

0 1I 111 12 122 21 I

1 0

0 22

Thus the determinantal equation becomesdet 2I 21 111 12 122 = 0

The nonzero roots 1 , . . . , k are called the population canonical

correlation coefficients . k = rank( 12 ).In applications is not known. One uses the sample covariancematrix S to obtain sample canonical correlation coefficients .

29


30/46

AN EXAMPLE

The rst application of Hotellings canonical correlations is byF. Waugh [16] in 1942. He begins his paper with

Professor Hotellings paper, Relations between Two Sets

of Variates, should be widely known and his method usedby practical statisticians. Yet, few practical statisticiansseem to know of the paper, and perhaps those few areinclined to regard it as a mathematical curiosity ratherthan an important and useful method for analyzingconcrete problems. This may be due to . . .

The Practical Problem: Relation of wheat characteristics toour characteristics. Guiding principle

The grade of the raw material should give a goodindication of the probable grade of the nished product.

30


31/46

The data: 138 samples of Canadian Hard Red Spring wheat and

the our made from each these samples. a

wheat quality our quality

x1 = kernel texture y1 = wheat per barrels of our

x2 = test weight y2 = ash in our

x3 = damaged kernels y3 = crude protein in our

x4 = crude protein in wheat y4 = gluten quality index

x5 = foreign material

u1 = T 1 x = index of wheat quality , v1 = T 1 y = index of our quality

corr( u1, v

1) = 0 .909

a In this example p > q . The data are normalized to mean 0 and variance 1.

31


32/46


33/46

c p,q,n is a normalization constant. ( ) = Vandermonde determinant = pj


34/46

ZONAL POLYNOMIALS

Zonal polynomials naturally arise in multivariate analysis whenconsidering group integrals such as

O (m )

e tr( X H Y H T ) (dH )

where X and Y are m m symmetric, positive denite matricesand ( dH ) is normalized Haar measure. Zonal polynomials can be

dened either through the representation theory of GL(m,R

) [12]or as eigenfunctions of certain Laplacians [5, 13].

Let S m denote the space of m m symmetric, positive denitematrices. Zonal polynomials, C (X ), X

S m are certain

homogeneous polynomials in the eigenvalues of X that are indexedby partitions .

To give the precise denition we need some preliminary denitions.

34


35/46

Denition: A partition of n is a sequence = ( 1 , 2 , . . . , )

where the j 0 are weakly decreasing and j j = n. We denotethis by n.For example, = (5 , 3, 3, 1) is a partition of 12. The number of

nonzero parts of is called the length of , denoted ().If and are two partitions of n, we say < (lexicographicorder ) if, for some index i, j = j for j < i and j < j . Forexample

(1, 1, 1, 1) < (2, 1, 1) < (2, 2) < (3, 1) < (4)

If n with () = m, we dene the monomial

x

= x 11 x

22 x

mm

and say x is of higher weight than x if > .

35


36/46

Metrics and Laplacians

Let X S m and dX = ( dx jk ) the matrix of differentials of X . Wedene a metric on S m by

(ds)2 = tr X 1dX X 1dX A simple computation shows this metric is invariant under

X LXL T , LGL(m, R ).

Let n = m(m + 1) / 2 = # of independent elements of X . We denoteby vec(X )

R n the column vector representation of X , e.g.

X =x11 x12x12 x22

, x := vec( X ) =

x11

x12x22

The metric ( ds)2 is a quadratic differential in dx.

36


37/46

For example,

(ds)2 = dx11 dx12 dx22 G(x) dx11dx12dx22

whereG(x) = ( gij ) =

1(x11 x22 x212 )2

x222 2x12 x22 x

212

2x12 x22 2x11 x22 + 2 x212 2x11 x12x212 2x11 x12 x211

Labeling x = vec( X ) with a single index the differential is in thestandard form

(ds)2 =i


38/46

and the Laplacian associated to this metric is

X = (det G) 1/ 2n

j =1

x j

(det G)1/ 2n

i =1

gij

x i

whereG 1 = ( gij )

More succinctly, if

X =

x 1

...

x n

,

X = (det G) 1/ 2

X , (det G)

1/ 2G

1

X

where (, ) is the standard inner product on R n .

38


39/46

One can show that X is an invariant differential operator :

LXL T = X , LGL(m,R )

We now diagonalize X

X = HY H T

, Y = diag( y1 , . . . , ym ), H O(m).The Laplacian X is now expressed in terms of a radial part andan angular part . The radial part of X is the differential operator

m

j =1

y2j 2

y2j+

m

j =1

m

k =1k = j

y2jyj yk

yj

+j

yj

yj

We now let X denote only the radial part.

We can now dene zonal polynomials!

39


40/46

Denition: Let X S m with eigenvalues x1 , . . . , xm and let = ( 1 , . . . , m)k into not more than m parts. Then C (X ) isthe symmetric, homogeneous polynomial of degree k in x j such that

1. The term of highest weight in C (X ) is x

2. C (X ) is an eigenfunction of the Laplacian X .

3. (tr( X )) k = ( x1 + + xm )k = k

( ) m

C (X ).

Remarks:

Must show there is an unique polynomial satisfying theserequirements.

Eigenvalue in (2) equals := j j ( j j ) + k(m + 1) / 2. Program MOPS [5] computes zonal polynomials.

40


41/46

The following theorem is at the core of why zonal polynomialsappear in multivariate statistical analysis.

Theorem: If X, Y S p, then

O ( p) C (XHYH T ) (dH ) = C (X )C (Y )C (I p) (6)where (dH ) is normalized Haar measure.Proof: Let f (Y ) denote the left-hand side of (6) and QO( p).f (QY QT ) = f (Y ) (let H HQ in integral and use invariance of the measure). Thus f is a symmetric function of the eigenvalues of Y . Since C is homogeneous of degree ||, so is f . Now apply the

41


42/46

Laplacian to f .

Y f (Y ) = O ( p) Y C (XHYH T ) (dH )=

Y C (X 1/ 2HY H T X 1/ 2) (dH )

= Y C (LY LT ) (dH ) (L = X 1/ 2H )=

LY L T C (LY LT ) (dH ) invariance of Y

= C (LY LT ) (dH )= f (Y )

By denition of C (Y ) we have f (Y ) = d C (Y ). Sincef (I p) = C (X ), we nd d and the theorem follows.

42


43/46

Using Zonal Polynomials to Evaluate Group Integrals

O ( p) e tr( X H Y H T ) (dH )

=

k =0

k

k! O ( p)

tr( XHYH t ) k (dH )

=

k =0

k

k! k

( ) m O ( p) C (XHYH T ) (dH )

=

k =0

k

k! k

( ) m

C (X )C (Y )C (I p)

43


44/46

References

[1] T. W. Anderson, An Introduction to Multivariate Statistical Analysis , third edition, John Wiley & Sons, Inc., 2003.

[2] J. Baik, G. Ben Arous, and S. Peche, Phase transition of the largest

eigenvalue for nonnull complex sample covariance matrices, Ann.Probab. 33 (2005), 16431697.

[3] J. Baik, Painleve formulas of the limiting distributions for nonnullcomplex sample covariance matrices, Duke Math. J. 133 (2006),205235.

[4] M. Dieng and C. A. Tracy, Application of random matrix theory tomultivariate statistics, arXiv: math.PR/0603543.

[5] I. Dumitriu, A. Edelman and G. Shuman, MOPS: MultivariateOrthogonal Polynomials (symbolically), arXiv:math-ph/0409066.

[6] N. El Karoui, On the largest eigenvalue of Wishart matrices withidentity covariance when n , p and p/n tend to innity, arXiv:

44


45/46

math.ST/0309355.

[7] N. El Karoui, Tracy-Widom limit for the largest eigenvalue of alarge class of complex Wishart matrices, arXiv: math.PR/0503109.

[8] G. H. Golub and C. F. Van Loan, Matrix Computations , secondedition, Johns Hopkins University Press, 1989.

[9] H. Hotelling, Analysis of a complex of statistical variables intoprincipal components, J. of Educational Psychology 24 (1933),417441, 498520.

[10] H. Hotelling, Relations between two sets of variates, Biometrika 28(1936), 321377.

[11] I. M. Johnstone, On the distribution of the largest eigenvalue inprincipal component analysis, Ann. Stat. 29 (2001), 295327.

[12] I. G. Macdonald, Symmetric Functions and Hall Polynomials , 2nded., Oxford University Press, 1995.

[13] R. J. Muirhead, Aspects of Multivariate Statistical Theory , John

45


46/46

Wiley & Sons Inc., 1982.

[14] A. Soshnikov, A note on universality of the distribution of thelargest eigenvalue in certain sample covariance matrices, J.Statistical Physics 108 (2002), 10331056.

[15] C. A. Tracy and H. Widom, On orthogonal and symplectic matrixensembles, Commun. Math. Phys. 177 (1996), 727754.

[16] F. V. Waugh, Regression between sets of variables, Econometrica 10 (1942), 290310.

[17] P. Zinn-Justin and J.-B. Zuber, On some integrals over the U (N )unitary group and their large N limit, J. Phys. A: Math. Gen. 36

(2003), 31733193.

46

samsitutorial

Documents