correspondence analysis ahmed rebai center of biotechnology of sfax

Post on 04-Jan-2016

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Correspondence Analysis

Ahmed Rebai

Center of Biotechnology of Sfax

Correspondance analysis

Introduced by Benzecri (1973) For uncovering and understanding the structure and pattern in data in contingency tables.

Involves finding coordinate values which represent the row and column categories in some optimal way

Contingency tables Table with r rows and c columns

X1 1 ……….. j ………… c Total

X2

12..i.r

N11

N21

.

.

.

.

Nr1

N1j

Nij

N1c

Ncr

N1.

Nr.

Total N.1 N.j N.c N..

Main idea Develop simple indices that will show

us the relation between rows and columns

Indices that tell us simultaneously which columns have more wheights in a row category and vice versa

Reduce dimensionality like PCA Indice are extracted in decreasing

order of imporance

Which crietria? In contigency table global

independence between the two variables is generally measured by a chi-square (²) calculated as:

Where Eij are expected count under independence

r

i

c

j ij

ijij

E

EN

1 1

22

)(

....

N

NNE jiij

Decomposition of ² We have a departure from

indepedence and we want to know why To find the factors we use the matrix C

of dimension (r xc ) with elements

ij

ijijij

E

ENc

)(

How to find factors? Singular value decomposition (SVD)

of matrix C that is find matrice U, D and V such that

C=U D VT U are eigenvectors of CCT V eigenvectors of CTC D a diagonal matrix of where k

are eigenvalues of CCT k=Rank(C)<Min(r-1,c-1)

k

Tr(CCT)= k = ²= cij²

The projections of the rows and the columns are given by the eigenvectors Uk and Vk

C Uk = Vk

CTVk = Uk

k

k

How many factors? The adequacy of representation by

the two first coordinates is measured by the % of explained inertia

(1+2)/ k In general a display on (U1,U2) of

rows and (V1,V2) of columns The proximity between rows and

columns points is to be interpreted

CA in practice Proximity of two rows (columns)

indicates a similar profile that is similar conditional frequency distribution: the two rows (columns) are proportional

The orignin is the average of the factor; so a point (row or column) close to the origin indicates an average profile

Proximity of a row to a column indicates that this row has particularly important wheight in this column (if far from origin)

A first example: French Bac

Eigenvalues

With Corsica

Without Corsica

Classicalbac

Technicalbac

Coefficients for regions

Coefficients for Bac Type

Properties of CA Allows consideration of dummy variables

(called ‘illustrative variables’), as additional variables which do not contribute to the construction of the factorial space, but can be displayed on this factorial space.

With such a representation it is possible to determine the proximity between observations and variables and the illustrative variables and observations.

Tekaia and yeramian (2006) 208 predicted proteomes representing

the three phylogenetic domains and various lifestyle (hyperthromphile, thermophiles, psychrofile and mesophiles including eukaryotes)

Variables: amino-acid composition of proteomes

Illustrative variables:groups of amino-acids (charged, polar, hydrophobic)

Why CA? To analyze distribution of species

in terms of global properties and discriminated groups

Search for amino-acid signature in groups of species

Try to understand potential evolutionary trends

Results

First axis (63%) correspond to GC contents (Mycoplasma (23%) to Streptomyces(72%))

Second axis (14%) correspond to optimals growth temperature

top related