principal component analysis. 20 food products 16 european countries country gr_coffee inst_coffee...

52
Principal Component Analysis

Upload: shon-watson

Post on 18-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

Principal Component Analysis

Page 2: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

• 20 food products16 European Countries

Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

Germany 90 49 88 19 57 51 19 21 27ItalyFrance

PCA Example: FOODS

Page 3: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

PCA Example: FOODS

Page 4: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

PCA Example: FOODS

Page 5: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

PCA Example: Red Sox Dataset: 110 Years of Redsox Performance Data Question: Pitchers and Batters Ages Matter for

Performance

Page 6: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish
Page 7: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish
Page 8: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

Redundancy

• Arbitrary observations by r1 and r2

• Low to high redundancies from (a) to (c)• (c) can be represented by a single variable

• Spread across the best-fit line – covariance between two variables

Page 9: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

Transform

• Linear Transformation• (x,y) in Cartesian coordinate• The same point becomes in (a,b) in another

coordinate system• Assuming linear transformation

• a = f(x, y) = x*c11 + y*c12

• b = g(x,y) = x*c21 + y*c22

=

For review of matrix, www.cs.uml.edu/~kim/580/review_matrix.pdf

Page 10: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

Eigenvector

=

= = 4*

• Eigenvector – projection to the same coordinate

• Eigenvectors of a square matrix are orthogonal

• Unique eigenvalues are associated with eigenvectors

Page 11: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

Transform

=

• http://www.ams.org/samplings/feature-column/fcarc-svd

Page 12: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

Transform

(Symmetric)

• http://www.ams.org/samplings/feature-column/fcarc-svd

Page 13: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish
Page 14: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

Eigenvectors

• Mv = λiv

λi is scalarv is orthogonal vectors

• Non-symmetric

Page 15: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish
Page 16: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

• For unit vectors u1 and u2

• A general vector x has coefficients projected by unit vectors

Page 17: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

• Vector product is of the form:

• =>

• SVD (Singular Vector Decomposition)

Page 18: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

M: mxn; U: mxm; Σ: mxn; V: nxn

• U is eigenvector of MMT

• V is eigenvector of MTM

Page 19: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

Transform = New Coordinate• What should be good for transform matrix for

PCA ?

• Covariance

Page 20: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

Mean, Variance, Covariance• X = (x1, x2, ….. xn) Y = (y1, y2, …..yn)

• E[X] = ∑i xi /n E[Y] = ∑I yi /n

• Variance = (st. dev.)2:• V[X] = ∑I (xi – E[X])2 / (n-1)

• Covariance -- • cov[X,Y] = ∑I (xi – E[X]) (yi – E[Y]) / (n-1)

Page 21: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

Covariance Matrix

• Three variables X,Y,Z

cov[x,x] cov[x,y] cov[x,z]cov[y,x] cov[y,y] cov[y,z]cov[z,x] cov[z,y] cov[z,z]

• cov[X,X] = V[X]• cov[X,Y] = cov{Y,X]

Page 22: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

PCA

Page 23: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish
Page 24: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish
Page 25: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

X Y

2.5 2.4

0.5 0.7

2.2 2.9

1.9 2.2

3.1 3.0

2.3 2.7

2.0 1.6

1.0 1.1

1.5 1.6

1.1 0.9

167 749

13l92 62.42

Numerical Example

Page 26: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

X (adj) Y (adj)

0.69 0.49

-1.31 -1.21

0.39 0.99

0.09 0.29

1.29 1.09

0.49 0.79

0.19 -0.31

-0.81 -0.81

-0.31 -0.31

-0.71 -1.01

• After adjustments

Page 27: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

• Covariance matrix

cov = (.6165 .6154).6154 .7166

• Eigenvalues

|.6165-λ .6154 ||.6154 .7166-λ| =0

1.2840, 0.0491• Eigenvectors

Page 28: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish
Page 29: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

Example: Amino Acid (AA) - Basic

Page 30: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

Clustering of AAs How many clusters ?

Use 4 AA groupsGood for acidic and basicP in polar groupNonpolar group is wide spread

Similarities of AA’s determine the ease of substitutions

Some alignment tools show similar AA’s in colors Needs a more systematic approach

Page 31: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

Physico-Chemical Properties Physico-chemical properties of AA

determine protein structures(1) Size in volume(2) Partial Vol.

Measure expanded volume in solution when dissolved

(3) Bulkiness The ratio of side chain volume to its length: average

cross-sectional area of the side chain(4) pH of isoelectric point of AA (pI)(5) Hydrophobicity(6) Polarity index(7) Surface area(8) Fraction of area

Fraction of the accessible surface area that is buried in the interior in a set of known crystal structures

Page 32: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

Vol. Bulk Pol. pI Hydro

Surf2

Frac

Alanine Ala A 67 11.5 0.0 6.0 1.8 113 0.74

Arginine Arg R 148 14.3 52.0 10.8 -4.5 241 0.64

Asparagine Asn N 96 12.3 3.4 5.4 -3.5 158 0.63

Aspartic Asp D 91 11.7 49.7 2.8 -3.5 151 0.62

Cysteine Cys C 86 13.5 1.5 5.1 2.5 140 0.91

Glutamine Gln Q 114 14.5 3.5 5.7 -3.5 189 0.62

Glu. Acid Glu E 109 13.6 49.9 3.2 -3.5 183 0.62

Glycine Gly G 48 3.4 0.0 6.0 -0.4 85 0.72

Histidine His H 118 13.7 51.6 7.6 -3.2 194 0.78

Isoleucine Ile I 124 21.4 0.1 6.0 4.5 182 0.88

Leucine Leu L 124 21.4 0.1 6.0 3.8 180 0.85

Lysine Lys K 135 13.7 49.5 9.7 -3.9 211 0.52

Methionine Met M 124 16.3 1.4 5.7 1.9 204 0.85

Phenyl. Phe F 135 10.8 0.4 5.5 2.9 218 0.88

Proline Prot P 90 17.4 1.6 6.3 -1.6 143 0.64

Serine Ser S 73 9.5 1.7 5.7 -0.8 122 0.66

Threonine Thr T 93 15.8 1.7 5.7 -0.7 146 0.70

Tryptophan Trp W 163 21.7 2.1 5.9 -0.9 259 0.85

Tyrosine Thr Y 141 18.0 1.6 5.7 -1.3 229 0.76

Valine Val V 105 21.6 0.1 6.0 4.2 160 0.86

Mean 109 15.4 13.6 6.0 -0.5 175 0.74

Red: acidicOrange: basicGreen: polar

(hydrophillic)

Yellow: non-polar

(hydrophobic

)

Page 33: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

PCA of AAs How to incorporate different properties

In order to group similar AA’sVisual clustering with Volume and pI

Page 34: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

PCA Given NxP matrix (e.g., 20x7),

Each row represents a p-dimensional data pointEach data point is

Scaled and shifted to the origin Rotated to spread out points as much as

possible

Scaling For property j, compute the average and the s.d.

μj = ∑i xij /N, σj2 = ∑i (xij - μj)2 /N

Since each property has a different scales and means, define normalized variables,

zij = (xij - μj) /σj

zij measures the deviation from the mean for each property with the mean of 0 and s.d. of 1

Page 35: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

PCA New orthogonal coordinate system

Find vj = (vj1, vj2 ,…, vjP ) such that

∑k vik vjk = 0 for i ≠ j (orthogonal) and ∑k vjk2 = 0 (unit length)

vj represents new coordinate vectorData points in z-coordinate becomes

yij = ∑k zjk vik

New y coordinate systems is a rotation of the z coordinate system

vjk turns out to be related to the correlation coefficient

Page 36: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

PCA Correlation coefficient, Cij

Cij = ∑k(zik - mi)(zjk - mj) /Psi sj (mi, si mean and s.d. of the i-th row)

-1 ≤ Cij ≤ 1

Results in NxN simiarlity matrix, Sij

Vol Bulk Polar pI Hyd SA FrA

Vol 1.00 9.73 0.24 0.37 -0.08 0.99 0.18

Bulk 1.00 -0.20 0.08 0.44 0.64 0.49

Polar 1.00 0.27 -0.69 0.29 -0.53

pI 1.00 -0.20 0.36 -0.18

Hyd 1.00 -0.18 0.84

SA 1.00 0.12

FrA 1.00

Page 37: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish
Page 38: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

Clustering Family of related sequences evolved from a common

ancestor is studied with phylogenetic trees showing the order of evolution

Criteria neededCloseness between sequencesThe number of clusters

Hierarchical Clustering Algorithm – connectivity-based K-mean -- centroid

Page 39: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

Hierarchical Clustering Hierarchical Clustering Algorithm

Each point forms its own cluster, initiallyJoin two clusters with the highest similarity to form a single larger clusterRecompute similarities between all clusterRepeat two steps above until all points are connected to clusters

Criteria of similarities ?Use scaled coordinates z

Vector zi from origin to each data point i with length |zi|2 = ∑k zik

2

Use cosine angle between two points for similarity

cos θij = ∑k zikzjk / |zi||zj|

N elements, nxn distrance matrix d

Page 40: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

Hierarchical_Clustering (d, n) Form n clusters, each with 1 element Construct a graph T by assigning an isolated vertex to each cluster while there is more than 1 cluster Find the two closest clusters C1 and C2

Merge C1 and C2 into new cluster C with | C1 | + | C2| elements Compute distance from C to all other clusters Add a new vertex C to T Remove rows and columns of d for C1 and C2, and add for C return T

Page 41: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish
Page 42: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish
Page 43: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

k-mean Clustering The number of clusters, k, is known ahead of the time Minimize the squared errors between data points and k

cluster centers

No known polynomial algorithmHeuristics – Lloyd algorithm

initially partition n points arbitrarily to k centers, then move some points between clusters

Converge to a local minimum, may move many points in each iteration

k-means Clustering Problem Given n data points, find k center points minimizing the squared error distortion,

d(V, X) = ∑id(vi,X)2/n

input: A set V of n data points and a parameter k output: A set X consisting of k center points minimizing d(V,X)

over all possible choices of X

Page 44: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

k-mean Clustering Assume every possible partition of n elements to k

clusters And each partition has cost(P)

Move one point in each iteration

Progressive_Greedy_k-means(n) Select an arbitray partition P into k clusters while forever bestChange = 0 for every cluster C for every element i not in C if moving i to C reduces Cost(P) if Δ(i → C) > bestChange bestChange ← Δ(i → C) i* = i C* = C if bestChange >0 change partition P by moving i* to C* else return P

Page 45: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish
Page 46: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

Dynamic Modeling in Chameleon Similarity between clusters is determined by

Relative interconnectivity (RI)Relative closeness (RC)

Select pairs with high RI and RC to merge

Page 47: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

Hierarchical Clustering - Cluto Generates a set of clusters within

clusters Algorithm can be arranged as a tree

Each node becomes where two smaller clusters join

CLUTO package with cosine and group-average rulesRed/green indicates values significantly higher/lower than the averageDark colors close to the average

Page 48: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

1. red on both pI and polarity scale

2. green on hydrophobicity and pI (can be separated into two smaller clusters)

3. green on volume and surface area

4. C is unusual in protein structure due to its potential to form disulfide bonds between pairs of cysteine residues (thus, difficult to interchange for other residues)

5. Hydrophobic6. Two largest AA’s

Clustering of properties: properties can be ordered illustrating groups of properties that are correlated

Page 49: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

6 clusters

cluster

Property AA

1 Basic K, R, H

2 Acid and amide

E, D, Q, N

3 Small P, T, S, G, A

4 Cysteine C

5 Hydrophobic V, L, I, M, F

6 Large, aromatic

W, Y

Page 50: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

In PAM matrix, considered probabilities of pairs of amino acids appearing together

Pairs of amino acids that tend to appear together are grouped into a cluster

six clusters (KRH) (EDQN) (PTSGA) (C) (VLIM) (FWY)

Contrast to clusters via hierarchical clustering(KRH) (EDQN) (PTSGA) (C) (VLIMF) (WY)

Dayhoff Clustering - 1978

Page 51: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

(KRH) (EDQN) (PTSGA) (C) (VLIM) (FWY)

Page 52: Principal Component Analysis. 20 food products 16 European Countries Country Gr_coffee Inst_coffee Tea Sweetener Biscuit Pe_Soup Ti_soup In_Portat Fro_Fish

To study protein folding Used BLOSUM50 similarity matrix

Determine correlation coefficients between similarity matrix elements for all pairs of AA’s

e.g., CAV = (∑i MA,i MV,i )/[(∑i MA,i MA,i q)*(∑i MV,i MV,i )] with summation over i is taken for 20 AA’s

Group two AA’s with highest CC’s, and either add the next AA with the highest CC to a group or a new group

Murphy, Wallqvist, Levy, 2000