statistical analysis of array data: dimensionality reduction, clustering

Statistical analysis of array data: Dimensionality reduction,

Clustering

Katja Astikainen, Riikka Kaven25.2.2005

Contents

• Problems and approaches• Dimensionality reduction by PCA• Clustering overview• Hierarchical clustering• K-means• Mixture models and EM

Problems and approaches• Basic idea is to find patterns of expression across

multiple genes and experiments– Models of expression are utilized in e.g. classification of

diseases more precisely (tautiluokitus,sairausaste)– Expression patterns can be utilized to exploring cellular

pathways– With help of gene expression modeling and also condition

(experiment) clustering one can find genes that are co-regulated– clustering methods can also be used for sequens alignments

• There are several methods for this, but we are going introduce:– Principal Component Analysis (PCA)– Clustering (hierarchical, K-means, EM)

Dimensionality reduction by PCA

PCA is statistical data analysis technique– method to reduce dimensionality– method to identify new meaningful underlying

variables– method to compress the data– method to visualize the data

Dimensionality reduction by PCA

• We have N data points xi,…,xn in M dimensional space, where values x are genes expression vectors.

• With PCA we can reduct the dimension to K which is usually much lower than M.

• Imagine taking three-dimensional cloud of datapoints and rotating it so you can view it from different perspectives. You might imagine that certain views would allow you to better separate the data into groups than others.

• With PCA we can ignore some of the redundant experiments (low variance), or use some average of the information without loss of information.

Dimensionality reduction by PCA• We are looking for unit vector u1 such that,

on average the squared length of of the projection of the xs along the u1 is maximal (vectors are column vectors)

• Generally if the first u1,…,uk-1 components have been determined the next component is the one that maximize the residual variance

• The principal components for the expression vectors are given by ci=uix

21

11maxarg

k

ii

Tiuk uxuxu

211 maxarg xuEu T

u

Dimensionality reduction by PCA• How can we find the eigenvectors ui

– Find such eigenvectoctors wich shows the most informative part of the data; vectors that show the direction of maximal variance of the data.

• Fist we calculate the covariance matrix

• Find out the eigenvalues and eigenvectors uk from the covariance matrix

• eigen value is a measure of the proportion of the variance explained by the

corresponding eigenvector• Select the uis wich are the eigenvectors of the sample covariance matrix

associated with the K largest eigenvalues– eigenvectors wich explains the most of the variance in the data– discovers the important features and patterns in the data– for datavisualization use two or three dimensional spaces

TxxEC

kkk uCu

i

Clustering overview• Data analysis methods for discovering patterns and

underlying cluster structures • Different kind of methods such as Hierarchical clustering,

partitioning based k-means and Self Organizing map (SOM)

• There’s no single method that is best for every data• clustering methods are unsuperviced methods (like k-

means)– there is no information about the true clusters or their amount– clustering algorithms are used for analysing the data– discovered clusters are just estimations of the truth (often the

result is local optimum)

Clustering overview• Data types

– Typically the clustered data is numerical vector data like gene expression data (expression vectors)

– Numerical data can also be represented in relative coordinates

– Data might also be qualitative (nominal) which brings challenge for comparing the data elements

• Number of clusters is often unknown• One way to estimate the number of clusters is analysing the

data by PCA– you might use the eigenvectors to estimate the number of

clusters• Other way is to make guesses and justify the number of cluster

by good results (what ever they are)

Clustering overview

• Similarity measures– Pearson correlation

(normalized vectors dot product)

• Distance measures– euclidean (natural distance

between two vectors)

• It is important to use appropriate distance/similarity measures– in euclidean space vectors

might be close to each other but their correlation could be 0

1

2 2

1 1

( )( )

( ) ( )

ni ii

n n

i ii i

x x y yr

x x y y

2

1

n

i ii

d x y

1000000000

0000000001

Clustering overview

• For comparing different ways of clustering the same data, we need some kind of cost function for the clustering algorithm

• The goal of clustering is to try to minimize such cost function

• Generally cost function depends on some quantities:– Centers of the clusters– The distance of each point in a cluster to the cluster center– The average degree of similarity of a points in a cluster

• Cost functions are algorithm spesific, so comparing the results of different clustering algorithms might be almost impossible

Cost function and probabilististic interpretation:

Clustering overview

• There are some advantages associatedwith probabilistic models they are often utilized in cost functions

• It is popular method to use in the clustering cost function the negative log-likelihood of an underlying probabilistic model

Cost function and probabilististic interpretation:

Hierarchical clustering

• The basic idea is to construct hierarchical tree which consist of nested clusters

• Algorithm is bottom-up method where clustering starts from single data points (genes) and stops when all data points are in same cluster (the root of the tree)

• Clustering begins with computing pairwise similarities between each data point and when clusters are formed similarity comparing is made between clusters.

• Branching process is repeated at most N-1 times which means that the leaf nodes (genes) make first pairs and the tree becomes a binary-tree.

Hierarchical clustering: phases

• Calculate the pairwais similarities between data points into matrix

• Find two datapoints (nodes in the tree) wich are closest to each other or are most similar.

• Group them together to make a new cluster.• Calculate the averige vector of datapoints which

is expression profile for the cluster (inner node in the tree that joins the leaf nodes = datapoints vectors)

• Calculate new correlation matrix– calculate pairwise similarity between the new cluster

and other clusters.

Tree Visualization

• With Hierarchical clustering we could find the dendoclusters of datapoints but the constructed tree isn’t yet in optimal order

• After finding the dendogram which tells the similarity between nodes and genes, the final and optimal linear order for nodes can be discovered with help of dynamic programming

Tree visualization with dynamic programming [2]

Goal: Quickly and easily arrange the data for further inspection

ABCDE

genes

experiments


nearest: we use correlation coefficient (normalized dot product)

can use other measures as well

ABCDE

Greedily join nearest cluster pair [3]


• Greedily join nearest cluster pair [3]• Optimal ordering: minimize summed distance between consecutive genes

– Criterion suggested by Eisen

A

BC

DE


• Greedily join nearest cluster pair [3]• Optimal ordering: minimize summed distance between consecutive genes

– Criterion suggested by Eisen

BACED

Hierarchical clustering:dynamic programming

• Optimal linear ordering for genes expression vectors can be computed in O(N4) steps

• We would like to maximize the similarity between neighbournodes

where is the ith leaf when the tree is ordered according to

. The algorithm works from bottom up towards the root by recursively computing the cost of the optimal ordering M(V,U,W)

1

1)1()( ,

N

iii GGC

)(iG

[1]

Hierarchical clustering:dynamic programming

• The dynamic programming recurrence is given by:

• The optimal cost M(V) for V is obtained by maximizing over all pairs, U, W.

• The global optimal cost is obtained recursively when V is the root of the tree, and the optimal tree can be found by standard backtracking.

),(),,(),,(max,, SRCWSVMRUVMWUVM rlVSVR rllr

[1]

k-means algorithm• Data points are divided into k clusters• Find by iterating such group of centroids C={v1,

…,vK}, which minimize the squared distances (d2) between expression vectors xj…xn and the centroid which they belong REP[xj,C]:

where the distance measure d is euclidean.In practise the result is approximation (local

optimum). • Each expression vector belongs into one cluster.

,,,1

2

n

jjj CxREPxdCL

k-means-algorithm: phases

1. Initially put the expression vectors randomly into k clusters.

2. Define the clusters centroids by calculating the average vector from expression vectors which belong into the cluster.

3. Compute the distances between expression vectors and centroids.

4. Move every expression vector into cluster with closest centroid.

5. Define new centroids for clusters. If clusters centroids are stabile or some other stopping criteria is achieved, stop algorithm. Otherwise repeat steps 3-5.

k-means clustering

Kuva 4 [4]: K-means example: 1) Expression vectors are randomly divided into three clusters 2) Define the centroids. 3) Compute expression vectors distances to the centroids. 4) Compute centroids new locations. 5) Compute expression vectors distances to the centroids. 6) Compute centroids new locations and finish the clustering cause the centroids are stabilized. Clusters formed are circled.

Mixture models and EM• EM algortihm is based on modelling complex distributions by

combining together simple Gaussian distributions of clusters• K-means algorithm is an oline approximation of EM algorithm

– maximizes the quadratic log-likelihood (minimizes quadratic distances of datapoints to their clusters centroids)

• The EM algorithm is used to optimize the centers of each cluster (weighted variance is maximal) which means that we find the maximum likelihood estimate for the center of the Gaussian distribution of the cluster

• Some initial guesses has to be made before starting – number of clusters (k)– initial centers of clusters

Mixture models and EM

Algorithm is an iterative process with two optimization task:• E-step: the membership probabilities (hidden variables) of

each datapoint for each mixture model (cluster) are Estimated

The maximum likehood estimate of the mixing coefficient is the sample mean of the conditional probatilities that d1 comes from model k

ik

N

ik dMN

1

* 1

ikkiik dPMPMdPdMP

Mixture model and EM

• M-step: K-separate estimation problems of Maximizing the log-likelihood of k component with a weight given by the estimated membership probabilities

• In M-step means of Gaussian distributions are estimated so that they maximize the likelihood of the models

0)(log

1

kj

kiN

iik w

MddM

References[1] Baldi, P and Hatfield, Wesley G, DNA Microarrays and Gene

Expression, Cambridge University Press, 2002, 73-96.[2] URL http://www-2.cs.cmu.edu/~zivbj/class04/lecture11.ppt[3] Eisen MB, Spellman PT, Brown PO and Botstein D. (1998).

Cluster Analysis and Display of Genome-Wide Expression Patterns. Proc Natl Acad Sci U S A 95, 14863-8.

[4] Gasch, A. P. and Eisen, M. B., Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. Genome Biology, 3,11(2002), 1–22. URL http://citeseer.ist.psu.edu/gasch02exploring.html.

http://www-2.cs.cmu.edu/~zivbj/class04/lecture11.ppt

http://citeseer.ist.psu.edu/gasch02exploring.html

statistical analysis of array data: dimensionality reduction, clustering

Documents

data vectors

patterns of expression

genes expression vectors

datadimensionality reduction

emdimensionality reduction

uixdimensionality reduction

unsuperviced methods

dataclustering methods