introduction to bioinformatics - tutorial no. 12 expression data analysis: - clustering - geo -...

27
Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

Post on 21-Dec-2015

240 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

Introduction to Bioinformatics - Tutorial no. 12

Expression Data Analysis:- Clustering- GEO- EPClust

Page 2: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

Application of Microarrays

We only know the function of about 20% of the 30,000 genes in the Human Genome Gene exploration Faster and better

Applications: Evolution Behavior Cancer Research

Page 3: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

Microarray Analysis

Unsupervised Grouping: Clustering

Pattern discovery via grouping similarly expressed genes together

Three techniques most often used k-Means Clustering Hierarchical Clustering Kohonen Self Organizing Feature Maps

Page 4: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

Hierarchical Agglomerative ClusteringMichael Eisen, 1998

Cluster (algorithm) TreeView (visualization)

Hierarchical Agglomerative Clustering Step 1: Similarity score between all pairs of genes

Pearson Correlation Euclidean distance

Step 2: Find the two most similar genes, replace with a node that contains the average Builds a tree of genes

Step 3: Repeat

Page 5: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

52 41 3

Agglomerative Hierarchical Clustering

3

1

4 2

5

Distance between joined clusters

Need to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

Need to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

Dendrogram

The dendrogram induces a linear ordering of the data points

The dendrogram induces a linear ordering of the data points

Page 6: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

Results of Clustering Gene Expression

CLUSTER is simple and easy to use

De facto standard for microarray analysis

Limitations: Hierarchical clustering in

general is not robust Genes may belong to

more than one cluster

Page 7: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

K-Means Clustering Algorithm Randomly initialize k cluster means Iterate:

Assign each genes to the nearest cluster mean Recompute cluster means

Stop when clustering converges

Notes: Really fast Genes are partitioned into clusters How do we select k?

Page 8: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

K-Means Algorithm

Randomly Initialize Clusters

Page 9: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

K-Means Algorithm

Assign data points to nearest clusters

Page 10: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

K-Means Algorithm

Recalculate Clusters

Page 11: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

K-Means Algorithm

Recalculate Clusters

Page 12: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

K-Means Algorithm

Repeat

Page 13: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

K-Means Algorithm

Repeat

Page 14: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

K-Means Algorithm

Repeat … until convergence

Page 15: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust
Page 16: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

EPClust Input (1)Expression data matrix

Extra annotation for gene rows

Method of tabulation

Name for further analysis

Page 17: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

EPClust Input (2)

Method of measuring distance between gene rows

Cluster hierarchically

Number k of means

Cluster into k means

Page 18: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

GEO: Gene Expression Omnibus

NCBI database for gene expression data Founded at end of 2000

Page 19: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust
Page 20: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

Querying GEOBrowse records

Search for entries containing a gene

Search for experiments

Search with Entrez

Page 21: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

SGD – Expression database

http://db.yeastgenome.org/cgi-bin/expression/expressionConnection.pl

Page 22: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

SGD – Expression database

Page 23: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

SGD – Expression database

Page 24: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

SGD – Expression database

Page 25: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

Two labs are running experiments on the APO1 gene. Suggest a method that would allow them to compare their results.

Gene grouping Relative values

Page 26: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

Explain how microarrays can be used as a basis for diagnostic

Sample 1

Sample 2

Sample 3

sample4

Sample 5

Gen1+--++Gen2++-+-Gen3-+++-Gen4+++--Gen5--+-+

Page 27: Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust

Explain how microarrays can be used as a basis for diagnostic

Sample 1

Sample 2

sample4

Sample 3

Sample 5

Gen1+-+-+Gen2+++--Gen3-+++-Gen4++-+-Gen5---++