introduction to time-course gene expression data

Introduction to Time-Course Gene Expression Data

STAT 675

R Guerra

April 21, 2008

Outline

• The Data

• Clustering – nonparametric, model based

• A case study

• A new model

The Data

• DNA Microarrays: collections of microscopic DNA spots, often representing single genes, attached to a solid surface

The Data

• Gene expression changes over time due to environmental stimuli or changing needs of the cell

• Measuring gene expression against time leads to time-course data sets

Time-Course Gene Expression

• Each row represents a single gene

• Each column represents a single time point

• These data sets can be massive, analyzing many genes simultaneously

Time-Course Gene Expression

• k-means to clustering• “in the budding yeast

Saccharomyces cerevisiae clustering gene expression data

• groups together efficiently genes of known similar function,

• and we find a similar tendency in human data…” Eisen et al. (1998)

Clustering Expression Data

• When these data sets first became available, it was common to cluster using non-parametric clustering techniques like K-Means and hierarchical clustering

Yeast Data Set

• Spellman et al (1998) measured mRNA levels on yeast (saccharomyces cerevisiae)– 18 equally spaced time-points– Of 6300 genes nearly 800 were categorized as cell-

cycle regulated– A subset of 433 genes with no missing values is a

commonly used data set in papers detailing new time-course methods

– Original and follow-up papers clustered genes using K-means and hierarchical clustering

Spellman et al. (1998)Yeast cell cycle

Row labels = cell cycleRows=genesCol labels = exptsCols = time points

Yeast Data Set (Spellman et al.)K-means Hierarchical

Which method gives the “right” result???

Non-Parametric Clustering

1. Data curves2. Apply distance metric to get distance matrix3. Cluster

Issues with Non-Parametric Clustering

• Technical– Require the number of clusters to be chosen a priori – Do not take into account the time-ordering of the data– Hard to incoporate covariate data, eg, gene ontology

• Yeast analysis had number of clusters chosen based on number of cell cycle groups .…no statistical validation showing that these were the best clustering assignments

Model-Based Clustering

• In response to limitations of nonparametric methods, model based methods proposed– Time series

– Spline Methods

– Hidden Markov Model

– Bayesian Clustering Models

• Little consensus over which method is “best” to cluster time course data

K-Means Clustering

Relocation method: Number of clusters pre-determined and curves can change clusters at each iteration– Initially, data assigned at random to k clusters

– Centroid is computed for each cluster

– Data reassigned to cluster whose centroid is closest to it

– Algorithm repeats until no further change in assignment of data to clusters

– Hartigan rule used to select “optimal” #clusters

K-means: Hartigan Rule

• n curves, let k1 =k groups and k2 = k+1 groups.• If E1 and E2 are the sums of the within cluster

sums of squares for k1 and k2 respectively, then add the extra group if:

10)1(

2

1

E

knE

K-means: Distance Metric

• Euclidean Distance

• Pearson Correlation

K-means: Starting Chains

• Initially, data are randomly assigned to k clusters but this choice of k cluster centers can have an effect on the final clustering

• R implementation of K-means software allows the choice of “number of initial starting chains” to be chosen and the run with the smallest sum of within cluster sums of squares is the run which is given as output

K-Means: Starting Chains

For j = 1 to B

Random assignment j k clusters wj = within cluster sum-of-squares

End j

Pick clustering with min(wj)

Insert Initial starting chains

Hierarchical Clustering

• Hierarchical clustering is an addition or subtraction method.

• Initially each curve is assigned its own cluster– The two closest clusters are joined into one

branch to create a clustering tree– The clustering tree stops when the algorithm

terminates via a stopping rule


• Nearest neighbor: Distance between two clusters is the minimum of all distances between all pairs of curves, one from each cluster

• Furthest neighbor: Distance between two cluster is the maximum of all distances between all pairs of curves, one from each cluster

• Average linkage: Distance between two clusters is the average of all distances between all pairs of elements, one from each cluster


• Normally the algorithm stops at a pre-determined number of clusters or when the distance between two clusters reaches some pre-determined threshold– No universal stopping rule of thumb to find an

optimal number of clusters using this algorithm.

Model-Based Clustering

Many uses mixture models, splines or piecewise polynomial functions used to approximate curves

Can better incorporate covariate information

Models using Splines

• Time course profiles assumed observations from some underlying smooth expression curve

• Each data curves represented as the sum of:– Smooth population mean spline (dependent on time and

cluster assignment)

– Spline function representing individual (gene) effects

– Gaussian measurement noise

SSCLUST software

Model based clustering and data transformationsfor gene expression data (2001)

Yeung et al., Bioinformatics, 17:977-987.

MCLUST software

Validation Methods

• L(C) is maximized log-likelihood for model with C clusters, m is the number of independent parameters to be estimated and n is the number of genes

• Strikes a balance between goodness-of-fit and model complexity

• The non-model-based methods have no such validation method

)()(2)( nLogmCLCBIC c

Clustering Yeast Data using SSClust

Clustering Yeast Data in MCLUST

Comparison of Methods

• Ma et al (2006)

• Smoothing Spline Clustering (SSClust)

• Simulation study

• SSClust better than MClust & nonparameteric

• Comparison: misclassification rates

Functional Form of Ma et al (2006) Simulation Cluster Centers

MR and OSR

• Misclassification Rate

• Overall Success Rate

– To calculate OSR the MR is only for the cases when the correct number of clusters is found

Curves of # Total

Curves iedMisclassif of #MR

)-(1 found) clusters #correct (% MROSR

Comparison of Methods

• From Ma et al. (2006) paper.

Clustering Method Distance Metric MR (%) Correct # of Clusters (%) OSR (%)K-means Euclidean 9.73 N/A NAK-means Pearson 2.64 N/A NAMCLUST N/A 0.38 77 69.5SSClust N/A 0.13 100 98.7

SSClust Methods Paper

• Concluded that SSClust was the superior clustering method

• Looking at the data, the differences in scale between the four true curves is large– Typical time course clusters differ in location and

spread but not in scale to this extreme– Their conclusions are based on a data set which is not

representative of the type of data this clustering method would be used for

Alternative Simulation

Functional Form for fiveclusters centers

Example of SSClust Breaking Down

Linear curves joined while sine curves arbitrarily split into 2 clusters

Simulation Configuration

• Distance Metric– Euclidean or Pearson

• # of Curves– Small (100), Large (3000)

• # Resolution of Time Points – 13 or 25 time points– evenly spaced or unevenly spaced

• Types of underlying Curves – Small (4) – Large (8)

Simulation Configuration

• Distribution of curves across clusters– Equally distributed verses unequally distributed

• Noise Level– Small (< 0.5*SD of the data set)– Large (> 0.5*SD of the data set)

• For these cases, found the misclassification rates and the percent of times that the correct number of clusters was found

Function Forms of 7 Cluster Centers

Simulation Analysis

Conclusions from Simulations

• MCLUST performed better than SSClust and K-means in terms of misclassification rate and finding the correct number of clusters

• Clustering methods were affected by the level of noise but, in general, not by the number of curves, the number of time points or the distribution of curves across cluster

Effect on Number of Profiles on OSR

Comparison based on Real Data

• Applied these same clustering techniques to real data

• Different numbers of clusters found for different methods for each real data set

Yeast Data

Human Fibroblast Data

Simulations Based on Real Data

– Start with real data, like the yeast data set– Cluster the results using a given clustering

method– Perturb the original data (add noise at each

point)– Evaluate how different the new clustering is in

comparison to the original clustering• Use MR and OSR

Simulations Based on Yeast Data

Simulations Based on HF Data

Conclusions from these Simulations

• SSClust better than MCLUST and K-means – This was in contrast to the prior simulations

where MCLUST was best

Gene Ontology

• So far I’ve described my work analyzing and comparing clustering results on gene expression data

• Some, like Pan (2006) have argued that clustering methods, even newer model-based clustering methods, are incomplete because they ignore gene function and other biological aspects in the clustering

Gene Ontology

• Expectation is that incorporating biological data in with the expression data with yield to better clustering

Gene Ontology

• Gene Ontology project (Ashburner et al. 2000) provides a structured vocabulary to describe genes and gene products in organisms

• Three ontologies developed– Biological Processes (e.g……)– Molecular Function (e.g……)– Cellular Component (e.g……)

Annotations

• Gene Ontology annotations are associations made between gene products and the GO terms describing them

• A directed acyclic graph for a gene from the HF data set using GO molecular function annotation is to the right

Clustering using GO Data

• First, need a distance metric

• Two metrics used are based Union-Intersection distance and the longest path distance both developed in Gentleman (2005) and extended by Christian (2007)

• I used the Union-Intersection distance in my clustering

GO Distances

• The union-intersection distance is defined as

• Show example using two dags – Min = 0 when two DAGs are identical, – Max = 1 when two DAGs have nothing in

common

Showing UI Distance

Clustering Using All Data

• Open question in how to cluster genes using both time-course expression data and gene ontology data together

• Two of the methods I used are from Boratyn et al (2007) and from Fang et al (2006)

Boratyn et al (2007) Method

• Clusters are based on adding individually scaled distances matrices– Take distance matrix from expression clustering and

the distance matrix from gene ontology cluster

– Put them on the same scale [0,1]

– Add the scaled distance matrices together

– Cluster using this new distance matrix which captures differences in expression profiles and gene function

Yeast: 12 Clusters on Combined Distance Metric

Fang et al (2006) Method

• In this method,– Gene ontology is a guide for clustering the

expression profiles– Biological process is the GO annotation used– Uses the mean squared residual score to assess

the expression correlation of genes within a cluster from the clustering by GO data.

Effect of the Choice of Ontologies

• Examined effect of the choice of which ontology to use in my clustering between BP, CC, and MF.

• Fang et al (2006) uses BP in their method as it has tended to be most closely correlated with gene function among the 3 ontologies

Effect of Choice of Ontology

Conclusions from GO Chapter

• Clustering using expression and ontology data together proved to provide expression clusters as good or better as when expression data is clustered alone but we have the added bonus of a biological base filtering out potentially nonsensical clustering

Conclusions from Paper as a Whole

• Expression clustering by model-based and non-model-based clustering methods do not have a uniform “best clustering method” in all cases– But, methods are robust in terms of data apportionment

per cluster and the number of curves per dataset (important for massive gene data banks.)

• Clustering using expression and GO data together improves upon expression clustering and again methods vary in complexity, performance, and ease of use

Further Extensions

• GO analysis was all using K-means and hierarchical clustering– Extend GO clustering to model-based

clustering techniques like MCLUST and SSClust (currently, GO data can be used as initial conditions in these models but not as some notion of prior model parameters.)

P. falciparum:Examination of Correlation Between Spatial Location and Temporal Expression of Genes

Motivations:

• Evidence for correlation in literature– Printing artifact – Biological

• Develop a visualization and statistical testing methodology

ORF1 ORF2promoter

mRNA

Operon control (bacteria)

ORF1 ORF2

mRNAs

Upstream Activating Sequences (yeast)

UAS1 UAS2

ORF1 ORF2

Locus Control Region (mammalian globin cluster)

LCR1

mRNAs

Biological Motivations

Hypothesis and Statistic

• Statistical: Correlation between chromosomal location and gene expression?

• Biological: Gene order random?

• H0: no correlation between location on chromosome and expression

• Consider correlations in partitions

ApproachCovariogram: General Tool

Partition Chromosome, Develop Statistic

Permutation Testing Framework

Check for Confounding Factors

Biological Significance

Issues

• Confounding (printing) or other artifacts

• Account for inter-gene distances (as opposed to adjacent pairwise correlation)

• Significance of correlation

operon

Methods: Data

• Need gene information (plasmodb.org has annotated fastA files):

TCAAGCAATTGTTAGATGAGAACAATAGGAAGAATTTAAATTTTAATGATCTGGTTATACACCCTTGGTGGTCTTATAAGAATTAA>Pfa3D7|pfal_chr1|PFA0135w|Annotation|Sanger(protein

coding) hypothetical protein Location=join(124752..124823,124961..125719)

ATGATATTTCATAAATGCTTTAAAATTTGTTCGCTCTCTTGTACTGTTTTATGGGTTACCGCCATATCATCGATCATTCAACCAGACAAACAACAAGAAA

• Normalized gpr files (2-D loess, centered and scaled)

Methods: Data

FastA sequence:5400 predicted

genes

QC Microarray:3800 genes

5100 probes

Intersection:3500 genes

with common gene name

PFA0135w124752:125719 bp

PFA0135wprobe a16122_1

t1,t2,…, t48

PFA0135w124752:125719

bpprobe a16122_1

t1,t2,…, t48

Methods: Covariograms

)]),(|,([),;,( baba dyxdistdyxAveddyx

• Covariogram 1: distance is chromosomal location:

• Covariogram 2: distance is printed microarray location:

)(,)(,),( locchrmidptjlocchrmidptiji ggggd

2,,

2,,),( yjyixjxiji ggggggd

Chr 10: Covariogram 2




Methods: Partitioning

• Partition• Avg of all pairwise

Pearson correlations

3

12 3

1

iirr

3 genes,

2

3pairwise correlations

60 kb

120 kb

0 kb

21

11 21

1

iirr

7 genes,

2

7pairwise correlations

Methods: Partitioning

• Chr 6, 40 kb partition• Significant?

Methods: Permutation Test

• in a 40kb interval on chr 6

• Permutation test• Null distribution• Estimated

p-values

2g

3g

4g

obsgene

1g 1e

2e

3e

4e

Perm(1)

1e

2e

3e

4e

Perm(2)

1e

2e

3e

4e

Perm(n)…

1e

2e

3e

4e

…

.50r

22.0

2

57.0

valp

n

r

genes

obs


• Distribution of

in 40 kb interval

r

001.0

6

72.0

valp

n

r

genes

obs


• Distribution of

in 40 kb interval

r

002.0

9

49.0

valp

n

r

genes

obs


• Distribution of

in 40 kb interval

r

475.0

12

018.0

valp

n

r

genes

obs


• Distribution of

in 40 kb interval

r

100kb

10kb

80kb

20kb

60kb40kb

Significant Intervals (Chr 7)

Significant Intervals (Chr 7)100kb

10kb

80kb

20kb

60kb40kb

100kb

80kb

10kb20kb40kb60kb

MAL6P1.257: hypothetical protein

MAL6P1.258: malate:quinone oxidoreductase




MAL6P1.265: pyridoxine kinase




MAL6P1.271: cdc2-like protein kinase

MAL6P1.272: ribonuclease


Results: Summary Table

10kb 60kb 100kb 10kb in 60kb

Chr 3 3/400 0/68 0/40 0

Chr 4 10/476 5/80 2/48 4

Chr 5 6/528 1/88 3/56 0

Chr 14 4/1304 2/220 1/132 0

Conclusions

• Statistical: Significance for both small regions of strong correlation and large regions of weak correlation

• Biological: Evidence for regulation at multiple levels

introduction to time-course gene expression data

Documents