introduction to time-course gene expression data

92
Introduction to Time- Course Gene Expression Data STAT 675 R Guerra April 21, 2008

Upload: frieda

Post on 04-Jan-2016

33 views

Category:

Documents


2 download

DESCRIPTION

Introduction to Time-Course Gene Expression Data. STAT 675 R Guerra April 21, 2008. Outline. The Data Clustering – nonparametric, model based A case study A new model. The Data. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to Time-Course Gene Expression Data

Introduction to Time-Course Gene Expression Data

STAT 675

R Guerra

April 21, 2008

Page 2: Introduction to Time-Course Gene Expression Data

Outline

• The Data

• Clustering – nonparametric, model based

• A case study

• A new model

Page 3: Introduction to Time-Course Gene Expression Data

The Data

• DNA Microarrays: collections of microscopic DNA spots, often representing single genes, attached to a solid surface

Page 4: Introduction to Time-Course Gene Expression Data

The Data

• Gene expression changes over time due to environmental stimuli or changing needs of the cell

• Measuring gene expression against time leads to time-course data sets

Page 5: Introduction to Time-Course Gene Expression Data

Time-Course Gene Expression

• Each row represents a single gene

• Each column represents a single time point

• These data sets can be massive, analyzing many genes simultaneously

Page 6: Introduction to Time-Course Gene Expression Data

Time-Course Gene Expression

• k-means to clustering• “in the budding yeast

Saccharomyces cerevisiae clustering gene expression data

• groups together efficiently genes of known similar function,

• and we find a similar tendency in human data…” Eisen et al. (1998)

Page 7: Introduction to Time-Course Gene Expression Data

Clustering Expression Data

• When these data sets first became available, it was common to cluster using non-parametric clustering techniques like K-Means and hierarchical clustering

Page 8: Introduction to Time-Course Gene Expression Data

Yeast Data Set

• Spellman et al (1998) measured mRNA levels on yeast (saccharomyces cerevisiae)– 18 equally spaced time-points– Of 6300 genes nearly 800 were categorized as cell-

cycle regulated– A subset of 433 genes with no missing values is a

commonly used data set in papers detailing new time-course methods

– Original and follow-up papers clustered genes using K-means and hierarchical clustering

Page 9: Introduction to Time-Course Gene Expression Data

Spellman et al. (1998)Yeast cell cycle

Row labels = cell cycleRows=genesCol labels = exptsCols = time points

Page 10: Introduction to Time-Course Gene Expression Data

Yeast Data Set (Spellman et al.)K-means Hierarchical

Which method gives the “right” result???

Page 11: Introduction to Time-Course Gene Expression Data

Non-Parametric Clustering

1. Data curves2. Apply distance metric to get distance matrix3. Cluster

Page 12: Introduction to Time-Course Gene Expression Data

Issues with Non-Parametric Clustering

• Technical– Require the number of clusters to be chosen a priori – Do not take into account the time-ordering of the data– Hard to incoporate covariate data, eg, gene ontology

• Yeast analysis had number of clusters chosen based on number of cell cycle groups .…no statistical validation showing that these were the best clustering assignments

Page 13: Introduction to Time-Course Gene Expression Data

Model-Based Clustering

• In response to limitations of nonparametric methods, model based methods proposed– Time series

– Spline Methods

– Hidden Markov Model

– Bayesian Clustering Models

• Little consensus over which method is “best” to cluster time course data

Page 14: Introduction to Time-Course Gene Expression Data

K-Means Clustering

Relocation method: Number of clusters pre-determined and curves can change clusters at each iteration– Initially, data assigned at random to k clusters

– Centroid is computed for each cluster

– Data reassigned to cluster whose centroid is closest to it

– Algorithm repeats until no further change in assignment of data to clusters

– Hartigan rule used to select “optimal” #clusters

Page 15: Introduction to Time-Course Gene Expression Data

K-means: Hartigan Rule

• n curves, let k1 =k groups and k2 = k+1 groups.• If E1 and E2 are the sums of the within cluster

sums of squares for k1 and k2 respectively, then add the extra group if:

10)1(

2

1

E

knE

Page 16: Introduction to Time-Course Gene Expression Data

K-means: Distance Metric

• Euclidean Distance

• Pearson Correlation

Page 17: Introduction to Time-Course Gene Expression Data

K-means: Starting Chains

• Initially, data are randomly assigned to k clusters but this choice of k cluster centers can have an effect on the final clustering

• R implementation of K-means software allows the choice of “number of initial starting chains” to be chosen and the run with the smallest sum of within cluster sums of squares is the run which is given as output

Page 18: Introduction to Time-Course Gene Expression Data

K-Means: Starting Chains

For j = 1 to B

Random assignment j k clusters wj = within cluster sum-of-squares

End j

Pick clustering with min(wj)

Page 19: Introduction to Time-Course Gene Expression Data

Insert Initial starting chains

Page 20: Introduction to Time-Course Gene Expression Data

Hierarchical Clustering

• Hierarchical clustering is an addition or subtraction method.

• Initially each curve is assigned its own cluster– The two closest clusters are joined into one

branch to create a clustering tree– The clustering tree stops when the algorithm

terminates via a stopping rule

Page 21: Introduction to Time-Course Gene Expression Data

Hierarchical Clustering

• Nearest neighbor: Distance between two clusters is the minimum of all distances between all pairs of curves, one from each cluster

• Furthest neighbor: Distance between two cluster is the maximum of all distances between all pairs of curves, one from each cluster

• Average linkage: Distance between two clusters is the average of all distances between all pairs of elements, one from each cluster

Page 22: Introduction to Time-Course Gene Expression Data

Hierarchical Clustering

• Normally the algorithm stops at a pre-determined number of clusters or when the distance between two clusters reaches some pre-determined threshold– No universal stopping rule of thumb to find an

optimal number of clusters using this algorithm.

Page 23: Introduction to Time-Course Gene Expression Data

Model-Based Clustering

Many uses mixture models, splines or piecewise polynomial functions used to approximate curves

Can better incorporate covariate information

Page 24: Introduction to Time-Course Gene Expression Data

Models using Splines

• Time course profiles assumed observations from some underlying smooth expression curve

• Each data curves represented as the sum of:– Smooth population mean spline (dependent on time and

cluster assignment)

– Spline function representing individual (gene) effects

– Gaussian measurement noise

Page 25: Introduction to Time-Course Gene Expression Data
Page 26: Introduction to Time-Course Gene Expression Data

SSCLUST software

Page 27: Introduction to Time-Course Gene Expression Data

Pan

Page 28: Introduction to Time-Course Gene Expression Data

Model based clustering and data transformationsfor gene expression data (2001)

Yeung et al., Bioinformatics, 17:977-987.

MCLUST software

Page 29: Introduction to Time-Course Gene Expression Data

Validation Methods

• L(C) is maximized log-likelihood for model with C clusters, m is the number of independent parameters to be estimated and n is the number of genes

• Strikes a balance between goodness-of-fit and model complexity

• The non-model-based methods have no such validation method

)()(2)( nLogmCLCBIC c

Page 30: Introduction to Time-Course Gene Expression Data

Clustering Yeast Data using SSClust

Page 31: Introduction to Time-Course Gene Expression Data

Clustering Yeast Data in MCLUST

Page 32: Introduction to Time-Course Gene Expression Data

Comparison of Methods

• Ma et al (2006)

• Smoothing Spline Clustering (SSClust)

• Simulation study

• SSClust better than MClust & nonparameteric

• Comparison: misclassification rates

Page 33: Introduction to Time-Course Gene Expression Data

Functional Form of Ma et al (2006) Simulation Cluster Centers

Page 34: Introduction to Time-Course Gene Expression Data

MR and OSR

• Misclassification Rate

• Overall Success Rate

– To calculate OSR the MR is only for the cases when the correct number of clusters is found

Curves of # Total

Curves iedMisclassif of #MR

)-(1 found) clusters #correct (% MROSR

Page 35: Introduction to Time-Course Gene Expression Data

Comparison of Methods

• From Ma et al. (2006) paper.

Clustering Method Distance Metric MR (%) Correct # of Clusters (%) OSR (%)K-means Euclidean 9.73 N/A NAK-means Pearson 2.64 N/A NAMCLUST N/A 0.38 77 69.5SSClust N/A 0.13 100 98.7

Page 36: Introduction to Time-Course Gene Expression Data

SSClust Methods Paper

• Concluded that SSClust was the superior clustering method

• Looking at the data, the differences in scale between the four true curves is large– Typical time course clusters differ in location and

spread but not in scale to this extreme– Their conclusions are based on a data set which is not

representative of the type of data this clustering method would be used for

Page 37: Introduction to Time-Course Gene Expression Data

Alternative Simulation

Functional Form for fiveclusters centers

Page 38: Introduction to Time-Course Gene Expression Data

Example of SSClust Breaking Down

Linear curves joined while sine curves arbitrarily split into 2 clusters

Page 39: Introduction to Time-Course Gene Expression Data

Simulation Configuration

• Distance Metric– Euclidean or Pearson

• # of Curves– Small (100), Large (3000)

• # Resolution of Time Points – 13 or 25 time points– evenly spaced or unevenly spaced

• Types of underlying Curves – Small (4) – Large (8)

Page 40: Introduction to Time-Course Gene Expression Data

Simulation Configuration

• Distribution of curves across clusters– Equally distributed verses unequally distributed

• Noise Level– Small (< 0.5*SD of the data set)– Large (> 0.5*SD of the data set)

• For these cases, found the misclassification rates and the percent of times that the correct number of clusters was found

Page 41: Introduction to Time-Course Gene Expression Data

Function Forms of 7 Cluster Centers

Page 42: Introduction to Time-Course Gene Expression Data

Simulation Analysis

Page 43: Introduction to Time-Course Gene Expression Data

Conclusions from Simulations

• MCLUST performed better than SSClust and K-means in terms of misclassification rate and finding the correct number of clusters

• Clustering methods were affected by the level of noise but, in general, not by the number of curves, the number of time points or the distribution of curves across cluster

Page 44: Introduction to Time-Course Gene Expression Data

Effect on Number of Profiles on OSR

Page 45: Introduction to Time-Course Gene Expression Data

Comparison based on Real Data

• Applied these same clustering techniques to real data

• Different numbers of clusters found for different methods for each real data set

Page 46: Introduction to Time-Course Gene Expression Data

Yeast Data

Page 47: Introduction to Time-Course Gene Expression Data

Human Fibroblast Data

Page 48: Introduction to Time-Course Gene Expression Data

Simulations Based on Real Data

– Start with real data, like the yeast data set– Cluster the results using a given clustering

method– Perturb the original data (add noise at each

point)– Evaluate how different the new clustering is in

comparison to the original clustering• Use MR and OSR

Page 49: Introduction to Time-Course Gene Expression Data

Simulations Based on Yeast Data

Page 50: Introduction to Time-Course Gene Expression Data

Simulations Based on HF Data

Page 51: Introduction to Time-Course Gene Expression Data

Conclusions from these Simulations

• SSClust better than MCLUST and K-means – This was in contrast to the prior simulations

where MCLUST was best

Page 52: Introduction to Time-Course Gene Expression Data

Gene Ontology

• So far I’ve described my work analyzing and comparing clustering results on gene expression data

• Some, like Pan (2006) have argued that clustering methods, even newer model-based clustering methods, are incomplete because they ignore gene function and other biological aspects in the clustering

Page 53: Introduction to Time-Course Gene Expression Data

Gene Ontology

• Expectation is that incorporating biological data in with the expression data with yield to better clustering

Page 54: Introduction to Time-Course Gene Expression Data

Gene Ontology

• Gene Ontology project (Ashburner et al. 2000) provides a structured vocabulary to describe genes and gene products in organisms

• Three ontologies developed– Biological Processes (e.g……)– Molecular Function (e.g……)– Cellular Component (e.g……)

Page 55: Introduction to Time-Course Gene Expression Data

Annotations

• Gene Ontology annotations are associations made between gene products and the GO terms describing them

• A directed acyclic graph for a gene from the HF data set using GO molecular function annotation is to the right

Page 56: Introduction to Time-Course Gene Expression Data

Clustering using GO Data

• First, need a distance metric

• Two metrics used are based Union-Intersection distance and the longest path distance both developed in Gentleman (2005) and extended by Christian (2007)

• I used the Union-Intersection distance in my clustering

Page 57: Introduction to Time-Course Gene Expression Data

GO Distances

• The union-intersection distance is defined as

• Show example using two dags – Min = 0 when two DAGs are identical, – Max = 1 when two DAGs have nothing in

common

Page 58: Introduction to Time-Course Gene Expression Data

Showing UI Distance

Page 59: Introduction to Time-Course Gene Expression Data

Clustering Using All Data

• Open question in how to cluster genes using both time-course expression data and gene ontology data together

• Two of the methods I used are from Boratyn et al (2007) and from Fang et al (2006)

Page 60: Introduction to Time-Course Gene Expression Data

Boratyn et al (2007) Method

• Clusters are based on adding individually scaled distances matrices– Take distance matrix from expression clustering and

the distance matrix from gene ontology cluster

– Put them on the same scale [0,1]

– Add the scaled distance matrices together

– Cluster using this new distance matrix which captures differences in expression profiles and gene function

Page 61: Introduction to Time-Course Gene Expression Data

Yeast: 12 Clusters on Combined Distance Metric

Page 62: Introduction to Time-Course Gene Expression Data

Fang et al (2006) Method

• In this method,– Gene ontology is a guide for clustering the

expression profiles– Biological process is the GO annotation used– Uses the mean squared residual score to assess

the expression correlation of genes within a cluster from the clustering by GO data.

Page 63: Introduction to Time-Course Gene Expression Data
Page 64: Introduction to Time-Course Gene Expression Data

Effect of the Choice of Ontologies

• Examined effect of the choice of which ontology to use in my clustering between BP, CC, and MF.

• Fang et al (2006) uses BP in their method as it has tended to be most closely correlated with gene function among the 3 ontologies

Page 65: Introduction to Time-Course Gene Expression Data

Effect of Choice of Ontology

Page 66: Introduction to Time-Course Gene Expression Data

Conclusions from GO Chapter

• Clustering using expression and ontology data together proved to provide expression clusters as good or better as when expression data is clustered alone but we have the added bonus of a biological base filtering out potentially nonsensical clustering

Page 67: Introduction to Time-Course Gene Expression Data

Conclusions from Paper as a Whole

• Expression clustering by model-based and non-model-based clustering methods do not have a uniform “best clustering method” in all cases– But, methods are robust in terms of data apportionment

per cluster and the number of curves per dataset (important for massive gene data banks.)

• Clustering using expression and GO data together improves upon expression clustering and again methods vary in complexity, performance, and ease of use

Page 68: Introduction to Time-Course Gene Expression Data

Further Extensions

• GO analysis was all using K-means and hierarchical clustering– Extend GO clustering to model-based

clustering techniques like MCLUST and SSClust (currently, GO data can be used as initial conditions in these models but not as some notion of prior model parameters.)

Page 69: Introduction to Time-Course Gene Expression Data

P. falciparum:Examination of Correlation Between Spatial Location and Temporal Expression of Genes

Page 70: Introduction to Time-Course Gene Expression Data

Motivations:

• Evidence for correlation in literature– Printing artifact – Biological

• Develop a visualization and statistical testing methodology

Page 71: Introduction to Time-Course Gene Expression Data

ORF1 ORF2promoter

mRNA

Operon control (bacteria)

ORF1 ORF2

mRNAs

Upstream Activating Sequences (yeast)

UAS1 UAS2

ORF1 ORF2

Locus Control Region (mammalian globin cluster)

LCR1

mRNAs

Biological Motivations

Page 72: Introduction to Time-Course Gene Expression Data

Hypothesis and Statistic

• Statistical: Correlation between chromosomal location and gene expression?

• Biological: Gene order random?

• H0: no correlation between location on chromosome and expression

• Consider correlations in partitions

Page 73: Introduction to Time-Course Gene Expression Data

ApproachCovariogram: General Tool

Partition Chromosome, Develop Statistic

Permutation Testing Framework

Check for Confounding Factors

Biological Significance

Page 74: Introduction to Time-Course Gene Expression Data

Issues

• Confounding (printing) or other artifacts

• Account for inter-gene distances (as opposed to adjacent pairwise correlation)

• Significance of correlation

operon

Page 75: Introduction to Time-Course Gene Expression Data

Methods: Data

• Need gene information (plasmodb.org has annotated fastA files):

TCAAGCAATTGTTAGATGAGAACAATAGGAAGAATTTAAATTTTAATGATCTGGTTATACACCCTTGGTGGTCTTATAAGAATTAA>Pfa3D7|pfal_chr1|PFA0135w|Annotation|Sanger(protein

coding) hypothetical protein Location=join(124752..124823,124961..125719)

ATGATATTTCATAAATGCTTTAAAATTTGTTCGCTCTCTTGTACTGTTTTATGGGTTACCGCCATATCATCGATCATTCAACCAGACAAACAACAAGAAA

• Normalized gpr files (2-D loess, centered and scaled)

Page 76: Introduction to Time-Course Gene Expression Data

Methods: Data

FastA sequence:5400 predicted

genes

QC Microarray:3800 genes

5100 probes

Intersection:3500 genes

with common gene name

PFA0135w124752:125719 bp

PFA0135wprobe a16122_1

t1,t2,…, t48

PFA0135w124752:125719

bpprobe a16122_1

t1,t2,…, t48

Page 77: Introduction to Time-Course Gene Expression Data

Methods: Covariograms

)]),(|,([),;,( baba dyxdistdyxAveddyx

• Covariogram 1: distance is chromosomal location:

• Covariogram 2: distance is printed microarray location:

)(,)(,),( locchrmidptjlocchrmidptiji ggggd

2,,

2,,),( yjyixjxiji ggggggd

Page 78: Introduction to Time-Course Gene Expression Data

Chr 10: Covariogram 2

Chr 10: Covariogram 1

Chr 6: Covariogram 1

Chr 6: Covariogram 2

Page 79: Introduction to Time-Course Gene Expression Data

Methods: Partitioning

• Partition• Avg of all pairwise

Pearson correlations

3

12 3

1

iirr

3 genes,

2

3pairwise correlations

60 kb

120 kb

0 kb

21

11 21

1

iirr

7 genes,

2

7pairwise correlations

Page 80: Introduction to Time-Course Gene Expression Data

Methods: Partitioning

• Chr 6, 40 kb partition• Significant?

Page 81: Introduction to Time-Course Gene Expression Data

Methods: Permutation Test

• in a 40kb interval on chr 6

• Permutation test• Null distribution• Estimated

p-values

2g

3g

4g

obsgene

1g 1e

2e

3e

4e

Perm(1)

1e

2e

3e

4e

Perm(2)

1e

2e

3e

4e

Perm(n)…

1e

2e

3e

4e

.50r

Page 82: Introduction to Time-Course Gene Expression Data

22.0

2

57.0

valp

n

r

genes

obs

Methods: Permutation Test

• Distribution of

in 40 kb interval

r

Page 83: Introduction to Time-Course Gene Expression Data

001.0

6

72.0

valp

n

r

genes

obs

Methods: Permutation Test

• Distribution of

in 40 kb interval

r

Page 84: Introduction to Time-Course Gene Expression Data

002.0

9

49.0

valp

n

r

genes

obs

Methods: Permutation Test

• Distribution of

in 40 kb interval

r

Page 85: Introduction to Time-Course Gene Expression Data

475.0

12

018.0

valp

n

r

genes

obs

Methods: Permutation Test

• Distribution of

in 40 kb interval

r

Page 86: Introduction to Time-Course Gene Expression Data

100kb

10kb

80kb

20kb

60kb40kb

Significant Intervals (Chr 7)

Page 87: Introduction to Time-Course Gene Expression Data

Significant Intervals (Chr 7)100kb

10kb

80kb

20kb

60kb40kb

Page 88: Introduction to Time-Course Gene Expression Data

Significant Intervals (Chr 7)100kb

10kb

80kb

20kb

60kb40kb

Page 89: Introduction to Time-Course Gene Expression Data

100kb

80kb

10kb20kb40kb60kb

Page 90: Introduction to Time-Course Gene Expression Data

MAL6P1.257: hypothetical protein

MAL6P1.258: malate:quinone oxidoreductase

MAL6P1.259: hypothetical protein

MAL6P1.260: hypothetical protein

MAL6P1.263: hypothetical protein

MAL6P1.265: pyridoxine kinase

MAL6P1.266: hypothetical protein

MAL6P1.267: hypothetical protein

MAL6P1.268: hypothetical protein

MAL6P1.271: cdc2-like protein kinase

MAL6P1.272: ribonuclease

MAL6P1.273: hypothetical protein

Page 91: Introduction to Time-Course Gene Expression Data

Results: Summary Table

10kb 60kb 100kb 10kb in 60kb

Chr 3 3/400 0/68 0/40 0

Chr 4 10/476 5/80 2/48 4

Chr 5 6/528 1/88 3/56 0

Chr 14 4/1304 2/220 1/132 0

Page 92: Introduction to Time-Course Gene Expression Data

Conclusions

• Statistical: Significance for both small regions of strong correlation and large regions of weak correlation

• Biological: Evidence for regulation at multiple levels