the statistical significance of max-gap clusters rose hoberman david sankoff dannie durand

58
The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Post on 19-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

The Statistical Significance

of Max-gap Clusters

Rose Hoberman

David Sankoff

Dannie Durand

Page 3: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Gene Clustering for Functional Inference in Bacterial Genomes

The Use of Gene Clusters to Infer Functional Coupling, Overbeek et al., PNAS 96: 2896-2901, 1999.

Page 4: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Gene content and order are preserved

rearrangement, mutation

Similarity in gene content

Neither content nor order is strictly preserved

large scale duplication

or speciation event

original genome

Page 5: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

“Evolution of gene order conservation in prokaryotes”

Tamames, Genome Biology 2, 2001

Page 6: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

“Evolution of gene order conservation in prokaryotes”

Tamames, Genome Biology 2, 2001

Gene insertion/loss

Page 7: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

“Evolution of gene order conservation in prokaryotes”

Tamames, Genome Biology 2, 2001

Gene insertion/loss

Local rearrangement

Page 8: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Two Possible Questions

1. Given a set of genes that we believe are functionally related, determine if they cluster together spatially more than we would expect by chance

2. Identify all significantly conserved gene clusters as a starting point for making functional inferences

Page 9: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Two Possible Questions

1. Given a set of genes that we believe are functionally related, determine if they cluster together spatially more than we would expect by chance

2. Identify all significantly conserved gene clusters as a starting point for making functional inferences

Reference set scenario

Whole genome comparison

Page 10: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Reference Set Scenario

Page 11: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Reference Set Scenario

• Model of a genome– G = 1, …, n; an ordered set of n unique genes– assume genes do not overlap– chromosome breaks ignored

Page 12: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

• Model of a genome– G = 1, …, n; an ordered set of n unique genes– assume genes do not overlap– chromosome breaks ignored

• Reference gene scenario:– m genes of interest (in red) are pre-specified– want to find clusters of (a subset of) these genes

Reference Set Scenario

Page 13: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Given: two genomes: G = 1, …, n and H = 1, …, n

Find all significant clusters of at least

k homologs in close proximity in both genomes?

Whole Genome Scenario

G

H

Page 14: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Outline

• What formalisms do we need to address these questions?– Definitions: formulate a cluster definition– Algorithms: identifying clusters in real dataStatistics: assess the significance of one or more

clusters

• Reference set scenario• Whole genome comparison• Conclusion

Page 15: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Why develop a formal statistical model?

• Understand trends and verify that they match our expectations

• Choose parameters effectively

• Statistical tests for data analysis

Typically researchers use randomization tests to estimate statistical significance

Page 16: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Cluster Definitions

• An intuitive notion of a cluster is a group of genes– occurring in close proximity– neither gene content nor order is strictly conserved

• Algorithms and statistics require a formal definition.– What properties are desirable?– Do existing definitions have these properties?

Page 17: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Possible Cluster Parameters– size: number of red genes in the cluster

• Example: cluster size ≥ 3

size = 3 genes

Page 18: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Possible Cluster Parameters– size: number of red genes in the cluster

• Example: cluster size ≥ 3

– length: number of genes between first and last red genes

• Example: cluster length ≤ 6

length = 6

Page 19: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Possible Cluster Parameters– size: number of red genes in the cluster

• Example: cluster size ≥ 3

– length: number of genes between first and last red genes

• Example: cluster length ≤ 6

length = 6

Page 20: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Possible Cluster Parameters– size: number of red genes in the cluster

• Example: cluster size ≥ 3

– length: number of genes between first and last red genes

• Example: cluster length ≤ 6

– density: proportion of red genes (size/length)• Example: density ≥ 0.5

density = 6/11

Page 21: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Possible Cluster Parameters– size: number of red genes in the cluster

• Example: cluster size ≥ 3

– length: number of genes between first and last red genes

• Example: cluster length ≤ 6

– density: proportion of red genes (size/length)• Example: density ≥ 0.5

density = 6/11

Page 22: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Possible Cluster Parameters– size: number of red genes in the cluster

• Example: cluster size ≥ 3

– length: number of genes between first and last red genes

• Example: cluster length ≤ 6

– density: proportion of red genes (size/length)– compactness: maximum gap between adjacent red

genes

gap ≤ 4 genes

Page 23: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Max-Gap Cluster

• Commonly used in analysis of genomic data

• Desirable properties– Ensures minimum local density – Extensible: doesn’t artificially limit cluster length– Disjoint: clusters will not overlap

gap g

Page 24: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Outline

• Formalisms

• Reference set scenario

• Whole genome comparison

• Conclusion

Page 25: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Formalisms

• Definitions: formulate a cluster definition

• Algorithms: identify clusters in real data

• Statistics: assess the significance of a cluster

Page 26: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

A Statistical Model

• Given– a genome: G = 1, …, n unique genes – a set of m reference genes – a maximum-gap size g

• Null hypothesis: – Random gene order

• Alternate hypotheses:– Evolutionary history– Functional selection

Page 27: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

• We provide – analytical and dynamic programming solutions – to determine cluster significance exactly– for the reference set scenario

Hoberman, Sankoff and Durand. In ``Proceedings of the RECOMB Satellite Workshop on Comparative Genomics'', J. Lagergren, ed.,

Lecture Notes in Bioinformatics, Springer Verlag, in press.

Hoberman, Sankoff, Durand. Submitted to RECOMB 2005.

Statistics of Max-Gap Gene Clusters

Page 28: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Test Statistic: Complete Clusters

The probability of observing all m reference genes in a max-gap cluster in G

Page 29: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Test Statistic: Incomplete Clusters

The probability of observing at least h of the m reference genes in a max-gap cluster in G

Page 30: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Cluster significance n = 1000, m=50

• n = number genes in each genome• m = number of genes shared between the two genomes• g = maximum allowed gap size• h = size of cluster (e.g. number of red genes)

n = 500, h = m/2

Page 31: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Significant Parameter Values (α = 0.0001)

n = 500

Page 32: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Significant Parameter Values (α = 0.0001)

n = 500

Page 33: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Outline

• Formalisms

• Reference set scenario

• Whole genome comparison

• Conclusion

Page 34: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Formalisms

• Definitions: formulate a cluster definition

• Algorithms: identify clusters in real data

• Statistics: assess the significance of one or more clusters

Page 35: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Whole genome comparison

Find all sets of genes that form max-gap clusters in both genomes.

g 10

g 10

Page 36: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Properties of Max-Gap Clusters for Whole Genome Comparison

• Clusters are locally dense in both genomes

• Clusters are still guaranteed to be disjoint.

• The definition is symmetric with respect to genome

Most existing cluster algorithms are not symmetric!

Page 37: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

If g = 2• There is no valid max-gap cluster of

size two or three• There is a valid max-gap cluster of

size four

Algorithms: Finding Max-Gap Clusters

Page 38: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

• A consequence of this is that a greedy iterative approach will not find all max-gap clusters– Specifically, larger clusters that don’t contain smaller

ones will not be found

Algorithms: Finding Max-Gap Clusters

Page 39: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

There is an efficient divide-and-conquer algorithm to find all max-gap clusters (Bergeron et al, 2002)

Since algorithms are generally not stated formally in application papers, we don’t know whether people are actually getting what they think they’re getting

Algorithms: Finding Max-Gap Clusters

Page 40: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Formalisms

• Definitions: formulate a cluster definition

• Algorithms: identify clusters in real data

• Statistics: assess the significance of one or more clusters

Work in Progress…

Page 41: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Statistics: Whole genome comparison

What is the probability that at least k genes form a max-gap cluster in both genomes?

g 10

g 10

Page 42: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

What is the probability that at least k genes form a max-gap cluster in both genomes?

Assuming identical gene content, the probability of finding a max-gap cluster of size at least k is

always one!

g 10

g 10

Statistics: Whole genome comparison

Page 43: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

An Example

Example: g =1

Page 44: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Example: g =1

An Example

Page 45: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Example: g =1

An Example

A cluster of size k does not necessarily

contain a cluster of size k-1

Page 46: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Example: g =1

An Example

Page 47: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

• When gene content is identical, there will always be a cluster of size n

Example: g =1

An Example

Page 48: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

• When gene content is identical, there will always be a cluster of size n

• Therefore, for all k, there will always be a cluster of size at least k

Example: g =1

An Example

Page 49: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

• When gene content is identical, there will always be a cluster of size n

• Therefore, for all k, there will always be a cluster of size at least k

• Therefore, the probability of finding a cluster of size at least k is always one!

Example: g =1

An Example

Page 50: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Relaxing the Assumption of Identical Gene Content

• Assume only m of the n genes in each genome are shared

• If the longest run of “non-shared” genes is less than g then we are still guaranteed to find a complete cluster

Page 51: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

More generally…

Simulations of randomly ordered genomes show that large clusters may be very likely

to occur merely by chance

Page 52: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Unexpected Statistical Trends• There can be a significant

probability of finding a cluster that includes all homologous gene pairs

• The significance of a cluster of size k can be less than that of a cluster of size k-1

• Probabilities are not monotonic

• Large clusters may not be significant

n = 1000, m = 250, g=20

Probability of a cluster of size 250 ~ 50%

Page 53: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Outline

• Formalisms

• Reference set scenario

• Whole genome comparison

• Conclusion

Page 54: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Clusters Are Used in Many Other Applications

Inferring functional coupling of genes in bacteria (Overbeek et al 1999)

Recent polyploidy in Arabidopsis (Blanc et al 2003)

Sequence of the human genome (Venter et al 2001)

Duplications in Arabidopsis through comparison with rice (Vandepoele et al 2002)

Duplications in Eukaryotes (Vision et al 2000)

Identification of horizontal transfers (Lawrence and Roth 1996)

Evolution of gene order conservation in prokaryotes (Tamames 2001)

Ancient yeast duplication (Wolfe and Shields 1997)

Genomic duplication during early chordate evolution (McLysaght et al 2002)

Comparing rates of rearrangements (Coghlan and Wolfe 2002)

Genome rearrangements after duplication in yeast (Seoighe and Wolfe 1998)

Operon prediction in newly sequenced bacteria (Chen et al 2004)

Breakpoints as phylogenetic features (Blanchette et al 1999)...

Page 55: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Max-Gap Clusters are Especially Common

Inferring functional coupling of genes in bacteria (Overbeek et al 1999)

Recent polyploidy in Arabidopsis (Blanc et al 2003)

Sequence of the human genome (Venter et al 2001)

Duplications in Arabidopsis through comparison with rice (Vandepoele et al 2002)

Duplications in Eukaryotes (Vision et al 2000)

Identification of horizontal transfers (Lawrence and Roth 1996)

Evolution of gene order conservation in prokaryotes (Tamames 2001)

Ancient yeast duplication (Wolfe and Shields 1997)

Genomic duplication during early chordate evolution (McLysaght et al 2002)

Comparing rates of rearrangements (Coghlan and Wolfe 2002)

Genome rearrangements after duplication in yeast (Seoighe and Wolfe 1998)

Operon prediction in newly sequenced bacteria (Chen et al 2004)

Breakpoints as phylogenetic features (Blanchette et al 1999)...

Page 56: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Formal statistical models allow us to– understand trends and verify that they match

our expectations,– choose parameters effectively– conduct statistical tests for data analysis

Formal statistical models require– a formal cluster definition– a search procedure to find clusters

These issues are more complicated than they might seem!

Page 57: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Summary

Results: statistical tests of significance for max-gap clusters• Reference set scenario• Genome comparison (work in progress)

We need to• explicitly consider the cluster properties we would like

our definitions to satisfy• rigorously evaluate whether our definition meets these

requirements • carefully prove that our search procedures match our

stated definitions

Page 58: The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Thank You