time dependent gene analysis

69
1 OVERVIEW Omer Berkman

Upload: janna

Post on 12-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Time dependent gene analysis. OVERVIEW Omer Berkman. Contents. Biological background Using Gene-arrays to decipher gene-regulatory interactions Applications …. Hybridization. DNA double strand form by “ gluing ” of complementary single starnds Complementary rule:A-T/U, G-C. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Time dependent gene analysis

1

OVERVIEW

Omer Berkman

Page 2: Time dependent gene analysis

2

Contents

Biological background

Using Gene-arrays to decipher gene-regulatory interactions

Applications…

Page 3: Time dependent gene analysis

3

Hybridization

DNA double strand form by “gluing” of complementary single starnds

Complementary rule: A-T/U, G-C

Page 4: Time dependent gene analysis

4

Protein production

Page 5: Time dependent gene analysis

5

From DNA to Protein

Transcription

mRNA

Translation

ProteinGene

cells express different subset of the genes in different tissues and under different conditions

Page 6: Time dependent gene analysis

6

Functional genomics

The complete sequences of many microbial genomes are already known - the inventory of the building blocks of life was collected.

next stage is ‘‘re-assembling the pieces’’ : Defining the role of each gene in these genomes. Understanding how the genome functions as a whole

in the complex natural history of a living organism.

Knowing when and where a gene is expressed often provides a strong clue as to its biological role

Page 7: Time dependent gene analysis

7

Transcriptional process

This process is highly regulated. One of the most important ways in which

the cell regulates gene expression is by using a feedback loop.

some of the proteins are transcription factors.

These proteins regulate the expression of other genes (and possibly, their own expression) by either initiating or repressing transcription.

Page 8: Time dependent gene analysis

8

Transcriptional networks

One gene can be a regulator of another gene. Biochemical networks responsible for

regulating the expression of genes in cells. In these transcription networks, the nodes

represent transcriptional factors (genes) and the edges represent direct transcriptional regulation.

[Shen-Orr 2002, Thieffry 1998]

Page 9: Time dependent gene analysis

9

Transcriptional networks example

Page 10: Time dependent gene analysis

10

Differences in cell type or state are correlated with changes in the mRNA levels of its genes.

The only specific reagent required to measure the abundance of the mRNA for a specific gene is a cDNA sequence.

DNA microarrays provide a practical and economical tool for studying gene expression on a very large scale.

Gene-arrays for mRNA analysis

Page 11: Time dependent gene analysis

11

Affymetrix model for DNA chip

Now, we can infer which of the genes were expressed and in what intensity.Due to some biological processes, not always the correct sequence will hybridized to the oligo.

Page 12: Time dependent gene analysis

12

Gene Arrays / DNA chips

From “one gene in one experiment” to “massively parallel biological data acquisition”.

Simultaneously analyzing the expression levels of large numbers of genes provides the opportunity to study the activity of whole genomes.

Large-scale gene expression analysis reveals the behavior of co-regulated gene networks.

Page 13: Time dependent gene analysis

13

Raw Data

The curse of dimensionality : Thousands of Genes versus only few observations

Page 14: Time dependent gene analysis

14

Static versus dynamic

We distinguish between static experiments and time series experiments:

Static – A snapshot in different samples is measured. Data are assumed to be independent identically

distributed. Dynamic –

A temporal process is measured. Data have strong autocorrelation between

successive points.

Page 15: Time dependent gene analysis

15

Temporal observations

It’s possible to produce time-dependent measurements, termed expression matrices.

These expression matrices are the result of the underlying regulatory network.

Reverse engineering seeks to extract information from time-series measurements in order to identify regulatory interactions in these genetic networks.

Page 16: Time dependent gene analysis

16

Complications

The curse of dimensionality Extremely noisy observations Expensive experiments Stochastic nature Population averaged Feasible time scale Partially information

We are facing a hard problem…

Page 17: Time dependent gene analysis

17

1 .The curse of dimensionality (Bellman, 1961)

The number of genes typically far exceeds the number of time points for which data are available, making the problem an ill-posed one.

“Traditional statistics” won’t help here - the amount of samples, versus the number of genes, does not provide enough information to construct a full detailed model with high statistical significance.

New statistical methods/approaches were developed (Bootstrap, Interpolations, Clustering, FDR…)

Page 18: Time dependent gene analysis

18

2 .Stochastic nature

Deterministic Stochastic

Biology has no deterministic processes…

Page 19: Time dependent gene analysis

19

3 .Population averaged

Measurements are obtained as population-averaged data

The measurement itself kills or alters the organism

This mask the real regulatory interactions (quantization problem)

Page 20: Time dependent gene analysis

20

4. Feasible time scale

Empirical limit on the number of time points :

The average speed of the biologic process determines the number of informative points.

The error of the method applied have to be smaller than the expression level difference.

COST and ERRORMISSING REGULATORY INTERACTIONS

Page 21: Time dependent gene analysis

21

5. Partial information

Biological systems are robust, adaptable, and redundant.

Genes are not the only actor in the game – transcriptional factors can be of many kinds.

The regulatory interactions between genes are not deterministic at the mRNA level - a gene has few independently regulated derivatives.

mRNA expression data alone only gives a partial picture that does not reflect key events such as translation and protein (in)activation.

Page 22: Time dependent gene analysis

22

Fundamental question

How much information is needed to map the gene-regulatory interactions of a biological system?

Hertz’s Estimation [1998] for the number of gene states to be measured for a successful reverse engineering:

P=K log (N/K)

N - The size of the network (e.g. the number of genes)K - The average number of interactions per gene.

Page 23: Time dependent gene analysis

23

Investigation of gene expression accompanying the metabolic shift from fermentation to respiration in yeast.

Identify genes whose expression was affected by deletion of TUP1 or over-expression of YAP1.

Application 1 [DeRisi 1997]

Exploring the metabolic and genetic control of gene

expression

Page 24: Time dependent gene analysis

24

Yeast genome micro-array

Genes induced or repressed appear in this image as red and green spots, respectively.

Page 25: Time dependent gene analysis

25

Temporal samples

Page 26: Time dependent gene analysis

26

Analysis

Stable gene expression during exponential growth. A marked change was seen as glucose was

progressively depleted from the growth media.- mRNA levels for 710 genes were induced.- mRNA levels for 1030 genes declined.

The expression patterns observed for previously characterized genes showed concordance with previously published results.

About half of these differentially expressed genes have no apparent homology to any gene whose function is known. This provides the first small clue to their possible roles.

Page 27: Time dependent gene analysis

27

Coordinated regulation of functionally related genes

Genes can be grouped on the basis of the similarities in their expression patterns

Page 28: Time dependent gene analysis

28

Distinct temporal patterns

Page 29: Time dependent gene analysis

29

Metabolic Diagram

Red boxes identify genes whose expression increases in the diauxic shift. Green boxes identify genes whose expression diminishes in the diauxic shift.

Page 30: Time dependent gene analysis

30

Defining the contributions of individual regulatory genes

Using a DNA micro-array to identify genes whose expression is affected by mutations in each putative regulatory gene.

Performing:- Deletion the transcriptional repressor TUP1.- Overexpression of the transcriptional activator YAP1.

Page 31: Time dependent gene analysis

31

Deleting the TUP1 gene

Wild-type yeast cells and cells bearing a deletion of the TUP1 gene were grown.

mRNA was isolated from the two populations and used to prepare c-DNA labeled with green and red.

The labeled probes were mixed and simultaneously hybridized to the micro-array.

Red spots on the array represent genes that were induced in the TUP1 strain, and thus presumably repressed by TUP1.

Page 32: Time dependent gene analysis

32

Overexpressing the YAP1 gene

Complementary DNA from the control and YAP1 over-expressing strains, labeled with Cy3 and Cy5, respectively, was prepared from mRNA isolated from the two strains and hybridized to the micro-array.

Red spots on the array represent genes that were induced in the strain over-expressing YAP1.

Page 33: Time dependent gene analysis

33

Characterization of regulatory pathways and networks

Use of a micro-array to characterize the transcriptional consequences of mutations provides a simple and powerful approach.

This strategy also has an important practical application in drug screening.

However, one should keep in mind that transcriptional regulations might be complicated.

Page 34: Time dependent gene analysis

34

Application1 summary

DNA micro-arrays provide a simple and economical way to explore gene expression patterns on a genomic scale.

“The greatest challenge now is to develop efficient methods for organizing, interpreting, and extracting insights from the large volumes of data these experiments provide.”

Technical advances have made array experiments fairly easy to do, but tools for analysis of data produced have lagged behind.

Page 35: Time dependent gene analysis

35

Probabilistic approach. Bayesian network as a model for

genetic networks.

Application 2 [Friedman 2000]

Using Bayesian Networks to Analyze Expression

Data

Page 36: Time dependent gene analysis

36

Bayesian networks – definitions

Representation of a joint probability distribution. This representation, consists of two components:

G is a directed acyclic graph (DAG) whose vertices correspond to the random variables

θ describes a conditional distribution for each variable, given its parents in G.

Page 37: Time dependent gene analysis

37

Simple example

Page 38: Time dependent gene analysis

38

Bayesian networks – properties

Encodes the Markov assumption : Each variable is independent of its non-descendants, given its parents in the graph

A graph-based model that captures properties of conditional independence between variables.

Useful for describing processes composed of locally interacting components.

Provide models of causal influence.

Page 39: Time dependent gene analysis

39

Equivalence classes

Let Ind(G) be the set of independence statements (of the form X is independent of Y given Z).

More than one graph can imply exactly the same set of independencies.

Two graphs G’ and G’’ are equivalent if Ind(G’)=Ind(G’’), that is, both graphs are alternative ways of describing the same set of independencies.

Equivalent graphs have the same underlying undirected graph but might disagree on the direction of some of the arcs (we switch to PDAG).

Page 40: Time dependent gene analysis

40

Learning Bayesian Networks

Given a training set D of independent instances of X, find a network B={G, θ} that best matches D.

Several scoring functions are available. Finding the structure G that maximizes the score is a

problem which is known to be NP-hard. For Heuristic search we need :

A score function which is decomposable For example - S(G:D) = log P(D|G) + log P(G) + C

An iterative search methodFor example - Greedy/stochastic hill climbing, simulated annealing…

Page 41: Time dependent gene analysis

41

Biological (causal) interpretation

Edges: the parents of a variable are its immediate causes (the parent of a node is a transcription factor for this gene).

A causal network models the effects of interventions: If X causes Y, then manipulating the value of X affects the value of Y, but not the other way around (If we knockout gene X then this will affect the expression of gene Y, but a knockout of gene Y has no effect on the expression of gene X).

Page 42: Time dependent gene analysis

42

Analyzing Expression Data

Random variable denote the expression level of individual genes.

In addition, we can include random variables that denote other attributes that affect the system (experimental conditions, temporal indicators…).

We want to learn one from the available data and use it to answer questions about the system.

Page 43: Time dependent gene analysis

43

Find high-scoring networks

The data is not informative enough to determine which single model is the right one

Focusing on features that are common to most of the possible models: Markov relation - indicates that two genes are related in some joint

biological interaction or process (if there is either an edge between them, or both are parents of another variable (Pearl 1988)).

Order relation - X is an ancestor of Y in all the networks of a given equivalence class (the given PDAG contain a directed path from X to Y).

Page 44: Time dependent gene analysis

44

How can we estimate a measure of confidence in the features?

bootstrap method (Efron & Tibshirani 1993) A method to enlarge our data set by generating

“perturbed” versions of our original data set. In this way we collect many networks, all of which are fairly reasonable models of the data.

For each feature f of interest calculate :

where f(G) is 1 if f is a feature in G, and 0 otherwise.

1

1( ) ( )

m

ii

conf f f Gm

Page 45: Time dependent gene analysis

45

Local Probability Models

In order to specify a Bayesian network model, we still need to choose the type of the local probability models we learn. In the current work, we consider two approaches:

Multinomial model (discretizing to (-1,0,1). Linear Gaussian model.

Page 46: Time dependent gene analysis

46

Robustness analysis

Page 47: Time dependent gene analysis

47

Multinomial versus Gaussian

The two methods highlight different types of connections between genes.

Page 48: Time dependent gene analysis

48

Biological Analysis

Order relations reveals existence of dominant genes. Out of all 800 genes only few seem to dominate the order (i.e., appear before many genes).

Top Markov relations reveals genes that most are functionally related.

Nice presentation:http://www.cs.huji.ac.il/~nirf/GeneExpression/top800/

Page 49: Time dependent gene analysis

49

An example of the graphical display of Markov features

This suits biological knowledge!

Page 50: Time dependent gene analysis

50

Application2 summary

Using Bayesian networks to model genetic network: Involves thousands of genes while current data sets contain a

few dozen samples. This raises problems in computational complexity and the statistical significance of the results.

Genetic regulation networks are sparse (gene assumed to have no more than a few dozen genes directly affect its transcription). Bayesian networks are especially suited for learning in such sparse domains.

Did not use any (biological) prior knowledge. This theory can provide tools for experimental design.

Page 51: Time dependent gene analysis

51

Dynamic Bayesian Networks

DBNs are an extension of Bayesian networks, which have been successfully applied to model expression data (Pe’er et al., 2001).

The main advantage that unlike BNs, DBNs allow for cycles, which are common in biological systems.

In addition, DBNs can also improve our ability to learn causal relationships by relying on the temporal nature of the data.

DBNs seem like a promising direction for modeling temporal system and recently a number of papers discuss this model.

Page 52: Time dependent gene analysis

52

Algebraic approach. Using SVD to a model gene expression.

Application 3 [Holter 2000]

Fundamental patterns underlying gene

expression

Page 53: Time dependent gene analysis

53

Singular Value Decomposition

A standard and straight-forward analytic procedure which finds eigenvectors, or fundamental patterns of expression with time, of the array matrices.

The SVD theorem states that the matrix A can be written as :

A = USVT

Page 54: Time dependent gene analysis

54

SVD theorem

U and V are orthogonal S elements are all zero except for Si,i which are

singular values (square roots of the eigenvalues)

Page 55: Time dependent gene analysis

55

Characteristic modes

We define the vectors Xi to be the first rows of the matrix SVT.

Those r vectors are the characteristic modes associated with the matrix A.

The temporal variation of any gene j can be written as a linear combination of these vectors:

Page 56: Time dependent gene analysis

56

Results

The first two values were significantly greater than the others for all three data sets, but the same is not true in a control calculation on random data.

Only the first few modes are required to capture the essential features of the expression data in most cases (the modes reflect the genome-wide expression pattern and are not gene-specific).

Page 57: Time dependent gene analysis

57

Singular values extracted from gene expression and random data sets

Page 58: Time dependent gene analysis

58

Characteristic modes for the gene expression and random data sets

The magnitude of the singular value is reflected in the amplitude of each mode.

Page 59: Time dependent gene analysis

59

A reconstruction of the expression profiles

Page 60: Time dependent gene analysis

60

Analysis 1 Type of ‘‘spectral’’ analysis : a gene expression

profile can be precisely represented by specifying the magnitude and sign of the contribution of each of its characteristic modes.

This suggests that at a gross level, most time-dependent expression patterns are very simple.

Data from SVD agree with previous knowledge

of expression patterns.

Page 61: Time dependent gene analysis

61

Plot of the coefficients

Symbols of different colors and shapes are used for genes that belong to the different clusters.

Page 62: Time dependent gene analysis

62

Analysis 2

The data points (which are not random) are concentrated near the perimeter of a circle or an ellipse, with the interior rather sparsely populated.

Expression profiles clustered by more conventional methods correspond well to groups of genes with similar coefficients.

Despite the evolutionary distance between yeast and humans, the observed behavior is both simple and similar.

Page 63: Time dependent gene analysis

63

Application3 summary SVD has uncovered underlying patterns or

‘‘characteristic modes’’ in gene temporal profiles. The expression pattern of any particular gene can

be represented precisely by a linear combination of the modes with gene-specific coefficients.

A good approximation of the exact pattern can be obtained by using just a few of the modes, underscoring the simplicity of the gene expression patterns.

This paradigm may find expression patterns that would not be detected using other methods.

Page 64: Time dependent gene analysis

64

Application 3b [Holter et

al 2001] Dynamic modeling

In the previous application we treated the gene expression pattern as a ‘‘static’’ image and derived the underlying genomewide characteristic modes of which it is composed.

Now we carry out a dynamical analysis, exploring the possible causal relationships among the genes by deducing a time translation matrix for the characteristic modes defined by SVD.

This matrix predicts future expression levels of genes based on their expression levels at some initial time.

Page 65: Time dependent gene analysis

65

How to deduce a time translation matrix?

To uniquely and unambiguously determine the g2 elements of the matrix, one needs a set of g2 linearly independent equations.

D’haeseleer [1999] used a nonlinear interpolation scheme to guess the shapes of gene expression profiles between the measured time points (speculative).

Van Someren [2000] chose to cluster the genes and study the interrelationships between the clusters (based on profile similarity).

Page 66: Time dependent gene analysis

66

Deduce a time translation matrix by using SVD

The SVD construction gives a linear combination of which exactly describes the expression pattern of each gene.

The modes form a linearly independent basis set.

The problem is mathematically well defined and tractable if one considers the causal relationships among the modes.

Page 67: Time dependent gene analysis

67

Analysis The causal links between the modes, and

thence the genes, involve just a few essential connections. Any additional connections among the genes must therefore provide redundancy in the network.

An important corollary is that it may be impossible to determine detailed connectivities among genes with just the micro-array data, because the number of genes greatly exceeds the number of contributing modes.

Page 68: Time dependent gene analysis

68

Application3b summary A model in which the expression levels of the

genes at a given time are linear combinations of their levels at a previous time.

Temporal evolution of the gene expression profiles can be described by using a ‘‘time translation’’ matrix, which reflects the magnitude of the connectivities between genes.

Because there are only a few essential connections among modes and therefore among genes, additional links provide redundancy in the network.

Page 69: Time dependent gene analysis

69

References

Yaacov Lindzen’s presentation “Introduction to Micro-arrays”“Genetic network analysis in light of massively parallel biological data acquisition”. Szallasi, 1999, PSB“Exploring the metabolic and genetic control of gene expression on a genomic scale”. DeRisi et al, 1997, Science“Using Bayesian networks to analyze expression data”. Friedman et al, 2000. “Fundamental patterns underlying gene expression profiles: Simplicity from complexity”. Holter et al, 2000, Genetics “Dynamic modeling of gene expression data”. Holter et al, 2001, Genetics“Analyzing time series gene expression data”. Bar Joseph, 2004, Bioinformatics