genetic networks. cellular networks u most processes in the cell are controlled by “networks” of...
Post on 21-Dec-2015
217 views
TRANSCRIPT
.
Genetic Networks
Cellular Networks
Most processes in the cell are controlled by “networks” of interacting molecules:
Metabolic Networks Signal Transduction Networks Regulatory Networks
Unifying View
The cell as a “state machine” Cell state S = (P1,P2, …, R1, R2, …m1, m2, …) P proteins, R mRNA molecules, m metabolites Each cell at any given time, can be
characterized using its state S Dynamics:
Input(t), S(t) => S(t+t)
What does it mean?
Steady Cell State – cell type Neuron RBC muscle cell Tumor cell
Dynamics – cellular process Differentiation Apoptosis Cell Cycle
Gene Regulation Networks
Regulation of expression of genes is crucial
Regulation occurs at many stages: pre-transcriptional (chromatin structure) transcription initiation RNA editing (splicing) and transport Translation initiation Post-translation modification RNA & Protein degradation
Understanding regulatory processes is a central problem of biological research
Genetic Network Models: Goals
Incorporate rule-based dependencies between genes Rule-based dependencies may constitute important
biological information. Allow to systematically study global network dynamics
In particular, individual gene effects on long-run network behavior.
Must be able to cope with uncertainty Small sample size, noisy measurements, biological
“noise” Quantify the relative influence and sensitivity of genes in
their interactions with other genes This allows us to focus on individual (groups of) genes.
What model should we use?
Level of Biochemical Detail
Detailed models require lots of data! Highly detailed biochemical models are only
feasible for very small systems which are extensively studied
Example: Arkin et al. (1998), Genetics 149(4):1633-48
lysis-lysogeny switch in Lambda phage: 5 genes, 67 parameters based on 50 years of
research stochastic simulation required supercomputer!
Example: Lysis-Lysogeny
Arkin et al. (1998), Genetics 149(4):1633-48
Level of Biochemical Detail
In-depth biochemical simulation of e.g. a whole cell is infeasible (so far)
Less detailed network models are useful when data is scarce and/or network structure is unknown
Once network structure has been determined, we can refine the model
Boolean or Continuous?
Boolean Networks (Kauffman (1993), The Origins of Order) assumes ON/OFF gene states.
Allows analysis at the network-level Provides useful insights in network dynamics Algorithms for network inference from binary data
A
B
C C = A AND B
0
10
Boolean Formalism: Cons
Boolean abstraction is poor fit to real data Cannot model important concepts:
amplification of a signal subtraction and addition of signals compensating for smoothly varying environmental
parameter (e.g. temperature, nutrients) varying dynamical behavior (e.g. cell cycle period)
Feedback control:negative feedback is used to stabilize expression
causes oscillation in Boolean model
Boolean Formalism: Pros
Studies give rise to qualitative phenomena, as observed by experimentalists.
Some studied systems exhibit multiple steady states and “switchlike” transitions between them.
It is experimentally shown that such systems are “robust” to exact values of kinetic parameters of individual reactions.
Concentrations or Molecules?
Use of concentrations assumes individual molecules can be ignored
Known examples (in prokaryotes) where stochastic fluctuations play an essential role (e.g. lysis-lysogeny in lambda)
Requires stochastic simulation (Arkin et al. (1998),
Genetics 149(4):1633-48), or modeling molecule counts (e.g. Petri nets, Goss and Peccoud (1998), PNAS 95(12):6750-5)
Significantly increases model complexity
Concentrations or Molecules?
Eukaryotes: larger cell volume, typically longer half-lives. Few known stochastic effects.
Yeast: 80% of the transcriptome is expressed at 0.1-2 mRNA copies/cell Holstege, et al.(1998), Cell 95:717-728.
Human: 95% of transcriptome is expressed at <5 copies/cell Velculescu et al.(1997), Cell 88:243-251
Spatial or Non-Spatial
Spatiality introduces additional complexity: intercellular interactions spatial differentiation cell compartments cell types
Spatial patterns also provide more data
e.g. stripe formation in Drosophila: Mjolsness et al. (1991), J. Theor. Biol. 152: 429-454.
Few (no?) large-scale spatial gene expression data sets available so far.
Example: Drosophila Segmentationan
terio
r
post
erio
r
expression of transcription factors in embryo
gt Kr
hb
bcdbcd
eve (stripe 2)high
low
eve (even-striped) expression
Deterministic or Stochastic?
Many sources of stochasticity Bioloical stochasticity Experimental noise
Stochastic models can account for those Deterministic models are usually simpler to analyze
(dynamics, steady states) and interpret
Modeling Approaches
Boolean Networks
Linear Models
Bayesian Networks
Boolean Network
What is a Boolean Network?
Boolean network is a kind of Graph G(V, F) – V is a set of nodes ( genes )
F is a list of Boolean functions
Every node has only two values: ON ( 1 ) and OFF ( 0 )
Every function has the result value of each node :
Representation: standard, wiring , automaton
1 2( , , , )i i nx f x x x
What is a Boolean Network?
Attractor : Certain states revisited infinitely often depending on the initial starting state.
Basin of attraction
Limit-cycle attractor
Boolean Network Example
x1
x2
x3
0 1
10
10
Nodes (genes)x2 x3x1 Time = t
Time = t+1x2 x3x1
Activate gene
inactivate gene
Wiring diagram G’(V’,F’)
Interation 1 2 3 4 5 6
1 1 0 0 0 0
1 1 1 0 0 0
0 1 1 1 0 0
X1
X2
X3
Trajectory example
1 2 3
1 2 3
( , , )if x x x
x x x
1 2 3{ , , }V x x x
Boolean Network Example
x1
x2
x3
0 1
10
10
Nodes (genes)Interation 1 2 3 4 5 6
1 1 0 0 0 0
1 1 1 0 0 0
0 1 1 1 0 0
X1
X2
X3
111 011110 000001
010100 101
Start!
trajectory 1
trajectory 2
1 1 2 3
2 3
2 1 2 3
1
3 1 2 3
2
( , , )
( , , )
( , , )
f x x x
x x
f x x x
x
f x x x
x
Basic Structure of Boolean Networks
A
X
B
Boolean functionA B X0 0 10 1 11 0 01 1 1
•Each node is a gene•1 means active/expressed•0 means inactive/unexpressed
In this example, two genes (A and B) regulate gene X. In principle, any number of “input” genes are possible. Positive/negative feedback is also common (and necessary for homeostasis).
Dynamics of Boolean Networks
0 1 1 0 01
A B C D E F Time
1
A
1
B
0
C
1
D
1
E
0
F
At a given time point, all the genes form a genome-wide gene activity pattern (GAP) (binary string of length n ).Consider the state space formed by all possible GAPs.
State Space of Boolean Networks
Similar GAPs lie close together.
There is an inherent directionality in the state space.
Some states are attractors (or limit-cycle attractors). The system may alternate between several attractors.
Other states are transient.
Picture generated using the program DDLab.
Reverse Engineering Problem
Can we infer the structure and rules of a genetic network from gene expression measurements?
Reverse Engineering Problem
Input: Gene expression data
Output: Network structure and parameters (or regulation rules)
Gene Expression Time Series Data
0 10 20 30 40 50 60time (min)
Problem: how can these data be used to infer how these three genes influence each other?
gene 1
gene 2
gene 3
Modelling Gene Expression Data
0 10 20 30 40 50 60time (min)
assume that genes exist in two states: on and off
if expression of gene i is above level i consider it on, otherwise, consider it off
gene 1
gene 2
gene 3
Modelling Gene Expression Data
0 10 20 30 40 50 60time (min)
assume that genes exist in two states: on and off
if expression of gene i is above level i consider it on, otherwise, consider it off
gene 1
gene 2
gene 3
2
1
3
Modelling Gene Expression Data
0 10 20 30 40 50 60time (min)
assume that genes exist in two states: on and off
if expression of gene i is above level i consider it on, otherwise, consider it off
gene 1
gene 2
gene 3
2
1
3
ononononon
on
off off off
off
off
offoffoff
off
off
on on onon
on
on
on
off off off off offoff
on
off off off
Modelling Gene Expression Data
we obtain the following discretized gene expression data:
time 0 5 10 15 20 25 30 35 40 45 50 55
gene 1 0 0 0 0 0 0 1 1 1 1 1 1
gene 2 0 0 0 0 0 0 0 1 1 0 0 0
gene 3 1 1 1 1 1 1 1 0 0 0 0 0
the gene expression data is now in the form of bit streams
Information Theoretic Tools
we define some necessary information theoretic tools:
Shannon entropy of data stream
H(X) = - ∑ pi log(pi)
where pi is the probability that a random element of data stream X is i
(the base of the logarithm can be anything, but must be consistent throughout; usually we use base 2)
Information Theoretic Tools
e.g. Shannon entropy of data streams X and Y
X = [0, 1, 1, 1, 1, 1, 1, 0, 0, 0]
Y = [0, 0, 0, 1, 1, 0, 0, 1, 1, 1]
H(X) = - ∑ pi logn(pi)
= -(pX=0 log2(pX=0) + pX=1 log2(pX=1))
= -(0.4 log2(0.4) + 0.6 log2(0.6))
= 0.971
H(Y) = - ∑ pi logn(pi)
= -(0.5 log2(0.5) + 0.5 log2(0.5))
= 1.0
Information Theoretic Tools
e.g. Shannon joint entropy of data streams X and Y
X = [0, 1, 1, 1, 1, 1, 1, 0, 0, 0]
Y = [0, 0, 0, 1, 1, 0, 0, 1, 1, 1]
H(X, Y) = - ∑ pi logn(pi)
= -(pX=0,Y=0 log2(pX=0,Y=0,) + pX=1,Y=0 log2(pX=1,Y=0)
+ pX=0,Y=1 log2(pX=0,Y=1,) + pX=1,Y=1 log2(pX=1,Y=1))
= -(0.1 log2(0.1) + 0.4 log2(0.4)
+ 0.3 log2(0.3) + 0.2 log2(0.2)
= 1.85
Information Theoretic Tools
Define:
Conditional Entropy
H(X|Y) = H(X, Y) – H(X)
H(Y|X) = H(X, Y) – H(Y)
Mutual Information
M(X, Y) = H(Y) - H(Y|X)
= H(X) - H(X|Y)
= H(X) + H(Y) - H(X,Y)
Information Theoretic Tools
It is easy to show that:
Let X be an input data stream
and Y be an output data stream
If M(Y, X) = H(Y)
then X exactly determines Y
Look for pairs(x,y) where M(Yt+1, Xt) = H(Yt+1)
Identification of the Network Graph
back to the data:
step 1: put data in “state transition table” form
time 1 2 3 4 5 6 1 2 3 1 2 3 1 2
gene A 0 0 1 1 1 1 0 1 1 0 1 1 1 1
gene B 0 0 0 1 0 0 1 0 1 1 0 1 1 1
gene C 0 1 1 0 0 0 0 1 0 1 0 0 1 0
Identification of the Network Graph
state transition table:
step 1: put data in “state transition table” form
Input stream value Output stream value
Ai-1 Bi-1 Ci-1 Ai Bi Ci
0 0 0 0 0 1
0 0 1 1 0 1
0 1 0 0 0 1
0 1 1 1 0 1
1 0 0 1 0 0
1 0 1 1 1 0
1 1 0 1 0 0
1 1 1 1 1 0
Identification of the Network Graph
state transition table tells us how to get from
state i – 1 to state i as a lookup table however, it is difficult to discern functional relationships,
so… step 2: use information theoretic tools to discover which
inputs determine the outputs
Identification of the Network Graph
step 2a: calculate entropies
note: limx+0xx=1, therefore in the left-hand limit, (0)log(0) = 0.
H(Ai) = -((0.25)log(0.25) + (0.75)log(0.75)) = 0.81
H(Bi) = -((0.75)log(0.75) + (0.25)log(0.25)) = 0.81
H(Ci) = -((0.5)log(0.5) + (0.5)log(0.5)) = 1
H(Ai-1) = H(Bi-1) = H(Ci-1) = -((0.5)log(0.5) + (0.5)log(0.5)) = 1
H(Ai-1, Ci-1) = -((0.25)log(0.25) + (0.25)log(0.25)
+ (0.25)log(0.25) + (0.25)log(0.25)) = 2
Identification of the Network Graph
step 2a: calculate entropies
H(Ai, Ai-1, Ci-1) = -((0.25)log(0.25) + (0.25)log(0.25)
+ (0.25)log(0.25) + (0.25)log(0.25)) = 2
H(Bi, Ai-1, Ci-1) = -((0.25)log(0.25) + (0.25)log(0.25)
+ (0.25)log(0.25) + (0.25)log(0.25)) = 2
H(Ci, Ai-1) = -((0.5)log(0.5) + (0.5)log(0.5) = 1
Identification of the Network Graph
step 2b: calculate mutual informationM(Ai, [Ai-1, Ci-1]) = H(Ai) + H(Ai-1, Ci-1) - H(Ai, Ai-1, Ci-1)
= 0.81 + 2 – 2
= 0.81
= H(Ai), therefore Ai-1 and Ci-1 determine Ai
M(Bi, [Ai-1, Ci-1]) = H(Bi) + H(Ai-1, Ci-1) - H(Bi, Ai-1, Ci-1)
= 0.81 + 2 – 2
= 0.81
= H(Bi), therefore Ai-1 and Ci-1 determine Bi
M(Ci, Ai-1) = H(Ci) + H(Ai-1) - H(Ci, Ai-1)
= 1 + 1 – 1
= 1
= H(Ci), therefore Ai-1 determines Ci
Identification of the Boolean Circuits
step 3: determine functional relationship between variables (this is simply the truth table)
Ai-1 Ci-1 Ai
0 0 0
0 1 1
1 0 1
1 1 1
Ai = Ai-1 OR Ci-1
Identification of the Boolean Circuits
step 3: determine functional relationship between variables
Ai-1 Ci-1 Bi
0 0 0
0 1 0
1 0 0
1 1 1
Bi = Ai-1 AND Ci-1
Identification of the Boolean Circuits
step 3: determine functional relationship between variables
Ai-1 Ci
0 1
1 0
Ci = NOT Ai-1
Problems With This Approach
no theory exists for determining the discretization level i
the assumption that genes can be modeled as either ‘on’ or ‘off’ may be sufficient for some genes, but will certainly not be sufficient for all genes
Ignores noise of all kinds (experimental, biological)
Boolean networks areinherently deterministic
Conceptually, the regularity of genetic function and interaction is not due to “hard-wired” logical rules, but rather to the intrinsic self-organizing stability of the dynamical system.
Additionally, we may want to model an open system with inputs (stimuli) that affect the dynamics of the network.
From an empirical viewpoint, the assumption of only one logical rule per gene may lead to incorrect conclusions when inferring these rules from gene expression measurements, as the latter are typically noisy and the number of samples is small relative to the number of parameters to be inferred.
Linear Models
Basic model: weighted sum of inputs
Simple network representation:
Only first-order approximation
Parameters of the model:
weight matrix containing NxN interaction weights
“Fitting” the model: find the parameters wji, bi such
that model best fits available data
w23
g1g2
g3g4
g5
w12
w55
j
ijjii btywtty )()( j
ijjii bywdt
dyor
Underdetermined problem!
Assumes fully connected network: need at least as many data points (arrays, conditions) as variables (genes)!
Underdetermined (underconstrained, ill-posed) model: we have many more parameters than data values to fit
No single solution, rather infinite number of parameter settings that will all fit the data equally well
Solution 1: reduce N
Rather than trying to model all genes, we can reduce the dimensionality of the problem:
Network of clusters: construct a linear model based on the cluster centroids
rat CNS data (4 clusters): Wahde and Hertz (2000),
Biosystems 55, 1-3:129-136. yeast cell cycle (15-18 clusters): Mjolsness et al.(2000),
NIPS 12; van Someren et al.(2000) ISMB2000, 355-366.
Network of Principal Components: linear model between “characteristic modes” of the data
Holter et al.(2001), PNAS 98(4):1693-1698.
Solution 2:
Take advantage of additional information: replicates accuracy of measurements smoothness of time series …
Most likely, the network will still be poorly constrained.
Need a method to identify and extract those parts of the model that are well-determined and robust
Danger of Overfitting
The linear model assumes every gene is regulated by all other genes (i.e. full connectivity)
This is the richest model of its kind Danger to over fit the training data Will result in poor prediction on new data Far from reality: only few regulators for each
gene