Genetic network inference: from co- expression clustering to reverse engineering Patrik D’haeseleer,Shoudan Liang and Roland Somogyi

The goal of this review Principles of genetic network

organization Computational methods for

extracting network architectures from experimental data

Outline Introduction A conceptual approach to complex

network dynamics Inference of regulation through clustering

of gene expression data Modeling methodologies Gene network inference:reverse

engineering Conclusions and Outlook

Genes encode proteins, some of which in turn regulate other genes

determine the structure of this intricate network of genetic regulatory interactions

Traditional approach: local Examining and collecting data on a

single gene, a single protein or a single reaction at a time

functional genomics

Functional Genomics Specifically, functional genomics refers

to the development and application of global experimental approaches to assess gene function by making use of the information and reagents provided by structural genomic. high throughput large scale experimental methodologies

combined with statistical and computational analysis of the results.

Functional Genomics(Cont.) We need to define the mapping

from sequence space to functional space.

Intermediate representation Focus at the level of single cells A biological system can be

considered to be a state machine,where the change in internal state of the system depends on both its current internal state and any external inputs.

The goal Observe the state of a cell and how

it changes under different circumstances, and from this to derive a model of how these state changes are generated The state of cell

All those variables determining its behavior

Example A simple,6-node regulatory


The global gene expression pattern is the result of the collective behavior of individual regulatory pathways

Gene function depends on its cellular context; thus understanding the network as a whole is essential.

Boolean Networks Each gene is considered as a

binary variable—either ON or OFF—regulated by other genes through logical or Boolean functions.

Even with this simplification ,the network behavior is already extremely rich.

Boolean Networks(Cont.)

Cell differentiation corresponds to transitions from one global gene expression pattern to another.

Scoring methods Whether there has been a significant

change at any one condition Whether there has been a significant

aggregate change over all conditions Whether the fluctuation pattern

shows high diversity according to Shannon entropy

Guilt By Association Select a gene Determine its nearest neighbors in

expression space within a certain user-defined distance cut-off

Clustering extract groups of genes that are

tightly co-expressed over a range of different experiments.

Caution Different clustering methods can

have very different results It’s not yet clear which clustering

methods are most useful for gene expression analysis.

Definition:Gene Expression Profile An expression profile ej of an

ordered list of N samples(k=1 to N) for a particular gene j is a vector of scaled expression values vjk

The expression profile is: ej=(vj1,vj2,vj3,…,vjN)

Definition:Gene Expression Profile( Cont.) A difference between two genes p

and q may be estimated as N-dimensional metric “distance” between ep and eq.

Euclidean distance: = N





Clustering algorithms Non-hierarchical methods

Cluster N objects into K groups in an iterative process until certain goodness criteria are optimized

E.g. K-means

Clustering algorithms Hierarchical methods

Return an hierarchy of nested clusters, where each cluster typically consists of the union of two or more smaller clusters.

Agglomerative methods Start with single object clusters and

recursively merge them into larger clusters Divisive methods

Start with the cluster containing all objects and recursively divide it into smaller clusters

Other applications of co-expression clusters Extraction of regulatory motifs

Genes in the same expression share biological funtions

Inference of functional annotation Functions of unknown genes may be

hypothesized from genes with know function within the same cluster

As a molecular signature in distinguishing cell or tissue types mRNA expression

Which clustering method to use? There is no single best criterion for

obtaining a partition because no precise and workable definition of ‘cluster’ exists.

Clusters can be of any arbitrary shapes and sizes in a multidimensional pattern space.

Challenge in cluster analysis A gene could be a member of several

clusters, each reflecting a particular aspect of its function and control

Solutions clustering methods that partition genes

into non-exclusive clusters Several clustering methods could be

used simultaneously

Level of biochemical detail abstract

Boolean networks concrete

Full biochemical interaction models with stochastic kinetics in Arkin et al.(1998)

Forward and inverse modeling Forward modeling approach Inverse modeling, or reverse

engineering Given an amount of data, what can

we deduce about the unknown underlying regulatory network?

Requires the use of a parametric model, the parameters of which are then fit to the real-world data.

Goal of network inference Construct a coarse-scale model of

the network of regulatory interactions between the genes

It’s possible to reverse engineer a network from its activity profiles

Data requirements We need to observe the expression

of that gene under many different combinations of expression levels of its regulatory inputs Use data from different sources Deal with different data types

Estimates for network models a sparse network model of N

genes, where each gene is only affected by K other genes on average.

a sparsely connected, directed graph with N nodes and NK edges.

Estimate for network models(Cont.) To specify the correct model, we need

bits of information.









)/log( KNNK

Correlation Metric Construction Adam Arkin and John Ross A method to reconstruct reaction

networks from measured time series of the component chemical species.

The system is driven using inputs for some of the chemical species and the concentration of all the species is monitored over time.

Correlation Metric Construction(Cont. ) The time-lagged correlation matrix is

calculated From this a distance matrix is constructed

based on the maximum correlation between any two chemical species

This distance matrix is then fed into a simple clustering algorithm to generate a tree of connections between the species

The results are mapped into a two-dimensional graph for visualization

Additive regulation models Property

The regulatory inputs are combined using a weighted sum

Can be used as a first-order approximation to the gene network

Additive regulation models The change in each variable over time is

given by a weighted sum of all other variables

is the level of the i-th varibale is a bias term indicating whether I is

expressed of not in the absence of regulatory inputs

represents the influence of j on the regulation of i


ijjii bywy




Use of such models We can infer regulatory

interactions directly from the data, by fitting these simple network models to large scale gene expression data.

Conclusion Conceptual foundations for

understanding complex biological networks

Several practical methods for data analysis