discovering larger network motifs li chen 4/16/2009 csc 8910 analysis of biological network, spring...

DISCOVERING LARGER NETWORK MOTIFS

Li Chen4/16/2009

CSC 8910 Analysis of Biological Network, Spring 2009

Dr. Yi Pan

THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS


Two distinct definitions of a motif based on frequency and statistical significance

Definition 1: a motif is a sub-graph that appears more than a threshold number of times.

Definition 2: a motif is a sub-graph that appears more often than expected by chance. (over-presented motif)


Two characteristics used to evaluate a motif

Frequency:

1. Arbitrary overlaps of nodes and edges (non- identical

case)

2. Only overlaps of nodes (edge-disjoint case)

3. No overlaps (edge and vertex-disjoint case)


Statistical Significance: compares the obtained values of the frequencies for the observed and random networks.

1. Z-score

2. Abundance

THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS Models of Random Graphs

Preserves the same degree distribution of

biological networks

Preserve degree sequence (search of n-node motifs)

Based on geometric random networks and Poisson

distribution of the degree

Incorporate node clustering into model

THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS 3. Compact Topological Motifs: introduces a compact graph

representation obtained by grouping together maximal

sets of nodes that are ‘indistinguishable’.

The graph on the left show the

sets U1 and U2 as compact nodes

and U1U2 as compact edge.


Motif Discovery Algorithm

Exact algorithm on motifs with a small number of nodes

1. Exhaustive Recursive Search (ERS): the input

network is represented by an adjacency matrix M.

(motif size <= 4)

2. ESU: starting with individual nodes and adding

one node at a time until the required size k is

reached. (motif size <=14)


Approximate Algorithms

1. Search Algorithm Based on Sampling (MFINDER): it

picks at random edges of the input graph until a set of

k nodes obtained to get sample sub-graph and assigns

weights to the samples to correct the non-uniform

sampling. It scale will with large networks, but does not

scale well with large motifs.


2. Rand-ESU: do not needed to compute the weights of all

samples compared with MFINDER. ESU builds a tree

whose leaves correspond to sub-graphs of size k while

internal nodes correspond to sub-graphs of size 1 up to

k-1, depending on the tree level. It assigns to each level

in the tree a probability that the nodes are further

explored, so as to guarantee all leaves are visited with

uniform probability.


3. NeMoFINDER: combines approaches of data mining and

computational biology communities. It search for repeated

trees and extend them to sub-graphs. It leads to a

reduction of the computation time for discovery of larger

motifs, but at the cost of missing some potentially

interesting sub-graphs.


4. Sub-graph Counting by Scalar Computation: it

characterize a biological network by a set of measures

based on scalars and functional of the adjacency matrix

associated to the network. Its advantages are

mathematical elegance and computational efficiency.


5. A-priori-based Motif Detection: the basic idea is if a sub-

graph is frequent so are all its sub-graphs. It builds

candidate motifs of size k by joining motifs of size k-1 and

then evaluating their frequency.

A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS


Desirable features of clustering algorithms to evaluate

Scalability

Robustness

Order insensitivity

Minimum user-specified input

Mixed data types

Arbitrary-shaped clusters

Point proportion admissibility: Duplicating data and re-clustering should not alter the results.


Five categories clustering algorithm

Partitioning Clustering Algorithm

Hierarchical Clustering Algorithm

Grid-based Clustering Algorithm

Density-based Clustering Algorithm

Model-based Clustering Algorithm

Graph-based Clustering Algorithm


Partition Clustering Algorithm

Numerical Methods

1. K-means algorithm and Farthest First Traversal k-center (FFT) algorithm

2. K-medoids or PAM (Partitioning Around Medoids)

3. CLARA (Clustering Large Applications)

4. CLARANS (Clustering Large Applications Based upon

Randomized Search) and Fuzzy K-means


Discrete Methods

1. K-modes

2. Fuzzy K-modes

3. Squeezer and COOLCAT.

Mixed of Discrete and Numerical Clustering Methods

1. K-prototypes


Hierarchical Clustering Algorithm

Divide the data into a tree of nodes, where each node represents a cluster.

Two categories based on methods or purposes

1. Agglomerative vs. Divisive

2. Single vs. Complete vs. Average linkage


Popular: natures can have various levels of subsets

Drawbacks:

1. Slow

2. Errors are not tolerable

3. Information losses when moving the levels

Two kinds of methods

1. Numerical Methods: BIRCH, CURE , Spectral clustering

2. Discrete Methods: ROCK, Chameleon, LIMBO


Grid-based Clustering Algorithm

Form a grid structure of cells from the input data. Then each data is distributed in a cell of the grid.

STING combines a numerical grid-base clustering method and hierarchical method


Density-based Clustering Algorithm

Use a local density standard

Clusters are dense subspaces separated by low density spaces

Examples of bioinformatics application : finding the densest subspaces in interactome(protein-protein interaction) networks


DBSCAN, OPTICS, DENCLUE, WaveCluster, CLIQUE use numerical values for clustering

SEQOPTICS is used for sequence clustering

HIERDENC (Hierarchical Density-based Clustering),

MULIC (Multiple Layer Incremental Clustering), Projected (subspace) clustering, CACTUS, STIRR, CLICK, CLOPE use discrete values for clustering


Model-based Clustering Algorithm

Uses a model often derived by a statistical distribution

Bioinformatics applications

1. gene expression

2. interactomes

3. sequences


Numerical model-based methods

1. Self-Organizing Maps

Discrete model-based clustering algorithm

1. COBWEB

Numerical and discrete model-based clustering methods

1. BILCOM (Bi-level clustering of Mixed Discrete and

Numerical Biomedical Data) using empirical Bayesian

approach


Examples

1. Gene expression clustering

2. Protein sequence clustering

3. AutoClass

4. SVM Clustering methods

Graph-based Clustering Algorithm

Applied to interactomers for complex prediction and sequence networks


Examples:

1. MCODE (Molecular Complex Detection)

2. SPC (Super Paramagnetic Clustering)

3. RNSC (Restricted Neighborhood Search Clustering)

4. MCL(Markov Clustering)

5. TribeMCL

6. SPC

7. CD-HIT

8. ProClust

9. BAG algorithms

A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS Usage in Bioinformatics Applications

Gene expression clustering 1. K-means algorithm 2. Hierarchical algorithm 3. SOMs Interactomes 1. AutoClass, 2. SVM clustering 3. COBSEB 4. MULIC Sequence clustering 1. Hierarchical clustering algorithm

[1] Bill Andreopoulos, Aijun An, Xiaogang Wang, and Michael Schroeder. A roadmap of clustering algorithms: finding a match for a biomedical application. Brief Bioinform, pages bbn058+, February 2009.

[2] Alberto Apostolico, Matteo Comin, and Laxmi Parida". Bridging Lossy and Lossless Compression by Motif Pattern Discovery. Electronic Notes in Discrete Mathematics, 21:219 - 225, 2005. General Theory of Information Transfer and Combinatorics.

[3] Giovanni Ciriello and Concettina Guerra. A review on models and algorithms for motif discovery in protein-protein interaction networks. Brief Funct Genomic Proteomic, 7(2):147-156, 2008.

[4] Jun Huan, Wei Wang, and Jan Prins. Efficient Mining of Frequent Subgraphs in the Presence of Isomorphism. Data Mining, IEEE International Conference on, 0:549, 2003.

[5] Michihiro Kuramochi and George Karypis. Finding Frequent Patterns in a Large Sparse Graph. Data Mining and Knowledge Discovery, 11(3):243-271, November 2005.

[6] Laxmi Parida. Discovering Topological Motifs Using a Compact Notation. Journal of Computational Biology, 14(3):300-323, 2007.

REFERENCES

Thank you so much !

discovering larger network motifs li chen 4/16/2009 csc 8910 analysis of biological network, spring...

Documents

compact nodes

modelthe review

abundancethe review

internal nodes

yi panthe review

arbitrary overlaps of

input graph

sample subgraph