discovering larger network motifs li chen 4/16/2009 csc 8910 analysis of biological network, spring...

30
DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

Upload: drusilla-ryan

Post on 01-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

DISCOVERING LARGER NETWORK MOTIFS

Li Chen4/16/2009

CSC 8910 Analysis of Biological Network, Spring 2009

Dr. Yi Pan

Page 2: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS

Page 3: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS

Two distinct definitions of a motif based on frequency and statistical significance

Definition 1: a motif is a sub-graph that appears more than a threshold number of times.

Definition 2: a motif is a sub-graph that appears more often than expected by chance. (over-presented motif)

Page 4: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS

Two characteristics used to evaluate a motif

Frequency:

1. Arbitrary overlaps of nodes and edges (non- identical

case)

2. Only overlaps of nodes (edge-disjoint case)

3. No overlaps (edge and vertex-disjoint case)

Page 5: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS

Statistical Significance: compares the obtained values of the frequencies for the observed and random networks.

1. Z-score

2. Abundance

Page 6: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS Models of Random Graphs

Preserves the same degree distribution of

biological networks

Preserve degree sequence (search of n-node motifs)

Based on geometric random networks and Poisson

distribution of the degree

Incorporate node clustering into model

Page 7: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS 3. Compact Topological Motifs: introduces a compact graph

representation obtained by grouping together maximal

sets of nodes that are ‘indistinguishable’.

The graph on the left show the

sets U1 and U2 as compact nodes

and U1U2 as compact edge.

Page 8: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS

Motif Discovery Algorithm

Exact algorithm on motifs with a small number of nodes

1. Exhaustive Recursive Search (ERS): the input

network is represented by an adjacency matrix M.

(motif size <= 4)

2. ESU: starting with individual nodes and adding

one node at a time until the required size k is

reached. (motif size <=14)

Page 9: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS

Approximate Algorithms

1. Search Algorithm Based on Sampling (MFINDER): it

picks at random edges of the input graph until a set of

k nodes obtained to get sample sub-graph and assigns

weights to the samples to correct the non-uniform

sampling. It scale will with large networks, but does not

scale well with large motifs.

Page 10: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS

2. Rand-ESU: do not needed to compute the weights of all

samples compared with MFINDER. ESU builds a tree

whose leaves correspond to sub-graphs of size k while

internal nodes correspond to sub-graphs of size 1 up to

k-1, depending on the tree level. It assigns to each level

in the tree a probability that the nodes are further

explored, so as to guarantee all leaves are visited with

uniform probability.

Page 11: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS

3. NeMoFINDER: combines approaches of data mining and

computational biology communities. It search for repeated

trees and extend them to sub-graphs. It leads to a

reduction of the computation time for discovery of larger

motifs, but at the cost of missing some potentially

interesting sub-graphs.

Page 12: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS

4. Sub-graph Counting by Scalar Computation: it

characterize a biological network by a set of measures

based on scalars and functional of the adjacency matrix

associated to the network. Its advantages are

mathematical elegance and computational efficiency.

Page 13: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS

5. A-priori-based Motif Detection: the basic idea is if a sub-

graph is frequent so are all its sub-graphs. It builds

candidate motifs of size k by joining motifs of size k-1 and

then evaluating their frequency.

Page 14: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS

Page 15: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS

Desirable features of clustering algorithms to evaluate

Scalability

Robustness

Order insensitivity

Minimum user-specified input

Mixed data types

Arbitrary-shaped clusters

Point proportion admissibility: Duplicating data and re-clustering should not alter the results.

Page 16: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS

Five categories clustering algorithm

Partitioning Clustering Algorithm

Hierarchical Clustering Algorithm

Grid-based Clustering Algorithm

Density-based Clustering Algorithm

Model-based Clustering Algorithm

Graph-based Clustering Algorithm

Page 17: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS

Partition Clustering Algorithm

Numerical Methods

1. K-means algorithm and Farthest First Traversal k-center (FFT) algorithm

2. K-medoids or PAM (Partitioning Around Medoids)

3. CLARA (Clustering Large Applications)

4. CLARANS (Clustering Large Applications Based upon

Randomized Search) and Fuzzy K-means

Page 18: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS

Discrete Methods

1. K-modes

2. Fuzzy K-modes

3. Squeezer and COOLCAT.

Mixed of Discrete and Numerical Clustering Methods

1. K-prototypes

Page 19: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS

Hierarchical Clustering Algorithm

Divide the data into a tree of nodes, where each node represents a cluster.

Two categories based on methods or purposes

1. Agglomerative vs. Divisive

2. Single vs. Complete vs. Average linkage

Page 20: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS

Popular: natures can have various levels of subsets

Drawbacks:

1. Slow

2. Errors are not tolerable

3. Information losses when moving the levels

Two kinds of methods

1. Numerical Methods: BIRCH, CURE , Spectral clustering

2. Discrete Methods: ROCK, Chameleon, LIMBO

Page 21: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS

Grid-based Clustering Algorithm

Form a grid structure of cells from the input data. Then each data is distributed in a cell of the grid.

STING combines a numerical grid-base clustering method and hierarchical method

Page 22: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS

Density-based Clustering Algorithm

Use a local density standard

Clusters are dense subspaces separated by low density spaces

Examples of bioinformatics application : finding the densest subspaces in interactome(protein-protein interaction) networks

Page 23: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS

DBSCAN, OPTICS, DENCLUE, WaveCluster, CLIQUE use numerical values for clustering

SEQOPTICS is used for sequence clustering

HIERDENC (Hierarchical Density-based Clustering),

MULIC (Multiple Layer Incremental Clustering), Projected (subspace) clustering, CACTUS, STIRR, CLICK, CLOPE use discrete values for clustering

Page 24: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS

Model-based Clustering Algorithm

Uses a model often derived by a statistical distribution

Bioinformatics applications

1. gene expression

2. interactomes

3. sequences

Page 25: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS

Numerical model-based methods

1. Self-Organizing Maps

Discrete model-based clustering algorithm

1. COBWEB

Numerical and discrete model-based clustering methods

1. BILCOM (Bi-level clustering of Mixed Discrete and

Numerical Biomedical Data) using empirical Bayesian

approach

Page 26: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS

Examples

1. Gene expression clustering

2. Protein sequence clustering

3. AutoClass

4. SVM Clustering methods

Graph-based Clustering Algorithm

Applied to interactomers for complex prediction and sequence networks

Page 27: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS

Examples:

1. MCODE (Molecular Complex Detection)

2. SPC (Super Paramagnetic Clustering)

3. RNSC (Restricted Neighborhood Search Clustering)

4. MCL(Markov Clustering)

5. TribeMCL

6. SPC

7. CD-HIT

8. ProClust

9. BAG algorithms

Page 28: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS Usage in Bioinformatics Applications

Gene expression clustering 1. K-means algorithm 2. Hierarchical algorithm 3. SOMs Interactomes 1. AutoClass, 2. SVM clustering 3. COBSEB 4. MULIC Sequence clustering 1. Hierarchical clustering algorithm

Page 29: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

[1] Bill Andreopoulos, Aijun An, Xiaogang Wang, and Michael Schroeder. A roadmap of clustering algorithms: finding a match for a biomedical application. Brief Bioinform, pages bbn058+, February 2009.

[2] Alberto Apostolico, Matteo Comin, and Laxmi Parida". Bridging Lossy and Lossless Compression by Motif Pattern Discovery. Electronic Notes in Discrete Mathematics, 21:219 - 225, 2005. General Theory of Information Transfer and Combinatorics.

[3] Giovanni Ciriello and Concettina Guerra. A review on models and algorithms for motif discovery in protein-protein interaction networks. Brief Funct Genomic Proteomic, 7(2):147-156, 2008.

[4] Jun Huan, Wei Wang, and Jan Prins. Efficient Mining of Frequent Subgraphs in the Presence of Isomorphism. Data Mining, IEEE International Conference on, 0:549, 2003.

[5] Michihiro Kuramochi and George Karypis. Finding Frequent Patterns in a Large Sparse Graph. Data Mining and Knowledge Discovery, 11(3):243-271, November 2005.

[6] Laxmi Parida. Discovering Topological Motifs Using a Compact Notation. Journal of Computational Biology, 14(3):300-323, 2007.

REFERENCES

Page 30: DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan

Thank you so much !