discovering larger network motifs li chen 4/16/2009 csc 8910 analysis of biological network, spring...
TRANSCRIPT
DISCOVERING LARGER NETWORK MOTIFS
Li Chen4/16/2009
CSC 8910 Analysis of Biological Network, Spring 2009
Dr. Yi Pan
THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS
THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS
Two distinct definitions of a motif based on frequency and statistical significance
Definition 1: a motif is a sub-graph that appears more than a threshold number of times.
Definition 2: a motif is a sub-graph that appears more often than expected by chance. (over-presented motif)
THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS
Two characteristics used to evaluate a motif
Frequency:
1. Arbitrary overlaps of nodes and edges (non- identical
case)
2. Only overlaps of nodes (edge-disjoint case)
3. No overlaps (edge and vertex-disjoint case)
THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS
Statistical Significance: compares the obtained values of the frequencies for the observed and random networks.
1. Z-score
2. Abundance
THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS Models of Random Graphs
Preserves the same degree distribution of
biological networks
Preserve degree sequence (search of n-node motifs)
Based on geometric random networks and Poisson
distribution of the degree
Incorporate node clustering into model
THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS 3. Compact Topological Motifs: introduces a compact graph
representation obtained by grouping together maximal
sets of nodes that are ‘indistinguishable’.
The graph on the left show the
sets U1 and U2 as compact nodes
and U1U2 as compact edge.
THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS
Motif Discovery Algorithm
Exact algorithm on motifs with a small number of nodes
1. Exhaustive Recursive Search (ERS): the input
network is represented by an adjacency matrix M.
(motif size <= 4)
2. ESU: starting with individual nodes and adding
one node at a time until the required size k is
reached. (motif size <=14)
THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS
Approximate Algorithms
1. Search Algorithm Based on Sampling (MFINDER): it
picks at random edges of the input graph until a set of
k nodes obtained to get sample sub-graph and assigns
weights to the samples to correct the non-uniform
sampling. It scale will with large networks, but does not
scale well with large motifs.
THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS
2. Rand-ESU: do not needed to compute the weights of all
samples compared with MFINDER. ESU builds a tree
whose leaves correspond to sub-graphs of size k while
internal nodes correspond to sub-graphs of size 1 up to
k-1, depending on the tree level. It assigns to each level
in the tree a probability that the nodes are further
explored, so as to guarantee all leaves are visited with
uniform probability.
THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS
3. NeMoFINDER: combines approaches of data mining and
computational biology communities. It search for repeated
trees and extend them to sub-graphs. It leads to a
reduction of the computation time for discovery of larger
motifs, but at the cost of missing some potentially
interesting sub-graphs.
THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS
4. Sub-graph Counting by Scalar Computation: it
characterize a biological network by a set of measures
based on scalars and functional of the adjacency matrix
associated to the network. Its advantages are
mathematical elegance and computational efficiency.
THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS
5. A-priori-based Motif Detection: the basic idea is if a sub-
graph is frequent so are all its sub-graphs. It builds
candidate motifs of size k by joining motifs of size k-1 and
then evaluating their frequency.
A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS
A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS
Desirable features of clustering algorithms to evaluate
Scalability
Robustness
Order insensitivity
Minimum user-specified input
Mixed data types
Arbitrary-shaped clusters
Point proportion admissibility: Duplicating data and re-clustering should not alter the results.
A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS
Five categories clustering algorithm
Partitioning Clustering Algorithm
Hierarchical Clustering Algorithm
Grid-based Clustering Algorithm
Density-based Clustering Algorithm
Model-based Clustering Algorithm
Graph-based Clustering Algorithm
A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS
Partition Clustering Algorithm
Numerical Methods
1. K-means algorithm and Farthest First Traversal k-center (FFT) algorithm
2. K-medoids or PAM (Partitioning Around Medoids)
3. CLARA (Clustering Large Applications)
4. CLARANS (Clustering Large Applications Based upon
Randomized Search) and Fuzzy K-means
A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS
Discrete Methods
1. K-modes
2. Fuzzy K-modes
3. Squeezer and COOLCAT.
Mixed of Discrete and Numerical Clustering Methods
1. K-prototypes
A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS
Hierarchical Clustering Algorithm
Divide the data into a tree of nodes, where each node represents a cluster.
Two categories based on methods or purposes
1. Agglomerative vs. Divisive
2. Single vs. Complete vs. Average linkage
A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS
Popular: natures can have various levels of subsets
Drawbacks:
1. Slow
2. Errors are not tolerable
3. Information losses when moving the levels
Two kinds of methods
1. Numerical Methods: BIRCH, CURE , Spectral clustering
2. Discrete Methods: ROCK, Chameleon, LIMBO
A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS
Grid-based Clustering Algorithm
Form a grid structure of cells from the input data. Then each data is distributed in a cell of the grid.
STING combines a numerical grid-base clustering method and hierarchical method
A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS
Density-based Clustering Algorithm
Use a local density standard
Clusters are dense subspaces separated by low density spaces
Examples of bioinformatics application : finding the densest subspaces in interactome(protein-protein interaction) networks
A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS
DBSCAN, OPTICS, DENCLUE, WaveCluster, CLIQUE use numerical values for clustering
SEQOPTICS is used for sequence clustering
HIERDENC (Hierarchical Density-based Clustering),
MULIC (Multiple Layer Incremental Clustering), Projected (subspace) clustering, CACTUS, STIRR, CLICK, CLOPE use discrete values for clustering
A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS
Model-based Clustering Algorithm
Uses a model often derived by a statistical distribution
Bioinformatics applications
1. gene expression
2. interactomes
3. sequences
A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS
Numerical model-based methods
1. Self-Organizing Maps
Discrete model-based clustering algorithm
1. COBWEB
Numerical and discrete model-based clustering methods
1. BILCOM (Bi-level clustering of Mixed Discrete and
Numerical Biomedical Data) using empirical Bayesian
approach
A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS
Examples
1. Gene expression clustering
2. Protein sequence clustering
3. AutoClass
4. SVM Clustering methods
Graph-based Clustering Algorithm
Applied to interactomers for complex prediction and sequence networks
A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS
Examples:
1. MCODE (Molecular Complex Detection)
2. SPC (Super Paramagnetic Clustering)
3. RNSC (Restricted Neighborhood Search Clustering)
4. MCL(Markov Clustering)
5. TribeMCL
6. SPC
7. CD-HIT
8. ProClust
9. BAG algorithms
A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS Usage in Bioinformatics Applications
Gene expression clustering 1. K-means algorithm 2. Hierarchical algorithm 3. SOMs Interactomes 1. AutoClass, 2. SVM clustering 3. COBSEB 4. MULIC Sequence clustering 1. Hierarchical clustering algorithm
[1] Bill Andreopoulos, Aijun An, Xiaogang Wang, and Michael Schroeder. A roadmap of clustering algorithms: finding a match for a biomedical application. Brief Bioinform, pages bbn058+, February 2009.
[2] Alberto Apostolico, Matteo Comin, and Laxmi Parida". Bridging Lossy and Lossless Compression by Motif Pattern Discovery. Electronic Notes in Discrete Mathematics, 21:219 - 225, 2005. General Theory of Information Transfer and Combinatorics.
[3] Giovanni Ciriello and Concettina Guerra. A review on models and algorithms for motif discovery in protein-protein interaction networks. Brief Funct Genomic Proteomic, 7(2):147-156, 2008.
[4] Jun Huan, Wei Wang, and Jan Prins. Efficient Mining of Frequent Subgraphs in the Presence of Isomorphism. Data Mining, IEEE International Conference on, 0:549, 2003.
[5] Michihiro Kuramochi and George Karypis. Finding Frequent Patterns in a Large Sparse Graph. Data Mining and Knowledge Discovery, 11(3):243-271, November 2005.
[6] Laxmi Parida. Discovering Topological Motifs Using a Compact Notation. Journal of Computational Biology, 14(3):300-323, 2007.
REFERENCES
Thank you so much !