4. gene expression data analysis
DESCRIPTION
4. Gene Expression Data Analysis. EECS 600: Systems Biology & Bioinformatics Instructor: Mehmet Koyuturk. Analyzing Gene Expression Data. Clustering How are genes related in terms of their expression under different conditions? Differential gene expression - PowerPoint PPT PresentationTRANSCRIPT
4. Gene Expression Data Analysis
EECS 600: Systems Biology & BioinformaticsInstructor: Mehmet Koyuturk
Analyzing Gene Expression Data
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
2
Clustering How are genes related in terms of their expression
under different conditions? Differential gene expression
Which genes are affected by change in condition, tissue, disease?
Classification (supervised analysis) Given expression profile for a gene, can we assign a
function? Given the expression levels of several genes in a
sample, can we characterize the type of sample (e.g., cancerous or normal)?
Regulatory network inference How do genes regulate each others expression to
orchestrate cellular function?
Clustering
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
3
Group similar items together Clustering genes based on their expression
profiles We can measure the expression of multiple genes
in multiple samples Genes that are functionally related should have
similar expression profiles Gene expression profile
A vector (or a point) in multi-dimensional space, where each dimension corresponds to a sample
Clustering of multi-dimensional real-valued data is a well-studied problem
Motivating Example
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
4
Expression levels of 2,000 genes in 22 normal and 40 tumor colon tissues (Alon et al. , PNAS,
1999)
Applications of Clustering
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
5
Functional annotation If a gene with unknown function is clustered
together with genes that perform a particular function, then that is likely to be associated with that function
Identification of regulatory motifs If a group of genes are co-regulated, then it is
likely that their regulation is modulated by similar transcription factors, so looking for common elements in the neighborhood of the coding sequences of genes in a cluster, we can identify regulatory motifs and their location (promoters)
Modular analysis
Gene Expression Matrix
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
6
m g
en
es
n samples Generally, m >> n
m = O(103) n = O(101)
Each row is an n-dimensional vector
Expression profile
Tiniii
ij
eeee
njmieE
],...,,[
1 ,1 ],[
21
Proximity Measures
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
7
How do we decide which genes are similar to each other?
Euclidian distance
Manhattan distance
n
kjkikjiji eeeeeeEuclidian
1
2
2)(),(
| |),(1
1 jk
n
kikjiji eeeeee tanManhat
Distance
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
8
Minkowski distance General version of Euclidian, Manhattan etc.
p is a parameter
n
k
pjkikpjiji eeeeeeMinkowski
1
)(),(
jkiknk
ji eeee 1
max
Normalization
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
9
If we want to measure the distance between directions rather than absolute magnitude, it may be necessary to standardize mean and variation of expression levels for each gene
i
iikik
Tiniii
n
kiikii
n
kikii
eeeeee
en
e
en
e
'''2
'1
'
1
2
1
,],...,,[
)(1
)(
1)(
Correlation
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
10
The similarity between the variation of two random variables
A vector is treated as sampling of a random variable
Covariance
2
1
],[][
))((1
],[
ijii
n
kjjkiikji
eeCoveVar
een
eeCov
Pearson Correlation Coefficient
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
11
Pearson correlation coefficient
Pearson correlation is equal to the cosine of the angle (or inner product of) normalized expression profiles
Pearson correlation is normalized
ji
n
kjjkiik
ji
jiji
ee
eVareVar
eeCoveePearson
1
))((
][][
],[),(
1),(1 ji eePearson
),(),( ''jiji eePearsoneePearson
Euclidian Distance & Correlation
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
12
Euclidian distance (normalized) and Pearson correlation coefficient are closely related
These are the two most commonly used proximity measures in gene expression data analysis
Without loss of generality, we will use to denote the distance between two expression profiles
)),( 1(2),( ''jiji eePearsonneeEuclidian
),( jiij ee
Other Measures of correlation
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
13
Pearson is vulnerable to outliers If two genes have very high expression in a single
profile, it might dominate to show that the two expression levels are highly correlated
Jackknife correlation: Estimate n correlations by taking each dimension (sample) out, take the minimum among them
Pearson is not robust for non-Gaussian distributions Spearman’s rank order correlation coefficient: Rank
expression levels, replace each expression level with its rank
More robust against outliers A lot of loss of information
Clustering Methods
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
14
Hierarchical clustering Group genes into a tree
(a.k.a, dendrogram), so that each branch of the tree corresponds to a cluster
Higher branches correspond to coarser clusters
Partitioning Partition genes into several
groups so that similar genes will be in the same partition
Hierarchical clustering
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
15
Direction of clustering Bottom-up (agglomerative): Start from individual
genes, join them into groups until only one group is left
Top-down (divisive): Start with one group consisting of all genes, keep partitioning groups until each group contains exactly one gene
Agglomerative clustering is computationally less expensive Why?
Hierarchical clustering methods are greedy Once a decision is made, it cannot be undone
Agglomerative clustering
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
16
Start with m clusters: Each cluster contains one gene
At each step, choose two clusters that are closest (or most correlated), merge them
How do we evaluate the distance between two clusters? Single-linkage: If clusters contain two very close
genes, than the clusters are close to each other)(min),(
,ij
CjCilk
lk
CC
Agglomerative Clustering
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
17
Complete linkage: Two clusters are close to each other only if all genes inside them are close to each other
Group average: Two clusters are close to each other if their centers are close to each other
k lCi Cj
ijlk
lk CCCC 1
),(
)(max),(,
ijCjCi
lklk
CC
Divisive Clustering
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
18
Recursive bipartitioning Find an “optimal” partitioning of the genes into two
clusters Recursively work on each partition Since the number of clusters is an issue for partitioning
based clustering algorithms, the magic number 2 solves a lot of problems
May be computationally expensive The problem is “global” At every level of the tree, we have to work on all of the
genes If tree is imbalanced, there might be as many as m
levels With a reasonable stopping criterion, maybe
considered a partition-based clustering as well
Partition Based Clustering
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
19
Find groups of genes such that genes in each group are similar to each other, while being somewhat less similar to those in other clusters
Easily interpratable Especially, for large datasets (as compared to
hierarchical)
Number of Clusters
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
20
Clustering is “unsupervised”, so generally we do not have prior knowledge on how many clusters underly the data
It is very difficult to partition data into an “unknown” number of clusters
Most algorithms assume that K (number of clusters) is known
Try different values of K, find the one that results in best clustering
Very expensive
Overlapping vs. Disjoint Clusters
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
21
Genes do not have a single function Most genes might be involved in
different processes, so their expression profiles might demonstrate similarities with different genes in different contexts
Can we allow a gene to be included in more than one cluster?
Allowing overlaps between clusters poses additional challenges To what extent do we allow overlaps?
(We definitely don’t want to identify two identical clusters)
Fuzzy Clustering
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
22
Assign weights to each gene-cluster pair, showing the extent (or likelihood) of the gene belonging to the cluster Difficult interpretation Partitioning is a special case of fuzzy clustering,
where the weights are restricted to binary values Hierarchical clustering is also “fuzzy” in some
sense Continuous relaxation might alleviate
computational complexity as well
K-Means Clustering
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
23
The most famous clustering algorithm Given K, find K disjoint clusters such that the
total intracluster variation is minimized
kCi
ik
k eC
1
kCi
iik e ),(
K
kk
1
Cluster mean:
Intracluster variation:
Total intracluster variation:
K-Means Algorithm
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
24
K-Means is an iterative algorithm that alters parameters based on each other’s values until no improvement is possible
1. Choose K expression profiles randomly, designate each of them as the center of one of the K clusters
2. Assign each gene to a cluster2.1. Each gene is assigned to the cluster with closest
center to its profile
3. Redetermine cluster centers4. If any gene was moved, go back to Step 2, else
stop
Sample Run of K-Means
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
25
Self Organizing Maps
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
26
Just like K-means, we have K clusters, but this time they are organized into a map Often a 2D grid We want to organize clusters so that similar
clusters will be in proximity in the map A way of visualizing in low-dimensional (2D) space
Just like K-means, each cluster is associated with a weight vector It was the cluster center in K-means
Each weight vector is first initialized randomly to some gene’s expression profile
SOM Algorithm
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
27
At each step, a gene is selected at random The distance between the gene’s expression
profile and each cluster’s weight vector is calculated, and the cluster with closest weight vector becomes the winner
The winner’s and its neighbors’ (according to the 2D mapping) weight vectors are adjusted to represent the gene’s expression profile better
Cj is the winner cluster for gene i at time t α is a decreasing function of time, θ is the
neighborhood function
))()(,()()()1( ikjkkk etwCCttwtw
Gene Co-expression Network
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
29
Nodes represent genes Weighted edges between nodes represent
proximity (correlation) between genes’ expression profiles
This is indeed a way of predicting interactions between genes
Graph Theoretical Clustering
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
30
Partition the graph into heavy subgraphs Maximize total weight (number of edges) inside a
cluster Minimize total weight (number of edges) between
clusters Heuristic algorithms
CLICK: Recursive min-cut CAST: Iterative improvement one by one for each
cluster Loss of information?
Model Based Clustering
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
31
Generating model Each cluster is associated with a distribution (that
generates expression profiles for associated genes) specified by model parameters
The probability that a gene belongs to a cluster is specified by hidden parameters
Expectation Maximization (EM) algorithm Start with a guess of model parameters E-step: Compute expected values of hidden parameters
based on model parameters M-step: Based on hidden parameters, estimate model
parameters to maximize the likelihood of observing the data at hand, iterate
K-means is a special case
Evaluation of Clusters
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
32
In general, we want to maximize intra-cluster similarity, while minimizing inter-cluster similarity
Homogeneity, separation Based on the proximity metric
Reference partition Information on “true clusters” that comes from a
different source (apart from expression data) Molecular annotation (e.g., Gene Ontology) Jaccard coefficient, sensitivity, specificity
Cluster annotation Processes that are significantly enriched in a cluster
Homogeneity & Separation
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
33
Heterogeneity (or homogeneity in reverse direction) How similar are the genes in one cluster?
Separation How dissimilar are different clusters?
Good clustering: high heterogeneity, low separation
kCji
ijCCCH
,)1(
2)(
k lCi Cj
ijlk
lk CCCCS 1
),(
Overall Quality
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
34
Overall heterogeneity
Overall separation
How do these change with respect to number of clusters? Can we optimize these values to choose the best
number of clusters?
kC
kk CHCm
H )(1
lk
lk
CClklk
CClk
CCSCCCC
S,
,
),(1
Bayesian Information Criterion
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
35
A statistical criterion for evaluating a model Penalizes model complexity (number of free
parameters to be estimated)
k is the number of free parameters in the model, which increases with the number clusters
RSS is the “total error” in the model Trade-off number of clusters and optimization
function to choose the best number of clusters
Reference Partitioning
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
36
If there is information about “ground truth” from an independent source, we can compare our clustering to such reference partitioning
Pairwise assessment Let Cij = 1 if gene i and gene j are assigned to the
same cluster by the clustering algorithm, 0 otherwise
Let Rij = 1 if gene i and gene j are in the same cluster according to reference partition
jiijij
jiijij
jiijij
jiijij
RCnRCn
RCnRCn
,10
,01
,00
,11
)(
Comparing Partitions
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
37
Rand index (symmetric)
Jaccard coefficient (sparse)
Minkowski measure (sparse)
01100011
0011
nnnn
nnRand
011011
11
nnn
nJaccard
0111
0110
nn
nnMinkowski
Cluster Annotation
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
38
Clustering results in groups of genes that are co-expressed (or co-regulated) For each group, can we tell something about the
biological phenomena that underlies our observation (their co-expression)?
We have partial knowledge on the function of many individual genes Gene Ontology, COG (Clusters of Ortholog Groups),
PFAM (Protein Domain Families) Taking a statistical approach, we can assign
function to each group of genes A function popular in a cluster is associated with
that cluster
Gene Ontology
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
39
Ontology: Study of being (e.g., conceptualization) Gene Ontology is an attempt to develop a
standardized library of cellular function Unified view of life: Processes, structures, and
functions recur in diverse organisms Three concepts of Gene Ontology
Biological process: A recognized series of events or molecular functions (e.g., cell cycle, development, metabolism)
Molecular function: What does a gene’s product do? (e.g., binding, enzyme activity, receptor activity)
Cellular component: Localization within the cell (e.g., membrane, nucleus, ubiquitin ligase complex)
Hierarchy in Gene Ontology
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
40
Gene Ontology is hierarchical A process might have subprocesses
Seed maturation is part of seed development A process might be described at different levels of
detail Seed dormation is a(n example of) seed maturation
Same for function and component Gene Ontology terms are related to each other
via “is a” and “part of” relationships If process A is part of process B, then A is B’s child
(B is A’s parent); B involves A If function C is a function D, then C is D’s child; C is
a more detailed specification of D
GO Hierarchy is a DAG
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
42
Gene Ontology is hierarchical, but the hierarcy is not represented by a tree, it is represented by a directed acyclic graph (DAG) A GO term can have
multiple parents (and obviously a GO term might (should?) have multiple children)
Annotation
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
43
GO-based annotation assigns GO terms to a gene A gene might have multiple functions, can be
involved in multiple processes Multiple genes might be associated with the same
function, multiple genes take part in a process True-path rule
If a gene is annotated with a term, then it is also annotated by its parents (consequently, all ancestors)
How does the number of genes associated with each term changes as we go down on the GO DAG?
GO Annotation of Gene Clusters
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
44
There a |C| genes in a cluster C |T| genes are associated with GO term t |C ∩ T| genes are in C and are associated with
t What is the association between cluster C and
term t? If we chose random clusters, would we be able to
observe that at least this many (|C ∩ T|) of the |C| genes in C are associated with t?
What is the probability of this observation? Statistical significance based on
hypergeometric distribution
Hypergeometric Distribution
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
45
We have n items, m of which are good If we choose r items from the entire set of items
at random, what is the probability that at least k of them will be good?
n is the number of genes in the organism m=|T|, r=|C|, k= |C ∩ T| The lower p is, the more likely that there is an
underlying association between the term and the cluster (the term is significantly enriched in the cluster)
),min(
][rm
ki
r
n
ir
mn
i
m
kKPp
GO Hierarchy & Cluster Annotation
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
46
How specific (general) is the annotation we attach to a cluster? If a cluster is larger, then it might correspond to a
more general process Some processes might be over-represented in the
study set How do we find the best location of a cluster in GO
hierarchy? Parent-child annotation
Condition probability of enrichment of a term in a cluster on the enrichment of its parent terms in the cluster
The gene space is defined as the set of genes that are associated with t’s parents
Parent-Child Annotation
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
47
Multiple Hypotheses Testing
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
48
The p-value for a single term provides an estimate of the probability of having the observed number of genes attached to that particular term We have many terms, even if the likelihood of
enrichment is small for a particular term, it might be very probable that one term will be enriched as much as observed in the cluster
We have to account for all hypotheses being tested simultaneously
Bonferroni correction: Apply union rule, add all p-values
Which terms should we consider while correcting for multiple hypotheses for a single term?
Representativity of Terms
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
49
How good does a significantly enriched term represent a cluster? How many of the genes in the cluster are attached
to the term? How many of the genes attached to the term are
in the cluster? For term t that is significantly enriched in
cluster C Specificity: |C ∩ T|/|C|, a.k.a. precision Specificity: |C ∩ T|/|T|, a.k.a. recall
Biclustering
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
50
A particular process might be active in certain conditions A group of genes
might be expressed (or up-regulated, supressed, co-regulated, etc.) in only a subset of samples
They might behave almost independently under other conditions
Clustering vs. Biclustering
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
51
Clustering is a global approach Each gene is a point in the space defined by all
samples How about points that are clustered in a subspace?
Biclustering: While clustering genes, also choose a set of dimensions (samples) that provides best clustering and vice versa a.k.a, co-clustering, subspace clustering… This is a much harder problem, because you are not
only trying to find groups of points that are close to each other in multi-dimensional space, but also trying to identify a subspace in which groups are more evident
Biclustering Applications
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
52
Sample/tissue classification for diagnosis The samples with leukemia show specific characters
for a subset of genes Identification of co-regulated genes
Certain sets of genes exhibit coherent activations under specific conditions (while behaving more or less arbitrarily with respect to each other under other conditions)
Functional annotation Biological processes, functional classes are
overlapping Different sets of samples reveal different functional
relationships
Biclustering Principles
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
53
A cluster of genes is defined with respect to a cluster of samples and vice versa
The clusters are not necessarily exclusive or exhaustive A gene/condition may belong to more than one
cluster A gene/condition may not belong to any cluster at
all Biclusters are not “perfect”
Noise Statistical inference becomes particularly
important
Biclustering Formulation
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
54
Given a gene expression matrix A with gene set G and sample set S, a bicluster is defined by a subset of genes I and a subset of samples J
General idea: A bicluster is a “good” one if AIJ , the submatrix defined by I and J, has some coherence (low variance, low rank, similar ordering of rows, etc.)
The biclustering problem can be defined as one of finding a single bicluster in the entire gene expression matrix, or as one of extracting all biclusters (with some restriction on the relationship between biclusters)
Coherence of a Submatrix
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
55
Distribution of Biclusters
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
56
Bipartite Graph Model
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
57
Just like symmetric matrices, which can be modeled as arbitrary graphs, rectangular matrices can be modeled using bipartite graphs
With proper definition of edge weights, biclustering can be posed as the problem of finding “heavy” subgraphs
Row, Column, Matrix Means
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
58
Objective Function
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
59
Low-variance (constant) bicluster Ideal bicluster: Minimize bicluster variance
Low-rank (constant row, constant column, coherent values) bicluster Ideal constant row: Ideal constant column: General rank-one bicluster: Define residue for each value: Minimize mean squared residue
Missing Values
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
60
Not all expression levels are available for each gene/sample pair A solution is to replace missing values (random
values, gene mean, sample mean, regression) Generalize definition row, column, and
bicluster means to handle missing values implicitly Occupancy threshold:A bicluster is one with adequate number of (non-missing) values in each row and column
Overlapping Biclusters
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
61
The expression of a gene in one sample may be thought of as a superposition of contribution for multiple biclusters
Plaid model: : contribution of bicluster k on the expression
value of the ith gene in the jth sample and (generally binary) specify the membership
of row i and column j in the kth bicluster, respectively
Minimize
is defined to reflect “bicluster type” , , ,
Discrete Coherence
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
62
A bicluster is defined to be one with coherent ordering of the values on rows and/or columns (as compared to values themselves)
Order-preserving submatrix (OPSM) A submatrix is order preserving if there is an
ordering of its columns such that the sequences of values in every row is increasing
Gene expression motifs (xMOTIFs) The expression level of a gene is conserved across
a subset of conditions if the gene is in the same “state” in each of the conditions
An xMOTIF is a subset of genes that are simultaneously conserved across a subset of samples
Binary Biclusters
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
63
Quantize gene expression matrix to binary values SAMBA: A 1 corresponds to a significant change in the
expression value PROXIMUS: A 1 means that the gene is “expressed” in
the corresponding sample A bicluster is a “dense submatrix”, i.e. one with
significantly more number of 1’s than one would expect Bipartite graph model: Bicliques, heavy subgraphs It is possible to statistically quantify the density of a
submatrix Log-likelihood:
p-value:
Biclustering Algorithms
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
64
Enumeration Go for it!
Greedy algorithms Make a locally optimal choice at every step
Divide and conquer Solve problem recursively
Alternating iterative heuristics Fix one dimension, solve for other, alternate
iteratively Model Based Parameter estimation
e.g., EM algorithm
Enumerating Biclusters
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
65
m rows, n columns in the matrix 2m X 2n possible biclusters in total Not doable in realistic amounts of time Is it really necessary?
Put some restriction on size of biclusters SAMBA models the problem as one of finding
heavy subgraphs in a bipartite graph Key assumption is sparsity: Nodes of the bipartite
graph have bounded degree Find K heavy bipartite subgraphs (biclusters) with
bounded degree enumeration Refine them to optimize overlap and add/remove nodes
that improve bicluster quality
Greedy Algorithms
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
66
Basic idea: Refine existing biclusters by adding/removing genes/samples to improve the objective function Generally, quite fast How to choose initial biclusters? How to jump over bad local optima? (Global awareness,
Hill-climbing) Optimization function: mean-squared residue
Node deletion: Start with a large bicluster, keep removing genes/samples that contribute most to total residue
Node addition: Start with a small bicluster, keep adding genes/samples that contribute least to total residue
Repeat these alternatingly to improve global awareness
Finding All Biclusters
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
67
If biclusters are identified one by one, we should make sure that we do not identify the same bicluster again and again Masking discovered biclusters: Fill bicluster with
random values First identify disjoint biclusters, then grow them to
capture overlaps Flexible Overlapped Biclustering (FLOC)
Generate K initial biclusters Make decision from the gene/sample perspective
(as compared to bicluster perspective): Choose the best (maximum gain) action for each gene
Generalizing K-Means to Biclustering
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
68
Assume K gene clusters, L sample clusters Notice that this is a little counter-intuitive, we do
not have well-defined biclusters, we rather have clusters of genes and samples, and each pair of gene and sample clusters defines a bicluster
R: mxk gene clustering matrix, C: nxl sample clustering matrix R(i,k)=1 if gene i belongs to cluster k (actually,
columns are normalized to have unit norm) Minimize total residue:
KL-Means Algorithm
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
69
We can show that Batch iteration
Given R, compute (mxl matrix) serves as a prototype for column
clusters For each column, find the column of that is
closest to that column, update the corresponding entry of C accordingly
Once C is fixed, repeat the same for rows to compute R from
Converges to a local minimum of the objective function
OPSM Algorithm Recall that an order preserving submatrix (OPSM)
is one such that all rows have their entries in the same order
Growing partial models Fix the extremes first The idea: Columns with very high or low values are
more informative for identifying rows that support the assumed linear order
Start with all (1,1) partial models, i.e., only consider the preservation of the first and last elements, keep the best ones
Expand these to obtain (2,1) models, then (2,2) until we have (s/2, s/2) models, s being the number of columns in target bicluster
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
70
Divide and Conquer Algorithms Block clustering (a.k.a., Direct clustering)
Recursive bipartitioning Sort rows according to their mean, choose a row such
that the total variance above and below the row is minimized
Do the same for columns Pick the row or column that results in minimum intra-
cluster variances, split matrix into two based on that row or column
Continue splitting recursively One problem is that once two rows/columns go to
different biclusters, they can never come together Gap Statistics: Find a large number of biclusters, then
recombine
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
71
Binormalization Normalize matrix on both dimensions Independent scaling of rows and columns
Here, R and C are diagonal matrices that contain row
and column means, respectively Bistochastization
Goal: Rows will add up to a constant (or will have constant norm), columns will add up to a separate constant
Repeat independent scaling of rows and columns until stability is reached
The residual of entire matrix is also normalized in the sense that both rows and columns have zero mean
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
72
Spectral Biclustering Singular value decomposition
The eigenvalues of the matrices ATA and AAT (say, σ2) are the same
Each σ is called a singular value of A and the corresponding left and right eigenvectors are called singular vectors
If σ1 is the largest singular vector of A such that ATAv1 = σ1v1 and AATu1 = σ1u1 , then σ1u1v1
T is the best rank-one approximation to A, i.e., ||A- σuvT ||2 is minimized by σ1 , u1 , and v1
(over all orthogonal vector pairs with unit norm)
Consequently, the entries of u and v are ordered in such a way that similar rows have similar values on u, similar columns have similar values on v Split matrix based on u and v
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
73