db seminar series: biclustering methods for microarray data analysis by: kevin yip 10 sep 2003

89
DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

Upload: alexandrina-welch

Post on 20-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

DB Seminar Series: Biclustering Methods for Microarray Data Analysis

By: Kevin Yip10 Sep 2003

Page 2: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

2

Outline

Introduction Overview of the algorithms Some details of each algorithm Summary Research opportunities

Page 3: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

3

Introduction Microarray data can be viewed as an NM

matrix: Each of the N rows represents a gene (or a clone, ORF,

etc.). Each of the M columns represents a condition (a

sample, a time point, etc.). Each entry represents the expression level of a gene

under a condition. It can either be an absolute value (e.g. Affymetrix GeneChip) or a relative expression ratio (e.g. cDNA microarrays).

A row/column is sometimes referred to as the “expression profile” of the gene/condition.

Page 4: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

4

Introduction It is common to visualize a

gene expression datasets by a color plot: Red spots: high expression

values (the genes have produced many copies of the mRNA).

Green spots: low expression values.

Gray spots: missing values.

Ngenes

M conditions

Page 5: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

5

Introduction If two genes are related (have similar functions

or are co-regulated), their expression profiles should be similar (e.g. low Euclidean distance or high correlation).

However, they can have similar expression patterns only under some conditions (e.g. they have similar response to a certain external stimulus, but each of them has some distinct functions at other time).

Similarly, for two related conditions, some genes may exhibit different expression patterns (e.g. two tumor samples of different sub-types).

Page 6: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

6

Introduction As a result, each cluster may involve only a

subset of genes and a subset of conditions, which form a “checkerboard” structure:

In reality, each gene/condition may participate

in multiple clusters.

Page 7: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

7

Introduction To discover such data patterns, some

“biclustering” methods have been proposed to cluster both genes and conditions simultaneously.

Differences with projected clustering (by observation, not be definition): Projected clustering has a primary clustering target,

biclustering usually treats rows and columns equally. Most projected clustering methods define attribute

relevance based on value distances, most biclustering methods define biclusters based on other measures.

Some biclustering methods do not have the concept of irrelevant attributes.

Page 8: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

8

Overview of the Biclustering Methods

Method Publish Cluster Model Goal

Cheng & Church ISMB 2000 Background + row effect + column effect

Minimize mean squared residue of biclusters

Getz et al.(CTWC)

PNAS 2000 Depending on plugin clustering algorithm

Depending on plugin clustering algorithm

Lazzeroni & Owen(Plaid Models)

Bioinformatics 2000

Background + row effect + column effect

Minimize modeling error

Ben-Dor et al.(OPSM)

RECOMB 2002 All genes have the same order of expression values

Minimize the p-values of biclusters

Tanay et al.(SAMBA)

Bioinformatics 2002

Maximum bounded bipartite subgraph

Minimize the p-values of biclusters

Yang et al.(FLOC)

BIBE 2003 Background + row effect + column effect

Minimize mean squared residue of biclusters

Kluger et al.(Spectral)

Genome Res. 2003

Background row effect column effect

Finding checkerboard structures

Page 9: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

9

Overview of the Biclustering Methods

Method Allow overlap?

Bicluster Discovery

Complexity Testing Data

Cheng & Church Yes (rare in reality)

One at a time

O(MN) or O(MlogN)

Yeast (288417), lymphoma (402696)

Getz et al.(CTWC)

Yes One set at a time

Exponential Leukemia (175372), colon cancer (200062)

Lazzeroni & Owen(Plaid Models)

Yes One at a time

Polynomial Food (9616), forex (27618), yeast (246779)

Ben-Dor et al.(OPSM)

Yes All at the same time

O(NM3l) Breast tumor (322622)

Tanay et al.(SAMBA)

Yes All at the same time

O((N2d+1)log(r+

1)/r(rd))Lymphoma (402696), yeast (6200515)

Yang et al.(FLOC)

Yes All at the same time

O((N+M)2kp) Yeast (288417)

Kluger et al.(Spectral)

No All at the same time

Polynomial Lymphoma (1 rel., 1 abs.), leukemia, breast cell line, CNS embryonal tumor

Page 10: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

10

Cheng and Church Model:

A bicluster is represented the submatrix A of the whole expression matrix (the involved rows and columns need not be contiguous in the original matrix).

Each entry Aij in the bicluster is the superposition (summation) of:1. The background level2. The row (gene) effect3. The column (condition) effect

A dataset contains a number of biclusters, which are not necessarily disjoint.

Page 11: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

11

Cheng and Church Example:

Correlation between any two columns = correlation between any two rows = 1.

aij = aiJ + aIj – aIJ, where aiJ = mean of row i, aIj = mean of column j, aIJ = mean of A.

Biological meaning: the genes have the same (amount of) response to the conditions.

Back.: 5 Col 0: 1 Col 1: 3 Col 2: 2

Row 0: 2 8 10 9

Row 1: 4 10 12 11

Row 2: 1 7 9 8

Page 12: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

12

Cheng and Church Goal: to find biclusters with minimum

squared residue:

For an ideal bicluster, H(I, J) = 0. adding a constant to all entries of a row or

column yields an ideal bicluster. multiplying all entries in the bicluster by a

constant yields an ideal bicluster.

JjIi

IJIjiJij aaaaJI

JIH,

2)(1

),(

Page 13: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

13

Cheng and Church

Constraints: 1M and N1 matrixes always give zero

residue.=> Find biclusters with maximum sizes, with residues not more than a threshold (largest -biclusters).

Constant matrixes always give zero residue.=> Use average row variance to evaluate the “interestingness” of a bicluster. Biologically, it represents genes that have large change in expression values over different conditions.

Page 14: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

14

Cheng and Church Finding the largest -bicluster:

The problem of finding the largest square -bicluster (|I| = |J|) is NP-hard.

Objective function for heuristic methods (to minimize):

=> sum of the components from each row and column, which suggests simple greedy algorithms to evaluate each row and column independently.

JjIi

IJIjiJij aaaaJI

JIH,

2)(1

),(

Page 15: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

15

Cheng and Church Greedy methods:

Algorithm 0: Brute-force deletion (skipped) Algorithm 1: Single node deletion

• Parameter(s): (maximum squared residue).• Initialization: the bicluster contains all rows and

columns.• Iteration:

1. Compute all aIj, aiJ, aIJ and H(I, J) for reuse.2. Remove a row or column that gives the maximum

decrease of H.• Termination: when no action will decrease H or H

<= .• Time complexity: O(MN)

Page 16: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

16

Cheng and Church

Greedy methods: Algorithm 2: Multiple node deletion (take one

more parameter . In iteration step 2, delete all rows and columns with row/column residue > H(I, J)).

Algorithm 3: Node addition (allow both additions and deletions of rows/columns).

Page 17: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

17

Cheng and Church

Handling missing values and masking discovered biclusters: replace by random numbers so that no recognizable structures will be introduced.

Data preprocessing: Yeast: x 100log(105x) Lymphoma: x 100x (original data is already

log-transformed)

Page 18: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

18

Cheng and Church

Some results on yeast cell cycle data (288417):

Page 19: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

19

Cheng and Church Some results on lymphoma data (402696):

No. of genes, no. of conditions

4, 96 10, 29 11, 25

103, 25 127, 13 13, 21

10, 57 2, 96 25, 12

9, 51 3, 96 2, 96

Page 20: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

20

Cheng and Church

Discussion: Biological validation: comparing with the

clusters in previously published results. No evaluation of the statistical significance of

the clusters. Both the model and the algorithm are not

tailored for discovering multiple non-disjoint clusters.

Normalization is of utmost importance for the model, but this issue is not well-discussed.

Page 21: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

21

Yang et al. (FLOC) Model: based on Cheng and Church, but allows

missing values. Volume of a bicluster: number of non-missing entries in

the submatrix. Goals:

Not to introduce random interference. Discover k possibly overlapping clusters

simultaneously. Support additional features (e.g. limit the maximum

amount of overlapping) using virtually zero additional cost.

FLOC: FLexible Overlapped biClustering

Page 22: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

22

Yang et al. (FLOC) Missing values handling:

Introducing a parameter (a fraction), so that in a bicluster, all rows and columns must not contain more than missing values. If =0.6,

When calculating the row/column/matrix averages, missing values are not counted.

Invalid bicluster:

1 3

4 5

3 4

Valid bicluster:

1 3 3

3 4 5

3 4 4

Page 23: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

23

Yang et al. (FLOC) Algorithm:

Parameter(s): k (no. of clusters), (cluster size parameter), (missing value threshold), r (residue threshold, i.e., in Cheng and Church’s notation).

Phase 1: create k random biclusters (for each bicluster, each row/column is randomly added with a probability ).

Phase 2: repeatedly• For each row/column, determine the changes of

squared residue if it is selected/deselected from each of the k biclusters.

• Perform the best actions of the m+n rows and columns.

Page 24: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

24

Yang et al. (FLOC)

Example (before):

• Remove from red

• Remove from green

• Add to red• Remove

from green

• Add to red• Remove

from green

• Remove from red

• Add to green

• Remove from red

• Remove from green

Page 25: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

25

Yang et al. (FLOC)

Example (decisions):

• Remove from red

• Remove from green

• Add to red• Remove

from green

• Add to red• Remove

from green

• Remove from red

• Add to green

• Remove from red

• Remove from green

Page 26: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

26

Yang et al. (FLOC) Example (after, if all actions are performed):

Actual algorithm: execute the actions sequentially, keep only the best cluster set out of the M+N potential sets.

Page 27: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

27

Yang et al. (FLOC) How to compare different actions?

Suppose an action is performed on row/column x in cluster c to form cluster c’, the gain of the action is defined as

A +ve gain indicates an improvement of bicluster quality.

rc’ > rc first term is –ve: favor smaller residue.

vc’ > vc second term is +ve: favor larger volume.

rc << r second term dominates: when residue is small, the major goal is to increase volume.

c

cc

c

cc

v

vv

rr

rrcxGain

'

2'),(

Page 28: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

28

Yang et al. (FLOC)

What is the execution order of the M+N actions? Based on the gain values, with some

probability of swapping the order in order to overcome local optimums.

Termination criteria: If none of the M+N new bicluster sets contains

only r-biclusters and the aggregated volume is larger than the previous best set.

Time complexity of FLOC: O((N+M)2kp).

Page 29: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

29

Yang et al. (FLOC) Additional features:

Limit the maximum amount of bicluster overlapping. Limit the minimum amount of coverage (fraction of

entries covered by at least one bicluster). Limit the ratio between the number of genes and

conditions in each bicluster. Limit the minimum volume of the biclusters.

How? Not to perform any actions that will violate the

constraints.

Page 30: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

30

Yang et al. (FLOC)

Some results on the yeast cell cycle data:Avg. residue

Avg. volume

Avg. gene num.

Avg. cond. num.

Time

CC algorithm

204.293 1576.98 167 12 12min

FLOC algorithm

187.543 1825.78 195 12.8 6.7min

Page 31: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

31

Yang et al. (FLOC)

Some results on Yeast cell cycle data:

1 more gene 2 more conditions,6 more genes

Page 32: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

32

Yang et al. (FLOC) Discussion:

The model is still not suitable for non-disjoint clusters.

There are more user parameters, including the number of biclusters.

There is no justification of having one action per row/column in each iteration.

Gain values are based on the biclusters before any of the M+N actions.

The additional features can have negative impacts to the clustering process.

Page 33: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

33

Lazzeroni and Owen (Plaid Models)

Model: Each entry Yij in the bicluster is the

superposition of:1.The global background level2.The background level of the layers (biclusters)3.The row (gene) effect of the layers4.The column (condition) effect of the layers

K

kjkikjkikkijY

10 )(

1 2 3 41 if bicluster k contains row i0 otherwise

1 if bicluster k contains column j0 otherwise

Page 34: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

34

Lazzeroni and Owen (Plaid Models)

Example: Layer 0: 0=10.

Layer 1: 1=5, 1={2,3,4}, 1={1,2,3}, 1={1,1,0},1={1,1,0}.

Layer 2: 2=2, 2={3,3,5}, 2={4,2,1}, 2={1,0,0},2={1,1,1}.

10 10 1010 10 1010 10 10

8 9 09 10 00 0 0

9 7 60 0 00 0 0

27 26 1619 20 1010 10 10

+ +

=

Page 35: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

35

Lazzeroni and Owen (Plaid Models)

The model is more suitable for overlapping biclusters.

Goal: to find model parameters (K, 0, k, ik, jk, ik and jk) such that the squared error is minimized.

For simplicity, call the parameters for cluster k (k, ik and jk) ijk.

Objective function (to minimize):2

1 1 102

1

N

i

M

j

K

kjkikijkijijY

Page 36: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

36

Lazzeroni and Owen (Plaid Models)

Algorithm to find 1 layer:1. Determine initial memberships (0) and (0).2. For (i=0; i<s; i++)

1. Determine cluster parameters (i+1) from (i) and (i).

2. Determine row memberships (i+1) from (i+1) and (i).

3. Determine column memberships (i+1) from (i+1) and (i).

Page 37: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

37

Lazzeroni and Owen (Plaid Models)

Determining initial memberships (0) and (0) (some attempts): All parameters set to 0.5 All parameters set to random values near 0.5 More complicated heuristics:

• Fix all ijk to 1.

• Perform several iterations that update and only.

• Scale and so that they sum to N/2 and M/2 respectively.

Page 38: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

38

Lazzeroni and Owen (Plaid Models)

Determining (k) from (k-1) and (k-1): deduce the best fit of the models, subject to the condition that every row and column has a zero mean.

Solutions (using Lagrange multiplier):

where

iikjk

iki

jkikkij

jk

jjkik

jkj

jkikkij

ik

jjk

iik

i jijjkik

k

ZZZ

2222

,,

1-k

1cij0ijij --YZ jcicijc

Page 39: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

39

Lazzeroni and Owen (Plaid Models)

Similarly, the membership parameters can be determined by:

Stopping rule: if a layer has a smaller size than expected (found by random permutation of data) or a Kmax (a user parameter) layers have been found.

iikijk

iijikijk

jk

jjkijk

jijjkijk

ik

ZZ

2222,

Page 40: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

40

Lazzeroni and Owen (Plaid Models)

Some results on yeast stress data (246779): 34 layers, 5568 parameters (<3% of all observations)

No. of layers Genes Conditions Observations

0 703 22 170703

1 1031 5 22872

2 579 2 1307

3 142 11 11

4-18 12 39 0

Total 2467 79 194893

Page 41: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

41

Lazzeroni and Owen (Plaid Models)

Some results on yeast stress data:

Layer 1: includes many genes involved in the cell cycle.

Layer 3: includes many genes involved in glycolysis.

Page 42: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

42

Lazzeroni and Owen (Plaid Models)

Discussion: The model may still be too restrictive for gene

expression data in which co-regulated genes may have different magnitudes of response to a stimulus.

Again, normalization issues are critical but not addressed.

Page 43: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

43

Kluger et al. (spectral)

All the previous approaches define NP-hard problems and provide heuristic solutions. This study adopts a model where optimal solution can be found in polynomial time.

Page 44: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

44

Kluger et al. (spectral)

Model: Each entry in the dataset is the product of:

1.The hidden base expression level2.The tendency of gene i to be expressed in all

conditions3.The tendency of all genes to be expressed in

condition j A normalized dataset should contain a

checkerboard structure. Within each block, all row tendencies are equal and all column tendencies are equal.

Page 45: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

45

Kluger et al. (spectral) Illustration of the model:

Suppose x’=2x (2 is a scalar), then ATAx=2x an eigenproblem.

Page 46: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

46

Kluger et al. (spectral)

Idea of the method: The input gene expression profiles form a non-

normalized, non-ordered matrix. Suppose there are ways to normalize the data

(discussed later). Call the resulting matrix A. Solve the eigenproblem ATAx=2x and

examine the eigenvectors x. If the constants in an eigenvector can be sorted to produce a step-like structure, the condition clusters can be identified accordingly. The gene clusters are found similarly from y.

Page 47: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

47

Kluger et al. (spectral) Illustration of the idea:

A:

The 1st eigenvector:

The corresponding y:

By sorting the constants, it can be seen that there are two row clusters and two column clusters.

31131

26626

31131

30.0

52.0

52.0

30.0

52.0

4.3

6.10

4.3

Page 48: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

48

Kluger et al. (spectral)

Problem 1: non-normalized data

E.g. some rows are multiplied by a scalar. The eigenproblem cannot be formulated.

Page 49: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

49

Kluger et al. (spectral)

Normalization method 1: independent rescaling of genes and conditions Assume the non-normalized matrix is obtained

by multiplying each row i by scalar ri and each column j by scalar cj, then ri1/ri2 = mean of row i1 / mean of row i2.

Let R be a diagonal matrix with entries ri at the diagonal and C is a diagonal matrix defined similarly, then the eigenproblem can be formulated by rescaling the data matrix:

21

21ˆ ACRA

Page 50: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

50

Kluger et al. (spectral) Method 2: bi-stochastization

By repeating the independent scaling of genes and conditions until stable, the final matrix will have all rows sum to a constant and all columns sum to a different constant.

Method 3: log-interactions If the original rows/columns are differed by

multiplicative constants, then after taking log, they differ by additive constants.

Further, we want each row and each column to have zero mean. This can be achieved by transforming each entry as follows: A’ij = Aij – AIj – AiJ + AIJ.

Page 51: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

51

Kluger et al. (spectral) Problem 2: when the number of genes/conditions

are large, and the input data does not 100% fit the model, it is not easy to find the clusters. Our previous example (0.54, 0.24, 0.54, 0.54, 0.24)

obviously contains 2 clusters. But what about (0.07, 0.09, 0.11, 0.11, 0.16, 0.24, 0.31, 0.36, 0.43, 0.45, 0.48, 0.5, 0.53, 0.56, 0.59, 0.65, 0.73, 0.81, 0.83, 0.97)?

In such cases, standard one-way clustering techniques (e.g. k-means) can be used to cluster the constant terms in the eigenvectors.

Page 52: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

52

Kluger et al. (spectral)

Results on lymphoma Affymetrix data:

Page 53: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

53

Kluger et al. (spectral)

Results on leukemia data:

Page 54: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

54

Kluger et al. (spectral)

Discussion: Real datasets may deviate from the ideal

checkerboard structure. The model does not assume any irrelevant

rows/columns, which is different from most biclustering, subspace clustering and projected clustering approaches.

The “clusters” are disjoint.

Page 55: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

55

3 more approaches to go…

In the previous models, every gene in a bicluster has the same amount of response to the conditions.

The following three approaches define biclusters in less stringent ways.

Page 56: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

56

Ben-Dor et al. (OPSM) Model:

For a condition set T and a gene g, the conditions in T can be ordered in a way so that the expression values are sorted in ascending order (suppose the values are all unique).

Suppose a submatrix A contains genes G and conditions T. A is a bicluster if there is an ordering (permutation) of T such that the expression values of all genes in G are sorted in ascending order.

OPSM: Order-Preserving SubMatrixes.

Page 57: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

57

Ben-Dor et al. (OPSM)

Example: Valid bicluster:

Invalid bicluster:

t1 t2 t3 t4 t5

g1 7 13 19 2 50

g2 19 23 39 6 42

g3 4 6 8 2 10

Induced permutation

2 3 4 1 5

t1 t2 t3 t4 t5

g1 7 13 19 2 50

g2 19 23 39 6 42

g3 4 6 8 2 7

Page 58: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

58

Ben-Dor et al. (OPSM)

Goal: to find OPSMs of maximum statistical significance (stochastic model: each row has an independent permutation).

Fact: given an NM matrix, the problem of finding an ks OPSM is NP-complete.

Page 59: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

59

Ben-Dor et al. (OPSM)

Some terms: Complete model (T, ):

T is a set of conditions (columns) is an ordering of the conditions in T.

Partial model (<t1, t2, …, ta>, <ts-b+1, …, ts>, s):The first a and last b conditions are specified, but not the remaining s-a-b conditions.

A row “supports” a model if applying the permutation to the row results in a set of monotonically increasing values.

Page 60: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

60

Ben-Dor et al. (OPSM) Idea of algorithm: to grow partial models until

they become complete models. Algorithm:

Evaluate all (1, 1) partial models (there are O(m2) possible models), keep the best l of them.

Expand them to (2, 1) models (there are O(ml) possible models), keep the best l of them.

Expand them to (2, 2) models, keep the best l of them. Expand them to (3, 2) models, keep the best l of them. … Until getting l (s/2, s/2) models, which are complete

models. Output the best one.

Page 61: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

61

Ben-Dor et al. (OPSM)

Assume evaluating each model takes O(ns) time, then the whole algorithm requires O(nm3l).

Evaluating a partial model (idea): A model is more favorable if there are more

rows that support it. A row is more likely to support a partial model

if there is a large “gap”.t1 t2 t3 t4 t5

g1 1 2 3 4 5

g2 2 3 4 1 5

1 4

A larger gap

A smaller gap

Page 62: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

62

Ben-Dor et al. (OPSM) Some results on breast tumor data (322622 (8

with brcal mutations, 8 with brca2 mutations and 6 sporadic breast tumors)): A 3474 bicluster with the first three tissues with brca2

mutations and the last one sporadic. A 426 bicluster with five brca2 mutations followed by

one brcal mutation. A 78 bicluster with four brca2 mutations followed by

three brcal mutation, followed by a sporadic cancer sample.

Page 63: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

63

Ben-Dor et al. (OPSM)

The 3474 bicluster:

Page 64: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

64

Ben-Dor et al. (OPSM) Discussion:

Although the model concerns only the order of values instead of value distance or correlation, the use of total ordering still makes the model quite restrictive (the paper suggests some possible model extensions with no corresponding algorithms).

Comparing to previous models, OPSM seems less biologically-intuitive.

The algorithm does not prevent the final models from being highly similar to each other.

Page 65: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

65

Tanay et al. (SAMBA)

SAMBA: Statistical-Algorithmic Method for Bicluster Analysis)

Model: The whole dataset forms a bipartite graph

G=(U, V, E):• U is the set of conditions.• V is the set of genes.• (u, v) E iff v responds in condition u (i.e., the

expression level of v changes significantly in u). A bicluster: a subgraph of the bipartite graph.

Page 66: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

66

Tanay et al. (SAMBA)

Example:

t1 t2 t3

g1 0.8 1.5 2.6

g2 0.4 0.7 3.2

t1

t2

t3

g1

g2

Page 67: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

67

Tanay et al. (SAMBA) Goal: to find the maximum weighted subgraph

Assume edges occur independently and equiprobably with density p = |E| / (|U||V|).

Denote BP(k, p, n) as the binomial tail, i.e., the probability of observing k or more successes in n trails, then the probability of obtaining a bicluster H=(U’, V’, E’), p(H) is BP(|E’|, |E|/(|U|+|V|), |U’||V’|).

Assume p < ½, then the problem can be transformed to finding a maximum weight subgraph of G where each edge has +ve weight (-1-log p) and each non-edge has -ve weight (-1-log(1-p)) (details skipped).

A refined model that does not assume independent edges can also be defined.

Page 68: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

68

Tanay et al. (SAMBA) Assume gene vertices have d-bounded degree (no more

than d edges incident on each gene vertex). Rationale: genes that constantly have abnormal expression

are not interesting. Define the neighborhood of a vertex v, N(v) be the set of

vertices adjacent to v in G. An O(|V|2d)-time algorithm to find the maximum weight

biclique:

Page 69: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

69

Tanay et al. (SAMBA)

Based on the algorithm, the maximum weight subgraph can be found in O((n2d)log(2d)) time.

The model can also be extended to take into account the sign of expression values (“overexpress” or “underexpress”).

Page 70: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

70

Tanay et al. (SAMBA)

The SAMBA algorithm:1. Form the bipartite graph and calculate vertex

pair weights. A gene is defined as up regulated (or down regulated) in a condition if its standardized level with mean 0 and variance 1 is above 1 (or below -1).

2. Apply the hashing technique to find the k heaviest bicliques in the graph.

3. Perform greedy addition/removal of vertices and filter biclusters that are too similar.

Page 71: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

71

Tanay et al. (SAMBA) Experiments on yeast data (6200515):

Use the fourth level GO annotation as class labels. Hide the labels of 30% of the genes.

Form biclusters. For biclusters with 60% labeled genes

belonging to the same class, all genes with hidden labels are assumed to belong to that class.

Compare the assumed and actual class labels to get the accuracy.

Repeat for 100 times.

Page 72: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

72

Tanay et al. (SAMBA)

Some results on yeast data:

Actual

SAMBA

Read: 15% of genes classified as “AA Met” by SAMBA actually belong

to class “Pro Met”.

Page 73: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

73

Tanay et al. (SAMBA)

Discussion: Although the paper reports reasonable

running time (a few minutes for 15000500, d set to 40), the exponential time complexity of SAMBA is daunting.

It is not easy to define abnormal expression. Performing row standardization is not always

appropriate.

Page 74: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

74

Getz et al. (CTWC) CTWC: Coupled Two-Way Clustering Goal: to find subsets of genes and conditions

such that a single process is the main contributor to the expression of the gene subset over the condition subset.

Idea: repeatedly perform one-way clustering on genes/conditions. Stable clusters of genes are used as the attributes for condition clustering, and vice versa.

Allow the input of domain knowledge by adding initial gene/condition clusters.

Page 75: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

75

Getz et al. (CTWC)

Illustration of the idea (assume a 43 dataset):

Geneclusters

Conditionclusters

g1, g2,g3, g4

g1, g3

t1, t2,t3

All genes

Domain knowledge

Page 76: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

76

Getz et al. (CTWC)

Illustration of the idea (assume a 43 dataset):

Geneclusters

Conditionclusters

g1, g2,g3, g4

g1, g3

t1, t2,t3

Page 77: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

77

Getz et al. (CTWC)

Illustration of the idea (assume a 43 dataset):

Geneclusters

Conditionclusters

Clustering:Rows: g1, g2, g3, g4

Columns: t1, t2, t3

Clustering:Rows: t1, t2, t3

Columns: g1, g2, g3, g4

Clustering:Rows: g1, g3

Columns: t1, t2, t3

Clustering:Rows: t1, t2, t3

Columns: g1, g3

g1, g2,g3, g4

g1, g3

t1, t2,t3

Page 78: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

78

Getz et al. (CTWC)

Illustration of the idea (assume a 43 dataset):

Geneclusters

Conditionclusters

Clustering:Rows: g1, g2, g3, g4

Columns: t1, t2, t3

Cluster 1: g1, g3, g4

Cluster 2: g2

g1, g2,g3, g4

g1, g3

t1, t2,t3

Page 79: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

79

Getz et al. (CTWC)

Illustration of the idea (assume a 43 dataset):

Geneclusters

Conditionclusters

Clustering:Rows: g1, g2, g3, g4

Columns: t1, t2, t3

Cluster 1: g1, g3, g4

Cluster 2: g2

g1, g3,g4

g2

g1, g2,g3, g4

g1, g3

t1, t2,t3

Page 80: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

80

Getz et al. (CTWC) Termination: all stable clusters have already

been added to the pools. 1-way clustering algorithm used in

experiments: super-paramagnetic clustering (SPC).

A hierarchical clustering method. Based on an analogy to the physics of inhomogeneous

ferromagnets: clusters are broken up due to an increase of temperature.

Normalization:1. Divide by column mean.2. Standardize each row.

Distance function: Euclidean distance.

Page 81: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

81

Getz et al. (CTWC)

Some results on leukemia data (175372 (47 ALL, 25 AML)): After two iterations, the algorithm formed 49

stable gene clusters and 35 stable sample clusters.

One sample cluster contains 37 samples, and is stable when either a cluster of 27 genes or another unrelated cluster of 36 genes was used as the attributes. The latter contains many genes that participate in the glycolysis pathway.

Page 82: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

82

Getz et al. (CTWC)

Some results on Leukemia data (175372 (47 ALL, 25 AML)): When the AML samples were clustered using a

28-gene cluster as attributes, a stable cluster was found that contains most of the samples (14/15) that were taken from patients that underwent treatment and whose results were known.

Page 83: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

83

Getz et al. (CTWC)

Discussion: The number of clusters in the pools can be

numerous. Only a specific set of one-way clustering

algorithms can be used as the “plugin”.• They should be able to determine the number

of clusters.• There should be ways to evaluate the stability

of clusters. The meaning of the biclusters is not very

intuitive.

Page 84: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

84

Summary Definition of (bi-)clusters:

Same trend (background +/ row effect +/ column effect).

Same ordering of values. Simultaneous abnormal expression (no direction, same

direction, same or opposite direction). Depending on plugin algorithm. (Projected clustering): similar values. (Other works): similar shape (e.g. only considers the

trend across adjacent time points).

Page 85: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

85

Summary The general research approach:

1. Define bicluster model and the clustering goal.

2. Determine if the problem is NP-hard (usually true).

3. Construct a statistical test for evaluating the significance/goodness of a bicluster/a set of biclusters.

4. Sketch the algorithm (usually greedy).5. If the algorithm has a high complexity, try to

speed up by applying reasonable heuristics.

Page 86: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

86

Summary

The general research approach:5. Test on synthetic data, validate by statistical

tests and known bicluster structures.6. Test on real data. Validate by

• Statistical tests.• Comparing with previously published results.• Using condition types.• Using gene annotations.• Visualization.

Page 87: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

87

Research Opportunities Propose other bicluster models. Based on the current models, propose new

algorithms that improve bicluster quality (validated statistically or biologically) and/or time complexity.

Combine the strength of multiple studies (e.g. plaid models + graph theory + statistical testing).

Investigate the effects of normalization to the models/algorithms.

Compare the different methods on some other real datasets.

Make better use of domain knowledge.

Page 88: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

88

References Yizong Cheng and George M. Church,

Biclustering of Expression Data, ISMB 2000. G. Getz, E. Levine and E. Domany, Coupled Two-

Way Clustering Analysis of Gene Microarray Data, Proc. Natl. Acad. Sci. USA, 2000.

Laura Lazzeroni and Art Owen, Plaid Models for Gene Expression Data, Statistica Sinica, 2002.

Amir Ben-Dor, Benny Chor, Richard Karp and Zohar Yakhini, Discovering Local Structure in Gene Expression Data: The Order-Preserving Submatrix Problem, RECOMB 2002.

Page 89: DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003

89

References Amos Tanay, Roded Sharan and Ron Shamir,

Discovering Statistically Significant Biclusters in Gene Expression Data, Bioinformatics 2002.

Jiong Yang, Haixun Wang, Wei Wang and Philip Yu, Enhanced Biclustering on Expression Data, BIBE 2003.

Yuval Kluger, Ronen Basri, Joseph T. Chang and Mark Gerstein, Spectral Biclustering of Microarray Cancer Data: Co-clustering Genes and Conditions, Genome Res., 2003.