pattern-based clustering

University at Buffalo The State University of New York

Pattern-based Clustering

How to cluster the five objects?Hard to define a global similarity measure


What Is Pattern-based Clustering?

A cluster: a set of objects following the same pattern in a subset of dimensions (Wang et al, 2002)


ChallengesMost clustering approaches do not address the temporal

variations in time series gene expression data, which is an important feature and affect the performance.

Previous approaches try to find coherent patterns and clusters w.r.t. the entire set of attributes

Patterns may be embedded in sub attribute spaces Only a subset of genes participate in any cellular processes of interest Any cellular process may take place only in a subset of experiment

conditions.

a) raw data b) shifting patterns c) scaling patterns


Gene-Sample-Time (GST) Microarray Data

2D time-series data

3D gene-sample-time data

• The GST microarray data consist of three dimensionsthree dimensions

• The samples often exhibit various various phenotypesphenotypes, e.g., cancer vs. control

A collection of samples


Challenges of Mining GST Data

Challenges 2D data 3D data

Mining Process Partition genes

Partition genes and samples

simultaneously

Cluster model Two types of variables

Three types of variables

Most clustering algorithms were designed for 2D data, and cannot be directly extended for 3D data.


Coherent Gene Cluster

• The group of samples (sj1, sj2, sj3 ) may exhibit the same phenotype• The group of genes (gi1,gi2,gi3) may be strongly correlated to the phenotype shared by (sj1, sj2, sj3 )

A coherent gene clusterA 3D GST data set The 2D representation


Results from a Real Data Set The Multiple Sclerosis (MS) data consist of

4324 genes 13 MS patients 10 time points before and after IFN- treatment

25 coherent gene clusters were reported

Sample A Sample B Sample C Sample D

Sample E Sample F Sample G Sample H

An example of coherent gene clusters (107 genes, 8 samples)


Other Types of Coherent Clusters


Problem DefinitionGiven a GST microarray data matrix M, a maximal

coherent gene cluster C=(GS) is a combination of a subset of genes G and a subset of samples S such that: Coherent : the subset of genes G are coherent across the

subset of samples S;Significant : |G|≥ming, |S|≥mins, where ming and mins are

user-specified parameters;Maximal : any insertion of gG or sS will make C not

coherent.The problem of mining coherent gene clusters is to

find the complete set of maximal coherent gene clusters in M.


Coherence Measure Various coherence measures exist. Measure selection is application dependent. A general coherence model

Given a coherence measure sim(•) and a user-specified threshold , A gene ga is coherent on samples si and sj, if sim(pai,paj)≥ .Coherent gene matrix (G1,S1): if every gene gi G1 is coherent

across samples in S1.Trivial coherent gene matrix: ({gi}, {sj}), (G, {sj})

We choose the Person’s correlation coefficient. Other coherence measures are also applicable.


Related Work Clustering algorithms on Gene-Sample or

Gene-Time microarray data The cluster model is completely different

Subspace clustering Find subsets of objects coherent with subsets of

attributes Frequent pattern mining

Find subsets of items frequently appearing in transaction databases


Algorithm Outline

Phase 1 (Pre-processing) : For each gene g, find the complete set of maximal coherent sample sets of gene g.

Phase 2: Compute the complete set of maximal coherent gene clusters based on pre-processing results.


Coherent Sample Sets

Given a gene g, a maximal coherent sample set of g is a subset of samples Si such that: coherent : g is coherent across Si; significant : |Si| mins; maximal : there exists no superset S’Si such

that g is also coherent with S’. (g Si ) is a building block for coherent

gene clusters including g.


Preprocessing Phase

s1 s2 s3 s4 s5 s6

s1 1 1 0 1 0 0

s2 1 1 0 0 0 0

s3 0 0 1 1 1 1

s4 1 0 1 1 1 1

s5 0 0 1 1 1 1

s6 0 0 1 1 1 1

Suppose mins = 3

The coherence matrix of gene g

The coherence graph of gene g

s1

s2

s3

s5

s4s6

s4s3

s5 s6

{s3,s4,s5,s6} is a coherent sample set of

gene g


Sample-gene Search

Set enumeration tree Enumerate all subsets of samples

systematically. Each node on the tree corresponds to a subset

of samples. For each node S

Find the maximal set of genes Gs which is coherent with S


Set Enumeration Tree

The set enumeration tree for {a,b,c,d}

{}

{a} {c}{b} {d}

{a,b} {a,c} {a,d} {b,c} {b,d} {c,d}

{a,b,d}{a,b,c} {a,c,d} {b,c,d}

{a,b,c,d}


Find the Maximal Coherent Subset of Genes

After the pre-processing phase:

Given a subset of samples S, how to find the maximal coherent set of genes GS? Expensive approach: scan the table once For each S, Gs can be derived by a single scan of the maximal

coherent samples of all genes. If S Sj, g Gs.

Efficient approach: use the inverted list.

g1 {s1, s2, s3, s4, s5}

g2 {s1,s2,s4}, {s1,s5}

g3 {s1,s2,s3,s4,s5}

g4 {s1,s2,s3},{s5,s6}

g5 {s1,s5,s6}


The Inverted ListGene Maximal Coherent sample sets

g1 {s1, s2, s3, s4, s5}

g2 {s1, s2, s4}, {s1, s5}

g3 {s1, s2, s3, s4, s5}

g4 {s1, s2, s3}, {s5, s6}

g5 {s1, s5, s6}

Sample The inverted list

s1 {g1.b1, g2.b1, g2.b2, g3.b1, g4.b1, g5.b1}

s2 {g1.b1, g2.b1, g3.b1, g4.b1}

s3 {g1.b1, g3.b1, g4.b1}

s4 {g1.b1, g2.b1, g3.b1}

s5 {g1.b1, g2.b2, g3.b1, g4.b2, g5.b1}

s6 {g4.b2, g5.b1}

The table of maximal coherent sample sets for genes

The table of inverted lists for samples

g2.b1

g2.b2


Intersection Instead of Scanning

Given a subset of samples S={si1,…,sik}, intersect the inverted lists of si1,…,sik. For example, given S={s1,s2,s3},

Ls1^Ls2^Ls3={g1.b1,g3.b1,g4.b1}, so Gs={g1,g3,g4}. Suppose the parent of S is S’={si1,…,sik-1}, then

LS=LS’ Lsik.


Anti-monotonic Property

Given a combination (GS),if G is not coherent on S, then for any superset S’S, G cannot be

coherent on S’. For any descendant S’ of S on the tree

let GS be the maximal coherent gene set of S, let GS’ be the maximal coherent gene sets of S’, since S’S, we have GS’ GS.


Pruning Irrelevant Samples

Given a subset of samples S={si1,…,sik}, a sample sjtails, if j > ik

there exists at least ming genes g such that g is coherent with S{sj}

Samples sltails(irrelevant samples) cannot be used to extend S.


Pruning Unpromising Nodes

Given a subset of samples S={si1,…,sik}, if |S|+|tails|< mins, then prune the subtree of S. let the maximal coherent subset of genes of S be Gs,

if there exists (G’S’) such that (Stails) S’ GsG’,

the prune the subtree of S


Determination of Maximal Coherent Gene Clusters

The depth-first search strategy: For any superset S’ of S, S’ is

visited before S; or a child of S.

To determine whether a coherent gene cluster (GsS) is maximal, check (GsS) after visiting all its children, report (GsS) if it is not subsumed.


{ }

{s1}{s2,s3,s4,s5}

{s2}{s3,s4}

{s3}{}

{s4}{}

{s1,s2}{s3,s4}

{g1.b1, g2.b1, g3.b1, g4.b1}

{s1,s3}{}

{g1.b1, g3.b1, g4.b1}

{s1,s4}{}

{g1.b1, g2.b1, g3.b1}

{s2,s3}{}

{g1.b1, g3.b1, g4.b1}

{s2,s4}{}

{g1.b1, g2.b1, g3.b1}

{s1,s2,s3}{}

{g1.b1,g3.b1,g4.b1}

{s1,s2,s4}{}

{g1.b1,g2.b1,g3.b1}

Sample The inverted list

s1 {g1.b1, g2.b1, g2.b2, g3.b1, g4.b1, g5.b1}

s2 {g1.b1, g2.b1, g3.b1, g4.b1}

s3 {g1.b1, g3.b1, g4.b1}

s4 {g1.b1, g2.b1, g3.b1}

s5 {g1.b1, g2.b2, g3.b1, g4.b2, g5.b1}

s6 {g4.b2, g5.b1}


Mining Coherent Gene Clusters

Systematic enumeration of genes and samples Sample-Gene Search Gene-Sample Search

Pruning rules Determination of whether a coherent gene

cluster (GS) is maximal


Gene-sample Search

Sample-Gene Search Gene-Sample SearchSubjects to enumerate samples genesNumber of subjects to enumerate

101~102 103~104

Coherent objects Single set of maxmial coherent genes

Single or multiple sets of maxmial

coherent sample Efficiency on GST data High Low


Experiment Data Sets Real-world gene expression data

4324 genes 13 multiple sclerosis (MS) patients before and at 1,2,4,8,24,48,120 and 168 hours after IFN-

treatment Synthetic data

Given the number of genes NG, samples NS and coherent gene clusters NC

Simulate the pre-processing results Embed NC maximal coherent gene clusters (GS)


A Coherent Gene Cluster from Real Data


Effect of Parameters

Number of clusters vs. ming (mins=3,=0.8)

Number of clusters vs. mins (ming=10, =0.8)

Number of clusters vs. (ming=10,mins=3)


Scalability

Scalability of phase 1 Scalability w.r.t. number of genes (number of samples: 30)

Scalability w.r.t. number of samples (number of genes: 3,000)


Conclusion

We define the new problem of mining coherent gene clusters from the novel gene-sample-time microarray data.

We propose two approaches: the sample-gene search and the gene-sample search.

We conduct an extensive empirical evaluation on both real and synthetic data sets.


Future Work

New problems from the gene-sample-time microarray data: Coherent sample clusters (GS)

for each sS, any pair of genes gi, gjG has coherent patterns.

Coherent gene-sample clusters (GS), both a coherent gene cluster and a coherent

sample cluster.

pattern-based clustering

Documents