pattern-based clustering
DESCRIPTION
Pattern-based Clustering. How to cluster the five objects? Hard to define a global similarity measure. What Is Pattern-based Clustering?. A cluster: a set of objects following the same pattern in a subset of dimensions (Wang et al, 2002). Challenges. - PowerPoint PPT PresentationTRANSCRIPT
University at Buffalo The State University of New York
Pattern-based Clustering
How to cluster the five objects?Hard to define a global similarity measure
University at Buffalo The State University of New York
What Is Pattern-based Clustering?
A cluster: a set of objects following the same pattern in a subset of dimensions (Wang et al, 2002)
University at Buffalo The State University of New York
ChallengesMost clustering approaches do not address the temporal
variations in time series gene expression data, which is an important feature and affect the performance.
Previous approaches try to find coherent patterns and clusters w.r.t. the entire set of attributes
Patterns may be embedded in sub attribute spaces Only a subset of genes participate in any cellular processes of interest Any cellular process may take place only in a subset of experiment
conditions.
a) raw data b) shifting patterns c) scaling patterns
University at Buffalo The State University of New York
Gene-Sample-Time (GST) Microarray Data
2D time-series data
3D gene-sample-time data
• The GST microarray data consist of three dimensionsthree dimensions
• The samples often exhibit various various phenotypesphenotypes, e.g., cancer vs. control
A collection of samples
University at Buffalo The State University of New York
Challenges of Mining GST Data
Challenges 2D data 3D data
Mining Process Partition genes
Partition genes and samples
simultaneously
Cluster model Two types of variables
Three types of variables
Most clustering algorithms were designed for 2D data, and cannot be directly extended for 3D data.
University at Buffalo The State University of New York
Coherent Gene Cluster
• The group of samples (sj1, sj2, sj3 ) may exhibit the same phenotype• The group of genes (gi1,gi2,gi3) may be strongly correlated to the phenotype shared by (sj1, sj2, sj3 )
A coherent gene clusterA 3D GST data set The 2D representation
University at Buffalo The State University of New York
Results from a Real Data Set The Multiple Sclerosis (MS) data consist of
4324 genes 13 MS patients 10 time points before and after IFN- treatment
25 coherent gene clusters were reported
Sample A Sample B Sample C Sample D
Sample E Sample F Sample G Sample H
An example of coherent gene clusters (107 genes, 8 samples)
University at Buffalo The State University of New York
Other Types of Coherent Clusters
University at Buffalo The State University of New York
Problem DefinitionGiven a GST microarray data matrix M, a maximal
coherent gene cluster C=(GS) is a combination of a subset of genes G and a subset of samples S such that: Coherent : the subset of genes G are coherent across the
subset of samples S;Significant : |G|≥ming, |S|≥mins, where ming and mins are
user-specified parameters;Maximal : any insertion of gG or sS will make C not
coherent.The problem of mining coherent gene clusters is to
find the complete set of maximal coherent gene clusters in M.
University at Buffalo The State University of New York
Coherence Measure Various coherence measures exist. Measure selection is application dependent. A general coherence model
Given a coherence measure sim(•) and a user-specified threshold , A gene ga is coherent on samples si and sj, if sim(pai,paj)≥ .Coherent gene matrix (G1,S1): if every gene gi G1 is coherent
across samples in S1.Trivial coherent gene matrix: ({gi}, {sj}), (G, {sj})
We choose the Person’s correlation coefficient. Other coherence measures are also applicable.
University at Buffalo The State University of New York
Related Work Clustering algorithms on Gene-Sample or
Gene-Time microarray data The cluster model is completely different
Subspace clustering Find subsets of objects coherent with subsets of
attributes Frequent pattern mining
Find subsets of items frequently appearing in transaction databases
University at Buffalo The State University of New York
Algorithm Outline
Phase 1 (Pre-processing) : For each gene g, find the complete set of maximal coherent sample sets of gene g.
Phase 2: Compute the complete set of maximal coherent gene clusters based on pre-processing results.
University at Buffalo The State University of New York
Coherent Sample Sets
Given a gene g, a maximal coherent sample set of g is a subset of samples Si such that: coherent : g is coherent across Si; significant : |Si| mins; maximal : there exists no superset S’Si such
that g is also coherent with S’. (g Si ) is a building block for coherent
gene clusters including g.
University at Buffalo The State University of New York
Preprocessing Phase
s1 s2 s3 s4 s5 s6
s1 1 1 0 1 0 0
s2 1 1 0 0 0 0
s3 0 0 1 1 1 1
s4 1 0 1 1 1 1
s5 0 0 1 1 1 1
s6 0 0 1 1 1 1
Suppose mins = 3
The coherence matrix of gene g
The coherence graph of gene g
s1
s2
s3
s5
s4s6
s4s3
s5 s6
{s3,s4,s5,s6} is a coherent sample set of
gene g
University at Buffalo The State University of New York
Sample-gene Search
Set enumeration tree Enumerate all subsets of samples
systematically. Each node on the tree corresponds to a subset
of samples. For each node S
Find the maximal set of genes Gs which is coherent with S
University at Buffalo The State University of New York
Set Enumeration Tree
The set enumeration tree for {a,b,c,d}
{}
{a} {c}{b} {d}
{a,b} {a,c} {a,d} {b,c} {b,d} {c,d}
{a,b,d}{a,b,c} {a,c,d} {b,c,d}
{a,b,c,d}
University at Buffalo The State University of New York
Find the Maximal Coherent Subset of Genes
After the pre-processing phase:
Given a subset of samples S, how to find the maximal coherent set of genes GS? Expensive approach: scan the table once For each S, Gs can be derived by a single scan of the maximal
coherent samples of all genes. If S Sj, g Gs.
Efficient approach: use the inverted list.
g1 {s1, s2, s3, s4, s5}
g2 {s1,s2,s4}, {s1,s5}
g3 {s1,s2,s3,s4,s5}
g4 {s1,s2,s3},{s5,s6}
g5 {s1,s5,s6}
University at Buffalo The State University of New York
The Inverted ListGene Maximal Coherent sample sets
g1 {s1, s2, s3, s4, s5}
g2 {s1, s2, s4}, {s1, s5}
g3 {s1, s2, s3, s4, s5}
g4 {s1, s2, s3}, {s5, s6}
g5 {s1, s5, s6}
Sample The inverted list
s1 {g1.b1, g2.b1, g2.b2, g3.b1, g4.b1, g5.b1}
s2 {g1.b1, g2.b1, g3.b1, g4.b1}
s3 {g1.b1, g3.b1, g4.b1}
s4 {g1.b1, g2.b1, g3.b1}
s5 {g1.b1, g2.b2, g3.b1, g4.b2, g5.b1}
s6 {g4.b2, g5.b1}
The table of maximal coherent sample sets for genes
The table of inverted lists for samples
g2.b1
g2.b2
University at Buffalo The State University of New York
Intersection Instead of Scanning
Given a subset of samples S={si1,…,sik}, intersect the inverted lists of si1,…,sik. For example, given S={s1,s2,s3},
Ls1^Ls2^Ls3={g1.b1,g3.b1,g4.b1}, so Gs={g1,g3,g4}. Suppose the parent of S is S’={si1,…,sik-1}, then
LS=LS’ Lsik.
University at Buffalo The State University of New York
Anti-monotonic Property
Given a combination (GS),if G is not coherent on S, then for any superset S’S, G cannot be
coherent on S’. For any descendant S’ of S on the tree
let GS be the maximal coherent gene set of S, let GS’ be the maximal coherent gene sets of S’, since S’S, we have GS’ GS.
University at Buffalo The State University of New York
Pruning Irrelevant Samples
Given a subset of samples S={si1,…,sik}, a sample sjtails, if j > ik
there exists at least ming genes g such that g is coherent with S{sj}
Samples sltails(irrelevant samples) cannot be used to extend S.
University at Buffalo The State University of New York
Pruning Unpromising Nodes
Given a subset of samples S={si1,…,sik}, if |S|+|tails|< mins, then prune the subtree of S. let the maximal coherent subset of genes of S be Gs,
if there exists (G’S’) such that (Stails) S’ GsG’,
the prune the subtree of S
University at Buffalo The State University of New York
Determination of Maximal Coherent Gene Clusters
The depth-first search strategy: For any superset S’ of S, S’ is
visited before S; or a child of S.
To determine whether a coherent gene cluster (GsS) is maximal, check (GsS) after visiting all its children, report (GsS) if it is not subsumed.
University at Buffalo The State University of New York
{ }
{s1}{s2,s3,s4,s5}
{s2}{s3,s4}
{s3}{}
{s4}{}
{s1,s2}{s3,s4}
{g1.b1, g2.b1, g3.b1, g4.b1}
{s1,s3}{}
{g1.b1, g3.b1, g4.b1}
{s1,s4}{}
{g1.b1, g2.b1, g3.b1}
{s2,s3}{}
{g1.b1, g3.b1, g4.b1}
{s2,s4}{}
{g1.b1, g2.b1, g3.b1}
{s1,s2,s3}{}
{g1.b1,g3.b1,g4.b1}
{s1,s2,s4}{}
{g1.b1,g2.b1,g3.b1}
Sample The inverted list
s1 {g1.b1, g2.b1, g2.b2, g3.b1, g4.b1, g5.b1}
s2 {g1.b1, g2.b1, g3.b1, g4.b1}
s3 {g1.b1, g3.b1, g4.b1}
s4 {g1.b1, g2.b1, g3.b1}
s5 {g1.b1, g2.b2, g3.b1, g4.b2, g5.b1}
s6 {g4.b2, g5.b1}
University at Buffalo The State University of New York
Mining Coherent Gene Clusters
Systematic enumeration of genes and samples Sample-Gene Search Gene-Sample Search
Pruning rules Determination of whether a coherent gene
cluster (GS) is maximal
University at Buffalo The State University of New York
Gene-sample Search
Sample-Gene Search Gene-Sample SearchSubjects to enumerate samples genesNumber of subjects to enumerate
101~102 103~104
Coherent objects Single set of maxmial coherent genes
Single or multiple sets of maxmial
coherent sample Efficiency on GST data High Low
University at Buffalo The State University of New York
Experiment Data Sets Real-world gene expression data
4324 genes 13 multiple sclerosis (MS) patients before and at 1,2,4,8,24,48,120 and 168 hours after IFN-
treatment Synthetic data
Given the number of genes NG, samples NS and coherent gene clusters NC
Simulate the pre-processing results Embed NC maximal coherent gene clusters (GS)
University at Buffalo The State University of New York
A Coherent Gene Cluster from Real Data
University at Buffalo The State University of New York
Effect of Parameters
Number of clusters vs. ming (mins=3,=0.8)
Number of clusters vs. mins (ming=10, =0.8)
Number of clusters vs. (ming=10,mins=3)
University at Buffalo The State University of New York
Scalability
Scalability of phase 1 Scalability w.r.t. number of genes (number of samples: 30)
Scalability w.r.t. number of samples (number of genes: 3,000)
University at Buffalo The State University of New York
Conclusion
We define the new problem of mining coherent gene clusters from the novel gene-sample-time microarray data.
We propose two approaches: the sample-gene search and the gene-sample search.
We conduct an extensive empirical evaluation on both real and synthetic data sets.
University at Buffalo The State University of New York
Future Work
New problems from the gene-sample-time microarray data: Coherent sample clusters (GS)
for each sS, any pair of genes gi, gjG has coherent patterns.
Coherent gene-sample clusters (GS), both a coherent gene cluster and a coherent
sample cluster.