functional genomics course 26.5.2006
DESCRIPTION
FUNCTIONAL GENOMICS COURSE 26.5.2006. Petri Pehkonen Laboratory of Functional Genomics and Bioinformatics Department of Neurobiology A.I.Virtanen Institute for Molecular Sciences University of Kuopio. [ I N T R O D U C T I O N ] Gene lists from microarrays. 2 X UP AA407331 BG062929 - PowerPoint PPT PresentationTRANSCRIPT
FUNCTIONAL GENOMICS FUNCTIONAL GENOMICS COURSECOURSE26.5.200626.5.2006
Petri PehkonenPetri PehkonenLaboratory of Functional Genomics and BioinformaticsLaboratory of Functional Genomics and BioinformaticsDepartment of NeurobiologyDepartment of NeurobiologyA.I.Virtanen Institute for Molecular SciencesA.I.Virtanen Institute for Molecular SciencesUniversity of KuopioUniversity of Kuopio
[ I N T R O D U C T I O N ][ I N T R O D U C T I O N ] Gene lists from microarraysGene lists from microarrays
2 X UPAA407331BG062929BG062930BG062931BG062932AA407367BG062933BG062934AA407377BG062935BG062936BG062937BG063015BG063016BG063017BG063018
2 X DOWNBG069315BG082333BG082334BG069318BG069319BG069320BG069321BG069322AU018797BG069323BG069324BG069325BG069326BG069327BG069328BG069406BG069407BG082348AU018835BG069409BG082350BG082351BG069412BG069413
NON REGULATEDBG068248AU022405BG068249BG068250BG068251BG068252BG068253BG068254BG081291BG081292BG068257BG068258BG068259BG068260BG068261BG068262BG068263BG068264BG068265AU022448BG081301BG068267AU022455BG068268BG068269BG068270BG081306BG068272BG068273BG068274BG068275BG068276AU022477BG068277BG068278BG068279BG068280BG068281BG081317BG068283BG068284
Genes are compared between two or more Genes are compared between two or more samplessamples
t-testt-test used to detect dissimilar expression used to detect dissimilar expression
Genes are ranked according to Genes are ranked according to p-valuesp-values
p-value p-value cut offcut off is set to select differently is set to select differently expressed genes (up/down regulated)expressed genes (up/down regulated)
Output: up, down and non-regulated gene listsOutput: up, down and non-regulated gene lists
mRNA Hybridization Dataset
TTTT
[ I N T R O D U C T I O N ][ I N T R O D U C T I O N ] Gene associated dataGene associated data
G1G1G1G1 G2G2G2G2 G3G3G3G3 G4G4G4G4
Chromosomal regions
G1G1G1G1
G2G2G2G2
G3G3G3G3
Cell cycleCell cycleCell cycleCell cycle
ApoptosisApoptosisApoptosisApoptosis
NeuorogenesisNeuorogenesisNeuorogenesisNeuorogenesis
Cell deathCell deathCell deathCell death
G4G4G4G4
ATPase activityATPase activityATPase activityATPase activity
Functional classes
G1G1G1G1
G2G2G2G2
G3G3G3G3
"is""is""is""is"
"segregation""segregation""segregation""segregation"
"protein""protein""protein""protein"
"antioxidant""antioxidant""antioxidant""antioxidant"
G4G4G4G4
"dopamine""dopamine""dopamine""dopamine"
Words from scientific literature
G1G1G1G1 G2G2G2G2
Transcription factor binding sites
GGGG = a gene
= element
6p13 6p13 6p146p14 6p15 6p15 6p12 6p12
[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] MotivationMotivation
AimAim
Find the chromosomal regions where our genes are Find the chromosomal regions where our genes are over-representedover-represented
Biological basisBiological basis
Gene duplication during evolutionGene duplication during evolution
Genes located nearby and/or having similar regulatory Genes located nearby and/or having similar regulatory elements may be regulated by same factors and elements may be regulated by same factors and participate to same biological processparticipate to same biological process
[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Data representationData representation
We can have gene sets obtained from any kind of We can have gene sets obtained from any kind of laboratory technology or in silico techniquelaboratory technology or in silico technique
Gene locations in chromosome can be represented as Gene locations in chromosome can be represented as ordered binomial data vectorsordered binomial data vectors
1 1 indicates that gene is from the row's class, indicates that gene is from the row's class, 00 indicates indicates the oppositethe opposite
Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0
Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0
Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82
Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] First analysisFirst analysis
We can select a region and find the number of genes thereWe can select a region and find the number of genes there
Total: 6 genesOver-expressed: 4 genesUnder-expressed: 1 gene
Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0
Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0
Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82
Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Our first analysisOur first analysis
We can select a region and find the number of genes thereWe can select a region and find the number of genes there
We can calculate the statistical significance of the regionWe can calculate the statistical significance of the region
Contigency table tests: Fisher's exact test, Chi squared test etc.Contigency table tests: Fisher's exact test, Chi squared test etc.
Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0
Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0
Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82
Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Total: 6 genesOver-expressed: 4 genesUnder-expressed: 1 gene
[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Our first analysisOur first analysis
Contigency table test: determining if there is a difference between Contigency table test: determining if there is a difference between two proportionstwo proportions
We can now test whether there is difference between the selected We can now test whether there is difference between the selected chromosomal region and whole genome chromosomal region and whole genome IN PROPORTIONS OF IN PROPORTIONS OF OVER-EXPRESSED GENESOVER-EXPRESSED GENES
Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0
Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0
Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82
Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Total: 6 genesOver-expressed: 4 genesUnder-expressed: 1 gene
[ C H R O M O S O M A L A N A L Y S I [ C H R O M O S O M A L A N A L Y S I
S ]S ] Our first analysisOur first analysis 2X2 contigency table for over-expressed vs. other genes:2X2 contigency table for over-expressed vs. other genes:
Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0
Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0
Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82
Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Over-Over-expr.expr.
OtherOther SumsSums
SelectedSelected 44 22 66
OtherOther 44 88 1212
SumsSums 88 1010 1818
[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Hypergeometric distributionHypergeometric distribution
Classes
G e
n e
s
Oth
er g
enes
C1 C2 C3 Regulated genes can be seen as Regulated genes can be seen as a samplea sample taken one by taken one by
one one without replacementwithout replacement from from aa populationpopulation of all of all
genes tested in experimentgenes tested in experiment
Hypergeometric distributionHypergeometric distribution describes how random describes how random
this kind of sample is this kind of sample is for one associated classfor one associated class
Hy
pe
rge
om
etr
ic p
rob
ab
ilit
y
Hypergeometric Probability Density Function (HygePDF)
Hypergeometric DistributionFunction (HygeCDF)
Reg
ula
ted
gen
es
Number of genes that were associated to class C in the regulated list
Classes
G e
n e
s
Oth
er g
enes
C1 C2 C3
Reg
ula
ted
gen
es
[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Hypergeometric probabilityHypergeometric probability
Hypergeometric probability fHypergeometric probability f answers if class answers if class CC is is
randomly distributed between regulated and other genesrandomly distributed between regulated and other genes
For calculating For calculating ff for class for class CC, we need sizes of regulated , we need sizes of regulated
gene list (N) and array (M), and amounts of class gene list (N) and array (M), and amounts of class CC
associated genes in gene list (x) and in the array (n)associated genes in gene list (x) and in the array (n)
Hy
pe
rge
om
etr
ic p
rob
ab
ilit
y
x
f
x
Number of genes that were associated to class C in the regulated list
[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Hypergeometric probabilityHypergeometric probability
Low probability means non random distribution => the Low probability means non random distribution => the
class is either over or under-representedclass is either over or under-represented
Figures: class Figures: class C1C1 is over-represented and is over-represented and C3C3 under- under-
represented in gene list. How about represented in gene list. How about C2C2??
Hy
pe
rge
om
etr
ic p
rob
ab
ilit
y
C1C2C3
Classes
G e
n e
s
Oth
er g
enes
C1 C2 C3
Reg
ula
ted
gen
es
Number of genes that were associated to class C in the regulated list
[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Fisher's exact testFisher's exact test
Hypergeometric probabilityHypergeometric probability f(x,M,N,n)f(x,M,N,n) measures the measures the probability to detect probability to detect exactly exactly xx genes that are associated genes that are associated to the class to the class CC from from a random samplea random sample
x
C2
Hy
pe
rge
om
etr
ic p
rob
ab
ilit
y
Classes
G e
n e
s
Oth
er g
enes
C1 C2 C3
Reg
ula
ted
gen
es
Number of genes that were associated to class C in the regulated list
[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Fisher's exact testFisher's exact test
Hypergeometric probabilityHypergeometric probability f(x,M,N,n)f(x,M,N,n) measures the measures the probability to detect probability to detect exactly exactly xx genes that are associated genes that are associated to the class to the class CC from from a random samplea random sample
It’s more natural to ask: what is probability to detect It’s more natural to ask: what is probability to detect x or x or moremore class associated genes in a random sample? class associated genes in a random sample?
Fisher’s exact testFisher’s exact test F F answers this by summing the tail answers this by summing the tail from HygePDFfrom HygePDF
FISHER’S TEST FOR CLASS C2:
F(x=36, M=1300, N=400, n=95) = 0.1916
C2 DOES NOT SHOW STATISTICALLYSIGNIFICANT OVER-REPRESENTATION(WITH A SIGNIFICANCE LEVEL α=0.05)
x
C2
Hy
pe
rge
om
etr
ic p
rob
ab
ilit
y
Classes
G e
n e
s
Oth
er g
enes
C1 C2 C3
Reg
ula
ted
gen
es
Number of genes that were associated to class C in the regulated list
[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Analysis of a chromosomeAnalysis of a chromosome
We want to analyse the chromsome from whole of its lengthWe want to analyse the chromsome from whole of its length
Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0
Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0
Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82
Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Analysis of a chromosomeAnalysis of a chromosome
We want to analyse the chromsome from whole of its lengthWe want to analyse the chromsome from whole of its length
We can discretize the chromosome into equal sized regionsWe can discretize the chromosome into equal sized regions
based on number of genes orbased on number of genes or
based on physical locations of genesbased on physical locations of genes
Then we can calculate the significance of each region separatelyThen we can calculate the significance of each region separately
Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0
Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0
Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82
Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Statistics appliedStatistics applied
We wanted to see if there exist over-represented chromosome regions in We wanted to see if there exist over-represented chromosome regions in down regulated gene list obtained from C. Elegans strain comparisondown regulated gene list obtained from C. Elegans strain comparison
Chromosomes were split into 250 kbp segments (themes)Chromosomes were split into 250 kbp segments (themes)
Fisher's test from Hyge distribution was used to calculate over-expressionFisher's test from Hyge distribution was used to calculate over-expression
G1G1G1G1 G2G2G2G2 G3G3G3G3G4G4G4G4
Chromosomal regions as themes
250-500
kbp 500-750
kbp 750-1000
kbp 1000-1250kbp
FISHER’S EXACT TEST
F(x, M, N, n) where:
x = Down-regulated genes in region TM = All chip genes in chromosomeN = All down-regulated genesn = All genes in region T
Genes G
Regions T
[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Statistics appliedStatistics applied
Sig
nif
ican
ce
GENE LIST: DOWN REGULATED BETWEEN C. ELEGANS HAWAIIAN VS. N2 STRAINS
CHR I CHR II CHR III CHR IV CHR V CHR X
0 3 6 9 12 15Mbp
0,00
1,00
2,00
3,00
0 3 6 9 12 15Mbp
0,00
0,50
1,00
1,50
0 3 6 9 12 15Mbp
0,00
0,50
1,00
1,50
0 3 6 9 12 15 18Mbp
2,00
4,00
6,00
8,00
10,00
0 3 6 9 12 15 18 21Mbp
0,00
0,50
1,00
1,50
2,00
0 3 6 9 12 15 18Mbp
1,00
2,00
3,00
4,00
Resulting p-values were transformed into Resulting p-values were transformed into ten based negative logaritmic form where ten based negative logaritmic form where p<0.05 corresponds to log p > 1,3p<0.05 corresponds to log p > 1,3
Some regions were found where genes Some regions were found where genes were highly over-representedwere highly over-represented
Kaja Reisner, Petri Pehkonen, Garry WongKaja Reisner, Petri Pehkonen, Garry Wong
FISHER’S EXACT TEST EXAMPLE
For region T(12750...13000 kbp):x = Down-regulated genes in region T = 14 M = All chip genes in chromosome = 3407N = All down-regulated genes = 92n = All genes in region T = 59
P(x,M,N,n) ~ 0.0000000001984-log P ~ 9.73
[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Analysis of a chromosomeAnalysis of a chromosome
What is the shortcoming of the previous approach?What is the shortcoming of the previous approach?
Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0
Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0
Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82
Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Analysis of a chromosomeAnalysis of a chromosome
What is the shortcoming of the previous approach?What is the shortcoming of the previous approach?
By discretization we can split 'good' regions which causes that we By discretization we can split 'good' regions which causes that we may not found themmay not found them
Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0
Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0
Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82
Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Total: 4 genesOver-expressed: 3 genesUnder-expressed: 1 gene
[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Analysis of a chromosomeAnalysis of a chromosome
Our bin size can be also too large to detect some regions and too Our bin size can be also too large to detect some regions and too small to detect otherssmall to detect others
Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0
Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0
Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82
Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Total: 8 genesOver-expressed: 4 genesUnder-expressed: 4 genes
[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Analysis of a chromosomeAnalysis of a chromosome
Solution (partial):Solution (partial):
Sliding window techniqueSliding window technique
Simply sliding a bin through the chromosomeSimply sliding a bin through the chromosome
Calculating significance in each positionCalculating significance in each position
Window size can be of amount of genes or physical region sizeWindow size can be of amount of genes or physical region size
Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0
Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0
Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82
Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Analysis of a chromosomeAnalysis of a chromosome
Good sidesGood sides
Now we do not split any region to halvesNow we do not split any region to halves
Rather we gather information in every possible positionRather we gather information in every possible position
Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0
Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0
Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82
Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Analysis of a chromosomeAnalysis of a chromosome
ShortcomingsShortcomings
User have to select a constant window sizeUser have to select a constant window size
We can still miss 'good' regions that are a lot smaller or larger to We can still miss 'good' regions that are a lot smaller or larger to the windowthe window
Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0
Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0
Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82
Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[ C H R O M O S O M A L A N A L Y S I S ][ C H R O M O S O M A L A N A L Y S I S ] Analysis of a chromosomeAnalysis of a chromosome
Our solution (Pehkonen, Törönen and Wong)Our solution (Pehkonen, Törönen and Wong)
Hierarchical segmentation analysis of chromosomeHierarchical segmentation analysis of chromosome
Possibility to find different sizes of 'significant' regions in different Possibility to find different sizes of 'significant' regions in different locations of chromosomelocations of chromosome
Over-expressed 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0
Under-expressed 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0
Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82
Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Ordered set of data includes signal rich regions among noiseOrdered set of data includes signal rich regions among noise Data can be partitioned into segments in order to separate signal from noise, and patterns from other patternsData can be partitioned into segments in order to separate signal from noise, and patterns from other patterns Existing method: recursive segmentationExisting method: recursive segmentation Existing applications: DNA-segmentation, image analysis etc.Existing applications: DNA-segmentation, image analysis etc.
Ordered set of data points as binary image
Cla
sse
s of
dat
a
Signal rich region
Noisy region
Segmentatio split
[ S E G M E N T A T I O N ][ S E G M E N T A T I O N ] BackgroundBackground
Recursive segmentationRecursive segmentation Used with e.g. maximum likelyhood ratio test to Used with e.g. maximum likelyhood ratio test to
decide whether to proceed in splittingdecide whether to proceed in splitting
Shortcomings:Shortcomings: Unsatisfactory definition of stopping criterionUnsatisfactory definition of stopping criterion Algorithm is often unable to detect global optimumAlgorithm is often unable to detect global optimum ML-model bases on classical probability theory and ML-model bases on classical probability theory and
do not take account the uncertainitydo not take account the uncertainity
[ S E G M E N T A T I O N ][ S E G M E N T A T I O N ]
Existing methodsExisting methods
Iterative algorithm instead of recursiveIterative algorithm instead of recursive Proceeds into split that increases most the global Proceeds into split that increases most the global
segmentation scoresegmentation score Reasonable visualization of result with dendogramReasonable visualization of result with dendogram Facilitates observation of changes in segmentation Facilitates observation of changes in segmentation
score between local maxima and minima, and global score between local maxima and minima, and global maximamaxima
Bayesian segmentation scoreBayesian segmentation score Dirichlet multinomial modelDirichlet multinomial model Takes account uncertainityTakes account uncertainity
[ S E G M E N T A T I O N ][ S E G M E N T A T I O N ]
Existing methodsExisting methods
Improved model selection criterion for finding the best Improved model selection criterion for finding the best segmentation level:segmentation level:
)1log();(_1kNpFscoreonsegmentati
priors of sum
cluster of size
class ofprior
cluster in class ofy probabilit
amount class
amountcluster
size data
jc
i
jip
C
k
N
j
i
ji
k
jj
C
iijii cx
c
xpF
1 1
))(log())(log()))(log())(log(();(
[ S E G M E N T A T I O N ][ S E G M E N T A T I O N ]
Our improvementsOur improvements
Improved model selection criterion for finding the best Improved model selection criterion for finding the best segmentation levelsegmentation level
There are N data points in the data There are N data points in the data
=> There are N-1 positions between the=> There are N-1 positions between the
)1log();(_1kNpFscoreonsegmentati
[ S E G M E N T A T I O N ][ S E G M E N T A T I O N ]
Our improvementsOur improvements
Improved model selection criterion for finding the best Improved model selection criterion for finding the best segmentation levelsegmentation level
There are N data points in the data There are N data points in the data
=> There are N-1 positions between them=> There are N-1 positions between them There are k clusters in the model There are k clusters in the model
=> There are k-1 edges (splits)=> There are k-1 edges (splits)
)1log();(_1kNpFscoreonsegmentati
[ S E G M E N T A T I O N ][ S E G M E N T A T I O N ]
Our improvementsOur improvements
Improved model selection criterion for finding the best Improved model selection criterion for finding the best segmentation levelsegmentation level
There are N data points in the data There are N data points in the data
=> There are N-1 positions between them=> There are N-1 positions between them There are k clusters in the model There are k clusters in the model
=> There are k-1 edges (splits)=> There are k-1 edges (splits) How many times k-1 edges can be positioned into N-1 How many times k-1 edges can be positioned into N-1
positions between data pointspositions between data points
)1log();(_1kNpFscoreonsegmentati
[ S E G M E N T A T I O N ][ S E G M E N T A T I O N ]
Our improvementsOur improvements
Improved model selection criterion for finding the best Improved model selection criterion for finding the best segmentation levelsegmentation level
There are N data points in the data There are N data points in the data
=> There are N-1 positions between them=> There are N-1 positions between them There are k clusters in the model There are k clusters in the model
=> There are k-1 edges (splits)=> There are k-1 edges (splits) How many times k-1 edges can be positioned into N-1 How many times k-1 edges can be positioned into N-1
positions between data pointspositions between data points Corresponds to MDL-based model selection criterion Corresponds to MDL-based model selection criterion
for detecting clustering levelfor detecting clustering level)1log();(_
1kNpFscoreonsegmentati
[ S E G M E N T A T I O N ][ S E G M E N T A T I O N ]
Our improvementsOur improvements
Simulated data creatorSimulated data creator Creates artificial data with random clusters Creates artificial data with random clusters
and noise from given modeland noise from given model
Evaluation of segmentation methodsEvaluation of segmentation methods How closely clustering result corresponds to How closely clustering result corresponds to
the given original modelthe given original model Kullback Leiber -distance, Jenssen-Shannon Kullback Leiber -distance, Jenssen-Shannon
divergence, Mutual Informationdivergence, Mutual Information
[ S E G M E N T A T I O N ][ S E G M E N T A T I O N ]
Evaluation of our methodEvaluation of our method
Comparison of given model vs. created Comparison of given model vs. created clustering result with KL-divergenceclustering result with KL-divergence
[ S E G M E N T A T I O N ][ S E G M E N T A T I O N ]
Evaluation of our methodEvaluation of our method
Multinomial data with 5 classes:Multinomial data with 5 classes:
0
100
200
300
400
500
600
AIC
BIC
BIC2
OWN
Simulated artificial datasets
Clu
ster
ing
vs.
mod
el -
dist
ance
(K
L d
ive
rgen
ce)
[ S E G M E N T A T I O N ][ S E G M E N T A T I O N ]
Evaluation of our methodEvaluation of our method
Multinomial data with 30 classes:Multinomial data with 30 classes:
Simulated artificial datasets
Clu
ster
ing
vs.
mod
el -
dist
ance
(K
L d
ive
rgen
ce)
0
500
1000
1500
2000
2500
3000
3500
AIC
BIC
BIC2
OWN
[ S E G M E N T A T I O N ][ S E G M E N T A T I O N ]
Evaluation of our methodEvaluation of our method
CATALIST SOFTWARECATALIST SOFTWARE
Two main partsTwo main parts Method testing: creation of artificial datasets and evaluation of Method testing: creation of artificial datasets and evaluation of
segmentation methodssegmentation methods Analysis of biological datasets: import of miroarray data or Analysis of biological datasets: import of miroarray data or
gene lists from filesgene lists from files
[ C A T A L I S T ][ C A T A L I S T ] Results visualizationResults visualization
Draws a dendogramDraws a dendogram
that shows the progress of the that shows the progress of the segmentationsegmentation
Dendogram revealsDendogram reveals
hierarchical localization of significant hierarchical localization of significant regionsregions
Detailed informationDetailed information
on genes by clicking found regionson genes by clicking found regions
C.Elegans strain comparisonC.Elegans strain comparison
See three clusters found in See three clusters found in chromosome 4chromosome 4
CHROMOSOME IV
[ L O C A T I O N A N A L Y S I S ][ L O C A T I O N A N A L Y S I S ] ResultsResults
Yeast cell cycle genes were clustered according to gene Yeast cell cycle genes were clustered according to gene expression dataexpression data
K-means clustering was performed with 2..10 clustersK-means clustering was performed with 2..10 clusters
[ L O C A T I O N A N A L Y S I S ][ L O C A T I O N A N A L Y S I S ] ResultsResults
Each clustering into k groups was encoded as multinomial Each clustering into k groups was encoded as multinomial data vector with k classesdata vector with k classes
Clustering to 2 1 0 2 2 2 2 0 0 0 0 0 1 1 1 0 0 2 0
Clustering to 3 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 1 0 0
Clustering to 4 4 4 3 3 3 3 4 4 4 4 4 3 1 1 2 2 4 4
Clustering to 5 5 5 4 5
Clustering to 6 1 0 4 4 4 4 1 1 1 1 1 0 0 0 1 0 1 0
...... 1 1 7 7 7 7 1 2 2 1 2 1 1 1 0 6 0 0
Chr location (kbp) 11 16 20 30 34 39 45 51 53 54 56 60 65 69 72 75 80 82
Rank order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[ L O C A T I O N A N A L Y S I S ][ L O C A T I O N A N A L Y S I S ] ResultsResults
Yeast cell cycle gene expression clusters, chromosome 4:Yeast cell cycle gene expression clusters, chromosome 4:
[ L O C A T I O N A N A L Y S I S ][ L O C A T I O N A N A L Y S I S ] ResultsResults
Result of hierarchical segmentation:Result of hierarchical segmentation:
[ L O C A T I O N A N A L Y S I S ][ L O C A T I O N A N A L Y S I S ] ResultsResults
Real results (red) compared to segmentation of several Real results (red) compared to segmentation of several randomized data sets (blue):randomized data sets (blue):
[ L O C A T I O N A N A L Y S I S ][ L O C A T I O N A N A L Y S I S ] ResultsResults
Nuclear receptors gene list:Nuclear receptors gene list: