clustering made human: us ugm 2008
DESCRIPTION
Clustering chemical structures alleviates the tedious task of browsing a large set of compounds by grouping individual structures into generic categories. ChemAxon's JKlustor product offers clustering solutions ranging from similarity based non-hierarchical method to a pure graph based technique. This latter exhibits some clear advantages over the more conventional approaches: clusters are more likely to meet human expectations and tangible explanation why certain compounds are grouped together is also produced. And even it is faster. If you 'farm your classes' then it's time to 'MCS your library'! Latest developments are here: http://www.chemaxon.com/product/jklustor.htmlTRANSCRIPT
•Solutions for Cheminformatics
Clustering made human
Miklos Vargyas
3
Cluster in computing
Computer cluster
4
Cluster in Chemistry
Transition metal carbonyl clusters
Transition metal halide clusters
Boron hydrides
Gas-phase clusters and fullerenes
Dimanganese-decacarbonyl di-tungsten tetra(hpp)
5
Cluster in Chemistry/Physics
Nanoscale particles
• Fullerenes
• Nano machines
Images produced by MarvinSpace
6
Star cluster
gravitationally bound groups of stars
Image from Wikipedia, the free encyclopedia
7
Clustering cars
Live demonstration
Group by property
• Shape, size, type, brand, colour
• Many possible arrangement, multiple aspects
Group by similarity
• Categorial perception
8
Why is clustering stars easy?
God did the job for us!
• Stars have an apparent spatial arrangement
• Distance between stars defines clusters
9
Why is clustering cars hard?
Lack of innate spatial arrangement
• Artificial arrangement
• Various approaches, no superior one
• “Cars come in all shapes and sizes”
Problem of dimensionality
• Why 2?!
10
So what about Molecules
Are they like stars or rather like cars?
• They come in all shapes and sizes
• Vast number of properties
Chemical spaces
• Select molecular properties
• Estimate or measure them
• Use them as coordinates
• Place your molecules as points in this abstract space
• Group that are close to each other to form clusters
11
Example in 2D
12
Further attempts in 2D
0
50
100
150
200
250
300
-2 0 2 4 6 8 10 12
tpsa
mass
0
50
100
150
200
250
300
0 200 400 600 800 1000
tpsa
log
P
13
Molecule clusters by similarity
Jarvis-Patrick clustering
• Fast
• Tanimoto similarity
• Globular clusters
• Tendency to create large number of
singletons
• Molecular properties & fingerprint
jarp -i SC1000.cfp -m 0 -f 1024 -t 0.6 -c 0.1
-y -z -o SC1000.jarp.t0.6.c0.1 –g
Number of objects = 999
Number of clusters (without singletons) = 2
Number of singletons = 8
Average dissimilarity = 0.66208726
Minimum dissimilarity = 0.0
Maximum dissimilarity = 0.9411765
14
Parameter tuning
t c Clusters singletons
0.6 0.1 2 8
0.3 0.1 179 248
0.5 0.1 7 36
15
The most populated cluster
16
Parameter tuning
t c Clusters singletons
0.6 0.1 2 8
0.3 0.1 179 248
0.5 0.1 7 36
0.5 0.5 10 37
0.5 0.8 81 115
17
Another cluster
18
So what’s wrong with that?
1. manual tuning
2. lack of interpretability
3. need:
4. automated (unsupervised) techniques
5. easy to grasp simple to understand “explanations”
6. one possible solutions: MCS based clustering
19
Maximum Common Substructure
Largest substructure shared by two molecules
MCS
Simple concept! More human, visual.
Yet hard (= expensive (= slow)) to compute..
20
MCS of a structure set
21
Hierarchical star clusters
star
22
Hierarchical star clusters
star cluster
• star
23
Hierarchical star clusters
galaxy
• star cluster
– star
24
Hierarchical star clusters
local group
• galaxy
– star cluster
star
25
Hierarchical star clusters
supercluster
• cluster
– local group
galaxy
» star cluster
26
Visualisation of hierarchy
Dendrogram
27
Hierarchical MCS
28
Intuitive visualisation
29
SAR table view
30
R-group deconvolusion
31
Speed-up achieved last year
-500
0
500
1000
1500
2000
2500
3000
3500
4000
0 5000 10000 15000 20000 25000 30000 35000
Structure count
Ru
nn
ing
tim
e (
sec)
2006
2007
Linear (2007)
Presented at UGM’07
32
Speed-up achieved this year
0
500
1000
1500
2000
2500
3000
3500
4000
0 5000 10000 15000 20000 25000 30000 35000
Structure count
Ru
nn
ing
tim
e (
sec)
2006
2007
2008
33
Speed-up this year
0.1
1
10
100
1000
10000
0 5000 10000 15000 20000 25000 30000 35000
Structure count
Ru
nn
ing
tim
e (
sec)
2006
2007
2008
34
Clustering performance comparison
0
10
20
30
40
50
60
70
80
90
0 20000 40000 60000 80000 100000 120000
Structure count
Ru
nn
ing
tim
e (
min
) LibraryMCS
Jarvis-Patrick
Ward-Murtagh
35
Find out more
Product descriptions & links
www.chemaxon.com/products.html
Forum
www.chemaxon.com/forum
Presentations and posters
www.chemaxon.com/conf
Download
www.chemaxon.com/download.html