![Page 1: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics](https://reader035.vdocuments.us/reader035/viewer/2022070305/55146505550346b0158b4add/html5/thumbnails/1.jpg)
The new JKlustor suite
Miklós Vargyas
Solutions for Cheminformatics
![Page 2: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics](https://reader035.vdocuments.us/reader035/viewer/2022070305/55146505550346b0158b4add/html5/thumbnails/2.jpg)
2
Why do we cluster?
• to reduce the number of objects to deal with– group subsets together– represent each group by one member of it
![Page 3: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics](https://reader035.vdocuments.us/reader035/viewer/2022070305/55146505550346b0158b4add/html5/thumbnails/3.jpg)
3
What’s the matter with that?
– tedious• parameter tuning in a trial-and-error fashion
– lack of interpretability• the algorithm does not provide explanation
– often do not meet chemists’ expectation
![Page 4: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics](https://reader035.vdocuments.us/reader035/viewer/2022070305/55146505550346b0158b4add/html5/thumbnails/4.jpg)
4
Why is clustering molecules hard?
• lack of innate spatial arrangement– artificial arrangement
• infinite types of chemical spaces• various ‘distance metrics’• usually high dimensionality (hard to visualize)
– various approaches, no superior one• “best method” depends on application area, and on
actual data
![Page 5: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics](https://reader035.vdocuments.us/reader035/viewer/2022070305/55146505550346b0158b4add/html5/thumbnails/5.jpg)
5
What do we need?
• no/few tuning
• easy to understand simple “explanation”
• novel approach– structure based clustering– Maximum Common Substructure– Molecular frameworks
![Page 6: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics](https://reader035.vdocuments.us/reader035/viewer/2022070305/55146505550346b0158b4add/html5/thumbnails/6.jpg)
6
Maximum Common Substructure
• largest substructure shared by two molecules
• Simple concept! More “human”, visual.
• Yet hard (= expensive (= slow)) to compute.
![Page 7: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics](https://reader035.vdocuments.us/reader035/viewer/2022070305/55146505550346b0158b4add/html5/thumbnails/7.jpg)
7
MCS complexity
• Sub-structure searching – query structure is known, it “only” have to be found
as part of the target structure (subgraph isomorphism)
– graph isomorphism is even “simpler” yet NP-hard• finding the answer can take long (scales exponentially
with respect to the number graph vertexes) in the worst case
• validating an answer is fast
• MCS – “query” structure is not known
– all possible substructures need to be checked• even the number of substructures is exponential!
![Page 8: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics](https://reader035.vdocuments.us/reader035/viewer/2022070305/55146505550346b0158b4add/html5/thumbnails/8.jpg)
8
MCS algorithms
• two camps
backtracking clique detection
ad hoc high mathematical elegance
average complexity is better than worst case
average complexity is same as worst case
dynamic heuristics static (initial) heuristics
coloring is easy coloring is hard
fuzzy matching fussy matching
![Page 9: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics](https://reader035.vdocuments.us/reader035/viewer/2022070305/55146505550346b0158b4add/html5/thumbnails/9.jpg)
9
MCS of a structure set
![Page 10: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics](https://reader035.vdocuments.us/reader035/viewer/2022070305/55146505550346b0158b4add/html5/thumbnails/10.jpg)
10
LibraryMCS: Hierarchical MCS
![Page 11: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics](https://reader035.vdocuments.us/reader035/viewer/2022070305/55146505550346b0158b4add/html5/thumbnails/11.jpg)
11
Intuitive visualization
![Page 12: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics](https://reader035.vdocuments.us/reader035/viewer/2022070305/55146505550346b0158b4add/html5/thumbnails/12.jpg)
12
SAR table view
![Page 13: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics](https://reader035.vdocuments.us/reader035/viewer/2022070305/55146505550346b0158b4add/html5/thumbnails/13.jpg)
13
R-group decomposition
![Page 14: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics](https://reader035.vdocuments.us/reader035/viewer/2022070305/55146505550346b0158b4add/html5/thumbnails/14.jpg)
14
LibraryMCS scales linearly
-500
0
500
1000
1500
2000
2500
3000
3500
4000
0 5000 10000 15000 20000 25000 30000 35000
Structure count
Ru
nn
ing
tim
e (s
ec)
2006
2007
Linear (2007)
![Page 15: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics](https://reader035.vdocuments.us/reader035/viewer/2022070305/55146505550346b0158b4add/html5/thumbnails/15.jpg)
15
Clustering performance comparison
0
10
20
30
40
50
60
70
80
90
0 20000 40000 60000 80000 100000 120000
Structure count
Run
ning
tim
e (m
in)
LibraryMCSJarvis-PatrickWard-Murtagh
![Page 16: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics](https://reader035.vdocuments.us/reader035/viewer/2022070305/55146505550346b0158b4add/html5/thumbnails/16.jpg)
16
Behind performance
• MCS search– exhaustive– heuristics
• exact• inexact
• Predictive MCS coupling in clustering– all pairs are not feasible– rich fingerprinting
![Page 17: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics](https://reader035.vdocuments.us/reader035/viewer/2022070305/55146505550346b0158b4add/html5/thumbnails/17.jpg)
17
Live demonstration
• Affect of use of heuristics– on average < 10% misclassifications– useful for obtaining birds-eye-view of a
larger/diverse sets
![Page 18: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics](https://reader035.vdocuments.us/reader035/viewer/2022070305/55146505550346b0158b4add/html5/thumbnails/18.jpg)
1M< compounds libraries
• Molecular scaffolds, – Rings, ring systems– Bemis-Murcko frameworks
![Page 19: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics](https://reader035.vdocuments.us/reader035/viewer/2022070305/55146505550346b0158b4add/html5/thumbnails/19.jpg)
• Sphere exclusion – Variants… linear scaling
Fast clustering methods
![Page 20: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics](https://reader035.vdocuments.us/reader035/viewer/2022070305/55146505550346b0158b4add/html5/thumbnails/20.jpg)
20
Jklustor roadmap
• In the dev pipeline– IJC integration– Spotfire integration– new dynamic viewer
• Planned– disconnected MCS– multiple class members
![Page 21: The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics](https://reader035.vdocuments.us/reader035/viewer/2022070305/55146505550346b0158b4add/html5/thumbnails/21.jpg)
21
Acknowledgements
• Gábor Imre
• Judit Vaskó-Szedlár
• Péter Vadász