clustering made human: us ugm 2008

34
Solutions for Cheminformatics Clustering made human Miklos Vargyas

Upload: chemaxon

Post on 11-Jun-2015

549 views

Category:

Technology


1 download

DESCRIPTION

Clustering chemical structures alleviates the tedious task of browsing a large set of compounds by grouping individual structures into generic categories. ChemAxon's JKlustor product offers clustering solutions ranging from similarity based non-hierarchical method to a pure graph based technique. This latter exhibits some clear advantages over the more conventional approaches: clusters are more likely to meet human expectations and tangible explanation why certain compounds are grouped together is also produced. And even it is faster. If you 'farm your classes' then it's time to 'MCS your library'! Latest developments are here: http://www.chemaxon.com/product/jklustor.html

TRANSCRIPT

Page 1: Clustering Made Human: US UGM 2008

•Solutions for Cheminformatics

Clustering made human

Miklos Vargyas

Page 2: Clustering Made Human: US UGM 2008

3

Cluster in computing

Computer cluster

Page 3: Clustering Made Human: US UGM 2008

4

Cluster in Chemistry

Transition metal carbonyl clusters

Transition metal halide clusters

Boron hydrides

Gas-phase clusters and fullerenes

Dimanganese-decacarbonyl di-tungsten tetra(hpp)

Page 4: Clustering Made Human: US UGM 2008

5

Cluster in Chemistry/Physics

Nanoscale particles

• Fullerenes

• Nano machines

Images produced by MarvinSpace

Page 5: Clustering Made Human: US UGM 2008

6

Star cluster

gravitationally bound groups of stars

Image from Wikipedia, the free encyclopedia

Page 6: Clustering Made Human: US UGM 2008

7

Clustering cars

Live demonstration

Group by property

• Shape, size, type, brand, colour

• Many possible arrangement, multiple aspects

Group by similarity

• Categorial perception

Page 7: Clustering Made Human: US UGM 2008

8

Why is clustering stars easy?

God did the job for us!

• Stars have an apparent spatial arrangement

• Distance between stars defines clusters

Page 8: Clustering Made Human: US UGM 2008

9

Why is clustering cars hard?

Lack of innate spatial arrangement

• Artificial arrangement

• Various approaches, no superior one

• “Cars come in all shapes and sizes”

Problem of dimensionality

• Why 2?!

Page 9: Clustering Made Human: US UGM 2008

10

So what about Molecules

Are they like stars or rather like cars?

• They come in all shapes and sizes

• Vast number of properties

Chemical spaces

• Select molecular properties

• Estimate or measure them

• Use them as coordinates

• Place your molecules as points in this abstract space

• Group that are close to each other to form clusters

Page 10: Clustering Made Human: US UGM 2008

11

Example in 2D

Page 11: Clustering Made Human: US UGM 2008

12

Further attempts in 2D

0

50

100

150

200

250

300

-2 0 2 4 6 8 10 12

tpsa

mass

0

50

100

150

200

250

300

0 200 400 600 800 1000

tpsa

log

P

Page 12: Clustering Made Human: US UGM 2008

13

Molecule clusters by similarity

Jarvis-Patrick clustering

• Fast

• Tanimoto similarity

• Globular clusters

• Tendency to create large number of

singletons

• Molecular properties & fingerprint

jarp -i SC1000.cfp -m 0 -f 1024 -t 0.6 -c 0.1

-y -z -o SC1000.jarp.t0.6.c0.1 –g

Number of objects = 999

Number of clusters (without singletons) = 2

Number of singletons = 8

Average dissimilarity = 0.66208726

Minimum dissimilarity = 0.0

Maximum dissimilarity = 0.9411765

Page 13: Clustering Made Human: US UGM 2008

14

Parameter tuning

t c Clusters singletons

0.6 0.1 2 8

0.3 0.1 179 248

0.5 0.1 7 36

Page 14: Clustering Made Human: US UGM 2008

15

The most populated cluster

Page 15: Clustering Made Human: US UGM 2008

16

Parameter tuning

t c Clusters singletons

0.6 0.1 2 8

0.3 0.1 179 248

0.5 0.1 7 36

0.5 0.5 10 37

0.5 0.8 81 115

Page 16: Clustering Made Human: US UGM 2008

17

Another cluster

Page 17: Clustering Made Human: US UGM 2008

18

So what’s wrong with that?

1. manual tuning

2. lack of interpretability

3. need:

4. automated (unsupervised) techniques

5. easy to grasp simple to understand “explanations”

6. one possible solutions: MCS based clustering

Page 18: Clustering Made Human: US UGM 2008

19

Maximum Common Substructure

Largest substructure shared by two molecules

MCS

Simple concept! More human, visual.

Yet hard (= expensive (= slow)) to compute..

Page 19: Clustering Made Human: US UGM 2008

20

MCS of a structure set

Page 20: Clustering Made Human: US UGM 2008

21

Hierarchical star clusters

star

Page 21: Clustering Made Human: US UGM 2008

22

Hierarchical star clusters

star cluster

• star

Page 22: Clustering Made Human: US UGM 2008

23

Hierarchical star clusters

galaxy

• star cluster

– star

Page 23: Clustering Made Human: US UGM 2008

24

Hierarchical star clusters

local group

• galaxy

– star cluster

star

Page 24: Clustering Made Human: US UGM 2008

25

Hierarchical star clusters

supercluster

• cluster

– local group

galaxy

» star cluster

Page 25: Clustering Made Human: US UGM 2008

26

Visualisation of hierarchy

Dendrogram

Page 26: Clustering Made Human: US UGM 2008

27

Hierarchical MCS

Page 27: Clustering Made Human: US UGM 2008

28

Intuitive visualisation

Page 28: Clustering Made Human: US UGM 2008

29

SAR table view

Page 29: Clustering Made Human: US UGM 2008

30

R-group deconvolusion

Page 30: Clustering Made Human: US UGM 2008

31

Speed-up achieved last year

-500

0

500

1000

1500

2000

2500

3000

3500

4000

0 5000 10000 15000 20000 25000 30000 35000

Structure count

Ru

nn

ing

tim

e (

sec)

2006

2007

Linear (2007)

Presented at UGM’07

Page 31: Clustering Made Human: US UGM 2008

32

Speed-up achieved this year

0

500

1000

1500

2000

2500

3000

3500

4000

0 5000 10000 15000 20000 25000 30000 35000

Structure count

Ru

nn

ing

tim

e (

sec)

2006

2007

2008

Page 32: Clustering Made Human: US UGM 2008

33

Speed-up this year

0.1

1

10

100

1000

10000

0 5000 10000 15000 20000 25000 30000 35000

Structure count

Ru

nn

ing

tim

e (

sec)

2006

2007

2008

Page 33: Clustering Made Human: US UGM 2008

34

Clustering performance comparison

0

10

20

30

40

50

60

70

80

90

0 20000 40000 60000 80000 100000 120000

Structure count

Ru

nn

ing

tim

e (

min

) LibraryMCS

Jarvis-Patrick

Ward-Murtagh

Page 34: Clustering Made Human: US UGM 2008

35

Find out more

Product descriptions & links

www.chemaxon.com/products.html

Forum

www.chemaxon.com/forum

Presentations and posters

www.chemaxon.com/conf

Download

www.chemaxon.com/download.html