clustering made human: us ugm 2008

•Solutions for Cheminformatics

Clustering made human

Miklos Vargyas

3

Cluster in computing

Computer cluster

4

Cluster in Chemistry

Transition metal carbonyl clusters

Transition metal halide clusters

Boron hydrides

Gas-phase clusters and fullerenes

Dimanganese-decacarbonyl di-tungsten tetra(hpp)

5

Cluster in Chemistry/Physics

Nanoscale particles

• Fullerenes

• Nano machines

Images produced by MarvinSpace

6

Star cluster

gravitationally bound groups of stars

Image from Wikipedia, the free encyclopedia

7

Clustering cars

Live demonstration

Group by property

• Shape, size, type, brand, colour

• Many possible arrangement, multiple aspects

Group by similarity

• Categorial perception

8

Why is clustering stars easy?

God did the job for us!

• Stars have an apparent spatial arrangement

• Distance between stars defines clusters

9

Why is clustering cars hard?

Lack of innate spatial arrangement

• Artificial arrangement

• Various approaches, no superior one

• “Cars come in all shapes and sizes”

Problem of dimensionality

• Why 2?!

10

So what about Molecules

Are they like stars or rather like cars?

• They come in all shapes and sizes

• Vast number of properties

Chemical spaces

• Select molecular properties

• Estimate or measure them

• Use them as coordinates

• Place your molecules as points in this abstract space

• Group that are close to each other to form clusters

11

Example in 2D

12

Further attempts in 2D

0

50

100

150

200

250

300

-2 0 2 4 6 8 10 12

tpsa

mass

0

50

100

150

200

250

300

0 200 400 600 800 1000

tpsa

log

P

13

Molecule clusters by similarity

Jarvis-Patrick clustering

• Fast

• Tanimoto similarity

• Globular clusters

• Tendency to create large number of

singletons

• Molecular properties & fingerprint

jarp -i SC1000.cfp -m 0 -f 1024 -t 0.6 -c 0.1

-y -z -o SC1000.jarp.t0.6.c0.1 –g

Number of objects = 999

Number of clusters (without singletons) = 2

Number of singletons = 8

Average dissimilarity = 0.66208726

Minimum dissimilarity = 0.0

Maximum dissimilarity = 0.9411765

14

Parameter tuning

t c Clusters singletons

0.6 0.1 2 8

0.3 0.1 179 248

0.5 0.1 7 36

15

The most populated cluster

16

Parameter tuning

t c Clusters singletons

0.6 0.1 2 8

0.3 0.1 179 248

0.5 0.1 7 36

0.5 0.5 10 37

0.5 0.8 81 115

17

Another cluster

18

So what’s wrong with that?

1. manual tuning

2. lack of interpretability

3. need:

4. automated (unsupervised) techniques

5. easy to grasp simple to understand “explanations”

6. one possible solutions: MCS based clustering

19

Maximum Common Substructure

Largest substructure shared by two molecules

MCS

Simple concept! More human, visual.

Yet hard (= expensive (= slow)) to compute..

20

MCS of a structure set

21

Hierarchical star clusters

star

22


star cluster

• star

23


galaxy

• star cluster

– star

24


local group

• galaxy

– star cluster

star

25


supercluster

• cluster

– local group

galaxy

» star cluster

26

Visualisation of hierarchy

Dendrogram

27

Hierarchical MCS

28

Intuitive visualisation

29

SAR table view

30

R-group deconvolusion

31

Speed-up achieved last year

-500

0

500

1000

1500

2000

2500

3000

3500

4000

0 5000 10000 15000 20000 25000 30000 35000

Structure count

Ru

nn

ing

tim

e (

sec)

2006

2007

Linear (2007)

Presented at UGM’07

32

Speed-up achieved this year

0

500

1000

1500

2000

2500

3000

3500

4000

0 5000 10000 15000 20000 25000 30000 35000

Structure count

Ru

nn

ing

tim

e (

sec)

2006

2007

2008

33

Speed-up this year

0.1

1

10

100

1000

10000

0 5000 10000 15000 20000 25000 30000 35000

Structure count

Ru

nn

ing

tim

e (

sec)

2006

2007

2008

34

Clustering performance comparison

0

10

20

30

40

50

60

70

80

90

0 20000 40000 60000 80000 100000 120000

Structure count

Ru

nn

ing

tim

e (

min

) LibraryMCS

Jarvis-Patrick

Ward-Murtagh

35

Find out more

Product descriptions & links

www.chemaxon.com/products.html

Forum

www.chemaxon.com/forum

Presentations and posters

www.chemaxon.com/conf

Download

www.chemaxon.com/download.html

http://www.chemaxon.com/products.html

http://www.chemaxon.com/forum

http://www.chemaxon.com/conf

http://www.chemaxon.com/download.html

http://www.chemaxon.com/download.html

clustering made human: us ugm 2008

Technology

number of clusters

populated cluster

clustering stars easy

molecule clusters

hierarchical mcs

tc clusters singletons0

clustering cars hard

possible arrangement