fusing database rankings in similarity-based virtual screening peter willett, university of...

Fusing database rankings in similarity-based virtual screening

Peter Willett, University of Sheffield

Overview

• Similarity-based virtual screening

• Combination of similarity rankings• Similarity fusion

• Group fusion

• Comparison of fusion rules

Drug discovery

• The pharmaceutical industry has been one of the great success stories of scientific research, discovering a range of novel drugs for important therapeutic areas

• The computer has revolutionised how the industry uses chemical (and increasingly biological) information

• Many of these developments are within the discipline we now know as chemoinformatics• “Chem(o)informatics is a generic term that encompasses the

design, creation, organization, management, retrieval, analysis, dissemination, visualization and use of chemical information” (G. Paris at a 1999 ACS meeting, quoted at http://www.warr.com/warrzone.htm)

• Focus on structural information (2D or 3D) cf bioinformatics

http://www.warr.com/warrzone.htm

Virtual screening

• Chemoinformatics covers a wide range of techniques

• Here, focus on virtual screening of existing public and in-house databases• Tools to rank compounds in order of decreasing

probability of activity

• The top-ranked molecules are then prioritised for biological screening

• A range of virtual screening methods available, with similarity searching being one of the best established and most widely used

Similarity searching

• Use of a similarity measure to quantify the resemblance between an active reference (or target) structure and each database structure

• Given a reference structure find molecules in a database that are most similar to it (“give me ten more like this”)• Compare the reference structure with each database

structure and measure the similarity

• Sort the database in order of decreasing similarity

• Display the top-ranked structures (“nearest neighbours”) to the searcher

2D similarity searching

OHN

N

OH

NH

N NH2

O

N NH

N

NH2

NH

N

NH2

NH

N

N NH

N

NH

OH

Reference structure

• The similar property principle states that structurally similar molecules tend to have similar properties

• Given a known active reference structure, a similarity search of a database can be used to identify further molecules for testing

• NB many exceptions to the similar property principle

Rationale for similarity searching

N

O OHOH

Morphine

N

O OHO

Codeine

N

O OOOO

Heroin

Similarity measures

• A similarity measure has two principal components • A structure representation

Characterise reference and database structures to enable rapid comparison

• A similarity coefficient to compare two representations

Quantitative measure of the resemblance of these characterisations

• The most common measure is based on the use of 2D fingerprints and the Tanimoto coefficient (as in previous example)

Fingerprints

C C C

C

C

C C C

O

• A simple, but approximate, representation that encodes the presence of fragment substructures in a bit-string or fingerprint

• Cf keywords indexing textual documents• Each bit in the bit-string (binary vector) records the

presence (“1”) or absence (“0”) of a particular fragment in the molecule.

• Typical length is a few hundred or few thousand bits• Two fingerprints are regarded as similar if they have

many common bits set

Tanimoto coefficient

• Tanimoto coefficient for binary bit strings

• C bits set in common between Reference and Database structures

• R bits set in Reference structure

• D bits set in Database structure

• More complex form for use with non-binary data, e.g., physicochemical property vectors

• Many other similarity coefficients exist

Data fusion: I

• Many comparisons of effectiveness using different screening methods (e.g., different coefficients, different fingerprints, 2D or 3D methods)

• Sheridan and Kearsley, Drug Discov. Today, 7, 2002, 903• “We have come to regard looking for ‘the best’ way of searching

chemical databases as a futile exercise. In both retrospective and prospective studies, different methods select different subsets of actives for the same biological activity and the same method might work better on some activities than others”

• Different types of coefficient and different types of representation reflect different molecular characteristics, so may enhance search performance by using more than one similarity measure

Data fusion: II• Use of ideas from textual information retrieval (IR) given

analogies between the two domains• Documents, keywords with highly skewed frequency distributions,

and relevance to a query

• Molecules, fragments with highly skewed frequency distributions, and activity against a specific biological target

• IR-like fusion first studied in the late Nineties• Generate multiple rankings from the same reference structure using

different similarity measures (similarity fusion)

• Found to give improved performance over use of a single similarity measure (more consistent, or even better than best individual)

• Later work in chemoinformatics• Generate multiple rankings from different reference structures using

the same similarity measure (group fusion)

Similarity fusion

• Conventional similarity searching yields a single database ranking

• Work in IR on the “Authority Effect”• Experiments in TREC show that documents retrieved by

multiple search engines more likely to be relevant to a query than if retrieved by a single search engine

• Does the Effect also apply in chemoinformatics?• Extensive virtual screening experiments to investigate

whether structures retrieved by multiple virtual screening methods more likely to be active than if retrieved by a single method

Experimental details: I

• Test collection methodology analogous to that used in IR• Use of MDDR (ca. 102K structures) and WOMBAT

(ca. 130K structures) databases

• Sets of molecules with known biological activities (several hundred known actives in each class)

• Simulated virtual screening using an active as the reference structure

• How many of the top-ranked molecules from a search are also active?

Experimental details: II

• Sets of 25 searches for a reference structure: • 5 different similarity coefficients (Tanimoto, cosine,

Euclidean distance, Forbes, Russell-Rao)

• 5 different fingerprints (MDL, BCI, Daylight, Unity and ECFP_4)

• Apply cut-off to take, e.g., top-1% of a ranking

• Numbers of molecules, and numbers of active molecules, retrieved by 1, 2….24, 25 searches

• Average over different reference structures for each activity class, and over different activity classes

Retrieval of molecules: WOMBAT top-1% searches

0

500

1000

1500

2000

2500

3000

3500

4000

4500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Number of similarity searches

Ave

rag

e n

um

ber

of

ove

rlap

5HT1A

5HT3

AChE

AT1

COX

D2

fXa

HIVP

MMP1

PDE4

PKC

Renin

SubP

Thrombin

Average over all classes

Retrieval of molecules: WOMBAT top-1% searches (average over classes)

y = -1.9788x + 3.9568R2 = 0.9666

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

Number of similarity searches (log)

Ave

rage

num

ber

of o

verl

ap (

log)

Zipf-like distribution

Retrieval of active molecules: WOMBAT top-1% searches

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25


Ave

rage

per

cen

tage

of

acti

ve

5HT1A

5HT3

AChE

AT1

COX

D2

fXa

HIVP

MMP1

PDE4

PKC

Renin

SubP

Thrombin

Average over all classes

Retrieval of active molecules: WOMBAT top-1% searches (average over classes)

y = 1.1104e0.1736x

R² = 0.9948

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25

Ave

rage

per

cent

age o

f act

ives


Similarity fusion: conclusions

• Using multiple searches hence results in:• Rapid decrease in the numbers of molecules retrieved

• Rapid increase in the percentage of those retrieved molecules that are active

• Multiple searches could hence increase the effectiveness of similarity-based virtual screening

• Provides empirical basis for similarity fusion (but very simple fusion rule). What about group fusion?

Reference 1

Use of group fusion: I

Reference 2

Reference 3

After truncation to required rank

Reference 2

Reference 1

Reference 3

Group fusion

• Use of MDDR database (ca. 102K structures)• Measured numbers of actives retrieved in top-5% of ranking?

• Group fusion searches where pick ten actives at random

• Comparison with the average of all the individual actives for each activity class

• Comparison with the best single active for each activity class

• Use of Unity and ECFP4 fingerprints)

• Group fusion markedly out-performs the use of individual reference structures

• Best results obtained using combination of scores and the MAX rule (see later)

• Hert et al., J. Chem. Inf. Comput. Sci., 44, 2004, 1177

Group fusion: average over 11 activity classes

25303540455055606570

SingleSimilarity -Average

SingleSimilarity -Maximum

Data Fusion(Scores - Max)

Re

call

at 5

%

(%R

eR

ecca

ll

Re

call

(%))

UnityECFP_4

Fusion rules

• Given multiple input rankings, a fusion rule outputs a single, combined ranking • The rankings can be either the computed similarity

values or the resulting rank positions

• Work in IR and chemoinformatics has used simple arithmetical operations to combine rankings (though many other, more complex types of rule available):• CombMAX for similarity data

• CombSUM for rank data

• Detailed comparison of a range of rules

Fusion rules for the x-th database structure

• CombMax = max{S1(x), S2(x)..Si(x)..Sn(x)}• Also CombMIN

• CombSum = ΣSi(x)

• Also CombMED and other averages, using all or just some of the rankings

• CombRKP = Σ(1/Ri(x))

• Used only with rank data

Very simple rules!

• Other studies use supervised rules (logistic regression, belief theory etc)

• But normally very limited training data (i.e., structures and bioactivity information) at the stage you want to use data fusion

• If such data are available, other chemoinformatics approaches preferable

Experimental details

• Searches carried out using • Similarity fusion and group fusion

• Various percentages of the ranked database

• 15 different fusion rules

• Results show conclusively that best results (for both similarity fusion and group fusion) obtained when:• Use just the top 1-5% of each ranked list in the fusion

• Use the CombRKP fusion rule on the ranked lists

Use of CombRKP: I Virtual screening seeks to rank molecules in decreasing order of probability of activity: MDDR searches (J. Med. Chem., 48, 2005, 7049) show a hyperbola-like plot

Use of CombRKP: IIFusion scores for CombRKP best approximate probability of activity, and hence CombRKP likely to perform well, Results averaged over 200 MDDR searches

Conclusions

• Similarity-based virtual screening using fingerprints well-established

• Can enhance screening effectiveness by use of data fusion:• Combining the rankings from different similarity

measures

• Combining the rankings from different reference structures

• Range of simple fusion rules available for this purpose

Acknowledgments

• Organisations• Accelrys, Daylight Chemical Information Systems,

Digital Chemistry, EPSRC, Government of Malaysia, Sunset Molecular, Royal Society, Tripos, Wolfson Foundation

• People• Claire Ginn, Jerome Hert, John Holliday, Evangelos

Kanoulas, Nurul Malim, Christoph Mueller, Naomie Salim