bioreason, inc molecular similarity and chemical families: the homogeneity approach c.a. nicolaou,...
Post on 14-Jan-2016
212 Views
Preview:
TRANSCRIPT
Bioreason, Inc
Molecular Similarity and Chemical Families:The Homogeneity Approach
C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck
11th April, 20002nd Sheffield Chemoinformatics Conference,
Sheffield, UK
Bioreason, Inc
Presentation Outline
Introduction Molecular similarity Observations on chemical data
Analyzing screening data Using a traditional approach
The Homogeneity Approach Definitions Implementation and experimental results
Conclusions
Bioreason, Inc
Molecular Similarity
Widely used all over drug discovery processSample applications:
Assessing diversity of a chemical dataset Picking representative dataset from compound library Given a compound and a compound library, identifying
subset of similar compounds Analyzing screening data
Major step: • Organizing screening data into chemical families
Bioreason, Inc
Typical Drug Discovery Process
Library
Assay
Data
Drug Candidates
*Screening*
Further exploration
*Data Analysis*
Start Chemistry
Bioreason, Inc
Technology Employed
Compound representation methods Fingerprints/bit vectors, graph-based, ... 2D-keys Vs 3D-keys, fragment Vs distance based, ...
Similarity and distance measures Tanimoto, Euclidean, …, graph-based, ...
Clustering methodsClassification methodsSubstructure searching/(sub)graph matching ...
Bioreason, Inc
Analyzing Chemical Compounds (1)
N-N Q-QH Q-C(-N)-C CH3-A-CH3
Q-N N-A-A-O N-C-O O not % A % A N-A-O Q-Q QH > 1 CH3 > 1 N > 1 NH ...
Dictionary of Keys
O
O
NN
O
H
H
10111000001...
Bioreason, Inc
Analyzing Chemical Compounds (2)
Compounds are multi-domain: multiple occurrences of a key/substructure members of more than one chemical family
Bioreason, Inc
Analyzing Chemical Compounds (3)
Information loss!E.g. “How” a key hits?
Bioreason, Inc
Dataset Used
Derived from the NCI anti-HIV program Latest release, Oct. 99, 43 382 compounds Cell based, EC50 (effective concentration at which the
test compound protects the cells by 50%) Pre-processing:
Molecular weight <=500Multiple EC50 values for compounds; kept highest
concentration 33245 compounds left
Activities: converted from molar concentrations to -log Activity threshold used: 5.5 Training set size (actives): 503
Bioreason, Inc
Analyzing Screening DataTypical Approach
Goal: Data Reduction To manageable size Organized fashion With minimal information loss
Represent molecules as vectors, often binary Similarity/distance measureClustering AlgorithmMetacluster selection method (e.g. cluster
level selection methods for hierarchical clustering)
Bioreason, Inc
Hierarchical Agglomerative Clustering Method
NCI - HIV dataset 503 subset based on activity
Clustered using Wards, Euclidean distance, bit-vectors obtained via application of MACCS-like keys
Cluster level selection using the Kelley method Results:
70 (meta)clusters Complete coverage of the dataset, no singletons! Average metacluster size: 7.2 compounds
Bioreason, Inc
Method Evaluation - Chemists
Results validation by comparing to known truth: Some known chemical families were detected, e.g. AZTs,
pyrimidine nucleosides, ... Smaller, less well-represented families not always detected,
e.g. stilbenes, ...
Results validation by assessing their quality On average chemists approved only 20-30 of the 70
clusters as chemical families of related compounds The remaining clusters(~2/3) were difficult to interpret
Compounds that shouldn’t be in some clusters Compounds that should have been in some clusters (misclassified or
not) Clusters that were made of dissimilar/diverse compounds
Experts were puzzled by the absence of singletons
Bioreason, Inc
Method Evaluation - Computational
Analyzed 70 groups of compounds: Simple method:
average nearest neighbor distance within a set of compounds distance computed using the bit-vectors of the compounds
43/70: pretty low average nearest neighbor distance 22/70: moderate average nearest neighbor distance 5/70: quite high average nearest neighbor distance. Overall most of the groups had a low diversity; expected
since the metaclusters were built using bit-vectors
Bioreason, Inc
The problem
Confusing? Method functioned just right from a computational perspective But, the results were not as satisfying to the human expert
Clustering results often don’t: match expectations make chemical sense
Why? Clustering is performed on molecular representations, often
based on small keys, not on the molecules themselves No chemical “common sense” influence on the clustering
process
Bioreason, Inc
The road ahead… (1)
What is the end goal of screening data analysis? Finding the chemical families of interest, i.e. those
that exhibit favorable biological characteristics
How are we attempting to do it? Clustering and classification methods using vector
encoding representations of molecules But,
clustering only gives groups of compounds that have similar vector representations and,
a successful classification session requires that one knows the chemical families of interest a priori.
Bioreason, Inc
The road ahead… (2)
So, what do we do now that we are aware of the loose coupling between clusters obtained traditionally and human experts’ expectations? Discover what the experts want Adapt our process to match results and
expectations
Bioreason, Inc
Definitions
Chemical family: A set of highly similar compounds sharing a common
scaffold; else a set of compounds with high homogeneity
Homogeneity: High structural similarity Based not only on similarity of molecular vectors but
also on the presence of a significant common scaffold
Scaffold: A substructure defined as a specific configuration of
atom types and bond types
Bioreason, Inc
Processing traditional method results
Processing the results of traditional methods: Easier to do than a complete re-design/re-
implementation Will “remove” results not chemically sensible Will make life easier for human analysts by
allowing them to focus on easily recognizable and interpretable pieces of knowledge
Approach: Compute and use structural homogeneity on results of
traditional methods. Basically construct “chemically sensible” methods for selecting the important compound groups
Bioreason, Inc
Identifying Scaffolds
Maximum Common Substructure(MCS) extraction: Using extremely fast and efficient own
implementations
Highlights of analysis: 7 out of 70 compound sets: common scaffold size < 2! 5 MCSs appeared multiple times
Range: 2-6, mostly benzene rings
A total of 53 different scaffolds MCS size:
Ranged from less than 2 atoms to greater than 14 atoms
Bioreason, Inc
Introducing Homogeneity
Clusters Homogeneity: Fingerprint Homogeneity:
Overall quite good average nearest neighbor distance
Structural Homogeneity: Used: # of atoms in mcs / avg. # of atoms in set
moleculesStructural Homogeneity Threshold: 1/3
• MCS covering at least a third of the average molecule size
Results:• 23/70 clusters below threshold• 47 above threshold
Bioreason, Inc
Method Assessment (1)
Results were used to assign priority to clusters: Low Priority - low likelihood of chemical sense:
clusters with small scaffolds, low structural homogeneityclusters with insignificant scaffolds, low-to-moderate
structural homogeneity
High Priority - high likelihood of chemical sense:well defined clusters, with high structural homogeneity
and big, significant scaffolds
Approach did make life easier to human analysts Ability to find important information faster
Bioreason, Inc
Method Assessment (2)
Prioritization assessment: the 23 non-structurally homogeneous clusters were
uninteresting to chemists. the 47 structurally homogeneous included all those (20-
30) approved before by chemists as chemical families
However, experts complained about: low information content of the clustering process
results Too many clusters, too little knowledge
the amount of information never found! High priority clusters contained only 2/3 of compounds analyzed! Clusters approved as chemical families from which knowledge
could be derived easily contained only 1/3 of the compounds!!! Known knowledge never found.
Bioreason, Inc
The road ahead… (3)
Do traditionally obtained clusters relate to chemical families?
Do we need a different approach? Introduce chemically “aware” methods No simple clustering methods Take into account structural homogeneity Accommodate multi-domain nature of molecules Present results in a format that facilitates
interpretation and knowledge discovery by chemists
Bioreason, Inc
A different approach: Can it work?
Have been working on “chemically aware” screening data analysis methods Same dataset results with a typical Bioreason
analysis:102 classes, all with high structural homogeneity
• All classes were easy to interpret• Only 10% of classes not interesting to chemists (~50
compounds)47 singletons (~10% of dataset)Information content much higher than traditional approach
• 90% of compounds placed in homogeneous clusters (Vs 66% in traditional method)
• 80% of compounds placed in clusters approved as structural families (Vs 34% in traditional method)
Multi-domain nature is accommodated
Bioreason, Inc
Conclusions
Molecular fingerprint similarity does not supply a certain indication of high structural molecular similarity
Most traditional chemical data analysis methods make heavy use of molecular fingerprint similarity
As a consequence, relations -including clusters- obtained via traditional methods often don’t make chemical sense
Structural Homogeneity may be employed to enable formation of clusters and identification of chemical relations closer to chemists’ expectations
Bioreason, Inc
Acknowledgements
Patricia BachaBobi Den Hartog
Info: nicolaou@bioreason.com www.bioreason.com
top related