faculty of computer science © 2006 cmput 605february 04, 2008 novel approaches for small...
Post on 19-Dec-2015
215 views
TRANSCRIPT
Faculty of Computer Science
CMPUT 605 December 06, 2007February 04,
2008© 2006
Novel Approaches for Small Bio-molecule Classification and Structural Similarity Search
Karakoc E, Cherkasov A., and Sahinalp S.C.
Amit [email protected]
© 2006
Department of Computing Science
CMPUT 605
Background and Focus
Identification of molecules that play an active role in regulation of biological processes or disease states (Aspirin)
Structural similarity Similar biological and/or physico-chemical properties (Maggiora et al.)
Classification of probe compound (unknown bioactivity)
Similarity search amongst compounds with known bioactivity
© 2006
Department of Computing Science
CMPUT 605
Background and Focus
Determining similarity distance measures (SDM)
Using SDM for classification of compounds—k-NN
classification
Efficient data structures for fast similarity search—
DMVP trees (an improvement over SCVP trees used
previously)
© 2006
Department of Computing Science
CMPUT 605
Outline
Similarity measures
Classification techniques
k-NN classifier
DMVP tree
Results, Observations and Conclusion
© 2006
Department of Computing Science
CMPUT 605
Similarity between Molecules
Structural Similarity—doubly bonded C pair, existence of aromatic atom etc. (Used in structural similarity search engines)
Similarity of chemical descriptors—atomic wt., hydrophobicity, charge, density etc. (Used in QSAR* tools)
*Quantitative Structure-Activity Relationship
© 2006
Department of Computing Science
CMPUT 605
Similarity Measures
Tanimoto coefficient T(X,Y)—Given two descriptor sets X & Y:
X & Y: n-dimensional bit-vectors (representation used by PubChem & some other databases)
Range of Tanimoto coefficient: [0, 1]
© 2006
Department of Computing Science
CMPUT 605
Similarity measures
Tanimoto Dist. Measure: DT(X,Y) = 1 –T(X,Y)
Minkowski distance (LP):
Real valued data possible
© 2006
Department of Computing Science
CMPUT 605
Classification Techniques
Multiple Linear Regression (MLR)
Linear Discriminant Analysis (LDA)
Artifical Neural Networks (ANN)
Support Vector Machines (SVM)
k-nearest Neighbor (k-NN) classification not used
previously.
© 2006
Department of Computing Science
CMPUT 605
Distance-based Classification
Compounds—s & r
S & R respective descriptor arrays
If D(S,R) is small then bioactivity levels of s & r are
similar
Notion of distance classification of new compounds
Distance measure == metric (conditions) e.g. Hamming
Distance, Tanimoto distance etc.
© 2006
Department of Computing Science
CMPUT 605
k-nn Classification
Given Bioactivity
To Find Distance measure that separates active
and inactive compounds for the training set N-
dimensional plane
Problem Easy
© 2006
Department of Computing Science
CMPUT 605
k-nn Classification
Given Bioactivity
To Find Distance measure that separates active and inactive compounds for the training set N-dimensional plane
Problem NP-hard
Solution Use Genetic Algorithms, heuristic linear search to find the plane
© 2006
Department of Computing Science
CMPUT 605
QSAR approach
• Uses a linear combination of descriptors
• Assigns a weight to each dimension
, W [0,1]
• Weighted Minkowski distance of order 1
• Only binary classification considered (A/I)
• Methods are general
© 2006
Department of Computing Science
CMPUT 605
k-NN Classifier
Set of data elements: {X1, … Xn}
Query element: Y
Range query Find Xi such that D(Y,Xi) < R1 (user
defined)
k-nn query Find k items such that their distance
to Y is as small as possible
© 2006
Department of Computing Science
CMPUT 605
Data structures: VP-Trees
Vantage Point (VP) tree
Choose an arbitrary data point (called Vantage
Point)
Binary tree—recursively partitions the dataset into
two equal sized subsets
Zero in on the nearest neighbor
© 2006
Department of Computing Science
CMPUT 605
Efficient data structures: SCVP Trees
Space Covering Vantage Point tree
Multiple vantage points chosen at each level
No more a binary tree—multiple branches at each
internal node
Multiple inner partitions—hope is that each data
point lies in atleast one inner partition
© 2006
Department of Computing Science
CMPUT 605
DMVP Tree
Memory requirements of SCVP tree can be large—redundancy
of data elements
Deterministic selection of Vantage points
VP minimization—NP-Hard
Minimization == Weighted set cover problem
Use of greedy Algorithm: O(log l); l<n
Approximates the min number of VP’s
© 2006
Department of Computing Science
CMPUT 605
Experiments
Five types of bioactivities viz. being antibiotic (520), bacterial metabolite (562), human metabolite(1104), drug(958), drug-like(1202)
62 dimensional descriptor array (30 QSAR & 32 physico-chemical properties)
k=1 i.e. one NN
Comparison with LDA, MLR, ANN
70% data used for training
wL1 distance is calculated in all cases
© 2006
Department of Computing Science
CMPUT 605
Experimental Results
Table 1 shows that in almost all cases in terms of accuracy, and T_P, T_N, F_P etc. k-NN does better than LDA and MLR
ANN beats k-NN on almost all counts
Pruning—more than 80% in each kind of bioactivity (over brute-force search)
Key point – k-NN classifier is faster
More than 100 times faster than ANN
© 2006
Department of Computing Science
CMPUT 605
Experimental Results
Can calculate the level of bioactivity instead of a
YES/NO
The value of the weights provides insights into the
importance of descriptors for each bioactivity
© 2006
Department of Computing Science
CMPUT 605
Observations & Conclusion
Bacterial metabolites & antimicrobial drugs overlap (confirmation)
Human metabolites display distinctive properties
QSAR models for drugs + human metabolites dominated by few descriptors
These descriptors favored by drug developers and natural evolution
© 2006
Department of Computing Science
CMPUT 605
Observations & Conclusion
Classification results from k-NN can help rationalize the
design and discovery of drugs
DMVP tree improves the space utilization of the program
Provides a means for fast similarity search
Data structure can be applied to any metric distance like wLp
and Tanimoto distance