faculty of computer science © 2006 cmput 605february 04, 2008 novel approaches for small...

Faculty of Computer Science

CMPUT 605 December 06, 2007February 04,

2008© 2006

Novel Approaches for Small Bio-molecule Classification and Structural Similarity Search

Karakoc E, Cherkasov A., and Sahinalp S.C.

Amit [email protected]

© 2006

Department of Computing Science

CMPUT 605

Background and Focus

Identification of molecules that play an active role in regulation of biological processes or disease states (Aspirin)

Structural similarity Similar biological and/or physico-chemical properties (Maggiora et al.)

Classification of probe compound (unknown bioactivity)

Similarity search amongst compounds with known bioactivity

© 2006


CMPUT 605

Background and Focus

Determining similarity distance measures (SDM)

Using SDM for classification of compounds—k-NN

classification

Efficient data structures for fast similarity search—

DMVP trees (an improvement over SCVP trees used

previously)

© 2006


CMPUT 605

Outline

Similarity measures

Classification techniques

k-NN classifier

DMVP tree

Results, Observations and Conclusion

© 2006


CMPUT 605

Similarity between Molecules

Structural Similarity—doubly bonded C pair, existence of aromatic atom etc. (Used in structural similarity search engines)

Similarity of chemical descriptors—atomic wt., hydrophobicity, charge, density etc. (Used in QSAR* tools)

*Quantitative Structure-Activity Relationship

© 2006


CMPUT 605

Similarity Measures

Tanimoto coefficient T(X,Y)—Given two descriptor sets X & Y:

X & Y: n-dimensional bit-vectors (representation used by PubChem & some other databases)

Range of Tanimoto coefficient: [0, 1]

© 2006


CMPUT 605

Similarity measures

Tanimoto Dist. Measure: DT(X,Y) = 1 –T(X,Y)

Minkowski distance (LP):

Real valued data possible

© 2006


CMPUT 605

Classification Techniques

Multiple Linear Regression (MLR)

Linear Discriminant Analysis (LDA)

Artifical Neural Networks (ANN)

Support Vector Machines (SVM)

k-nearest Neighbor (k-NN) classification not used

previously.

© 2006


CMPUT 605

Distance-based Classification

Compounds—s & r

S & R respective descriptor arrays

If D(S,R) is small then bioactivity levels of s & r are

similar

Notion of distance classification of new compounds

Distance measure == metric (conditions) e.g. Hamming

Distance, Tanimoto distance etc.

© 2006


CMPUT 605

k-nn Classification

Given Bioactivity

To Find Distance measure that separates active

and inactive compounds for the training set N-

dimensional plane

Problem Easy

© 2006


CMPUT 605

k-nn Classification

Given Bioactivity

To Find Distance measure that separates active and inactive compounds for the training set N-dimensional plane

Problem NP-hard

Solution Use Genetic Algorithms, heuristic linear search to find the plane

© 2006


CMPUT 605

QSAR approach

• Uses a linear combination of descriptors

• Assigns a weight to each dimension

, W [0,1]

• Weighted Minkowski distance of order 1

• Only binary classification considered (A/I)

• Methods are general

© 2006


CMPUT 605

Parameter Optimization

© 2006


CMPUT 605

k-NN Classifier

Set of data elements: {X1, … Xn}

Query element: Y

Range query Find Xi such that D(Y,Xi) < R1 (user

defined)

k-nn query Find k items such that their distance

to Y is as small as possible

© 2006


CMPUT 605

Data structures: VP-Trees

Vantage Point (VP) tree

Choose an arbitrary data point (called Vantage

Point)

Binary tree—recursively partitions the dataset into

two equal sized subsets

Zero in on the nearest neighbor

© 2006


CMPUT 605

Efficient data structures: SCVP Trees

Space Covering Vantage Point tree

Multiple vantage points chosen at each level

No more a binary tree—multiple branches at each

internal node

Multiple inner partitions—hope is that each data

point lies in atleast one inner partition

© 2006


CMPUT 605

DMVP Tree

Memory requirements of SCVP tree can be large—redundancy

of data elements

Deterministic selection of Vantage points

VP minimization—NP-Hard

Minimization == Weighted set cover problem

Use of greedy Algorithm: O(log l); l<n

Approximates the min number of VP’s

© 2006


CMPUT 605

Experiments

Five types of bioactivities viz. being antibiotic (520), bacterial metabolite (562), human metabolite(1104), drug(958), drug-like(1202)

62 dimensional descriptor array (30 QSAR & 32 physico-chemical properties)

k=1 i.e. one NN

Comparison with LDA, MLR, ANN

70% data used for training

wL1 distance is calculated in all cases

© 2006


CMPUT 605

Experimental Results

Table 1 shows that in almost all cases in terms of accuracy, and T_P, T_N, F_P etc. k-NN does better than LDA and MLR

ANN beats k-NN on almost all counts

Pruning—more than 80% in each kind of bioactivity (over brute-force search)

Key point – k-NN classifier is faster

More than 100 times faster than ANN

© 2006


CMPUT 605

Experimental Results

Can calculate the level of bioactivity instead of a

YES/NO

The value of the weights provides insights into the

importance of descriptors for each bioactivity

© 2006


CMPUT 605

Observations & Conclusion

Bacterial metabolites & antimicrobial drugs overlap (confirmation)

Human metabolites display distinctive properties

QSAR models for drugs + human metabolites dominated by few descriptors

These descriptors favored by drug developers and natural evolution

© 2006


CMPUT 605

Observations & Conclusion

Classification results from k-NN can help rationalize the

design and discovery of drugs

DMVP tree improves the space utilization of the program

Provides a means for fast similarity search

Data structure can be applied to any metric distance like wLp

and Tanimoto distance

faculty of computer science © 2006 cmput 605february 04, 2008 novel approaches for small...

Documents