scalable partitioning & exploration of chemical spaces ... · very fast, approximate nearest...

32
Rapid Partitioning of Chemical Spaces Scalable Partitioning & Exploration of Chemical Spaces Using Geometric Hashing Rajarshi Guha, Debojyoti Dutta, Peter C. Jurs, Ting Chen Department of Chemistry Pennsylvania State University and Department of Computational Biology University of Southern California 28 th March, 2006

Upload: others

Post on 17-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical Spaces

Scalable Partitioning & Exploration ofChemical Spaces Using Geometric Hashing

Rajarshi Guha, Debojyoti Dutta, Peter C. Jurs, Ting Chen

Department of ChemistryPennsylvania State University

andDepartment of Computational Biology

University of Southern California

28th March, 2006

Page 2: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical Spaces

Outline

1 Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

Page 3: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

Chemical Data Mining

Common tasks

Clustering

library design andselectiondata partitioningclassification

Virtual screening

Substructure searching

Problems

Datasets are large and aregetting larger every day

Feature space has highdimensionality and is usuallycorrelated

Many common algorithmsare of high computationalcomplexity

Page 4: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

Chemical Data Mining

Common tasks

Clustering

library design andselectiondata partitioningclassification

Virtual screening

Substructure searching

Problems

Datasets are large and aregetting larger every day

Feature space has highdimensionality and is usuallycorrelated

Many common algorithmsare of high computationalcomplexity

Page 5: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

How Can We Efficiently Analyze Chemical Data?

Parallelize

Throw more CPU’s at the problemGets the answer but doesn’t address the underlying problemsLacks elegance

Approximation algorithms

Transforms the problem to a space of lower dimensionsParametric in natureError bounds of the approximation are desirable

Avoid the distance matrix

Many utilize a pairwise distance matrixThe distance matrix for 50,000 structures will need 4GB ofmemoryMost distance matrix based algorithms are O(N2)

Guha, R; Jurs, P.C.; “Applications of Spectral Clustering to Chemical Datasets”, Gordon Conference on Computer

Aided Drug Design, Tilton, NH, August 2005

Page 6: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

How Can We Efficiently Analyze Chemical Data?

Parallelize

Throw more CPU’s at the problemGets the answer but doesn’t address the underlying problemsLacks elegance

Approximation algorithms

Transforms the problem to a space of lower dimensionsParametric in natureError bounds of the approximation are desirable

Avoid the distance matrix

Many utilize a pairwise distance matrixThe distance matrix for 50,000 structures will need 4GB ofmemoryMost distance matrix based algorithms are O(N2)

Guha, R; Jurs, P.C.; “Applications of Spectral Clustering to Chemical Datasets”, Gordon Conference on Computer

Aided Drug Design, Tilton, NH, August 2005

Page 7: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

How Can We Efficiently Analyze Chemical Data?

Parallelize

Throw more CPU’s at the problemGets the answer but doesn’t address the underlying problemsLacks elegance

Approximation algorithms

Transforms the problem to a space of lower dimensionsParametric in natureError bounds of the approximation are desirable

Avoid the distance matrix

Many utilize a pairwise distance matrixThe distance matrix for 50,000 structures will need 4GB ofmemoryMost distance matrix based algorithms are O(N2)

Guha, R; Jurs, P.C.; “Applications of Spectral Clustering to Chemical Datasets”, Gordon Conference on Computer

Aided Drug Design, Tilton, NH, August 2005

Page 8: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

How Can We Efficiently Analyze Chemical Data?

Parallelize

Throw more CPU’s at the problemGets the answer but doesn’t answer the underlying problemsLacks elegance

Approximation algorithms

Transforms the problem to a space of lower dimensionsParametric in natureError bounds of the approximation are desirable

Avoid the distance matrix

Many algorithms utilize a pairwise distance matrixThe distance matrix for 50,000 structures will need 4GB ofmemoryMost distance matrix based algorithms are O(N2)

Guha, R; Jurs, P.C.; “Applications of Spectral Clustering to Chemical Datasets”, Gordon Conference on Computer

Aided Drug Design, Tilton, NH, August 2005

Page 9: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

Locality Sensitive Hashing

What is it?

Very fast, approximate nearest neighbor detection algorithm

Why is this significant?

Fast

Theoretically sublinearAvoids the distance matrix computationNot hampered by high dimensional data

Approximate

Based on a radius valueThe nearest neighbors in the radius are reported in aprobabilistic manner

Datar, M. et al.; “Locality Sensitive Hashing Scheme Based on p-Stable Distributions”, in Proceedings of the 20th

Annual Symposium on Computational Geometry, 2004

Page 10: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

Locality Sensitive Hashing

What is it?

Very fast, approximate nearest neighbor detection algorithm

Why is this significant?

Fast

Theoretically sublinearAvoids the distance matrix computationNot hampered by high dimensional data

Approximate

Based on a radius valueThe nearest neighbors in the radius are reported in aprobabilistic manner

Datar, M. et al.; “Locality Sensitive Hashing Scheme Based on p-Stable Distributions”, in Proceedings of the 20th

Annual Symposium on Computational Geometry, 2004

Page 11: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

How Does It Work?

O=C(CCO)CC f57021b43e1d86a17363dabcdbe2b744

CCC ba2a9960031f1cd180f7619a15e77fc4

b50ca9d003ff7611453a2f066eb1b949

CC

Hash Functions

A hash function, H, takes anobject, O, and tries to convert it toa unique value

O can be a string or numeric

A good hash function will avoidtoo many collisions

H(O1) = H(O2) does not alwaysimply O1 = O2

Knuth, D. .E., The Art of Computer Programming, Vol. 3, Addison Wesley, 1973

Page 12: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

How Does It Work?

CCC

CC

O=C(CCO)CC

H

H

H

Hash Tables

A hash table uses a hash functionto convert an object O to alocation in the table

A good hash function should beused so that similar objects do notland in the same location

Knuth, D. .E., The Art of Computer Programming, Vol. 3, Addison Wesley, 1973

Page 13: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

How Does It Work?

Overview

High dimensional observations are projected onto random lines

The lines are chopped into equal width segments, hi

Observations are assigned a hash value depending on whichsegment it is projected onto

Locality Sensitive Hash Functions

If two points q and v are close they will hash to the samevalue with high probability

If they are distant they should collide with low probability

Families of hash functions can be defined using s-stabledistributions which ensures the locality property

Page 14: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

How Does It Work?

Overview

High dimensional observations are projected onto random lines

The lines are chopped into equal width segments, hi

Observations are assigned a hash value depending on whichsegment it is projected onto

Locality Sensitive Hash Functions

If two points q and v are close they will hash to the samevalue with high probability

If they are distant they should collide with low probability

Families of hash functions can be defined using s-stabledistributions which ensures the locality property

Page 15: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

How Does It Work?

User Parameters

R - the radius of interest

k, L - hashing parameters

p - the probability that g(q) = g(v), where

g(q) = (h1(q), . . . , hk(q))

In practice the parameters k and L are estimated empirically

R and p are specified by the user

What Does All This Mean?

We can use the LSH algorithm to determine the neighbors of apoint q that are within a radius R from q, with a probability p

Page 16: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

What Is It Good For?

Rapidly partition a dataset into more manageable chunks

kNN regression and classification

Clustering

Intuitive diversity analysis of chemical spaces

LSH gives us a framework which can be applied to nearestneighbor type problems for very large datasets

It can be used as an analysis method in itself as well as performthe role of a preprocessor for subsequent methods

Page 17: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

What Is It Good For?

Rapidly partition a dataset into more manageable chunks

kNN regression and classification

Clustering

Intuitive diversity analysis of chemical spaces

LSH gives us a framework which can be applied to nearestneighbor type problems for very large datasets

It can be used as an analysis method in itself as well as performthe role of a preprocessor for subsequent methods

Page 18: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

Datasets & Descriptors

Datasets

No. of Descriptors

Dataset No. of Molecules Full Reduced

Kazius 4337 142 20NCI-AIDS 42613 144 55NCI-3D 249071 163 53

Kazius, J. et al.; J. Med. Chem., 2005, 48, 312-320

http://cactus.nci.nih.gov/DownLoad/AID2DA99.sdz

Voigt, J. et al.; J. Chem. Inf. Comput. Sci., 2001, 41, 702-712

Page 19: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

Datasets & Descriptors

Descriptors

Topological and geometric descriptors were evaluated

A reduced descriptor pool was also investigated

No scaling was performed

Maximum Distance Mean distance

Dataset Full Reduced Full Reduced

Kazius 507252 507144 1793 1656NCI-AIDS 626517 626438 3712 3490NCI-3D 14285759 3013170 10810 11784

Page 20: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

Timing Results - Effect of Dataset Size

3.5 4.0 4.5 5.0 5.5

−3.

0−

2.5

−2.

0−

1.5

−1.

0−

0.5

log[Size of Training Set]

log[

Mea

n Q

uery

Tim

e (s

ec)]

● LSH (full descriptor pool)LSH (reduced descriptor pool)kNN (full descriptor pool)kNN (reduced descriptor pool)

Mean Query Times

For each dataset the first200 points were used as thequery set

For the LSH runs,R = 0.01× Dmax

kNN runs were performedwith k = 200

Dutta, D.; Guha, R.; Jurs, P.C.; Chen, T., J. Chem. Inf. Model., 2006, 46, 321-333

Page 21: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

Timing Results - Effect of NN Count

−5 −4 −3 −2 −1

−3.5

−3.0

−2.5

−2.0

−1.5

log(CNN)

log

[Mea

n Q

uery

Tim

e (s

ec)]

KaziusNCI AIDSNCI 3D

Normalized Query Counts

C̄NN = 1Nt

1Nq

∑Nq

i=1 NNN,i

Memory access is significantfor large datasets

CNN takes into account thenumber of neighbors withrespect to the size of thedataset

Dutta, D.; Guha, R.; Jurs, P.C.; Chen, T., J. Chem. Inf. Model., 2006, 46, 321–333

Page 22: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

Timing Results (NCI-AIDS Dataset)

0.02 0.03 0.04 0.05 0.06 0.07 0.08

−3.

5−

3.0

−2.

5−

2.0

−1.

5−

1.0

Radius

log

[Mea

n Q

uery

Tim

e (s

ec)]

LSH (144 descriptors)LSH (55 descriptors)kNN (144 descriptors)kNN (55 descriptors)

Influence of Radius

kNN results were obtainedwith k = 200

For the highest radius a fewthousand neighbors weredetected

Query times for LSH are atleast an order of magnitudelower than kNN

Highlights the use of LSH asa partitioning tool

Dutta, D.; Guha, R.; Jurs, P.C.; Chen, T., J. Chem. Inf. Model., 2006, 46, 321–333

Page 23: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

Accuracy Results (Kazius Dataset)

0.02 0.04 0.06 0.08 0.10

0.97

0.98

0.99

1.00

Radius

Frac

tion

corre

ct n

eare

st n

eigh

bors

142 descriptors20 descriptors

Number of correct R-NN’s detected

versus radius, for the Kazius dataset

Accuracy was obtained withrespect to a linear scan

Over all datasets studied,accuracy was never lowerthan 94%

By increasing the probabilityparameter we can ensurehigher accuracy (at the costof time efficiency)

Dutta, D.; Guha, R.; Jurs, P.C.; Chen, T., J. Chem. Inf. Model., 2006, 46, 321–333

Page 24: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

Using LSH for Diversity Analysis - Dataset Sparsity

The R-NN Curve Algorithm

Dmax ← max. pairwise distanceR ← 0.01× Dmax

while R ≤ Dmax doFind NN’s within radius Rincrement R

end while

Depends on the descriptor space

NN counts vs. radius plots arediscriminatory

0 20 40 60 80 100

020

040

060

080

0

Percentage of Maximum Pairwise Distance

Num

ber o

f R−N

eare

st N

eigh

bors

The R-NN curve for anobservation from a dense region

of the descriptor space

Guha, R., Dutta, D., Jurs, P.C. and Chen, T., submitted

Page 25: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

Using LSH for Diversity Analysis - Dataset Sparsity

The R-NN Curve Algorithm

Dmax ← max. pairwise distanceR ← 0.01× Dmax

while R ≤ Dmax doFind NN’s within radius Rincrement R

end while

Depends on the descriptor space

NN counts vs. radius plots arediscriminatory

0 20 40 60 80 100

020

040

060

080

0

Percentage of Maximum Pairwise Distance

Num

ber o

f R−N

eare

st N

eigh

bors

The R-NN curve for anobservation from a dense region

of the descriptor space

Guha, R., Dutta, D., Jurs, P.C. and Chen, T., submitted

Page 26: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

Using LSH for Diversity Analysis - Dataset Sparsity

The R-NN Curve Algorithm

Dmax ← max. pairwise distanceR ← 0.01× Dmax

while R ≤ Dmax doFind NN’s within radius Rincrement R

end while

Depends on the descriptor space

NN counts vs. radius plots arediscriminatory

0 20 40 60 80 100

020

040

060

080

0

Percentage of Maximum Pairwise Distance

Num

ber o

f R−N

eare

st N

eigh

bors

The R-NN curve for anobservation from a sparse region

of the descriptor space

Guha, R., Dutta, D., Jurs, P.C. and Chen, T., submitted

Page 27: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

Using LSH for Diversity Analysis

2170 (4281) 98 (4281)

4224 (4279) 3679 (4279)

Kazius dataset

Representativecompounds from thedense region

R = 0.1% of Dmax

Page 28: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

Using LSH for Diversity Analysis - Outliers

3904 1861

Page 29: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

Using LSH for Diversity Analysis - Comparison

−5e+05 −4e+05 −3e+05 −2e+05 −1e+05 0e+00

−600

0−4

000

−200

00

2000

Principal Component 1

Prin

cipal

Com

pone

nt 2

163847

1861

2597

2954

3224

3904

Whole descriptor pool

−5e+05 −4e+05 −3e+05 −2e+05 −1e+05 0e+00

−100

010

020

0

Principal Component 1

Prin

cipal

Com

pone

nt 2

163

1634

1861

3224

3904

4049

4097

4139

Reduced descriptor pool

Page 30: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

Future Work

Is data reduction really necessary?

How will scaling affect the performance and accuracy of theLSH algorithm?

Quantification of sparsity for descriptor spaces as a whole

Use nearest neighbor modeling as a component in a screeningprotocol

Approximate substructure searches

Page 31: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

Summary

Conclusions

In practice LSH outperforms traditional kNN by up to 3orders of magnitude

Though the LSH is approximate by design, it exhibits highaccuracy

Can be used for a variety of preprocessing and analytical tasks

Acknowledgements

Prof. Pyotr Indyk

A. Andoni

Page 32: Scalable Partitioning & Exploration of Chemical Spaces ... · Very fast, approximate nearest neighbor detection algorithm Why is this significant? Fast Theoretically sublinear Avoids

Rapid Partitioning of Chemical SpacesChemical Data MiningLocality Sensitive HashingResults

Summary

Conclusions

In practice LSH outperforms traditional kNN by up to 3orders of magnitude

Though the LSH is approximate by design, it exhibits highaccuracy

Can be used for a variety of preprocessing and analytical tasks

Acknowledgements

Prof. Pyotr Indyk

A. Andoni