troubles with chemoinformatics and associative neural networks · troubles with chemoinformatics...

Troubles with Chemoinformaticsand Associative Neural Networks

Igor V. TetkoGSF -- Institute for Bioinformatics (MIPS), Neuherberg,Germany and Institute of Bioorganic & Petrochemistry,

Kyiv, Ukraine

Fraunhofer FIRST, Berlin

15 November 2006

Layout of presentation

Problems with chemoinformatics models developed byAcademia

• Introduction to the Associative Neural Network• Properties of the ASNN: bias estimation and correction• Properties of the ASNN: applicability domain of models• Properties of the ASNN: secure sharing of data

Challenges for development ofChemoinformatics models

Public datasets• Limited data availability,

data are expensive• Data are noisy,

inconsistent betweenlaboratories (not allconditionscontrolled/specified)

• No new measurementscan be easily made

• Data are heterogenous

In house datasets• Data are consistent (at least

within each company site)• Data are homogeneous and

may correspond just to fewseries of interest

• New data can be easily andincrementally measured

• Accuracy of data may varydepending on usedexperimental protocol

• Not available for publicdevelopment

“Public” model can fail due to differentchemical diversity in training & test sets

New data (in house compounds)to be estimated

Training set data used to develop a model

Our model given the training set

Correct model

"One can not embrace the unembraceable.”

Possible: 1060 - 10100 molecules theoretically exist( > 1080 atoms in the Universe)

Achievable: 1020 - 1024 can be synthesized now(weight of the Moon is ca 1023 kg)

Available: 2*107 molecules are on the market

Measured: 102 - 104 molecules with ADME/T data

Problem: To predict ADME/T properties of justmolecules on the market we must extrapolate datafrom one to 1,000 - 100,000 molecules!

Kozma Prutkov

N O

O OH

molecules on the market

We need methods whichWe need methods whichcan estimate the accuracy can estimate the accuracy of predictionsof predictions!!

ADME/T data

Troubles with models

• Model developed using “public data”– May have low prediction accuracy due to limited

chemical diversity of both sets (problem of accuracy)• Model developed using “public” + “in house” data

– Difficult to receive “in house” data– “Public data” may not be really public (problem of IP)– May require significant computing time/expertise of the

end user– Public/in-house data can be incompatible

• Model developed using “public” data and then “adjusted”for molecules from “in house” data– Associative Neural Network


• Problems with chemoinformatics models developed byAcademia

Introduction to the Associative Neural Network• Properties of the ASNN: bias estimation and correction• Properties of the ASNN: applicability domain of models• Properties of the ASNN: secure sharing of data

Associative Neural Network (ASNN)

Some software tools rely just on one “best” model. Other software tools rely on the ensemble average (“panel of experts”). ASNN explores disagreement of individual models in the ensemble to

improve its accuracy and to derive a confidence score.

Bias-variance decompositionBias-variance decomposition

Variance: can be decreased using a large ensemble ofmodels

Bias: can be partially decreased using more flexible models(large number of hidden neurons), early stopping,AdaBoost algorithm, etc. (indirect methods)

Question: Can we directly estimate (and correct) the biasof the ensemble?

Answer: Yes, if we can correctly detect nearest neighborsof the target data case in “functional space”!

!

PE( f i) = ( f i "Y )2

= ( f *"Y )2

+ ( f *" f )2

+ ( fi " f )2

fi - prediction of ith model;Y - experimental valuesf* - ideal function - ensemble average

!

f

Sources of bias: Sources of bias: underfittingunderfitting

Sine function approximation by neural networks withSine function approximation by neural networks withone and two hidden neurons (one and two hidden neurons (x=xx=x11+x+x22))

Sources of bias: incomplete dataSources of bias: incomplete data(extrapolation, applicability domain)(extrapolation, applicability domain)

Gauss function extrapolation (Gauss function extrapolation (x=xx=x11+x+x22))

Correction of model bias by the error ofthe nearest neighbors of the target point

f*

!

f

Representation of molecules(data cases) for machine learning

• Can be defined withcalculated properties(logP, quantum-chemical parameters,etc.)

• Can be defined with aset of structuraldescriptors (topological2D, 3D, etc.).

• In general: any set ofdescriptors

!

12.3

4.6

M

13.2

10.1

"

#

$ $ $ $ $ $

%

&

' ' ' ' ' ' N

HO

!

13.7

4.8

M

15.8

12.0

"

#

$ $ $ $ $ $

%

&

' ' ' ' ' '

N

HO

Nearest neighbors in the input space(space of descriptors)

-4

-2

0

2

4

-3 -2 -1 0 1 2 3

x

x2

x1

Nearest neighbors and activity

0

0.2

0.4

0.6

0.8

1

-3 -2 -1 0 1 2 3

x

y

-4

-2

0

2

4

-3 -2 -1 0 1 2 3

x

x2

x1

y=exp(-(x1+x2)2)

x=x1+x2 !

The nearestneighbors in thedescriptor space arenot necessaryneighbors in theproperty space!

Nearest neighbors and activity

0

0.2

0.4

0.6

0.8

1

-3 -2 -1 0 1 2 3

0

0.2

0.4

0.6

0.8

1

-3 -2 -1 0 1 2 3

A B

-4

-2

0

2

4

-3 -2 -1 0 1 2 3

C

x

x x

x2

x1

y y

x=x1+x2

The nearestneighbors inproperty are notneighbors indescriptor space!

A property-based similarity

!

12.3

4.6

M

13.2

10.1

"

#

$ $ $ $ $ $

%

&

' ' ' ' ' '

!

net 1

net 2

M

net 63

net 64

"

#

$ $ $ $ $ $

%

&

' ' ' ' ' '

!

13.7

4.8

M

15.8

12.0

"

#

$ $ $ $ $ $

%

&

' ' ' ' ' '

!

net 1

net 2

M

net 63

net 64

"

#

$ $ $ $ $ $

%

&

' ' ' ' ' '

-- both molecules are thenearest neighbors, r2=0.47, inspace of residuals amid>12,000 molecules!

N

HO

N

HO

logP=3.11

logP=3.48

Morphinan-3-ol, 17-methyl-

Levallorphan

The property-based similarity* isdefined as correlation of ensemble

residuals

*Tetko, I.V.; Villa, A.E.P. Neural Networks, 1997, 10, 1361-1374

0

0.2

0.4

0.6

0.8

1

-3 -2 -1 0 1 2 3

0

0.2

0.4

0.6

0.8

1

-3 -2 -1 0 1 2 3

A B

-4

-2

0

2

4

-3 -2 -1 0 1 2 3

C

x

x x

x2

x1

y y

Detection of nearestneighbors in space ofmodels uses invariantsin “structure- property”space.

Nearest neighbors for Gauss function

All nearest neighborsare detected correctlyusing similarity inproperty-based space !

Associative Neural Network (ASNN)Associative Neural Network (ASNN)

A prediction of case i: [ ] [ ] [ ]

!!!!!!

"

#

$$$$$$

%

&

==•

i

M

i

k

i

iMi

z

z

z

M

M

1

zANNEx

Pearson’s (Spearman) correlation coefficient rij=R(zi,zj)>0

!=

=Mk

i

kiz

Mz

,1

1

( )!"

#+=$)(

1

ikNj

jjkii zyzzx

Traditional ensemble:Traditional ensemble:

<<= ASNN bias correction<<= ASNN bias correction

The correction of neural network ensemble value is performed using errors(biases) calculated for the neighbor cases of analyzed case xxii detected inspace of neural network models (neural network associations of the givenmodel)

M=10 (ten models) in the ensemble [ ] [ ]

valuesalexperimentyyy

averageensemblezzz

i

17.0;6.0;4.0

19.0;74.0;5.0

1.0

2.0

3.0

2.0

3.0

1.0

1.0

3.0

1.0

2.0

,

4.0

5.0

7.0

8.0

9.0

8.0

9.0

7.0

8.0

9.0

,

6.0

5.0

4.0

5.0

7.0

6.0

5.0

4.0

3.0

5.0

321

321

10

!===

!===

""""""""""""""

#

$

%%%%%%%%%%%%%%

&

'

""""""""""""""

#

$

%%%%%%%%%%%%%%

&

'

""""""""""""""

#

$

%%%%%%%%%%%%%%

&

'

=• ANNEx

N=3, three cases{(x1,y1),(x2,y2),(x3,y3)} in thetraining set:

[ ] [ ]

predictionensemblezt

t

62.0

6.0

4.0

7.0

9.0

8.0

7.0

6.0

4.0

4.0

7.0

10

!!=

""""""""""""""

#

$

%%%%%%%%%%%%%%

&

'

=• ANNExTest case:

( )

( ) ( )( )

0.50.12-0.62 16.0),(

74.06.05.04.02

162.0 2 ,42.0),(

55.0),(

3

'

2

)(

1'

1

===

!+!+==>==

!+== "#

t

tt

Nj

jjkttt

xxr

zkxxr

zyzzxxrtk x

ASNN result:ASNN result:

Illustrative exampleIllustrative example



• Introduction to the Associative Neural Network Properties of the ASNN: bias estimation and correction• Properties of the ASNN: applicability domain of models• Properties of the ASNN: secure sharing of data

ASNN sine function approximation(correction of the underfitting bias)

0

0.2

0.4

0.6

0.8

1

0 !!/4 !/2 3!/4

A

0

0.2

0.4

0.6

0.8

1

0 !!/4 !/2 3!/4

B

A) ANN trained with two (black) and one (grey) hiddenneurons. B) ASNN results using one hidden neuronusing the same networks from A).

Classification of UCI data setsClassification of UCI data sets

7.8%8.3%8.1%35-30-15-620004435Satellite

1.8%4.1%1.5%16-70-50-26400016000Letter

ASNN2ANN50 CC2

Boostedresults1

MLParchitecture1

test settrainingset

dataset

1 - Schwenk and Bengio, Neural Computation, 12, 2000, 1867-1887.2 - Tetko, Neural Processing Letters, 2002, 16, 187-199.

Gauss function extrapolation (correction ofGauss function extrapolation (correction ofextrapolation bias with extrapolation bias with ““fresh datafresh data””))

0

0.2

0.4

0.6

0.8

1

-3 -2 -1 0 1 2 3

y

x

Ay

x

B

0

0.2

0.4

0.6

0.8

1

-3 -2 -1 0 1 2 3

y

x

C

-1

-0.5

0

0.5

1

-3 -2 -1 0 1 2 3

y

0

0.2

0.4

0.6

0.8

1

-3 -2 -1 0 1 2 3

D

Notice: x=x1+x2

LIBRARY mode: newdata are used for kNNcorrection (enlargementof the ASNN memory)without rebuilding theneural network model!

Advantages:fast, no weightretraining;correction is not limitedby the range of valuesin the training set

ALOGPS 2.1••LogPLogP:: 75 input variables corresponding to electronic and

topological properties of atoms (E-state indices), 12908molecules in the database (PHYSPROP), 64 neural networks inthe ensemble. Calculated results RMSE=0.35, MAE=0.26, n=76outliers (>1.5 log units)

•LogS: 33 input E-state indices, 1291 molecules in the database,64 neural networks in the ensemble. Calculated resultsRMSE=0.49, MAE=0.35, n=18 outliers (>1.5 log units)

• Tetko, Tanchuk & Villa, JCICS, 2001, 41, 1407-1421.• Tetko, Tanchuk, Kasheva & Villa, JCICS, 2001, 41, 1488-1493.• Tetko & Tanchuk, JCICS, 2002, 42, 1136-1145.

Available free at http://www.vcclab.org site.

Nearest neighbors in different spaces

The same 74E-state descriptorswere used

GSE of S. YalkowskylogS = 0.5-0.01(MP-25) - logP

Accuracy of predictors developed usingwhole set (ALOGPS) and “star” set only

ANN, "nova set"

prediction

ASNN, "nova set" used as

LIBRARY, LOO

network

training

set, N RMSE MAE outliers RMSE MAE outliers

ALOGPS 12908 0.49 0.38 68 0.43 0.32 50

ANN

trained on

XLOGP

set

1853 0.65 0.52 647 0.47 0.36 141

ANN

trained on

“star” set

9429 0.59 0.47 480 0.46 0.35 98

Tetko & Tanchuk, JCICS, 2002, 42, 1136-1145.

XLOGP

1873

CLOGP

9429

PHYSPROP 12 908

star set

nova set

Notice in “LIBRARY” new data (nova set) were used for thekNN corrections without rebuilding neural network models

Prediction of AstraZeneca logP set

ACDlogP (v. 7.0): MAE = 0.86, RMSE=1.20CLOGP (v. 4.71): MAE = 0.71, RMSE=1.07ALOGPS BLIND: MAE = 0.60, RMSE=0.84ALOGPS LIBRARY: MAE = 0.45, RMSE=0.68

Tetko & Bruneau, J. Pharm. Sci., 2004, 94, 3103-3110.

n=2569

Can we predict some other similarproperties with ALOGPS, e.g. logD?

• logP is defined for neutral compounds• logD is defined for charged compounds

logD(pH) = logP - log(1+10(pH-pKa)δi),where δi = {1,-1} for acids and bases,respectively

logD is more difficult to predicta) since it can accumulate errors due to both

logP and pKa predictionsb) there is no good compilation of logD data

measured for a diverse collection ofcompounds (however, a lot of data isavailable commercially)

ASNN: prediction of data that areinconsistent with the training set

.

0

0.25

0.5

0.75

10 1 2 3

training "LogP"

"LogD"

“logP” set -- 100 points; “logD” set -- 8 points, x=x1+x2

LIBRARY mode (no retraining).

0

0.25

0.5

0.75

1

0 1 2 3

training "LogP"

"LogD"

Library "LogD"

N.B.! Training using both “old” and “new” data will not work!!!

Library mode vs training using“logD” data

Pallas PrologD : MAE = 1.06, RMSE=1.41ACDlogD (v. 7.19): MAE = 0.97, RMSE=1.32ALOGPS: MAE = 0.92, RMSE=1.17ALOGPS LIBRARY: MAE = 0.43, RMSE=0.64

Tetko & Poda, J. Med. Chem., 2004, 94, 5601-5604.

“Self-learning” Pfizer logD datausing logP model (LIBRARY)



• Introduction to the Associative Neural Network• Properties of the ASNN: bias estimation and correction Properties of the ASNN: applicability domain of models• Properties of the ASNN: secure sharing of data

XLOGP

1873

CLOGP

9429

PHYSPROP 12 908

star set

nova set

PHYSPROP data set

Total:12908

3479 training“nova” -->predictionstar set

Mean Average Error (MAE) as functionMean Average Error (MAE) as functionof the number of non-hydrogen atomsof the number of non-hydrogen atoms

Methods trainedusing “star” setprovided a lowprediction abilityfor the “nova” setbecause of thedifferentchemicaldiversity in bothsets

x - training setperformance

O - test setperformance

Estimation of the model accuracyby the nearest neighbors

Calibration: For each query molecule we find the most similar molecule(max correlation in the property-based space: “property-based similarity”) inthe training set. We plot prediction accuracy of the query molecules asfunction of their property based similarities (calibration curve).

Usage: For each test molecule we again find the most similar molecule andestimate its prediction accuracy using calibrated value for the actual valueof “property-based similarity” .

Prediction performance as afunction of the “property-

based” similarityBlind prediction

“property-based similarity”:max correlation coefficient of a testmolecule to “star” set

LIBRARY mode

“property-based similarity”:max correlation coefficient of a test moleculeto “star” + ”nova” sets

MAE=0.28 (0.26)MAE=0.43

XLOGP

1873

CLOGP

9429

PHYSPROP 12 908

star set

nova set

logP

(cal

c. -

expe

rimen

tal)

valu

es theoretically estimated accuracy

Associative Neural Network similarity

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

accuracy for 7,498 Pfizer molecules

accuracy for 8,750 AstraZeneca molecules

Lipophilicity (logP) prediction ofpharma companies

Tetko, I.V. et al, Drug Discovery Today, 2006.

50% data

experimental errors

distribution of molecules

• We can reliably estimate which compounds can/can’t be reliably predicted. ASNN can save costs for measurements of up to 50% of molecules.

Estimated and calculated error forAstraZeneca (AZ), Pfizer (PFE) and

iResearch Library sets (>1.5*107 molecules)M

AE



• Introduction to the Associative Neural Network• Properties of the ASNN: bias estimation and correction• Properties of the ASNN: applicability domain of models Properties of the ASNN: secure sharing of data

Secure sharing of informationbut not molecules

• Symposium organized by T. Oprea at 229th ACS, San Diego• Two dedicated session (CINF, COMP) ca 20 participants• Too secure sharing makes impossible model development

(relevant information is lost)• Less than 1 bit/atom is required to store molecules in “zip” file

(1 float value for molecule with 35 atoms)• Thus, any proposed method can be secure ... but only until it is

“hacked”• Sharing molecular descriptors of a target molecule is a difficult

business• But …. let us share reliably predicted molecules!• These are the molecules with significant high R in property

space to the target molecule

Number of molecules as a functionof the property-based similarity

The average number of molecules in respective databases with higher orequal correlation coefficient to a molecule in the PHYSPROP database

Real and surrogate molecules

Tetko, Abagyan,OpreaJ. Comp. Aid. Mol. Des.2005,19, 749.

SH

N

S S

N

N

S

HSN

NS

O

O

S

N

HS

N

O

O O

SO

O

OO

SO

OO

SS

O

O

H2N

S

O

O

O

O

O

F

F

FS

O

O

NN

O

N

O

O

O

O

O

O

Cl

O

O

Cl

OO

O

O

O

HN

FF

F

O OO

HN

Cl

F

F

F

S

N

N

S

Cl

Cl

SH

N

SN

N

OO

NHF

F

F

N

OO

S

O

O

O

Br

N

SO

O

O

S N

O

N

O

O

O

O

Cl

O

PHYSPROP molecule 100th similar molecule 1000th

similar molecule

Models for logP prediction developedwith real & surrogate data

Dataset sizesReal = surrogate =

1949 molecules

• Take a “real” molecule from PHYSPROP logP dataset• Find for it 100th (1000th) significantly correlated molecule r2>0.3 in

the IResearchLibrary (use additional filters to filter structurallysimilar ones)

• Name it as a “surrogate” molecule, calculate for it logP value -->“surrogate data”

• Use “real” molecules with real logP values and “surrogate data” todevelop models

• Predict all 12908 PHYSPROP molecules using both models

N.B.! This is a property-specificdata sharing!!!

Conclusions• Use of ASNN approach allows to

– Estimate bias of the model– Correct bias of the model– Improve accuracy of predictions– Instantly update models with new data without retraining

global (neural network) model• Drug discovery• Control systems (movement)

– Allow to estimate applicability domain and accuracy ofprediction of models

• novelty detection• detection of outlying data points• detection of non-stationary measurements• experimental design• secure data sharing

Neuro-physiological roots• Spatio-temporal coding of information processing• A lot of correlations in the brain, fast reaction (100s ms for visual stimulus)

• A signal with maximum correlation to the stimulus from the long-term memorywill be reinforced

• Some part of the brain provides modulation (increase of excitability) of regionsthat will be used for search of the proto-types and thus switching the “property-based similarity” depending on the context of the query or/and desired action

• Explains presence of two levels of behavior– Long term skills (fast reaction, parallel processing of information --

procedural memory)• Genetically programmed skills• Skills developed on the level of vegetative neural system

– Short term skills (sequential processing of information -- declarativememory)

• Skills developed by learning from few examples, cognition, observationof similar situation

• Used to correct the long term skills• Following training declarative memory changes to the procedural memory

Literature1. Tetko, I. V.; Villa, A. E. P. In Unsupervised and Supervised Learning: Cooperation toward a

Common Goal, ICANN'95, International Conference on Artificial Neural NetworksNEURONIMES'95, Paris, France, 1995; F., F.-S., Ed. EC2 & Cie: Paris, France, 1995; pp105-110.

2. Tetko, I. V.; Villa, A. E. P., Efficient partition of learning data sets for neural network training.Neural Networks 1997, 10, (8), 1361-1374.

3. Tetko, I. V., Associative Neural Network, CogPrints Archive, cog00001441. 2001.4. Tetko, I. V., Associative neural network. Neural Processing Letters 2002, 16, (2), 187-199.5. Tetko, I. V., Neural network studies. 4. Introduction to associative neural networks. J Chem

Inf Comput Sci 2002, 42, (3), 717-28.6. Tetko, I. V.; Tanchuk, V. Y., Application of associative neural networks for prediction of

lipophilicity in ALOGPS 2.1 program. J Chem Inf Comput Sci 2002, 42, (5), 1136-45.7. Tetko, I. V.; Poda, G. I., Application of ALOGPS 2.1 to predict log D distribution coefficient for

Pfizer proprietary compounds. J Med Chem 2004, 47, (23), 5601-4.8. Tetko, I. V.; Bruneau, P., Application of ALOGPS to predict 1-octanol/water distribution

coefficients, logP, and logD, of AstraZeneca in-house database. J Pharm Sci 2004, 93, (12),3103-10.

9. Tetko, I. V.; Bruneau, P.; Mewes, H. W.; Rohrer, D. C.; Poda, G. I., Can we estimate theaccuracy of ADME-Tox predictions? Drug Discov Today 2006, 11, (15-16), 700-707.

AcknowledgementAcknowledgement

Part of this work was done thanks toVirtual Computational Chemistry Laboratory

INTAS-INFO 00-0363 project

I thank Pierre Bruneau (AstraZeneca), Gennadiy Poda (Pfizer),Douglas Rohrer (Pfizer), Hans-Werner Mewes (IBI, GSF), Ruben

Abagyan (Scripps Inst., USA) and Tudor Oprea (New Mexico, USA)for collaboration in this work and Dr. Scott Hutton (USA) for

providing compounds from the iResearch Library (ChemNavigator).

Thank you for your attention!

troubles with chemoinformatics and associative neural networks · troubles with chemoinformatics...

Documents