2/44

OUTLINE

The need for developing validated models OECD programme

NIHs Molecular Library Initiative and PubChem

Possible insufficiency of HTS and -omics data for predictivetoxicology

Predictive QSAR Modeling Workflow

Examples of the Workflow applications Ames genotoxicity

HTS/NTP dataset

Carcinogenicity modeling (HTS/NTP/CPDB and CPDB)

Discussion


3/44

Art. 3.2Art. 3.2 ..to keep animal testing to a minimumto keep animal testing to a minimumin the interest of timein the interest of time-- and costand cost--effectivenesseffectiveness

particular research efforts are needed for development andparticular research efforts are needed for development and

validation of modelling (e.g. QSAR) and screening methodsvalidation of modelling (e.g. QSAR) and screening methods

for assessing the potential adverse effects of chemicalsfor assessing the potential adverse effects of chemicals

Regulatory acceptance of QSAR modelsRegulatory acceptance of QSAR models::

Setubal PrinciplesSetubal PrinciplesSetubal Principles

Workshop ICCA/CEFIC (2002):Workshop ICCA/CEFIC (2002):

*Slide is a courtesy of Prof. Paola Gramatica - QSAR Research Unit - DBSF - University of Insubria - Varese (Italy)

EU- WHITE PAPER on the Strategy for aFuture Chemicals Policy (2001)*

EUEU-- WHITE PAPER on theWHITE PAPER on the StrategyStrategy forfor aaFutureFuture ChemicalsChemicals PolicyPolicy ((2001)*2001)*


4/44

From Setubal to OECD PrinciplesFrom Setubal to OECD PrinciplesFrom Setubal to OECD PrinciplesTo facilitate the consideration of a QSAR modelTo facilitate the consideration of a QSAR model

for regulatory purposes, it should be associated with thefor regulatory purposes, it should be associated with the

following information:following information:

--

be associated with abe associated with a defined endpointdefined endpoint

take the form of an unambiguous and easily applicabletake the form of an unambiguous and easily applicable

algorithmalgorithm;;

ideally, have aideally, have a mechanistic basis;mechanistic basis;

be accompanied by a definition ofbe accompanied by a definition of domain of applicabilitydomain of applicability

be associated with a measure of goodnessbe associated with a measure of goodness--of fit (of fit (internalinternal

validationvalidation););

be assessed in terms of its predictive power by using databe assessed in terms of its predictive power by using datanot used in the development of the model (not used in the development of the model (externalexternal

validationvalidation).).*Slide is a courtesy of Prof. Paola Gramatica - QSAR Research Unit - DBSF - University of Insubria - Varese (Italy)


5/44


6/44

4 Chemical SynthesisCenters

4 Chemical SynthesisCenters

Molecular Libraries Initiative

Molecular Libraries Initiative

NIH Roadmap Initiative

NIH Roadmap Initiative

MLSCN (9+1)9 centers

1 NIH intramural20 x 10 = 200 assays

MLSCN (9+1)9 centers

1 NIH intramural20 x 10 = 200 assays

PubChem(NLM)

PubChem(NLM)

ECCR (6)Exploratory

Centers

ECCR (6)ExploratoryCenters

CombiChem

Parallel synthesis

DOS

4 centers + DPI

100K 1M compounds

CombiChem

Parallel synthesis

DOS

4 centers + DPI

100K 1M compounds

1M compounds

200 assays

SAR matrix

NIHs Molecular Libraries Initiative in numbers

PredictiveADMET

(10)

PredictiveADMET(10)


7/44

Recent MLSCN Screening

Results in PubChem

0.000536982637NCGCIkB Signalling

0.036932140852NCGCCell Viability

(BJ)

0.100852

1408142NCGCCell Viability

(Jurkat)

0.056818


(Hek293

)

0.037642


(HepG2)

0.036222


(MRC5)

0.065341


(SK-N-SH)

0701580NCGCsOGT0655350NCGCPrx2

% Active# Screened# ofActives

CenterScreen

0.0058211236972VUMLSCM40.00574

853649VUMLSCThallium

flux through

GIRK

0.019431

9984194SDCCGCell

Viability

(HPDE-C7)

0.021534

9984215SDCCGCell

Viability

(HPDE-

C7K)

0.083811

3317278SRMLSCA549 Cell

Growth

0.0002

100112SRMLSCPantothenate

Synthetase

0.005104999351NMMLSCFPR0.002302

999323NMMLSCFPRL1

0.00153365239100PMLSCMKP-1


8/44

Subset of PubChem relevant to this presentation:

NTP-HTS Content Summary of 1408 Compounds

Chemical Structure Types:- Organic: 1,348

- Inorganic: 27

- Organometallic: 19- No structure: 14

1348 Organic compounds contain:- Unique: 1,279

- Complex: 51

- Salt: 20

- Duplicates: 53

Curated subset: 1,289 unique organic compounds


9/44

HTS Screening Data (NCGC) for

1,289 NTP Compounds

1,208

44

37

MRC5

74416312142Actives

1,1611,2011,1471,0791,203Inactives

5447798944Inconclusives

SK-N-

SH

HepG2Hek293JurkatBJ


10/44

Data interpretation: How was the Activity

Classified?

-0.5

0

0.5

1

1.5

2

2.5

-9 -8 -7 -6 -5 -4 -3 -2 -1

Log(EC50)

Classifications

The relationship between IC50 and classification

of compounds in Jurkat cell line test.

active

inactive

inconclusive


11/44

Summary of the experimental data

for 1,289 compounds 141 compounds are active in at least one test.

230 compounds are at least active orinconclusive in at least one test.

1,059 compounds are inactive in all 6 tests


12/44

Additional biological data on

1,289 NTP/HTS compounds*

181383*4231,0531,1531,289

IRISSICPDBHPVCSINTPGTZNTPBSINTP-

HTS

NTPBSI: National Toxicology Program Chemical Structure Index file

NTPGTZ: National Toxicology Program genotoxicity

HPVCSI: High Production Volume Chemicals

CPDB: Carcinogenic Potency Data Base All SpeciesIRISSI: EPA Integrated Risk Information System

*15 of 383 compounds in CPDB database are technique class.

*Based on the DSSTox project of Dr. Ann Richard at EPA.


13/44

Overview of carcinogenic responses of

the 383 compounds in rats and mice 229 compounds show positive response in at least

one organ of one or more species.

92 compounds show negative results in all tests.

62 compounds show negative response in all testsbut the tests are not complete.


14/44

Are HTS results indicative of

carcinogenicity?

Results based on the IRIS database (EPA 1986, 1996, 1999 carcinogen risk assessment)

1

5

HTS-Actives

Non HumanCarcinogens

Human

Carcinogens

332

475

HTS-InactivesHTS-

Inconclusives

93 compounds were tested in HTS, 57 of them are or likely to be human carcinogens, 36

of them are not human carcinogens.


15/44

Can the explicit use of chemical structure help

with the end point prediction: QSPR Modeling

Chemistry Biology

(IC50, Kd...)

Cheminformatics(Molecular Descriptors)

Comp.1 Value1 D1 D2 D3 D4

Comp.2 Value2 " " " "

Comp.3 Value3 " " " "

Comp.N ValueN " " " "

- - - - - - - - - - - - - -

Goal: Establish correlations between descriptors and the target

property capable of predicting activities of novel compounds

BA (e.g., IC50) = F(D)

=

2

2

2

)(

)(1

i

ii

yy

yyq


16/44

Typical QSPR modeling result: Comparison

between Actual and Predicted Activity

0

0.5

1

1.5

2

2.5

3

0 1 2 3 4Actual LogED50 (ED50 = mM/kg)

PredictedLogED50

Training

Linear (Training)

q2 = 0.79

makes everyone happy


17/44

An example of mechanistic model

of mutagenicity

Possible remedies (per authors)

retesting some of the compounds;

testing further new compounds; checking (if necessary) the use of additional

chemical descriptors.

Beningni et al.Env. Mol. Mutagenesis 46:268-280 (2005)


18/44

y = 0.5958x + 2.3074

R2 = 0.2135

3

5

7

9

3 5 7 9

Predicted

Observed

EXTERNAL TEST SET PREDICTIONS

y = 0.4694x + 2.9313

R2 = 0.1181

3

5

7

9

3 5 7 9

Predicted

Observed

The unbearable lightness of modeling

(in this case, CoMFA)

leads to unacceptable prediction accuracy.

0

0.5

1

1.5

2

2.5

3

0 1 2 3 4Actual LogED50 (ED50 = mM/kg)

PredictedLogED50

Training

Linear (Training)

2!!!


19/44

BEWARE OF q2!!!(Golbraikh & Tropsha,J. Mol. Graphics Mod. 2002, 20, 269-276. )

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

0.5 0.6 0.7 0.8 0.9 1

q2

R2

Poor Models

Good Models

0

20

40

60

80

100

120

-0.7 -0.6 -0.5 -0.4 - 0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

R2

F

re

q

ue

n

c


20/44

COMPONENTS OF THE PREDICTIVE

QSPR MODELING WORKFLOW*

Model Building: Combination of various descriptorsets and variable selection data modeling methods

(Combi-QSAR) Model Validation

Y-randomization

Training and test set selection

Applicability domain Evaluation of external predictive power

Virtual screening

*Tropsha, A., Gramatica, P., Gombar, V. The importance of being earnest:Quant. Struct. Act. Relat. Comb. Sci. 2003, 22, 69-77.


21/44

COMBINATORIAL QSAR

C-Qics

KNNKNN

BINARY QSAR,BINARY QSAR,

COMFA descriptorsCOMFA descriptors

MolconnMolconn ZZ

descriptorsdescriptors

Chirality descriptorsChirality descriptors

VolsurfVolsurf descriptorsdescriptors

Comma descriptorsComma descriptors

MOE descriptorsMOE descriptors

Dragon descriptorsDragon descriptors

SAR DatasetSAR Dataset

Compound representationCompound representation

Selection of best modelsSelection of best models

Model validationModel validation

usingusing external testexternal test setset

andand YY--RandomizationRandomization

QSAR modelQSAR modeliningg SVMSVM

DECISION TREEDECISION TREE

Lima, P., Golbraikh, A., Oloff, S., Xiao, Y., Tropsha, A. Combinatorial QSAR Modeling of P-Glycoprotein

Substrates. J. Chem. Info. Model., 2006, 46, 1245-1254

Kovatcheva, A., Golbraikh, A., Oloff, S., Xiao, Y., Zheng, W., Wolschann, P., Buchbauer, G., Tropsha, A.

Combinatorial QSAR of Ambergris Fragrance Compounds. J Chem. Inf. Comput. Sci. 2004, 44, 582-95


22/44

Example of application in a Combi-QSAR Study

808088887878868676768383VOLSURFVOLSURF

626284846969888853538989MOEMO E

808094947676838380808787ATOM PAIRATOM PAIR

676790906767888878789292MOLCONNZMOLCONNZ

TestTestTrainingTrainingTestTestTrainingTrainingTestTestTrainingTraining

SVMSVMDECISION TREEDECISION TREEkN NkN NMethodMethod

DescriptorsDescriptors

808088887878868676768383VOLSURFVOLSURF

626284846969888853538989MOEMO E

808094947676838380808787ATOM PAIRATOM PAIR

676790906767888878789292MOLCONNZMOLCONNZ

TestTestTrainingTrainingTestTestTrainingTrainingTestTestTrainingTraining

SVMSVMDECISION TREEDECISION TREEkN NkN NMethodMethod


Percent Classification Accuracy for the PGP Dataset*

75758383626275756565717189897676CoMFACoMFA

69697373727274747070737375757777COMMA/MOECOMMA/MOE


65657777717174748686747465657777MOEMOE

53539494606077777070747485857878VOLSURFVOLSURF

53538787535362624747858560606767CMTD/MOLCONNZCMTD/MOLCONNZ

58588181747467675050767665657272CMTDCMTD

68688383787870707676727286867070DRAGONDRAGON

TestTestTrainingTrainingTestTestTrainingTrainingTestTestTrainingTrainingTestTestTrainingTraining

SVMSVMDECISION TREEDECISION TREEBINARYBINARY

QSARQSAR

kNNkNNMethodMethod


75758383626275756565717189897676CoMFACoMFA



65657777717174748686747465657777MOEMOE

53539494606077777070747485857878VOLSURFVOLSURF

53538787535362624747858560606767CMTD/MOLCONNZCMTD/MOLCONNZ

58588181747467675050767665657272CMTDCMTD

68688383787870707676727286867070DRAGONDRAGON

TestTestTrainingTrainingTestTestTrainingTrainingTestTestTrainingTrainingTestTestTrainingTraining

SVMSVMDECISION TREEDECISION TREEBINARYBINARY

QSARQSAR

kNNkNNMethodMethod


Percent Classification Accuracy for the Fragrance Dataset**

*Lima, et al. JCIM, 2006, in press.**Kovatcheva, Golbraikh, Oloff, et al. JCICS, 44: 582-595, 2004.


23/44

Activity randomization: model robustnessStruc.1

Struc.2

Struc.n

..

..

Pro.1

Struc.3

..

..

Pro.2Pro.3

Pro.n

Struc.1

Struc.2

Struc.n

..

..

Pro.1

Struc.3

..

..

Pro.2

Pro.3

Pro.n

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 10 20 30 40 50 60 70

Number of Variables

q2

The lowest q2 = 0.51 in the top 10 models

The highest q2 =0.14 for randomized datasets

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 10 20 30 40 50 60 70

Number of Variables

q2

The lowest q2 = 0.51 in the top 10 models

The highest q2 =0.14 for randomized datasets


24/44

RATIONAL SELECTION OF TRAINING AND TEST SETS

BASED ON DIVERSITY SAMPLING

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

R

TRAINING SET TEST SET

NVVp /=

K

p

cVR /1=

V - the occupied volume in the descriptor space

Vp- volume corresponding to one point

c - dissimilarity level

K - dimensionality of the descriptor space

ALGORITHMS 1 to 3

1. Volume corresponding to one point is 1/N.

2. Select a compound with the highest activity.3. Include this compound into the training set.

4. Construct a sphere with the center in the

representative point of this compound with

radius R = c(V/N)1/K.

5. Include compounds within this sphere except

for the center in the test set.6. Exclude all points within this sphere. For

algorithm 1, select randomly a compound

and go to 3. If no compounds left, go to 10.

7. n - the number of remaining compounds. m -

the number of spheres already constructed.

dij, i=1,,n, j=1,,m distances of

compounds left to the sphere surfaces.8. Select a compound with the smallest

(algorithm 2) or largest (algorithm 3) dij.

9. Go to step 3.

10. Stop.

N number of points

Golbraikh et al.,J. Comp. Aid. Mol. Design 2003, 17, 241253


25/44

DEFINING THE APPLICABILITY DOMAIN

Distribution of distances between points and

their nearest neighbors in the training set

0

0.1

0.2

0.3

0.4

0-0.1

0.1-0.2

0.2-0.3

0.3-0.4

0.4-0.5

0.5-0.6

0.6-0.7

0.7-0.8

0.8-0.9

0.9-1.0

Distances

Ni/N

Training set

Test set

Training set: 60 compoundsTest set: 35 compounds

MODEL:Two nearest neighbors

The number of descriptors: 8Q2(CV)=0.57 R2 =0.67

DISTANCES:train=0.287

StDev(D)train= =0.149

Closest nearest neighbors oftest set compounds:

Dtest = train+ ZCutOff(ZCutOff=0.5)

N is the total number of distances

(Ntrain=60 2=120; Ntest=70 )

Ni is the number of distances in each

category (bin)



26/44

y = 0.9383x

R02= -3.3825

y = 0.3154x + 3.4908

R2

= 0.9778

0

2

4

6

8

10

0 2 4 6 8 10

Predicted

Observed

y = 1.0023x

R02 = 0.5238

y = 3.1007x - 10.715

R2 = 0.9778

0

2

4

6

8

10

0 2 4 6 8 10

Observed

Predic

ted

=

22 )~~()(

)~~)((

yyyy

yyyyR

ii

ii

2

2

2

0)~~(

)~(1

0

yy

yyR

i

r

ii

=

2

2

2

0 )(

)~(

1'

0

yy

yy

R i

r

ii

=

''~ byay r += ykyr '~ 0 =

Criteria for Predictive QSAR Model.

Correlation coefficient

y = 1.2458x - 1.8812

R2

= 0.8604

y = 0.9796x

R02

= 0.8209

3

5

7

9

3 5 7 9

Predicted

Observed

2

~

'i

ii

y

yyk

=

Coefficients of determinationRegression Regression throughthe origin

CRITERIA

22'

0

2

0

'

22

or;0.1or

;6.0;5.0

RRRkk

Rq

>>

Golbraikh et al.,J. Comp. Aid. Mol. Design 2003, 17, 241253


27/44

Why cant we get it right? Have not

we tried enough?

Descriptors? No, we have plenty (e.g., Dragon)

Methods? No, we also have plenty, and still

searching (e.g., adapting datamining techniques). Training set statistics? NO, it does not work

Test set statistics? Maybe, but it is still insufficient

Sowhat else can we do?????

Change the success criteria!!!

QSAR is an empirical data modeling exercise: justdo it any way you like but VALIDATE onindependent datasets!


28/44

Only accept models

that have a

q2 > 0.6

R2 > 0.6, etc.

MultipleTraining Sets

Validated Predictive

Models with High Internal

& External Accuracy

Predictive QSPR Workflow*

Original

Dataset

Multiple

Test Sets

Combi-QSAR

ModelingSplit into

Training, Test,and External

validation Sets

Activity

Prediction

Y-Randomization

External validation

Using Applicability

Domain

Database

Screening



29/44

3,363 diverse compounds (including >300drugs) tested for their Ames genotoxicity

60% mutagens, 40% non mutagens 148 initial topological descriptors

ANN, kNN, Decision Forest (DF) methods

2963 compounds in the training set, 400compounds (39 drugs) in randomly selectedvalidation set

Example. Consensus QSPR models for

the prediction of Ames genotoxicity*

*Votano JR, Parham M, Hall LH, Kier LB, Oloff S, Tropsha A, Xie Q, Tong W.Mutagenesis, 2004, 19, 365-77.

Comparison of GenTox prediction


30/44

Comparison of GenTox prediction

for 30 drugs in the external test set


31/44

Frequent MI

descriptors map

onto (someknown) structural

alerts

Bold, wide bonds show

positions within structures

where descriptors indicate

a structural alert for Amesmutagenicity as found

among most important E-

State indices


32/44

Applicability domain vs. prediction

accuracy (Ames Genotoxicity dataset)

80

82

84

86

88

90

92

0.5 1 2 3 5 10

kNN Z-Score Used for Prediction Cutoff

TestSet%Accurac

y


33/44

QSAR modeling of the NTP/NCGC/HTS data only

37103Actives

2367Inconclusives

157400Total97*230*Inactives

Validation setModeling set

The best k-NN models based on the modeling set:

74.1%

79.4%

72.8%

Pred. Test

278.1%3

278.8%2

278.8%1

NNNPred. Train.Nm

*Inactives most similar to actives are selected


34/44

Prediction of the External Set

88.7%63.9%Pred.

Accuracy

8613Pred.

Inactives

1123Pred.

Actives

InactivesActives

92.1%76.2%Pred.

Accuracy

825Pred.

Inactives

716Pred.

Actives

InactivesActives

No applicability domain.

Accuracy 75.8

Applicability domain filter applied.

Accuracy 83.6%, Coverage 82.8%


35/44

Carcinogenicity Model Based on the 187

Compounds Modeling set: 167 compounds

External validation set: 20 compounds

The number of kNN QSAR models based on modeling set fordifferent cutoff values:

Training/test set predictivity

cutoff

Chemical descriptors only Combinational descriptors

0.7/0.7 315 919

0.75/0.75 29 86

0.8/0.8 1 4

Combined HTS+chemical descriptors


36/44

Prediction of the 20 External Compounds

Chemial descriptors

only

Combined HTS+chemical

descriptors

Exp.

Actives

Exp.

Inactives

Exp.

Actives

Exp. Inactives

Pred. actives 5 1 8 0

Pred. inactives 5 4 3 5

Accuracy 50.0% 80.0% 72.7% 100%

Overall

Accuracy

65.0% 86.4%

Modeling of the completeModeling of the complete


37/44

Modeling of the complete

carcinogenicity dataset: The

Carcinogenic Potency Database (CPDB)

Modeling of the completecarcinogenicity dataset: The

Carcinogenic Potency Database (CPDB) Lois Swirsky Gold, Ph.D., Director

Unique and widely used international repositoryhttp://potency.berkeley.edu/

1485 chemicals Species, strain, and sex of test animals

Target organ, tumor types, and tumor incidence

Carcinogenic potency (TD50)

Shape of the dose-response

Expertss conclusion on carcinogenicity

Literature citation through1997

Incorporated in The Distributed Structure-Searchable Toxicity(DSSTox) Database Network.http://www.epa.gov/nheerl/dsstox/sdf_cpdbas.html


38/44

Database CurationDatabase Curation

Total entries: (1481)

Delete entries with no structure (1444 left)

Delete entries containing inorganicelements (1244 left)

Clean duplicates / triplicates and keep one

copy (1216 left) Delete chiral compounds (1214 left)

Delete all entries missing mutagenicity data(693 left)


39/44

Statistics of a Working Subset for the

Animal Carcinogenicity Modeling

Statistics of a Working Subset for the

Animal Carcinogenicity Modeling

693140553Total

42481343Active

26959210Inactive

TotalVal. SetTrain/Test SetT


40/44

Accuracy of kNN QSAR Models of

Animal Carcinogenicity

Accuracy of kNN QSAR Models of

Animal Carcinogenicity

Training SetTest Set

Val. Set

Min Pred. Acc.

Max Pred. Acc.0

0.1

0.2

0.3

0.4

0.5

0.60.7

0.8

0.9

Training SetTest Set

Val. Set

Min Pred. Acc.

Max Pred. Acc.0

0.1

0.2

0.3

0.4

0.5

0.60.7

0.8

0.9

QSPR Workflow: Emphasis on Successful


41/44

Large fraction isconfirmed active

Small number ofcomputational hits

SAR datasetExternal

database/library

QSPR Workflow: Emphasis on Successful

Predictions, not statistics or interpretations

QSPR

Magic

Input

Output

Real Test


42/44

C-


43/44

Summary and thoughts

HTS and omics data may be insufficient toachieve the desired accuracy of the end point

property prediction. Should be explored asbiodescriptors in conjunction with chemicaldescriptors

Predictive QSPR workflow with extensivevalidation affords statistically significant modelsthat can serve as reliable property predictors

Mechanistic model interpretation should only beattempted IFF models have been externallyvalidated

The public has an insatiable curiosity to know everything,

except what is worth knowing.Oscar Wilde


44/44

ACKNOWLEDGMENTSUNC ASSOCIATES

Former:

-Stephen CAMMER-Sung Jin CHO

-Weifan ZHENG

- Min SHEN

-Bala KRISHNAMOORTHY

-Shuxing ZHANG

Peter ITSKOWITZ

Scott OLOFF- Shuquan ZONG

Protein structure group:

Yetian CHEN

Tanarat KIETSAKORN

Berk ZAFER

Funding NIH RoadMap

EPA (STAR award)

Collaborators

Ann Richard (EPA)

Ivan Rusyn (UNC)

Tudor Oprea (UNM)

Cheminformatics group:- Kun WANG - Rima HAJJO

Alex GOLBRAIKH - Mei WANG Raed KHASHAN - Julia GRACE

Chris GRULKE - Hao HU

- Hao TANG - Mihir SHAH

- Simon WANG - Jui-Hua Hsieh

Jun FENG

Yun-De XIAO

Yuanyuan QIAO

Patricia LIMA

Assia KOVACHEVA

M. KARTHIKEYAN

Current

Download - Predictive Qspr Epa Star

Top Related