Download - Predictive Qspr Epa Star
-
8/3/2019 Predictive Qspr Epa Star
1/44
The Challenges in Predictive QSPR
Modeling
Alexander Tropsha, Hao Zhu, Kun Wang
Laboratory for Molecular Modeling
School of Pharmacy
and
Carolina Center for Exploratory
Cheminformatics Research
UNC-Chapel Hill
-
8/3/2019 Predictive Qspr Epa Star
2/44
OUTLINE
The need for developing validated models OECD programme
NIHs Molecular Library Initiative and PubChem
Possible insufficiency of HTS and -omics data for predictivetoxicology
Predictive QSAR Modeling Workflow
Examples of the Workflow applications Ames genotoxicity
HTS/NTP dataset
Carcinogenicity modeling (HTS/NTP/CPDB and CPDB)
Discussion
-
8/3/2019 Predictive Qspr Epa Star
3/44
Art. 3.2Art. 3.2 ..to keep animal testing to a minimumto keep animal testing to a minimumin the interest of timein the interest of time-- and costand cost--effectivenesseffectiveness
particular research efforts are needed for development andparticular research efforts are needed for development and
validation of modelling (e.g. QSAR) and screening methodsvalidation of modelling (e.g. QSAR) and screening methods
for assessing the potential adverse effects of chemicalsfor assessing the potential adverse effects of chemicals
Regulatory acceptance of QSAR modelsRegulatory acceptance of QSAR models::
Setubal PrinciplesSetubal PrinciplesSetubal Principles
Workshop ICCA/CEFIC (2002):Workshop ICCA/CEFIC (2002):
*Slide is a courtesy of Prof. Paola Gramatica - QSAR Research Unit - DBSF - University of Insubria - Varese (Italy)
EU- WHITE PAPER on the Strategy for aFuture Chemicals Policy (2001)*
EUEU-- WHITE PAPER on theWHITE PAPER on the StrategyStrategy forfor aaFutureFuture ChemicalsChemicals PolicyPolicy ((2001)*2001)*
-
8/3/2019 Predictive Qspr Epa Star
4/44
From Setubal to OECD PrinciplesFrom Setubal to OECD PrinciplesFrom Setubal to OECD PrinciplesTo facilitate the consideration of a QSAR modelTo facilitate the consideration of a QSAR model
for regulatory purposes, it should be associated with thefor regulatory purposes, it should be associated with the
following information:following information:
--
be associated with abe associated with a defined endpointdefined endpoint
take the form of an unambiguous and easily applicabletake the form of an unambiguous and easily applicable
algorithmalgorithm;;
ideally, have aideally, have a mechanistic basis;mechanistic basis;
be accompanied by a definition ofbe accompanied by a definition of domain of applicabilitydomain of applicability
be associated with a measure of goodnessbe associated with a measure of goodness--of fit (of fit (internalinternal
validationvalidation););
be assessed in terms of its predictive power by using databe assessed in terms of its predictive power by using datanot used in the development of the model (not used in the development of the model (externalexternal
validationvalidation).).*Slide is a courtesy of Prof. Paola Gramatica - QSAR Research Unit - DBSF - University of Insubria - Varese (Italy)
-
8/3/2019 Predictive Qspr Epa Star
5/44
-
8/3/2019 Predictive Qspr Epa Star
6/44
4 Chemical SynthesisCenters
4 Chemical SynthesisCenters
Molecular Libraries Initiative
Molecular Libraries Initiative
NIH Roadmap Initiative
NIH Roadmap Initiative
MLSCN (9+1)9 centers
1 NIH intramural20 x 10 = 200 assays
MLSCN (9+1)9 centers
1 NIH intramural20 x 10 = 200 assays
PubChem(NLM)
PubChem(NLM)
ECCR (6)Exploratory
Centers
ECCR (6)ExploratoryCenters
CombiChem
Parallel synthesis
DOS
4 centers + DPI
100K 1M compounds
CombiChem
Parallel synthesis
DOS
4 centers + DPI
100K 1M compounds
1M compounds
200 assays
SAR matrix
NIHs Molecular Libraries Initiative in numbers
PredictiveADMET
(10)
PredictiveADMET(10)
-
8/3/2019 Predictive Qspr Epa Star
7/44
Recent MLSCN Screening
Results in PubChem
0.000536982637NCGCIkB Signalling
0.036932140852NCGCCell Viability
(BJ)
0.100852
1408142NCGCCell Viability
(Jurkat)
0.056818
140880NCGCCell Viability
(Hek293
)
0.037642
140853NCGCCell Viability
(HepG2)
0.036222
140851NCGCCell Viability
(MRC5)
0.065341
140892NCGCCell Viability
(SK-N-SH)
0701580NCGCsOGT0655350NCGCPrx2
% Active# Screened# ofActives
CenterScreen
0.0058211236972VUMLSCM40.00574
853649VUMLSCThallium
flux through
GIRK
0.019431
9984194SDCCGCell
Viability
(HPDE-C7)
0.021534
9984215SDCCGCell
Viability
(HPDE-
C7K)
0.083811
3317278SRMLSCA549 Cell
Growth
0.0002
100112SRMLSCPantothenate
Synthetase
0.005104999351NMMLSCFPR0.002302
999323NMMLSCFPRL1
0.00153365239100PMLSCMKP-1
-
8/3/2019 Predictive Qspr Epa Star
8/44
Subset of PubChem relevant to this presentation:
NTP-HTS Content Summary of 1408 Compounds
Chemical Structure Types:- Organic: 1,348
- Inorganic: 27
- Organometallic: 19- No structure: 14
1348 Organic compounds contain:- Unique: 1,279
- Complex: 51
- Salt: 20
- Duplicates: 53
Curated subset: 1,289 unique organic compounds
-
8/3/2019 Predictive Qspr Epa Star
9/44
HTS Screening Data (NCGC) for
1,289 NTP Compounds
1,208
44
37
MRC5
74416312142Actives
1,1611,2011,1471,0791,203Inactives
5447798944Inconclusives
SK-N-
SH
HepG2Hek293JurkatBJ
-
8/3/2019 Predictive Qspr Epa Star
10/44
Data interpretation: How was the Activity
Classified?
-0.5
0
0.5
1
1.5
2
2.5
-9 -8 -7 -6 -5 -4 -3 -2 -1
Log(EC50)
Classifications
The relationship between IC50 and classification
of compounds in Jurkat cell line test.
active
inactive
inconclusive
-
8/3/2019 Predictive Qspr Epa Star
11/44
Summary of the experimental data
for 1,289 compounds 141 compounds are active in at least one test.
230 compounds are at least active orinconclusive in at least one test.
1,059 compounds are inactive in all 6 tests
-
8/3/2019 Predictive Qspr Epa Star
12/44
Additional biological data on
1,289 NTP/HTS compounds*
181383*4231,0531,1531,289
IRISSICPDBHPVCSINTPGTZNTPBSINTP-
HTS
NTPBSI: National Toxicology Program Chemical Structure Index file
NTPGTZ: National Toxicology Program genotoxicity
HPVCSI: High Production Volume Chemicals
CPDB: Carcinogenic Potency Data Base All SpeciesIRISSI: EPA Integrated Risk Information System
*15 of 383 compounds in CPDB database are technique class.
*Based on the DSSTox project of Dr. Ann Richard at EPA.
-
8/3/2019 Predictive Qspr Epa Star
13/44
Overview of carcinogenic responses of
the 383 compounds in rats and mice 229 compounds show positive response in at least
one organ of one or more species.
92 compounds show negative results in all tests.
62 compounds show negative response in all testsbut the tests are not complete.
-
8/3/2019 Predictive Qspr Epa Star
14/44
Are HTS results indicative of
carcinogenicity?
Results based on the IRIS database (EPA 1986, 1996, 1999 carcinogen risk assessment)
1
5
HTS-Actives
Non HumanCarcinogens
Human
Carcinogens
332
475
HTS-InactivesHTS-
Inconclusives
93 compounds were tested in HTS, 57 of them are or likely to be human carcinogens, 36
of them are not human carcinogens.
-
8/3/2019 Predictive Qspr Epa Star
15/44
Can the explicit use of chemical structure help
with the end point prediction: QSPR Modeling
Chemistry Biology
(IC50, Kd...)
Cheminformatics(Molecular Descriptors)
Comp.1 Value1 D1 D2 D3 D4
Comp.2 Value2 " " " "
Comp.3 Value3 " " " "
Comp.N ValueN " " " "
- - - - - - - - - - - - - -
Goal: Establish correlations between descriptors and the target
property capable of predicting activities of novel compounds
BA (e.g., IC50) = F(D)
=
2
2
2
)(
)(1
i
ii
yy
yyq
-
8/3/2019 Predictive Qspr Epa Star
16/44
Typical QSPR modeling result: Comparison
between Actual and Predicted Activity
0
0.5
1
1.5
2
2.5
3
0 1 2 3 4Actual LogED50 (ED50 = mM/kg)
PredictedLogED50
Training
Linear (Training)
q2 = 0.79
makes everyone happy
-
8/3/2019 Predictive Qspr Epa Star
17/44
An example of mechanistic model
of mutagenicity
Possible remedies (per authors)
retesting some of the compounds;
testing further new compounds; checking (if necessary) the use of additional
chemical descriptors.
Beningni et al.Env. Mol. Mutagenesis 46:268-280 (2005)
-
8/3/2019 Predictive Qspr Epa Star
18/44
y = 0.5958x + 2.3074
R2 = 0.2135
3
5
7
9
3 5 7 9
Predicted
Observed
EXTERNAL TEST SET PREDICTIONS
y = 0.4694x + 2.9313
R2 = 0.1181
3
5
7
9
3 5 7 9
Predicted
Observed
The unbearable lightness of modeling
(in this case, CoMFA)
leads to unacceptable prediction accuracy.
0
0.5
1
1.5
2
2.5
3
0 1 2 3 4Actual LogED50 (ED50 = mM/kg)
PredictedLogED50
Training
Linear (Training)
2!!!
-
8/3/2019 Predictive Qspr Epa Star
19/44
BEWARE OF q2!!!(Golbraikh & Tropsha,J. Mol. Graphics Mod. 2002, 20, 269-276. )
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0.5 0.6 0.7 0.8 0.9 1
q2
R2
Poor Models
Good Models
0
20
40
60
80
100
120
-0.7 -0.6 -0.5 -0.4 - 0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
R2
F
re
q
ue
n
c
-
8/3/2019 Predictive Qspr Epa Star
20/44
COMPONENTS OF THE PREDICTIVE
QSPR MODELING WORKFLOW*
Model Building: Combination of various descriptorsets and variable selection data modeling methods
(Combi-QSAR) Model Validation
Y-randomization
Training and test set selection
Applicability domain Evaluation of external predictive power
Virtual screening
*Tropsha, A., Gramatica, P., Gombar, V. The importance of being earnest:Quant. Struct. Act. Relat. Comb. Sci. 2003, 22, 69-77.
-
8/3/2019 Predictive Qspr Epa Star
21/44
COMBINATORIAL QSAR
C-Qics
KNNKNN
BINARY QSAR,BINARY QSAR,
COMFA descriptorsCOMFA descriptors
MolconnMolconn ZZ
descriptorsdescriptors
Chirality descriptorsChirality descriptors
VolsurfVolsurf descriptorsdescriptors
Comma descriptorsComma descriptors
MOE descriptorsMOE descriptors
Dragon descriptorsDragon descriptors
SAR DatasetSAR Dataset
Compound representationCompound representation
Selection of best modelsSelection of best models
Model validationModel validation
usingusing external testexternal test setset
andand YY--RandomizationRandomization
QSAR modelQSAR modeliningg SVMSVM
DECISION TREEDECISION TREE
Lima, P., Golbraikh, A., Oloff, S., Xiao, Y., Tropsha, A. Combinatorial QSAR Modeling of P-Glycoprotein
Substrates. J. Chem. Info. Model., 2006, 46, 1245-1254
Kovatcheva, A., Golbraikh, A., Oloff, S., Xiao, Y., Zheng, W., Wolschann, P., Buchbauer, G., Tropsha, A.
Combinatorial QSAR of Ambergris Fragrance Compounds. J Chem. Inf. Comput. Sci. 2004, 44, 582-95
-
8/3/2019 Predictive Qspr Epa Star
22/44
Example of application in a Combi-QSAR Study
808088887878868676768383VOLSURFVOLSURF
626284846969888853538989MOEMO E
808094947676838380808787ATOM PAIRATOM PAIR
676790906767888878789292MOLCONNZMOLCONNZ
TestTestTrainingTrainingTestTestTrainingTrainingTestTestTrainingTraining
SVMSVMDECISION TREEDECISION TREEkN NkN NMethodMethod
DescriptorsDescriptors
808088887878868676768383VOLSURFVOLSURF
626284846969888853538989MOEMO E
808094947676838380808787ATOM PAIRATOM PAIR
676790906767888878789292MOLCONNZMOLCONNZ
TestTestTrainingTrainingTestTestTrainingTrainingTestTestTrainingTraining
SVMSVMDECISION TREEDECISION TREEkN NkN NMethodMethod
DescriptorsDescriptors
Percent Classification Accuracy for the PGP Dataset*
75758383626275756565717189897676CoMFACoMFA
69697373727274747070737375757777COMMA/MOECOMMA/MOE
69697373727274747070737375757777COMMA/MOECOMMA/MOE
65657777717174748686747465657777MOEMOE
53539494606077777070747485857878VOLSURFVOLSURF
53538787535362624747858560606767CMTD/MOLCONNZCMTD/MOLCONNZ
58588181747467675050767665657272CMTDCMTD
68688383787870707676727286867070DRAGONDRAGON
TestTestTrainingTrainingTestTestTrainingTrainingTestTestTrainingTrainingTestTestTrainingTraining
SVMSVMDECISION TREEDECISION TREEBINARYBINARY
QSARQSAR
kNNkNNMethodMethod
DescriptorsDescriptors
75758383626275756565717189897676CoMFACoMFA
69697373727274747070737375757777COMMA/MOECOMMA/MOE
69697373727274747070737375757777COMMA/MOECOMMA/MOE
65657777717174748686747465657777MOEMOE
53539494606077777070747485857878VOLSURFVOLSURF
53538787535362624747858560606767CMTD/MOLCONNZCMTD/MOLCONNZ
58588181747467675050767665657272CMTDCMTD
68688383787870707676727286867070DRAGONDRAGON
TestTestTrainingTrainingTestTestTrainingTrainingTestTestTrainingTrainingTestTestTrainingTraining
SVMSVMDECISION TREEDECISION TREEBINARYBINARY
QSARQSAR
kNNkNNMethodMethod
DescriptorsDescriptors
Percent Classification Accuracy for the Fragrance Dataset**
*Lima, et al. JCIM, 2006, in press.**Kovatcheva, Golbraikh, Oloff, et al. JCICS, 44: 582-595, 2004.
-
8/3/2019 Predictive Qspr Epa Star
23/44
Activity randomization: model robustnessStruc.1
Struc.2
Struc.n
..
..
Pro.1
Struc.3
..
..
Pro.2Pro.3
Pro.n
Struc.1
Struc.2
Struc.n
..
..
Pro.1
Struc.3
..
..
Pro.2
Pro.3
Pro.n
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 10 20 30 40 50 60 70
Number of Variables
q2
The lowest q2 = 0.51 in the top 10 models
The highest q2 =0.14 for randomized datasets
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 10 20 30 40 50 60 70
Number of Variables
q2
The lowest q2 = 0.51 in the top 10 models
The highest q2 =0.14 for randomized datasets
-
8/3/2019 Predictive Qspr Epa Star
24/44
RATIONAL SELECTION OF TRAINING AND TEST SETS
BASED ON DIVERSITY SAMPLING
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
R
TRAINING SET TEST SET
NVVp /=
K
p
cVR /1=
V - the occupied volume in the descriptor space
Vp- volume corresponding to one point
c - dissimilarity level
K - dimensionality of the descriptor space
ALGORITHMS 1 to 3
1. Volume corresponding to one point is 1/N.
2. Select a compound with the highest activity.3. Include this compound into the training set.
4. Construct a sphere with the center in the
representative point of this compound with
radius R = c(V/N)1/K.
5. Include compounds within this sphere except
for the center in the test set.6. Exclude all points within this sphere. For
algorithm 1, select randomly a compound
and go to 3. If no compounds left, go to 10.
7. n - the number of remaining compounds. m -
the number of spheres already constructed.
dij, i=1,,n, j=1,,m distances of
compounds left to the sphere surfaces.8. Select a compound with the smallest
(algorithm 2) or largest (algorithm 3) dij.
9. Go to step 3.
10. Stop.
N number of points
Golbraikh et al.,J. Comp. Aid. Mol. Design 2003, 17, 241253
-
8/3/2019 Predictive Qspr Epa Star
25/44
DEFINING THE APPLICABILITY DOMAIN
Distribution of distances between points and
their nearest neighbors in the training set
0
0.1
0.2
0.3
0.4
0-0.1
0.1-0.2
0.2-0.3
0.3-0.4
0.4-0.5
0.5-0.6
0.6-0.7
0.7-0.8
0.8-0.9
0.9-1.0
Distances
Ni/N
Training set
Test set
Training set: 60 compoundsTest set: 35 compounds
MODEL:Two nearest neighbors
The number of descriptors: 8Q2(CV)=0.57 R2 =0.67
DISTANCES:train=0.287
StDev(D)train= =0.149
Closest nearest neighbors oftest set compounds:
Dtest = train+ ZCutOff(ZCutOff=0.5)
N is the total number of distances
(Ntrain=60 2=120; Ntest=70 )
Ni is the number of distances in each
category (bin)
*Tropsha, A., Gramatica, P., Gombar, V. The importance of being earnest:Quant. Struct. Act. Relat. Comb. Sci. 2003, 22, 69-77.
-
8/3/2019 Predictive Qspr Epa Star
26/44
y = 0.9383x
R02= -3.3825
y = 0.3154x + 3.4908
R2
= 0.9778
0
2
4
6
8
10
0 2 4 6 8 10
Predicted
Observed
y = 1.0023x
R02 = 0.5238
y = 3.1007x - 10.715
R2 = 0.9778
0
2
4
6
8
10
0 2 4 6 8 10
Observed
Predic
ted
=
22 )~~()(
)~~)((
yyyy
yyyyR
ii
ii
2
2
2
0)~~(
)~(1
0
yy
yyR
i
r
ii
=
2
2
2
0 )(
)~(
1'
0
yy
yy
R i
r
ii
=
''~ byay r += ykyr '~ 0 =
Criteria for Predictive QSAR Model.
Correlation coefficient
y = 1.2458x - 1.8812
R2
= 0.8604
y = 0.9796x
R02
= 0.8209
3
5
7
9
3 5 7 9
Predicted
Observed
2
~
'i
ii
y
yyk
=
Coefficients of determinationRegression Regression throughthe origin
CRITERIA
22'
0
2
0
'
22
or;0.1or
;6.0;5.0
RRRkk
Rq
>>
Golbraikh et al.,J. Comp. Aid. Mol. Design 2003, 17, 241253
-
8/3/2019 Predictive Qspr Epa Star
27/44
Why cant we get it right? Have not
we tried enough?
Descriptors? No, we have plenty (e.g., Dragon)
Methods? No, we also have plenty, and still
searching (e.g., adapting datamining techniques). Training set statistics? NO, it does not work
Test set statistics? Maybe, but it is still insufficient
Sowhat else can we do?????
Change the success criteria!!!
QSAR is an empirical data modeling exercise: justdo it any way you like but VALIDATE onindependent datasets!
-
8/3/2019 Predictive Qspr Epa Star
28/44
Only accept models
that have a
q2 > 0.6
R2 > 0.6, etc.
MultipleTraining Sets
Validated Predictive
Models with High Internal
& External Accuracy
Predictive QSPR Workflow*
Original
Dataset
Multiple
Test Sets
Combi-QSAR
ModelingSplit into
Training, Test,and External
validation Sets
Activity
Prediction
Y-Randomization
External validation
Using Applicability
Domain
Database
Screening
*Tropsha, A., Gramatica, P., Gombar, V. The importance of being earnest:Quant. Struct. Act. Relat. Comb. Sci. 2003, 22, 69-77.
-
8/3/2019 Predictive Qspr Epa Star
29/44
3,363 diverse compounds (including >300drugs) tested for their Ames genotoxicity
60% mutagens, 40% non mutagens 148 initial topological descriptors
ANN, kNN, Decision Forest (DF) methods
2963 compounds in the training set, 400compounds (39 drugs) in randomly selectedvalidation set
Example. Consensus QSPR models for
the prediction of Ames genotoxicity*
*Votano JR, Parham M, Hall LH, Kier LB, Oloff S, Tropsha A, Xie Q, Tong W.Mutagenesis, 2004, 19, 365-77.
Comparison of GenTox prediction
-
8/3/2019 Predictive Qspr Epa Star
30/44
Comparison of GenTox prediction
for 30 drugs in the external test set
-
8/3/2019 Predictive Qspr Epa Star
31/44
Frequent MI
descriptors map
onto (someknown) structural
alerts
Bold, wide bonds show
positions within structures
where descriptors indicate
a structural alert for Amesmutagenicity as found
among most important E-
State indices
-
8/3/2019 Predictive Qspr Epa Star
32/44
Applicability domain vs. prediction
accuracy (Ames Genotoxicity dataset)
80
82
84
86
88
90
92
0.5 1 2 3 5 10
kNN Z-Score Used for Prediction Cutoff
TestSet%Accurac
y
-
8/3/2019 Predictive Qspr Epa Star
33/44
QSAR modeling of the NTP/NCGC/HTS data only
37103Actives
2367Inconclusives
157400Total97*230*Inactives
Validation setModeling set
The best k-NN models based on the modeling set:
74.1%
79.4%
72.8%
Pred. Test
278.1%3
278.8%2
278.8%1
NNNPred. Train.Nm
*Inactives most similar to actives are selected
-
8/3/2019 Predictive Qspr Epa Star
34/44
Prediction of the External Set
88.7%63.9%Pred.
Accuracy
8613Pred.
Inactives
1123Pred.
Actives
InactivesActives
92.1%76.2%Pred.
Accuracy
825Pred.
Inactives
716Pred.
Actives
InactivesActives
No applicability domain.
Accuracy 75.8
Applicability domain filter applied.
Accuracy 83.6%, Coverage 82.8%
-
8/3/2019 Predictive Qspr Epa Star
35/44
Carcinogenicity Model Based on the 187
Compounds Modeling set: 167 compounds
External validation set: 20 compounds
The number of kNN QSAR models based on modeling set fordifferent cutoff values:
Training/test set predictivity
cutoff
Chemical descriptors only Combinational descriptors
0.7/0.7 315 919
0.75/0.75 29 86
0.8/0.8 1 4
Combined HTS+chemical descriptors
-
8/3/2019 Predictive Qspr Epa Star
36/44
Prediction of the 20 External Compounds
Chemial descriptors
only
Combined HTS+chemical
descriptors
Exp.
Actives
Exp.
Inactives
Exp.
Actives
Exp. Inactives
Pred. actives 5 1 8 0
Pred. inactives 5 4 3 5
Accuracy 50.0% 80.0% 72.7% 100%
Overall
Accuracy
65.0% 86.4%
Modeling of the completeModeling of the complete
-
8/3/2019 Predictive Qspr Epa Star
37/44
Modeling of the complete
carcinogenicity dataset: The
Carcinogenic Potency Database (CPDB)
Modeling of the completecarcinogenicity dataset: The
Carcinogenic Potency Database (CPDB) Lois Swirsky Gold, Ph.D., Director
Unique and widely used international repositoryhttp://potency.berkeley.edu/
1485 chemicals Species, strain, and sex of test animals
Target organ, tumor types, and tumor incidence
Carcinogenic potency (TD50)
Shape of the dose-response
Expertss conclusion on carcinogenicity
Literature citation through1997
Incorporated in The Distributed Structure-Searchable Toxicity(DSSTox) Database Network.http://www.epa.gov/nheerl/dsstox/sdf_cpdbas.html
-
8/3/2019 Predictive Qspr Epa Star
38/44
Database CurationDatabase Curation
Total entries: (1481)
Delete entries with no structure (1444 left)
Delete entries containing inorganicelements (1244 left)
Clean duplicates / triplicates and keep one
copy (1216 left) Delete chiral compounds (1214 left)
Delete all entries missing mutagenicity data(693 left)
-
8/3/2019 Predictive Qspr Epa Star
39/44
Statistics of a Working Subset for the
Animal Carcinogenicity Modeling
Statistics of a Working Subset for the
Animal Carcinogenicity Modeling
693140553Total
42481343Active
26959210Inactive
TotalVal. SetTrain/Test SetT
-
8/3/2019 Predictive Qspr Epa Star
40/44
Accuracy of kNN QSAR Models of
Animal Carcinogenicity
Accuracy of kNN QSAR Models of
Animal Carcinogenicity
Training SetTest Set
Val. Set
Min Pred. Acc.
Max Pred. Acc.0
0.1
0.2
0.3
0.4
0.5
0.60.7
0.8
0.9
Training SetTest Set
Val. Set
Min Pred. Acc.
Max Pred. Acc.0
0.1
0.2
0.3
0.4
0.5
0.60.7
0.8
0.9
QSPR Workflow: Emphasis on Successful
-
8/3/2019 Predictive Qspr Epa Star
41/44
Large fraction isconfirmed active
Small number ofcomputational hits
SAR datasetExternal
database/library
QSPR Workflow: Emphasis on Successful
Predictions, not statistics or interpretations
QSPR
Magic
Input
Output
Real Test
-
8/3/2019 Predictive Qspr Epa Star
42/44
C-
-
8/3/2019 Predictive Qspr Epa Star
43/44
Summary and thoughts
HTS and omics data may be insufficient toachieve the desired accuracy of the end point
property prediction. Should be explored asbiodescriptors in conjunction with chemicaldescriptors
Predictive QSPR workflow with extensivevalidation affords statistically significant modelsthat can serve as reliable property predictors
Mechanistic model interpretation should only beattempted IFF models have been externallyvalidated
The public has an insatiable curiosity to know everything,
except what is worth knowing.Oscar Wilde
-
8/3/2019 Predictive Qspr Epa Star
44/44
ACKNOWLEDGMENTSUNC ASSOCIATES
Former:
-Stephen CAMMER-Sung Jin CHO
-Weifan ZHENG
- Min SHEN
-Bala KRISHNAMOORTHY
-Shuxing ZHANG
Peter ITSKOWITZ
Scott OLOFF- Shuquan ZONG
Protein structure group:
Yetian CHEN
Tanarat KIETSAKORN
Berk ZAFER
Funding NIH RoadMap
EPA (STAR award)
Collaborators
Ann Richard (EPA)
Ivan Rusyn (UNC)
Tudor Oprea (UNM)
Cheminformatics group:- Kun WANG - Rima HAJJO
Alex GOLBRAIKH - Mei WANG Raed KHASHAN - Julia GRACE
Chris GRULKE - Hao HU
- Hao TANG - Mihir SHAH
- Simon WANG - Jui-Hua Hsieh
Jun FENG
Yun-De XIAO
Yuanyuan QIAO
Patricia LIMA
Assia KOVACHEVA
M. KARTHIKEYAN
Current