QSAR modeling
Pavel Polishchuk
Institute of Molecular and Translational MedicineFaculty of Medicine and Dentistry
Palacky University
Activity = F(structure)
M – mapping functionE – encoding function
Modeling of compounds properties
Activity = M(E(structure))
D1 D2 D3 D4 D5 D6 … DN
1 0 9 0 11 1 … 1
4 0 1 0 0 0 … 1
0 0 0 0 0 4 … 6
0 2 3 6 0 0 … 3
… … … … … … … …
4 0 0 0 1 2 … 1
QSAR modeling workflow
D1 D2 D3 D4 D5 D6 … DN
1 0 9 0 11 1 … 1
4 0 1 0 0 0 … 1
0 0 0 0 0 4 … 6
0 2 3 6 0 0 … 3
… … … … … … … …
4 0 0 0 1 2 … 1
Structure Descriptors (features) Model
Encoding(represent structure with
numerical features)
Mapping(machine learning)
Overall QSAR workflow
Input data
Bioassays
Databases
PreprocessingFeature
engineeringModel
trainingModel
validation
Classification
Regression
Clustering
Cross-validation
Bootstrap
Test set
Applicability Domain
Feature selection
Feature combination
Data normalization & curation
Feature extraction
Interpretation
j
j
ii
z
xxx '
4
1) a defined endpoint2) an unambiguous algorithm3) a defined domain of applicability4) appropriate measures of goodness-of–fit, robustness and predictivity5) a mechanistic interpretation, if possible
OECD principles for the validation, for regulatory purposes, of (Q)SAR models
Step 1. Data collection
Scientific literature and patentsDatabases (ChEMBL, PubChem, BindingDB, etc)
Traditionally modeled compounds should have the same mechanism of action, however using of complex non-linear machine learning method allows to model data sets with mixed or even unknown mechanism of action with reasonable accuracy.
Conditions may substantially influence the results of bioassays (change in temperature, activators, detectors, etc)
Units checking
Step 2. Data curation (normalization)
Step 2. Data curation (normalization)
Step 2. Data curation (normalization)
Removal of mixtures, inorganics, metalorganics, etc
Strip of salts, counterions, etc
Ionization, if necessary (at the particular pH level)
Chemotype normalization, resonance structure and tautomers
Duplicates removal
Manual checking
Step 3. Descriptors: classificationObject type:
molecular descriptors (single molecules)descriptors of molecular ensemble (mixtures, materials)reaction descriptors (reactions)
Descriptor origin:calculated from the structureempirical (Hammet constants, lipophilicity chemical shifts in NMR, etc)
Locality:local (atom charge)global (molecular weight, molecular volume, lipophilicity, etc)
Dimensionality:1D (number of methyl groups, molecular weight, etc)2D (topological indices, fragmental descriptors)3D (molecular volume, quantum chemical descriptors)4D (based on a set of conformers)
Calculation method:physico-chemical (lipophilicity, etc)topological (invariants of molecular graph, Randic index, Wiener index, etc)fragmental (fingerprints, etc)pharmacophorespatial (moment of inertia, etc)quantum-chemical (energy of HOMO/LUMO, etc)etc. R. Todeschini and V. Consonni Handbook of Molecular Descriptors, 2008
Atom-centric (augmented atoms) fingerprints
Generate substructures starting from each atom and considering all its neighbors up to the specified distance (radius or diameter).
Morgan fingerprints, Extended-connectivity fingerprints (ECFP). Functional-class fingerprints (FCFP) , etc.
radius=2 (diameter=4)
Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 50, 742-754 (2010)
Atom-centric (augmented atoms) fingerprints
Fingerprints
Each molecule has variable length set of substructures – variable length fingerprints
N-C-C C-C-O C-C=O C-C-C C:C:C C:C:N C:N:C
1 1 1 0 0 0 0
1 1 1 1 0 0 0
0 0 0 0 1 1 1
2-bond sequences
Hashed fingerprints
Have fixed length (usually 512, 1024 or 2048 bits)
hash code
N-C-C C-C-O C-C=O C-C-C C:C:C C:C:N
13823 9740 37278 28478 874 283764
0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1
pseudo-randomnumber generator
fixed-length bit string
Each substructure activates several bits (usually 4-5) to avoid collisions and produce bit string of enough density
Missing bits mean that certain substructures are not presented, Active bits mean that certain substructure may be present (but due to possible collisions one cannot be sure)
Folding of fingerprints and keys bit strings
To increase density of very sparse bit vectors
0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1
0 0 0 1 0 0 1 1 0 1
0 0 1 0 1 0 0 0 1 1
OR
n bit fingerprints
n/2 bits
n/2 bits
0 0 1 1 1 0 1 1 1 1 n/2 bit fingerprint
Usually optimal density is 0.2-0.3
Step 4. Feature processing
Feature transformations:linear and non-linear scaling
Feature combinations:
Feature selection:removal of correlated descriptors (important e.g. for linear regression models)removal of descriptors with missing valuesremoval constant and near constant featuresremoval of descriptors with low correlation with the target propertystep-wise feature selectionevolutionary feature selection (genetic algorithm, etc)
sd
xxz i
i
n
i
i xxn
nsd
1
2)(1
ixie
z
1
1range (0; 1)
jiij xxz
2
ii xz add quadratic term
Step 5. Model building
Unsupervisedclustering
Supervised
Regression Classification
Multiple linear regression (MLR)
Partial linear regression (PLS) Logistic regression
Gaussian Process (GP) Naïve Bayes (NB)
Decision trees (DT)
Support vector machine (SVM)
Neural nets (NN)
Random forest (RF)
k-Nearest neighbors (kNN)
1/C = 4.08π – 2.14π2 + 2.78σ + 3.38
π = logPX – logPH
σ - Hammet constant
plant growth inhibition activity of phenoxyacetic acids
Hansch equation
R is H or CH3;
X is Br, Cl, NO2 and
Y is NO2, NH2, NHC(=O)CH3
Inhibition activity of compounds against Staphylococcus aureus
Act = 75RH – 112RCH3 + 84XCl – 16XBr – 26XNO2 + 123YNH2 + 18YNHC(=O)CH3 – 218YNO2
Free-Wilson models
QSAR model building
18
Simulated data set of actives and inactives with two descriptors – MW and logP
logP
MW
Decision tree
IF logP >= 3.2
AND MW >= 255
THENcompound is
19
logP < 3.2YES NO
MW < 358
YES NO
MW < 255
YES NO
Decision tree
20
LogP
MW
Decision tree
21
Initial dataset
Bootstrapsample
Bootstrapsample
Bootstrapsample
CART tree1 CART tree2 CART tree3
Combined prediction
…
Random feature subspace in each node
Random Forest
Consensus (ensemble) modeling
Consensus (ensemble) modeling
Step 6. Validation
Test set (usually 20-25% of the work set)
1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2working set
random test set 1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2
stratified test set
1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2
Cross-validation
1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2working set
1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2
1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2
1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2
fold 1
fold 2
fold 3
predictions of different folds are combined to calculate the final predictive measure
Step 6. Measures of predictive ability of models
Classification
FPTN
TNySpecificit
FNTP
TPySensitivit
2
ySensitivitySpecificitaccuracy Balanced
baseline-1
baseline -accuracy Kappa
N
TNTPaccuracy
2N
FP)FN)(TP(TPFN)FP)(TN(TNbaseline
Confusion matrix
Predicted
positive class (1)
negative class (0)
observed
positive class (1)
truepositive
(TP)
false negative
(FN)
negative class (0)
false positive
(FP)
true negative
(TN)
))()()(( FNTNFPTNFNTPFPTP
FNFPTNTPMCC
[-1; 1]
[0; 1]
[0; 1]
[0; 1]
[0; 1]
[0; 1]
Step 6. Measures of predictive ability of models
Regression
i
2
obspredi,
i
2
obsi,predi,2
)y(y
)y(y
1Q
1N
)y(y
RMSE i
2
obsi,predi,
N
1i
obsi,predi, yyN
1MAE
Determination coefficient
Root mean squared error
Mean absolute error
Step 7. Applicability domain (AD)Extrapolation to very distant objects is dangerous
There is a need to define the domain where our model is reliable (models are not universal!)
Only compounds which are similar to the training set compounds should be included in applicability domain of the model. One should estimate similarity of new compounds (test set, etc) to the training set compounds.
Step 7. Applicability domain (AD) measures
Bounding box - based on descriptor range
Desc_1
Desc_2
AD• internal regions are usually empty, especially if the number of descriptors is big
• it doesn’t take into account descriptor correlation
Distance from training set compounds in descriptor space
Desc_1
Desc_2Distance from training set compounds in model space
Model_1
Model_2
Requires several models (e.g. consensus model, bootstrap models)
One-class SVM
Conformal predictions
Step 7. Applicability domain (AD) example
J. Chem. Inf. Model., Vol. 50, No. 12, 2010
Step 8. Interpretation of QSAR models
Why interpretation is important?
Found active/inactive patterns which can be used for optimization of compound properties
Retrieve trends of stricture-activity relationships which can be used for knowledge-base model validation
Regulatory purposes
Step 8. Interpretation of QSAR models
Principles and issues
Model should be predictive
Interpretation is valid within the applicability domain of the model
Interpretation results are data set dependent
Step 8. Interpretation of QSAR models
1/C = 4.08π – 2.14π2 + 2.78σ + 3.38
π = logPX – logPH
σ - Hammet constant
plant growth inhibition activity of phenoxyacetic acids
electronic factorsrate of penetration of membranes in the plant cell
Hansch equation
R is H or CH3;
X is Br, Cl, NO2 and
Y is NO2, NH2, NHC(=O)CH3
Inhibition activity of compounds against Staphylococcus aureus
Act = 75RH – 112RCH3 + 84XCl – 16XBr – 26XNO2 + 123YNH2 + 18YNHC(=O)CH3 – 218YNO2
Free-Wilson models
Step 8. Interpretation of QSAR models
347 agonists of 5-HT1A receptor
Ar - substituted (hetero)arylsL - polymethylene chainR - various (poly)cyclic residues
O O OMe Cl Cl CF3
N
N
CH3
F
O2N
Cl
PLS 0.84 0.18 0.03 -0.04 -0.06 -0.09 -0.11 -0.66 -0.73 -0.94 -0.96RF 0.27 0.24 0.04 0.07 -0.02 0.11 0.04 -0.04 -0.55 -0.66 -0.66
Ar
L -(CH2)6- -(CH2)5- -(CH2)4- -(CH2)3- -(CH2)2- -CH2-
PLS 0.8 0.71 0.81 0.08 -0.04 0.06
RF 0.14 0.19 0.14 -0.01 -0.03 0.05
Kuz’min, V.E., et al. Molecular Informatics, 2011, 30, 593-603.
Step 8. Interpretation of QSAR models
Step 8. Interpretation of QSAR models
=–
A B C
Activitypred(A) Activitypred(B) Contribution(C)
f(A) = x f(B) = y W(C) = x – y
Polishchuk, P. G.; Kuz'min, V. E.; Artemenko, A. G.; Muratov, E. N. Universal Approach for Structural Interpretation of QSAR/QSPR Models. Molecular Informatics 2013, 32, 843-853
Universal approach
Step 8. Interpretation of QSAR models
Step 8. Interpretation of QSAR models
Polishchuk P, Tinkov O, Khristova T, Ognichenko L, Kosinskaya A, Varnek A, Kuz’min V. Journal of Chemical Information and Modeling 2016, 56, 1455-1469.
Acute oral toxicity on rats
?
?
Step 8. Interpretation of QSAR models
Step 8. Interpretation of QSAR models
Step 8. Interpretation of QSAR models
Interpretation results of valid predictive models should converge independent of:
interpretation approach
descriptors
machine learning method
All models are interpretable but not all end-points
Step 8. Interpretation of QSAR models
Overall QSAR workflow
Input data
Bioassays
Databases
PreprocessingFeature
engineeringModel
learningModel
validation
Classification
Regression
Clustering
Cross-validation
Bootstrap
Test set
Applicability Domain
Feature selection
Feature combination
Data normalization
Feature extraction
Interpretation
j
j
ii
z
xxx '
Step 8. Interpretation of QSAR models
2δ
δ)f(xδ)f(xC
f(x)δ)f(xC
δ)f(xf(x)C
forward difference
backward difference
central difference
Partial derivatives is used to calculate contributions of single features.It is applicable to any model built on meaningful (interpretable) descriptors.May be calculated analytically or numerically.
Estimated contributions of separate substructural features can be mapped back on the structure to reveal contribution of fragments.
Marcou, G.; Horvath, D.; Solov'ev, V.; Arrault, A.; Vayer, P.; Varnek, A. Interpretability of SAR/QSAR Models of any Complexity by Atomic Contributions. Molecular Informatics 2012, 31, 639-642