evaluation of a targeted-qspr based pure compound property prediction system abstract the use of the...

1
Evaluation of a Targeted-QSPR Based Pure Compound Property Prediction System Abstract The use of the DD – TQSPR (Dominant-Descriptor Targeted QSPR) method for the prediction of a wide variety of constant properties is considered. Prediction of a property (the target property) for a particular compound (the target compound) is carried out in two stages. The first stage involves the identification of a training set whose members are structurally related to the target compound (typically of around 10 compounds, for which target property data are available). The training set is selected from the target compound similarity group. The latter is identified by using a large database of molecular descriptors. The similarity between a potential predictive compound and the target compound is measured by the correlation coefficient between the vectors of their molecular descriptors. In the second stage of the DD TQSPR method, a Dominant Descriptor, which is collinear with the target property values for the members of the training set is identified and a linear relationship (the DD-TQSPR) between the DD and the target property values, is derived. Finally, the target compounds DD value is introduced into the linear equation in order to predict its target property. The use of the of the proposed technique is demonstrated by predicting 34 constant properties (available in the DIPPR database) for a target compound. Mordechai Shacham, and Inga Paster, Dept. of Chem. Engng, Ben Gurion University of the Negev, Beer-Sheva, Israel Richard L. Rowley, Chem. Eng. Dept., Brigham Young University, Provo, UT Neima Brauner and Gretah Tovarovski, School of Engineering, Tel-Aviv University, Tel-Aviv, Israel, The DD-TQSPR method was able to predict all 34 properties of the target compound within the experimental error level The appropriate training set (similarity group) is dependent on the target property. The TSAE (Training Set Average Error) has proven to be a good indicator for the appropriateness of the training set and the prediction accuracy. This criterion is independent of the target-compound properties. Similarity Group of n-hexyl mercaptan Prediction of the NBT of n-hexyl mercaptan Prediction of Properties of n – hexyl mercaptan - Summary of Results for 34 Properties Conclusions Property and Descriptor Databases A property and molecular descriptor database containing 1798 compounds for which 34 constant properties (source: DIPPR database http://dippr.byu.edu ) and 3224 descriptors (source: Dragon 5.5, http://www.talete.mi.it ) are available Most of the 3-D molecular structures were optimized in Gaussian 03 using B3LYP/6-311+G (3df, 2p), a density functional method with a large basis set. The rest were optimized using HF/6-31G*, a Hartree-Fock ab initio method with a medium- sized basis set. N o. Sym bol Property description Type U nits 1 ACEN Acentric Factor Defined -- 2 AIT Auto Ignition Tem perature K 3 DC D ielectric Constant -- 4 DM D ipole M om ent c*m 5 ENT Absolute Entropy ofIdeal G as at298.15 K and 100000 Pa J/kmol*K 6 FLTL Low erFlam m ability Lim itTemperature K 7 FLTU U pperFlam m ability Lim itTemperature K 8 FLVL Low erFlam m ability Lim it vol % in air 9 FLVU U pperFlam m ability Lim it vol % in air 10 FP Flash Point K 11 G FO R G ibbs Energy ofForm ation ofIdeal G as at298.15 K and 100000 Pa Defined J/kmol 12 G STD G ibbs Energy ofForm ation in Standard State at298.15 K and 100000 Pa Defined J/kmol 13 HCO M N etEnthalpy ofC om bustion Standard State (298.15 K) J/kmol 14 HFO R Enthalpy ofForm ation ofIdeal gas at298.15 K and 100000 Pa J/kmol 15 HFUS Enthalpy ofFusion atM elting Point J/kmol 16 HSTD Enthalpy ofForm ation in Standard State at298.15 K and 100000 Pa J/kmol 17 HSUB H eatofSublim ation atthe triple point J/kmol 18 LVO L Liquid M olarVolum e at298.15 K m ^3/kmol 19 M P M elting Point(1 atm ) K 20 MW MolecularWeight kg/kmol 21 NBP N orm al Boiling Point(1 atm ) K 22 PAR Parachor -- 23 PC C ritical Pressure Pa 24 RG R adius ofG yration Defined m 25 RI R efractive Index at298.15 K -- 26 SO LP Solubility Param eterat298.15 K Defined (J/m ^3)^(1/2) 27 SSTD Absolute Entropy in Standard State at298.15 K and 100000 Pa J/kmol*K 28 TC C ritical Tem perature K 29 TPP Triple PointPressure Pa 30 TPT Triple PointTem perature K 31 VC C ritical Volum e m ^3/kmol 32 VDW A van derW aals Area Defined m ^2/kmol 33 VDW V van derW aals R educed Volum e Defined m ^3/kmol 34 ZC C ritical C om pressibility Factor Defined -- Constant Properties Included in the DIPPR Database C om p. N o. N am e StructuralForm ula N o. ofC atom s C orr. C oeff. 1 n -heptylmercaptan CH 3 (CH 2 ) 6 SH 7 0.977 2 n -pentylmercaptan CH 3 (CH 2 ) 4 SH 5 0.975 3 m ethyl n -butylsulfide CH 3 (CH 2 ) 3 SCH 3 5 0.962 4 methylpentylsulfide CH 3 S(CH 2 ) 4 CH 3 6 0.957 5 ethylpropylsulfide CH 3 CH 2 S(CH 2 ) 2 CH 3 5 0.952 6 n -nonylm ercaptan CH 3 (CH 2 ) 8 SH 9 0.947 7 n -butylmercaptan CH 3 (CH 2 ) 3 SH 4 0.946 8 n -octylmercaptan CH 3 (CH 2 ) 7 SH 8 0.944 9 1-hexanol CH 3 (CH 2 ) 5 OH 6 0.942 10 di- n -propylsulfide CH 3 (CH 2 ) 2 S(CH 2 ) 2 CH 3 6 0.942 Target n -hexylm ercaptan CH 3 (CH 2 ) 5 SH 6 Immediate neighbors of the target in the homologous series Range of the number of the carbon atoms Oxygen atom instead of sulfur - Property value (from DIPPR) p – No. of comps.in training set ζ - Descriptor p y y TSAE p i i i i 1 1 0 ~ / ~ 100 (%) i y ~ Attainable accuracy measures (independent of the target comp. property value) 1. DIPPR uncertainty values for the properties of the training set members ; 2. Average (U avg ) and maximal (U max ) DIPPR uncertainty values 3. Training Set Average Error (TSAE) = 0.51% The ESpm01r is a 2D descriptor belonging to the "edje adjacency indices" group whose definition is: "Spectral moment 01 from edje adjacency matrix weighted by resonance integral". Prediction error for the target = 0.55% Prediction No. Sym bol G ACC U avg (% ) Um ax (% )D escriptor ρ DP TSAE D IPPR Predicted U ncertainty % Com m ent 1 ACEN 0.959 - 25 Mor03p -0.998 1.1359 0.3681 0.3817 3.68 2 AIT 0.956 25 25 R 1e+ 0.85 1.702 520 513.79 1.195 3 DC 0.953 - 100 nO 0.987 7.9 4.436 4.42 0.36 1 4 DM 0.959 - 10 Mor12e 0.833 1.51 5.10E-30 5.26E-30 3.07 5 ENT 0.959 2.32 3 Sv 0.999 0.365 454600 452803 0.395 6 FLTL 0.959 - 25 Ss 0.991 0.861 307 308.7 0.546 7 FLTU 0.959 - 25 Rte 0.995 0.55 351 351.15 0.043 8 FLV L 0.959 22 25 ATS3p -0.996 1.2 1 1.04 4.45 9 FLV U 0.959 26 50 ATS4e -0.999 0.577 8.4 8.47 0.88 10 FP 0.959 - 10 Ss 0.99 0.966 293.15 308.4 5.21 11 G FO R 0.959 4.6 25 Mv 0.976 37.17 2.759E+07 3.220E+07 16.7 1 12 G STD 0.959 4.4 25 Mv 0.9889 60.55 1.431E+07 1.959E+07 36.9 1 13 HCOM 0.959 1.24 3 PHI -0.99969 0.48246 -4.176E+09 -4.167E+09 0.21 14 H FO R 0.959 2.2 3 X0Av 0.983 7.53 -1.292E+08 -1.486E+08 15 1 15 H FU S 0.959 3.6 25 Mor23v -0.987 7.34 1.801E+07 2.116E+07 17.5 2 16 H STD 0.959 1.72 3 X0av 0.987 5.396 -1.757E+08 -1.961E+08 11.6 2 17 H SU B 0.959 11.5 25 Ss 0.98137 2.96 6.68E+07 6.92E+07 3.52 2 18 LVOL 0.959 1.4 3 MW 0.9991 0.653 0.141 0.1416 0.45 19 MP 0.959 0.96 3 Mor29m -0.974 3.39 1.926E+02 2.115E+02 9.78 2 20 M W 0.959 M olecularW eightis included in both the property and the descriptordatabases 21 NBP 0.959 0.76 1 ESpm 01r 0.997 0.509 450.094 451.2819 0.26 22 PA R 0.936 4.11 10 Sp 0.997 1.6357- 317.92- 3 23 PC 0.959 8.1 10 BEHP1 -0.997 0.968 3080000 3058107 0.71 24 RG 0.959 3 3 H 3D 0.9947 1.225 4.25E-10 4.15E-10 2.4288 25 RI 0.959 0.64 3 Mv 0.92 0.233 1.4473 1.4473 1.50E-04 1 26 SO LP 0.959 5 10 SEigZ -0.988 1.054 17450 17452 0.0127 1 27 SSTD 0.959 2.32 5 AM R 0.997 1.029 343210 340640 0.749 28 TC 0.959 3.12 5 Mor29v -0.991 0.672 623 624.61 0.26 29 TPP 0.959 42.5 100 R D F100p 0.8 2147 0.013096 0.02596 98 2 30 TPT 0.959 Triple pointtem peratured assum ed to be the sam e as norm al m elting tem peratures (M P) 31 VC 0.959 19.8 25 Sp 0.997 1.1365 0.412 0.416 0.92 32 VDW A 0.959 4 5 X0sol 0.9999 0.25 1.12E+09 1.12E+09 0.09 33 VDW V 0.959 1.5 3 H ar 0.9999 0.267 0.07963 0.07925 0.48 34 ZC 0.959 - 25 Mor08m 0.98 0.731 0.245 0.243 0.874 Property Value Comments: 1. n-hexanol outlier; 2. Different odd even populations; 3. No data for target y = -575.38x + 158.53 R 2 = 0.9478 150 170 190 210 230 250 270 -0.18 -0.16 -0.14 -0.12 -0.1 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04 D escriptor M or29m M elting P oint(K ) Training S et Target Linear(Training S et) 1-hexanol is a leverage point and an outlier N o. No. (3D) N am e StructuralForm ula No. ofC atom s C orr. C oeff. C orr. C oeff. (3D ) 1 1 n-heptylmercaptan CH 3 (CH 2 ) 6 SH 7 0.971 0.977 2 2 n-pentylmercaptan CH 3 (CH 2 ) 4 SH 5 0.968 0.975 3 4 methylpentylsulfide CH 3 S(CH 2 ) 4 CH 3 6 0.965 0.957 4 3 methyln-butylsulfide CH 3 (CH 2 ) 3 SCH 3 5 0.951 0.962 5 8 n-octylmercaptan CH 3 (CH 2 ) 7 SH 8 0.951 0.944 6 2-pentanethiol CH 3CH (SH )CH 2CH 2CH 3 5 0.943 7 10 di-n-propylsulfide CH 3 (CH 2 ) 2 S(CH 2 ) 2 CH 3 6 0.942 0.942 8 tert-nonylmercaptan CH 3(CH 2)5C(CH 3)2SH 9 0.941 9 5 ethylpropylsulfide CH 3 CH 2 S(CH 2 ) 2 CH 3 5 0.940 0.952 10 7 n-butylmercaptan CH 3 (CH 2 ) 3 SH 4 0.936 0.946 Improved Training Set. Obtained by using only stable (non- 3D descriptors. No oxygen atom containing compounds The prediction accuracy can be enhanced by refinement of the training set and not by increasing the number of the descriptors in the TQSPR. Further research is required for deriving training set refinement algorithms for various properties and various groups of compounds.

Upload: maryann-taylor

Post on 05-Jan-2016

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Evaluation of a Targeted-QSPR Based Pure Compound Property Prediction System Abstract The use of the DD – TQSPR (Dominant-Descriptor Targeted QSPR) method

Evaluation of a Targeted-QSPR Based Pure Compound Property Prediction System

AbstractThe use of the DD – TQSPR (Dominant-Descriptor Targeted QSPR) method for the prediction of a wide variety of constant properties is considered. Prediction of a property (the target property) for a particular compound (the target compound) is carried out in two stages. The first stage involves the identification of a training set whose members are structurally related to the target compound (typically of around 10 compounds, for which target property data are available). The training set is selected from the target compound similarity group. The latter is identified by using a large database of molecular descriptors. The similarity between a potential predictive compound and the target compound is measured by the correlation coefficient between the vectors of their molecular descriptors. In the second stage of the DD TQSPR method, a Dominant Descriptor, which is collinear with the target property values for the members of the training set is identified and a linear relationship (the DD-TQSPR) between the DD and the target property values, is derived. Finally, the target compounds DD value is introduced into the linear equation in order to predict its target property.The use of the of the proposed technique is demonstrated by predicting 34 constant properties (available in the DIPPR database) for a target compound.

Mordechai Shacham, and Inga Paster, Dept. of Chem. Engng, Ben Gurion University of the Negev, Beer-Sheva, IsraelRichard L. Rowley, Chem. Eng. Dept., Brigham Young University, Provo, UT

Neima Brauner and Gretah Tovarovski, School of Engineering, Tel-Aviv University, Tel-Aviv, Israel,

The DD-TQSPR method was able to predict all 34 properties of the target compound within the experimental error levelThe appropriate training set (similarity group) is dependent on the target property.The TSAE (Training Set Average Error) has proven to be a good indicator for the appropriateness of the training set and the prediction accuracy. This criterion is independent of the target-compound properties. .

Similarity Group of n-hexyl mercaptan Prediction of the NBT of n-hexyl

mercaptan

Prediction of Properties of n – hexyl mercaptan -

Summary of Results for 34 Properties

Conclusions

Property and Descriptor Databases

A property and molecular descriptor database containing 1798 compounds for which 34 constant properties (source: DIPPR database http://dippr.byu.edu ) and 3224 descriptors (source: Dragon 5.5, http://www.talete.mi.it ) are available

Most of the 3-D molecular structures were optimized in Gaussian 03 using B3LYP/6-311+G (3df, 2p), a density functional method with a large basis set. The rest were optimized using HF/6-31G*, a Hartree-Fock ab initio method with a medium-sized basis set.

No. Symbol Property description Type Units1 ACEN Acentric Factor Defined --2 AIT Auto Ignition Temperature K3 DC Dielectric Constant --4 DM Dipole Moment c*m5 ENT Absolute Entropy of Ideal Gas at 298.15 K and 100000 Pa J/kmol*K6 FLTL Lower Flammability Limit Temperature K7 FLTU Upper Flammability Limit Temperature K8 FLVL Lower Flammability Limit vol % in air9 FLVU Upper Flammability Limit vol % in air10 FP Flash Point K11 GFOR Gibbs Energy of Formation of Ideal Gas at 298.15 K and 100000 Pa Defined J/kmol12 GSTD Gibbs Energy of Formation in Standard State at 298.15 K and 100000 Pa Defined J/kmol13 HCOM Net Enthalpy of Combustion Standard State (298.15 K) J/kmol14 HFOR Enthalpy of Formation of Ideal gas at 298.15 K and 100000 Pa J/kmol15 HFUS Enthalpy of Fusion at Melting Point J/kmol16 HSTD Enthalpy of Formation in Standard State at 298.15 K and 100000 Pa J/kmol17 HSUB Heat of Sublimation at the triple point J/kmol18 LVOL Liquid Molar Volume at 298.15 K m^3/kmol19 MP Melting Point (1 atm) K20 MW Molecular Weight kg/kmol21 NBP Normal Boiling Point (1 atm) K22 PAR Parachor --23 PC Critical Pressure Pa24 RG Radius of Gyration Defined m25 RI Refractive Index at 298.15 K --26 SOLP Solubility Parameter at 298.15 K Defined (J/m^3)^(1/2)27 SSTD Absolute Entropy in Standard State at 298.15 K and 100000 Pa J/kmol*K28 TC Critical Temperature K29 TPP Triple Point Pressure Pa30 TPT Triple Point Temperature K31 VC Critical Volume m^3/kmol32 VDWA van der Waals Area Defined m^2/kmol33 VDWV van der Waals Reduced Volume Defined m^3/kmol34 ZC Critical Compressibility Factor Defined  --

Constant Properties Included in the DIPPR Database

Comp. No. Name Structural FormulaNo. of C

atomsCorr. Coeff.

1 n -heptyl mercaptan CH3(CH2)6SH 7 0.977

2 n -pentyl mercaptan CH3(CH2)4SH 5 0.9753 methyl n -butyl sulfide CH3(CH2)3SCH3 5 0.962

4 methyl pentyl sulfide CH3S(CH2)4CH3 6 0.957

5 ethyl propyl sulfide CH3CH2S(CH2)2CH3 5 0.952

6 n -nonyl mercaptan CH3(CH2)8SH 9 0.947

7 n -butyl mercaptan CH3(CH2)3SH 4 0.946

8 n -octyl mercaptan CH3(CH2)7SH 8 0.944

9 1-hexanol CH3(CH2)5OH 6 0.94210 di-n -propyl sulfide CH3(CH2)2S(CH2)2CH3 6 0.942

Target n -hexyl mercaptan CH3(CH2)5SH 6

Immediate neighbors of the target in the homologous series

Range of the number of the carbon atoms Oxygen atom instead of sulfur

- Property value (from DIPPR)p – No. of comps.in training setζ - Descriptor

p

yyTSAE

p

iiii

1

10~/~100

(%)

iy~

Attainable accuracy measures (independent of the target comp. property value)1. DIPPR uncertainty values for the properties of the training set members ;2. Average (Uavg) and maximal (Umax) DIPPR uncertainty values3. Training Set Average Error (TSAE) = 0.51%

The ESpm01r is a 2D descriptor belonging to the "edje adjacency indices" group whose definition is: "Spectral moment 01 from edje adjacency matrix weighted by resonance integral".

Prediction error for the target = 0.55%

PredictionNo. Symbol GACC Uavg (%) Umax (%) Descriptor ρDP TSAE DIPPR Predicted Uncertainty % Comment

1 ACEN 0.959 - 25 Mor03p -0.998 1.1359 0.3681 0.3817 3.682 AIT 0.956 25 25 R1e+ 0.85 1.702 520 513.79 1.1953 DC 0.953 - 100 nO 0.987 7.9 4.436 4.42 0.36 14 DM 0.959 - 10 Mor12e 0.833 1.51 5.10E-30 5.26E-30 3.075 ENT 0.959 2.32 3 Sv 0.999 0.365 454600 452803 0.3956 FLTL 0.959 - 25 Ss 0.991 0.861 307 308.7 0.5467 FLTU 0.959 - 25 Rte 0.995 0.55 351 351.15 0.0438 FLVL 0.959 22 25 ATS3p -0.996 1.2 1 1.04 4.459 FLVU 0.959 26 50 ATS4e -0.999 0.577 8.4 8.47 0.88

10 FP 0.959 - 10 Ss 0.99 0.966 293.15 308.4 5.2111 GFOR 0.959 4.6 25 Mv 0.976 37.17 2.759E+07 3.220E+07 16.7 112 GSTD 0.959 4.4 25 Mv 0.9889 60.55 1.431E+07 1.959E+07 36.9 113 HCOM 0.959 1.24 3 PHI -0.99969 0.48246 -4.176E+09 -4.167E+09 0.2114 HFOR 0.959 2.2 3 X0Av 0.983 7.53 -1.292E+08 -1.486E+08 15 115 HFUS 0.959 3.6 25 Mor23v -0.987 7.34 1.801E+07 2.116E+07 17.5 216 HSTD 0.959 1.72 3 X0av 0.987 5.396 -1.757E+08 -1.961E+08 11.6 217 HSUB 0.959 11.5 25 Ss 0.98137 2.96 6.68E+07 6.92E+07 3.52 218 LVOL 0.959 1.4 3 MW 0.9991 0.653 0.141 0.1416 0.4519 MP 0.959 0.96 3 Mor29m -0.974 3.39 1.926E+02 2.115E+02 9.78 220 MW 0.959 Molecular Weight is included in both the property and the descriptor databases21 NBP 0.959 0.76 1 ESpm01r 0.997 0.509 450.094 451.2819 0.2622 PAR 0.936 4.11 10 Sp 0.997 1.6357 - 317.92 - 323 PC 0.959 8.1 10 BEHP1 -0.997 0.968 3080000 3058107 0.7124 RG 0.959 3 3 H3D 0.9947 1.225 4.25E-10 4.15E-10 2.428825 RI 0.959 0.64 3 Mv 0.92 0.233 1.4473 1.4473 1.50E-04 126 SOLP 0.959 5 10 SEigZ -0.988 1.054 17450 17452 0.0127 127 SSTD 0.959 2.32 5 AMR 0.997 1.029 343210 340640 0.74928 TC 0.959 3.12 5 Mor29v -0.991 0.672 623 624.61 0.2629 TPP 0.959 42.5 100 RDF100p 0.8 2147 0.013096 0.02596 98 230 TPT 0.959 Triple point temperatured assumed to be the same as normal melting temperatures (MP)31 VC 0.959 19.8 25 Sp 0.997 1.1365 0.412 0.416 0.9232 VDWA 0.959 4 5 X0sol 0.9999 0.25 1.12E+09 1.12E+09 0.0933 VDWV 0.959 1.5 3 Har 0.9999 0.267 0.07963 0.07925 0.4834 ZC 0.959 - 25 Mor08m 0.98 0.731 0.245 0.243 0.874

Property Value

Comments: 1. n-hexanol outlier; 2. Different odd even populations; 3. No data for target

y = -575.38x + 158.53

R2 = 0.9478

150

170

190

210

230

250

270

-0.18 -0.16 -0.14 -0.12 -0.1 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04

Descriptor Mor29m

Mel

tin

g P

oin

t (K

)

Training Set

Target

Linear (Training Set)

1-hexanol is a leverage point and an outlier

No. No. (3D) Name Structural Formula

No. of C atoms

Corr. Coeff.

Corr. Coeff. (3D)

1 1 n-heptyl mercaptan CH3(CH2)6SH 7 0.971 0.977

2 2 n-pentyl mercaptan CH3(CH2)4SH 5 0.968 0.975

3 4 methyl pentyl sulfide CH3S(CH2)4CH3 6 0.965 0.957

4 3 methyl n-butyl sulfide CH3(CH2)3SCH3 5 0.951 0.962

5 8 n-octyl mercaptan CH3(CH2)7SH 8 0.951 0.944

6 2-pentanethiol CH3CH(SH)CH2CH2CH3 5 0.9437 10 di-n-propyl sulfide CH3(CH2)2S(CH2)2CH3 6 0.942 0.942

8 tert-nonyl mercaptan CH3(CH2)5C(CH3)2SH 9 0.9419 5 ethyl propyl sulfide CH3CH2S(CH2)2CH3 5 0.940 0.952

10 7 n-butyl mercaptan CH3(CH2)3SH 4 0.936 0.946

Improved Training Set. Obtained by using only stable (non-3D descriptors.No oxygen atom containing compounds

The prediction accuracy can be enhanced by refinement of the training set and not by increasing the number of the descriptors in the TQSPR.Further research is required for deriving training set refinement algorithms for various properties and various groups of compounds.