solubility modeling in three supercritical carbon dioxide, ethane and trifluoromethane fluids by one...

7
Solubility modeling in three supercritical carbon dioxide, ethane and triuoromethane uids by one set of molecular descriptors Reza Tabaraki *, Aref Toulabi Department of Chemistry, Faculty of Science, Ilam University, Ilam, Iran A R T I C L E I N F O Article history: Received 12 February 2014 Received in revised form 22 August 2014 Accepted 5 October 2014 Available online 8 October 2014 Keywords: Solubility Supercritical uid Carbon dioxide Ethane Triuoromethane ANN MLR A B S T R A C T Quantitative structure property relationships (QSPR) were developed for the rst time predicting of solubility in supercritical carbon dioxide, ethane and triuoromethane over a wide range of pressures (5.136.2 MPa) and temperatures (308343 K). A large number of descriptors were calculated and a subset of calculated descriptors was selected by genetic algorithmmultiple linear regression (GAMLR). Four molecular descriptors and three experimental descriptors such as pressure, temperature and melting point were selected as the most feasible descriptors for prediction of solubility in three supercritical uids. The data set consisted of 14 molecules in various temperatures and pressures, which form 586 solubility data. Modeling of the relationship between selected descriptors and solubility data was achieved by linear (multiple linear regression; MLR) and nonlinear (articial neural network; ANN) methods. The articial neural network architectures and their parameters were optimized simulta- neously. The root mean squares error (RMSE) for supercritical carbon dioxide, ethane and triuoro- methane were 0.56, 0.68 and 0.72, respectively. The performance of the ANN models was also compared with multiple linear regression models and the results showed the superiority of the ANN over MLR model. ã 2014 Elsevier B.V. All rights reserved. 1. Introduction Supercritical uid technology (SFT) nds applications in chemical, biochemical, pharmaceutical and food processing industries. Supercritical uids (SCFs) have diffusivities between that of gases and liquids; compressibilitys comparable to gases, densities comparable to liquids and negligible surface tension. These properties make them attractive solvents for many industrial applications [1]. Solubility data in SCFs are important for the successful implementation of SFT. The experimental determination of solubility of organic solids in SCFs at various temperatures and pressures is expensive. Regarding the difculties of solubility measurement in SCF, development of mathematical model to predict the solubility of new or even non-synthesized compounds is essential for saving both time and money. Therefore, modeling and prediction of solubility is essential. In the mathematical modeling of solubility data in supercritical uids, the solubility systems can be categorized in three groups, a single solute in a supercritical uid, mixed solutes in a supercritical uids and a single solute in mixed supercritical uids or supercritical uid plus an organic solvent. Different models have been presented for solubility in supercritical uid and can be categorized into two groups, theoretical (such as equations of state and semi-empirical equations) and empirical equations (such as density based equations). Numerous theoretical models have studied solubility in binary solid-SC uid systems such as cubic equations of state, perturbed hard-sphere equations of state, lattice models, KirkwoodBuff solution theory, Monte Carlo simulation and mean eld theory [2]. Equations of state often require properties such as critical temperature, critical pressure and acentric factor that are not available for the solid solutes. Also, the models require one or more temperature-dependent param- eters which must be obtained from binary solid solubility data [3,4]. The empirical models are based on simple error minimiza- tion using least squares method and, for most of them; there is no need to use physicochemical properties [5]. One of the most successful approaches to the prediction of chemical properties with molecular structural information is quantitative structure property/activity relationship. In QSPR, the molecular structure is translated into the so-called molecular descriptors using chemical graph theory, information theory, quantum mechanics, etc., and mathematical equations are related chemical structure to a wide variety of physical, chemical and biological properties [6,7]. QSPR models can be used to predict properties of new compound. Major steps in constructing the QSPR models are (i) the proper calculation of molecular descriptors, which satisfactorily describe the properties of a set of chemical substances (ii) selection of the best descriptors (iii) constructing a * Corresponding author. Tel.: +98 841 2227022; fax: +98 841 2227022. E-mail addresses: [email protected], [email protected] (R. Tabaraki). 0378-3812/$ see front matter ã 2014 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.uid.2014.10.008 Fluid Phase Equilibria 383 (2014) 108114 Contents lists available at ScienceDirect Fluid Phase Equilibria journal homepage: www.else vie r.com/locat e/fluid

Upload: aref

Post on 11-Feb-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Solubility modeling in three supercritical carbon dioxide, ethane and trifluoromethane fluids by one set of molecular descriptors

Fluid Phase Equilibria 383 (2014) 108–114

Solubility modeling in three supercritical carbon dioxide, ethane andtrifluoromethane fluids by one set of molecular descriptors

Reza Tabaraki *, Aref ToulabiDepartment of Chemistry, Faculty of Science, Ilam University, Ilam, Iran

A R T I C L E I N F O

Article history:Received 12 February 2014Received in revised form 22 August 2014Accepted 5 October 2014Available online 8 October 2014

Keywords:SolubilitySupercritical fluidCarbon dioxideEthaneTrifluoromethaneANNMLR

A B S T R A C T

Quantitative structure property relationships (QSPR) were developed for the first time predicting ofsolubility in supercritical carbon dioxide, ethane and trifluoromethane over a wide range of pressures(5.1–36.2 MPa) and temperatures (308–343 K). A large number of descriptors were calculated and asubset of calculated descriptors was selected by genetic algorithm–multiple linear regression (GA–MLR).Four molecular descriptors and three experimental descriptors such as pressure, temperature andmelting point were selected as the most feasible descriptors for prediction of solubility in threesupercritical fluids. The data set consisted of 14 molecules in various temperatures and pressures, whichform 586 solubility data. Modeling of the relationship between selected descriptors and solubility datawas achieved by linear (multiple linear regression; MLR) and nonlinear (artificial neural network; ANN)methods. The artificial neural network architectures and their parameters were optimized simulta-neously. The root mean squares error (RMSE) for supercritical carbon dioxide, ethane and trifluoro-methane were 0.56, 0.68 and 0.72, respectively. The performance of the ANN models was also comparedwith multiple linear regression models and the results showed the superiority of the ANN over MLRmodel.

ã 2014 Elsevier B.V. All rights reserved.

Contents lists available at ScienceDirect

Fluid Phase Equilibria

journal homepage: www.else vie r .com/ locat e/fluid

1. Introduction

Supercritical fluid technology (SFT) finds applications inchemical, biochemical, pharmaceutical and food processingindustries. Supercritical fluids (SCFs) have diffusivities betweenthat of gases and liquids; compressibility’s comparable to gases,densities comparable to liquids and negligible surface tension.These properties make them attractive solvents for manyindustrial applications [1]. Solubility data in SCFs are importantfor the successful implementation of SFT. The experimentaldetermination of solubility of organic solids in SCFs at varioustemperatures and pressures is expensive. Regarding the difficultiesof solubility measurement in SCF, development of mathematicalmodel to predict the solubility of new or even non-synthesizedcompounds is essential for saving both time and money. Therefore,modeling and prediction of solubility is essential.

In the mathematical modeling of solubility data in supercriticalfluids, the solubility systems can be categorized in three groups, asingle solute in a supercritical fluid, mixed solutes in a supercriticalfluids and a single solute in mixed supercritical fluids orsupercritical fluid plus an organic solvent. Different models havebeen presented for solubility in supercritical fluid and can be

* Corresponding author. Tel.: +98 841 2227022; fax: +98 841 2227022.E-mail addresses: [email protected], [email protected] (R. Tabaraki).

0378-3812/$ – see front matter ã 2014 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.fluid.2014.10.008

categorized into two groups, theoretical (such as equations of stateand semi-empirical equations) and empirical equations (such asdensity based equations). Numerous theoretical models havestudied solubility in binary solid-SC fluid systems such as cubicequations of state, perturbed hard-sphere equations of state,lattice models, Kirkwood–Buff solution theory, Monte Carlosimulation and mean field theory [2]. Equations of state oftenrequire properties such as critical temperature, critical pressureand acentric factor that are not available for the solid solutes. Also,the models require one or more temperature-dependent param-eters which must be obtained from binary solid solubility data[3,4]. The empirical models are based on simple error minimiza-tion using least squares method and, for most of them; there is noneed to use physicochemical properties [5].

One of the most successful approaches to the prediction ofchemical properties with molecular structural information isquantitative structure property/activity relationship. In QSPR, themolecular structure is translated into the so-called moleculardescriptors using chemical graph theory, information theory,quantum mechanics, etc., and mathematical equations are relatedchemical structure to a wide variety of physical, chemical andbiological properties [6,7]. QSPR models can be used to predictproperties of new compound. Major steps in constructing the QSPRmodels are (i) the proper calculation of molecular descriptors,which satisfactorily describe the properties of a set of chemicalsubstances (ii) selection of the best descriptors (iii) constructing a

Page 2: Solubility modeling in three supercritical carbon dioxide, ethane and trifluoromethane fluids by one set of molecular descriptors

Table 1Molecular structure of organic solids.

Molecules Molecular structure

1 1,4-Naphthoquinone O

O

2 2-Aminofluorene

NH2

3 5-Aminoindole

NH

H2N

4 5-Hydroxyindole

NH

HO

5 Indole 3-carboxylic acid

NH

OHO

6 Naphtalene

7 Phenanthrene

8 Oxindole

O

NH

9 Skatole

HN

CH3

10 2-Naphthol OH

11 5-Methoxyindole

NH

H3CO

12 Acridine

N

Table 1 (Continued)

Molecules Molecular structure

13 Benzoic acid O

OH

14 Indole-3-aldehyde

NH

OH

R. Tabaraki, A. Toulabi / Fluid Phase Equilibria 383 (2014) 108–114 109

mathematical model having the best prediction of property dataand (iv) validating the quality and predictivity of the model [8].

Genetic algorithms (GA) are optimization tools and randomizedsearch techniques guided by the principles of evolution and naturalgenetics [9]. They have been proved to be a very efficient method inthe feature selection problem. Variables are represented as geneson a chromosome and they are generally coded as binary strings. Apopulation of strings is randomly created. In variable selection,each string is a row vector containing as many elements as thereare variables. Each element was coded as 1 if the correspondingvariable was selected and 0 if it was not selected. The fitness of thestring is equal to the evaluation response that is based on thepredictive ability with a given subset of selected variables. Themethod used in the GA variable selection is designed to selectvariables with lowest prediction error. Through natural selectionand the genetic operators, mutation and recombination, chromo-somes with better fitness are found. Natural selection guaranteesthat chromosomes with the best fitness will propagate in futurepopulations. Mutation allows new areas of the response surface tobe explored. GA offers a generational improvement in the fitness ofthe chromosomes and after many generations will createchromosomes containing the optimized variable settings. GAhas several advantages such as the ability to move from localoptima present on the response surface and require no knowledgeor gradient information about the response surface and can beemployed for a wide variety of optimization problems.

Artificial neural network consists of a large number ofprocessing elements (neurons) and connections between them.Function f(x) maps a set of given input values to some outputvalues y = f(x). A neural network tries to find the best possibleapproximation of the function f(x). This approximation is coded inthe neurons of the network using weights that are associated witheach neuron. The weights of a neural network are learned using aniterative procedure during which examples of correct input–output associations are shown to the network and the weights aremodified so that the network starts to mimic this desirable input–output behavior. Learning in a neural network then means findingan appropriate set of weights. This ability to learn from examplesand based on this learning the ability to generalize to newsituations is the most attractive feature of the neural network. Oneof the most popular learning algorithms is the back propagationalgorithm. The architecture of a network used in connection withthe back propagation algorithm is the feed forward layerednetwork. In a feed forward layered network, the processingelements are divided into disjoint subsets, called layers. A feedforward network consists of layers (input, hidden and outputlayers). The input data flow through the network from the hiddenlayer towards the output layer. The number of hidden layers in a

Page 3: Solubility modeling in three supercritical carbon dioxide, ethane and trifluoromethane fluids by one set of molecular descriptors

Table 2Parameters of the genetic algorithm.

Parameter

Population size 30 chromosomeNumber of variables per chromosome in the original population 2–9Regression method MLRElitism TrueCrossover MultipointProbability of crossover 50%Mutation MultipointProbability of mutation 1%Number of runs 2000

Table 3Calculated values of selected descriptors.

Molecules Tm SIC3 Mor03m Mor10m C-026

1 1,4-Naphthoquinone 399 0.79 �1.37 0.39 02 2-Aminofluorene 404 0.91 �1.65 0.44 13 5-Aminoindole 405 0.91 �0.96 0.49 14 5-Hydroxyindole 381 0.97 �1.00 0.52 15 Indole 3-carboxylic acid 475 0.98 �1.08 0.36 06 Naphthalene 353 0.67 �1.16 0.24 07 Phenanthrene 373 0.65 �1.76 0.70 08 Oxindole 399 0.94 �1.23 0.37 09 Skatole 368 0.92 �0.88 0.36 0

10 2-Naphthol 395 0.92 �1.24 0.34 111 5-Methoxyindole 327 0.92 �1.15 0.59 112 Acridine 381 0.89 �1.47 0.50 013 Benzoic_acid 395 0.85 �1.11 0.24 014 Indole-3-aldehyde 469 0.97 �0.85 0.31 0

110 R. Tabaraki, A. Toulabi / Fluid Phase Equilibria 383 (2014) 108–114

feed forward network must be optimized [10]. Artificial neuralnetworks have been used for representing non-linear functionalrelationship between variables. The ability of an ANN to learn andgeneralize the behavior of any complex and non-linear processmakes it a powerful modeling tool.

A wide variety of descriptors have been reported for use in QSPRanalysis [11]. In the most cases, it is more convenient to consider alinear relationship between solubility and descriptors by multiplelinear regression (MLR) [12–15]. Artificial neural network (ANN)[13–15], wavelet neural network (WNN) [12–14] and Bayesianmethods [16] are the most commonly used nonlinear models forpredicting of solubility in supercritical carbon dioxide. Bayesianneural networks can do the descriptor selection and networkarchitecture optimization automatically [17].

Based on our literature search, there is no report on theapplication of QSPR models for the prediction of solubility indifferent supercritical fluids by one set of descriptors. In this work,MLR (linear model) and ANN (nonlinear model) were used forpredicting the solubility in supercritical carbon dioxide, ethaneand trifluoromethane fluids over a wide range of pressures andtemperatures by one set of descriptors.

2. Data and methodology

2.1. Data set

The data set consists of 14 organic solids that their structuresare given in Table 1. The experimental values of solubility atdifferent temperature and pressure were collected from thefollowing references: (1,4-naphthoquinone, 2-aminofluorene,naphthalene, phenanthrene, 2-naphthol, acridine and benzoicacid) [18,19]; (5-aminoindole, 5-hydroxyindole, indole 3-carbox-ylic acid, oxindole and indole-3-aldehyde) [20];(skatole and 5-methoxyindole) [21]. Based on our literaturesearch, these compounds had solubility data in all threesupercritical fluids (supercritical carbon dioxide (SC-CO2), super-critical ethane (SC-C2H4) and supercritical trifluoromethane (SC-CHF3). The data set consisted of 586 experimental solubilityvalues in different temperatures and pressures. The data set of9 organic solids was randomly divided into training set (SC-CO2:80; SC-C2H4: 81; SC-CHF3: 44; total = 205 data) and test set (SC-CO2: 155; SC-C2H4: 148; SC-CHF3: 78; total = 381 data). The testdata consisted of two data sets. The adequacy of the models wasevaluated by both interpolation and extrapolation (moleculeswithin and beyond the samples used for constructing the model)[22]. Internal test set consisted of the same molecules as thetraining set but data was not used in constructing the model (SC-CO2: 50; SC-C2H4: 52; SC-CHF3: 34 data; total = 136). The externaltest set compromised of five organic solids molecules, withmolecular structures which were new for the model (SC-CO2:105; SC-C2H4: 96; SC-CHF3: 44 data; total = 245). The molecular

structures of those molecules are given in Table 1 (the last5 molecules).

2.2. Descriptor calculation

All molecules were drawn in the HyperChem 6 software(version 7.0, Hypercube, Inc.). The optimization of the molecularstructures was also carried out by semi-empirical AM1 methodusing the Fletcher–Reeves algorithm until the root mean squaregradient was 0.01. The resulted geometry was loaded into Dragonsoftware [23] to calculate 1497 descriptor in 18 different classes.

2.3. Descriptor selection

The selection of relevant descriptors, which relate the solubilityto the molecular structure, is an important step to construct apredictive model. In this work, the following method was used toselect the best calculated descriptors among 1497 theoreticaldescriptors using the training sets:

(1) All descriptors with zero or same values for all the molecules inthe training set were eliminated.

(2) Co-linearity of the descriptors were calculated and one of thetwo descriptors which had the pairwise correlation coefficientabove 0.9 (R > 0.9) and a large correlation coefficient with theother descriptors was eliminated. Finally, 187 descriptors wereremained for next section.

(3) Genetic algorithm–multiple linear regression (GA–MLR) wasused to select the most relevant descriptors. To select the mostrelevant descriptors with GA, the evolution of the populationwas simulated. Each individual of the population, defined by achromosome of binary values as the coding technique,represented a subset of descriptors. The number of the genesat each chromosome was equal to the number of thedescriptors. The population of the first generation was selected

Page 4: Solubility modeling in three supercritical carbon dioxide, ethane and trifluoromethane fluids by one set of molecular descriptors

Table 4Correlation coefficients between various descriptors.

Descriptor Tm SIC3 Mor03m Mor10m C-026

Tm 1SIC3 0.44 1Mor03m 0.23 0.53 1Mor10m �0.35 0.13 �0.48 1C-026 �0.24 0.36 0.02 0.34 1

Table 5RMSE and mean relative error of MLR and ANN models.

SC CO2 SC CHF3 SC C2H6

MLR ANN MLR ANN MLR ANN

RMSE Training 0.6 0.4 0.7 0.4 1.0 0.5%RE Training 10.2 6.6 8.0 4.7 13.0 5.8RMSE Int.test 0.6 0.4 1.0 0.9 0.8 0.6%RE Int.test 10.1 7.6 9.3 5.6 12.1 6.4RMSE Ext.test 0.8 0.7 1.1 0.8 1.0 0.8%RE Ext.test 9.2 8.7 11.9 8.4 10.8 8.6

Int. test: internal test set; Ext. test: external test set.

Fig. 1. Scatter plot of MLR and ANN models in supercritical ethane.

Fig. 2. Scatter plot of MLR and ANN models in supercritical trifluoromethane.

R. Tabaraki, A. Toulabi / Fluid Phase Equilibria 383 (2014) 108–114 111

randomly. A gene was given the value of one, if itscorresponding descriptor was included in the subset; other-wise, it was given the value of zero. The GA performs itsoptimization by variation and selection via the evaluation ofthe fitness function (h). Fitness function was used to evaluatealternative descriptor subsets that were finally orderedaccording to the predictive performance of related model.The fitness function that we used was the one that wasproposed by Depczynski et al. [24]. Parameters of GA–MLR areshown in Table 2.

2.4. Modeling

Modeling of negative logarithm of the solubility (�ln y) byselected molecular descriptors for three supercritical fluids wereperformed by linear (multiple linear regression; MLR) andnonlinear (artificial neural network; ANN) methods. Neuralnetworks had sigmoid functions as a hidden transfer functionand linear functions as output transfer function. A back propaga-tion learning algorithm was employed to adjust the weights. ANNmodels were developed using seven neurons in input layercorresponding to the selected seven descriptors. The output layerhad one node that predicts solubility. In this work, the number of

Page 5: Solubility modeling in three supercritical carbon dioxide, ethane and trifluoromethane fluids by one set of molecular descriptors

Fig. 3. Scatter plot of MLR and ANN models in supercritical carbon dioxide.

Table 6Optimized parameters of ANN models.

SC-CO2 SC-CHF3 SC-C2H6

Input neurons 7 7 7Hidden neurons 3 3 3Output neurons 1 1 1Learning rate 0.088 0.077 0.006Momentum 0.25 0.75 0.95Iterations 5000 7000 14000

112 R. Tabaraki, A. Toulabi / Fluid Phase Equilibria 383 (2014) 108–114

nodes in hidden layer and other parameters except the number ofiterations were optimized simultaneously. In other words, the bestvalue for each variable was not obtained by “one at a time”optimization method. The number of nodes in hidden layer can bechanged from 2 to 11, while the learning rate from 0.001 to 0.1 witha step of 0.001 and momentum from 0.1 to 0.99 with a step of 0.01.The ANN models were constructed with all of the possiblecombinations of those three variables. The root mean square error(RMSE) for training and test sets for each ANN model wascalculated. The training was stopped when the RMSE for trainingset was low and the RMSE for the test set reached a minimum.Finally, the number of iterations was optimized with the optimizedvariable values.

2.5. Software

All molecules were drawn in the HyperChem 6 software(version 7.0, Hypercube, Inc.). Dragon software was used tocalculate molecular descriptor in 18 different classes. Multiplelinear regression, genetic algorithm and artificial neural networkprograms were written by authers in MATLAB environment(version 6.1, Mathworks, Inc.).

3. Results and discussion

3.1. Descriptor selection and MLR model

The GA–MLR technique was also performed to select the bestgeneral molecular descriptors for three supercritical fluids.Therefore, three training set were used as one new training set.The best equation was as follow:

�ln y = �8.224(�2.367) � 0.019(�0.007) T � 0.096(�0.007)P + 0.038(�0.004) Tm + 5.727(�1.179) SIC3 � 1.496(�0.422)Mor03m + 4.427(�0.697) Mor10m + 1.033(�0.220) C026 (1)

n = 205; R2 = 0.892; std. error = 0.918; F = 231.777The values of selected descriptors are shown in Table 3. From

Table 4, it can be seen that the correlation coefficient value for eachpair descriptors was less than 0.55, which mean that the selecteddescriptors were independent.

Based on Topliss work [25], relation between the number ofobservations and the number of variables for chance correlationlevel Pc< 0.01 and r2� 0.9 can be evaluated by extrapolation. For187 variables (this work), 208 observations are needed at a chancecorrelation level of 1%, which in this work, 205 observations wereused.

Although GA–MLR was used as feature selection method, theselected general descriptors were also used to develop the MLRmodels in each supercritical fluid. The best equations were:

�ln y (SC-CO2) = �6.824(�2.862) � 0.025(�0.008) T � 0.128(�0.011) P + 0.042(�0.005) Tm + 4.742(�1.439) SIC3 � 1.944(�0.510) Mor03m + 2.166(�0.884) Mor10m + 1.378(�0.268) C026(2)

n = 80; R2 = 0.949; std. error = 0.665; F = 189.908

�ln y (SC-C2H6) = �12.709(�3.887) � 0.011(�0.012) T- 0.073(�0.012) P + 0.035(�0.007) Tm + 8.739(�2.187) SIC3 � 0.926(�0.749) Mor03m + 6.674(�1.154) Mor10m + 0.508(�0.398) C026(3)

n = 81; R2 = 0.887; std. error = 1.009; F = 82.243

�ln y (SC-CHF3) = �6.991(�5.893) � 0.005(�0.019) T � 0.096(�0.013) P + 0.034(�0.006) Tm + 1.635(�1.803) SIC3 � 0.833(�0.896) Mor03m + 3.810(�1.303) Mor10m + 1.345(�0.358) C026(4)

n = 44; R2 = 0.890; std. error = 0.731; F = 41.505The predictive ability of the MLR models were evaluated by

calculation of the RMSE and mean relative error for the trainingand test sets (Table 5). The training set was used for modelgeneration. The test sets were used for the evaluation of thepredictive ability of the models. The performance of the MLRmodels in three supercritical fluids was evaluated by plotting thecalculated versus experimental values of the solubility for thetraining set, internal test set and external test set (Figs. 1–3).

Page 6: Solubility modeling in three supercritical carbon dioxide, ethane and trifluoromethane fluids by one set of molecular descriptors

Table 7Some of experimental and calculated values of �ln(y) using MLR and ANN models for external test set.

Compounds T(K) P(MPa) �ln(y) Calc. values (MLR) Absolute error (MLR) Calc. values (ANN) Absolute error (ANN)

SC-C2H6

Acridine 308 6.17 9.01 9.56 0.55 8.83 �0.18328 8.23 8.25 9.20 0.95 8.11 �0.14343 8.22 8.63 9.04 0.41 8.61 �0.02

Benzoic acid 308 11.10 6.94 7.28 0.34 6.93 �0.01318 20.15 6.17 6.51 0.34 6.20 0.02343 36.35 4.64 5.06 0.42 4.44 �0.20

2-Naphthol 308 7.35 9.43 9.46 0.03 9.66 0.23328 6.60 10.56 9.30 1.26 10.83 0.27343 9.30 8.54 8.95 0.41 9.09 0.55

5-Methoxyindole 308 5.85 8.29 8.74 0.45 7.80 �0.49308 16.30 7.25 7.98 0.73 7.25 �0.00308 19.2 7.16 7.76 0.60 7.24 0.08

Indole-3-aldehyde 308 6.59 11.41 11.51 0.10 11.35 �0.06308 8.35 11.30 11.38 0.08 10.43 �0.87308 13.9 11.11 10.97 0.14 10.05 �1.06

SC-CHF3Acridine 318 13.45 7.48 7.67 0.19 7.48 0.00

318 21.7 7.23 6.88 0.35 7.23 0.00328 21.7 6.99 6.83 0.16 7.14 0.14

Benzoic acid 318 8.32 8.11 7.28 0.83 8.51 0.40318 13.1 7.22 6.82 0.40 6.93 �0.29328 13.1 6.98 6.77 0.21 7.08 0.10

2-Naphthol 328 13.55 8.03 8.68 0.65 7.58 �0.45328 24.1 7.47 7.67 0.20 7.73 0.26343 24.1 6.87 7.59 0.72 7.08 0.21

5-Methoxyindole 308 6.69 7.02 8.00 0.98 6.68 �0.34308 7.8 6.56 7.89 1.33 6.69 0.13308 9.65 6.16 7.72 1.56 6.72 0.56

Indole-3-aldehyde 308 10.66 10.55 9.86 0.69 10.10 �0.45308 12.7 10.41 9.67 0.74 9.95 �0.46308 15.01 10.36 9.45 0.91 9.58 �0.78

SC-CO2

Acridine 318 20 6.79 6.96 0.17 6.76 �0.03328 12.25 8.55 7.70 0.85 8.41 �0.14343 19.97 6.73 6.34 0.39 6.70 �0.03

Benzoic acid 318 12 6.77 7.13 0.36 6.73 �0.04343 15.12 6.48 6.10 0.38 6.52 0.04308 15.1 6.35 6.98 0.63 6.46 0.10

2-Naphthol 328 12 8.57 9.05 0.48 8.60 0.03343 11.15 9.25 8.79 0.46 9.42 0.17343 20.1 6.95 7.65 0.70 6.99 0.04

5-Methoxyindole 308 10.9 6.73 7.18 0.45 7.15 0.42308 13.7 6.50 6.82 0.32 6.53 0.03308 15.9 6.47 6.54 0.07 6.15 �0.32

Indole-3-aldehyde 308 8.72 11.70 11.14 0.56 10.50 �1.20308 10.76 11.55 10.88 0.67 10.33 �1.22

R. Tabaraki, A. Toulabi / Fluid Phase Equilibria 383 (2014) 108–114 113

3.2. Interpretation of the selected descriptors

By interpreting the descriptors contained in the model, it ispossible to gain useful chemical insights into the chemicalproperty. The result shows that four calculated and threeexperimental descriptors are the most feasible ones. The calculateddescriptors were one topological descriptor (SIC3: structuralinformation content, neighborhood symmetry of 3-order), oneatom centered fragment descriptor (C026: R-CX-R) and two 3D-MoRSE descriptor (Mor03m: 3D-MoRSE-signal 3 weighted by

atomic masses; Mor10m: 3D-MoRSE- signal 10 weighted by atomicmasses). Selection of pressure, temperature and melting tempera-ture of solids as the experimental descriptors is important. Thedensity of the supercritical fluid, which is the key parameter to thesolubility of different compounds, is related to both temperatureand pressure of the supercritical gas.

Topological descriptors (SIC3: structural information content,neighborhood symmetry of 3-order) are based on a graphrepresentation of the molecule. They are numerical quantifiersof molecular topology obtained by the application of algebraic

Page 7: Solubility modeling in three supercritical carbon dioxide, ethane and trifluoromethane fluids by one set of molecular descriptors

114 R. Tabaraki, A. Toulabi / Fluid Phase Equilibria 383 (2014) 108–114

operators to matrices representing molecular graphs and whosevalues are independent of vertex numbering or labeling. They canbe sensitive to one or more structural features of the molecule suchas size, shape, symmetry, branching and cyclicity and can alsoencode chemical information concerning atom type and bondmultiplicity [11].

The next descriptor is C026. It is one of the atom-centeredfragment descriptors. C-026 corresponds to R-CX-R, representingthe number of substituent groups bonded to the benzene ring, butexcluding those which are bonded by a carbon atom to carbonatoms, i.e., C benzene ring–C substituent group [26].

The third and fourth descriptors are Mor03m: 3D-MoRSE-signal3 weighted by atomic masses and Mor10m: 3D-MoRSE-signal 10weighted by atomic masses. These 3D-MoRSE descriptors (3Dmolecule representation of structures based on electron diffrac-tion) are derived from infrared spectra simulation using ageneralized scattering function [11].

A comparison between selected descriptors and models in thiswork and previous works [12–16] was performed. In most ofthese works, temperature and pressure were used becausedensity of supercritical fluid is a function of these parametersand has significant effect on solubility. Only in this work andHemmateenejad work [15], large number of descriptors wascalculated. Different feature selection methods such as stepwiseMLR [12–14], combined data splitting feature selection strategy(CDFS) [15], MLR with expectation maximization (MLREM) [16]and GA–MLR (this study) were used to select the best descriptors.Both linear and nonlinear models were used in all of thesestudies. In this work and previous works [12–16], topological,geometrical, charge, atom-centered fragment, functional group,WHIM and 3D-MoRSE descriptors were the most selecteddescriptors.

3.3. ANN model

In order to improve the predictive ability, nonlinear model suchas ANN was constructed. The best MLR model had sevendescriptors. ANN models were developed using seven neuronsin input layer corresponding to the mentioned seven descriptors.Mean relative error and RMSE of ANN models and optimizedparameters of ANN models are shown in Tables 5 and 6,respectively. The capability of the models were also evaluatedfor prediction of the solubility of five organic solids, their data werenot used in any of the previous data sets. These organic solids were2-naphthol, acridine, benzoic acid, indole-3-aldehyde and 5-methoxyindole. The structures of these organic solids are givenin Table 1. The calculated descriptors for these organic solids arepresented in Table 3. Some of the results of the prediction ability ofthe models for external validation set are given in Table 7.

The performance of the ANN models in three supercritical fluidswas evaluated by plotting the calculated versus experimentalvalues of the solubility for the training and test sets (Figs. 1–3).

4. Conclusions

The MLR and ANN models were developed for predicting thesolubility for 14 organic solids in supercritical carbon dioxide,ethane and trifluoromethane over a wide range of pressure andtemperature. Some crucial implications of this study are listedbelow:

a This study is the first report on the application of QSPR modelsfor the prediction of solubility in different supercritical fluids byone set of descriptors.

b The capabilities of the models were also evaluated for predictionof the solubility of five organic solids with molecular structureswhich were new for the models.

c The performance of the ANN model was compared with MLRmodel. The results from Table 5 and Figs. 1–3 indicate thesuperiority of the ANN model over that of the MLR models. It canbe concluded that the characteristics of the molecular descrip-tors on the solubility values in supercritical fluids was nonlinearas mentioned in previous works [12–16]. Similar results wereobtained for supercritical C2H6 and CHF3.

d The root mean squares error (RMSE) for supercritical carbondioxide, ethane and trifluoromethane were 0.56, 0.68 and 0.72,respectively.

References

[1] A. Akgerman, G. Madras, Fundamentals of solids extraction by supercriticalfluids, in: E. Kiran, J.M.H.L. Sengers (Eds.), Supercritical Fluids: Fundamentalsand Applications, Kluwer Academic Publishers, Dordrecht, 1994.

[2] K.P. Johnston, D.G. Peck, S. Kim, Ind. Eng. Chem. Res. 29 (1989) 1115–1125.[3] G.S. Foster, J.S.L. Gurdial, K.D. Liong, S.S.T. Tilly, H. Ting, Singh, J.H. Lee, Ind. Eng.

Chem. Res. 30 (1991) 1955–1964.[4] J. Mendez-Santiago, A.S. Teja, Fluid Phase Equilibria 158-160 (1999) 501–510.[5] H.K. A. Jouyban, N.R. Foster Chan, J. Supercrit. Fluids 24 (2002) 19–35.[6] M. Karelson, V.S. Lobanov, Chem. Rev. 96 (1996) 1027–1044.[7] T. Le, V.C. Epa, F.R. Burden, D.A. Winkler, Chem. Rev. 112 (2012) 2889–2919.[8] H.L. Engelhardt, P.C. Jurs, J. Chem. Inf. Comput. Sci. 37 (1997) 478–484.[9] D.E. Golderg, Genetic Algorithms: Search, Optimization and Machine Learning,

Addison-Wesley, New York, 1989.[10] J. Zupan, J. Gasteiger, Neural Networks for Chemists: An Introduction, VCH,

Weinheim, 1993.[11] R. Todeschini, V. Consonni, Handbook of Molecular Descriptors, John Wiley &

Sons, 2008.[12] T. Khayamian, M. Esteki, J. Supercrit. Fluids 32 (2004) 73.[13] R. Tabaraki, T. Khayamian, A.A. Ensafi, Dyes Pigments 73 (2007) 230.[14] R. Tabaraki, T. Khayamian, A.A. Ensafi, J. Mol. Graph. Model. 25 (2006) 46.[15] B. Hemmateenejad, M. Shamsipur, R. Miri, M. Elyasi, F. Foroghinia, H. Sharghi,

Anal. Chim. Acta 610 (2008) 25.[16] A. Tarasova, F. Burden, J. Gasteiger, D.A. Winkler, J. Mol. Graph. Model. 28

(2010) 593–597.[17] F.R. Burden, D.A. Winkler, QSAR Comb. Sci. 28 (2009) 645–653.[18] W.J. Schmitt, R.C. Reid, J. Chem. Eng. Data 31 (1986) 204–212.[19] M. McHugh, M.E. Paulaitis, J. Chem. Eng. Data 25 (1980) 326–329.[20] K. Nakatani, J. Supercrit. Fluids 2 (1989) 9–14.[21] S. Sako, K. Shibata, K. Ohgaki, T. Katayama, J. Supercrit. Fluids 2 (1989) 3–8.[22] H. Martens, T. Naes, Multivariate Calibration, Wiley, Chichester, 1989.[23] R. Todeschini, Milano Chemometrics and QSPR Group. http://www.disatuni-

mib.it/vhml/.[24] U. Depczynski, V.J. Frost, K. Molt, Anal. Chim. Acta 420 (2000) 217–227.[25] J.G. Topliss, R.P. Edwards, J. Med. Chem. 22 (1979) 1238.[26] B. Rasulev, H. Kusic, D. Leszczynska, J. Leszczynski, N. Koprivana, J. Environ.

Monit. 12 (2010) 1037–1044.