comp mat science baumes collet

8/14/2019 Comp Mat Science Baumes Collet

1/14


2/14

U N C O R R E

C T E

DP R

O O F

used genetic programming tree-based form is reported. From bothcatalysis and GP knowledge it is derived an adapted representationfor the optimization and discovery of solid catalysts. It is demon-strated how such a personalized GP representation allows tosearch in open-ended spaces of catalytic structures. The methodis shown to be efcient for the discovery of new structural formof catalysts, while the parametric optimization can be partiallytransferred to common fast local approaches. Among the differenttests, an example is given through a multi-objective mathematicalbenchmark. This choice is due to the lack of studies tackling suchdifculty which reects most real-world optimization problems.Finally, the genetic programming paradigm is compared to thecommonly used genetic algorithms (GAs).

2. The problematic of data representation

The way the experiments are encoded, i.e., the possible repre-sentation spaces according to the features investigated, allowsdening the adequate algorithms taking advantage of such repre-sentations for providing the expected form of results. For example,considering pharmaceutical studies, molecular descriptors are thenal result of logic and mathematical procedures which transformchemical information encoded within a symbolic representation of a molecule intouseful numbers. However, in contrast to molecules,a solid cannot easily be represented in a meaningful way in a com-puter [2] . If only the composition of a solid is encoded, manyimportant factors are lost. Parameters such as preparation modesand heat treatments greatly inuence catalytic structure and prop-erties. Consideringa hydrodesulphurization (HDS) study, one musttake into account the order in the synthesis sequence since resultsare different if Ni is impregnated rst and then Mo, or vice versa, oreven simultaneously [3] . Another example dealing with zeoliteactivation [4] demonstrates that a fast calcinations procedurewithout careful ion-exchange anddrying produces a steaming withthe corresponding dealumination. On the other hand, when calci-nations are carried out with different progressive steps a bettercontrol of the nal product properties is achieved. Consequently,

11the necessity of an efcient representation in heterogeneous catal-12ysis that can fully handle such parameters, i.e., order of element12additions or solid modication in the synthesis sequence, the de-12tailed description of linked thermal programs, the parameters re-12lated to the different synthesis methods used during the12preparation phase, the precursors should be emphasized. Every12time there is a communication between chemists, HT apparatus,12databases, and algorithms, (see Fig. 1 ), an adequate representation12of the information to be transferred is unavoidable. Each step into12the HT process will be detailed focusing on data representation.

12 2.1. Databases connection

13The management of data, from the storage to the retrieve of the13data, is of great importance in combinatorial studies.Saupe et al. [5]13stress on the crucial role of informatics in HT experimentation ap-13plied to material science, specifying that Every parameter during13preparation and testing may be a factor crucial for the performance13of the material. As a consequence, all experimental parameters13should be controlled or at least recorded to be able to identify13important correlations. All companies, Symyx, hte AG, Avantium,13and recently, Bayer, DPI, and Dow, claim to have well-developed13such systems [6] . On the other hand, only few academic groups14have tackled the problem of experimental data storage and man-14agement considering a broad kind of catalysts [7] . In StoCat [8]14an underlying structure supporting and organizing experimental14data in an interconnected way permits to successfully minimize14the loss of scientic information which might be used for further14data treatments. The general DB defy encountered when dealing14with solids is theantagonist combination of accuracyandexibility14enabling to accommodate most of thereactions using diversemate-14rials. Despite this challenging task is effectively supported through14StoCat scheme, after the software was used as a central DB by a15consortium of 10 European organizations, including academia and15industries (fth PCRD Combicat), it appeared that the graphical15user interface (GUI) is split into too many steps (i.e., windows),15making the input of data complicated. This problem is due to the

Fig. 1. The iterative procedure for heterogeneous catalyst discovery and optimization involving high-throughput technology, data storage in databases, data mining and

statistical algorithms, and chemical knowledge-interpretation, intuition. hITeQ is the new workow platform built in ITQ supporting the communication between apparatusand databases via various formats, among them AniML.

2 L.A. Baumes, P. Collet / Computational Materials Science xxx (2008) xxxxxx

COMMAT 2788 No. of Pages 14, Model 5G

10 July 2008 Disk UsedARTICLE IN PRESS

Please cite this article in press as: L.A. Baumes, P. Collet, Comput. Mater. Sci. (2008), doi:10.1016/j.commatsci.2008.03.051


3/14

U N C O R R E

C T E

DP R

O O F

4 difculty to support thearchitecture of complexrelational schemesthrough simple grids in interfaces. Such dilemma is obviouslyinherent to every database, the DB structure would not be neces-sary if a simple grid or Excel TM spreadsheet was sufcient. Even if the input of data can be facilitated by the use of norms such asthe denition of a XML format, difculties are more troubling con-sidering the retrieve of data. Lets consider a query which aims atretrieving information about the object O

1, for example catalyst.

The information to be retrieved may belong to another object O 2 ,for example element. One instance of O 1 can be linked to more

4 than one instance of O 2 , and also inversely; for example a given cat-alyst may contain various elements, and a given element may be-long to various catalysts. The corresponding relational DB schemeis made of a so-called intermediary table between O 1 and O 2(see Supplementary Material ). In such case, the query returns mul-tiple lines for each instance of O 1 . A given catalyst made of CuFeMg is described with three consecutive lines ( Table 1 ). Whateverthe DB type and the query language, the retrieve of data usuallydoes not provide a usable le foralgorithms. A more disturbingcaseis faced when the creation of the le is not possible, for example if

4 thedesiredelds, involved in thedescriptionof theexperiments, donot appear for every experiment. Table 1 c is stressing on suchproblem.

As a result, even if the build-up of powerful DBs do permits thehandling of complex catalytic data structure, the input of data fromuser may be complicated and time-consuming, and the extractionof the contained information is limited and does not provide aworkable le for algorithms.

2.2. Algorithms inputs and outputs

Another central anddecisivecomponent in combinatorial mate-rial research (CMR) is the automatic treatment of experimentaldata. The different approaches employed for proposing iterativelythe new library of experiments to be conducted may be catego-rized into either modeling or optimization techniques. Model-ing aims at obtaining an estimation of the gure of merit for thesearch space investigated. Based on the expected criterion(s)and/or diversity measures, [9] future experiments are selected.Depending on the study, the selection of the approach may alsoconsider the trade-off between accuracy and understandability/interpretability, or the ability of the model to correctly extrapolatein view of new conditions, or searchspacezones that are poorly ex-plored, technically difcult, or that require higher investigationcost. Numerous techniques are employed such as neural networks[10] and hybrid solutions, [11] support vector machines, [12]regression, [13] and classication [14] trees, long-established sta-tistics is mostly ignored, [13] traditional DoE [15] . Consideringoptimization methods, the new generation to be conducted is de-ned regarding one criterion to be optimized or more, if multi-objective is handled which is rarely observed. Genetic algorithm(GA) and evolutionary strategy (ES) [16] are principally exploiteddue to both the numerous past proofs of efciency in diverse do-mains, and their iterative-population-mechanism which ts wellthe combinatorial loop process. Apart from modeling and optimi-zation, very few new algorithms were proposed, for example, inRef. [17] a new active sampling methodology aims at obtaining a

Table 1

Array resulting from a SQL query

Catalysts Element Order1 Cu 11 Fe 11 Mg 22 Mg 12 V 22 Fe 3

Catalysts Element Precursor Precursor type1 Cu A 11 Mg B 22 Mg C 12 Cu A 22 Fe D 3

L.A. Baumes, P. Collet / Computational Materials Science xxx (2008) xxxxxx 3



http://-/?-http://-/?-


4/14


5/14

U N C O R R E

C T E

DP R

O O F

2.4. Representation and research space

When starting a discovery program in material science employ-ing HT tools, the initial conception of the research space should letopportunity to unexpected results. A surprising breakthrough isonly possible if the designer of the searchspace integrates diversity

4 among possible experiments. However, diversity is not only inte-grated through the number of modalities of the parameters in-volved in the study (i.e., size of the pool of element or supports,levels of temperature or concentration, . . . ), but also through theselection of variables to be explored. Such diversity is always a pri-ori restricted in all studies due to the impossibility to further han-dle the variables through the representation mode. This obviouslyreduces tremendously the chance of discovery, the power of thegained knowledge, and nally the interest of using combinatorialapproach. For example, despite the body of theory [26] providing

4 theoretical evidence to complement the empirical proof of therobustness of GAs, the alphabet string representation is not thatexible as emphasized earlier. Moreover, the capability of GAsand ESs to discover new materials may be discussed. On one hand,Schrage [27] states that In many respects, evolution is the ulti-mate prototyping and simulation methodology. Evolutions powerand versatility are inarguable; its ability to innovate and surprise isoverwhelming. On the other hand, one can argue that there isnothing creative involved in the solutions generated by this waysince technically speaking one is just nding solutions that are al-

4 ready out there waiting to be found. For the rst time, here it isproposed a method which enables working on an open-endedsearch space while taking into account the complex structure of data.

3. The genetic programming paradigm

Evolutionary algorithms (EAs) methodologies are now consis-tently used not only in research but also for industrial and com-mercial problem-solving activities, demonstrating that theapproach is sound and competitive. EA is an umbrella term usedto describe computer-based problem solving systems, which use

4 computational models of some of the known mechanisms of evolu-tion as key elements in their design and implementation ( Fig. 2 ).

Over the past thirty years, strong biological metaphors led to thedevelopment of several schools of EA, including genetic program-ming (GP) [28,29] . A simple EA can be rst summarized by Fig. 3 ,where different representation or encoding schemes and operatorswill dene different EAs. Here, we concentrate on GP which usesgenetic tree-like computer programs. The aim of GP is ambitiousand this approach has already attracted a great deal of attentionfrom researchers in the eld of machine learning (ML). It is Kozathat demonstrated the power and generality of GP through the po-tential of performing genetic-style recombination upon function-tree specication of algorithms. The tree-based GP (TGP) is alsonamed traditional GP. GP differs with other soft-computing tech-niques, which often optimize real numbers or vectors techniques.GP produces and processes symbolic information very efciently.Despite this unique strength, GP has so far been applied mostlyin numerical and Boolean problem domains. See Ref. [30] fornon-Boolean related materials applications of GP.

3.1. The GP mechanism

The primary difference between GA and GP is that GP genomesis allowed to vary in its depth and size which can change dynam-ically during the evolutionary process to give a more exible rep-resentation. Such GP open-endedness is a good way of gettingaround the inherent limitations of xed coding. The tree-like rep-resentation is a hierarchical representation where the argumentsof a function are represented as its descendant nodes. A functionof arity n in a parse tree will have n child nodes. These argumentsmay be constants, variables or other functions. Two different parsetrees are presented in Fig. 4 . Six preliminary steps can be denedwhen using GP: (i) the terminals, (ii) the functions, (iii) the tnessfunction, (iv) the control parameters, (v) the termination criterion,and (vi) the program architecture. Here, it is considered that thereader is already familiar with GAs or ESs, and consequently, onlyGP specic features are discussed.

Before a problem may be solved the alphabet from which theprogram trees are composed must be specied. The alphabet canbe split into: function set, i.e., internal nodes of a tree, and terminalset which is made up of all the constants and variables for a partic-ular problem, i.e., leaf nodes with no descendants in a tree. The rst

Fig. 2. Classication of evolutionary algorithms.

Generate the initial population P(0) usually at random and set i=0 Repeat

Evaluate the fitness of each individual in P(i)Select parents from P(i) based on their fitness in P(i)

Apply search operators to the parents and get generation P(i+1)Until the population converges or the time is up

Fig. 3. A simple evolutionary algorithm.






6/14

U N C O R R E

C T E

DP R

O O F

requirement of great importance is to ensure that they are capableof expressing the solution to the problem. Both functions and ter-minals must return a value, and this value should provide a validargument to each of the functions in the function set. This impor-tant constraint is known as closure. However there are differentways that allow relaxing such constraint, such as the use of dy-namic typing which enforce closure and allow multiple data types.A better way to enforce data type constraints is to use strong typ-ing and hence to force to only generate trees which satisfy theseconstraints. Closure inuences the choice of terminals and func-tions, and so the problem representation and thus problem dif-culty. While both GAs and GP employ operators, their

implementation must be tailored to the representation. In Fig. 5right hand side, is shown how the crossover creates unviable pro-

42grams according to the incompatibility of data types. In order to42overcome closure constraint limitations, strong typing can be43introduced to the algorithm. The typesystem indirectly and a priori43species constraints through the types of arguments of the func-43tions and the return types of both the functions and terminals.43The constraints can be implicit and come with the type system,43or they can be explicit and hand crafted by the user. The type sys-43tem, therefore, constrains the search space by only allowing a sub-43set of the combinations of symbols from the alphabet to be43combined. In addition to the type constraints, strongly typed GP43(STGP) allows the use of generic functions and generic data types.43STGP lifts the closure requirement by implementing mechanisms

44that allow only type correct programs to be considered. Generic 44functions will accept and return a variety of different types so that

Fig. 4. (a) A simple program x2 + (( y3) + z ) in tree-like form and (b) another example of genetic programming syntax tree (OR(AND(OR( x1 x2))(NAND( x2 x3)))(NOR( x1 x3))). Notehow the brackets which denote the order of evaluation correspond to the structure of the tree. The functions are respectively arithmetic functions such as (+, ^ and ) andBoolean functions (AND, OR, NAND and NOR) whereas the terminal sets comprise real values ( x, y, z ) and, respectively, Boolean input variables ( x1 , x2 , x 3).

Fig. 5. Standard GP crossover (left) and unviableprograms (right). The function if . . . then . . . else is 3-arity where therst argumentis Boolean and the two others real ones. >Returns true if the rst element is superior to the second one. + and ^ are the usual arithmetic functions. Max is a 1-arity function returning the maximum real valueamong the connected list. On the right hand side parents 1 and 2 are, respectively, equal to 1 and 8. However, on left hand side, after crossing over the two viable parents 1and 2, an unfeasible offspring is obtained as the type of arguments for each function is not respected.






7/14

U N C O R R E

C T E

DP R

O O F

an individual function does not have to be dened for each type.For example, the addition function is able to acceptand return both

4 oating point numbers and integers. Haynes and Schoenefeld [31]extend STGP to allow a hierarchy of types in an object orientatedmanner.

3.2. Genetic programming for catalysts design

GP is used to evolve computer programs while heterogeneouscatalysts have nothing to do with programmingcodeor mathemat-ical functions. However, the design of catalysts libraries using theHT approach can be practically depicted by sequences of actionswhich are analogous to program instructions. Therefore, the gen-eral idea behind the use of GP for heterogeneous catalysts design

4 is to describe experimental information by trees. For example, thiscan be done through the creation of functions related to synthesis

activities. Mathematical functions in Fig. 4 will be replaced by syn-thesis actions such as impregnate. The programs can be seen asthe catalyst design, the compilation is similar to the synthesis,the execution is the catalytic test and the set of xed reaction con-ditions are the tness cases. Considering heterogeneous catalystsrepresentation through trees, the narrow view of types restrictedto mathematics must be enlarged to other types with symbolicnotions such as liquid or gas. For example, in Ref. [32] a func-

4 tional logic language is used and combined with GP in order to pre-dict the carcinogenic activity of chemicals.

Before giving concrete example of the possible application of GPfor heterogeneous catalysis, it must be pointed that large candidatesolutions, i.e., large trees, are undesirable for numerous reasonsand among them due to the fact that the comprehensibility be-

comes more difcult. From catalysis point of view, one can thinkthat catalysts containing many elements or synthesis steps can

be hard to reproduce and optimize. Therefore, for two differenttrees, i.e., catalysts, with the same tness values the smallest oneis preferred. However, trees show a tendency to increase in sizewith the evolution: this phenomenon is known as bloat appearsto be an inherent feature of search algorithms using variable-length representations, and is common for TGP as two differentgenomes can produce exactly the same programs, i.e., two differenttrees code the same catalyst. Koza [28] examines the effect of add-ing more Boolean functions to the function set on a single Booleaninduction problem(the 6-multiplexer). He found that performanceprogressively deteriorated when the size increases, and observedthe best performance that was reached with the logically completefunction set. Therefore, functions must be general enough to han-dle diverse catalysts but also adequately specic not to enlarge thesearch space to undesirable zones, not to create unrealistic cata-lysts, and not to increase bloat potential. GP2HC (genetic program-

ming to heterogeneous catalysis) concept tackles this problem.Different examples dening the GP constituents are examinedand discussed. The following section stresses on the structuralorganization of GP2HC by assuming a domain dependent architec-ture, dening functions for heterogeneous catalysis, and adaptingand rening operators.

4. GP2HC: an advanced and exible concept

The choice of components of the program, i.e., terminals andfunctions, and the tness function largely determine the spacewhich GP searches and consequently how difcult that search isand ultimately how successful it will be. Here, the chemist knowl-

edge has an important role to play since the computer scientistresponsible of the coding of the strategy, i.e., the manipulation of

Fig. 6. Expended GA representation.






8/14

U N C O R R E

C T E

DP R

O O F

the data, is not able to interpret the chemical meaning of the trees.Due to the high complexity of catalyst design when consideringvarious types of materials, an architecture-dening preparatorystep will be performed. The proposed structure is not totally xedand can be easily prototyped considering special requirements of agiven research program.

4.1. Architecture denition

4.1.1. A multi-tree architectureEach catalyst within the population is composed of n trees. This

multiple trees architecture is selected and adapted for heteroge-neous catalysis so that each tree contains code which evolves for asingle purpose (a part of the catalyst) while optimizing the wholestructure, i.e., the entire catalyst. The divide and conquer tech-

51nique isoften usedin problemsolvingto decomposea difcultprob-51lem into more manageable sub-problems. The most general and51simplest structure considers a catalyst in different parts. As there51are options (noted with Lozenges ) this architecture is consid-51ered as xed within a certain degree of freedom. Fig. 7 shows the51scheme corresponding to the structure also called architecture.52Rectangles with round borders are not functions but concep-52tual objects which will be dened by their special function and ter-52minal sets. Functions are symbolized with s (none in Fig. 7 ) and52terminals use different symbols depending to the corresponding52type (h , ,. . . ). h is a numerical value and corresponds to a52list of values. d is a conceptual and on which the argument list52and functions of the concept are plugged for clarity of the scheme.52The solid roughly follows the StoCat DB scheme conception52(Fig. 7 on the right hand side, the entire DB scheme is given in

Fig. 7. The catalyst architecture in GP2HC (left) following the concept of core and shell previously dened in StoCat

.

Fig. 8. GP2HC architecture design exibility.






9/14

U N C O R R E

C T E

DP R

O O F

Supplementary Material ). It is composed of a core (in green 1Q4 ), theweightpercent of this core (h ), some optional combinations(inblueandred)of layersadditionswith their respectiveweight percent( )and heat treatments, and a nal heat treatment. An intermediaryheat treatment is possible (in red, second one from left hand side)

4 in case of element addition onto the core (i.e., layers). The architec-ture respectsan order from top to bottomand implicitly from left toright. The regimented syntaxused from left to right is implicit sinceanyoperatorcanalter this order.Thearchitecturedenitionpermitsboth to introduce constraints and to underline important factors agiven research program may investigate. For example, one can sep-arate the main elements from promoters (


10/14

U N C O R R E

C T E

DP R

O O F

in order to avoid impossible cases such as the use of impregnationsince there is nothing yet to impregnate on. Considering the coresynthesis methods, only methods for producing dispersing agentthrough intimate mix of element are taken into account accordingto core and layers concept. Simultaneous use of multiple elementsin a single method is necessary if one wants to produce a multi-component core suchas co-precipitated CuZnAl. Thus, core synthe-sis M s calls + n is shown in Fig. 10 .

The thermal treatment or heat treatment (noted HT) is denedas it follows: the argument list is composed of a nal temperature,ramp, dwell, total ow rate of gases, and an initial temperatureautomatically set as the previous one or ambient ( ) if any isspecied (see Fig. 8 right hand side). Functions such as HT,M l can call each other enforcing a hierarchical arrangement of function calls. Thus, HT requires gas (Fig. 10 ), and either anotherHT or terminal for stopping as depicted in Fig. 10 (left branch).Therefore one thermal treatment can be composed of differentcycles. If a given research program focuses on thermal treatment,exibility can be integrated by dening gas. The gas functioncould be either air or a coupled list of gas and relative ow per-cent. This list is of special type in order to avoid bloat since it isordered ( +; ). We restrict also the use of twice the same gas forbloat consequences; the related symbol is the DB+ (see the DBsymbol in Fig. 10 with a red arrow).

The optional layer concept necessitates HT or M l functions.M l needs a stirring type argument, Param which is the list of argument linked to the chosen method, HTm is the thermal treat-ment associated to the method (for example the thermal programduring a precipitation). HT and HTm are two different functionssince they could be designed differently by the user. Add is thefunction that denes which element is added to the catalyst butalso precursor type and concentration associated to the element.

Such design allows only one single element to be added per layer.Of course and depending on catalysts to be searched an improvedversion can be adopted as shown for core synthesis. Each synthesismethod can be fully described (for example, number of impregna-tions will only belong to the impregnation list and addition rate of precipitating agent for co-precipitation). Note that the designercould apply the lter on elements in M l function not to maketwice the same element addition in two different layers. However,StoCat assumes that achieving two separated element additionsof the same element should produce two different catalysts (evenif the nal and total amount of the given element is equal in bothcases) (see Fig. 11 )Q2 .

This simple and quick description can be easily modied by the

user. The exibility of such a structure that can be accommodatedto nearly all catalysts prepared at lab scale. In GP2HC concept, the

61constraints are explicit and hand crafted by the user. The terminal61sets are dependent on the tree branch and functions in which they61are employed. In this example of GP2HC conception we have de-62ned: lists (element, gases, . . . ), ordered list, real values (ows, tem-62peratures, Ph, . . . ), symbols or pre-dened groups (air, ),62qualitative values (stirring types, precursors types, commercial62supports, . . . ). The creativity of the designer is important as it is di-62rectly correlated with GAP performance and research problem def-62inition or boundaries. If one wants to take into account mixtures of 62commercial supports, see Ref. [22] for an example of real study62where the resulting catalyst outperforms the current industrial62performance, a function & may be dened (see Fig. 12 on left62hand side). However, bloat may occur since the order when mixing63n solids does not have any inuence on the nal mixture of solids,63and consequently & (NaZSM-5, 30, & (NaX, 70, )) & (NaX, 70, &63(NaZSM-5, 30, )).

635. Experimental section

63Materials science, as numerous design problems from various63domains, deals with multiple objectives (MOs) at the same time.63However, even if the presence of several conicting objectives is63typical for heterogeneous catalysis research, no paper is available.63Therefore it is decided to survey quickly the subject (see Supple-63mentary Material ), and to apply the concept of GP to a MO bench-64mark using Pareto optimality.

645.1. Benchmark and algorithm settings

64The benchmark is suggested in order to stress on the order into64a sequence of selected elements noted x. Twice the same element64can be selected. This emphasizes on the fact that solids containing64equal amounts of elements may perform differently depending on

64the way they are added. The aim is to optimize both C and S, max- 64imized and minimized, respectively.

Fig. 11. From left to right: heat treatment (HT), gas, and layer ADFs.

Fig. 12. & Function for solid mixture.




http://-/?-http://-/?-http://-/?-http://-/?-


11/14

U N C O R R E

C T E

DP R

O O F

Multi-objective optimization is handled by using MOGA [33]which ranks each individual according to their degree of domi-nance. An individuals ranking equals the number of individualsthat it is dominated by plus one. Individuals on the Pareto fronthave a ranking of one, as they are non-dominated. The rankingsare then scaled to score individuals in the population (see Fig.

4 14 ). The main architecture is dened by a root and two leaves.The left hand side leaf receives a single value for S from terminal

set f S 1 ; . . . ; S 10 g, and the second leaf receives either +or & func-tions. + function has three leaves with two terminal waiting for

real values for x, another that gives i (i.e., which xi is selected).The selection of i is linked to a temporary list that remove everyi that has been selected with + and &. In the last leaf, it makesa call for + or & functions. &is dened exactly like + sincethe selection of twice an x is restricted on the same i. The differencebetween & and + is on the inuence they have on tness value.

A rank-basedselectionwithtournamentanda generational tech-nique using elitism are employed. In order to enhance exploitation,

thepopulationis separated into twosets: (i) P b is the part( b%)of thepopulation that stores the best individuals from the beginning and

Fig. 14. GAP population (left) and population ranking in MOGA (right).

Fig. 13. Viewof a single run of GPfor the benchmark. R is on the vertical axis, while S is on thehorizontal axis. Since the scale on the vertical axes in changingfrom one chartto another, R = 100 is shown by the blue dotted line ( ), and R = 300 by ( ).

C S a Xin

i2 f xi1 ; xi with

f x j; xi x j1o j

xi nbf xig oi

xi20 ; ... ; 1nbf xig number of x 6 0

S 2 f 1 ; . . . ; 10 g

8>>>>>>>>>>>:

and if nb f xig 6 3 then a 2else a 1






12/14

U N C O R R E

C T E

DP R

O O F

(ii) P r is the rest ( r = 1 b) of the population. Compared to the com-mon elitist method, an elitist selection is done over the entire set of individuals previously explored. These individuals belonging to P bare not re-evaluated. P b is set to33%in theexperiment. Micro-muta-tion consists in replacing the terminal by a new random one. EverymM generations, micro-mutation is applied in order to both opti-mize the structure and the parameters. Given a pre-dened maxi-mum depth, GP trees are initialised using halfhalf initialisation.The population trees are created half with the full and half withthe growmethod. This ensures initial tree diversity in terms of bothsize and structure. All the parameters are listed in Table2 . The pop-ulation sizeand tournament sizehave been set, respectively, to 150and 6 due to previous tests and information from the literature. Thestopping criterion is setto 25 generationssince wemust restrictthetotal number of tests as we would do for real experiments. The fre-quencies of crossover and mutation operators have been set afternumerous tests with various benchmarks (not shownhere).Tourna-ment selection hasbecome increasingly popularas it performs rankselection based selection using only local information. As it does notuse the whole population tournament selection does not requireglobal population statistics. In tournament selection a number of individuals, thetournamentsize, arechosenat randomwithreselec-tion from the breeding population. These are compared with eachother and the best of them is chosen. As the number of candidatesin the tournament is small the comparisons are not expensive. Anelement of noise is inherent in tournament selection due to the ran-dom selection of candidates.

A solution is said to be Pareto optimal if there exists no othersolution, i.e., better in all attributes. This implies that in order toachieve a better value in one objective at least one of the otherobjectives is going to deteriorate if the solution is Pareto optimal.Thus, the outcome of a Pareto optimization is not one optimalpoint, but a set of Pareto optimal solutions that visualize thetrade-off between the objectives. Considering a minimization

problem and two solution vectors x1 and x2; x1 is said to dominatex2 , if 8i 2 ff 1 ; . . . ; kg n jg : f i x1 6 f i x2 and 9 j : f j x1 < f j x2 . ThePareto optimal solutions are known as the Pareto optimal front.If the nal solution is selected from the set of Pareto optimal solu-tions, there would not exist any solutions that are better in allattributes. If the solution is not in the Pareto optimal set, it couldbe improved without degeneration in any of the objectives, andthus it is not a rational choice. This is true as long as the selectionis done based on the objectives only. Pareto optimal solutions arealso known as non-dominated or efcient solutions.

5.2. Results

The optimization of the benchmark using GAP is done 20 times

each being made of 25 generations. The reproducibility of the dis-tribution of performances values over the 20 runs is successful and

71has been statistically tested with a Chi-square GOF test. Fig. 1371shows the result of a single run.71The rst generation is quite poor and an average of only two71individuals usually exhibit R > 100 as shown by the blue dotted72line ( ). It canbe noted that no relatively high difference is visible72between R on the different S except for S 5 , S 8 and S 9 . In the third72generation, the R gure of merit exceeds 300 ( ). One can note72the increase of R at the eighth generation. However, while search-72ing to increase R, the GP generates individuals that increase S val-72ues. The GP follows the MOGA objective which consists in72providing the best Pareto frontier and to maintain it. The trade-72off between R and S is correctly controlled as the best individuals72(considering R) on lowest S are rapidly found around the eighth72generation and GAP focuses on higher values of S . From the eighth73generation one can note that small values of R for S > 5 are partially73removed (see circle on Fig. 13 ) going forward to Pareto front.73The outcome from this optimization is a set of Pareto optimal73solutions that visualize the trade-off between two gures of merit73noted C and S . The advantage with such approach is that the solu-73tions are independent from the decision-makers preferences. The73analysis has only to be performed once, as the Pareto set does not73change as long as the problem description remains unchanged. A73disadvantage might be that the decision-maker has too many solu-73tions to choose from. Here, a typical decision-maker would balance74the pros and cons of some few results picked up from the Pareto74front. For example, catalysts made of very few elements (S from 174to 5 in the benchmark) receive very low C , on the other hand in or-74der to reach the highest vales of C , S is equal to 10. The solution for74S = 7 seems a good compromise between the complexity of the cat-74alyst, i.e., number of element addition, and its performance. More-74over, there is a relatively high increase from S = 6 to 7, while no74bettercatalyst has been found for S = 8. This evaluationon a bench-74mark shows that managing ordered problem is easily feasible with74a GP. However, such a strategy still has to be applied on real cases.

756. Discussion

75Possibly the greatest distinction between GAs and GP is that of 75xed or variable length. In some cases, the size of the required75solution sought may be known beforehand. However, there are75many problems where it is difcult to pre-specify the size of the75solution. Clearly, if we know the size of the solutionwe do not need75to use a variable length representation as this would make the75searchspace larger. The quest for more efcient GP is an important75research problem. This is due to the fact that a high complexity of 75GP is among its distinctive features. Evolution in GP is both para-76metric and structural in nature. Two important features are spe-76cic to GP: (i) the tness of the functional structure depends on76the values of local parameters. Even very t structures may per-

76form poorly due to inappropriate numeric coefcients and (ii)76the tness of the individual is highly context sensitive. Slight76changes in structure dramatically inuence tness and may re-76quire completely new parameters. Accordingly, there are many76ways to introduce local learning into GP. The presence of stochas-76ticity in local learning makes it relatively slow, even though some76hybrid algorithms yield overall improvement. Apparently, the full77potential of local search optimization is yet to be realized. More-77over, since local learning comes with a price, it must be wisely77traded off with genetic search costs.77Storing, organizing and using most of the information in chem-77istry research is one of the main concerns to develop efcient77experimental strategies. Up to now, there is not any integral ap-77proach able to exactly reproduce each of the decisions around

77one particular experimental procedure. In catalysis research, 77synthesis of materials, its characterization, and the corresponding

Table 2

GP parameters

Parameters Values

Evolutionary model GenerationalPopulation size 150Stop criterion Stop at 25 generationsFunction set {+, &}Terminals { S 1 ,...,S 10 }, real, and integer

Tree generation Half and half Initial depth 4Maximum depth 12Sub-tree crossover probability 1Macro-mutation 0.01Micro-mutation 0.1 (applied every mM generations)Frequency of local optimization (mM) 3Parent selection Tournament size 6






13/14

U N C O R R E

C T E

DP R

O O F

reactivity tests, involve a large number of individual steps regard-ing the selection of substances, compositions, heat treatments,characterized chemical properties and reactivity parameters. How-ever, it is signicant, but assumed, that much more decisions arealways taken along the research. Specially, decisions about the

4 temporal order of each step, i.e., the order in which several metalsare deposited on a support or the design of heat treatments, areimpossible to be included into a conventional strategy. Geneticprogramming represents a exible architecture to store all theinformation around experimental procedures, opening the possi-bility to nd unexpected new formulations, impossible to be con-sidered with other algorithms. Genetic programming allowsdening particular blocks for particular experimental operations,i.e., core blocks, such as the way to prepare the main part of a cat-alyst; heat treatment blocks, in order to modify chemical proper-

4 ties of raw materials; or shell blocks, as a way to add promotersand other additional elements, each one with own rules. This kindof architecture, together with the great exibility to dene andadapt the rules for the different blocks, makes genetic program-ming a powerful tool to integrate and extract scientic knowledge,notably improving the research quality.

0 GP has been shown as the rst tool that can handle the com-plexity inherent to catalyst structure. Moreover, it becomes veryeasy to connect such GP with a database in order to obtain a fullyautomated workspace with an increased speed of the data ow be-

4 cause each part of the catalyst is well-dened (i.e., tagged) like An-iML [34] format, the use of data stored in DBs is then possible.

6.1GAP:GP or GA?

It is clear that GAs and GP are related as they are both inspiredby Darwinian evolution. The main differences between GP and GAare listed below.

6.1.1. Fixed or variables lengthPossibly the greatest distinction between GAs and GP is that of

xed or variable length. In somecases, the size of the required solu-tion sought may be known beforehand. However, there are many

4 problems where it is difcult to pre-specify the size of the solution.There is no reason why the size of a bit string in GAs cannot varyduringtheevolution. Both crossover andmutationoperators, whichoperate on xed length structures, canbe engineered into operatorswhich produce variable length bit strings. Conversely, with GPthere is no reason why xed size GP cannot be implemented.

6.1.2. RepresentationIn Ref. [35] an example comparing the use of the chromosome

bit string in tness evaluation is depicted. In the rst case, the t-ness of an individual bit string in the population is given by some

4 cost function which, given a bit string, returns a real value and,

therefore, we are facing a GA. But on the other hand, the cost func-tion is interpretingthe bitstring as a computer program and the va-lue returned reects the performance of the program at a particulartask. One would be inclined to say we have been describing a GPsystem. Woodwardconcludes that thequestion is notwhat the rep-resentation is but rather the interpretation of the representation.

6.1.3. OperatorsOne important potential difference betweenGA and GP is the ef-

fect of crossover. In GA, the crossover operators can move genetic4 material fromeitherof theparents, andplacesit inthesamelocation

in the child (i.e., the position of the gene in the genotype is not al-tered by crossover). Thus, crossover does not move the location of a bitwithina bitstring.The crossoveroperator inGPtypicallymoves

a sub-tree from one parent to a different locationin the child. In GP,sub-trees can be interpreted anywhere in the overall tree.

6.1.4. Spaces relationshipThere are many ways to represent GA features, however typi-

cally there is a one to one mapping between the search spaceand chromosome space. With programs there is a many to onemapping between the representationand the program being repre-sented. This difference has consequences when searching thespace. If the mapping between the representation and the objectbeing represented is one to one, a uniform sampling of the repre-sentation will lead to a uniform sampling of the objects being rep-resented. If the mapping between the representation and theobject being represented is a many to one mapping, a uniformsampling of the representation will lead to a nonuniform samplingof the objects being represented. This is the reason why Landgonpointed bloat as a consequence to the many to one relationship be-tween spaces. NFL is valid in the case of a one to one mapping.However, when there is a nonuniform many to one mapping be-tween representation and the objects being represented suggestedthat NFL is not valid. With GA, either situation can occur, howeverthe second situation is always the case with the representationsused in GP.

7. Conclusion

In view of the complexity of catalysis, different search frame-works with a structure commonly limited in features dependencetype have been proposed in the literature. Bearing in mind CHCobjectives and priorities, an adapted architecture of data is sug-gested. With genetic programming, it is possible to increase thenumber of variables to study and this would result in a potentiallyrather more powerful nal catalyst. Indeed, if this methodology isproperly followed it can be very helpful in the scientic under-standing of catalysis. GP may be used in order to enlarge the searchspace with deeper details on synthesis description that cannot behandled by other methods based on linear representation. A sec-ond motivation was based on the statement that high-throughput

and related data treatment in the domain of heterogeneous catal-ysis was relatively delayed compared to other chemistry, materialscience and pharmaceutical domains. By considering as much aspossible different paradigms without any a priori and determiningwhich one was the best adapted to the specicity and the numer-ous issues of heterogeneous, while being positioned at a high levelof research from computer science point of view, it seems thatmuch improvement has been done for putting the combinatorialcatalysis in a competitive position as compared to the other lead-ing domains. The third motivation, linked to the previous ones,was to promote innovative ideas from computer science point of view.

Acknowledgements

Ferdi Schueth from Max-Planck-Institut fr Kohlenforschung,Mlheim, Germany is gratefully acknowledged for the discussionsdealing with catalyst discovery which have permitted to elaboratesuch a conceptual approach. Claude Mirodatos and David Farrus-seng are also acknowledged. Avelino Corma is also acknowledgedfor the suggested examples and references. EU Commission (TOP-COMBI Project) support is gratefully acknowledged for this re-search. We thank Santiago Jimenez for his technical support onthe platform hITeQ.

Appendix A. Supplementary material

Supplementary data associated with this article canbe found, inthe online version, at doi:10.1016/j.commatsci.2008.03.051 .




http://dx.doi.org/10.1016/j.commatsci.2008.03.051http://dx.doi.org/10.1016/j.commatsci.2008.03.051


14/14

U N C O R R E

C T E

DP R

O O F

References

[1] (a) B. Jandeleit, D.J. Schaefer, T.S. Powers, H.W. Turner, W.H. Weinberg, Angew.Chem. Int. Ed. 38 (17) (1999) 24942532;(b) S.M. Senkan, Angew. Chem. Int. Ed. 40 (2) (2001) 312329;(c) M.T. Reetz, Angew. Chem. Int. Ed. 40 (2) (2001) 284310;(d) J.M. Newsam, F. Schuth, Biotechnol. Bioeng. 61 (4) (1999) 203216;(e) F. Gennari, P. Seneci, S. Miertus, Catal. Rev. Sci. Eng. 42 (3) (2000) 385402;

(f) M. Moliner, J.M. Serra, A. Corma, E. Argente, S. Valero, V. Botti, Micropor.Mesopor. Mater. 78 (2005) 7381;(g) O.B. Vistad, D.E. Akporiaye, K. Mejland, R. Wendelbo, A. Karlsson, M.Plassen, K.P. Lillerud, Stud. Surf. Sci. Catal. 154 (2004) 731738;(h) A. Cantn, A. Corma, M.J. Diaz-Cabanas, J.L. Jord, M. Moliner, J. Am. Chem.Soc. 128 (2006) 42164217;(i) J.R. Hendershot, C.M. Snively, J. Lauterbach, Chem. Eur. J. 11 (2005) 806814.

[2] (a) L.A. Harmon, A.J. Vayda, S.G. Schlosser, Abstr. Pap. Am. Chem. Soc. 221(2001) BTEC-067;(b) C. Klanner, D. Farrusseng, L.A. Baumes, C. Mirodatos, F. Schueth, QSAR Comb. Sci. 22 (2003) 729736.

[3] M.A. Camblor, A. Corma, A. Martinez, V. Martinez-Soria, S.J. Valencia, J. Catal.179 (2) (1998) 537547.

[4] G. Garralon, A. Corma, A. Fornes, Zeolites 9 (1) (1989) 8486.[5] M. Saupe, R. Fodisch, A. Sundermann, S.A. Schunk, K.E. Finger, QSAR Comb. Sci.

24 (1) (2005) 6677.[6] (a) H. Zhang, R. Hoogenboom, M.A.R. Meier, U.S. Schubert, Meas. Sci. Technol.

16 (2005) 203211;

(b) M.A.R. Meier, U.S. Schubert, Soft Matter. 2 (2006) 371376;(c) Special issue Materials Informatics: From Data to Knowledge, QSAR Comb. Sci. 24(1) (2005) 1196.

[7] (a) W.F. Maier, K. Stwe, S. Sieg, Angew. Chem. Int. Ed. (2007);(b) W.F. Maier, J. Saalfrank, Chem. Eng. Sci. 59 (2004) 46734678.

[8] (a) L.A. Baumes. Ph.D. Thesis in Comput. Sci. Univ. Lyon 1 La Doua France,2004.;(b) D. Farrusseng, L.A. Baumes, C. Hayaud, I. Vauthey, P. Denton, C. Mirodatos.Kluver Academic Publisher, Nato series, in: E. Derouane (Ed.), Proceedings of NATO Advanced Study Institute on Principles and Methods for AcceleratedCatalyst Design, Preparation, Testing and Development, Vilamoura, Portugal,1528 July, 2001. E. Derouane, V. Parmon, F Lemos, F. Ribeiro (Eds.), BookSeries: NATO Science Series: II: Mathematics, Physics and Chemistry, vol. 69,101124, Kluwer Academic Publishers, Dordrecht, Hardbound, ISBN 1-4020-0720-5. July 2002.

[9] L.A. Baumes, A. Corma, ISHHC 12, 1822 July, 2005, Firenze, Italy.[10] (a) J.M.Serra, A. Corma, A. Chica, E. Argente, V. Botti,Catal. Today 81(3) (2003)

393403;(b) Y. Watanabe, T. Umegaki, M. Hashimoto, K. Omata, M. Yamada, Catal.Today 89 (4) (2004) 455464;(c) K. Omata, Y. Watanabe, M. Hashimoto, T. Umegaki, M. Yamada, Ind. Eng.Chem. Res. 43 (13) (2004) 32823288.

[11] (a) A. Corma, J.M. Serra, P. Serna, S. Valero, E. Argente, V. Botti, J. Catal. 225(2005) 513524;(b) L.A. Baumes, D. Farruseng, M. Lengliz, C. Mirodatos, QSAR Comb. Sci. Nov.29 (9) (2004) 767778.

[12] (a) L.A. Baumes, J.M. Serra, P. Serna, A. Corma, J. Comb. Chem. 8 (2006) 583596;(b) J.M. Serra, L.A. Baumes, M. Moliner, P. Serna, A. Corma, Comb. Chem. HighThroughput Screen 10 (January 1) (2007) 1324.

[13] L.A. Baumes, M. Moliner, A. Corma, QSAR Comb. Sci. 26 (2) (2007) 255272.[14] A. Corma, M. Moliner, J.M. Serra, P. Serna, M.J. Daz-Cabaas, L.A. Baumes,

Chem. Mater. 18 (2006) 32873296.[15] J.N. Cawse (Ed.), Experimental Design for Combinatorial and High Throughput

Materials Development. 2003. ISBN-10: 0-471-20343-2. ISBN-13: 978-0-471-20343-8, John Wiley & Sons.

96[16] (a) J.M. Serra, A. Corma, E. Argente, S. Valero, V. Botti, ICEE (2003) 2125;96(b) D. Wolf, O.V. Buyevskaya, M. Baerns, Appl. Catal. A 200 (2000) 63;96(c) G. Grubert, E.V. Kondratenko, S. Kolf, M. Baerns, P. van Geem, R. Parton,96Catal. Today 81 (2003) 337345;96(d) M. Holena, High-Throughput Screening in Chemical Catalysis, in: A. Hag-96emayer, P. Strasser, A.F. Volpe (Eds.), Wiley VCH, 2004, pp. 153172.96[17] L.A. Baumes, J. Comb. Chem. 8 (2006) 304314.96[18] L.A. Baumes, P. Jouve, D. Farrusseng, M. Lengliz, N. Nicoloyannis, C. Mirodatos.96Seventh International Conference on Knowledge-Based Intelligent Information97and Engineering Systems (KES2003), September 35, 2003, University of 97Oxford, UK. Springer-Verlag in Lecture Notes in AI (LNCS/LNAI series), vol.972773, pp. 265270 V. Palade, R.J. Howlett, L.C. Jain (Eds.).97[19] (a) R. Baares-Alcntara, E.I. Ko, A.W. Westerberg, M.D. Rychener, Comput.97Chem. Eng. 12 (9/10) (1988) 923938;97(b) R. Baares-Alcntara, A.W. Westerberg, E.I. Ko, M.D. Rychener, Comput.97Chem. Eng. 11 (3) (1987) 265277.97[20] S. Kito, T. Hattori, Y. Murakami, Chem. Eng. Sci. 45 (1990) 2661.97[21] H. Speck, W. Hoelderich, W. Himmel, M. Irgang, G. Koppenhoefer, W.D. Mross,97DECHEMAMonogr. Computer Application in the Chemical Industry. Papers of 98European Symposium, Wenheim, VCH, Erlangen, April 2326, 1989, p. 43.98[22] J.M. Serra, A. Corma, D. Farrusseng, L.A. Baumes, C. Mirodatos, C. Flego, C.98Perego, Catal. Today 81 (3/30) (2003) 425436.98[23] F. Clerc, S.R.M. Pereira, M. Lengliz, D. Farrusseng, R. Rakotomalala, C.98Mirodatos, Rev. Sci. Instrum. 76 (2005) 062208.98[24] (a) C. Klanner,D. Farrusseng, L.A. Baumes, C. Mirodatos, F. Schth, QSAR Comb.98Sci. 22 (2003) 729736;98(b) C. Klanner, D. Farrusseng, L.A. Baumes, M. Lengliz, C. Mirodatos, F. Schth,98Angew. Chem. Int. Ed. 43 (40) (2004) 53475349;98(c) D. Farrusseng, C. Klanner, L.A. Baumes, M. Lengliz, C. Mirodatos, F. Schth,99QSAR Comb. Sci. 24 (2005) 7893;99(d) F. Schth, L.A. Baumes, F. Clerc, D. Demuth,D. Farrusseng, J. Llamas-Galilea,99C. Klanner, J. Klein, A. Martinez-Joaristi, J. Procelewska, M. Saupe, S. Schunk, M.99Schwickardi, W. Strehlau, T. Zech, Catal. Today 117 (2006) 284290.99[25] L.A. Baumes, R. Gaudin, P. Serna, N. Nicoloyannis, A. Corma. Comb. Chem. High99Throughput Screen (in press). Q399[26] D.E. Goldberg, Genetic Algorithms in Search Optimization and Machine99Learning, Springer, Reading, MA, 1989.99[27] M. Schrage. ISBN-13: 9780875848143. Harvard Business School Press, 2000.99[28] J.R. Koza. ISBN: 0-262-11170-5. MIT press, 1992.10[29] (a) J.R. Koza. ISBN: 0262111896. MIT press, 1994.;10(b) J.R. Koza, F.H. Bennett, D. Andre, M.A. Keane. ISBN: 1- 55860-543-6.10Morgan Kaufmann, 1999;10(c) J.R. Koza, M.A. Keane, M.J. Streeter, W. Mydlowec, J. Yu, G. Lanza. ISBN: 1-104020-7446-8. Kluwer Academic, 2003.10[30] (a) M. Kovacic, P. Uranick, M. Brezocnik, R. Turk, Mater. Manuf. Process 22 (5-106) (2007) 634640;10(b) M. Brezocnik, M. Kovacic, L. Gusel, Mater. Manuf. Process 20 (3) (2005)

10497508. 10[31] T. Haynes, D. Schoenefeld, in: J.R. Koza, D.E. Goldberg, D.B. Fogel, R.L. Riolo10(Eds.), Genetic Programming 1996: Proceedings of the First Annual10Conference, The MIT Press, Cambridge, MA, 1996, p. 426.10[32] C.J. Kennedy. Ph.D. Thesis, University of Bristol, 2000. http://10citeseer.ist.psu.edu/kennedy99strongly.html .10[33] C.M. Fonseca, P.J. Fleming, Evol. Comput. 3 (1) (1995) 116.10[34] B. Schaefer, L.A. Baumes, A. Corma, LabAutomation 2008 Palm Springs CA,10Documenting Catalytic Test Reactions Using the Analytical Information10Markup Language (AnIML) 2633 Monday, 01/28/2008 1:00PM3:00PM ,10Room MP94.10[35] J.R. Woodward, J.R. Neil, in: Genetic Programming, Proc. EuroGP 2003,10Springer-Verlag, Essex, UK, 1416 April, 2003.

10



http://citeseer.ist.psu.edu/kennedy99strongly.htmlhttp://citeseer.ist.psu.edu/kennedy99strongly.htmlhttp://citeseer.ist.psu.edu/kennedy99strongly.htmlhttp://citeseer.ist.psu.edu/kennedy99strongly.html

comp mat science baumes collet

Documents