regression model for sediment transport problems using multi-gene symbolic genetic programming

9
Regression model for sediment transport problems using multi-gene symbolic genetic programming Bimlesh Kumar a,, Anjaneya Jha a , Vishal Deshpande a , Gopu Sreenivasulu b a Indian Institute of Technology Guwahati, Guwahati, India b Department of Civil Engineering, Rajeev Gandhi Memorial College of Engineering and Technology, Nandyal 518501, India article info Article history: Received 14 August 2013 Received in revised form 12 January 2014 Accepted 17 February 2014 Keywords: Genetic programming Incipient motion Sediment transport Total bed load Vegetated flow abstract Sediment transport modeling problems are complex due to the multi-dimensionality of the problems, along with their nonlinear interdependence. Also, in river hydraulics, phenomena are stochastic and variables are measured with uncertainties which are unavoidable. Dimensional and regression analyses have been employed in the past but have associated limitations. As a robust modeling tool, genetic programming was used to develop predictor models for three different but related problems of sediment transport-vegetated flow, incipient motion and total bed load prediction. A relatively new development over the conventional genetic programming-multi-gene symbolic regression was used to model functional relationships that were able to generalize highly nonlinear variations in data as well as predict system behavior from independent input data in all the three cases. The algorithmic parameters for genetic programming technique were resolved iteratively, varying based on problems in context. For all the three models developed, model efficiency criteria were found out and presented and the perfor- mance of the present model was compared with several past models for the same data points. The models developed herein were able to generalize the underlying relationships in the presented data as well as were able to predict values for unknown data with high accuracy. Ó 2014 Elsevier B.V. All rights reserved. 1. Introduction Sediment transport problems form an essential part of civil engineering practice with regard to river hydraulics related challenges faced in field. Solving sediment transport problems is indispensable in planning and managing water resources. How- ever, system parameters in these problems are multiple in num- bers with complex and exhibit nonlinear interdependence. This complexity combined with huge spatiotemporal variations and an inherent nonlinearity makes it difficult to analyze the system analytically. Besides, the variables seem to assume values specific to geographies and climates. This compels one to take assumptions in an analysis that are rendered false when the model may be used for disparate regions. This has also prevented development of uni- versal models which offer satisfactory prediction capabilities irre- spective of environments of application. As an example of interdependence, sediment and vegetation make the flow compli- cated, affecting the velocity profile. This then affects the bed and wall shear stresses and vegetation shapes, causing further changes in sediment loads and velocity profiles. Some or all of the model variables are subjected to the sources of uncertainty, like errors of measurement, absence of information and poor or partial understanding of the driving forces and mechanisms. This imposes a limit on our confidence in the response of the model. Also, mod- els may have to cope with the natural intrinsic variability of the system, such as the occurrence of stochastic events. Almost all of the existing equations for sediment transport problems are empirical in nature due to such limitations. However, regression and dimensional analyses have been used extensively in the past. These approaches have certain limitations that keep them from being used widely for field applications. Regression has inadequacies pertaining to a first-hand functional form determina- tion and clustering effect of influential points and groups of points. Dimensional analysis is also inadequate due to high number of variables and problems of multiple forms of the same equation In problems of river hydrology, the system often reflects a stochas- tic nature and the variables cannot be measured without uncer- tainty. It has therefore been realized that there is a need for developing new and robust models that can overcome the restric- tions posed by the conventional techniques. Soft computing is an emerging paradigm based on the backbone of artificial intelligence, evolutionary/bio-inspired computing and probabilistic computing. These allow developing of statistical http://dx.doi.org/10.1016/j.compag.2014.02.010 0168-1699/Ó 2014 Elsevier B.V. All rights reserved. Corresponding author. Tel.: +91 361 2582420. E-mail address: [email protected] (B. Kumar). Computers and Electronics in Agriculture 103 (2014) 82–90 Contents lists available at ScienceDirect Computers and Electronics in Agriculture journal homepage: www.elsevier.com/locate/compag

Upload: gopu

Post on 25-Dec-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Regression model for sediment transport problems using multi-gene symbolic genetic programming

Computers and Electronics in Agriculture 103 (2014) 82–90

Contents lists available at ScienceDirect

Computers and Electronics in Agriculture

journal homepage: www.elsevier .com/locate /compag

Regression model for sediment transport problems using multi-genesymbolic genetic programming

http://dx.doi.org/10.1016/j.compag.2014.02.0100168-1699/� 2014 Elsevier B.V. All rights reserved.

⇑ Corresponding author. Tel.: +91 361 2582420.E-mail address: [email protected] (B. Kumar).

Bimlesh Kumar a,⇑, Anjaneya Jha a, Vishal Deshpande a, Gopu Sreenivasulu b

a Indian Institute of Technology Guwahati, Guwahati, Indiab Department of Civil Engineering, Rajeev Gandhi Memorial College of Engineering and Technology, Nandyal 518501, India

a r t i c l e i n f o a b s t r a c t

Article history:Received 14 August 2013Received in revised form 12 January 2014Accepted 17 February 2014

Keywords:Genetic programmingIncipient motionSediment transportTotal bed loadVegetated flow

Sediment transport modeling problems are complex due to the multi-dimensionality of the problems,along with their nonlinear interdependence. Also, in river hydraulics, phenomena are stochastic andvariables are measured with uncertainties which are unavoidable. Dimensional and regression analyseshave been employed in the past but have associated limitations. As a robust modeling tool, geneticprogramming was used to develop predictor models for three different but related problems of sedimenttransport-vegetated flow, incipient motion and total bed load prediction. A relatively new developmentover the conventional genetic programming-multi-gene symbolic regression was used to modelfunctional relationships that were able to generalize highly nonlinear variations in data as well as predictsystem behavior from independent input data in all the three cases. The algorithmic parameters forgenetic programming technique were resolved iteratively, varying based on problems in context. Forall the three models developed, model efficiency criteria were found out and presented and the perfor-mance of the present model was compared with several past models for the same data points. The modelsdeveloped herein were able to generalize the underlying relationships in the presented data as well aswere able to predict values for unknown data with high accuracy.

� 2014 Elsevier B.V. All rights reserved.

1. Introduction

Sediment transport problems form an essential part of civilengineering practice with regard to river hydraulics relatedchallenges faced in field. Solving sediment transport problems isindispensable in planning and managing water resources. How-ever, system parameters in these problems are multiple in num-bers with complex and exhibit nonlinear interdependence. Thiscomplexity combined with huge spatiotemporal variations andan inherent nonlinearity makes it difficult to analyze the systemanalytically. Besides, the variables seem to assume values specificto geographies and climates. This compels one to take assumptionsin an analysis that are rendered false when the model may be usedfor disparate regions. This has also prevented development of uni-versal models which offer satisfactory prediction capabilities irre-spective of environments of application. As an example ofinterdependence, sediment and vegetation make the flow compli-cated, affecting the velocity profile. This then affects the bed andwall shear stresses and vegetation shapes, causing further changesin sediment loads and velocity profiles. Some or all of the model

variables are subjected to the sources of uncertainty, like errorsof measurement, absence of information and poor or partialunderstanding of the driving forces and mechanisms. This imposesa limit on our confidence in the response of the model. Also, mod-els may have to cope with the natural intrinsic variability of thesystem, such as the occurrence of stochastic events.

Almost all of the existing equations for sediment transportproblems are empirical in nature due to such limitations. However,regression and dimensional analyses have been used extensively inthe past. These approaches have certain limitations that keep themfrom being used widely for field applications. Regression hasinadequacies pertaining to a first-hand functional form determina-tion and clustering effect of influential points and groups of points.Dimensional analysis is also inadequate due to high number ofvariables and problems of multiple forms of the same equationIn problems of river hydrology, the system often reflects a stochas-tic nature and the variables cannot be measured without uncer-tainty. It has therefore been realized that there is a need fordeveloping new and robust models that can overcome the restric-tions posed by the conventional techniques.

Soft computing is an emerging paradigm based on the backboneof artificial intelligence, evolutionary/bio-inspired computing andprobabilistic computing. These allow developing of statistical

Page 2: Regression model for sediment transport problems using multi-gene symbolic genetic programming

Notation

y channel heightSf friction slopey sediment densityG specific gravityc unit weight of watercs unit weight of sedimentd sediment particle diameterm kinematic viscosityu mean flow velocityC sediment concentrationQ water dischargeB channel widthsb bed shear stress

sc critical bed shear stressh flow depthD diameter of cylindrical vegetationk height of vegetationi channel slopem number of vegetation cylinders per unit horizontal areaCd drag coefficientg acceleration due to gravityGr gradationR2 coefficient of correlationId index of agreementE Nash–Sutcliffe efficiency

B. Kumar et al. / Computers and Electronics in Agriculture 103 (2014) 82–90 83

black-box models based entirely on historical data. Soft computinghas been employed extensively in hydrology and hydraulics withvarying applications. The suitability of application of soft comput-ing comes from the fact that it allows for uncertainties in measuredvalues. This is critical in river hydraulics due to the inadvertentuncertainties in measuring data from the field and while experi-menting. The models developed are not expected to give 100%accurate results but rather to be tolerant to errors in measurementand offer overall better predictability. These ‘‘black-box’’ modelsare purely statistical models and model parameters are adjustedby providing training data so as to give predictions for independentand new inputs. Primarily, soft computing techniques include arti-ficial neural network (ANN), fuzzy logic, genetic algorithms (GA),particle swarm optimization (PSO), etc. Several soft computingmodels have been developed in the past. Adib (2008) used ANNfor determining water surface elevation in tidal rivers. Sedimentload prediction was carried out by Altunkaynak (2009) usinggenetic algorithms. Goel and Pal (2009) have used support vectormachine in scour prediction. GA was also used for parameter iden-tification for modeling river network by Tang et al. (2010). Kumaret al. (2010) has used Radial Basis Function model to design anincipient channel with bed suction. Kumar and Rao (2010) hasused metamodel to predict friction factor in alluvial channel.Application of neural networks and fuzzy logic models to long-shore sediment transport was carried out by Samani et al.(2011). Amirabdollahian et al. (2011) used fuzzy genetic algorithmfor optimal design of water networks. Kumar (2011) has used ANNmodel for friction factor prediction in alluvial channel. Krishnaet al. (2012) used a wavelet neural network model for river flowtime series. Kumar (2012) has applied soft computing techniquefor bed material load prediction. Ismail et al. (2013) have applieda feed-forward neural network to predict bridge scour. Otherrecent relevant work done in the field of river hydraulics byemploying soft computing techniques include those of Kisi andHosseinzadeh (2012) for modeling rainfall–runoff process, Kisiand Hosseinzadeh (2012) for suspended sediment modeling, Shiriet al. (2012) for forecasting daily stream flow. Shiri and Kisi(2012) also estimated daily suspended sediment load using wave-let conjunction models. A comparative study was completed byKisi and Shiri (2012) in river suspended sediment estimation byclimatic variables implication where various soft computing tech-niques were compared.

Genetic programming (GP) proposed in Koza (1992) views themodeling problem as one of program discovery. Genetic program-ming is a relatively newer domain in soft computing and hasgained popularity in a variety of applications, including those inriver hydraulics and sediment dynamics in fluvial systems. Singhet al. (2007) applied neural network–genetic programming for

sediment transport. Azamathulla et al. (2008) used genetic pro-gramming to predict ski-jump bucket spill-way scour. Aytekand Kis�i (2008) attempted sediment modeling using a geneticprogramming approach. Kisi and Guven (2010) carried out sus-pended sediment concentration estimation using a machinecode-based genetic programming. Chang et al. (2012) used lineargenetic programming for discharge prediction in compoundchannels. Kisi and Hosseinzadeh (2012) developed suspendedsediment models using genetic programming. The paradigm ofgenetic programming attempts a search for the best programfrom a search space of programs by evolving generations ofgenetically bred and mutated populations of programs (mathe-matical expressions). Indeed, the modeling problem requiresone to develop models which may well be an explicit functionof the independent variables. In this, the approach of geneticprogramming differs from that of artificial neural models whichdo not present an explicit expression and rather utilize a numberof network parameters to transform inputs to outputs. However,both ANN and GP help develop black-box models which are notbased on the underlying physics or the phenomena of the systembut are purely statistical. Genetic programming is different fromconventional regression. Rather than finding numeric coefficientsof a predetermined functional form as done by regression,symbolic regression attempts to find a symbolic expressioncontaining both, functions as well as independent variables andnumeric coefficients. The method is also referred to as symbolicfunction identification. The major difference lies in the fact thatunlike conventional regression, GP does not require predeter-mined functional forms. Instead, it accepts the library of opera-tors (functions and variables) and evolves generations ofexpressions to ultimately reach the best expression. The termsymbolic regression is used for any technique which fits the mea-sured data using a suitable mathematical formula. GP employs asearch heuristic where the algorithm begins with randomizedsets of expressions and creating new expressions in each genera-tion (iteration) which perform better than the previous genera-tion. Hence, the expressions are not calculated but generatedfrom parent expressions using the genetic operators (mutation,crossover, etc.). The only calculations that take place are evalua-tions of expressions to assess their performance. This is doneusing model performance indicators (correlation coefficient, etc.)on the training data. The indicator helps to assess to what degreethe model has been able to generalize the training dataset statis-tically. A good correlation coefficient, for example, would indicatea good generalization. These river hydraulics models are highlycomplex, and therefore their underlying relationships may bepoorly understood. In such cases, the model can be viewed as ablack box, i.e. the output is an opaque function of its inputs.

Page 3: Regression model for sediment transport problems using multi-gene symbolic genetic programming

84 B. Kumar et al. / Computers and Electronics in Agriculture 103 (2014) 82–90

The present attempt is aimed at suggesting a new and improvedregression model for sediment transport problems, namely, multi-gene symbolic regression model for three different but related phe-nomena in sediment transport-vegetated flow, incipient motionand total bed load prediction. Multi-gene symbolic regression usesGP to find (and not calculate) multiple sub-programs (individualgenes) and finally regresses the coefficients of these sub programsto reach the final expression. Models developed herein for the allthe three cases were found to be better than existing models interms of model performance criteria.

2. Sediment transport problems

Flow velocity prediction in vegetated channel flow, totalbedload and incipient motion prediction have been taken up in thisstudy. The state of the art for each has been discussed briefly in thesubsections that follow.

2.1. Flow prediction in vegetated channel

Channel vegetation typically is emergent aquatic plants-vegetation that exists inside and in proximity to flowing water.Their characteristics include completely submerged or non-submerged (emergent), vegetation density, height of submergence,stem-flexibility, stem geometry, surface characteristics and spac-ing. These characteristics influence flow and morphology of a chan-nel (Erskine et al., 2012). Several empirical (Kouwen and FathiMoghadam, 2000) and theoretical relations (Stone and Shen,2002; Poggi et al., 2009) have been proposed to describe theflow-vegetation interactions. Galema (2009) based on data avail-able in the literature has compared different predictors of flowcharacteristics and concluded that no simple predictor exists forvegetated flow.

2.2. Total bed load prediction

Total bed load or bed material load comprises all of the bedload, and the part of suspended load which is represented in bedsediments. Bed material load controls river morphology becausea part of it can be actively interchanged with the actual bed load.Determination of bed load is crucial for water resources planningand management-design and maintenance of canals, dams and res-ervoirs. The earliest models for bed load transport estimation wereproposed by Meyer-Peter and Müller (1948) and Einstein (1950).These were based on the dependence of sediment flux on bed shearstress. However, it was realized that a complex dependence existedbetween various other parameters including particle size, grada-tion, specific gravity and critical bed shear stress. The numericalequations for sediment transport were developed over a decadeago (Bagnold, 1966; Parker, 1990; Sinnakaudan et al., 2006). Theseequations differ in effectiveness owing to the parameters used,theoretical approach, sampling techniques and mathematicalapproaches (Yang, 1996). It was shown by a discrepancy ratioanalysis (Sinnakaudan et al., 2010) that existing equations had pre-dictive power below 50% when applied to high gradient rivers.

2.3. Incipient motion bed shear predictor

A channel may or may not experience movement of the beddepending on the prevailing circumstances. The concept of incipi-ent threshold of bed shear exists which delineates a moving bedfrom a nonmoving one. The estimation of this critical value ofbed shear forms an important part of sediment transport studies.Shields diagram has been used extensively till date to determinethe bed shear condition (Shields, 1936). However, dissatisfactions

with this diagram have been reported (Yalin and Karahan, 1979;Smith and Cheung, 2004). The original Shields data showed consid-erable scatter and could be interpreted as representing a bandrather than a well-defined curve (Buffington, 1999). Consequently,many empirical curves have since been reported (Chien and Wan,1983; Hager and Oliveto, 2002; Cao et al., 2006).

3. Methodology

A similar methodology has been adopted in all the three casestaken up to demonstrate the applicability of genetic programmingapproach across sediment transport problems. The source of dataused for modeling in all the three cases has been stated. This is fol-lowed by description of the adopted technique, namely, multi-genesymbolic regression. Finally, model performance was assessedthrough criteria such as correlation coefficient and index of agree-ment and finally comparing with past empirical models. Statisticalfits have also been compared for the three models. The procedureof modeling requires the data to be split into two parts-trainingand testing. Training data typically account for around 75% of totaldata available. This is used to develop the model-to decide the finalfunction parameters such as the coefficient multipliers in this case.Once the function is achieved as a result of training, the test data ispresented to the function and the function outputs are obtained.These outputs are to be compared to the known outputs fromthe dataset. It is to be noted that the prime difference betweenthe training and testing data is that testing data is completely pre-sented to the model developing process (inputs and outputs).Using these training sets the modeling process prepares the finalfunction. However, only the inputs are presented to the final func-tion after the model has been prepared to check if it predicts accu-rately. From the model’s perspective, training data is used togeneralize the data by internal parameter adjustment. And testingdata serves as a test of predictive accuracy of the model.

3.1. Data source and functional forms

Data measurement becomes critical in developing statisticalmodels. In the three problems taken up, the data set was sampledfor training and testing subsets. Data used for modeling were takenup from literature. Following subsections mention such sources.

3.2. Vegetated flow prediction

Galema (2009) documented comprehensive database of differ-ent types of vegetation with hydraulic properties from differentsources. For development of model, around 75% of observationshave been assigned as training sets and 25% for testing. The num-ber of training points was 345 and testing 100, totaling to 445 datapoints in all. It should be noted that, like all empirical models, softcomputing models perform best in interpolation rather thanextrapolation (Masters, 1993); consequently, the extreme valuesof the available data are included in the training set. All datasources are not cited in the reference of this paper as they couldbe found in Galema (2009).

Flow-vegetation interactions in steady and uniform conditioncan be described by the following function:

u ¼ f ðh; k; i;m;Cd;DÞ ð1Þ

where u is the mean velocity, h is the flow depth, k is the height ofthe vegetation, i is the channel slope. D is the diameter of cylindricalvegetation and m is the number of cylinders per m2 horizontal area,indicating vegetation density. Cd is the non-dimensional drag coef-ficient. Eq. (1) mentioned is for homogeneous vegetation of fixedheight and diameter.

Page 4: Regression model for sediment transport problems using multi-gene symbolic genetic programming

B. Kumar et al. / Computers and Electronics in Agriculture 103 (2014) 82–90 85

3.3. Total bed load transport prediction

Brownlie (1981) database has been used for total bed loadtransport, which consisted both flume and field data. A set of1800 data points were used for modeling. While 1200 randomlysampled points were used for training, the other 600 served asan independent testing set. Data points comprise input values offlow discharge, channel width, flow depth, slope, sediment size,gradation and specific gravity. Parameter values of �1 have beendiscarded as suggested. The references for individual data sourcesare compiled Brownlie (1981) and therefore have been avoided inthe present work. The functional form of sediment load concentra-tion may be expressed as:

C ¼ f ðQ ; b; y; sb; Sf ; d;Gr;G; mÞ ð2Þ

where C is the sediment concentration, Q the water discharge, b thechannel width, sb the bed shear stress, Sf the friction slope ofchannel, d the sediment particle size, G the specific gravity, m thekinematic viscosity, Gr is the gradation of sediment particles

3.4. Incipient motion threshold prediction

A total of 105 data points were used for incipient shear model-ing. The training–testing break-up was 80–25. Several data sourceshave been used from the literature for modeling critical shearstress, sc for a channel (Mantz, 1977; Yalin and Karahan, 1979;Shen and Wang, 1983; Pilloti, 2001).

The functional form may be represented as:

sc ¼ f ðG; Sf ;d; g; c; cs; m; yÞ ð3Þ

where G is the specific gravity, Sf the friction slope, d the sedimentparticle diameter, g the acceleration due to gravity, y the channeldepth, c the unit weight of water, cs the unit weight of sediment,v is the kinematic viscosity.

3.5. Symbolic regression

Symbolic regression aims at reaching to a mathematical expres-sion. This, basically, is an explicit relation between one or more in-puts and an output using mathematical ‘symbols’-functions andvariables. The process of symbolic regression is a subset of sym-bolic function identification and it differs from conventionalregression in that it does not ‘calculate’ the coefficients/functions.The way it finds equations is by carrying out an extensive,continuously improving guided search in an evolving search space.

Fig. 1. Expression tree structure illustration.

Genetic programming serves to provide a platform with geneticoperators (random generation, mutation, crossover, etc.) thatproduce, alter and select individuals in population. This is facili-tated by storing expression in tree data structures in the computermemory. Tree structures are easy to swap parts of the program andappend or remove parts, which are operations carried out by ge-netic operations. Fig. 1 shows tree structure with evaluatedexpression.

All programming was carried out in MATLAB environment.Fitness criterion used in case of symbolic function identificationis the sum of squared errors. Here, the term fitness refers to sumof squared errors, which must be minimized. Fitness minimizationresults in ‘better’ expressions over the generations. The bestexpression is chosen as the expression with least fitness. An initialpopulation of expressions is generated with a randomized selec-tion procedure. A fitness function is then used to assess the indi-vidual expressions. It is usually the deviation of the modeloutput from the actual output, or, the error, which must be mini-mized. Genetic algorithm, based on Darwinian model of reproduc-tion and survival of the fittest and genetic recombination is used tocreate a population of individuals from the current population. Apart of the parent population is filtered based on their fitness val-ues. These are then used to create new offspring population whichreplaces the old generation. Then, each individual is assessed forfitness and the process is repeated. This algorithm produces popu-lations which, over generations, tend to exhibit average improvingfitness and adapt themselves to the changes. This is because thenewer expressions are developed from better parents individualsand hence contain components of the parent expression. The struc-tures that undergo adaptation are expressions containing func-tions, operators, variables and numeric coefficients, whose form,size and complexity can dynamically change during the process.The search space of expressions is the set of all possible composi-tion of expressions that can be formed using recursively from theavailable set of ‘n’ functions F = {f1, f2, f3... fn} and an available setof m terminals T = {a1, a2, a3... am}. While the terminals are vari-ables, numeric constants and universal constants (e.g. pi), the func-tions comprise of mathematical operators and standardmathematical functions. The functions and variables along withthe numeric constants are referred to as symbols. It should benoted that the set of functions and terminals used in a particularproblem should be selected so as to be capable to capture thenonlinearity.

3.6. Multi-gene symbolic regression

Traditional genetic programming looks for an individual expres-sion that represents distribution with minimum error. A relativelynew modification of the basic genetic programming involves use ofmultiple numbers of ‘best’ individuals from the final populationand regresses numeric constants to best fit the data. A gene isstored in computer memory as a tree data structure. Evaluationof this tree structure yields a sub-program (expression). Since theterm ‘gene’ is used for an individual expression, multi-gene impliesusage of multiple expressions that are ultimately regressed.

3.7. Optimum parameters of algorithm

The algorithm offers flexibility in terms of values of variousparameters that are to be decided beforehand. These include totalpopulation, generations, genes (expressions), gene depths (depthof tree structure, which means that the depth of a node is the num-ber of edges that need to be traversed to reach the node startingfrom the tree’s root node), elite fraction and mutate–crossover–di-rect (M–C–D) fractions. The parameters of the algorithm depend onthe problem at hand. This is a general behavior in modeling using

Page 5: Regression model for sediment transport problems using multi-gene symbolic genetic programming

Fig. 3. Model performance vs. population.

86 B. Kumar et al. / Computers and Electronics in Agriculture 103 (2014) 82–90

soft computing. Each problem has its own curvature of the solutionspace. In order to thoroughly search the solution space, the param-eters must be well adjusted. This can be understood with an exam-ple: Suppose the solution space is very small (in terms of Euclideanvolume) and does not require traversing long distances (Euclidean)in the search space. In such a case, mutation would not help be-cause mutation may cause significant differences in offspringexpressions. There are no existing concrete guidelines on estima-tion of parameter of this algorithm yet. Hence a repetitive hitand trial procedure was used to reach to the best set of parameters.Since the number of parameters is manifold, a stepwise selection ofthe parameters was carried out. Here, a nominal value of otherparameters being kept constant, the parameter to be optimizedwas scanned thoroughly across values. The results were comparedto assess the effect of this one parameter only, with the assumptionthat variation in one parameter is not strongly dependent of non-varying values of other parameters. For all these parameters, theindividual runs for a particular combination of were repeated threetimes each to remove random correlations. It was observed thatwith varying the algorithm parameters, the model performancechanged considerably and reached a peak for a certain value ofthe parameter, except for generations-in which it kept increasingwith very fine improvements upon increasing the generations.Fig. 2 shows a typical plot of performance over changing parame-ter, where performance is plotted vs. population. Fig. 3 shows per-formance changing with generations. The values close to the peakvalues were selected as optimum parameters. The number of gen-erations after which improvements became negligible wasadopted. It is to be noted that some parameters when increasedto huge values cause the program incur more computational cost.Therefore a compromise was made for number of generations.

3.8. Model evaluation and comparison

The entire dataset in each of the three problems was sampledfor training and testing subsets. Due to random sampling of thedata, testing set is independent from training set. Efficiency criteriachosen in the present work are coefficient of determination (R2),Nash–Sutcliffe efficiency (E) and index of agreement (Id) to ascer-tain the model prediction. R2 can also be expressed as the squaredratio between the covariance and the multiplied standard devia-tions of the observed and predicted values. Therefore it estimatesthe combined dispersion against the single dispersion of theobserved and predicted series. The value of R2 varies from 0 (no

Fig. 2. Model performance vs. generations.

correlation) to 1.0 (100% correlation). The major drawback of R2

is that it quantifies the dispersion only. The efficiency E proposedby Nash and Sutcliffe (1970) is defined as one minus the sum ofthe absolute squared differences between the predicted andobserved values normalized by the variance of the observed values.It varies from negative infinity to 1.0. An efficiency of 1 (E = 1) cor-responds to a perfect match of modeled discharge to the observeddata. An efficiency of 0 (E = 0) indicates that the model predictionsare as accurate as the mean of the observed data, whereas an effi-ciency less than zero (E < 0) occurs when the observed mean is abetter predictor than the model. The most significant disadvantageof the Nash–Sutcliffe efficiency is the fact that the differences be-tween the observed and predicted values are calculated as squaredvalues. As a result larger values in a time series are stronglyoverestimated whereas lower values are neglected (Legates andMcCabe, 1999). The index of agreement Id was proposed by to over-come the insensitivity of E and R2 to differences in the observedand predicted means and variances (Legates and McCabe, 1999).The index of agreement represents the ratio of the mean square er-ror and the potential error. The range of Id is similar to that of R2

and lies between 0 (no correlation) and 1.0 (perfect fit). The modelsdeveloped herein were compared in terms of efficiency criteriawith past empirical predictors for the three cases for the samedatasets for each of the three cases.

4. Results and discussion

The parameters were varied through a range of values to detectthe appropriate combination of parameters. Build method, whichrepresents ways of initializing tree structures in the first genera-tion, was varied through three possible configurations-‘full’, ‘grow’and ‘ramped half-and-half’. The maximum depth of an individualgene was varied between 2 and 8. Number of genes was set to varywithin 1 and 50. The fractions of mutations, crossover and directcop were varied with steps of 0.1 within the range of 0–1. Tourna-ment size for best population selection was varied between 10%and 80%. Elite fraction, which represents the fraction of the entirepopulation that simply gets copied without participating in geneticoperations, was varied between 0.05 and 3.00 with steps of 0.05.Finally, population and generations were varied from 5 to 30 and10 to 50 respectively for incipient bed shear predictor. While forother two cases, they were varied from 25 to 5000 and 50 to 100respectively. The algorithm parameters adopted for modeling inthe three cases have been summarized in Table 1. In identification

Page 6: Regression model for sediment transport problems using multi-gene symbolic genetic programming

Table 1Optimum algorithm parameters.

Sl. no. Parameter Incipient bed shear Total bed load Vegetative flow

1 Build method Grow Grow Ramped half-and-half2 Max depth 4 71 53 Max genes 35 6 74 M–C–D probabilities 0.4–0.4–0.2 0.6–0.3–0.1 0.2–0.6–0.25 Tour. size% 70 20 406 Elite fraction 0.15 0.1 0.27 Population 25 750 5008 Generation 50 500 500

Table 2Model performance.

Predictor E R2 Id

(a) Vegetative flow predictorPresent model (GP) 0.94 0.97 0.98Stone and Shen (2002) �0.83 0.59 0.71van velzen et al., (2003) �0.64 0.54 0.65Baptist et al., (2006) 0.069 0.61 0.67

(b) Incipient shear predictorPresent model (GP) 0.99 0.99 0.99Paphitis (2001) 0.64 0.88 0.92Sheppard and Renna (2005) �6.61 0.68 0.54Hager and Oliveto (2002) 0.35 0.75 0.85

(c) Total bed load predictorPresent model (GP) 0.96 0.97 0.98Acaroglu (1968) 0.29 0.78 0.584Brownlie (1981) 0.69 0.78 0.44Yang (2005) 0.24 0.62 0.054

B. Kumar et al. / Computers and Electronics in Agriculture 103 (2014) 82–90 87

of the best parameter, one parameter was varied while the otherswere kept constant and the value yielding the best model wasadopted. Three runs were carried out for each parameter combina-tion to obviate random correlations. Average correlation coefficientof the three runs was used as an indicator of performance of themodel. Optimum values identified at each stage were incorporatedin the process, with keeping critical parameters such as populationand generation reserved for the final stages of parameter optimiza-tion. Fig. 4 shows the selection of optimum parameters for the caseof incipient motion critical shear.

The model evaluation was carried out on the entire data set forthe best model (Appendix 1) obtained using the optimum param-eters found for each case. The values of R2, Id and E have been tab-ulated for the present models generated as well as past predictorsfor each of the three cases in Table 2. These values were calculatedfor the entire dataset (training + testing). Past predictors includeexplicit empirical and semi-analytical formulations of the variablemodeled. For vegetative flow prediction, predictors presented byStone and Shen (2002); Van Velzen et al., (2003), and Baptistet al. (2006) have been used for comparison with model developedin the present study by genetic programming. Total bed load pre-dictors include those of Acaroglu (1968); Engelund-Hansen(1967) and Yang (2005). Finally, incipient motion empirical equa-tions of Paphitis (2001), Hager and Oliveto (2002) and Sheppardand Renna (2005) have been used as comparators. It is evident that

Fig. 4. Optimum param

the present symbolic regression models show a higher degree ofperformance in terms of all the three efficiency criteria in each ofthe three cases undertaken in this study.

Figs. 5–7 show the model performance-model outputagreement with actual outputs. The high values of R2 reflect strong

eters identification.

Page 7: Regression model for sediment transport problems using multi-gene symbolic genetic programming

Fig. 5. Total bed load prediction model.

Fig. 6. Vegetative flow predictor.

Fig. 7. Incipient motion threshold shear predictor.

Fig. 8. Observation statistical fits.

88 B. Kumar et al. / Computers and Electronics in Agriculture 103 (2014) 82–90

correlations between model output and actual observed outputs.While the training set performance gives an idea of generalizingcapability of the model, the test sets form an independent set of

data to determine the predictability. It is observed from the figurethat both the generalizing and predicting power is satisfactory forall the three models.

Page 8: Regression model for sediment transport problems using multi-gene symbolic genetic programming

Fig. 9. Prediction statistical fits.

B. Kumar et al. / Computers and Electronics in Agriculture 103 (2014) 82–90 89

Figs. 8 and 9 show the statistical fits obtained for observed andpredicted outputs respectively. It can be seen from the parametersof the fits are very close to each other for observed and predictedvalues. This reinforces the fact that the observed and predicted val-ues follow the same distribution with parameters close to eachother such that the distribution of data remains similar. Observedand predicted values for incipient shear and vegetative flow caseswere found to follow a lognormal distribution while those for totalbedload were found to follow an exponential distribution.

5. Conclusion

Sediment transport problems are difficult to model analyticallydue to the difficulties posed by multidimensional and nonlinearlyinterdependent system variables. Moreover, graphical techniquesand empirical methods proposed in the past for several sedimenttransport problems show inadequacy in terms of low predictabilityand agreement with actual data. Soft computing offers several com-putational techniques to develop efficient and robust models. Mul-ti-gene symbolic regression was used to develop models for threesediment transport problems-vegetative flow, total bed load trans-port and incipient shear for sediment threshold prediction. It wasobserved that highly nonlinear data could be effectively generalizedby symbolic regression models. These models developed alsoshowed excellent prediction capabilities on new data not used totrain the models. Models were assessed using efficiency criteria ofcorrelation coefficient, index of agreement and the Nash–SutcliffeEfficiency by comparing model outputs with original observed datafor the entire available dataset for each case. It was observed thatmodels developed herein showed satisfactory efficiency accordingto all three criteria and also showed better efficiency compared topast predictors for each of the three cases. This was irrespectiveof dataset sizes. It can be conclusively asserted that genetic pro-gramming can be used to develop satisfactory models in the formsof symbolic functions which can be used to predict unknown infor-mation with very good accuracy. Also, robust models can be devel-oped which incorporates noise and inaccuracy in input data.

Acknowledgements

The authors gratefully acknowledge the financial support thatwas received from the department of science and technology, Govt.of India (SERC-DST: SR/S3/MERC/005/2010) to carry out theresearch work presented in this paper.

Appendix A. Supplementary material

Supplementary data associated with this article can be found, inthe online version, at http://dx.doi.org/10.1016/j.compag.2014.02.010.

References

Acaroglu, E.R., 1968. Sediment transport in convenyance system. PhD Thesis,Cornell University, Ithaca, New York, USA.

Adib, A., 2008. Determining water surface elevation in tidal rivers by ANN. P. I. CivilEng. – Wat. M. 161 (2), 83–88.

Altunkaynak, A., 2009. Sediment load prediction by genetic algorithms. Adv. Eng.Softw. 40 (9), 928–934.

Amirabdollahian, M., Chamani, M.R., Asghari, K., 2011. Optimal design of waternetworks using fuzzy genetic algorithm. P I Civil Eng-Wat M 164 (7), 335–346.

Aytek, A., Kis�i, Ö., 2008. A genetic programming approach to suspended sedimentmodelling. J. Hydrol. 351 (3–4), 288–298.

Azamathulla, H., AbGhani, A., Zakaria, N.A., Lai, S.H., Chang, C.K., Leow, C.S.,Abuhasan, A., 2008. Genetic programming to predict ski-jump bucket spill-wayscour. J. Hydrodyn. 20 (4), 477–484.

Bagnold, R.A., 1966. An Approach to the Sediment Transport Problem From GeneralPhysics, Professional Paper 422-I, U.S. Geological Survey, Reston, Va.

Baptist, M.J., Babovic, V., Uthurburu, J.R., Keijzer, M., Uittenbogaard, R.E., Verway, A.,Mynett, A.E., 2006. On inducing equations for vegetation resistance. J. HydraulicRes. 45, 435–450.

Brownlie, W.R., 1981. Compilation of Alluvial Channel Data: Laboratory and field,Rep. No. KH-R-43B, California Institute of Technology, Calif.

Buffington, J., 1999. The legend of A. F. Shields. J. Hydraul. Eng. 125 (4), 376–387.Cao, Z., Pender, G., Meng, J., 2006. Explicit formulation of the Shields diagram for

incipient motion of sediment. J. Hydraul. Eng. 132 (10), 1097–1099.Chang, C.K., Azamathulla, H., Zakaria, N.A., AbGhani, A., 2012. Appraisal of soft

computing techniques in prediction of total bed material load in tropical rivers.J. Earth Syst. Sci. 121 (1), 125–130.

Chien, N., Wan, Z.H., 1983. Mechanics of Sediment Movement. Science Publications,Beijing.

Einstein, H.A., 1950. The Bed-load Function for Sediment Transportation in OpenChannel Flows, Technical Bulletin 1026, U.S. Department of Agriculture, SoilConservation Service.

Engelund F., Hansen E., 1967. A monograph for sediment transport in alluvialchannel, report, Copenhagen, Denmark.

Erskine, W., Keene, A., Bush, R., Cheetham, M., Chalmers, A., 2012. Influence ofriparian vegetation on channel widening and subsequent contraction on a sand-bed stream since European settlement: Widden Brook, Australia.Geomorphology 147–148, 102–114.

Galema, A., 2009. Evaluation of Vegetation Resistance Descriptors for FloodManagement, Master Thesis, University of Twente.

Goel, A., Pal, M., 2009. Application of support vector machines in scour predictionon grade-control structures. Eng. Appl. Artif. Intel. 22 (2), 216–223.

Hager, W.H., Oliveto, G., 2002. Shields’ entrainment criterion in bridge hydraulics. J.Hydraul. Eng. 128 (5), 538–542.

Ismail, A., Jeng, D.-S., Zhang, L.L., Zhang, J.-S., 2013. Predictions of bridge scour:application of a feed-forward neural network with an adaptive activationfunction. Eng. Appl. Artif. Intel. 26 (5–6), 1540–1549.

Kisi, Ö., Guven, A., 2010. A machine code-based genetic programming forsuspended sediment concentration estimation. Adv. Eng. Softw. 41 (7–8),939–945.

Kisi, O., Hosseinzadeh Dalir, A., Cimen, M., Shiri, J., 2012. Suspended sedimentmodeling using genetic programming and soft computing techniques. J. Hydrol.450–451, 48–58.

Kisi, O., Shiri, J., 2012. River suspended sediment estimation by climatic variablesimplication: Comparative study among soft computing techniques. Comp.Geosci. 43, 73–82.

Kouwen, N., Fathi Moghadam, M., 2000. Friction factors for coniferous trees alongrivers. J. Hydraul. Eng. 126, 732–740.

Koza, J.R., 1992. Genetic Programming: On the Programming of Computers byMeans of Natural Selection. MIT Press.

Krishna, B., Rao, Y., Nayak, P.C., 2012. Wavelet neural network model for river flowtime series. P. I. Civil Eng. – Wat. M 165 (8), 425–439.

Kumar, B., 2011. Data mining approach for Friction factor in Mobile bed channel. P ICivil Eng-Wat M 164 (1), 15–25.

Page 9: Regression model for sediment transport problems using multi-gene symbolic genetic programming

90 B. Kumar et al. / Computers and Electronics in Agriculture 103 (2014) 82–90

Kumar, B., 2012. Neural network prediction of bed material load transport. Hydrol.Sci. J. 57 (5), 956–966.

Kumar, B., Rao, A.R.K., 2010. Metamodeling approach to predict friction factor inalluvial channel. Comput. Electron. Agr. 70, 144–150.

Kumar, B., Sreenivasulu, G., Rao, A.R.K., 2010. Incipient motion design of sand bedchannels affected by bed suction. Comput. Electron. Agr. 74 (2), 321–328.

Legates, D.R., Mccabe Jr., G.J., 1999. Evaluating the use of ‘‘goodness-of-fit’’measures in hydrologic and hydroclimatic model validation. Water Resour.Res. 35 (1), 233–241.

Mantz, P.A., 1977. Incipient transport of fine grains and flakes by fluids—extendedShields diagram. J. Hydraul. Div. Am. Soc. Civ. Eng. 103, 601–615.

Masters, T., 1993. Practical Neural Network Recipes in C++. Academic Press, SanDiego, CA.

Meyer-Peter, E., Müller, R., 1948. Formula for Bed Load Transport. In: Proceedings ofthe 2nd Meeting of the IAHR, Stockholm, pp. 39–64.

Nash, J.E., Sutcliffe, J.V., 1970. River flow forecasting through conceptual models. 1:Discussion of principles. J. Hydrol. 10, 282–290.

Paphitis, D., 2001. Sediment movement under unidirectional flows: an assessmentof empirical threshold curves. Coast. Eng. 43 (5), 227–245.

Parker, G., 1990. Surface-based bedload transport relation for gravel rivers. J. Hydr.Res. 28 (4), 417–436.

Pilloti, M., 2001. Beginning of sediment transport of incoherent grains in shallowshear flows. J. Hydr. Res. 39 (2), 115–123.

Poggi, D., Krug, C., Katul, G.G., 2009. Hydraulic resistance of submerged rigidvegetation derived from first-order closure models. Water Resour. Res. 45,W10442.

Samani, A.R., Tarazjani, J.A., Borghei, S.M., Jeng, D.S., 2011. Application of neuralnetworks and fuzzy logic models to long-shore sediment transport. App. SoftComp. 11 (2), 2880–2887.

Shen, H.W., Wang, S., 1983. Incipient motion and Riprap design. J. Hydraul. Eng. 111(3), 520–537.

Sheppard, D.M., Renna, R., 2005. Bridge Scour Manual. Florida Dept. ofTransportation, Tallahassee, FL.

Shields, A., 1936. Anwendung der Aehnlichkeitsmechanik und derTurbulenzforschung auf die Geschiebebewegung, Mitt. Preuss. Versuchsanst.Wasserbau Schiffbau, 26, 26. (English translation by W. P. Ott and J. C. vanUchelen, 36 pp., U.S. Dep. of Agric. Soil Conser. Serv. Coop. Lab., Calif., Inst. ofTechnol., Pasadena).

Shiri, J., Kisi, O., 2012. Estimation of daily suspended sediment load by usingwavelet conjunction models. J. Hydrol. Eng. 17 (9), 986–1000.

Shiri, J., Kisi, O., Makarynskyy, O., Shiri, A.A., Nikoofar, B., 2012. Forecasting dailystream flow using artificial intelligence approaches, ISH. J. Hydraul. Eng. 18 (3),204–214.

Singh, K., Deo, M.C., Sanil Kumar, V., 2007. Neural network–genetic programmingfor sediment transport. P. I. Civil Eng. – Wat. M 160 (3), 113–119.

Sinnakaudan, S., Ab Ghani, A., Ahmad, M.S.S., Zakaria, N.A., 2006. Multiple linearregression model for total bed material load prediction. J. Hydraul. Eng. 132 (5),521–528.

Sinnakaudan, S., Sulaiman, M.S., Teoh, S.H., 2010. Total Bed Material Load Equationfor High Gradient Rivers, Water Resources Engineering and ManagementResearch Centre, UniversitiTeknologi MARA, Penang, Malaysia.

Smith, D.A., Cheung, K.F., 2004. Initiation of motion of calcareous sand. J. Hydraul.Eng. 130 (5), 467–472.

Stone, B., Shen, H., 2002. Hydraulic resistance of flow in channels with cylindricalroughness. J. Hydraul. Eng. 128, 500–506.

Tang, H., Xin, X., Dai, W., Xiao, Y., 2010. Parameter identification for modeling rivernetwork using a genetic algorithm. J. Hydrodyn. 22 (2), 246–253.

Van Velzen E., Jesse P., Cornelissen P., Coops H., 2003. Stromingsweerstandvegetatie in uiterwaarden, Handboek report 2003.028, RIZA, Arnhem,Netherlands.

Yalin, M.S., Karahan, E., 1979. Inception of sediment transport. J. Hydraul. Div. Am.Soc. Civ. Eng. 105, 1433–1443.

Yang C.T., 1996. Sediment transport: Theory and Practice, McGraw-Hill, New York,USA.

Yang, S., 2005. Formula for Sediment Transport in Rivers, Estuaries, and CoastalWaters. J. Hydraul. Eng. 131 (11), 968–979.