data mining as a tool for environmental scientists

22
Data Mining as a Tool for Environmental Scientists Jessica Spate a , Karina Gibert b , Miquel S` anchez-Marr` e c , Eibe Frank d , Joaquim Comas e , Ioannis Athanasiadis f , Rebecca Letcher g a Mathematical Sciences Institute, Australian National University, Canberra, Australia b Department of Statistics and Operation Research, Technical University of Catalonia, Barcelona, Catalonia c Knowledge Engineering and Machine Learning Group, Technical University of Catalonia, Barcelona, Catalonia d Department of Computer Science, University of Waikato, Waikato, New Zealand e Laboratory of Chemical and Environmental Engineering (LEQUIA), University of Girona, Girona, Catalonia f Istituto Dalle Molle di Studi sull’Intelligenza Artificiale, Lugano, Switzerland g Integrated Catchment Assessment and Management Centre, Australian National University, Canberra, Australia Abstract: Over recent years a huge library of data mining algorithms has been developed to tackle a variety of problems in fields such as medical imaging and network traffic analysis. Many of these techniques are far more flexible than more classical modelling approaches and could be usefully applied to data-rich environmental problems. Certain techniques such as Artificial Neural Networks, Clustering, Case-Based Reasoning and more recently Bayesian Decision Networks have found application in environmental modelling while other methods, for example classification and association rule extraction, have not yet been taken up on any wide scale. We propose that these and other data mining techniques could be usefully applied to difficult problems in the field. This paper introduces several data mining concepts and briefly discusses their application to environmental modelling, where data may be sparse, incomplete, or heterogenous. Keywords: Data mining; environmental data. 1 I NTRODUCTION In 1989, the first Workshop on Knowledge Discov- ery from Data (KDD) was held. Seven years later, in the proceedings of the first International Confer- ence on KDD, Fayyad gave one of the most well known definitions of what is termed Knowledge Discovery from Data: ”The non-trivial process of identify- ing valid, novel, potentially useful, and ultimately understandable patterns in

Upload: others

Post on 06-Oct-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining as a Tool for Environmental Scientists

Data Mining as a Tool for Environmental Scientists

Jessica Spatea, Karina Gibertb, Miquel Sanchez-Marrec, Eibe Frankd, Joaquim Comase, IoannisAthanasiadisf , Rebecca Letcherg

aMathematical Sciences Institute, Australian National University,Canberra, Australia

bDepartment of Statistics and Operation Research, Technical University of Catalonia,Barcelona, Catalonia

cKnowledge Engineering and Machine Learning Group, Technical University of Catalonia,Barcelona, Catalonia

dDepartment of Computer Science, University of Waikato,Waikato, New Zealand

eLaboratory of Chemical and Environmental Engineering (LEQUIA), University of Girona,Girona, Catalonia

f Istituto Dalle Molle di Studi sull’Intelligenza Artificiale,Lugano, Switzerland

gIntegrated Catchment Assessment and Management Centre, Australian National University,Canberra, Australia

Abstract:

Over recent years a huge library of data mining algorithms has been developed to tackle a variety of problemsin fields such as medical imaging and network traffic analysis. Many of these techniques are far more flexiblethan more classical modelling approaches and could be usefully applied to data-rich environmental problems.Certain techniques such as Artificial Neural Networks, Clustering, Case-Based Reasoning and more recentlyBayesian Decision Networks have found application in environmental modelling while other methods, forexample classification and association rule extraction, have not yet been taken up on any wide scale. Wepropose that these and other data mining techniques could beusefully applied to difficult problems in the field.This paper introduces several data mining concepts and briefly discusses their application to environmentalmodelling, where data may be sparse, incomplete, or heterogenous.

Keywords: Data mining; environmental data.

1 INTRODUCTION

In 1989, the firstWorkshop on Knowledge Discov-ery from Data (KDD)was held. Seven years later,in the proceedings of the firstInternational Confer-ence on KDD, Fayyad gave one of the most well

known definitions of what is termedKnowledgeDiscovery from Data:

”The non-trivial process of identify-ing valid, novel, potentially useful, andultimately understandable patterns in

Page 2: Data Mining as a Tool for Environmental Scientists

data” Fayyad et al. [1996b]

KDD quickly gained strength as an interdisciplinaryresearch field where a combination of advancedtechniques from Statistics, Artificial Intelligence,Information Systems, Visualization and new algo-rithms are used to face the knowledge acquisitionfrom huge data bases. The termKnowledge Discov-ery from Dataappeared in 1989 referring to highlevel applications which include particular methodsof Data Mining:

”[...] overall process of finding and in-terpreting patterns from data, typicallyinteractive and iterative, involving re-peated application of specific data min-ing methods or algorithms and the in-terpretation of the patterns generatedby these algorithms” Fayyad et al.[1996a]

Thus, KDD is the high level process combiningData Mining methods with different tools for ex-tractingknowledgefrom data. The basic steps es-tablished by Fayyad are briefly described bellow. InFayyad et al. [1996a] details on the different tech-niques involved in this process are provided:

• Developing and understanding the domain,capturing relevant prior knowledge and thegoals of the end-user

• Creating the target data set by selecting aproper set of variables or data samples (in-cluding generation of proper queries to a cen-tral data warehouse if needed)

• Data cleaning and preprocessing. Quality ofresult is dependent on the quality of inputdata, and therefore the preprocessing step iscrucial. See Section 3.1 for discussion of thispoint

• Data reduction and projection: Depending onthe problem, it may be convenient to simplifythe set of variables in question. The aim hereis to keep a relevant set of variables describ-ing the system adequately and efficiently

• Choosing the data mining task, with referenceto the goal of the KDD process. From clus-tering to time series forecasting, many differ-ent techniques exist for different purposes, orwith different requirements. See Section 3

for a survey of the most common data min-ing techniques. Depending on the choice ofmethods, various parameters may or may notneed to be set, with or without optimization

• Selecting the data mining algorithm/s: oncedecided the task and goals are codified, a con-crete method (or set of methods) needs to bechosen for searching patterns in the data. De-pending on the choice of techniques, parame-ter optimization may or may not be required

• Data mining: Searching for patterns in data.Results from this stage will be significantlyimproved if previous steps were performedcarefully

• Interpreting mined patterns, possibly fol-lowed by further iteration of previous steps

• Consolidating discovered knowledge: docu-menting and reporting results, or using theminside the target system.

The steps outlined above can be illustrated as Fig-ure 1 (from Fayyad et al. [1996a]). Fayyad’s pro-posal, outlined above, marked the beginning of anew paradigm in KDD research:

”Most previous work on KDD has fo-cussed on [...] data mining step. How-ever, the other steps are of considerableimportance for the successful applica-tion of KDD in practice” Fayyad et al.[1996a]

Fayyad’s proposal included prior and posterior anal-ysis tasks as well as the application of data miningalgorithms. These may in fact require great effortwhen dealing with real applications. Data cleaning,transformation, selection of data mining techniquesand optimization of parameters (if required) are of-ten time consuming and difficult, mainly becausethe approaches taken should be tailored to each spe-cific application, and human interaction is required.Once those tasks have been accomplished, the ap-plication of data mining algorithms becomes triv-ial and can be automated, requiring a only a smallproportion of the time devoted to the whole KDDprocess. Interpretation of results is also often timeconsuming and requires much human guidance.

However, it is common in some scientific contextsto use the termData Mining to refer to the wholeKDD process Siebes [1996] instead of the applica-tion to a cleaned dataset only. In those contexts the

Page 3: Data Mining as a Tool for Environmental Scientists

Figure 1: Outline of the Knowledge Discovery Pro-cess

process may be thought of, in brief, as a sequence offour main steps: data cleaning and variable selec-tion, algorithm and parameters selection, applica-tion of said algorithm, and interpretation of results.Some research attention has recently been given tothe data mining ’process model’ (Shearer [2000]),where an addition phase ofdeploymentis also dis-cussed.

It is clear that either referring to the knowledge dis-covery process asKDD or simply asData Mining,tasks like data cleaning, variable selection, interpre-tation of results, and even the reporting phase areof much importance as the data analysis stage itself.This is particularly true when dealing with environ-mental data, which is often highly uncertain, andmay contain outliers and missing values that neces-sitate special treatment from any modelling scheme.

Selected algorithms are discussed in Section 3,along with preprocessing methods, which will beexplored in Sections 3.1 to 3.3. Later in the paper,concerns such as performance evaluation, modeloptimization and validation, and dealing with dis-parate data sources, are dealt with. Section 4 con-tains a brief review of previous environmental datamining work and Section 6 a discussion of some ex-isting environmental problems that may particularlybenefit from data mining. Available software is dis-cussed in Section 7, with particular reference to theWeka (Whitten and Frank [1991]) and GESCONDA(Sanchez-Marre et al. [2004]) packages. The role ofiEMSs in encouraging data mining is raised in Sec-tion 8.

2 SOME NOTES ON ENVIRONMENTAL AND

NATURAL SYSTEMS AND M ODELLING

Environmental systems typically contain many in-terrelated components and processes, which may bebiological, physical, geological, climatic, chemical,or social. Whenever we attempt to analyse envi-ronmental systems and associated problems, we areimmediately confronted with complexity stemmingfrom various sources:

Multidisciplinarity: a variety of technical, eco-nomical, ecological and social factors are atplay. Integration of different social and scien-tific disciplines is necessary for proper treat-ment, as well as use of analysis techniquesfrom different scientific fields

Ill-structured and nonlinear domain:environmental systems are poor or ill struc-tured domains. That is, they are difficult toclearly formulate with a mathematical theoryor deterministic model due to high complex-ity. Many interactions between animal, veg-etal, human and climatic system componentsare highly nonlinear

High dimensionality and multiscalarity: mostenvironmental processes take place in two orthree spatial dimensions, and may also in-volve a time component. Within this frame,multiple factors are acting at many differentspatial and temporal scales (see Section 3.2)

Heterogeneity of data: for environmental realworld problem, data comes from numeroussources, with different formats, resolutionsand qualities. Qualitative and subjective in-formation is often integral. This topic isdiscussed in more detail in Section 6.1

Intrinsic non-stationarity: Environmental pro-cesses are in general not static, but evolveover time. The assumption of stationaritycannot be justified in many physical, chem-ical, and biological processes Guariso andWerthner [1989].

Controllability: Controllability of environmentalsystems is poor, due the unavailability of ac-tuators Olsson [2005]

Uncertainty and imprecise information:because environmental data collection isoften expensive and difficult, measurementerror is often large, and spatial and temporalsampling may not fully capture system be-haviour. Records may also contain missing

Page 4: Data Mining as a Tool for Environmental Scientists

values and highly uncertain information. SeeSection 3.1

Many environmental systems involve processeswhich are not yet well known, and for which no for-mal models are established at present. Because theconsequences of an environmental system changingbehavior or operating under abnormal conditionsmay be severe, there is a great need for KnowledgeDiscovery in the area.

Great quantities of data are available, but as the ef-fort required to analyse the large masses of datagenerated by environmental systems is large, muchof it is not examined in depth and the informa-tion content remains unexploited. The special fea-tures of environmental processes demand a newparadigm to improve analysis and consequentlymanagement. Approaches beyond straightforwardapplication of conventional classical techniques areneeded to meet the challenge of environmental sys-tem investigation. Data mining techniques provideefficient tools to extract useful information fromlarge databases, and are equipped to identify andcapture the key parameters controlling these com-plex systems.

3 DATA M INING TECHNIQUES

Here we shall introduce a variety of data miningtechniques: classification (Section 3.5), clustering(Section 3.4), and association rule extraction (Sec-tion 3.6), as well as preprocessing other data is-sues. Of course, we cannot hope to detail all datamining tools in a short paper. An extensive reviewof data mining tools for environmental science isgiven in Spate and Jakeman [Under review 2006],and references to specific papers are given through-out the text. Key reading material introducing thereader to essential points of KDD are Han and Kam-ber [2001], Whitten and Frank [1991], Hastie et al.[2001], Larose [2004] and Parr Rud [2001]. Thetechniques listed below are some of the most com-mon and useful in the data mining toolbox, andpreprocessing and visualisation are also included inthis section, as they are essential components of theknowledge discovery process.

3.1 Preprocessing: Data Cleaning, Outlier De-tection, Missing Value Treatment, Trans-formation and Creation of Variables

Sometimes, a number of cells are missing fromthe data matrix. These cells may be marked as a

*, ?, NaN (Not a Number), blank space or otherspecial character or special numeric code such as99999. The latter can produce grave mistakes incalculations if not properly treated. It is also impor-tant to distinguish between random and non-randommissing values (Allison [2002], Little and Rubin[1987]). Non-random missing values are producedby identifiable causes that will determine the propertreatment, also influenced by the goals of the task.Inputation (see Rubin [1987]) is a complex processfor converting missing data into useful data usingestimation techniques. It is important to aviod falseassumptions when considering inputation methods,because this choice may have a significant effect onthe results extracted.

Options include removing the object altogether (al-though useful data may be thrown away), replacingit with a mean or otherwise estimated value, dupli-cating the row once for each possible value if thevariable is discrete, or excluding it from the analysisby modification of the data mining algorithm. Noneof these are without pros and cons, and the choiceof method must be made with care. In particular, re-moving rows with missing cells from a dataset maycause serious problems if the missing values are notrandomly distributed. Caution should be exercisedwhen adopting this approach, and it is of utmost im-portance to report any elimination performed.

Outliers are objects with very extreme values inone or more variables (Barnett and Lewis [1978]).Graphical techniques were once the most com-mon method for identifying them, but increases indatabase sizes and dimensions have led to a varietyof automated techniques. The use of standard devi-ations is possible when and only when consideringa single variable that has a symmetric distribution,but outliers may also take the form of unusual com-binations of two or more variables. The data pointshould be analysed as a whole to understand the na-ture of the outlier.

Outliers can then be considered apart treated asmissing, corrected (this may be possible where dec-imal points have been mistakenly entered, for ex-ample), or included in the study as regular points.Whichever course of action is taken, the presence ofan outlier should be noted for future reference. Cer-tain modelling methods and data mining algorithmsmay be affected to a greater or lesser degree by thepresence of outliers, a concern which should featurethe choice of tools used throughout the rest of theprocess. See Moore and McCabe [1993] for an in-teresting discussion on the dangers of eliminatingrows with outliers:

Page 5: Data Mining as a Tool for Environmental Scientists

”in 1985 British scientists reported ahole in the ozone layer of the Earth’s at-mosphere over the South Pole. [...] TheBritish report was at first disregarded,since it was based on ground instru-ments looking up. More comprehen-sive observations from satellite instru-ments looking down had shown noth-ing unusual. Then, examination of thesatellite data revealed that the SouthPole ozone readings were so low thatthe computer software [...] had auto-matically suppressed these values as er-roneous outliers! Readings dating backto 1979 were reanalyzed and showeda large and growing hole in the ozonelayer [...] suppressing an outlier with-out investigating it can keep valuableout of sight.”

Moore and McCabe [1993]

Sometimes, transformation of variables may assistanalysis. For example, normality may be forcedwhen using ANOVA, or, for ease of interpretation,variables with a large number of categorical labelscan be grouped according to expert knowledge. Un-der some circumstances, discretisation of continu-ous variables is appropriate (egAgeinto Child under18 years,Adult between 18 and 65 years,Elderlyover 65 years). Noise is often a critical issue, andespecially with environmental data some bias mayexist that can be removed with a filter. Transfor-mations should always be justified and documented,and the biases that may be introduced noted (Gibertand Sonicki [1999]).

Creation of additional variables is also used to facil-itate the knowledge discovery process under somecircumstances. For example, see the decision treeof Figure 2. The object here is to determine whetheror not a given year is or is not a leap year, andto do this, an efficient tree is built using the vari-ablesYearModulo4 and YearModulo100. Whereextra features are constructed in this way, expertknowledge is usually the guide. Exploratory vari-able creation without such assistance is almost al-ways prohibitively time consuming, and as noted inSection 5, may obfuscate physical interpretation andexacerbate noise. Efficient techniques for data re-duction, however, do exist and are well used.

3.2 Data Reduction and Projection

When the number of variables is too high to dealwith in a reasonable way, which is not unusual indata mining context, it may be convenient to apply adata reduction method. This kind of technique con-sists of finding some set with the minimum numberof variables that captures the information containedin the original data set.

This may be accomplished by eliminating somevariables wholesale, or projecting the feature spaceof the original problem into a reduced fictitiousspace, with fewer dimensions. Principal Compo-nents Analysis (PCA) (see for example Dillon andGoldstein [1984]) is one of the best known tech-niques used for the latter purpose. Each principalcomponent is a linear combination of the originalvariables, and the aim is to work with a reducedset of these, such that the loss of information is notgreat. It is important to note that interpretation ofthe new variables may be lost.

Regarding the former method, datasets may containirrelevant or redundant variables. For example, inthe daily weather dataset discussed in Spate et al.[2003], four temperature variables were recorded:maximum, minimum, mean, and grass. In the con-text of that study, all four contained much of thesame information and any three out of the four couldbe eliminated without significant loss of predictivecapacity. A Boolean (presence/absence) marker forfrost was also included in the original database, butthis was only relevant for three out of six sites, asthe other measurement stations where located in asubtropical maritime environment (the Brisbane re-gion of Queensland, Australia) where frosts do notoccur. No useful information was encoded in theBrisbane frost variables, which were all a vector ofzeros and could thus be deleted from the Brisbanedatasets.

Automated techniques for identifying and remov-ing unhelpful, redundant or even contradictory vari-ables usually take one of two forms: statisticalexamination of the relevance properties of candi-date variables and combinations of the same, orsearching through the space of possible combina-tions of attributes and evaluating the performanceof some model building algorithm for each combi-nation. The former are calledfilters and the latterwrappers(see Hall [1999] for details). For a sur-vey of common attribute selection techniques, seeMolina et al. [2002].

Other techniques are based on feature weighting

Page 6: Data Mining as a Tool for Environmental Scientists

Aha [1998] Nez et al. [2003], which is a more gen-eral and flexible approach than feature selection.The aim of feature weighting is to assign a degreeof relevance, commonly known as a weight, to eachattribute. This way, some similarity computationsfor tasks like clustering, rule induction, can be im-proved. Similarities (or dissimilarities) become em-phasized according to the relevance of the attribute,and irrelevant attributes will not influence the re-sults, so quality of inductive learning improves.

3.3 Visualisation

While automation is a key goal for knowledge dis-covery routines, some human interaction is still ben-eficial and indeed necessary. One of the key pointsat which human interaction is often most fruitfulis the visualisation stages, during pre- and postpro-cessing. Graphical methods should be the first stageof investigation for all datasets, even those whosedimension is too great to allow a comprehensive sur-vey in this way. The presence of outliers, missingvalues, errors, and unusual behaviour are often firstnoted visually, enabling more detailed investigationlater. Redundant and useless variables may also be-come clear at this stage, although by no means isvisualisation a complete substitute for quantitativeexploratory analysis.

Graphs commonly used for classical exploratory vi-sualisation like boxplots, histograms, time seriesplots, and two dimensional scatter plots may be use-ful for examining individual variables or pairs ofvariables, but when considering a great number ofvariables with complex interrelations other devicesmay have greater utility, as scatter plots can usefullycontain only a small number of dimensions and alimited number of points. A variety of more sophis-ticated visualisation methods appear in the contextof data mining, for example:

• Distributional plots

• Three, four, and five dimensional plots(colour and symbols may be used to representthe higher dimensions)

• Using transformed variables, for example logscales

• Rotatable frames

• Animation with time

Many data mining packages (for example Wekaof Waikato [2005]) include visualisation packages,

and the more complex devices mentioned abovesuch as rotatable reference frames for three dimen-sional plots and animations, can be generated withcommon packages such as Matlab (?]) or a dedi-cated data language such as IDL or the CommonGIStool (Andrienko and Andrienko [2004]). There arealso dedicated visualisation tools such as XGobi(Swayne et al. [1998]).

Visual representations are extremely effective, andmay convey knowledge far better than numerical in-formation or equations. As it is well accepted thatpresentation of the results from almost all modellingprocesses should include graphical illustrations, andwe argue that the same approach is equally essentialto the knowledge discovery process.

3.4 Clustering and Density Estimation

Clustering techniques are used to divide a data setinto groups. They are suitable for discovering theunderlying structure of the target domain, if this isunknown. For this reason, they belong to the groupof techniques known asunsupervised learnersalongwith association rule extraction, which will be dis-cussed in Section 3.6. Clustering techniques coveran exploratory goal, rather than a predictive one.They identify distinct groups of similar objects (ac-cording to some criteria) that can be considered to-gether, which is very useful in the Data Mining con-text, since the number of cases to be analysed canbe huge. Ideally, objects within a cluster should behomogeneous compared to the difference betweencluster representatives.

The measure of distance or dissimilarity betweendata objects can be based either a quantitative met-ric, dissimilarity measure, or some logical criteriaderived from analogy or concept generalization, de-pending on the research field where the clusteringalgorithm was conceived (usually either Statistics(Sokal and Sneath [1963]) or Artificial Intelligence(Michalski and Stepp [1983])). Sometimes it isconvenient to mix algebraic and logical criteria forbetter capturing difficult domain structures (Gibertet al. [2005a]). Note that where data is continuousand the scales changes between variables, normal-isation may be necessary to avoid weighting vari-ables unevenly. Appropriate choice of criteria forcomparing objects (distance measure) is essential,and different measures will result in different clus-tering schemes, a point which is discussed in detailin Spate [In preparation], Nunez et al. [2004], andGibert et al. [2005b].

Page 7: Data Mining as a Tool for Environmental Scientists

There are different families of techniques, to beused depending on the desired form of the clus-ters. The simplest methods simply divide the fea-ture space into a set number of (hyper) polygonalpartitions, but other methods can construct overlap-ping or fuzzy classes or a hierarchy of clusters. Fora survey, see Dubes and Jain [1988].

Clustering can also be viewed as a density estima-tion problem by assuming that the data was gener-ated by a mixture of probability distributions, onefor each cluster (see, e.g., Witten and Frank, 2005).A standard approach is to assume that the datawithin each cluster are normally distributed. In thatcase a mixture of normal distributions is used. Toget an improved and more concise description ofthe data other distributions can be substituted whenthe assumption of normality is incorrect. The over-all density function for the data is given by thesum of the density functions for the mixture com-ponents, weighted by the size of each component.The beauty of this approach is that one can applythe standard maximum likelihood method to findthe most likely mixture model based on the data.The parameters of the maximum likelihood model(for example, means and variances of the normaldensities) can be found using the expectation maxi-mization algorithm. Treating clustering as a densityestimation problem makes it possible to objectivelyevaluate the model’s goodness of fit, for example,by computing the probability of a separate test setbased on the mixture model inferred from the train-ing data. This approach also makes it possible toautomatically select the most appropriate number ofclusters.

3.5 Classification and Regression Methods

In classification and regression, the identity of thetarget class is knowna priori and the goal is to findthose variables that best explain the value of thistarget, either for descriptive purposes (better under-standing the nature of the system) or prediction ofthe class value of a new datapoint. They are an ex-ample ofsupervisedlearning methods. A popularand accessible classification model is the decisiontree, of which an example is given in Figure 2. Thissimple model, built on two variables, tells us if agiven year is a leap year or not. The informationencoded is thus: if the year is divisible by four withno remainder (YearModulo4 = 0), that year is a leapyear unless it is also exactly divisible by 100 (Year-Modulo100 = 0). The decision is given at the in-ternal nodes of the tree, and if the condition holds,we follow the right-hand branch. If it fails, we fol-

Figure 2: Example Decision Tree

low the left-hand branch. In this way, the tree splitsthe dataset into smaller and smaller pieces until eachcan be assigned a label, which is the value of the tar-get variable. Decisions can take the form of greaterthan/less than criteria, or equality with a specificvalue/s or category/ies. Perhaps the most commondecision tree extraction algorithm is C4.5 (Quinlan[1993]).

Classical linear regression is a technique for find-ing the best linear equation defining the relation-ship between a numerical response variable and theindependent variables, all of which should also benumerical (Draper and Smith [1998]). It is mainlyused for prediction of the target variable, but alsofor identifying which variables have the strongestinfluence on the behavior of the response variable.In that sense it is useful for descriptive purposes.However, there are many conditions to be met in or-der for regression to be a suitable technique, suchas normality, homocedasticity or independence ofthe regressors. When all the independent variablesare qualitative, ANOVA should be used, and whereboth qualitative and numerical data are involved,ANCOVA is the proper model, provided the re-quired technical hypotheses hold and the qualita-tive variables are properly transformed into dummyvariables. For other situations, nonlinear regres-sion may be useful (if nonlinearity is not polyno-mial, neural networks may be a better approach).For detailed discussion of multivariate regression,see for example Lebart et al. [1984] and Dillon andGoldstein [1984]. When modelling non numericalresponses, it is possible to use logistic regression(when the response is binary or can be transformedto binary) or even polytomous regression (for qual-itative responses of more than 2 categories). Inthat case, interpretation of results requires signifi-cant care.

A middle way between the two approaches of classi-fication and regression exists in the family of meth-ods known as regression trees. Here, the dataset issplit up into blocks by a tree like the one shown in

Page 8: Data Mining as a Tool for Environmental Scientists

Figure 2, but instead of a class label on each leaf,there is a model obtained by regression. The M5’(em five prime) mentioned in Frank et al. [2000] isan example of this concept.

Case-Based Reasoning (CBR) is a general problem-solving, reasoning and learning paradigm (seeKolodner [1993]) within the artificial intelligencefield. CBR relies on the hypothesis that similarproblems have similar solutions, and new problemscan be solved from past solutions stored in an ex-perience memory, usually called the case library orcase base. If the structure of the cases or experi-ences is ’flat’, where a case can be described with aset of pairs formed by one attribute and its value, theapplication of CBR principles is generally knownas the nearest neighbour technique, memory-basedreasoning or instance-based reasoning. CBR can beused as a classification method. In this case, the as-sumption is that similar cases should have similarclassifications: given a new case, one or more sim-ilar cases are selected from in the case library, andthe new case is classified according to the classi-fier values of those neighbours. Quality of the caselibrary is critical for good classification, as is thechoice of an appropriate measure of similarity be-tween cases (see Nunez et al. [2004]). In CBR sys-tems, the retrieval of similar instances from mem-ory is a crucial point. If it is not carried out withaccuracy and reliability, the system will fail. The re-trieval task is strongly dependent on the case libraryorganization (Sanchez Marre et al. [2000]). It mustalso be noted that CBR does not produce an explicitmodel describing system behaviour in the way thatmost models do. Rather, the model is implicit in thedatabase itself.

Standard feed-forward neural nets are supervisedlearners, although another form of neural net (theself-organising map) exists as an unsupervisedmethod. They can be used for classification and forregression. Most simply, neural nets consist of a se-ries of nodes arranged in a number of layers, com-monly three- an input layer, a single hidden layerwhere nonlinear transformations of the input vari-ables are performed, and an output. Additional lay-ers of hidden nodes and feedback can be introducedinto the system, providing greater flexibility at thecost of increased complexity. This complexity is thechief reason for the idea that neural nets are ’blackboxes’, which is not entirely warranted. With goodvisualisation, process information can be extractedfrom them.

As noted in the introduction, artificial neural net-works (also called neural nets) have been exten-

sively applied in the environmental sciences, and forthat reason we will not examine them in detail here.Numerous references can be found in Section 4.

Rule induction or rule extraction is the process ofdiscovering rules summarising common or frequenttrends within a dataset (i.e. which variables and val-ues are frequently associated). Classification rulesare induced rules from labelled examples. The ex-amples should be marked with the cluster or theclass label. Thus, it is a supervised machine learn-ing technique, which can be used to predict the clus-ter to which a new example or instance belongsto. The induced rules are not usually guaranteed tocover the entire dataset, and focus more on repre-senting trends in the data. Classification rule extrac-tion algorithms are similar to association rule dis-covery techniques discussed in Section 5. In an en-vironmental context, supervised classification rulescould be used, for example, to identify from a gridof spatial points those locations that may be prone togully erosion. A rule for erosion vulnerability mightlook like:

IF slope> 10 % AND soiltype = yellow podzolicAND landuse = grazing THEN risk = high

Note that some rule extraction routines can combinenumerical and categorical data. A time componentcan also be introduced into the rule format. Con-sider the case where, for a catchment of a given areaA, the time delay between rainfallP and stream-flow Q events isT , or in other words, IFP ANDA THEN Q WITHIN TIME T . As a formulaic ex-pression, this rule might be compactly expressed as:

P,A ⇒ QT

3.6 Association Analysis

Association analysis is the process of discoveringand processing interesting relations from a dataset.The concept was originally developed for supermar-ket analysis, where the aim is to uncover whichitems are frequently bought together. They have theadvantage that as no initial structure is specified,the results may contain rules that are highly unex-pected and which would never have been specifi-cally searched for because they are inherently sur-prising. The format summarising only frequentlyoccurring patterns can also be useful for anomalydetection, because those datapoints violating rulesthat usually hold are easy to identify and may beexamples of interesting behaviour.

Page 9: Data Mining as a Tool for Environmental Scientists

Rule extraction algorithms, for both association andclassification, tend to fall into two broad categories:those built by generalising very specific rules untilthey cover a certain number of instances (for exam-ple the AQ family of algorithms described in Wnekand Michalski [1991], and those that begin with abroad rule covering all or a large fraction of the dataand refine that rule until a sufficient level of preci-sion is achieved (such as the PRISM Cendrowska[1998] and RIPPER Cohen [1995] algorithms. Forobvious reasons, the specific to general variety arefor the most classification rule learners. Many ruleextraction algorithms are extremely fast, and canthus be applied to very large databases in their en-tirety. They may be used either for predictive pur-poses or for system investigation.

3.7 Spatial and Temporal Aspects of Environ-mental Data Mining

The dominating field of application for the knowl-edge discovery process is that of business. Only re-cently has data mining expanded into other fields,as the utility of knowledge extraction methods hasbeen noted by researchers outside the data miningcommunity. In business, knowledge extraction tech-niques are used for identifying consumer patterns ormaximizing profit-related functions. Data is typi-cally sourced from business transactions databasesor Enterprise Resource Planning (ERP) systems.Spatial and temporal considerations are usually notsignificantly related to business goals, although theymay be included during the data filtering/cleaningpreparation phases (i.e. with a marker for transac-tions recorded in July, or in Vermont). This kindof spatial data can be easily integrated into a datasetfor knowledge discovery, as can time or distance to aspecific point. However, environmental science of-ten requires deep consideration of spatial and tem-poral variables, and indeed this is often the primaryfocus. Therefore, more sophisticated spatiotempo-ral techniques are valuable.

Temporal relationships include ”before-during-after” relations, while spatial relationships deal ei-ther with metric (distance) relations, or with nonmetric topological relations. Classical data min-ing methods do not accommodate this kind of need,but many can be modified to accept it (and havebeen, see for example Antunes and Oliveira [2001],Spate [2005]) with minimal effort and good results.Spatial data mining is often treated as an extensionof temporal data mining techniques (including timeseries analysis and sequence mining) into a multi-dimensional space.

Recently, data mining techniques have been de-signed with spatial and temporal data in mind, and asignificant body of spatial and temporal data min-ing research does exist. Time series data miningtechniques have been built for stock market dataextraction and industrial purposes (for example seeAntunes and Oliveira [2001], Chen et al. [2004]),and even environmental science (see Sanchez Marreet al. [2005]). It has been stated in the literature(Keogh and Kasetty [2002]) that temporal data min-ing is an area where much effort should be fo-cussed. An example of an explicitly spatial ma-chine learning technique is Cellular Automata, withtheir interacting grid points where spatial and tem-poral dynamics can be modelled very naturally. Foran example application, see the bird nest site workof Campbell et al. [2004]. Here, consideration ofneighbouring nest sites was an integral part of themodel. Some other spatial and temporal data min-ing examples are listed in Section 4.

Spatiotemporal data mining is potentially useful fora variety of tasks, including:a spatiotemporal pat-tern identification (as in pattern analysis, neighbor-hood analysis)b data segmentation and clustering(spatiotemporal classification)c dependency analy-sis, correlation analysis and fault detection in data(outlier detection, surprising pattern identification)d trend discovery, sequence mining (as in regressionanalysis and time series prediction). It should benoted that spatial resolution and time granularity af-fects the nature of the extracted patterns and must bechosen with appropriate care. In this respect, visu-alizing data and extracted patterns, employing mapsand GIS technology (from example see Andrienkoand Andrienko [2004]) could be valuable.

4 PREVIOUS WORK IN THE AREA

As mentioned in the introduction, most data min-ing techniques have not found wide scale applica-tion in the environmental sciences. In this sectionwe mention a few projects that have utilised thistechnology, although as with all literature reviews,we do not claim to make an exhaustive list. Afew papers have been published advocating learningmethods for environmental applications, for exam-ple Babovic [2005]. Comas et al. [2001] discussesthe performance of several data mining techniques(decision tree creation, two types of rule induc-tion, and instance-based learning) to identify pat-terns from environmental data. A small number ofresearch groups also exist with the specific aim ofusing artificial intelligence or data mining in the en-vironmental sciences.

Page 10: Data Mining as a Tool for Environmental Scientists

The BESAI (Binding Environmental Science andArtificial Intelligence) working group has organ-ised four international workshops within the ECAIconferences during 1998-2004, one internationalworkshop at IJCAI’2003 conference, and one in-ternational workshop at AAAI’99, with contribu-tions addressing data mining techniques. Theyhave also organised two special sessions devoted toEnvironmental Sciences and Artificial Intelligenceduring the iEMSs 2002 and iEMSs 2004 interna-tional conferences. See the BESAI website formore details (BESAI, http://www.lsi.upc.edu/ we-bia/besai/besai.html)

The European Network of Excellence on Knowl-edge Discovery (KDnet) organised a workshop onKnowledge Discovery for Environmental Manage-ment Voss et al. [2004] in an effort to promote KDDin the public sector. Four International Workshopson Environmental Applications have been held overthe period 1997 – 2004, producing some of the pa-pers discussed below. The output of the latest work-shop, which focussed on genetic algorithms andneural networks, is discussed in Recknagel [2001].

One exception from the statement that data miningtechniques are not widely used in the area is the useof Artificial Neural Networks, which have becomean accepted part of the environmental modellingtoolbox. For examples see Kralisch et al. [2001]and Almasri and Kaluarachchi [2005] on nitrogenloading, Mas et al. [2004] on deforestation, or Be-lanche et al. [2001], Gibbs et al. [2003] and Gattset al. [2005] on water quality, or the discussion inRecknagel [2001]). Numerous other examples canbe found in most journals in the area.

Examples of the use of clustering algorithms in-clude Sanchez-Marre et al. [1997], where differ-ent techniques for clustering wastewater treatmentdata were compared, Zoppou et al. [2002], where286 Australian streamflow series were clustered ac-cording to a dimensionally reduced representation.The aim was to identify groups of catchments withsimilar physical characteristics. Clustering or sim-ilar methods have also been used to a similar endin and [????] and Sanborn and Bledsoe [2005] forstudy areas in the United Kingdom and Northwest-ern United States respectively. Clustering was alsoapplied to cyclone paths in Camargo et al. [2004],and in Ter Braak et al. [2003] to cluster water sam-ples according to chemical composition.

Various classification algorithms have been ap-plied to a wide variety of environmental problemsas well. Rainfall intensity information was ex-

tracted from daily climate data Spate [2002] andSpate et al. [2003] using a number of classifica-tion methods. A decision tree like that in Fig-ure 2 was used in Ekasingh et al. [2003] wherethe cropping choices of Thai farmers were mod-elled as a classification problem with considerablesuccess. Mosquito population sites were classi-fied in Sweeney et al. [2004 (submitted], with aview to controlling the spread of malaria. Agri-culturally, classification has been applied to applebruising (Holmes et al. [1998]), mushroom grading(Cunningham and Holmes [1999]), bull castrationand venison carcase analysis in Yeates and Thom-son [1996], and perhaps most famously in Michal-ski and Chilausky’s classic soybean disease diagno-sis work (Michalski and Chilausky [1980]). Regres-sion trees were applied (for example) to sea cucum-ber habitat preference modelling in Dzeroski andDrumm [2003].

Much effort has been made by international scien-tists in the water quality and wastewater quality con-trol domains. Some approaches used rule-based rea-soning (Zhu and Simpson [1996]), case-based rea-soning (Rodrıguez-Roda et al. [1999]), fuzzy logic(Wang et al. [1997]), artificial neural networks (Syuand Chen [1998]), and integrated approaches werealso developed, such as in (Rodrıguez-Roda et al.[2002]) and (Cortes et al. [2002]). Many of theseapproaches utilised several data mining techniques.

In the study of urban air quality, fuzzy lattice clas-sifiers have been applied for estimating ambientozone concentrations in an operational context, withvery good results Athanasiadis et al. [2003]. Un-certainty and other data quality issues such as mea-surement validation and estimation of missing val-ues in the same field were addressed in Athanasiadisand Mitkas [2004]. A comparison between statisti-cal and classification algorithm algorithms appliedin air quality forecasting Athanasiadis et al. [2005]demonstrated that the potential of data mining tech-niques is high.

Classification has also found spatial applications.For example, fish distribution (Su et al. [2004]) andsoil erosion patterns (Ellis [1996]) have both beenmodelled with classification methods, as was soilerosion in Smith and Spate [2005], and other soilproperties in McKenzie and Ryan [1999], whichalso used regression trees and other techniques witha view to obtaining system information.

The use of rule learning for the environmental sci-ences is discussed in Riano [1998]. The examplediscussed in this paper is the state identification of a

Page 11: Data Mining as a Tool for Environmental Scientists

wastewater treatment plant. Rule learning was alsoused to investigate a streamflow/electrical conduc-tivity system in Spate [2005]. In Dzeroski et al.[1997], the process of rule learning is illustratedwith examples from water quality databases and theCN2 algorithm. Rodrıguez-Roda et al. [2001] pre-sented the induction of rules in order to acquire spe-cific knowledge from (bio)chemical processes).

Data mining and machine learning are of coursenot restricted to the methods discussed here, andsome less common techniques have been appliedto environmental problems. In Robertson et al.[2003], Hidden Markov models were used to modelrainfall patterns over Brazil with interesting re-sults, and Mora-Lopez and Conejo [1998] appliedqualitative reasoning to meteorological problems.Cloud screening for meteorological purposes wasalso investigated with Markov Random Fields inCadez and Smyth [1999]. Sudden death of oaktrees was modelled with support vector machines inGuo et al. [2005]. Generative topographic mappingwas used to investigate riverine ecology in Vellidoet al. [submitted 2005]. Genetic programming wasused to model glider possum distributions Whigham[2000]), and D’heygere et al. [2003] used genetic al-gorithms for attribute selection in benthic macroin-vertebrate modelling. Decision trees were then builtfrom the reduced dataset. Several inductive meth-ods have been applied to discover knowledge of thebehaviour of wastewater treatment plants, such as inComas et al. [2001].

5 GOOD DATA M INING PRACTICE

As with all modelling paradigms, good practicemodelling involves far more than applying a singlealgorithm or technique. Each of the steps detailedin Section 1 must be followed with due attention. Inthis section, we record a few notes and considera-tions that may be of use to those contemplating theuse of data mining in an environmental area.

Data Cleaning Data cleaning is a fundamental as-pect of the analysis of a dataset, and one whichis often neglected. When working with real data,the process is often very time consuming, but is es-sential for obtaining good quality results, and fromthere useful new knowledge. The quality of the re-sults directly depends on the quality of the data, andin consequence, on the correct missing data treat-ment, outlier identification, etc. Data miners shouldbecome conscious of the importance of performingvery careful and rigorous data cleaning, and allocatesufficient time to this activity accordingly.

Transformations Beginning with preprocessing,avoidance of unnecessary transformations is recom-mended, especially if the transformation decreasesinterpretability (for exampleY = log streamflow,although Y is normal). If transformations are defi-nitely required, some bias may be introduced intothe results; thus, it is convenient to minimize ar-bitrariness of the transformation as much as possi-ble (in recodingAge, Adult may be defined from18 to 65 or from 15 to 70), and this implies thatthe goals of the analysis must also be taken into ac-count. For arithmetic transformations, inputation ofmissing data before the transformation is thought tobe better, especially if several variables are com-bined into one new feature (a mean or ratio, fromexample).

The same caveat applies to interpolated or extrap-olated data, for example to catchment-wide rainfallestimates obtained by thin plate spline interpolation.It is not uncommon for no mention at all to be madeof the fact that rainfall values used are not directlyobtained, despite the possibility of bias and error be-ing introduced by the interpolation process.

Input Data Uncertainty All data is subject to un-certainty, and environmental data such as rainfallor streamflow are often subject to uncertainties of± 10% or more. Tracking and reporting of uncer-tainties related to measurement and other sources ofnoise is another area that is sometimes not treatedrigorously, despite the implications. Consider thefollowing: there is a± 10% error in all input data.Therefore, the minimum theoretically achievable er-ror of any model built on this data cannot be lessthan± 10%, and is likely to be much higher de-pending on the structure of the model. Models withreported fit greater than this are overfitted and theirperformance measures do not reflect true predictivecapacity.

Quantity of data is also a concern. In general,where there is more data, there is less uncertainty,or at least that uncertainty can be better quantified.Choice of data mining method should also be in-fluenced by dataset size. Where datasets are small,choose simpler methods and be mindful of the max-imum theoretical certainty that can be obtained. It isalso important to remark that as the number of dataincreases, variance of classical estimators tends tozero, which usually implies that very small sampledifferences may appear statistically significant. Thisphenomena requires serious attention and great caremust be exercised in the interpretation of some sta-tistical results, and it requires the user to take intoaccount that statistical significance properly reflect

Page 12: Data Mining as a Tool for Environmental Scientists

the nature of the data. In fact, serious revision ofclassical statistical inference is necessary to enablesuitable use in the context of data mining.

Data Reduction by Principal Components andSimilar Techniques Principal component analysisis only recommended when all original variables arenumerical. For qualitative data, multiple correspon-dence analysis should be used in its place. See forexample Lebart et al. [1984] or Dillon and Gold-stein [1984] or methodological details. This pro-cess reduces the original variables to a set of ficti-tious ones (or factors). With principal componentanalysis, conceptual interpretation of a factor maynot be clear, and if this is the case, there will beimplications for understandability of the final re-sults. Numerous other techniques also exist for fea-ture weighting and and selection.

Clustering Most clustering methods generate a setof clusters even where no set of distinguishablegroups really exist. This is why it is very impor-tant to carefully validate the correctness of the dis-covered clusters. Meaning and usefulness of discov-ered classes are one validation criteria, although thisis largely subjective. A more quantitative approachis to perform multiple runs of the algorithm or dif-ferent algorithms with slightly different parametersor initial values, which will give a good indicationof the stability of the cluster scheme. Some soft-ware packages also contain tools to help assess suchproperties.

As a measure of cluster ’goodness’, the ratio of av-erage distance within to average distance betweenclusters may be useful where a numerical distancemeasure exists, although it is redundant if that cri-teria was used to build the clusters themselves (asis the case of Ward’s method Ward [1963]). Clustervalidation where no reference partition exists (andin real-world applications none is present, or theclustering would be unnecessary) is an open prob-lem, but some investigation into stability should beperformed as a minimum treatment. See for exam-ple Gibert et al. [2005c]. Some methods based onthe principles of cross validation can also be used toanalyze how the representatives of the classes movefrom one iteration to another. See Spate [In prepa-ration] for an example of this procedure.

Statistical Modelling and RegressionScalar real-valued performance criteria such as thedetermina-tion coefficient(R2, also known as efficiency), usedtogether with residual plots (Moore and McCabe[1993]), constitute a very useful tool for validationof the model, far more powerful than numerical in-

dicators by themselves. Outliers, influent values,nonlinearities and other anomalies can be investi-gated in this way. Note however thatR2 can be ap-plied only to real numerical data.

Classification When classifying real data, it is of-ten useful to consider accuracy on a class-by-classbasis. In this way, the modeller can keep track ofwhere errors are occurring. These errors may begiven unequal weighting, if the consequences arenot equal. The most common device for this is theconfusion matrix. If the problem contains only twoclasses (say true/false), the matrix is filled with thefollowing entries:

Top Left ’true’ values correctly labelled ’true’

Top Right ’true’ values incorrectly labelled ’false’

Bottom Left ’false’ values correctly labelled’false’

Bottom Right ’false’ values incorrectly labelled’true’

The distribution of input data should also receiveconsideration, as many classification algorithmstend towards predicting the majority class. An in-depth discussion of this topic can be found in Weissand Provost [2001]. Tree (and other classifier) sta-bility can be assessed in the same ways as clusterstability (see above).

Uncertainty Quantification and Model Valida-tion As mentioned in the note regarding input dataabove, proper consideration of uncertainty is es-sential for meaningful modelling. One must alsogive thought to how best to quantify and representthe performance of the final model. For some pur-poses, a single-valued measure such asR2 may besufficient provided that the model has been prop-erly validated as unbiased, but for most applicationsmore information is useful. It is seldom possibleto represent model performance against all goals ofthe investigation with one number. As an exam-ple, a rainfall-runoff model may fit well for lowand medium flows, but underestimate large peaks.It may also have a systematic tendency to slightlyoverpredict lower flows to compensate for missingextreme events. All of this cannot be expressed asa single number, but a comparison of distributionswill reveal the necessary information.

Model validation is as important for automaticallyextracted models as it is for those constructed withmore human interaction, or more so. To this end we

Page 13: Data Mining as a Tool for Environmental Scientists

recommend the usual best practice procedures suchas holding back a portion of the dataset for indepen-dent validation (if the size of database allows) andn-fold cross validation.

Parameter Selection and Model Fitting Whileparameter-free data mining algorithms do exist,most require somea priori set up. Parameters fordata mining algorithms are decided by the samemethods as more common models- expert knowl-edge, guessing, trial and error, automated and man-ual experimentation. In addition, it is often helpfulto learn a little about the role of the parameter withinthe algorithm, as appropriate values for the problemat hand can often be set or estimated this way. Someexperimentation may improve the output model andreporting the process of parameter fitting in detailadds credibility to any modelling project. It is im-portant that parameter values are not chosen basedon the final test data. Otherwise optimistic perfor-mance estimates will be obtained.

6 CHALLENGES FOR DATA M INING IN THE

ENVIRONMENTAL SCIENCES

Finally, in this section, we shall comment on the hotissues and challenging aspects in and of the inter-disciplinary field of environmental data mining sci-ences in coming years. Achievement of the follow-ing aims would increase utility and applicability ofdata mining methods.

• Improvement of automated preprocessingtechniques

• Elaboration of protocols to facilitate sharingand reuse of data

• Development of standard procedures (bench-marks) for experimental testing and valida-tion of data mining tools

• Involvement of end-user (domain expert) cri-teria in algorithm design and result interpre-tation

• Development and implementation of mixeddata mining methods, combining differenttechniques for better knowledge discovery

• Formulation of tools for explicit representa-tion and handling of discovered knowledgefor greater understandability

• Improvement of data mining techniques foron-line and heterogenous databases

• Guideline or recommendation development,to assist with method and algorithm selection.

Another factor that is often of great importance is(conceptual) interpretability of output models. Cut-ting edge knowledge acquisition techniques such asrandom forests have advantages over simpler meth-ods, but are difficult to interpret and understand.Tools that clearly and usefully summarise extractedknowledge are of great value to environmental sci-entists, as are those that assist in the quantificationof uncertainties.

6.1 Integrated Approaches

The main goal of many environmental system anal-yses is to support posterior decision making to im-prove either management or control of the sys-tem. Intelligent Environmental Decision SupportSystems (IEDSSs) are among the most promis-ing approaches in this field. IEDSS are integratedmodels that provide domain information by meansof analytical decision models, and allow accessto databases and knowledge bases to the decisionmaker. They intend to reduce the time in which de-cisions can be made as well as repeatability and thequality of eventual decisions by offering criteria forthe evaluation of alternatives or for justifying deci-sions Poch et al. [2004a], Poch et al. [2004b]. Of-ten, multiple scenarios are modelled and evaluatedaccording to environmental, social, and economiccriteria.

There are six primary approaches to the problemof building an integrated model: expert systems,agent-based modelling, system dynamics, Bayesiannetworks, coupled complex models, and meta-modelling. Of these, the last three are most relevantto the field of data mining. Opportunities exist forautomation of Bayesian network and meta-modelconstruction and parametrisation, simplification andsummarisation of complex submodels, and also in-terpretation of results. Data mining techniques areimportant tools for knowledge acquisition phase ofof integrated model building, and because integratedmodels are very high in complexity, results are oftencorrespondingly difficult to interpret and the deci-sion maker may benefit from a postprocessing datamining step. Of course, data mined models mayalso form part of the integrated model as in Ekas-ingh et al. [2005].

Page 14: Data Mining as a Tool for Environmental Scientists

7 SOFTWARE - EXISTING AND UNDER DEVEL -OPMENT

In this Section, two software packages are dis-cussed in detail. One of these, Weka, is well es-tablished in the data mining community, and theother, GESCONDA, is currently under develop-ment. Weka is a general purpose package, andGESCONDA is designed specifically for environ-mental science. Both contain a wide variety of toolsand techniques.

7.1 GESCONDA

GESCONDA (Gibert et al. [2004], Sanchez-Marreet al. [2004]) is the name given to an Intelligent DataAnalysis System developed with the aim of facilitat-ing Knowledge Discovery (KD) and especially ori-ented to environmental databases. On the basis ofprevious experiences, it was designed as with fourlevel architecture connecting the user with the envi-ronmental system or process. These four levels arethe following:

• Data Filtering

– data cleaning

– missing data management

– outlier analysis and treatment

– statistical univariate analysis

– statistical bivariate analysis

– visualization tools

– attribute or variable transformation fa-cility

• Recommendation and Meta-KnowledgeManagement

– overall goal definition

– method suggestion

– parameter setting

– integration of attribute and variablemetadata

– domain theory and domain knowledgeelicitation

• Knowledge Discovery

– clustering (by machine learning and sta-tistical means)

– decision tree induction

– classification rule induction

Figure 3: GESCONDA two-dimensional clusterplot

– case-based reasoning

– support vector machines

– statistical modelling

– dynamic systems analysis

• Knowledge Management

– validation of the results

– integration of different knowledge pat-terns for a predictive task, or planning,or system supervision, together with AIand statistics mixed techniques

– consideration of knowledge use by end-users

Central characteristics of GESCONDA are the in-tegration of statistical and AI methods into a singletool together with mixed techniques, for extractingknowledge contained in data, as well as tools forqualitative analysis of complex relationships alongthe time axis Sanchez-Marre et al. [2004]. All tech-niques implemented in GESCONDA can share in-formation among themselves to best co-operate forextracting knowledge. It also includes capability forexplicit management of the results produced by thedifferent methods.

Figure 3 is a two dimensional GESCONDA visuali-sation of a multidimensional clustering scheme andFigure 4 a screen capture from the clustering GUI.Figure 5 shows a rule induction screen. Portabil-ity of the software between platforms is providedby a common Java platform. The GESCONDAdesign document can be viewed at http://www.eu-lat.org/eenviron/Marre.pdf.

Page 15: Data Mining as a Tool for Environmental Scientists

Figure 4: GESCONDA clustering interface

Figure 5: GESCONDA rule induction interface

7.2 Weka

The Weka workbench (Witten and Frank [2005])contains a collection of visualization tools and al-gorithms for data analysis and predictive modelling,together with graphical user interfaces for easy ac-cess to this functionality. A command line interfaceis also included, for mass processing. It was orig-inally designed as a tool for analyzing data fromagricultural domains but is now used in many dif-ferent application areas, largely for educational pur-poses and research. The main strengths of Weka arethat it is (a) freely available under the GNU GeneralPublic License, (b) very portable because it is fullyimplemented in the Java programming language andthus runs on almost any computing platform, (c)contains a comprehensive collection of data prepro-cessing and modelling techniques, and (d) is easy touse by a novice due to the graphical user interfacesit contains.

Weka supports several standard data mining tasks.More specifically, data preprocessing, clustering,classification, regression, visualization, and featureselection are included. All of Weka’s techniquesare predicated on the assumption that the data isavailable as a single flat file or relation, where eachdata object is described by a fixed number of at-tributes (normally, numeric or nominal attributes,but some other attribute types are also supported).Weka provides access to SQL databases using JavaDatabase Connectivity and can process the resultreturned by a database query. It is not capable ofmulti-relational data mining, but there is a sepa-rate piece of software for converting a collection oflinked database tables into a single table that is suit-able for processing using Weka (Reutemann et al.[2004]). Another important area that is currently notcovered by the algorithms included in the Weka dis-tribution is time series modelling.

Weka’s main user interface is the Explorer, shown inFigure 6, but essentially the same functionality canbe accessed through the component-based Knowl-edge Flow interface, shown in Figure 7, and fromthe command line. There is also the Experimenter,which allows the systematic comparison of the pre-dictive performance of Weka’s machine learning al-gorithms on a collection of datasets rather than asingle one.

The Explorer interface has several panels that giveaccess to the main components of the workbench.The Preprocesspanel has facilities for importingdata from a database as a CSV or other format file,and for preprocessing this data using a so-calledfil-

Page 16: Data Mining as a Tool for Environmental Scientists

Figure 6: The Weka Explorer user interface

Figure 7: The Weka Knowledge Flow user interface

tering algorithm. These filters can be used to trans-form the data (e.g. turning numeric attributes intodiscrete ones) and make it possible to delete in-stances and attributes according to specific criteria.TheClassify panel enables the user to apply classi-fication and regression algorithms (indiscriminatelycalledclassifiersin Weka) to the resulting dataset,to estimate the accuracy of the resulting predic-tive model, and to visualize erroneous predictions,ROC curves, etc, or the model itself (if the model isamenable to visualization as, for example, decisiontrees are). TheAssociatepanel provides access toassociation rule learners, which attempt to identifyall important interrelationships between attributes inthe data. TheCluster panel gives access to the clus-tering techniques in Weka, for example, the simplek-means algorithm. There is also an implementa-tion of the expectation maximization algorithm forlearning a mixture of normal distributions (see Sec-tion 3.4. The next panel,Select attributeshousesalgorithms for identifying the attributes in a datasetwith most predictive capacity. The last panel,Visu-alize, shows a scatter plot matrix, where individualscatter plots can be selected and enlarged, and ana-

lyzed further using various selection operators.

7.3 Other

Other proprietary data mining packages exist.SAS’sEnterprizeminer GUI was designed for busi-ness users, and includes decision trees and neu-ral nets within the wider SAS statistical frameworkand includes facility for direct connection with datawarehouses. IBM has releasedIntelligent Miner.Clementinefrom SPSS includes facilities for neu-ral networks, rule induction, data visualization in ta-bles, histograms, plots and webs. Salford System’sCART r package builds classification and regres-sion trees. Data mining libraries for general compu-tation and statistics environments like Matlab and Rhave also been built, covering a range of techniques.These, with Weka and GESCONDA, are some ofthe data mining software options available.

8 AIMS OF I EMSS

iEMSs is the International Environmental Mod-elling and Software Society(www.iemss.org),founded in 2000 by interested scientists with the fol-lowing aims:

• developing and using environmental mod-elling and software tools to advance the sci-ence and improve decision making with re-spect to resource and environmental issues;

• promoting contacts and interdisciplinary ac-tivities among physical, social and naturalscientists, economists and software develop-ers from different countries;

• improving the cooperation between scientistsand decision makers/advisors on environmen-tal matters;

• exchanging relevant information among sci-entific and educational organizations and pri-vate enterprises, as well as non-governmentalorganizations and governmental bodies.

To achieve these aims, the iEMSs:

• organizes international conferences, meet-ings and educational courses in environmen-tal modelling and software;

• publishes scientific studies and popular sci-entific materials in the Environmental Mod-elling and Software journal (Elsevier);

Page 17: Data Mining as a Tool for Environmental Scientists

• hosts a website (www.iemss.org) which al-lows members to communicate research andother information relevant to the Society’saims with one another and the broader com-munity;

• delivers a regular newsletter to members.

This paper proposes that data mining techniquesare valuable tools that could be used to good effectin the environmental and natural resource sciencefield, and are thus of interest to iEMSs and its mem-bers. We aim to introduce the main concepts of datamining and foster discussion of the ways in whichit could be used and encouraged within and outsidethe iEMSs organisation.

ACKNOWLEDGMENTS

The projectTIN2004 − 01368 has partially fi-nanced the development of GESCONDA. Portionsof Section 4 also appear in Spate and Jakeman [Un-der review 2006]. The aims and activities of iEMSsare taken with minimal alteration from the iEMSswebsite http://www.iemss.org/.

REFERENCES

Technical report, ????

Aha, D. Feature weighting for lazy learning algo-rithms. In Liu, H. and lastname Motoda, H., edi-tors,Feature Extraction, Construction and Selec-tion: A Data Mining Perspective. Kluwer, 1998.

Allison, P. Missing Data. Sage, Thousand Oaks,CA, USA, 2002.

Almasri, M. and J. Kaluarachchi. Modular neu-ral networks to predict the nitrate distribution inground water using the on-ground nitrogen load-ing and recharge data.Environmental Modellingand Software, 20(7):851–871, July 2005.

Andrienko, G. and A. Andrienko. Research onvisual analysis of spatio-temporal data at fraun-hofer ais: an overview of history and function-ality of commonGIS. InProceedings of theKnowledge-Based Services for the Public Ser-vices Symposium, Workshop III: Knowledge Dis-covery for Environmental Managment, pages 26–31. KDnet, June 2004.

Antunes, C. and A. Oliveira. Temporal data min-ing: An overview. InKDD 2001: Proceedings ofthe 7th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, 2001.Workshop.

Athanasiadis, I., V. Kaburlasos, P. Mitkas, andV. Petridis. Applying machine learning tech-niques on air quality for real-time descision sup-port. In Information technologies in Environmen-tal Engineering, June 2003.

Athanasiadis, I., K. Karatzas, and P. Mitkas. Con-temporary air quality forecasting methods: Acomparative analysis between statistical methodsand classification algorithms. InProceedings ofthe 5th International Conference on Urban AirQuality, March 2005.

Athanasiadis, I. and P. Mitkas. Supporting thedecision-making process in environmental mon-itoring systems with knowledge discovery tech-niques. InProceedings of the Knowledge-BasedServices for the Public Services Symposium,Workshop III: Knowledge Discovery for Envi-ronmental Managment, pages 1–12. KDnet, June2004.

Babovic, V. Data mining in hydrology.Hydrologi-cal Processes, 19:1511–1515, 2005.

Barnett, V. and T. Lewis. Outliers in StatisticalData. Wiley, 1978.

Belanche, L., J. Valdes, J. Comas, I. Rodrıguez-Roda, and M. Poch. Towards a model of input-output behaviour of wastewater treatment plantsusing soft computing techniques.EnvironmentalModelling and Software, 5(14):409–419, 2001.

Cadez, I. and P. Smyth. Modelling of inhomoge-neous markov random fields with applications tocloud screening. Technical Report UCI-ICS 98-21, 1999.

Camargo, S., A. Robertson, S. Gaffney, andP. Smyth. Cluster analysis of western north pa-cific tropical cyclone tracks. InProceedings ofthe 26th Conference on Hurricanes and tropicalMeteorology, pages 250–251, May 2004.

Campbell, A., B. Pham, and Y.-C. Tian. Miningecological data with cellular automata. InPro-ceedings of the International Conference on Cel-lular Automata for Research and Industry (ACRI2004), pages 474–483. Springer, October 2004.

Cendrowska, J. Prism: An algorithm for induc-ing modular rules.International Journal of Man-Machine Studies, 27(4):349–370, 1998.

Chen, J., H. He, G. Williams, and H. Jin.Temporal sequence associations for rareevents. In Honghua, D., namesleft>1 Srikant,R.,numnames>2 and lastname Zhang, C., ed-itors, Advances in Knowledge Discovery and

Page 18: Data Mining as a Tool for Environmental Scientists

Data Mining, 8th Pacific-Asia Conference,PAKDD 2004, Sydney, Australia, May 26-28,2004, Proceedings, volume 3056 of LectureNotes in Computer Science, pages 239–239.Springer, 2004.

Cohen, W. Fast effective rule induction. In Prieditis,A. and lastname Russell, S., editors,Proceedingsof the Twelfth International Conference on Ma-chine Learning, pages 115–123. Morgan Kauf-mann, 1995.

Comas, J., S. Dzeroski, and K. Gibert. Knowl-edge discovery by means of inductive methods inwastewater treatment plant data.AI Communica-tions, 14(1):45–62, 2001.

Cortes, U., I. Rodrıguez-Roda, M. Sanchez-Marre,J. Comas, C. Cortes, and M. Poch. DAI-DEPUR:An environmental decision support system forsupervision of municipal waste water treatmentplants. InProceedings of the 15th European Con-ference on Artificial Intelligence (ECAI’2002),pages 603–607, 2002.

Cunningham, S. and G. Holmes. Developing in-novative applications in agriculture using datamining. In Proceedings of the Southeast AsiaRegional Computer Confederation Conference,1999.

D’heygere, T., P. Goethals, and N. De Pauw. Useof genetic algorithms to select input variables indecision tree models for the prediction of benthicmacroinvertebrates.Ecological Modelling, 160(3):291–300, 2003.

Dillon, W. and M. Goldstein.Multivariate Analysis.Wiley, USA, 1984.

Draper, N. and H. Smith.Applied Regression Anal-ysis. Wiley, 1998.

Dubes, R. and A. Jain.Algorithms for ClusteringData. Prentice Hall, 1988.

Dzeroski, S. and D. Drumm. Using regression treesto identify the habitat preference of the sea cu-cumber (holothuria leucospilota) on rarotonga,cook islands. Ecological Modelling, 170(2–3):219–226, 2003.

Dzeroski, S., J. Grbovic, W. Walley, and B. Kom-pare. Using machine learning techniques in theconstruction of models. ii. data analysis with ruleinduction. Ecological Modelling, 95(1):95–111,1997.

Ekasingh, B., K. Ngamsomsuke, R. Letcher, andJ. Spate. A data mining approach to simulat-ing land use decisions: Modelling farmer’s crop

choice from farm level data for integrated waterresource management. In Singh, V. and lastnameYadava, R., editors,Advances in Hydrology: Pro-ceedings of the International Conference on Wa-ter and Environment, pages 175–188, 2003.

Ekasingh, B., K. Ngamsomsuke, R. Letcher, andJ. Spate. A data mining approach to simulat-ing land use decisions: Modelling farmer’s cropchoice from farm level data for integrated waterresource management.Journal of EnvironmentalManagement, 2005.

Ellis, F. The application of machine learning tech-niques to erosion modelling. InProceedings ofthe Third International Conference on Integrat-ing GIS and Environmental modelling. NationalCenter for Geographic Information and Analysis,January 1996.

Fayyad, U., G. Piatetsky-Shapiro, and S. P. Ad-vances in knowledge discovery and data min-ing. In Data Mining to Knowledge Discovery:an Overview, pages 1–34. American Associationfor Artificial Intelligence, 1996a.

Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth.From data mining to knowledge discovery indatabases (a survey).AI Magazine, 3(17):37–54,1996b.

Frank, E., L. Trigg, G. Holmes, and I. Whitten.Naive bayes for regression.Machine Learning,1(41):5–26, October 2000.

Gatts, C., A. Ovalle, and C. Silva. Neural patternrecognition and multivariate data: Water typol-ogy of the Paraiba do Sul River, Brazil.Environ-mental Modelling and Software, 20(7):883–889,July 2005.

Gibbs, M., N. Morgan, H. Maier, H. M. Dandy, GC,and J. Nixon. Use of artificial neural networks formodelling chlorine residuals in water distributionsystems. InMODSIM 2003: Proceedings of the2003 International Congress on Modelling andSimulation, pages 789–794, 2003.

Gibert, K., R. Annicchiarico, U. Cortes, and C. Cal-tagirone. Knowledge Discovery on FunctionalDisabilities: Clustering Based on Rules VersusOther Approaches. IOS Press, 2005a.

Gibert, K., X. Flores, I. Rodrıguez-Roda, andM. Sanchez-Marre. Knowledge discovery in en-vironmental data bases using GESCONDA. InProceedings of IEMSS 2004: International Envi-ronmental Modelling and Software Society Con-ference, Osnabruck, Germany, 2004.

Page 19: Data Mining as a Tool for Environmental Scientists

Gibert, K., R. Nonell, V. JM, and C. MM. Knowl-edge discovery with clustering: Impact of metricsand reporting phase by using klass.Neural Net-work World, pages 319–326, 2005b.

Gibert, K., M. Sanchez Marre, and X. Flores. Clus-ter discovery in environmental databases usinggesconda: The added value of comparisons.AICommunications, 4(18):319–331, 2005c.

Gibert, K. and Z. Sonicki. Clustering based onrules and medical research.Journal on AppliedStochastic Models in Business and Industry, for-merly JASMDA, 15(4):319–324, Oct–Dec 1999.

Guariso, G. and H. Werthner.Environmental Deci-sion Support Systems. Ellis Horwood-Wiley, NewYork, 1989.

Guo, Q., M. Kelly, and C. Graham. Support vec-tor machines for predicting distribution of suddenoak death in california. Ecological Modelling,182(1):75–90, 2005.

Hall, M. Feature selection for discrete and nu-meric class machine learning. Technical report,Department of Computer Science, University ofWaikato, April 1999. Working Paper 99/4.

Han, J. and M. Kamber.Data Mining: Conceptsand Techniques. Morgan Kaufmann, 2001.

Hastie, T., R. Tibshirani, and J. Friedman.The El-ements of Statistical Learning: Data Mining, In-ference, and Prediction. Springer-Verlag, 2001.

Holmes, G., S. Cunningham, B. Dela Rue, andA. Bollen. Predicting apple bruising using ma-chine learning. InProceedings of the Model-IT Conference, Acta Horticulturae, number 476,pages 289–296, 1998.

Keogh, E. and S. Kasetty. On the need for time se-ries data mining benchmarks: a survey and em-piricial demonstration. InProceedings of the8th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, pages102–111. ACM Press, 2002.

Kolodner, J.Case-Based Reasoning. Morgan Kauf-mann, 1993.

Kralisch, S., M. Fink, W.-A. Flugel, and C. Beck-stein. Using neural network techniques to opti-mize agricultural land management for minimisa-tion of nitrogen loading. InMODSIM 2001: Pro-ceedings of the 2001 International Congress onModelling and Simulation, pages 203–208, 2001.

Larose, D.Discovering Knowledge in Data: An In-troduction to Data Mining. John Wiley, 2004.

Lebart, L., A. Morineau, and K. Warwick.Multi-variate Descriptive Statistical Analysis. Wiley,New York, USA, 1984.

Little, R. and D. Rubin. Statistical Analysis withMissing Data. Wiley, 1987.

Mas, J., H. Puig, J. Palacio, and A. Sosa-Lopez.Modelling deforestation using GIS and artificialneural networks.Environmental Modelling andSoftware, 19(5):461–471, May 2004.

McKenzie, N. and P. Ryan. Spatial prediction of soilproperties using environmental correlation.Geo-derma, (89):67–94, 1999.

Michalski, R. and R. Chilausky. Learning by beingtold and learning by examples: An experimentalcomparison of the two methods of knowledge ac-quisition in the context of developing an expertsystem for soybean disease diagnosis.Interna-tional Journal of Policy Analysis and InformationSystems, 4(2):125–161, 1980.

Michalski, R. and R. Stepp. Learning from ob-servation: Conceptual clustering. In Michalski,R., namesleft>1 Carbonell, J.,numnames>2 andlastname Mitchell, T., editors,Machine Learning,an Artificial Intelligence Approach, pages 331–363. Morgan Kaufmann, 1983.

Molina, L., L. Belanche, and A. Nebot. Featureselection algorithms: A survey and experimentalevaluation. InICDM 2002: Proceedings of theIEEE International Conference on Data Mining,pages 306–313, 2002.

Moore, D. and G. McCabe.Introduction to the prac-tice of statistics. WH Freeman, New York, 1993.second edition.

Mora-Lopez, L. and R. Conejo. Qualitative reason-ing model for the prediction of climatic data. InECAI 1998: Proceedings of the 13th EuropeanConference on Artificial Intelligence, pages 61–75, 1998.

Nez, H., Sanchez-Marre, and U. Cortes. Improv-ing similarity assessment with entropy-based lo-cal weighting. InLecture Notes in Artificial Intel-ligence, (LNAI-2689): Proceedings of the 5th In-ternational Conference on Case-Based Reason-ing (ICCBR2003), pages377−−391.Springer−

V erlag, June2003.

Nunez, H., M. Sanchez-Marre, U. Cortes, J. Comas,M. Martinez, I. Rodrıguez-Roda, and M. Poch. A

Page 20: Data Mining as a Tool for Environmental Scientists

comparative study on the use of similarity mea-sures in case-based reasoning to improve the clas-sification of environmental system situations.En-vironmental Modelling and Software, 19(9):809–819, 2004.

of Waikato, U. Weka machine learning project,2005. Accessed 12/08/2005.

Olsson, G. Instrumentation, control and automationin the water industry: State-of-the-art and newchallenges. InProceedings of The 2nd IWA Con-ference on Instrumentation, Control and Automa-tion (ICA 2005), volume 1, pages 19–31, Busan,Korea, May 29–June 2 2005.

Parr Rud, O. Data Mining Cookbook- Modellingdata for marketing, risk, and CRM. Wiley, 2001.

Poch, M., J. Comas, I. Rodrıguez-Roda,M. Sanchez-Marre, , and U. Cortes. De-signing and building real environmental decisionsupport systems. Environmental ModellingSoftware, (19):857–873, 2004a.

Poch, M., J. Comas, I. Rodrıguez-Roda,M. Sanchez-Marre, and U. Cortes. Tenyears of experience in designing and buildingreal environmental decision support systems.what have we learnt?Environmental Modellingand Software, 9(19):857–873, 2004b.

Quinlan, J.C4.5: Programs for Machine Learning.Morgan Kaufmann, 1993.

Recknagel, F. Applications of machine learning toecological modelling.Ecological Modelling, 146(1–3):303–310, 2001.

Reutemann, P., B. Pfahringer, and E. Frank. A tool-box for learning from relational data with propo-sitional and multi-instance learners. InProceed-ings of the Australian Joint Conference on Ar-tificial Intelligence, pages 1017–1023. Springer-Verlag, 2004.

Riano, D. Learning rules within the framework ofenvironmental sciences. InECAI 1998: Proceed-ings of the 13th European Conference on Artifi-cial Intelligence, pages 151–165, 1998.

Robertson, A., S. Kirshner, and P. Smyth. Hiddenmarkov models for modelling daily rainfall oc-currence over brazil. Technical Report UCI-ICS03-27, 2003.

Rodrıguez-Roda, I., J. Comas, J. Colprim, M. Poch,M. Sanchez-Marre, U. Cortes, J. Baeza, and

J. Lafuente. A hybrid supervisory system to sup-port wastewater treatment plant operation: Im-plementation and validation.Water Science andTechnology, 4–5(45):289–297, 2002.

Rodrıguez-Roda, I., J. Comas, M. Poch,M. Sanchez-Marre, and U. Cortes. Auto-matic knowledge acquisition from complexprocesses for the development of knowledgebased systems. Industrial and EngineeringChemistry Research, 15(40):3353–3360, 2001.

Rodrıguez-Roda, I., Poch, M. Sanchez-Marre,U. Cortes, and J. Lafuente. Consider a case-basedsystem for control of complex processes.Chemi-cal Engineering Progress, 6(95):39–48, 1999.

Rubin, D. Multiple Imputation for Nonresponse inSurveys. Wiley, 1987.

Sanborn, S. and B. Bledsoe. Predicting stream-flow regime metrics for ungauaged streams incolorado, washington, and oregon.Journal of Hy-drology, 2005. under review.

Shearer, C. The CRISP-DM model: The newblueprint for data mining.Journal of Data Ware-housing, 5(4):13–22, 2000.

Siebes, A. Data mining: What it is and how it isdone. InSEBD, page 329, 1996.

Smith, C. and J. Spate. Gully erosion in the benchifley catchment. 2005. In Preparation.

Sanchez-Marre, M., U. Cortes, J. Bejar, J. de Gra-cia, J. Lafuente, and M. Poch. Concept informa-tion in wastewater treatment plants by means ofclassification techniquesl: an applied study.Ap-plied Intelligence, 147(7), 1997.

Sanchez Marre, M., U. Cortes, M. Martinez, J. Co-mas, and I. Rodrıguez-Roda. An approach fortemporal case-based reasoning: Episode-basedreasoning. InLecture Notes in Artificial Intelli-gence (LNAI-3620): Proceedings of the 6th In-ternational Conference on Case-Based Reason-ing (ICCBR2005), pages465−−476.Springer−

V erlag,August2005.

Sanchez Marre, M., U. Cortes, I. Rodrıguez-Roda,and M. Poch. Using meta-cases to improve accu-racy in hierarchical case retrieval.Computaciony Sistemas, 1(4):53–63, 2000.

Sanchez-Marre, M., K. Gibert, and I. Rodrıguez-Roda. GESCONDA: A Tool for KnowledgeDiscovery and Data Mining in EnvironmentalDatabases, volume 11 ofResearch on ComputingScience, pages 348–364. Centro de Investigacion

Page 21: Data Mining as a Tool for Environmental Scientists

en Computacion, Instituto Politecnico Nacional,Mexico DF, Mexico, 2004.

Sokal, R. and P. Sneath.Principles of NumericalTaxonomy. Freeman, San Francisco, 1963.

Spate, J. Data in hydrology: Existing uses andnew approaches. Australian National University,2002. Honours thesis.

Spate, J. Modelling the relationship betweenstreamflow and electrical conductivity in HollinCreek, southeastern Australia. In Fazel Famili,A., namesleft>1 Kok, J.,numnames>2 and last-name Pena, J., editors,Proceedings of the 6th In-ternational Symposium on Intelligent Data Anal-ysis, pages 419–440, August 2005.

Spate, J. Machine learning as a tool for investigat-ing environmental systems, In preparation. PhDThesis.

Spate, J., B. Croke, and A. Jakeman. Data miningin hydrology. InMODSIM 2003: Proceedingsof the 2003 International Congress on Modellingand Simulation, pages 422–427, 2003.

Spate, J. and A. Jakeman. Review of data miningtechniques and their application to environmen-tal problems.Environmental Modelling and Soft-ware, Under review 2006.

Su, F., C. Zhou, V. Lyne, Y. Du, and W. Shi. Adata-mining approach to determine the spatio-temporal relationship between environmentalfactors and fish distribution.Ecological Mod-elling, 174(4):421–431, June 2004.

Swayne, D., D. Cook, and A. Buja. Xgobi: Interac-tive dynamic data visualization in the x windowsystem.Journal of Computational and GraphicalStatistics, 7(1), 1998.

Sweeney, A., N. Beebe, and R. Cooper. Analysisof environmental factors influencing the range ofanopheline mosquitoes in northern Australia us-ing a genetic algorithm and data mining methods.Ecological Modelling, 2004 (submitted).

Syu, M. and B. Chen. Back-propagation neural net-work adaptive control of a continuous wastewa-ter treatment process.Industrial and EngineeringChemistry Research, 9(37), 1998.

Ter Braak, C., H. Hoijtink, W. Akkermans, andP. Verdonschot. Bayesian model-based clusteranalysis of predicting macrofaunal communities.Ecological Modelling, 160(3):235–248, February2003.

Vellido, A., J. Marti, I. Comas, I. Rodrıguez-Roda,and F. Sabater. Exploring the ecological status ofhuman altered streams through generative topogr-pahic mapping. Environmental Modelling andSoftware, submitted 2005.

Voss, H., namesleft>1 Wachowicz, M.,namesleft>1 Dzeroski, S.,numnames>2 andlastname Lanza, A., editors. KnowledgeDiscovery for Environmental Management,Knowledge-Based Services for the Public SectorConference. KDnet, June 2004.

Wang, X., B. Chen, S. Yang, C. McGreavy, andM. Lu. Fuzzy rule generation from data for pro-cess operational decision support.Computers andChemical Engineering, 21(1001), S661–S666,1997.

Ward, J.Hierarchical Grouping to Optimize an Ob-jective Function. 1963.

Weiss, G. and F. Provost. The effect of class distri-bution on classier learning: An empirical study.Technical report, Department of Computer Sci-ence, Rutgers University, August 2001. TechnicalReport ML-TR-44.

Whigham, P. Induction of a marsupial densitymodel using genetic programming and spatial re-lationships.Ecological Modelling, 131:299–317,July 2000.

Whitten, I. and E. Frank. Data Mining: Practi-cal Machine Learning Tools and Techniques withJava Implementations. Morgan Kaufmann Pub-lishers, 1991.

Witten, I. and E. Frank.Data Mining: PracticalMachine Learning Tools and Techniques. MorganKaufmann, 2005. 2nd edition.

Wnek, J. and R. Michalski. Hypothesis-driven con-structive induction in aQ17: A method and exper-iments. InProceedings of the IJCAI-91 Workshopon Evaluating and Changing Representation inMachine Learning, pages 13–22, 1991.

Yeates, S. and K. Thomson. Applications of ma-chine learning on two agricultural datasets. InProceedings of the New Zealand Conference ofPostgraduate Students in Engineering and Tech-nology, pages 495–496, 1996.

Zhu, X. and A. Simpson. Expert system for wa-ter treatment plant operation.Journal of Environ-mental Engineering, pages 822—-829, 1996.

Zoppou, C., O. Neilsen, and L. Zhang. Region-alization of daily stream flow in Australia us-ing wavelets and k-means. Technical report,

Page 22: Data Mining as a Tool for Environmental Scientists

Australian National University, 2002. Accessed15/10/2002.