forecasting of cyanobacterial density in torrão reservoir using artificial neural networks

7
Forecasting of cyanobacterial density in Torra ˜o reservoir using artificial neural networks Rita Torres, a Elisa Pereira, b V ıtor Vasconcelos * ab and Lu ıs Oliva Teles ab Received 7th February 2011, Accepted 8th April 2011 DOI: 10.1039/c1em10127g The ability of general regression neural networks (GRNN) to forecast the density of cyanobacteria in the Torra ˜ o reservoir (T^ amega river, Portugal), in a period of 15 days, based on three years of collected physical and chemical data, was assessed. Several models were developed and 176 were selected based on their correlation values for the verification series. A time lag of 11 was used, equivalent to one sample (periods of 15 days in the summer and 30 days in the winter). Several combinations of the series were used. Input and output data collected from three depths of the reservoir were applied (surface, euphotic zone limit and bottom). The model that presented a higher average correlation value presented the correlations 0.991; 0.843; 0.978 for training, verification and test series. This model had the three series independent in time: first test series, then verification series and, finally, training series. Only six input variables were considered significant to the performance of this model: ammonia, phosphates, dissolved oxygen, water temperature, pH and water evaporation, physical and chemical parameters referring to the three depths of the reservoir. These variables are common to the next four best models produced and, although these included other input variables, their performance was not better than the selected best model. Introduction Eutrophication, being a natural process, can be accelerated by human activities due to the increase of the load of nutrients and organic substances. These substances can cause the excessive growth of algae and cyanobacteria, and this may interfere with the uses of the water and with the health and diversity of autochthonous organisms (EPA 2003). Most of these negative effects might be prevented or minimized if phytoplankton blooms are predicted in an early stage. 1 Over the last decade, there has been a growing interest in using artificial neural networks (ANNs) for modelling ecosys- tems. This is mainly because, unlike other ecological models based on linear regression, ANNs are able to map the non- linear relationships between characteristic variables of the ecosystems. 2 The great advantage of ANNs is their ability to work with noisy or incomplete input data and their capability of learning and generalizing from experience. They are often good at solving problems that are too complex for conven- tional technologies. Since the use of ANNs to model cyano- bacteria blooms is recent, most of the ANN models known in the present do not have the ability to predict future values; most of the ANNs developed only predict and do not forecast results. 2–6 With this work, it is intended to study the ability of a specific type of ANN, general regression neural networks (GRNN), to forecast the density of cyanobacteria in a temperate reservoir (Torra ˜o, T^ amega river, Portugal) based on three years of monitoring data. This will be a valuable tool for the management of this type of ecosystem and help environmental authorities to better manage eutrophication and its consequences. a Departamento de Biologia - Faculdade de Ci^ encias, Universidade do Porto, Rua do Campo Alegre, 4069-007 Porto, Portugal. E-mail: [email protected]; Fax: +351 223380609; Tel: +315 223401814 b CIIMAR/CIMAR - Centro Interdisciplinar de Investigac ¸a˜o Marinha e Ambiental, Universidade do Porto, Rua dos Bragas 289, 4050-123 Porto, Portugal Environmental impact Artificial Neural Networks may be used to forecast the occurrence of cyanobacteria blooms in reservoirs. We report the use of this technique to predict the occurrence of blooms several weeks ahead. This might enable water managers to adapt monitoring strategies that will prevent human health risks. This might also be used to apply mitigation measures in order to decrease the potential hazardous effects of blooms. This journal is ª The Royal Society of Chemistry 2011 J. Environ. Monit., 2011, 13, 1761–1767 | 1761 Dynamic Article Links C < Journal of Environmental Monitoring Cite this: J. Environ. Monit., 2011, 13, 1761 www.rsc.org/jem PAPER Published on 06 May 2011. Downloaded by University of Pittsburgh on 28/10/2014 10:28:23. View Article Online / Journal Homepage / Table of Contents for this issue

Upload: luis-oliva

Post on 04-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Forecasting of cyanobacterial density in Torrão reservoir using artificial neural networks

Dynamic Article LinksC<Journal ofEnvironmentalMonitoringCite this: J. Environ. Monit., 2011, 13, 1761

www.rsc.org/jem PAPER

Publ

ishe

d on

06

May

201

1. D

ownl

oade

d by

Uni

vers

ity o

f Pi

ttsbu

rgh

on 2

8/10

/201

4 10

:28:

23.

View Article Online / Journal Homepage / Table of Contents for this issue

Forecasting of cyanobacterial density in Torrao reservoir using artificialneural networks

Rita Torres,a Elisa Pereira,b V�ıtor Vasconcelos*ab and Lu�ıs Oliva Telesab

Received 7th February 2011, Accepted 8th April 2011

DOI: 10.1039/c1em10127g

The ability of general regression neural networks (GRNN) to forecast the density of cyanobacteria in

the Torrao reservoir (Tamega river, Portugal), in a period of 15 days, based on three years of collected

physical and chemical data, was assessed. Several models were developed and 176 were selected based

on their correlation values for the verification series. A time lag of 11 was used, equivalent to one

sample (periods of 15 days in the summer and 30 days in the winter). Several combinations of the series

were used. Input and output data collected from three depths of the reservoir were applied (surface,

euphotic zone limit and bottom). The model that presented a higher average correlation value

presented the correlations 0.991; 0.843; 0.978 for training, verification and test series. This model had

the three series independent in time: first test series, then verification series and, finally, training series.

Only six input variables were considered significant to the performance of this model: ammonia,

phosphates, dissolved oxygen, water temperature, pH and water evaporation, physical and chemical

parameters referring to the three depths of the reservoir. These variables are common to the next four

best models produced and, although these included other input variables, their performance was not

better than the selected best model.

Introduction

Eutrophication, being a natural process, can be accelerated by

human activities due to the increase of the load of nutrients and

organic substances. These substances can cause the excessive

growth of algae and cyanobacteria, and this may interfere with

the uses of the water and with the health and diversity of

autochthonous organisms (EPA 2003). Most of these negative

effects might be prevented or minimized if phytoplankton

blooms are predicted in an early stage.1

Over the last decade, there has been a growing interest in

using artificial neural networks (ANNs) for modelling ecosys-

tems. This is mainly because, unlike other ecological models

aDepartamento de Biologia - Faculdade de Ciencias, Universidade doPorto, Rua do Campo Alegre, 4069-007 Porto, Portugal. E-mail:[email protected]; Fax: +351 223380609; Tel: +315 223401814bCIIMAR/CIMAR - Centro Interdisciplinar de Investigacao Marinha eAmbiental, Universidade do Porto, Rua dos Bragas 289, 4050-123 Porto,Portugal

Environmental impact

Artificial Neural Networks may be used to forecast the occurrence

technique to predict the occurrence of blooms several weeks ahead. T

that will prevent human health risks. This might also be used to

hazardous effects of blooms.

This journal is ª The Royal Society of Chemistry 2011

based on linear regression, ANNs are able to map the non-

linear relationships between characteristic variables of the

ecosystems.2 The great advantage of ANNs is their ability to

work with noisy or incomplete input data and their capability

of learning and generalizing from experience. They are often

good at solving problems that are too complex for conven-

tional technologies. Since the use of ANNs to model cyano-

bacteria blooms is recent, most of the ANN models known in

the present do not have the ability to predict future values;

most of the ANNs developed only predict and do not forecast

results.2–6

With this work, it is intended to study the ability of a specific

type of ANN, general regression neural networks (GRNN), to

forecast the density of cyanobacteria in a temperate reservoir

(Torrao, Tamega river, Portugal) based on three years

of monitoring data. This will be a valuable tool for the

management of this type of ecosystem and help environmental

authorities to better manage eutrophication and its

consequences.

of cyanobacteria blooms in reservoirs. We report the use of this

his might enable water managers to adapt monitoring strategies

apply mitigation measures in order to decrease the potential

J. Environ. Monit., 2011, 13, 1761–1767 | 1761

Page 2: Forecasting of cyanobacterial density in Torrão reservoir using artificial neural networks

Publ

ishe

d on

06

May

201

1. D

ownl

oade

d by

Uni

vers

ity o

f Pi

ttsbu

rgh

on 2

8/10

/201

4 10

:28:

23.

View Article Online

Materials and methods

Study site

Torrao reservoir is the first hydroelectric dam constructed in

Tamega River, the largest tributary of Douro River (North

Portugal). This dam was concluded in 1988 and is located about

30 km far from Porto. The reservoir formed by this dam has

submerged 31 km of the river basin and has 77 hm3 of maximum

volume.7 It is used to produce energy (about 233 million kilo-

watts per hour), for recreational activities (swimming, boating

and fishing) and as a source of drinking water for Amarante and

Marco de Canavezes inhabitants.

The phytoplankton community in Torrao reservoir includes

toxin producing cyanobacteria, mainly Microcystis aeruginosa

and Aphanizomenon flos-aquae.8 The occurrence of toxic cyano-

bacteria in the reservoir may be harmful to the population that

uses its water and to wild and domestic animals, due to the

microcystins produced.

Collected data

The monitoring data used in this study were collected between

September 1999 and December 2002 (Table 1). Sampling was

carried out monthly, except between June and October when

increased abundance of cyanobacteria was detected. During this

period, sampling was biweekly.

Artificial neural networks do not require transformation of the

used input data, because the probability distribution of the data

does not affect the model input.5 However, in order to reduce the

scale range, phytoplankton data suffered a log2 transformation.

This transformation was found to give better results than the

model produced using untransformed data.6

Models development

The software used in this study was Statistica Neural Networks

�, Version 4.0 F. A general regression neural network (GRNN)

was used in the development of the forecasting models. The input

patterns were formed by thirty distinct parameters, all physical

and chemical variables. Phytoplankton and other associated

Table 1 Collected data used in the development of the models, theirnotation and units

Parameter Notation Units

Lunar day length moon DaysNitrite NO2 mg L�1

Nitrate NO3 mg L�1

Ammonia NH3 mg L�1

Phosphate PO4 mg L�1

Dissolved oxygen DO mg L�1

Water temperature wT �CpH pH Sorenson scaleElectrical conductivity cond mS cm�1

Oxygen stratification strat Present—1, notpresent—0

Precipitation PCP mm day�1

Water evaporation EVP mm day�1

Solar radiation RAD kJ m�2

Discharge disch m3 s�1

Cyanobacteria density cyan cells mL�1

1762 | J. Environ. Monit., 2011, 13, 1761–1767

variables (chlorophyll and phaeophytin) were excluded. In

a study performed at Crestuma Reservoir, Douro River, Portu-

gal,6 the inclusion of data directly correlated to the output

variable did not conduce to a better performance of the predic-

tive models, while physical and chemical information of the

environment provided best results in the prediction of the cya-

nobacterial density.

The thirty variables were divided in three groups: variables

provided from samples collected from the surface, from the

euphotic zone limit and close to the bottom of the reservoir

(concentration of nitrites, nitrates, ammonia and phosphates,

dissolved oxygen, water temperature, pH and water conduc-

tivity), meteorological variables (precipitation, water evapora-

tion, solar radiation and reservoir discharge) and general

variables (lunar day length and oxygen stratification).

The outputs of the GRNN provide an estimate of the cyano-

bacterial density in Torrao reservoir for each of the three depths

(surface, euphotic zone limit, close to bottom) and the total

concentration of cyanobacteria in the reservoir. Since these

organisms can migrate vertically in the water column9 it was

considered important to include information about the density of

cyanobacteria in all the depths of the reservoir.

It is known that models with an excess of input data may

produce overfitting,10 so to avoid this problem, some variables

were excluded: percentage of oxygen saturation (because dis-

solved oxygen was included), and minimum, maximum and

average air (due to the inclusion of water temperature).

Meteorological data (solar radiation, precipitation and water

evaporation) and reservoir discharge were measured daily

throughout the sampling period. To avoid using all this detailed

information, these data were summarized in average data of

seven days periods. This period was chosen, because the

sampling periodicity of the other variables was biweekly during

time intervals of higher cyanobacteria concentrations.

For each model, input data were divided in three series:

training series, verification series and test series. The distribution

of the three series included eleven possibilities (Table 2). The

collected data available referred to a period of three years.

Dividing this period of time by the three series, each series would

include information about only one year, which, ecologically, is

a very short period of time to train a neural network. To avoid

this problem, it was decided to alternate the three series along the

three years of available data, so that all the series covered the

longest time interval possible (network type 1). However, by

doing this, none of the series was time independent from the

others, which might produce inconsistent results (as overtraining

problems). In order to surpass this possibility, it was decided to

isolate the test series, placing it after the train and the verification

series, still alternated (network type 2). It was also decided to

separate in time the three series and alter their sequential position

(network types 3 to 8), in order to find the best combination.

During the three years of data collection, it was found a yearly

peak of cyanobacteria density in the reservoir. For this reason, it

was also considered important that the test series included the

time interval where at least one maximum of cyanobacteria

density occurred, so three other combinations were regarded

(network types 9 to 11). In these combinations, the time series

include one cyanobacteria peak and the training and verification

series alternated during the remaining period of time. All these

This journal is ª The Royal Society of Chemistry 2011

Page 3: Forecasting of cyanobacterial density in Torrão reservoir using artificial neural networks

Table 2 Network types used for the development of the models. Each network type used a different combination of the three series (training, veri-fication and test series) through out the available time period (tr—training series; ve—verification series; te—test series)

Network type Series combination

1 Three series successively alternated, that is,overlapped through out all the time period

tr-ve-te-tr-ve-te-.

2 Training and verification series alternated andoverlapped and test series separated at the end ofthe time interval

tr-ve-tr-ve-.-te

3 Three series separated in time (first training series,then verification series and finally test series)

tr-ve-te

4 Three series separated in time (first training series,then test series and finally verification series)

tr-te-ve

5 Three series separated in time (first verificationseries, then training series and finally test series)

ve-tr-te

6 Three series separated in time (first verificationseries, then test series and finally training series)

ve-te-tr

7 Three series separated in time (first test series,then training series and finally verificationseries)

te-tr-ve

8 Three series separated in time (first test series,then verification series and finally trainingseries)

te-ve-tr

9 Test series including the first peak ofcyanobacteria density during the time interval andtraining and verification series alternated

tr-ve-tr-ve-te (peak 1)-tr-ve-tr-ve.

10 Test series including the second peak ofcyanobacteria density during the time interval andtraining and verification series alternated

tr-ve-tr-ve-te (peak 2)-tr-ve-tr-ve.

11 Test series including the third peak ofcyanobacteria density during the time interval andtraining and verification series alternated

tr-ve-tr-ve-te (peak 3)-tr-ve-tr-ve.

Publ

ishe

d on

06

May

201

1. D

ownl

oade

d by

Uni

vers

ity o

f Pi

ttsbu

rgh

on 2

8/10

/201

4 10

:28:

23.

View Article Online

series combinations were applied for each of the depths of the

input data (surface, euphotic zone limit, close to bottom and the

three depths simultaneously).

The default selections, proposed by the software, were used for

the construction of the networks: number of regression layer

nodes, network parameters, pre- and post-data processing and

error function. Prediction time was set to one, which corresponds

to a period of one sampling: 30 days in the winter and 15 days in

the summer.

The statistical test used to evaluate the forecasting ability of

the developed models was the standard Pearson—R correlation

coefficient between the actual and the predicted outputs. A

perfect prediction has a correlation coefficient of 1.0, although

this correlation does not necessarily indicate a perfect prediction,

but only a prediction that is perfectly linearly correlated with the

actual outputs. Nevertheless, in practice, the correlation coeffi-

cient is a good indicator of performance.

The first step was to investigate different time lags for the

different depths of input variables, in order to find the one that

produced better correlations between the observed values and the

expected ones. This search for the best time lag was developed

using the NN-based approach.8 The best time lag found was 11.

Then, smoothing coefficients for that time lag were explored. The

smoothing coefficient that presented better correlations between

the observed and the expected values was 1.7. A large smoothing

constant removes noise in the training data, but may fail to take

into account genuine detail in the error surface. It is advised to

experiment different values of the smoothing constant for best

performance and values between 0.1 and 100 are usually

acceptable (Statsoft 2000). Using a time lag of 11 and

a smoothing coefficient of 1.7, different models were operated.

This journal is ª The Royal Society of Chemistry 2011

For each of the input data depths, the eleven network types were

applied, as shown before. Also, four different outputs were

included for each of these models: cyanobacteria density at the

surface, cyanobacteria density at euphotic zone limit, cyano-

bacteria density close to the bottom and the total cyanobacteria

density at the three depths, that is, along the entire water column

of the reservoir. Several models were produced for each of those

situations and a total of 176 models were selected, based in the

correlation in the verification series/the models with higher

correlation values were selected.

Variables selection

The improvement of the different models was obtained by

training all the networks and searching for the best input data

combination, that is, the input variables that obtained the best

regression in the verification series. The criterion to exclude or

include a variable was based in their sensitivity in the series.

Sensitivity analysis can give information about the importance of

each of the variables. It usually provides useful information

about variables that can be safely ignored in the model and

significant variables that must always be maintained. However,

input variables may not be independent from each other, because

there may exist interdependencies between variables. Therefore,

it is advised to take special care on excluding a variable from

a model.

The basic sensitivity instrument is the Error, which indicates

the performance of the network if one determined variable is not

included as an input. Important variables produce a high error,

indicating that the network performance drops if they are not

present. The Ratio reports the ratio between the Error and the

J. Environ. Monit., 2011, 13, 1761–1767 | 1763

Page 4: Forecasting of cyanobacterial density in Torrão reservoir using artificial neural networks

Publ

ishe

d on

06

May

201

1. D

ownl

oade

d by

Uni

vers

ity o

f Pi

ttsbu

rgh

on 2

8/10

/201

4 10

:28:

23.

View Article Online

Baseline Error (the error of the network if all variables are

available). If the Ratio is equal to one or lower, then the exclu-

sion of the variable from the model has no effect on the perfor-

mance of the network or may even enhance it. Thus, the variables

with Ratio values under 1 were excluded from the model.

However, a search for the best variables combination was per-

formed, re-including the previous excluded variables. This

procedure was repeated for each excluded variable, including,

one by one, all the variables previously eliminated, in order to

verify again their importance to the model, when in the absence

of other variables.

Fig. 1 Number of selected models for which each input variable was

found to be significant to the prediction ability of the model.

Results and discussion

The relationship between phytoplankton and environmental

variables has been extensively studied. Nevertheless, the causality

and dynamics of algal blooms are very complex and not yet

entirely understood.11 Since the mechanisms responsible for

cyanobacteria blooms are not well understood, various input

variables were used to develop a variety of artificial neural

networks. Several models were selected from this work (176

models). This model selection was based on their correlation

values in the verification series: from the several models devel-

oped (where the input, the output and the network type varied),

the model that presented the highest correlation value in the

verification series was selected. The 176 selected models were

then tested with a test series, in order to determine the best

models, based on the correlation average in the three series

(training, verification and test series). The models that had the

highest correlation average between the real value and the esti-

mated value in the three series (Table 3). After calculating the

average correlation in the three series for each of the selected

models, one model—161—was evidenced because it presented

the highest average value. A group of four other models—4, 1,

129 and 37—were also distinguished from the other ones with

similar correlation average values. There was a quality loss for

the other selected models, with lower average correlation values.

A sensitivity analysis was carried out for all the 176 selected

models, in order to determine the relative significance of each of

the inputs to the prediction ability of the models. It was verified

that three input variables were the most significant to the

performance of most of the models: phosphates (in 152 models),

oxygen stratification (in 142 models) and water temperature (in

140 models) (Fig. 1). On the opposite side, lunar day length and

electrical conductivity were chosen as important variables for

only 66 and 71 models, respectively.

Table 3 Models with highest correlation average between the real value and thtest series)

Model Network typeInput (physicaland chemical data)

Output(cyanobacteria

161 8 Three depths Surface4 1 Surface Total1 1 Surface Surface129 11 Deepness Surface37 10 Surface Surface

1764 | J. Environ. Monit., 2011, 13, 1761–1767

For the group of five models that detached from the others by

presenting higher average correlation values (models 161, 4, 1,

129 and 37), lunar day was only selected for one of the models,

while ammonia, phosphates, dissolved oxygen, water tempera-

ture, pH and water evaporation were considered important

parameters for the performance of all of these networks.

Nitrates, oxygen stratification, precipitation and discharge were

selected for four of these models. Nitrites, electrical conductivity

and solar radiation were only considered important for the

performance of three of these five models (Table 4). The model

that had the best ability to predict the occurrence of cyano-

bacterial blooms in the studied reservoir was model 161. This

model produced very good results, considering that it used as

input data information collected in the reservoir during a period

of three years.

For model 161, only six input variables were found to be

significant to the performance of the model: ammonia, phos-

phates, dissolved oxygen, water temperature, pH and water

evaporation (see Table 4). It is important to mention that these

variables correspond precisely to the parameters that were

equally considered significant to the prediction ability of the

other four mentioned models. These other models included other

input variables besides these six, but their performance was not

benefited by the inclusion of that data, so they present lower

average correlation values than model 161, that used, as input

variables, only the six previously referred to.

Enrichment of water bodies by nutrients is one of the causes of

cyanobacterial blooms. Nitrogen and phosphorus, usually in the

forms of ammonia, nitrate and phosphate, are the principal

nutrients affecting the growth of these organisms. In some lake-

based models, nutrient concentrations were found to be one of

the most significant parameters for the prediction of

e estimated value in the three series (training series, verification series and

density)

Correlation (R)

Training Verification Test Average

0.991 0.843 0.978 0.9370.999 0.931 0.850 0.9270.999 0.958 0.816 0.9240.999 0.958 0.816 0.9240.998 0.889 0.875 0.921

This journal is ª The Royal Society of Chemistry 2011

Page 5: Forecasting of cyanobacterial density in Torrão reservoir using artificial neural networks

Table 4 Number of models for which each of the input parameter was considered significant to the prediction of cyanobacterial density. Only fivemodels were considered: the model with the best correlation average in the three series and the four models (models 1, 4, 37 and 129) that presentedhigher correlation average values after model 161

Models

Input parameters

moon NO2 NO3 NH3 PO4 DO wT pH cond strat EVP PCP RAD disch

161 0 0 0 1 1 1 1 1 0 0 1 0 0 01, 4, 37, 129 and 161 1 3 4 5 5 5 5 5 3 4 5 4 3 4

Publ

ishe

d on

06

May

201

1. D

ownl

oade

d by

Uni

vers

ity o

f Pi

ttsbu

rgh

on 2

8/10

/201

4 10

:28:

23.

View Article Online

cyanobacteria blooms.12–15 The ratio between nitrogen and

phosphorus concentrations in the aquatic system can often

implicate cyanobacteria species succession and dominance,

which can influence total phytoplankton productivity.16 From

the nutrients considered in this study, only two were considered

significant to the performance of model 161: phosphates and

ammonia. These nutrients were considered important variables

to the prediction accuracy of 152 and 131 models, respectively,

and are two of the parameters common to the five best selected

models. Phosphates were, actually, considered the most impor-

tant variable influencing the prediction ability of the models.

This was the variable that was selected to the higher number of

models.

Algae and other aquatic microorganisms prefer ammonium

above nitrate.17 The uptake of ammonium by river plankton is

higher than the nitrate uptake,18 although the ammonium

concentration was lower than the nitrate concentration, which

was also verified in the Torrao reservoir, during the three years of

sampling. This may explain the importance of ammonia to the

prediction ability of the model. In this study, ammonia seemed to

be important for 131 of the 176 selected models, compared to 117

models for nitrates. Nitrate, in aerobic conditions, is the most

stable, abundant and oxidised form of nitrogen in water. This

may be the reason why nitrogen, in the form of nitrate, was

significant to 117 of the total selected models and also to five of

the best models obtained, although they were not considered

significant to the performance of model 161. On the other hand,

nitrogen in the form of nitrite is easily oxidised, so it rarely

accumulates in water, unless there is organic pollution in the

system. Therefore, it may explain why nitrite concentration in the

reservoir was not one of the most significant variables for the

selected models, being chosen for only 80 of the 176 models and

three of the best five models, not being selected to the model 161.

In aquatic systems, the sources of oxygen are the atmosphere

(essentially in the surface of the system) and photosynthesis

performed by aquatic plants and algae. One of the processes

responsible for oxygen consumption is decomposition, since

bacteria that breakdown organic matter consume the available

oxygen in water. However, this happens mainly on the bottom of

the aquatic system, essentially on slow flowing waters.17 This may

explain why dissolved oxygen from the surface of the reservoir

(chosen for only 37 of the 176 models) was not as significant to

the models as the dissolved oxygen from the other depths of the

reservoir, namely from the euphotic zone limit (46 the selected

models). Dissolved oxygen was one of the six variables common

to the five better models and, therefore, present in the best

obtained model (model 161). Oxygen diffusion in the water is

slow and facilitated by water turbulence at the same time, light

This journal is ª The Royal Society of Chemistry 2011

penetration in the water column is important to the occurrence of

photosynthesis. Hence, it is expected that the concentration of

the dissolved oxygen varies during the day, with the seasons and

in a longitudinal scale.19 An uneven distribution of oxygen in the

water column originates an oxygen stratification, as it was

registered in the Torrao reservoir. Normally, deep water is poor

in oxygen, because it hardly reaches the upper layer of the

system, which is exposed to the atmosphere. Oxygen stratifica-

tion was considered a significant parameter for 142 of the 176

selected models, appearing as important in only four of the best

five models, not being chosen for model 161. The inclusion of this

variable in the other four best predicting models did not conduce

to an improvement of the prediction ability of those models.

It was reported that cyanobacterial blooms occur essentially

when there is physical stability and that changes in temperature,

wind speed, and wind direction may lead to the disappearance of

the blooms.20 Maier et al.3 identified temperature as one of the

most important input variables driving growth of Anabaena in

the RiverMurray. In this work, water temperature was one of the

most selected input parameters in the 176 selected models, being

present in 140 of those models. This variable was also considered

important for the best five models selected, including model 161,

confirming its importance for predicting the density of cyano-

bacteria, which is consistent with the conclusions of the study

developed in River Murray. Temperature variations influence

both abundance and species composition of the cyanobacterial

blooms.21 Temperature also appears to define a window in time,

during which large incidences of cyanobacteria blooms may

occur.3 Cyanobacteria blooms often occur in summer and

autumn, when temperatures are higher than 20 �C. According to

Tilman and Kiesling22 and Tilman et al.23 cyanobacteria are

rarely dominant when water temperature is below 17 �C.High abundances of cyanobacteria in lakes were also found to

coincide with alkaline conditions, generally within a pH range of

7.5 to 9.0.24,25 Acidification of the water may conduce to

a decrease in cyanobacteria density.26,27 pH has been found to be

one of the best parameters for predicting cyanobacterial

blooms.1,12,15 In this study, pH was found to be significant for

predicting cyanobacterial blooms in 113 of the 176 models, which

corresponds to about 64% of the selected models. It was also one

of the six variables selected for model 161 and the other four best

models considered.

Physical conditions in surface waters are determinant to the

composition of the existing community.28,29 Jeong and others,30

in a study proceeded in a reservoir in South Korea, mentioned

that precipitation, along with wind velocity and water tempera-

ture, were the most significant variables responsible for the

occurrence of cyanobacterial blooms. In this work, precipitation

J. Environ. Monit., 2011, 13, 1761–1767 | 1765

Page 6: Forecasting of cyanobacterial density in Torrão reservoir using artificial neural networks

Publ

ishe

d on

06

May

201

1. D

ownl

oade

d by

Uni

vers

ity o

f Pi

ttsbu

rgh

on 2

8/10

/201

4 10

:28:

23.

View Article Online

was considered a significant meteorological variable for 126 of

the selected 176 models. However, this parameter was not

considered important to the performance of model 161. In this

study, low precipitation values, warm water temperatures,

stratification, high concentrations of phosphorus and low

concentration of dissolved inorganic nitrogen in the water were

related to high cyanobacteria densities. Other meteorological

variables were also included in this study: water evaporation and

solar radiation. According to Jeong et al.30 and OlivaTeles et al.8

water evaporation showed to have a larger impact on chlorophyll

a concentration that solar radiation and water temperature. In

this study, water evaporation, together with water temperature,

seemed to be significant to the performance of the best five

models selected and, therefore, to model 161, unlike solar radi-

ation, that was only selected for three of those five models. If the

176 models are considered, water evaporation was considered

important to the performance of only 99 models, while solar

radiation and water temperature were chosen as important

parameters in 109 and 140 of the 176 models, respectively.

Cyanobacterial blooms tend to occur when temperatures are

high, which coincides with increased light intensity.3 Solar radi-

ation, together with nutrient concentrations and water temper-

ature, is one of the most significant parameters controlling the

growth of cyanobacteria.14 However, other studies provided

different results. It was not found an explanation for the low

weight of solar radiation at the water surface, in summer,1 and

why concentration of chlorophyll a increases when there is an

increase in solar radiation at low radiation levels, but not at high

levels of radiation.30

Water flow was one of the predominant variables for the

determination of the occurrence of Anabaena in River Murray.3

The construction of dams in a river considerably reduces flow

discharge. This condition increases the retention time of the

water in the reservoir, reduces the turbulence and the turbidity of

the water, which turns nutrients available for cyanobacteria,

therefore stimulating their growth.3,5 However, large cyanobac-

teria densities seem to occur after a flood, so flow seems to be

responsible for the timing of the incidence of cyanobacterial

blooms.3 The period before the flood seems to be important to

the cyanobacteria, since it transports nutrients to the system and

re-suspends cyanobacteria akynets that may exist in the sediment

of the reservoir.31 So, variations in discharge influence the

development of cyanobacterial blooms. This may explain why

this parameter was found to be significant to the prediction

accuracy of 128 of the 176 selected models (almost 73% of the

models). Nevertheless, this variable was not selected for model

161, although it was chosen for the other best four models

mentioned.

Electrical conductivity of water provides information about

the salinity of the aquatic system. An increase in salinity was

related to changes in aquatic ecosystems, such as an increase in

cyanobacterial density,32,33 because most cyanobacteria were

found to be very tolerant to high salinities.34 However, salinity

and the growth of Microcystis aeruginosa exhibited an inverse

relationship35–37 because increased salinity inhibited this cyano-

bacteria growth. In the developed study, electrical conductivity

was only considered as an important variable to the performance

of 71 of the 176 models, being the less significant parameter to the

prediction of cyanobacterial blooms after lunar day length. This

1766 | J. Environ. Monit., 2011, 13, 1761–1767

variable was only selected for three of the five models with

highest average correlation values, not being chosen to model

161. Lunar day length variable was included in this study, since

lunar cycles are responsible for the formation of tides and

meteorological changes, namely in the winds. There is evidence

that it may influence biological processes and represent

a powerful clock for synchronising biological events, namely in

cyanobacteria. Although this parameter was not determinant for

the majority of the models selected, it has been chosen, as an

important variable to the network performance, for 66 of the 176

selected models, which corresponds to only 37% of those models,

being selected for only one of the best five considered models.

Sensitivity analysis of variables for each of the 176 models

developed and selected also showed that there were no dominant

variables, since their average root mean square error (RMSE)

ratios oscillate between 1003 and 1128. Some of the 176 selected

models developed gave poor results for the test series, in spite of

presenting good results for the training and verification sets.

These results may be due to overfitting or because the different

datasets were not representative of the same population.38 This

poor ability to predict future values from the test series may also

occur as a result of inappropriate network architecture in those

cases. From the analysis of the five models that produced higher

average correlation values (model 1, 4, 37, 129 and 161), we may

also conclude that four of those models used as input physical

and chemical data collected from the surface of the reservoir

(models 1, 4, 37 and 161), while model 129 used data collected

from the bottom of the reservoir (see Table 3). None of the best

models selected used input data collected only from the euphotic

zone limit of the Torrao reservoir. Also, four of the three selected

models (models 1, 37, 129 and 161) predicted cyanobacteria

density from the surface of the reservoir, while only model 4

predicted the density in the entire column of water. It is inter-

esting to notice that model 129 used physical and chemical

information from the bottom of the reservoir to predict the

concentration of cyanobacteria in the surface. Model 161 used,

as input data, physical and chemical parameters from the three

depths of the reservoir. More complete information about the

aquatic system was provided when using data from the entire

column of the reservoir, so it is understandable that these data

produced better results in the models they were applied to.

Network types used in the best five models were type 1 (models

1 and 4), type 8 (model 161), type 10 (model 37) and type 11

(model 129). In the type 1 network, as referred to before, the

three series were distributed, alternated and overlapped along the

entire time period. Although, in this case, the three series where

not totally independent in time, there was no overtraining of the

model, since correlation results for the three series were similar.

Model 161 was a type 8 network, in which the three series are

independent in time: first the test series, then the verification

series and, finally, the training series. In network types 10 and 11,

test series included one of the three peaks in cyanobacteria

density that were verified to occur during the three years of data

collection (second peak in type 10 and third peak in type 11). As

training and verification series were alternated during the

remaining time period, it may be possible that this interval of

time (of approximately two years) may have been enough for

these models to understand the functioning of the aquatic system

and, therefore, predict the oscillations in cyanobacteria densities.

This journal is ª The Royal Society of Chemistry 2011

Page 7: Forecasting of cyanobacterial density in Torrão reservoir using artificial neural networks

Publ

ishe

d on

06

May

201

1. D

ownl

oade

d by

Uni

vers

ity o

f Pi

ttsbu

rgh

on 2

8/10

/201

4 10

:28:

23.

View Article Online

Conclusions

Several models were created in order to forecast the cyanobacte-

rial density with 15 days of anticipation, using information

collected in the Torrao reservoir during a period of three years. A

time lag of 11 was used, equivalent to one sample (periods of 15

days in the summer and 30 days in the winter), and the selected

smoothing value was 1.7. The model that presented a higher

average correlation value from the three series had the correla-

tions of 0.991; 0.843; 0.978 for training, verification and test series,

respectively. This corresponds to high correlation values,

considering that the sampling period of time was only three years.

The prediction accuracy of cyanobacterial density could have

been improved if it was included data referring to a longer period

of time. This model corresponds to a type 8 network, in which the

three series are independent in time: first the test series, then the

verification series and, finally, the training series. Only six input

variables were considered to be significant to the performance of

this model: ammonia, phosphates, dissolved oxygen, water

temperature, pH and water evaporation, physical and chemical

data collected from the three depths of the reservoir. These vari-

ables are common to the other best four models produced and,

although these models included other input variables, their

performance was not improved in relation to model 161.

It is important to note that the selected variables in the model

161 were those for which the combination led to a better

prediction of the evolution of the density of cyanobacteria

through time, not necessarily those that explain that evolution.

The exclusion of certain variables does not mean that those

parameters do not determine the occurrence of cyanobacteria

blooms or that their information was not considered. We cannot

establish a relation of cause and effect between the input and the

output, and the useful information contained in those variables

may be present, in a secondary form, in other considered

parameters.

The presented models may predict cyanobacteria blooms with

enough anticipation to provide a Water Treatment Plant time

enough to prepare a more efficient treatment system, or to a local

public health officer to organize a monitoring program to

prevent human health risks due to direct exposure to cyano-

bacteria and their toxins. Nevertheless, these models should be

continuously fed with new data every year so as to better respond

to unpredictable changes in the environment.

Acknowledgements

The authors acknowledge the Municipality of Marco de Can-

aveses for the help during sampling and the Portuguese Science

Foundation (FCT) for partially funding the project.

References

1 F. Recknagel, M. French, P. Harkonen and K.-I. Yabunaka, Ecol.Modell., 1997, 96, 11–28.

2 H. Wilson and F. Recknagel, Ecol. Modell., 2001, 146, 69–84.3 H.Maier, G. Dandy andM. Burch,Ecol. Modell., 1998, 105, 257–272.

This journal is ª The Royal Society of Chemistry 2011

4 H. R. Maier, T. Sayed and B. J. Lence, Ecol. Modell., 2001, 146, 65–96.

5 H. R. Maier and G. Dandy, Math. Comput. Modell., 2001, 33, 669–682.

6 L. OlivaTeles, E. Pereira, M. Saker and V. Vasconcelos, Environ.Manage., 2006, 38, 227–237.

7 R. Leitao and A. I. Lopes, A utilizacao dos recursos da parteportuguesa da bacia hidrogr�afica do Rio Douro para producao deenergia el�ectrica, Hidrorumo, Projecto e Gestao, Porto, 2000.

8 L. OlivaTeles, E. Pereira, M. Saker and V. Vasconcelos, Lakes andReservoirs: Research and Management, 2008, 13, 135–143.

9 C. S. Reynolds, The Ecology of Freshwater Phytoplankton, CambridgeUniversity Press, 1984, p. 384.

10 M. Qi and G. P. Zhang, Eur. J. Oper. Res., 2001, 132, 666–680.11 J. H. W. Lee, Y. Huang, M. Dickman and A. W. Jayawardena, Ecol.

Modell., 2003, 159, 179–201.12 J. Bobbin and F. Recknagel, Ecol. Modell., 2001, 146, 253–262.13 J. Bobbin and F. Recknagel, Environ. Int., 2001, 27, 237–242.14 I. G. Prokopkin, V. G. Gubanov and M. I. Gladyshev, Ecol. Modell.,

2006, 190, 419–431.15 B. Wei, N. Sugiura and T. Maekawa, Water Res., 2001, 35(8), 2022–

2028.16 N. Takamura, A. Otsuki, M. Aizaki and Y. Nojiri, Archives fur

Hydrobiologie, 1992, 24(2), 129–148.17 R. C. Nijboer and P. F.M. Verdonschot, Ecol. Modell., 2004, 177, 17–

39.18 D. W. Stanley and J. E. Hobbie, Limnol. Oceanogr., 1981, 26, 30–42.19 E. P. Odum, Fundamentos de Ecologia, 4a Edicao, Fundacao Calouste

Gulbenkian, Lisboa, 1971.20 H. W. Paerl, Growth and Reproductive Strategies of Freshwater

Blue-Green Algae (Cyanobacteria), in Growth and ReproductiveStrategies of Freshwater Phytoplankton, ed. C. D. Sandgren,Cambridge University Press, New York, 1988, pp. 261—315.

21 B. Guven and A. Howard, Sci. Total Environ., 2006, 368, 898–908.22 D. Tilman and R. Kiesling, Freshwater Algal Ecology: Taxonomic

Tradeoffs in the Temperature Dependence of Nutrient CompetitiveAbilities, in Current Perspectives in Microbial Ecology, AmericanSociety Microbiology, 1984, pp. 314–319.

23 D. Tilman, R. Kiesling and R. Sterner, Archives fur Hydrobiologie,1986, 106, 473–485.

24 T. D. Brock, Evolutionary and Ecological Aspects of theCyanophytes, in The Biology of the Blue-Green Algae, ed. N. G.Carr and B. A. Whitton, Blackwell Scientific Publications, Oxford,1973, pp. 487–500.

25 W. A. Kratz and J. Myers, Am. J. Bot., 1955, 42, 282–287.26 A. M. Turner, J. C. Trexler, C. F. Jordan, S. J. Slack, P. Geddes,

J. H. Chick and W. F. Loftus, Conserv. Biol., 1999, 13, 898–911.27 S. S. Dixit, A. S. Dixit and J. P. Smol, Freshwater Biol., 1991, 26, 251–

265.28 R. Margalef, Oecologia Aquatica, 1978, vol. 3, pp. 97–132.29 C. S. Reynolds, Holarctic Ecol., 1980, 3, 141–159.30 K.-S. Jeong, D.-K. Kim, P. Whigham and G.-J. Joo, Ecol. Modell.,

2003, 161, 67–78.31 C. Sullivan, J. Saunders and D. Welsh, Phytoplankton of the River

Murray, 1980–1985. Water Quality Report No. 2, Murray–DarlingBasin Commission, Canberra, 1988.

32 K. G. Sellner, R. V. Lacouture and C. R. Parrish, J. Plankton Res.,1988, 10, 49–61.

33 H. W. Paerl, J. L. Pinckney and T. F. Steppe, Environ. Microbiol.,2000, 2(1), 11–26.

34 B. J. Robson and D. P. Hamilton, Mar. Freshwater Res., 2003, 54,139–151.

35 P. T. Orr, G. J. Jones and G. B. Douglas,Mar. Freshwater Res., 2004,55, 277–283.

36 Y. Liu, Effects of Salinity on the Growth and Toxin Production ofa Harmful Algal Species, Microcystis aeruginosa, WaterEnvironment Federation, 2006, vol. 1, J.U.S. SJWP.

37 T. Masters, Practical Neural Network Recipes in C++, AcademicPress, San Diego, CA, 1993.

J. Environ. Monit., 2011, 13, 1761–1767 | 1767