forecasting of cyanobacterial density in torrão reservoir using artificial neural networks
TRANSCRIPT
Dynamic Article LinksC<Journal ofEnvironmentalMonitoringCite this: J. Environ. Monit., 2011, 13, 1761
www.rsc.org/jem PAPER
Publ
ishe
d on
06
May
201
1. D
ownl
oade
d by
Uni
vers
ity o
f Pi
ttsbu
rgh
on 2
8/10
/201
4 10
:28:
23.
View Article Online / Journal Homepage / Table of Contents for this issue
Forecasting of cyanobacterial density in Torrao reservoir using artificialneural networks
Rita Torres,a Elisa Pereira,b V�ıtor Vasconcelos*ab and Lu�ıs Oliva Telesab
Received 7th February 2011, Accepted 8th April 2011
DOI: 10.1039/c1em10127g
The ability of general regression neural networks (GRNN) to forecast the density of cyanobacteria in
the Torrao reservoir (Tamega river, Portugal), in a period of 15 days, based on three years of collected
physical and chemical data, was assessed. Several models were developed and 176 were selected based
on their correlation values for the verification series. A time lag of 11 was used, equivalent to one
sample (periods of 15 days in the summer and 30 days in the winter). Several combinations of the series
were used. Input and output data collected from three depths of the reservoir were applied (surface,
euphotic zone limit and bottom). The model that presented a higher average correlation value
presented the correlations 0.991; 0.843; 0.978 for training, verification and test series. This model had
the three series independent in time: first test series, then verification series and, finally, training series.
Only six input variables were considered significant to the performance of this model: ammonia,
phosphates, dissolved oxygen, water temperature, pH and water evaporation, physical and chemical
parameters referring to the three depths of the reservoir. These variables are common to the next four
best models produced and, although these included other input variables, their performance was not
better than the selected best model.
Introduction
Eutrophication, being a natural process, can be accelerated by
human activities due to the increase of the load of nutrients and
organic substances. These substances can cause the excessive
growth of algae and cyanobacteria, and this may interfere with
the uses of the water and with the health and diversity of
autochthonous organisms (EPA 2003). Most of these negative
effects might be prevented or minimized if phytoplankton
blooms are predicted in an early stage.1
Over the last decade, there has been a growing interest in
using artificial neural networks (ANNs) for modelling ecosys-
tems. This is mainly because, unlike other ecological models
aDepartamento de Biologia - Faculdade de Ciencias, Universidade doPorto, Rua do Campo Alegre, 4069-007 Porto, Portugal. E-mail:[email protected]; Fax: +351 223380609; Tel: +315 223401814bCIIMAR/CIMAR - Centro Interdisciplinar de Investigacao Marinha eAmbiental, Universidade do Porto, Rua dos Bragas 289, 4050-123 Porto,Portugal
Environmental impact
Artificial Neural Networks may be used to forecast the occurrence
technique to predict the occurrence of blooms several weeks ahead. T
that will prevent human health risks. This might also be used to
hazardous effects of blooms.
This journal is ª The Royal Society of Chemistry 2011
based on linear regression, ANNs are able to map the non-
linear relationships between characteristic variables of the
ecosystems.2 The great advantage of ANNs is their ability to
work with noisy or incomplete input data and their capability
of learning and generalizing from experience. They are often
good at solving problems that are too complex for conven-
tional technologies. Since the use of ANNs to model cyano-
bacteria blooms is recent, most of the ANN models known in
the present do not have the ability to predict future values;
most of the ANNs developed only predict and do not forecast
results.2–6
With this work, it is intended to study the ability of a specific
type of ANN, general regression neural networks (GRNN), to
forecast the density of cyanobacteria in a temperate reservoir
(Torrao, Tamega river, Portugal) based on three years
of monitoring data. This will be a valuable tool for the
management of this type of ecosystem and help environmental
authorities to better manage eutrophication and its
consequences.
of cyanobacteria blooms in reservoirs. We report the use of this
his might enable water managers to adapt monitoring strategies
apply mitigation measures in order to decrease the potential
J. Environ. Monit., 2011, 13, 1761–1767 | 1761
Publ
ishe
d on
06
May
201
1. D
ownl
oade
d by
Uni
vers
ity o
f Pi
ttsbu
rgh
on 2
8/10
/201
4 10
:28:
23.
View Article Online
Materials and methods
Study site
Torrao reservoir is the first hydroelectric dam constructed in
Tamega River, the largest tributary of Douro River (North
Portugal). This dam was concluded in 1988 and is located about
30 km far from Porto. The reservoir formed by this dam has
submerged 31 km of the river basin and has 77 hm3 of maximum
volume.7 It is used to produce energy (about 233 million kilo-
watts per hour), for recreational activities (swimming, boating
and fishing) and as a source of drinking water for Amarante and
Marco de Canavezes inhabitants.
The phytoplankton community in Torrao reservoir includes
toxin producing cyanobacteria, mainly Microcystis aeruginosa
and Aphanizomenon flos-aquae.8 The occurrence of toxic cyano-
bacteria in the reservoir may be harmful to the population that
uses its water and to wild and domestic animals, due to the
microcystins produced.
Collected data
The monitoring data used in this study were collected between
September 1999 and December 2002 (Table 1). Sampling was
carried out monthly, except between June and October when
increased abundance of cyanobacteria was detected. During this
period, sampling was biweekly.
Artificial neural networks do not require transformation of the
used input data, because the probability distribution of the data
does not affect the model input.5 However, in order to reduce the
scale range, phytoplankton data suffered a log2 transformation.
This transformation was found to give better results than the
model produced using untransformed data.6
Models development
The software used in this study was Statistica Neural Networks
�, Version 4.0 F. A general regression neural network (GRNN)
was used in the development of the forecasting models. The input
patterns were formed by thirty distinct parameters, all physical
and chemical variables. Phytoplankton and other associated
Table 1 Collected data used in the development of the models, theirnotation and units
Parameter Notation Units
Lunar day length moon DaysNitrite NO2 mg L�1
Nitrate NO3 mg L�1
Ammonia NH3 mg L�1
Phosphate PO4 mg L�1
Dissolved oxygen DO mg L�1
Water temperature wT �CpH pH Sorenson scaleElectrical conductivity cond mS cm�1
Oxygen stratification strat Present—1, notpresent—0
Precipitation PCP mm day�1
Water evaporation EVP mm day�1
Solar radiation RAD kJ m�2
Discharge disch m3 s�1
Cyanobacteria density cyan cells mL�1
1762 | J. Environ. Monit., 2011, 13, 1761–1767
variables (chlorophyll and phaeophytin) were excluded. In
a study performed at Crestuma Reservoir, Douro River, Portu-
gal,6 the inclusion of data directly correlated to the output
variable did not conduce to a better performance of the predic-
tive models, while physical and chemical information of the
environment provided best results in the prediction of the cya-
nobacterial density.
The thirty variables were divided in three groups: variables
provided from samples collected from the surface, from the
euphotic zone limit and close to the bottom of the reservoir
(concentration of nitrites, nitrates, ammonia and phosphates,
dissolved oxygen, water temperature, pH and water conduc-
tivity), meteorological variables (precipitation, water evapora-
tion, solar radiation and reservoir discharge) and general
variables (lunar day length and oxygen stratification).
The outputs of the GRNN provide an estimate of the cyano-
bacterial density in Torrao reservoir for each of the three depths
(surface, euphotic zone limit, close to bottom) and the total
concentration of cyanobacteria in the reservoir. Since these
organisms can migrate vertically in the water column9 it was
considered important to include information about the density of
cyanobacteria in all the depths of the reservoir.
It is known that models with an excess of input data may
produce overfitting,10 so to avoid this problem, some variables
were excluded: percentage of oxygen saturation (because dis-
solved oxygen was included), and minimum, maximum and
average air (due to the inclusion of water temperature).
Meteorological data (solar radiation, precipitation and water
evaporation) and reservoir discharge were measured daily
throughout the sampling period. To avoid using all this detailed
information, these data were summarized in average data of
seven days periods. This period was chosen, because the
sampling periodicity of the other variables was biweekly during
time intervals of higher cyanobacteria concentrations.
For each model, input data were divided in three series:
training series, verification series and test series. The distribution
of the three series included eleven possibilities (Table 2). The
collected data available referred to a period of three years.
Dividing this period of time by the three series, each series would
include information about only one year, which, ecologically, is
a very short period of time to train a neural network. To avoid
this problem, it was decided to alternate the three series along the
three years of available data, so that all the series covered the
longest time interval possible (network type 1). However, by
doing this, none of the series was time independent from the
others, which might produce inconsistent results (as overtraining
problems). In order to surpass this possibility, it was decided to
isolate the test series, placing it after the train and the verification
series, still alternated (network type 2). It was also decided to
separate in time the three series and alter their sequential position
(network types 3 to 8), in order to find the best combination.
During the three years of data collection, it was found a yearly
peak of cyanobacteria density in the reservoir. For this reason, it
was also considered important that the test series included the
time interval where at least one maximum of cyanobacteria
density occurred, so three other combinations were regarded
(network types 9 to 11). In these combinations, the time series
include one cyanobacteria peak and the training and verification
series alternated during the remaining period of time. All these
This journal is ª The Royal Society of Chemistry 2011
Table 2 Network types used for the development of the models. Each network type used a different combination of the three series (training, veri-fication and test series) through out the available time period (tr—training series; ve—verification series; te—test series)
Network type Series combination
1 Three series successively alternated, that is,overlapped through out all the time period
tr-ve-te-tr-ve-te-.
2 Training and verification series alternated andoverlapped and test series separated at the end ofthe time interval
tr-ve-tr-ve-.-te
3 Three series separated in time (first training series,then verification series and finally test series)
tr-ve-te
4 Three series separated in time (first training series,then test series and finally verification series)
tr-te-ve
5 Three series separated in time (first verificationseries, then training series and finally test series)
ve-tr-te
6 Three series separated in time (first verificationseries, then test series and finally training series)
ve-te-tr
7 Three series separated in time (first test series,then training series and finally verificationseries)
te-tr-ve
8 Three series separated in time (first test series,then verification series and finally trainingseries)
te-ve-tr
9 Test series including the first peak ofcyanobacteria density during the time interval andtraining and verification series alternated
tr-ve-tr-ve-te (peak 1)-tr-ve-tr-ve.
10 Test series including the second peak ofcyanobacteria density during the time interval andtraining and verification series alternated
tr-ve-tr-ve-te (peak 2)-tr-ve-tr-ve.
11 Test series including the third peak ofcyanobacteria density during the time interval andtraining and verification series alternated
tr-ve-tr-ve-te (peak 3)-tr-ve-tr-ve.
Publ
ishe
d on
06
May
201
1. D
ownl
oade
d by
Uni
vers
ity o
f Pi
ttsbu
rgh
on 2
8/10
/201
4 10
:28:
23.
View Article Online
series combinations were applied for each of the depths of the
input data (surface, euphotic zone limit, close to bottom and the
three depths simultaneously).
The default selections, proposed by the software, were used for
the construction of the networks: number of regression layer
nodes, network parameters, pre- and post-data processing and
error function. Prediction time was set to one, which corresponds
to a period of one sampling: 30 days in the winter and 15 days in
the summer.
The statistical test used to evaluate the forecasting ability of
the developed models was the standard Pearson—R correlation
coefficient between the actual and the predicted outputs. A
perfect prediction has a correlation coefficient of 1.0, although
this correlation does not necessarily indicate a perfect prediction,
but only a prediction that is perfectly linearly correlated with the
actual outputs. Nevertheless, in practice, the correlation coeffi-
cient is a good indicator of performance.
The first step was to investigate different time lags for the
different depths of input variables, in order to find the one that
produced better correlations between the observed values and the
expected ones. This search for the best time lag was developed
using the NN-based approach.8 The best time lag found was 11.
Then, smoothing coefficients for that time lag were explored. The
smoothing coefficient that presented better correlations between
the observed and the expected values was 1.7. A large smoothing
constant removes noise in the training data, but may fail to take
into account genuine detail in the error surface. It is advised to
experiment different values of the smoothing constant for best
performance and values between 0.1 and 100 are usually
acceptable (Statsoft 2000). Using a time lag of 11 and
a smoothing coefficient of 1.7, different models were operated.
This journal is ª The Royal Society of Chemistry 2011
For each of the input data depths, the eleven network types were
applied, as shown before. Also, four different outputs were
included for each of these models: cyanobacteria density at the
surface, cyanobacteria density at euphotic zone limit, cyano-
bacteria density close to the bottom and the total cyanobacteria
density at the three depths, that is, along the entire water column
of the reservoir. Several models were produced for each of those
situations and a total of 176 models were selected, based in the
correlation in the verification series/the models with higher
correlation values were selected.
Variables selection
The improvement of the different models was obtained by
training all the networks and searching for the best input data
combination, that is, the input variables that obtained the best
regression in the verification series. The criterion to exclude or
include a variable was based in their sensitivity in the series.
Sensitivity analysis can give information about the importance of
each of the variables. It usually provides useful information
about variables that can be safely ignored in the model and
significant variables that must always be maintained. However,
input variables may not be independent from each other, because
there may exist interdependencies between variables. Therefore,
it is advised to take special care on excluding a variable from
a model.
The basic sensitivity instrument is the Error, which indicates
the performance of the network if one determined variable is not
included as an input. Important variables produce a high error,
indicating that the network performance drops if they are not
present. The Ratio reports the ratio between the Error and the
J. Environ. Monit., 2011, 13, 1761–1767 | 1763
Publ
ishe
d on
06
May
201
1. D
ownl
oade
d by
Uni
vers
ity o
f Pi
ttsbu
rgh
on 2
8/10
/201
4 10
:28:
23.
View Article Online
Baseline Error (the error of the network if all variables are
available). If the Ratio is equal to one or lower, then the exclu-
sion of the variable from the model has no effect on the perfor-
mance of the network or may even enhance it. Thus, the variables
with Ratio values under 1 were excluded from the model.
However, a search for the best variables combination was per-
formed, re-including the previous excluded variables. This
procedure was repeated for each excluded variable, including,
one by one, all the variables previously eliminated, in order to
verify again their importance to the model, when in the absence
of other variables.
Fig. 1 Number of selected models for which each input variable wasfound to be significant to the prediction ability of the model.
Results and discussion
The relationship between phytoplankton and environmental
variables has been extensively studied. Nevertheless, the causality
and dynamics of algal blooms are very complex and not yet
entirely understood.11 Since the mechanisms responsible for
cyanobacteria blooms are not well understood, various input
variables were used to develop a variety of artificial neural
networks. Several models were selected from this work (176
models). This model selection was based on their correlation
values in the verification series: from the several models devel-
oped (where the input, the output and the network type varied),
the model that presented the highest correlation value in the
verification series was selected. The 176 selected models were
then tested with a test series, in order to determine the best
models, based on the correlation average in the three series
(training, verification and test series). The models that had the
highest correlation average between the real value and the esti-
mated value in the three series (Table 3). After calculating the
average correlation in the three series for each of the selected
models, one model—161—was evidenced because it presented
the highest average value. A group of four other models—4, 1,
129 and 37—were also distinguished from the other ones with
similar correlation average values. There was a quality loss for
the other selected models, with lower average correlation values.
A sensitivity analysis was carried out for all the 176 selected
models, in order to determine the relative significance of each of
the inputs to the prediction ability of the models. It was verified
that three input variables were the most significant to the
performance of most of the models: phosphates (in 152 models),
oxygen stratification (in 142 models) and water temperature (in
140 models) (Fig. 1). On the opposite side, lunar day length and
electrical conductivity were chosen as important variables for
only 66 and 71 models, respectively.
Table 3 Models with highest correlation average between the real value and thtest series)
Model Network typeInput (physicaland chemical data)
Output(cyanobacteria
161 8 Three depths Surface4 1 Surface Total1 1 Surface Surface129 11 Deepness Surface37 10 Surface Surface
1764 | J. Environ. Monit., 2011, 13, 1761–1767
For the group of five models that detached from the others by
presenting higher average correlation values (models 161, 4, 1,
129 and 37), lunar day was only selected for one of the models,
while ammonia, phosphates, dissolved oxygen, water tempera-
ture, pH and water evaporation were considered important
parameters for the performance of all of these networks.
Nitrates, oxygen stratification, precipitation and discharge were
selected for four of these models. Nitrites, electrical conductivity
and solar radiation were only considered important for the
performance of three of these five models (Table 4). The model
that had the best ability to predict the occurrence of cyano-
bacterial blooms in the studied reservoir was model 161. This
model produced very good results, considering that it used as
input data information collected in the reservoir during a period
of three years.
For model 161, only six input variables were found to be
significant to the performance of the model: ammonia, phos-
phates, dissolved oxygen, water temperature, pH and water
evaporation (see Table 4). It is important to mention that these
variables correspond precisely to the parameters that were
equally considered significant to the prediction ability of the
other four mentioned models. These other models included other
input variables besides these six, but their performance was not
benefited by the inclusion of that data, so they present lower
average correlation values than model 161, that used, as input
variables, only the six previously referred to.
Enrichment of water bodies by nutrients is one of the causes of
cyanobacterial blooms. Nitrogen and phosphorus, usually in the
forms of ammonia, nitrate and phosphate, are the principal
nutrients affecting the growth of these organisms. In some lake-
based models, nutrient concentrations were found to be one of
the most significant parameters for the prediction of
e estimated value in the three series (training series, verification series and
density)
Correlation (R)
Training Verification Test Average
0.991 0.843 0.978 0.9370.999 0.931 0.850 0.9270.999 0.958 0.816 0.9240.999 0.958 0.816 0.9240.998 0.889 0.875 0.921
This journal is ª The Royal Society of Chemistry 2011
Table 4 Number of models for which each of the input parameter was considered significant to the prediction of cyanobacterial density. Only fivemodels were considered: the model with the best correlation average in the three series and the four models (models 1, 4, 37 and 129) that presentedhigher correlation average values after model 161
Models
Input parameters
moon NO2 NO3 NH3 PO4 DO wT pH cond strat EVP PCP RAD disch
161 0 0 0 1 1 1 1 1 0 0 1 0 0 01, 4, 37, 129 and 161 1 3 4 5 5 5 5 5 3 4 5 4 3 4
Publ
ishe
d on
06
May
201
1. D
ownl
oade
d by
Uni
vers
ity o
f Pi
ttsbu
rgh
on 2
8/10
/201
4 10
:28:
23.
View Article Online
cyanobacteria blooms.12–15 The ratio between nitrogen and
phosphorus concentrations in the aquatic system can often
implicate cyanobacteria species succession and dominance,
which can influence total phytoplankton productivity.16 From
the nutrients considered in this study, only two were considered
significant to the performance of model 161: phosphates and
ammonia. These nutrients were considered important variables
to the prediction accuracy of 152 and 131 models, respectively,
and are two of the parameters common to the five best selected
models. Phosphates were, actually, considered the most impor-
tant variable influencing the prediction ability of the models.
This was the variable that was selected to the higher number of
models.
Algae and other aquatic microorganisms prefer ammonium
above nitrate.17 The uptake of ammonium by river plankton is
higher than the nitrate uptake,18 although the ammonium
concentration was lower than the nitrate concentration, which
was also verified in the Torrao reservoir, during the three years of
sampling. This may explain the importance of ammonia to the
prediction ability of the model. In this study, ammonia seemed to
be important for 131 of the 176 selected models, compared to 117
models for nitrates. Nitrate, in aerobic conditions, is the most
stable, abundant and oxidised form of nitrogen in water. This
may be the reason why nitrogen, in the form of nitrate, was
significant to 117 of the total selected models and also to five of
the best models obtained, although they were not considered
significant to the performance of model 161. On the other hand,
nitrogen in the form of nitrite is easily oxidised, so it rarely
accumulates in water, unless there is organic pollution in the
system. Therefore, it may explain why nitrite concentration in the
reservoir was not one of the most significant variables for the
selected models, being chosen for only 80 of the 176 models and
three of the best five models, not being selected to the model 161.
In aquatic systems, the sources of oxygen are the atmosphere
(essentially in the surface of the system) and photosynthesis
performed by aquatic plants and algae. One of the processes
responsible for oxygen consumption is decomposition, since
bacteria that breakdown organic matter consume the available
oxygen in water. However, this happens mainly on the bottom of
the aquatic system, essentially on slow flowing waters.17 This may
explain why dissolved oxygen from the surface of the reservoir
(chosen for only 37 of the 176 models) was not as significant to
the models as the dissolved oxygen from the other depths of the
reservoir, namely from the euphotic zone limit (46 the selected
models). Dissolved oxygen was one of the six variables common
to the five better models and, therefore, present in the best
obtained model (model 161). Oxygen diffusion in the water is
slow and facilitated by water turbulence at the same time, light
This journal is ª The Royal Society of Chemistry 2011
penetration in the water column is important to the occurrence of
photosynthesis. Hence, it is expected that the concentration of
the dissolved oxygen varies during the day, with the seasons and
in a longitudinal scale.19 An uneven distribution of oxygen in the
water column originates an oxygen stratification, as it was
registered in the Torrao reservoir. Normally, deep water is poor
in oxygen, because it hardly reaches the upper layer of the
system, which is exposed to the atmosphere. Oxygen stratifica-
tion was considered a significant parameter for 142 of the 176
selected models, appearing as important in only four of the best
five models, not being chosen for model 161. The inclusion of this
variable in the other four best predicting models did not conduce
to an improvement of the prediction ability of those models.
It was reported that cyanobacterial blooms occur essentially
when there is physical stability and that changes in temperature,
wind speed, and wind direction may lead to the disappearance of
the blooms.20 Maier et al.3 identified temperature as one of the
most important input variables driving growth of Anabaena in
the RiverMurray. In this work, water temperature was one of the
most selected input parameters in the 176 selected models, being
present in 140 of those models. This variable was also considered
important for the best five models selected, including model 161,
confirming its importance for predicting the density of cyano-
bacteria, which is consistent with the conclusions of the study
developed in River Murray. Temperature variations influence
both abundance and species composition of the cyanobacterial
blooms.21 Temperature also appears to define a window in time,
during which large incidences of cyanobacteria blooms may
occur.3 Cyanobacteria blooms often occur in summer and
autumn, when temperatures are higher than 20 �C. According to
Tilman and Kiesling22 and Tilman et al.23 cyanobacteria are
rarely dominant when water temperature is below 17 �C.High abundances of cyanobacteria in lakes were also found to
coincide with alkaline conditions, generally within a pH range of
7.5 to 9.0.24,25 Acidification of the water may conduce to
a decrease in cyanobacteria density.26,27 pH has been found to be
one of the best parameters for predicting cyanobacterial
blooms.1,12,15 In this study, pH was found to be significant for
predicting cyanobacterial blooms in 113 of the 176 models, which
corresponds to about 64% of the selected models. It was also one
of the six variables selected for model 161 and the other four best
models considered.
Physical conditions in surface waters are determinant to the
composition of the existing community.28,29 Jeong and others,30
in a study proceeded in a reservoir in South Korea, mentioned
that precipitation, along with wind velocity and water tempera-
ture, were the most significant variables responsible for the
occurrence of cyanobacterial blooms. In this work, precipitation
J. Environ. Monit., 2011, 13, 1761–1767 | 1765
Publ
ishe
d on
06
May
201
1. D
ownl
oade
d by
Uni
vers
ity o
f Pi
ttsbu
rgh
on 2
8/10
/201
4 10
:28:
23.
View Article Online
was considered a significant meteorological variable for 126 of
the selected 176 models. However, this parameter was not
considered important to the performance of model 161. In this
study, low precipitation values, warm water temperatures,
stratification, high concentrations of phosphorus and low
concentration of dissolved inorganic nitrogen in the water were
related to high cyanobacteria densities. Other meteorological
variables were also included in this study: water evaporation and
solar radiation. According to Jeong et al.30 and OlivaTeles et al.8
water evaporation showed to have a larger impact on chlorophyll
a concentration that solar radiation and water temperature. In
this study, water evaporation, together with water temperature,
seemed to be significant to the performance of the best five
models selected and, therefore, to model 161, unlike solar radi-
ation, that was only selected for three of those five models. If the
176 models are considered, water evaporation was considered
important to the performance of only 99 models, while solar
radiation and water temperature were chosen as important
parameters in 109 and 140 of the 176 models, respectively.
Cyanobacterial blooms tend to occur when temperatures are
high, which coincides with increased light intensity.3 Solar radi-
ation, together with nutrient concentrations and water temper-
ature, is one of the most significant parameters controlling the
growth of cyanobacteria.14 However, other studies provided
different results. It was not found an explanation for the low
weight of solar radiation at the water surface, in summer,1 and
why concentration of chlorophyll a increases when there is an
increase in solar radiation at low radiation levels, but not at high
levels of radiation.30
Water flow was one of the predominant variables for the
determination of the occurrence of Anabaena in River Murray.3
The construction of dams in a river considerably reduces flow
discharge. This condition increases the retention time of the
water in the reservoir, reduces the turbulence and the turbidity of
the water, which turns nutrients available for cyanobacteria,
therefore stimulating their growth.3,5 However, large cyanobac-
teria densities seem to occur after a flood, so flow seems to be
responsible for the timing of the incidence of cyanobacterial
blooms.3 The period before the flood seems to be important to
the cyanobacteria, since it transports nutrients to the system and
re-suspends cyanobacteria akynets that may exist in the sediment
of the reservoir.31 So, variations in discharge influence the
development of cyanobacterial blooms. This may explain why
this parameter was found to be significant to the prediction
accuracy of 128 of the 176 selected models (almost 73% of the
models). Nevertheless, this variable was not selected for model
161, although it was chosen for the other best four models
mentioned.
Electrical conductivity of water provides information about
the salinity of the aquatic system. An increase in salinity was
related to changes in aquatic ecosystems, such as an increase in
cyanobacterial density,32,33 because most cyanobacteria were
found to be very tolerant to high salinities.34 However, salinity
and the growth of Microcystis aeruginosa exhibited an inverse
relationship35–37 because increased salinity inhibited this cyano-
bacteria growth. In the developed study, electrical conductivity
was only considered as an important variable to the performance
of 71 of the 176 models, being the less significant parameter to the
prediction of cyanobacterial blooms after lunar day length. This
1766 | J. Environ. Monit., 2011, 13, 1761–1767
variable was only selected for three of the five models with
highest average correlation values, not being chosen to model
161. Lunar day length variable was included in this study, since
lunar cycles are responsible for the formation of tides and
meteorological changes, namely in the winds. There is evidence
that it may influence biological processes and represent
a powerful clock for synchronising biological events, namely in
cyanobacteria. Although this parameter was not determinant for
the majority of the models selected, it has been chosen, as an
important variable to the network performance, for 66 of the 176
selected models, which corresponds to only 37% of those models,
being selected for only one of the best five considered models.
Sensitivity analysis of variables for each of the 176 models
developed and selected also showed that there were no dominant
variables, since their average root mean square error (RMSE)
ratios oscillate between 1003 and 1128. Some of the 176 selected
models developed gave poor results for the test series, in spite of
presenting good results for the training and verification sets.
These results may be due to overfitting or because the different
datasets were not representative of the same population.38 This
poor ability to predict future values from the test series may also
occur as a result of inappropriate network architecture in those
cases. From the analysis of the five models that produced higher
average correlation values (model 1, 4, 37, 129 and 161), we may
also conclude that four of those models used as input physical
and chemical data collected from the surface of the reservoir
(models 1, 4, 37 and 161), while model 129 used data collected
from the bottom of the reservoir (see Table 3). None of the best
models selected used input data collected only from the euphotic
zone limit of the Torrao reservoir. Also, four of the three selected
models (models 1, 37, 129 and 161) predicted cyanobacteria
density from the surface of the reservoir, while only model 4
predicted the density in the entire column of water. It is inter-
esting to notice that model 129 used physical and chemical
information from the bottom of the reservoir to predict the
concentration of cyanobacteria in the surface. Model 161 used,
as input data, physical and chemical parameters from the three
depths of the reservoir. More complete information about the
aquatic system was provided when using data from the entire
column of the reservoir, so it is understandable that these data
produced better results in the models they were applied to.
Network types used in the best five models were type 1 (models
1 and 4), type 8 (model 161), type 10 (model 37) and type 11
(model 129). In the type 1 network, as referred to before, the
three series were distributed, alternated and overlapped along the
entire time period. Although, in this case, the three series where
not totally independent in time, there was no overtraining of the
model, since correlation results for the three series were similar.
Model 161 was a type 8 network, in which the three series are
independent in time: first the test series, then the verification
series and, finally, the training series. In network types 10 and 11,
test series included one of the three peaks in cyanobacteria
density that were verified to occur during the three years of data
collection (second peak in type 10 and third peak in type 11). As
training and verification series were alternated during the
remaining time period, it may be possible that this interval of
time (of approximately two years) may have been enough for
these models to understand the functioning of the aquatic system
and, therefore, predict the oscillations in cyanobacteria densities.
This journal is ª The Royal Society of Chemistry 2011
Publ
ishe
d on
06
May
201
1. D
ownl
oade
d by
Uni
vers
ity o
f Pi
ttsbu
rgh
on 2
8/10
/201
4 10
:28:
23.
View Article Online
Conclusions
Several models were created in order to forecast the cyanobacte-
rial density with 15 days of anticipation, using information
collected in the Torrao reservoir during a period of three years. A
time lag of 11 was used, equivalent to one sample (periods of 15
days in the summer and 30 days in the winter), and the selected
smoothing value was 1.7. The model that presented a higher
average correlation value from the three series had the correla-
tions of 0.991; 0.843; 0.978 for training, verification and test series,
respectively. This corresponds to high correlation values,
considering that the sampling period of time was only three years.
The prediction accuracy of cyanobacterial density could have
been improved if it was included data referring to a longer period
of time. This model corresponds to a type 8 network, in which the
three series are independent in time: first the test series, then the
verification series and, finally, the training series. Only six input
variables were considered to be significant to the performance of
this model: ammonia, phosphates, dissolved oxygen, water
temperature, pH and water evaporation, physical and chemical
data collected from the three depths of the reservoir. These vari-
ables are common to the other best four models produced and,
although these models included other input variables, their
performance was not improved in relation to model 161.
It is important to note that the selected variables in the model
161 were those for which the combination led to a better
prediction of the evolution of the density of cyanobacteria
through time, not necessarily those that explain that evolution.
The exclusion of certain variables does not mean that those
parameters do not determine the occurrence of cyanobacteria
blooms or that their information was not considered. We cannot
establish a relation of cause and effect between the input and the
output, and the useful information contained in those variables
may be present, in a secondary form, in other considered
parameters.
The presented models may predict cyanobacteria blooms with
enough anticipation to provide a Water Treatment Plant time
enough to prepare a more efficient treatment system, or to a local
public health officer to organize a monitoring program to
prevent human health risks due to direct exposure to cyano-
bacteria and their toxins. Nevertheless, these models should be
continuously fed with new data every year so as to better respond
to unpredictable changes in the environment.
Acknowledgements
The authors acknowledge the Municipality of Marco de Can-
aveses for the help during sampling and the Portuguese Science
Foundation (FCT) for partially funding the project.
References
1 F. Recknagel, M. French, P. Harkonen and K.-I. Yabunaka, Ecol.Modell., 1997, 96, 11–28.
2 H. Wilson and F. Recknagel, Ecol. Modell., 2001, 146, 69–84.3 H.Maier, G. Dandy andM. Burch,Ecol. Modell., 1998, 105, 257–272.
This journal is ª The Royal Society of Chemistry 2011
4 H. R. Maier, T. Sayed and B. J. Lence, Ecol. Modell., 2001, 146, 65–96.
5 H. R. Maier and G. Dandy, Math. Comput. Modell., 2001, 33, 669–682.
6 L. OlivaTeles, E. Pereira, M. Saker and V. Vasconcelos, Environ.Manage., 2006, 38, 227–237.
7 R. Leitao and A. I. Lopes, A utilizacao dos recursos da parteportuguesa da bacia hidrogr�afica do Rio Douro para producao deenergia el�ectrica, Hidrorumo, Projecto e Gestao, Porto, 2000.
8 L. OlivaTeles, E. Pereira, M. Saker and V. Vasconcelos, Lakes andReservoirs: Research and Management, 2008, 13, 135–143.
9 C. S. Reynolds, The Ecology of Freshwater Phytoplankton, CambridgeUniversity Press, 1984, p. 384.
10 M. Qi and G. P. Zhang, Eur. J. Oper. Res., 2001, 132, 666–680.11 J. H. W. Lee, Y. Huang, M. Dickman and A. W. Jayawardena, Ecol.
Modell., 2003, 159, 179–201.12 J. Bobbin and F. Recknagel, Ecol. Modell., 2001, 146, 253–262.13 J. Bobbin and F. Recknagel, Environ. Int., 2001, 27, 237–242.14 I. G. Prokopkin, V. G. Gubanov and M. I. Gladyshev, Ecol. Modell.,
2006, 190, 419–431.15 B. Wei, N. Sugiura and T. Maekawa, Water Res., 2001, 35(8), 2022–
2028.16 N. Takamura, A. Otsuki, M. Aizaki and Y. Nojiri, Archives fur
Hydrobiologie, 1992, 24(2), 129–148.17 R. C. Nijboer and P. F.M. Verdonschot, Ecol. Modell., 2004, 177, 17–
39.18 D. W. Stanley and J. E. Hobbie, Limnol. Oceanogr., 1981, 26, 30–42.19 E. P. Odum, Fundamentos de Ecologia, 4a Edicao, Fundacao Calouste
Gulbenkian, Lisboa, 1971.20 H. W. Paerl, Growth and Reproductive Strategies of Freshwater
Blue-Green Algae (Cyanobacteria), in Growth and ReproductiveStrategies of Freshwater Phytoplankton, ed. C. D. Sandgren,Cambridge University Press, New York, 1988, pp. 261—315.
21 B. Guven and A. Howard, Sci. Total Environ., 2006, 368, 898–908.22 D. Tilman and R. Kiesling, Freshwater Algal Ecology: Taxonomic
Tradeoffs in the Temperature Dependence of Nutrient CompetitiveAbilities, in Current Perspectives in Microbial Ecology, AmericanSociety Microbiology, 1984, pp. 314–319.
23 D. Tilman, R. Kiesling and R. Sterner, Archives fur Hydrobiologie,1986, 106, 473–485.
24 T. D. Brock, Evolutionary and Ecological Aspects of theCyanophytes, in The Biology of the Blue-Green Algae, ed. N. G.Carr and B. A. Whitton, Blackwell Scientific Publications, Oxford,1973, pp. 487–500.
25 W. A. Kratz and J. Myers, Am. J. Bot., 1955, 42, 282–287.26 A. M. Turner, J. C. Trexler, C. F. Jordan, S. J. Slack, P. Geddes,
J. H. Chick and W. F. Loftus, Conserv. Biol., 1999, 13, 898–911.27 S. S. Dixit, A. S. Dixit and J. P. Smol, Freshwater Biol., 1991, 26, 251–
265.28 R. Margalef, Oecologia Aquatica, 1978, vol. 3, pp. 97–132.29 C. S. Reynolds, Holarctic Ecol., 1980, 3, 141–159.30 K.-S. Jeong, D.-K. Kim, P. Whigham and G.-J. Joo, Ecol. Modell.,
2003, 161, 67–78.31 C. Sullivan, J. Saunders and D. Welsh, Phytoplankton of the River
Murray, 1980–1985. Water Quality Report No. 2, Murray–DarlingBasin Commission, Canberra, 1988.
32 K. G. Sellner, R. V. Lacouture and C. R. Parrish, J. Plankton Res.,1988, 10, 49–61.
33 H. W. Paerl, J. L. Pinckney and T. F. Steppe, Environ. Microbiol.,2000, 2(1), 11–26.
34 B. J. Robson and D. P. Hamilton, Mar. Freshwater Res., 2003, 54,139–151.
35 P. T. Orr, G. J. Jones and G. B. Douglas,Mar. Freshwater Res., 2004,55, 277–283.
36 Y. Liu, Effects of Salinity on the Growth and Toxin Production ofa Harmful Algal Species, Microcystis aeruginosa, WaterEnvironment Federation, 2006, vol. 1, J.U.S. SJWP.
37 T. Masters, Practical Neural Network Recipes in C++, AcademicPress, San Diego, CA, 1993.
J. Environ. Monit., 2011, 13, 1761–1767 | 1767