heterogeneity-constrained random resampling of phytosociological databases
TRANSCRIPT
Heterogeneity-constrained random resampling ofphytosociological databases
Attila Lengyel, Milan Chytry & Lubomır Tichy
Keywords
Data representativeness; Point pattern;
Releve; Ripley’s K function; Sample plot;
Selection; Stratification; Vegetation survey
Received 12 October 2009
Accepted 28 September 2010
Co-ordinating Editor: Valerio Pillar
Lengyel, A. (corresponding author, e-mail:
[email protected]): Department of Plant
Taxonomy and Ecology, Eotvos Lorand
University, Pazmany Peter s. 1/C, H-1117
Budapest, Hungary
Chytry, M. ([email protected]) & Tichy, L.
([email protected]): Department of Botany and
Zoology, Masaryk University, Kotlarska 2,
CZ-61137 Brno, Czech Republic
Abstract
Aim: Phytosociological databases often contain unbalanced samples of real vege-
tation, which should be carefully resampled before any analyses. We propose a
new resampling method based on species composition, called heterogeneity-
constrained random (HCR) resampling.
Method: Many subsets of the source vegetation database are selected ran-
domly. These subsets are sorted by decreasing mean dissimilarity between pairs
of the vegetation plots, and then sorted again by increasing variance of these
dissimilarities. Ranks from both sortings are summed for each subset, and the
subset with the lowest summed rank is considered as the most representative.
The performance of this method was tested using simulated point patterns that
represented different levels of aggregation of vegetation plots within a data-
base. The distributions of points in the subsets resulting from different
resampling methods, both with and without database stratification, were
compared using Ripley’s K function. The mean of random selections from an
unbiased sample was used as a reference in these comparisons. The efficiency of
the method was also demonstrated with real phytosociological data.
Results: Both stratified and HCR resampling yielded selection patterns more
similar to the reference than resampling without these tools. Outcomes from
the resampling that combined these two methods were the most similar to the
reference. The efficiency of the HCR resampling method varied with different
levels of aggregation in the database.
Conclusions: This new method is efficient for resampling phytosociological
databases. As it only uses information on species occurrences/abundances, it
does not require the definition of strata, thereby avoiding the effect of
subjective decisions on the selection outcome. Nevertheless, this method can
also be applied to stratified databases.
Introduction
Large phytosociological databases, i.e. databases of vege-
tation survey plots (Schaminee et al. 2009), usually
contain data from many different sources. Each data set
included in such databases had been collected for a certain
purpose, was focused on a specific vegetation type or area
and, accordingly, reflected the specific requirements de-
fined by the original researchers. Consequently, some
areas or habitats are over-represented, and others are
under-represented in large databases. Analyses based on
such heterogeneous data sets face a real danger that the
results will mainly reflect patterns existing in the areas or
habitats represented by disproportionately many plots
(Rolecek et al. 2007). To reduce such bias, Knollova et al.
(2005) proposed several methods for the stratified resam-
pling of phytosociological databases. In principle, these
methods divide the database into plot groups (called
strata) according to the variables that influence (or are
supposed to influence) the between-plot variation in
species composition. Each stratum contains plots that
have similar values of such variables. Assuming that plots
from close sites are more likely to have a similar species
composition than those from distant sites (Legendre 1993;
Nekola & White 1999), the strata can also be defined
based on the geographical position of individual plots.
After stratification, the database is resampled by
Journal of Vegetation Science 22 (2011) 175–183
Journal of Vegetation Science
Doi: 10.1111/j.1654-1103.2010.01225.x r 2010 International Association for Vegetation Science 175
randomly selecting a predefined number of vegetation
plots from each stratum, thus creating a subset of the
original database in which over-sampling of some areas or
habitats is reduced.
This way of stratified resampling creates smaller but
more balanced data sets (Knollova et al. 2005; Haveman
& Janssen 2008) because the number of plots that repre-
sent common vegetation types (or over-sampled areas) is
reduced, so as to be similar to the number of plots that
represent rare types (or under-sampled areas). However,
the success of the improvement strongly depends on the
way the database is stratified. The choice of stratifying
variables may limit the researchers’ ability to explore the
part of the total variation that is unrelated to these
variables. Certainly, previous knowledge of the relation-
ship between vegetation and the environment can help
select environmental variables that most probably ac-
count for the major vegetation patterns (Smartt & Grain-
ger 1974; Gillison & Brewer 1985; Bunce et al. 1996;
Goedickemeier et al. 1997), and these variables can be
used for stratification. However, Knollova et al. (2005)
demonstrated that vegetation classification results depend
on the variables used for stratification. For example, if
altitude is used as one of the variables to define strata, the
resampled data set is likely to reflect altitudinal vegetation
differentiation more strongly than another data set re-
sampled within strata based on variables other than
altitude. The resolution of the stratifying variables is also
important, e.g. the number of categories in categorical
variables (Holt et al. 2008). In addition, the variables with
the most direct effect on species composition are not often
available with a reasonable resolution. Therefore in prac-
tical applications, some of the available variables are
usually subjectively selected based on the researcher’s
hypotheses about the relationships of the studied vegeta-
tion to the environment, and this subjectivity can affect
the results of exploratory analyses (Knollova et al. 2005).
Hence, a more straightforward and less subjective way
of database stratification in exploratory studies of ecologi-
cal communities might be based directly on vegetation
properties rather than environmental variables. In studies
of community diversity, such a property can be species
composition. Knollova et al. (2005) proposed two meth-
ods to define strata based on species composition: indica-
tor values (Ellenberg et al. 1991) and clusters from
numerical classifications of the original database. They
resampled a database by the random selection of a pre-
defined number of plots from each stratum. Another
stratification method based on a measure of between-plot
dissimilarity in species composition was proposed by De
Caceres et al. (2008). However, the resampled data sets
and the vegetation classifications based on them differ
due to arbitrary decisions regarding the chosen indicator
systems, numerical classification methods or cover-abun-
dance transformations (Knollova et al. 2005).
Another problem with random resampling within stra-
ta is that although strata are defined to maximize be-
tween-stratum heterogeneity among plots and to
minimize within-stratum heterogeneity, considerable
heterogeneity can exist even within strata. Random
selection of plots from internally heterogeneous strata
can yield subsets of vegetation plots in which initially rare
vegetation types are still under-represented or missing,
and initially common types remain over-represented.
Therefore, there is a need for a resampling method that
does not require a priori stratification of the database, and
which performs a non-random plot selection that pre-
serves the between-plot heterogeneity of the initial data
set while reducing the number of plots from vegetation
types that are over-represented in the initial database.
Such a method was proposed by Feoli & Feoli Chiapella
(1979), based on contributions of single plots to the total
sum-of-squares of all plots in a database. This method
iteratively ranks plots by decreasing contributions, and
selects those with the highest contributions for the analy-
sis. This method was applied to vegetation classification
by Feoli & Lagonegro (1982). However, there are some
disadvantages with this method: (1) the sum-of-squares
criterion does not allow the use of any other measures to
express within-stratum heterogeneity; (2) outliers are
preferred in the selections; (3) this method can only
handle data matrices with a higher number of species
than plots.
Methods of selection of a heterogeneous subset from a
larger, unbalanced database have also been sought in the
field of conservation biology, where the maintenance of a
high genetic diversity is a central goal; therefore, a higher
priority should be given to a set of genetically unique
species or populations. Pavoine et al. (2005) compared
several methods aimed at the ranking of species according
to genetic isolation based on cladogram topologies, and
they also proposed a new measure called ‘‘originality of a
species’’. However, similar to the method of Feoli & Feoli
Chiapella (1979), these measures suffer from the draw-
back of subset heterogeneity maximization; therefore
they cannot be simply adapted to vegetation-plot data-
bases, since their preference toward outliers may pose a
serious problem in studies of vegetation diversity.
In this paper we introduce a new, more flexible heur-
istic method called ‘‘heterogeneity-constrained random
resampling’’ (HCR resampling) for the selection of a
representative subsample from a database based on spe-
cies composition. This name reflects that of many random
resampling trials, only one is accepted that fulfils the
requirements of high heterogeneity that is not caused by
the presence of outliers; in contrast, many other possible
Heterogeneity-constrained random resampling A. Lengyel
176Journal of Vegetation Science
Doi: 10.1111/j.1654-1103.2010.01225.x r 2010 International Association for Vegetation Science
random resampling trials are rejected. We tested and
demonstrated the performance of this new method using
simulated and real data sets.
Methods
The new method
The new method is a heuristic alternative to the random
resampling of vegetation plots from a large database that
involves heterogeneity constraints. As the initial step, the
number of plots to be selected from the database is defined.
Theoretically, there are S!/[s!(S� s)!] possible selections of s
plots from a database of S plots. Some of these selections
represent only a small part of the total variation in the
database, thus they do not fulfil the criterion of maximum
heterogeneity, which is usually desirable (Kenkel et al.
1989; Rolecek et al. 2007). This method performs N selec-
tion trials, where N is set a priori. Each trial randomly
selects s plots from the database. Then, since we are looking
for a heterogeneous subset of plots, the trials are ranked by
decreasing heterogeneity. Heterogeneity is measured as the
mean of all pair-wise dissimilarities in species composition
between selected plots in a trial. Any measure of pair-wise
dissimilarity appropriate for the given data type can be
used. The most heterogeneous selection trials tend to be
those that contain plots from different extremes of variation
in species composition, especially outliers. However, such
selections are not the most appropriate ones, since they
tend to contain few plots with the most typical species
composition. Therefore, the selection trials are ranked
again, but now by increasing variance within the matrix of
between-plot dissimilarities. In this rank order the trials
that contain outlying plots are rarely ranked near the top
because the inclusion of outliers tends to increase the
variance in dissimilarities. In contrast, low variance is often
found in the trials with low heterogeneity, which only
represent the variation in species composition for parts of
the original database. At the same time, trials with, on
average, high but roughly equal between-plot distances
also meet this criterion. Finally, ranks for decreasing hetero-
geneity and increasing variance are summed and trials are
ranked again by their summed ranks. In the new order of
the summed ranks, the first trial (the one with the lowest
summed rank) is expected to include a selection of plots
that are representatively distributed across the total range of
species composition variation. This method can be applied
either directly to an unstratified data set, or it can be also
applied to a stratified data set and used for resampling
within strata. It can be computed in the JUICE program for
the classification and analysis of ecological data (version
7.0; L. Tichy; freely available at http://www.sci.muni.cz/
botany/juice). A simple calculation example is provided in
Appendix S1.
Test concepts
Since there is no objective standard for ‘‘representative-
ness’’ or ‘‘ecological meaningfulness’’ of ecological data sets,
the degree of improvement of a data set after resampling is
difficult to measure. Therefore, we tested the performance
of the HCR resampling method using simulated data sets
with a known structure. Let us consider a metric ordina-
tion diagram of vegetation plots in which, for simplicity,
the total variation in species composition is fully explained
by the first two axes, and the between-point distances are
proportional to the dissimilarities between plots. Supposing
continuous variation in the species composition of vegeta-
tion, theoretically there is a very high number of possible
species compositions that correspond to certain locations
within the ordination diagram. The set of all possible
species compositions is the statistical population in the
ordination space, of which subsets of plots (or points) can
be taken as its samples. Thus, database structures originat-
ing from different plot location schemes can be simulated
by generating artificial point patterns on the ordination
diagram. For example, dense groups of points on the
ordination diagram would represent a situation in which
plots are located in very similar (or spatially very close)
vegetation, whereas evenly distributed points on the ordi-
nation diagram would represent a stratified or systematic
vegetation sampling.
Following this logic, we simulated different database
structures using artificial point scatters in a finite region of
a plane, hereafter called a ‘‘window’’. We resampled these
point scatters both randomly and using the new method,
in each case with and without stratification (Table 1). To
evaluate the outcomes of the different resampling meth-
ods, we used a spatial statistical tool, Ripley’s K function
(Ripley 1977; Fortin & Dale 2005). This function quanti-
fies the spatial randomness of a point pattern by estimat-
ing the number of points within a range of distances
(denoted as r) from any point within the area of the
pattern. In the case of randomly distributed points, the K
function grows exponentially with increasing distance.
The K function for aggregated point patterns yields higher
values than those for randomly distributed point patterns,
while those for regular point patterns are lower than for
the random case. We assume that the statistical
Table 1. Resampling schemes used for the tests with artificial point
patterns. Point aggregations refer to initial samples of points in the
window.
Point aggregation Random Evenly
aggregated
Unevenly
aggregated
Stratification Yes No Yes No Yes No
Random resampling RSR RNR ESR ENR USR UNR
HCR resampling RSH RNH ESH ENH USH UNH
A. Lengyel Heterogeneity-constrained random resampling
Journal of Vegetation Science
Doi: 10.1111/j.1654-1103.2010.01225.x r 2010 International Association for Vegetation Science 177
population in this simulation includes all possible points
in the window, while scatters of selected points in the
window are its more or less biased samples. Our aim is to
obtain a subsample from the sample of points that is
representative of the whole window, i.e. of the statistical
population. Obviously, random sampling is an appropri-
ate procedure for obtaining a representative sample of the
window, and its average random subsample should also
be representative of the region. In contrast, a non-random
subsample is a biased representation of the region, and
the aim of resampling is to reduce this bias. As the aim of
resampling is to obtain a subsample in which the sampling
units (e.g. points representing vegetation plots in this
example) are more evenly distributed than in the original
sample, and random sampling offers an acceptable repre-
sentation of the continuous statistical population, it seems
a straightforward solution to test whether or not a sub-
sample of a sample population shows random pattern
characteristics. Since Ripley’s K function is an appropriate
tool for testing the randomness of point patterns, mean K
values for random subsamples of random samples of the
points from the window can be used as references for
other subsamples of non-random samples of the points
from the same window.
Artificial data sets
The performance of HCR resampling was demonstrated
using simple point scatters consisting of five aggregations:
350 points (representing a biased sample of 350 vegeta-
tion plots) were assigned to the aggregations by random
sampling with replacement, thus on average 70 points
belonged to each aggregation. Selections of point subsets
from different positions of the final rank order of 1000
trials of HCR resampling were compared.
Then, three point scatters were created, each of them
representing a different degree of aggregation. The first
scatter was simply a random draw of co-ordinates for 180
points in a plane region with dimensions of 500� 1000
points. The second scatter was created by the assignment
of 180 points to the co-ordinates of the first scatter by
random sampling with replacement. Thus, some of the
co-ordinates were selected more than once, whereas
others were not selected. However, overlapping points
would be unrealistic because vegetation plots with iden-
tical species compositions are extremely rare in species-
rich vegetation. Therefore, points were moved to a new
random position based on the two-dimensional normal
distribution with the mean equal to the originally as-
signed point and SD = 5. As a result, a point scatter with
more or less even-sized aggregations was created. The
third scatter differed from the second one in the size of the
aggregations. Points were assigned to co-ordinates using a
selection weighted by a probability vector following the
broken-stick distribution (Legendre & Legendre 1998).
Thus, in the third scatter there were point aggregations
with larger differences in size, i.e. a more realistic model of
a real community sample than the previous two scatters.
Comparison of the resampling methods
The three point scatters were resampled, each by four
methods, resulting in a total of 12 outcomes (Table 1). For
cases with stratification, the scatters were divided into
nine strata using sections of equal length along the longer
side of the window. The number of points to select was set
to 90, and in the case of stratified resampling, 10 points
were selected from each stratum. In HCR resampling,
Euclidean distances of points were used for heterogeneity
calculations. When the strata were defined, 100 trials
were run for sorting out the best point combination from
each stratum separately. Without stratification, the num-
ber of trials was 10 000.
After selection by the four methods, the points were re-
allocated to the scatter. Ripley’s K function was calculated
using isotropic edge-correction (Ripley 1988) for the selec-
tion patterns from the four methods. Resampling and
calculation of the K function were repeated 100 times.
Mean K function values were compared graphically with
reference lines and 2.5–97.5% quantiles derived from 100
RNR cases (Table 1) instead of using a function derived
from the theoretical Poisson point process.
Real data example
A data set containing 35 vegetation plots from southern
Hungarian mesic meadows (Lengyel 2010), each of 25 m2
in size, was used to demonstrate the performance of the
HCR method with real data. The plots were from three
meadow types that were represented disproportionately
in the data set:
(1) moderately grazed pastures and hayfields on nutri-
ent-rich soil (25 plots);
(2) intensively grazed colline pastures on nutrient-poor
and drier soil (five plots);
(3) intensively grazed lowland pastures on nutrient-rich
and wetter soil (five plots).
The data set was analysed by principal coordinates
analysis (PCoA) using the Jaccard index as the distance
measure (Legendre & Legendre 1998). Then the data set
was resampled both randomly and using the HCR meth-
od. Subsamples of 10 plots were partitioned into three
clusters around medoids (PAM, Kaufman & Rousseeuw
1990) with Jaccard index as distance measure. PAM is a
non-hierarchical classification method similar to K-means
clustering, but it is generalized so that any distance
Heterogeneity-constrained random resampling A. Lengyel
178Journal of Vegetation Science
Doi: 10.1111/j.1654-1103.2010.01225.x r 2010 International Association for Vegetation Science
measure can be used. Points representing vegetation
plots, with their cluster assignment, were visualized on a
PCoA ordination diagram in order to assess how general
patterns remained identifiable from the subsamples.
Computations were performed in the R environment
(R Development Core Team, http://www.r-project.org),
using vegan (http://cc.oulu.fi/�jarioksa/softhelp/vegan.
html), spatstat (http://www.jstatsoft.org) and cluster (by
M. Maechler) packages.
Results
The differences among the random plot selections sorted
by summed heterogeneity and variance ranks are shown
in Fig. 1. The best selection included points from each of
the five point aggregations, while other selections tended
to over-represent certain point aggregations. The values
of Ripley’s K function resulting from repeated resampling
of the point scatters by random and HCR resampling with
and without stratification are compared in Fig. 2. If the
scatter was random (Fig. 2a), the mean curves of all of the
resampling cases (RSR, RNR, RSH, RNH) were close to
each other. Some tendency towards regularity in the
point distribution can be seen in the subsets resulting
from HCR resampling (lower average K values in RNH
and RSH cases); however, these differences were within
the confidence interval. Resampled point scatters from the
evenly aggregated point pattern (Fig. 2b) were character-
ized by higher K values, but when stratification or HCR
resampling was applied they were closer to the reference
lines. Point patterns resulting from HCR resampling com-
bined with stratification (ESH) fell within the confidence
interval for resampling of the random scatter, particularly
at the larger scale (large values of r). In general, ENR
provided the worst results of the four resampling methods.
However, the unevenly aggregated scatter (Fig. 2c) curves
of random and HCR resampling showed little, and possibly,
no significant difference; stratification still remained a
moderately useful tool in this situation.
The example with a real unbalanced data set (Fig. 3)
demonstrated that after random resampling the classifica-
tion failed to recognize clear vegetation types in the data
200 400 600 800
0
100
300
500
Axis 1
Axi
s 2
Position in final ranking:1
•
200 400 600 800
0
100
300
500
Axis 1
Axi
s 2
Position in final ranking:300
200 400 600 800
0
100
300
500
Axis 1
Axi
s 2
Position in final ranking:600
200 400 600 800
0
100
300
500
Axis 1
Axi
s 2
Position in final ranking:1000
Fig. 1. Selections of points in a subsample corresponding to different rank positions according to the HCR resampling method. Grey points represent a
sample; they were drawn from a two-dimensional normal distribution around five randomly chosen cores, with 70 points per each core on average. The
five black points in each graph represent a subsample selected in the actual random resampling trial.
A. Lengyel Heterogeneity-constrained random resampling
Journal of Vegetation Science
Doi: 10.1111/j.1654-1103.2010.01225.x r 2010 International Association for Vegetation Science 179
set because it divided the over-represented Type 1 into
two clusters and merged plots from the two rarer types
into one cluster. In contrast, all three vegetation types
were correctly distinguished in the subsample obtained
with the HCR resampling method.
Discussion
The new method of resampling of phytosociological data-
bases presented in this paper has multiple advantages
against the random subsample selection within strata,
which was used in previous studies (Knollova et al.
2005; Haveman & Janssen 2008). Although there is no
objective measure of the quality of subsample selection,
redundancies within subunits of a data set should be
removed by resampling because they may negatively
affect the results of the analyses. Analyses of non-re-
sampled data sets can reveal patterns that are actually
mainly typical of the over-sampled subunit(s) of the data
set, but not of the whole sampling universe. This issue is
conceptually similar to the problem of pseudo-replica-
tions in experimental studies (Hurlbert 1984). Figure 1
demonstrates that some randomly selected subsamples
were poorly representative of the whole data set and that
HCR resampling was able to identify them and accept
more representative subsamples instead. The subsamples
based on HCR resampling were also markedly more
representative than those based on the releve ranking
method of Feoli & Feoli Chiapella (1979), which is an
early predecessor of HCR resampling (Appendix S2).
For a more exact evaluation of the resampling out-
comes with reference to a representative sample, we used
Ripley’s K function. Application of stratification and
heterogeneity constraints proved to be appropriate for
reducing bias in the distribution of sampling units. How-
ever, their effect varied among the different cases. If the
sample population was originally randomly distributed,
then none of them had a remarkable effect (Fig. 2). If the
0 20 40 60 80 100 120
0
2000
4000
6000
8000
10000Random
Distance
Rip
ley'
s K
Not stratified, randomNot stratified, HCRStratified, randomStratified, HCRMean of random selections2.5−97.5% conf. int.
0 20 40 60 80 100 120
0
2000
4000
6000
8000
10000Evenly aggregated
Distance
Rip
ley'
s K
0 20 40 60 80 100 120
0
2000
4000
6000
8000
10000Unevenly aggregated
Distance
Rip
ley'
s K
(a) (b) (c)
Fig. 2. Values of Ripley’s K function for three scatters with (a) random points, (b) evenly aggregated points, and (c) unevenly aggregated points, which
were resampled by random and HCR resampling, each with and without stratification. Higher values indicate a more aggregated selection pattern in the
resampled subset of points. The resampling result is better the closer the values of Ripley’s K function are to the reference line of the mean of random
selections. Values on the horizontal axis are the distances in the window used for the K function estimation.
−0.2 0.0 0.2 0.4−0.4
−0.2
0.0
0.2
0.4
(b)(a) Random resampling
PCoA Axis 1
PC
oA A
xis
2 Type 1
Type 2
Type 3
−0.2 0.0 0.2 0.4−0.4
−0.2
0.0
0.2
0.4
HCR resampling
PCoA Axis 1
PC
oA A
xis
2 Type 1
Type 2
Type 3
Fig. 3. Principal coordinates analysis diagrams of 35 vegetation plots representing three types of mesic meadow. This data set was resampled using (a)
random resampling and (b) HCR resampling. Ten plots selected by these resampling methods were subsequently classified into three clusters, and
cluster membership is indicated in the ordination diagrams.
Heterogeneity-constrained random resampling A. Lengyel
180Journal of Vegetation Science
Doi: 10.1111/j.1654-1103.2010.01225.x r 2010 International Association for Vegetation Science
points were aggregated, both methods worked well,
particularly if they were used in combination. In our
simulation, HCR resampling within strata was able to
provide selection patterns that reached confidence inter-
vals of reference cases when the points were evenly
aggregated. For the unevenly aggregated scatter, HCR
resampling caused no significant difference in the subsam-
ple selection but stratification proved to be more efficient
alone. At the same time, combined stratification and con-
strained resampling again provided the best subsamples.
Therefore, we suggest here the replacement of the
usual procedure of single random selection of sampling
units within strata by generating many random selections
from the same stratum, followed by identification of the
most representative of these selections using the new
method. Only the most representative selection should
be used for data analysis. A great advantage of the HCR
resampling method is that, in some situations, it selects
subsamples from a biased sample that tend to have a
similar distribution to the random selection from an
unbiased sample (Fig. 2b). However, strong biases in the
initial data sets (e.g. large differences among sizes of
aggregations) cannot be removed by the HCR resampling
method (Fig. 2c). In addition, no resampling can correct
bias due to completely missing vegetation types, which
can only be remedied by obtaining new data.
An important advantage of the HCR resampling meth-
od is that it does not require any a priori stratification of
the initial database. The resampling of a database within
strata proved to be helpful, but stratification may intro-
duce subjective bias (Knollova et al. 2005), and suitable
data for defining the strata may not always be available.
Still, the HCR resampling method can provide the best
results if applied to carefully selected strata. In geographi-
cally extensive broad-scale investigations, spatial autocor-
relation usually affects patterns of species composition
(Legendre 1993; Nekola & White 1999). Since in most
cases there is some spatial information on the plot loca-
tions in the databases, geographical stratification can be a
useful starting stratification for constrained resampling.
Another possibility for strata definition is the numerical
classification of vegetation plots. This can be particularly
useful if the resemblance measure selected for the
classification process is the same as the dissimilarity
measure expressing within-stratum heterogeneity in the
subsequent HCR resampling within clusters defined by
this classification.
The HCR resampling method can also be applied if
between-plot distances are not expressed by dissimilari-
ties in species composition but rather by measured envir-
onmental variables or by spatial distances. This can be
particularly useful for data sets for which we assume a
strong relationship between species composition and a
particular environmental variable, or a strong spatial
effect on species composition. For example, if species
composition is assumed to be highly spatially structured,
spatial distances between the plots can be used instead of
between-plot dissimilarities, from which HCR resampling
will probably select a subset of plots evenly distributed
across the landscape.
Certainly, the choice of the dissimilarity measure deter-
mines the selection pattern from a database. One possible
solution for reducing the effect of subjectivity is to perform
an ‘‘ensemble’’ analysis of distance indices, i.e. to sum rank
scores of trials obtained by several distance measures and
rank trials according to the summed rank scores. The final
set of plots would be the trial that was ranked near the best
trials by most of the distance measures.
Obviously, the success of HCR resampling is dependent
on the number of trials, and this relationship also clarifies
the role of stratification. Consider a database containing S
plots of which s plots are to be selected by the HCR
resampling method. The S!/[s!(S� s)!] trials should be
run to obtain 100% probability of finding the best combi-
nation of plots. If S = 100 and s = 60 then 1.374623� 1028
trials would be needed. In contrast, if the plots are divided
into C (for the sake of simplicity, of equal size) groups by a
stratification that maximizes between-stratum heteroge-
neity, the number of necessary trials would be (S/C)!/[(s/
c)!((S/C)� (s/C))!] for each stratum. If C = 10, then this
number is 210� 10 = 2100. This huge difference explains
the high efficiency of the combined application of strati-
fication and the HCR resampling method. The chance to
find a representative set of plots from a database can be
increased by an appropriate stratification or by an in-
creased number of trials. Although in our examples
stratification was highly successful, in reality it is rarely
verifiable and an ambiguous stratification may result in
non-representative selections. Moderate performance of
HCR resampling with unevenly aggregated scatters sug-
gests that for aggregations of different sizes, more trials
should be run to find an acceptable selection of plots.
Currently, it is not always possible technically to find the
single best selection from large databases using this
computationally intensive procedure. Nevertheless, it
may not be important since many of the selections that
were highly ranked by the HCR resampling method
satisfactorily represented the variation of the sample (or
strata). Therefore, a tendency towards particular selection
patterns can be recognized if several repeated runs are
compared, which implies that some combinations of plots
have a very low probability of being selected. These
combinations will almost never appear in the final result;
however, they can provide information about factors
independent of the heterogeneity measure. For example,
if strata are defined by environmental variables but not
A. Lengyel Heterogeneity-constrained random resampling
Journal of Vegetation Science
Doi: 10.1111/j.1654-1103.2010.01225.x r 2010 International Association for Vegetation Science 181
geographical positions and HCR resampling is applied
within these strata, two plots with very similar species
composition will almost never be selected together, even
if they were sampled at distant sites. When such re-
sampled data sets are classified and sites representing
vegetation types are plotted on a geographical map, a type
will seem to be present only in one of the two locations
because only one plot was included in the analysis. To
avoid this danger in vegetation classification studies, we
recommend assigning unselected plots in the final clusters
using compositional dissimilarity measures (e.g. Tichy
2005; van Tongeren et al. 2008) or other supervised
classifiers (e.g. Bruelheide 2000; Cerna & Chytry 2005;
De Caceres et al. 2009). However, these formerly unse-
lected plots can only be used for mapping, but not for
statistical analyses such as classification, ordination or
determination of diagnostic species of vegetation units,
otherwise these analyses would be unduly affected by the
poorly representative non-resampled data.
To conclude, in most cases the HCR resampling method
is superior to other methods for resampling phytosociolo-
gical databases, particularly to random resampling within
strata. The HCR resampling method was shown to per-
form well in the tests with artificial point data sets and in
the tests with data sets of vegetation plots. It is a very
promising tool for increasing the representativeness of
vegetation databases or other unbalanced data sets prior
to analyses of vegetation patterns, such as classification or
ordination.
Acknowledgements
We thank David Zeleny for his help with the R program,
Zoltan Botta-Dukat, Enrico Feoli and an anonymous
referee for valuable suggestions on a previous version of
the manuscript, and Enrico Feoli and Paola Ganis for
providing the source code of the releve ranking program
for testing purposes. The study stay of A. L. at Masaryk
University was supported by the Erasmus programme,
and M. C. and L. T. were supported by the Ministry of
Education of the Czech Republic (MSM0021622416).
This study was supported by the Czech Science Founda-
tion (505/11/0732).
References
Bruelheide, H. 2000. A new measure of fidelity and its
application to defining species groups. Journal of Vegetation
Science 11: 167–178.
Bunce, R.G.H., Barr, C.J., Clarke, R.T., Howard, D.C. & Lane,
A.M.J. 1996. Land classification for strategic ecological
survey. Journal of Environmental Management 47: 37–60.
Cerna, L. & Chytry, M. 2005. Supervised classification of plant
communities with artificial neural networks. Journal of
Vegetation Science 16: 407–414.
De Caceres, M., Font, X. & Oliva, F. 2008. Assessing species
diagnostic value in large data sets: a comparison between
phi-coefficient and Ochiai index. Journal of Vegetation
Science 19: 779–788.
De Caceres, M., Font, X., Vicente, P. & Oliva, F. 2009.
Numerical reproduction of traditional classifications and
automatic vegetation identification. Journal of Vegetation
Science 20: 620–628.
Ellenberg, H., Weber, H.E., Dull, R., Wirth, V., Werner, W. &
Paulißen, D. 1991. Zeigerwerte von Pflanzen in
Mitteleuropa. Scripta Geobotanica 18: 9–160.
Feoli, E. & Feoli Chiapella, L. 1979. Releve ranking based on
sum of squares criterion. Vegetatio 39: 123–125.
Feoli, E. & Lagonegro, M. 1982. Syntaxonomical
analysis of beech woods in the Apennines (Italy) using the
program package IAHOPA. Vegetatio 50: 129–173.
Fortin, M.-J. & Dale, M.R.T. 2005. Spatial analysis – a guide for
ecologists. Cambridge University Press, Cambridge, UK.
Gillison, A.N. & Brewer, K.R.W. 1985. The use of gradient
directed transects or gradsects in natural resource surveys.
Journal of Environmental Management 20: 103–127.
Goedickemeier, I., Wildi, O. & Kienast, F. 1997. Sampling for
vegetation survey: some properties of a GIS-based
stratification compared to other statistical sampling
methods. Coenoses 12: 43–50.
Haveman, R. & Janssen, J.A.M. 2008. The analysis of long-
term changes in plant communities using large databases:
the effect of stratified resampling. Journal of Vegetation
Science 19: 355–362.
Holt, E.A., McCune, B. & Neitlich, P. 2008. Spatial scale of GIS-
derived categorical variables affects their ability to separate
sites by community composition. Applied Vegetation Science
11: 421–430.
Hurlbert, S.H. 1984. Pseudoreplication and the design of
ecological field experiments. Ecological Monographs 54:
187–211.
Kaufman, L. & Rousseeuw, P.J. 1990. Finding groups in data. An
introduction to cluster analysis. Wiley, New York, NY, US.
Kenkel, N.C., Juhasz-Nagy, P. & Podani, J. 1989. On sampling
procedures in population and community ecology. Vegetatio
83: 195–207.
Knollova, I., Chytry, M., Tichy, L. & Hajek, O. 2005. Stratified
resampling of phytosociological databases: some strategies
for obtaining more representative data sets for classification
studies. Journal of Vegetation Science 16: 479–486.
Legendre, P. 1993. Spatial autocorrelation: trouble or new
paradigm? Ecology 74: 1659–1673.
Legendre, P. & Legendre, L. 1998. Numerical ecology. 2nd ed.
Elsevier, Amsterdam, NL.
Lengyel, A. 2010. Phytosociological investigation of mesophilous
meadows of South Transdanubia (Hungary). MSc thesis,
University of Pecs, Pecs, HU (in Hungarian).
Heterogeneity-constrained random resampling A. Lengyel
182Journal of Vegetation Science
Doi: 10.1111/j.1654-1103.2010.01225.x r 2010 International Association for Vegetation Science
Nekola, J.C. & White, P.S. 1999. The distance decay of simila-
rity in biogeography and ecology. Journal of Biogeography
26: 867–878.
Pavoine, S., Ollier, S. & Dufour, A.-B. 2005. Is the originality of
a species measurable? Ecology Letters 8: 579–586.
Ripley, B.D. 1977. Modelling spatial patterns. Journal of the
Royal Statistical Society, Series B 39: 172–212.
Ripley, B.D. 1988. Statistical inference for spatial processes.
Cambridge University Press, Cambridge, UK.
Rolecek, J., Chytry, M., Hajek, M., Lvoncık, S. & Tichy, L. 2007.
Sampling design in large-scale vegetation studies: do not
sacrifice ecological thinking to statistical purism! Folia
Geobotanica 42: 199–208.
Schaminee, J.H.J., Hennekens, S.M., Chytry, M. & Rodwell,
J.S. 2009. Vegetation plot data and databases in Europe: an
overview. Preslia 81: 173–185.
Smartt, P.F.M. & Grainger, J.E.A. 1974. Sampling for vegetation
survey: some aspects of the behavior of unrestricted,
restricted and stratified techniques. Journal of Biogeography
1: 193–206.
Tichy, L. 2005. New similarity indices for the assignment of
releves to the vegetation units of an existing phyto-
sociological classification. Plant Ecology 179: 67–72.
van Tongeren, O., Gremmen, N. & Hennekens, S. 2008.
Assignment of releves to pre-defined classes by supervised
clustering of plant communities using a new composite
index. Journal of Vegetation Science 19: 525–536.
Supporting Information
Additional supporting information may be found in the
online version of this article:
Appendix S1. Calculation example of the hetero-
geneity-constrained random resampling.
Appendix S2. Comparison of the heterogeneity-
constrained random resampling with the releve ranking
method (Feoli & Feoli Chiapella 1979).
Please note: Wiley-Blackwell are not responsible for
the content or functionality of any supporting materials
supplied by the authors. Any queries (other than missing
material) should be directed to the corresponding author
for the article.
A. Lengyel Heterogeneity-constrained random resampling
Journal of Vegetation Science
Doi: 10.1111/j.1654-1103.2010.01225.x r 2010 International Association for Vegetation Science 183