heterogeneity-constrained random resampling of phytosociological databases

9
Heterogeneity-constrained random resampling of phytosociological databases Attila Lengyel, Milan Chytry ´ & Lubom´ ır Tichy ´ Keywords Data representativeness; Point pattern; Relev ´ e; Ripley’s K function; Sample plot; Selection; Stratification; Vegetation survey Received 12 October 2009 Accepted 28 September 2010 Co-ordinating Editor: Valerio Pillar Lengyel, A. (corresponding author, e-mail: [email protected]): Department of Plant Taxonomy and Ecology, E¨ otv ¨ os Lor ´ and University, P ´ azm ´ any P ´ eter s. 1/C, H-1117 Budapest, Hungary Chytry ´, M. ([email protected]) & Tichy ´, L. ([email protected]): Department of Botany and Zoology, Masaryk University, Kotl´ ar ˇsk´ a 2, CZ-61137 Brno, Czech Republic Abstract Aim: Phytosociological databases often contain unbalanced samples of real vege- tation, which should be carefully resampled before any analyses. We propose a new resampling method based on species composition, called heterogeneity- constrained random (HCR) resampling. Method: Many subsets of the source vegetation database are selected ran- domly. These subsets are sorted by decreasing mean dissimilarity between pairs of the vegetation plots, and then sorted again by increasing variance of these dissimilarities. Ranks from both sortings are summed for each subset, and the subset with the lowest summed rank is considered as the most representative. The performance of this method was tested using simulated point patterns that represented different levels of aggregation of vegetation plots within a data- base. The distributions of points in the subsets resulting from different resampling methods, both with and without database stratification, were compared using Ripley’s K function. The mean of random selections from an unbiased sample was used as a reference in these comparisons. The efficiency of the method was also demonstrated with real phytosociological data. Results: Both stratified and HCR resampling yielded selection patterns more similar to the reference than resampling without these tools. Outcomes from the resampling that combined these two methods were the most similar to the reference. The efficiency of the HCR resampling method varied with different levels of aggregation in the database. Conclusions: This new method is efficient for resampling phytosociological databases. As it only uses information on species occurrences/abundances, it does not require the definition of strata, thereby avoiding the effect of subjective decisions on the selection outcome. Nevertheless, this method can also be applied to stratified databases. Introduction Large phytosociological databases, i.e. databases of vege- tation survey plots (Schamin ´ ee et al. 2009), usually contain data from many different sources. Each data set included in such databases had been collected for a certain purpose, was focused on a specific vegetation type or area and, accordingly, reflected the specific requirements de- fined by the original researchers. Consequently, some areas or habitats are over-represented, and others are under-represented in large databases. Analyses based on such heterogeneous data sets face a real danger that the results will mainly reflect patterns existing in the areas or habitats represented by disproportionately many plots (Rolec ˇek et al. 2007). To reduce such bias, Knollov ´ a et al. (2005) proposed several methods for the stratified resam- pling of phytosociological databases. In principle, these methods divide the database into plot groups (called strata) according to the variables that influence (or are supposed to influence) the between-plot variation in species composition. Each stratum contains plots that have similar values of such variables. Assuming that plots from close sites are more likely to have a similar species composition than those from distant sites (Legendre 1993; Nekola & White 1999), the strata can also be defined based on the geographical position of individual plots. After stratification, the database is resampled by Journal of Vegetation Science 22 (2011) 175–183 Journal of Vegetation Science Doi: 10.1111/j.1654-1103.2010.01225.x r 2010 International Association for Vegetation Science 175

Upload: attila-lengyel

Post on 21-Jul-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Heterogeneity-constrained random resampling ofphytosociological databases

Attila Lengyel, Milan Chytry & Lubomır Tichy

Keywords

Data representativeness; Point pattern;

Releve; Ripley’s K function; Sample plot;

Selection; Stratification; Vegetation survey

Received 12 October 2009

Accepted 28 September 2010

Co-ordinating Editor: Valerio Pillar

Lengyel, A. (corresponding author, e-mail:

[email protected]): Department of Plant

Taxonomy and Ecology, Eotvos Lorand

University, Pazmany Peter s. 1/C, H-1117

Budapest, Hungary

Chytry, M. ([email protected]) & Tichy, L.

([email protected]): Department of Botany and

Zoology, Masaryk University, Kotlarska 2,

CZ-61137 Brno, Czech Republic

Abstract

Aim: Phytosociological databases often contain unbalanced samples of real vege-

tation, which should be carefully resampled before any analyses. We propose a

new resampling method based on species composition, called heterogeneity-

constrained random (HCR) resampling.

Method: Many subsets of the source vegetation database are selected ran-

domly. These subsets are sorted by decreasing mean dissimilarity between pairs

of the vegetation plots, and then sorted again by increasing variance of these

dissimilarities. Ranks from both sortings are summed for each subset, and the

subset with the lowest summed rank is considered as the most representative.

The performance of this method was tested using simulated point patterns that

represented different levels of aggregation of vegetation plots within a data-

base. The distributions of points in the subsets resulting from different

resampling methods, both with and without database stratification, were

compared using Ripley’s K function. The mean of random selections from an

unbiased sample was used as a reference in these comparisons. The efficiency of

the method was also demonstrated with real phytosociological data.

Results: Both stratified and HCR resampling yielded selection patterns more

similar to the reference than resampling without these tools. Outcomes from

the resampling that combined these two methods were the most similar to the

reference. The efficiency of the HCR resampling method varied with different

levels of aggregation in the database.

Conclusions: This new method is efficient for resampling phytosociological

databases. As it only uses information on species occurrences/abundances, it

does not require the definition of strata, thereby avoiding the effect of

subjective decisions on the selection outcome. Nevertheless, this method can

also be applied to stratified databases.

Introduction

Large phytosociological databases, i.e. databases of vege-

tation survey plots (Schaminee et al. 2009), usually

contain data from many different sources. Each data set

included in such databases had been collected for a certain

purpose, was focused on a specific vegetation type or area

and, accordingly, reflected the specific requirements de-

fined by the original researchers. Consequently, some

areas or habitats are over-represented, and others are

under-represented in large databases. Analyses based on

such heterogeneous data sets face a real danger that the

results will mainly reflect patterns existing in the areas or

habitats represented by disproportionately many plots

(Rolecek et al. 2007). To reduce such bias, Knollova et al.

(2005) proposed several methods for the stratified resam-

pling of phytosociological databases. In principle, these

methods divide the database into plot groups (called

strata) according to the variables that influence (or are

supposed to influence) the between-plot variation in

species composition. Each stratum contains plots that

have similar values of such variables. Assuming that plots

from close sites are more likely to have a similar species

composition than those from distant sites (Legendre 1993;

Nekola & White 1999), the strata can also be defined

based on the geographical position of individual plots.

After stratification, the database is resampled by

Journal of Vegetation Science 22 (2011) 175–183

Journal of Vegetation Science

Doi: 10.1111/j.1654-1103.2010.01225.x r 2010 International Association for Vegetation Science 175

randomly selecting a predefined number of vegetation

plots from each stratum, thus creating a subset of the

original database in which over-sampling of some areas or

habitats is reduced.

This way of stratified resampling creates smaller but

more balanced data sets (Knollova et al. 2005; Haveman

& Janssen 2008) because the number of plots that repre-

sent common vegetation types (or over-sampled areas) is

reduced, so as to be similar to the number of plots that

represent rare types (or under-sampled areas). However,

the success of the improvement strongly depends on the

way the database is stratified. The choice of stratifying

variables may limit the researchers’ ability to explore the

part of the total variation that is unrelated to these

variables. Certainly, previous knowledge of the relation-

ship between vegetation and the environment can help

select environmental variables that most probably ac-

count for the major vegetation patterns (Smartt & Grain-

ger 1974; Gillison & Brewer 1985; Bunce et al. 1996;

Goedickemeier et al. 1997), and these variables can be

used for stratification. However, Knollova et al. (2005)

demonstrated that vegetation classification results depend

on the variables used for stratification. For example, if

altitude is used as one of the variables to define strata, the

resampled data set is likely to reflect altitudinal vegetation

differentiation more strongly than another data set re-

sampled within strata based on variables other than

altitude. The resolution of the stratifying variables is also

important, e.g. the number of categories in categorical

variables (Holt et al. 2008). In addition, the variables with

the most direct effect on species composition are not often

available with a reasonable resolution. Therefore in prac-

tical applications, some of the available variables are

usually subjectively selected based on the researcher’s

hypotheses about the relationships of the studied vegeta-

tion to the environment, and this subjectivity can affect

the results of exploratory analyses (Knollova et al. 2005).

Hence, a more straightforward and less subjective way

of database stratification in exploratory studies of ecologi-

cal communities might be based directly on vegetation

properties rather than environmental variables. In studies

of community diversity, such a property can be species

composition. Knollova et al. (2005) proposed two meth-

ods to define strata based on species composition: indica-

tor values (Ellenberg et al. 1991) and clusters from

numerical classifications of the original database. They

resampled a database by the random selection of a pre-

defined number of plots from each stratum. Another

stratification method based on a measure of between-plot

dissimilarity in species composition was proposed by De

Caceres et al. (2008). However, the resampled data sets

and the vegetation classifications based on them differ

due to arbitrary decisions regarding the chosen indicator

systems, numerical classification methods or cover-abun-

dance transformations (Knollova et al. 2005).

Another problem with random resampling within stra-

ta is that although strata are defined to maximize be-

tween-stratum heterogeneity among plots and to

minimize within-stratum heterogeneity, considerable

heterogeneity can exist even within strata. Random

selection of plots from internally heterogeneous strata

can yield subsets of vegetation plots in which initially rare

vegetation types are still under-represented or missing,

and initially common types remain over-represented.

Therefore, there is a need for a resampling method that

does not require a priori stratification of the database, and

which performs a non-random plot selection that pre-

serves the between-plot heterogeneity of the initial data

set while reducing the number of plots from vegetation

types that are over-represented in the initial database.

Such a method was proposed by Feoli & Feoli Chiapella

(1979), based on contributions of single plots to the total

sum-of-squares of all plots in a database. This method

iteratively ranks plots by decreasing contributions, and

selects those with the highest contributions for the analy-

sis. This method was applied to vegetation classification

by Feoli & Lagonegro (1982). However, there are some

disadvantages with this method: (1) the sum-of-squares

criterion does not allow the use of any other measures to

express within-stratum heterogeneity; (2) outliers are

preferred in the selections; (3) this method can only

handle data matrices with a higher number of species

than plots.

Methods of selection of a heterogeneous subset from a

larger, unbalanced database have also been sought in the

field of conservation biology, where the maintenance of a

high genetic diversity is a central goal; therefore, a higher

priority should be given to a set of genetically unique

species or populations. Pavoine et al. (2005) compared

several methods aimed at the ranking of species according

to genetic isolation based on cladogram topologies, and

they also proposed a new measure called ‘‘originality of a

species’’. However, similar to the method of Feoli & Feoli

Chiapella (1979), these measures suffer from the draw-

back of subset heterogeneity maximization; therefore

they cannot be simply adapted to vegetation-plot data-

bases, since their preference toward outliers may pose a

serious problem in studies of vegetation diversity.

In this paper we introduce a new, more flexible heur-

istic method called ‘‘heterogeneity-constrained random

resampling’’ (HCR resampling) for the selection of a

representative subsample from a database based on spe-

cies composition. This name reflects that of many random

resampling trials, only one is accepted that fulfils the

requirements of high heterogeneity that is not caused by

the presence of outliers; in contrast, many other possible

Heterogeneity-constrained random resampling A. Lengyel

176Journal of Vegetation Science

Doi: 10.1111/j.1654-1103.2010.01225.x r 2010 International Association for Vegetation Science

random resampling trials are rejected. We tested and

demonstrated the performance of this new method using

simulated and real data sets.

Methods

The new method

The new method is a heuristic alternative to the random

resampling of vegetation plots from a large database that

involves heterogeneity constraints. As the initial step, the

number of plots to be selected from the database is defined.

Theoretically, there are S!/[s!(S� s)!] possible selections of s

plots from a database of S plots. Some of these selections

represent only a small part of the total variation in the

database, thus they do not fulfil the criterion of maximum

heterogeneity, which is usually desirable (Kenkel et al.

1989; Rolecek et al. 2007). This method performs N selec-

tion trials, where N is set a priori. Each trial randomly

selects s plots from the database. Then, since we are looking

for a heterogeneous subset of plots, the trials are ranked by

decreasing heterogeneity. Heterogeneity is measured as the

mean of all pair-wise dissimilarities in species composition

between selected plots in a trial. Any measure of pair-wise

dissimilarity appropriate for the given data type can be

used. The most heterogeneous selection trials tend to be

those that contain plots from different extremes of variation

in species composition, especially outliers. However, such

selections are not the most appropriate ones, since they

tend to contain few plots with the most typical species

composition. Therefore, the selection trials are ranked

again, but now by increasing variance within the matrix of

between-plot dissimilarities. In this rank order the trials

that contain outlying plots are rarely ranked near the top

because the inclusion of outliers tends to increase the

variance in dissimilarities. In contrast, low variance is often

found in the trials with low heterogeneity, which only

represent the variation in species composition for parts of

the original database. At the same time, trials with, on

average, high but roughly equal between-plot distances

also meet this criterion. Finally, ranks for decreasing hetero-

geneity and increasing variance are summed and trials are

ranked again by their summed ranks. In the new order of

the summed ranks, the first trial (the one with the lowest

summed rank) is expected to include a selection of plots

that are representatively distributed across the total range of

species composition variation. This method can be applied

either directly to an unstratified data set, or it can be also

applied to a stratified data set and used for resampling

within strata. It can be computed in the JUICE program for

the classification and analysis of ecological data (version

7.0; L. Tichy; freely available at http://www.sci.muni.cz/

botany/juice). A simple calculation example is provided in

Appendix S1.

Test concepts

Since there is no objective standard for ‘‘representative-

ness’’ or ‘‘ecological meaningfulness’’ of ecological data sets,

the degree of improvement of a data set after resampling is

difficult to measure. Therefore, we tested the performance

of the HCR resampling method using simulated data sets

with a known structure. Let us consider a metric ordina-

tion diagram of vegetation plots in which, for simplicity,

the total variation in species composition is fully explained

by the first two axes, and the between-point distances are

proportional to the dissimilarities between plots. Supposing

continuous variation in the species composition of vegeta-

tion, theoretically there is a very high number of possible

species compositions that correspond to certain locations

within the ordination diagram. The set of all possible

species compositions is the statistical population in the

ordination space, of which subsets of plots (or points) can

be taken as its samples. Thus, database structures originat-

ing from different plot location schemes can be simulated

by generating artificial point patterns on the ordination

diagram. For example, dense groups of points on the

ordination diagram would represent a situation in which

plots are located in very similar (or spatially very close)

vegetation, whereas evenly distributed points on the ordi-

nation diagram would represent a stratified or systematic

vegetation sampling.

Following this logic, we simulated different database

structures using artificial point scatters in a finite region of

a plane, hereafter called a ‘‘window’’. We resampled these

point scatters both randomly and using the new method,

in each case with and without stratification (Table 1). To

evaluate the outcomes of the different resampling meth-

ods, we used a spatial statistical tool, Ripley’s K function

(Ripley 1977; Fortin & Dale 2005). This function quanti-

fies the spatial randomness of a point pattern by estimat-

ing the number of points within a range of distances

(denoted as r) from any point within the area of the

pattern. In the case of randomly distributed points, the K

function grows exponentially with increasing distance.

The K function for aggregated point patterns yields higher

values than those for randomly distributed point patterns,

while those for regular point patterns are lower than for

the random case. We assume that the statistical

Table 1. Resampling schemes used for the tests with artificial point

patterns. Point aggregations refer to initial samples of points in the

window.

Point aggregation Random Evenly

aggregated

Unevenly

aggregated

Stratification Yes No Yes No Yes No

Random resampling RSR RNR ESR ENR USR UNR

HCR resampling RSH RNH ESH ENH USH UNH

A. Lengyel Heterogeneity-constrained random resampling

Journal of Vegetation Science

Doi: 10.1111/j.1654-1103.2010.01225.x r 2010 International Association for Vegetation Science 177

population in this simulation includes all possible points

in the window, while scatters of selected points in the

window are its more or less biased samples. Our aim is to

obtain a subsample from the sample of points that is

representative of the whole window, i.e. of the statistical

population. Obviously, random sampling is an appropri-

ate procedure for obtaining a representative sample of the

window, and its average random subsample should also

be representative of the region. In contrast, a non-random

subsample is a biased representation of the region, and

the aim of resampling is to reduce this bias. As the aim of

resampling is to obtain a subsample in which the sampling

units (e.g. points representing vegetation plots in this

example) are more evenly distributed than in the original

sample, and random sampling offers an acceptable repre-

sentation of the continuous statistical population, it seems

a straightforward solution to test whether or not a sub-

sample of a sample population shows random pattern

characteristics. Since Ripley’s K function is an appropriate

tool for testing the randomness of point patterns, mean K

values for random subsamples of random samples of the

points from the window can be used as references for

other subsamples of non-random samples of the points

from the same window.

Artificial data sets

The performance of HCR resampling was demonstrated

using simple point scatters consisting of five aggregations:

350 points (representing a biased sample of 350 vegeta-

tion plots) were assigned to the aggregations by random

sampling with replacement, thus on average 70 points

belonged to each aggregation. Selections of point subsets

from different positions of the final rank order of 1000

trials of HCR resampling were compared.

Then, three point scatters were created, each of them

representing a different degree of aggregation. The first

scatter was simply a random draw of co-ordinates for 180

points in a plane region with dimensions of 500� 1000

points. The second scatter was created by the assignment

of 180 points to the co-ordinates of the first scatter by

random sampling with replacement. Thus, some of the

co-ordinates were selected more than once, whereas

others were not selected. However, overlapping points

would be unrealistic because vegetation plots with iden-

tical species compositions are extremely rare in species-

rich vegetation. Therefore, points were moved to a new

random position based on the two-dimensional normal

distribution with the mean equal to the originally as-

signed point and SD = 5. As a result, a point scatter with

more or less even-sized aggregations was created. The

third scatter differed from the second one in the size of the

aggregations. Points were assigned to co-ordinates using a

selection weighted by a probability vector following the

broken-stick distribution (Legendre & Legendre 1998).

Thus, in the third scatter there were point aggregations

with larger differences in size, i.e. a more realistic model of

a real community sample than the previous two scatters.

Comparison of the resampling methods

The three point scatters were resampled, each by four

methods, resulting in a total of 12 outcomes (Table 1). For

cases with stratification, the scatters were divided into

nine strata using sections of equal length along the longer

side of the window. The number of points to select was set

to 90, and in the case of stratified resampling, 10 points

were selected from each stratum. In HCR resampling,

Euclidean distances of points were used for heterogeneity

calculations. When the strata were defined, 100 trials

were run for sorting out the best point combination from

each stratum separately. Without stratification, the num-

ber of trials was 10 000.

After selection by the four methods, the points were re-

allocated to the scatter. Ripley’s K function was calculated

using isotropic edge-correction (Ripley 1988) for the selec-

tion patterns from the four methods. Resampling and

calculation of the K function were repeated 100 times.

Mean K function values were compared graphically with

reference lines and 2.5–97.5% quantiles derived from 100

RNR cases (Table 1) instead of using a function derived

from the theoretical Poisson point process.

Real data example

A data set containing 35 vegetation plots from southern

Hungarian mesic meadows (Lengyel 2010), each of 25 m2

in size, was used to demonstrate the performance of the

HCR method with real data. The plots were from three

meadow types that were represented disproportionately

in the data set:

(1) moderately grazed pastures and hayfields on nutri-

ent-rich soil (25 plots);

(2) intensively grazed colline pastures on nutrient-poor

and drier soil (five plots);

(3) intensively grazed lowland pastures on nutrient-rich

and wetter soil (five plots).

The data set was analysed by principal coordinates

analysis (PCoA) using the Jaccard index as the distance

measure (Legendre & Legendre 1998). Then the data set

was resampled both randomly and using the HCR meth-

od. Subsamples of 10 plots were partitioned into three

clusters around medoids (PAM, Kaufman & Rousseeuw

1990) with Jaccard index as distance measure. PAM is a

non-hierarchical classification method similar to K-means

clustering, but it is generalized so that any distance

Heterogeneity-constrained random resampling A. Lengyel

178Journal of Vegetation Science

Doi: 10.1111/j.1654-1103.2010.01225.x r 2010 International Association for Vegetation Science

measure can be used. Points representing vegetation

plots, with their cluster assignment, were visualized on a

PCoA ordination diagram in order to assess how general

patterns remained identifiable from the subsamples.

Computations were performed in the R environment

(R Development Core Team, http://www.r-project.org),

using vegan (http://cc.oulu.fi/�jarioksa/softhelp/vegan.

html), spatstat (http://www.jstatsoft.org) and cluster (by

M. Maechler) packages.

Results

The differences among the random plot selections sorted

by summed heterogeneity and variance ranks are shown

in Fig. 1. The best selection included points from each of

the five point aggregations, while other selections tended

to over-represent certain point aggregations. The values

of Ripley’s K function resulting from repeated resampling

of the point scatters by random and HCR resampling with

and without stratification are compared in Fig. 2. If the

scatter was random (Fig. 2a), the mean curves of all of the

resampling cases (RSR, RNR, RSH, RNH) were close to

each other. Some tendency towards regularity in the

point distribution can be seen in the subsets resulting

from HCR resampling (lower average K values in RNH

and RSH cases); however, these differences were within

the confidence interval. Resampled point scatters from the

evenly aggregated point pattern (Fig. 2b) were character-

ized by higher K values, but when stratification or HCR

resampling was applied they were closer to the reference

lines. Point patterns resulting from HCR resampling com-

bined with stratification (ESH) fell within the confidence

interval for resampling of the random scatter, particularly

at the larger scale (large values of r). In general, ENR

provided the worst results of the four resampling methods.

However, the unevenly aggregated scatter (Fig. 2c) curves

of random and HCR resampling showed little, and possibly,

no significant difference; stratification still remained a

moderately useful tool in this situation.

The example with a real unbalanced data set (Fig. 3)

demonstrated that after random resampling the classifica-

tion failed to recognize clear vegetation types in the data

200 400 600 800

0

100

300

500

Axis 1

Axi

s 2

Position in final ranking:1

200 400 600 800

0

100

300

500

Axis 1

Axi

s 2

Position in final ranking:300

200 400 600 800

0

100

300

500

Axis 1

Axi

s 2

Position in final ranking:600

200 400 600 800

0

100

300

500

Axis 1

Axi

s 2

Position in final ranking:1000

Fig. 1. Selections of points in a subsample corresponding to different rank positions according to the HCR resampling method. Grey points represent a

sample; they were drawn from a two-dimensional normal distribution around five randomly chosen cores, with 70 points per each core on average. The

five black points in each graph represent a subsample selected in the actual random resampling trial.

A. Lengyel Heterogeneity-constrained random resampling

Journal of Vegetation Science

Doi: 10.1111/j.1654-1103.2010.01225.x r 2010 International Association for Vegetation Science 179

set because it divided the over-represented Type 1 into

two clusters and merged plots from the two rarer types

into one cluster. In contrast, all three vegetation types

were correctly distinguished in the subsample obtained

with the HCR resampling method.

Discussion

The new method of resampling of phytosociological data-

bases presented in this paper has multiple advantages

against the random subsample selection within strata,

which was used in previous studies (Knollova et al.

2005; Haveman & Janssen 2008). Although there is no

objective measure of the quality of subsample selection,

redundancies within subunits of a data set should be

removed by resampling because they may negatively

affect the results of the analyses. Analyses of non-re-

sampled data sets can reveal patterns that are actually

mainly typical of the over-sampled subunit(s) of the data

set, but not of the whole sampling universe. This issue is

conceptually similar to the problem of pseudo-replica-

tions in experimental studies (Hurlbert 1984). Figure 1

demonstrates that some randomly selected subsamples

were poorly representative of the whole data set and that

HCR resampling was able to identify them and accept

more representative subsamples instead. The subsamples

based on HCR resampling were also markedly more

representative than those based on the releve ranking

method of Feoli & Feoli Chiapella (1979), which is an

early predecessor of HCR resampling (Appendix S2).

For a more exact evaluation of the resampling out-

comes with reference to a representative sample, we used

Ripley’s K function. Application of stratification and

heterogeneity constraints proved to be appropriate for

reducing bias in the distribution of sampling units. How-

ever, their effect varied among the different cases. If the

sample population was originally randomly distributed,

then none of them had a remarkable effect (Fig. 2). If the

0 20 40 60 80 100 120

0

2000

4000

6000

8000

10000Random

Distance

Rip

ley'

s K

Not stratified, randomNot stratified, HCRStratified, randomStratified, HCRMean of random selections2.5−97.5% conf. int.

0 20 40 60 80 100 120

0

2000

4000

6000

8000

10000Evenly aggregated

Distance

Rip

ley'

s K

0 20 40 60 80 100 120

0

2000

4000

6000

8000

10000Unevenly aggregated

Distance

Rip

ley'

s K

(a) (b) (c)

Fig. 2. Values of Ripley’s K function for three scatters with (a) random points, (b) evenly aggregated points, and (c) unevenly aggregated points, which

were resampled by random and HCR resampling, each with and without stratification. Higher values indicate a more aggregated selection pattern in the

resampled subset of points. The resampling result is better the closer the values of Ripley’s K function are to the reference line of the mean of random

selections. Values on the horizontal axis are the distances in the window used for the K function estimation.

−0.2 0.0 0.2 0.4−0.4

−0.2

0.0

0.2

0.4

(b)(a) Random resampling

PCoA Axis 1

PC

oA A

xis

2 Type 1

Type 2

Type 3

−0.2 0.0 0.2 0.4−0.4

−0.2

0.0

0.2

0.4

HCR resampling

PCoA Axis 1

PC

oA A

xis

2 Type 1

Type 2

Type 3

Fig. 3. Principal coordinates analysis diagrams of 35 vegetation plots representing three types of mesic meadow. This data set was resampled using (a)

random resampling and (b) HCR resampling. Ten plots selected by these resampling methods were subsequently classified into three clusters, and

cluster membership is indicated in the ordination diagrams.

Heterogeneity-constrained random resampling A. Lengyel

180Journal of Vegetation Science

Doi: 10.1111/j.1654-1103.2010.01225.x r 2010 International Association for Vegetation Science

points were aggregated, both methods worked well,

particularly if they were used in combination. In our

simulation, HCR resampling within strata was able to

provide selection patterns that reached confidence inter-

vals of reference cases when the points were evenly

aggregated. For the unevenly aggregated scatter, HCR

resampling caused no significant difference in the subsam-

ple selection but stratification proved to be more efficient

alone. At the same time, combined stratification and con-

strained resampling again provided the best subsamples.

Therefore, we suggest here the replacement of the

usual procedure of single random selection of sampling

units within strata by generating many random selections

from the same stratum, followed by identification of the

most representative of these selections using the new

method. Only the most representative selection should

be used for data analysis. A great advantage of the HCR

resampling method is that, in some situations, it selects

subsamples from a biased sample that tend to have a

similar distribution to the random selection from an

unbiased sample (Fig. 2b). However, strong biases in the

initial data sets (e.g. large differences among sizes of

aggregations) cannot be removed by the HCR resampling

method (Fig. 2c). In addition, no resampling can correct

bias due to completely missing vegetation types, which

can only be remedied by obtaining new data.

An important advantage of the HCR resampling meth-

od is that it does not require any a priori stratification of

the initial database. The resampling of a database within

strata proved to be helpful, but stratification may intro-

duce subjective bias (Knollova et al. 2005), and suitable

data for defining the strata may not always be available.

Still, the HCR resampling method can provide the best

results if applied to carefully selected strata. In geographi-

cally extensive broad-scale investigations, spatial autocor-

relation usually affects patterns of species composition

(Legendre 1993; Nekola & White 1999). Since in most

cases there is some spatial information on the plot loca-

tions in the databases, geographical stratification can be a

useful starting stratification for constrained resampling.

Another possibility for strata definition is the numerical

classification of vegetation plots. This can be particularly

useful if the resemblance measure selected for the

classification process is the same as the dissimilarity

measure expressing within-stratum heterogeneity in the

subsequent HCR resampling within clusters defined by

this classification.

The HCR resampling method can also be applied if

between-plot distances are not expressed by dissimilari-

ties in species composition but rather by measured envir-

onmental variables or by spatial distances. This can be

particularly useful for data sets for which we assume a

strong relationship between species composition and a

particular environmental variable, or a strong spatial

effect on species composition. For example, if species

composition is assumed to be highly spatially structured,

spatial distances between the plots can be used instead of

between-plot dissimilarities, from which HCR resampling

will probably select a subset of plots evenly distributed

across the landscape.

Certainly, the choice of the dissimilarity measure deter-

mines the selection pattern from a database. One possible

solution for reducing the effect of subjectivity is to perform

an ‘‘ensemble’’ analysis of distance indices, i.e. to sum rank

scores of trials obtained by several distance measures and

rank trials according to the summed rank scores. The final

set of plots would be the trial that was ranked near the best

trials by most of the distance measures.

Obviously, the success of HCR resampling is dependent

on the number of trials, and this relationship also clarifies

the role of stratification. Consider a database containing S

plots of which s plots are to be selected by the HCR

resampling method. The S!/[s!(S� s)!] trials should be

run to obtain 100% probability of finding the best combi-

nation of plots. If S = 100 and s = 60 then 1.374623� 1028

trials would be needed. In contrast, if the plots are divided

into C (for the sake of simplicity, of equal size) groups by a

stratification that maximizes between-stratum heteroge-

neity, the number of necessary trials would be (S/C)!/[(s/

c)!((S/C)� (s/C))!] for each stratum. If C = 10, then this

number is 210� 10 = 2100. This huge difference explains

the high efficiency of the combined application of strati-

fication and the HCR resampling method. The chance to

find a representative set of plots from a database can be

increased by an appropriate stratification or by an in-

creased number of trials. Although in our examples

stratification was highly successful, in reality it is rarely

verifiable and an ambiguous stratification may result in

non-representative selections. Moderate performance of

HCR resampling with unevenly aggregated scatters sug-

gests that for aggregations of different sizes, more trials

should be run to find an acceptable selection of plots.

Currently, it is not always possible technically to find the

single best selection from large databases using this

computationally intensive procedure. Nevertheless, it

may not be important since many of the selections that

were highly ranked by the HCR resampling method

satisfactorily represented the variation of the sample (or

strata). Therefore, a tendency towards particular selection

patterns can be recognized if several repeated runs are

compared, which implies that some combinations of plots

have a very low probability of being selected. These

combinations will almost never appear in the final result;

however, they can provide information about factors

independent of the heterogeneity measure. For example,

if strata are defined by environmental variables but not

A. Lengyel Heterogeneity-constrained random resampling

Journal of Vegetation Science

Doi: 10.1111/j.1654-1103.2010.01225.x r 2010 International Association for Vegetation Science 181

geographical positions and HCR resampling is applied

within these strata, two plots with very similar species

composition will almost never be selected together, even

if they were sampled at distant sites. When such re-

sampled data sets are classified and sites representing

vegetation types are plotted on a geographical map, a type

will seem to be present only in one of the two locations

because only one plot was included in the analysis. To

avoid this danger in vegetation classification studies, we

recommend assigning unselected plots in the final clusters

using compositional dissimilarity measures (e.g. Tichy

2005; van Tongeren et al. 2008) or other supervised

classifiers (e.g. Bruelheide 2000; Cerna & Chytry 2005;

De Caceres et al. 2009). However, these formerly unse-

lected plots can only be used for mapping, but not for

statistical analyses such as classification, ordination or

determination of diagnostic species of vegetation units,

otherwise these analyses would be unduly affected by the

poorly representative non-resampled data.

To conclude, in most cases the HCR resampling method

is superior to other methods for resampling phytosociolo-

gical databases, particularly to random resampling within

strata. The HCR resampling method was shown to per-

form well in the tests with artificial point data sets and in

the tests with data sets of vegetation plots. It is a very

promising tool for increasing the representativeness of

vegetation databases or other unbalanced data sets prior

to analyses of vegetation patterns, such as classification or

ordination.

Acknowledgements

We thank David Zeleny for his help with the R program,

Zoltan Botta-Dukat, Enrico Feoli and an anonymous

referee for valuable suggestions on a previous version of

the manuscript, and Enrico Feoli and Paola Ganis for

providing the source code of the releve ranking program

for testing purposes. The study stay of A. L. at Masaryk

University was supported by the Erasmus programme,

and M. C. and L. T. were supported by the Ministry of

Education of the Czech Republic (MSM0021622416).

This study was supported by the Czech Science Founda-

tion (505/11/0732).

References

Bruelheide, H. 2000. A new measure of fidelity and its

application to defining species groups. Journal of Vegetation

Science 11: 167–178.

Bunce, R.G.H., Barr, C.J., Clarke, R.T., Howard, D.C. & Lane,

A.M.J. 1996. Land classification for strategic ecological

survey. Journal of Environmental Management 47: 37–60.

Cerna, L. & Chytry, M. 2005. Supervised classification of plant

communities with artificial neural networks. Journal of

Vegetation Science 16: 407–414.

De Caceres, M., Font, X. & Oliva, F. 2008. Assessing species

diagnostic value in large data sets: a comparison between

phi-coefficient and Ochiai index. Journal of Vegetation

Science 19: 779–788.

De Caceres, M., Font, X., Vicente, P. & Oliva, F. 2009.

Numerical reproduction of traditional classifications and

automatic vegetation identification. Journal of Vegetation

Science 20: 620–628.

Ellenberg, H., Weber, H.E., Dull, R., Wirth, V., Werner, W. &

Paulißen, D. 1991. Zeigerwerte von Pflanzen in

Mitteleuropa. Scripta Geobotanica 18: 9–160.

Feoli, E. & Feoli Chiapella, L. 1979. Releve ranking based on

sum of squares criterion. Vegetatio 39: 123–125.

Feoli, E. & Lagonegro, M. 1982. Syntaxonomical

analysis of beech woods in the Apennines (Italy) using the

program package IAHOPA. Vegetatio 50: 129–173.

Fortin, M.-J. & Dale, M.R.T. 2005. Spatial analysis – a guide for

ecologists. Cambridge University Press, Cambridge, UK.

Gillison, A.N. & Brewer, K.R.W. 1985. The use of gradient

directed transects or gradsects in natural resource surveys.

Journal of Environmental Management 20: 103–127.

Goedickemeier, I., Wildi, O. & Kienast, F. 1997. Sampling for

vegetation survey: some properties of a GIS-based

stratification compared to other statistical sampling

methods. Coenoses 12: 43–50.

Haveman, R. & Janssen, J.A.M. 2008. The analysis of long-

term changes in plant communities using large databases:

the effect of stratified resampling. Journal of Vegetation

Science 19: 355–362.

Holt, E.A., McCune, B. & Neitlich, P. 2008. Spatial scale of GIS-

derived categorical variables affects their ability to separate

sites by community composition. Applied Vegetation Science

11: 421–430.

Hurlbert, S.H. 1984. Pseudoreplication and the design of

ecological field experiments. Ecological Monographs 54:

187–211.

Kaufman, L. & Rousseeuw, P.J. 1990. Finding groups in data. An

introduction to cluster analysis. Wiley, New York, NY, US.

Kenkel, N.C., Juhasz-Nagy, P. & Podani, J. 1989. On sampling

procedures in population and community ecology. Vegetatio

83: 195–207.

Knollova, I., Chytry, M., Tichy, L. & Hajek, O. 2005. Stratified

resampling of phytosociological databases: some strategies

for obtaining more representative data sets for classification

studies. Journal of Vegetation Science 16: 479–486.

Legendre, P. 1993. Spatial autocorrelation: trouble or new

paradigm? Ecology 74: 1659–1673.

Legendre, P. & Legendre, L. 1998. Numerical ecology. 2nd ed.

Elsevier, Amsterdam, NL.

Lengyel, A. 2010. Phytosociological investigation of mesophilous

meadows of South Transdanubia (Hungary). MSc thesis,

University of Pecs, Pecs, HU (in Hungarian).

Heterogeneity-constrained random resampling A. Lengyel

182Journal of Vegetation Science

Doi: 10.1111/j.1654-1103.2010.01225.x r 2010 International Association for Vegetation Science

Nekola, J.C. & White, P.S. 1999. The distance decay of simila-

rity in biogeography and ecology. Journal of Biogeography

26: 867–878.

Pavoine, S., Ollier, S. & Dufour, A.-B. 2005. Is the originality of

a species measurable? Ecology Letters 8: 579–586.

Ripley, B.D. 1977. Modelling spatial patterns. Journal of the

Royal Statistical Society, Series B 39: 172–212.

Ripley, B.D. 1988. Statistical inference for spatial processes.

Cambridge University Press, Cambridge, UK.

Rolecek, J., Chytry, M., Hajek, M., Lvoncık, S. & Tichy, L. 2007.

Sampling design in large-scale vegetation studies: do not

sacrifice ecological thinking to statistical purism! Folia

Geobotanica 42: 199–208.

Schaminee, J.H.J., Hennekens, S.M., Chytry, M. & Rodwell,

J.S. 2009. Vegetation plot data and databases in Europe: an

overview. Preslia 81: 173–185.

Smartt, P.F.M. & Grainger, J.E.A. 1974. Sampling for vegetation

survey: some aspects of the behavior of unrestricted,

restricted and stratified techniques. Journal of Biogeography

1: 193–206.

Tichy, L. 2005. New similarity indices for the assignment of

releves to the vegetation units of an existing phyto-

sociological classification. Plant Ecology 179: 67–72.

van Tongeren, O., Gremmen, N. & Hennekens, S. 2008.

Assignment of releves to pre-defined classes by supervised

clustering of plant communities using a new composite

index. Journal of Vegetation Science 19: 525–536.

Supporting Information

Additional supporting information may be found in the

online version of this article:

Appendix S1. Calculation example of the hetero-

geneity-constrained random resampling.

Appendix S2. Comparison of the heterogeneity-

constrained random resampling with the releve ranking

method (Feoli & Feoli Chiapella 1979).

Please note: Wiley-Blackwell are not responsible for

the content or functionality of any supporting materials

supplied by the authors. Any queries (other than missing

material) should be directed to the corresponding author

for the article.

A. Lengyel Heterogeneity-constrained random resampling

Journal of Vegetation Science

Doi: 10.1111/j.1654-1103.2010.01225.x r 2010 International Association for Vegetation Science 183