diversity patterns of uncultured haptophytes unravelled by...

Diversity patterns of uncultured Haptophytes unravelledby pyrosequencing in Naples Bay

LUCIE BITTNER,* ANGELIQUE GOBET,*† STEPHANE AUDIC,* SARAH ROMAC,*

ELIANNE S. EGGE,‡ SEBASTIEN SANTINI ,§ HIROYUKI OGATA,§ IAN PROBERT,*

BENTE EDVARDSEN‡ and COLOMBAN DE VARGAS*

*CNRS, UMR7144 & Universite Pierre et Marie Curie, Team EPPO, Station biologique de Roscoff, Place Georges Tessier,

Roscoff, France, †Genoscope (CEA), CNRS UMR 8030, Universite d’Evry, 2 rue Gaston Cremieux, BP5706, 91057 Evry,

France, ‡Department of Biology, Marine Biology, University of Oslo, NO-0316 Oslo, Norway, §CNRS, Aix-Marseille

Universite, IGS UMR7256, FR-13288 Marseille, France

Abstract

Haptophytes are a key phylum of marine protists, including ~300 described morphospe-

cies and 80 morphogenera. We used 454 pyrosequencing on large subunit ribosomal

DNA (LSU rDNA) fragments to assess the diversity from size-fractioned plankton sam-

ples collected in the Bay of Naples. One group-specific primer set targeting the LSU

rDNA D1/D2 region was designed to amplify Haptophyte sequences from nucleic acid

extracts (total DNA or RNA) of two size fractions (0.8–3 or 3–20 lm) and two sampling

depths [subsurface, at 1 m, or deep chlorophyll maximum (DCM) at 23 m]. 454 reads

were identified using a database covering the entire Haptophyta diversity currently

sequenced. Our data set revealed several hundreds of Haptophyte clusters. However,

most of these clusters could not be linked to taxonomically known sequences: consider-

ing OTUs97% (clusters build at a sequence identity level of 97%) on our global data set,

less than 1% of the reads clustered with sequences from cultures, and less than 12% clus-

tered with reference sequences obtained previously from cloning and Sanger sequencing

of environmental samples. Thus, we highlighted a large uncharacterized environmental

genetic diversity, which clearly shows that currently cultivated species poorly reflect the

actual diversity present in the natural environment. Haptophyte community appeared to

be significantly structured according to the depth. The highest diversity and evenness

were obtained in samples from the DCM, and samples from the large size fraction (3–20 lm) taken at the DCM shared a lower proportion of common OTUs97% with the other

samples. Reads from the species Chrysoculter romboideus were notably found at the

DCM, while they could be detected at the subsurface. The highest proportion of totally

unknown OTUs97% was collected at the DCM in the smallest size fraction (0.8–3 lm).

Overall, this study emphasized several technical and theoretical barriers inherent to the

exploration of the large and largely unknown diversity of unicellular eukaryotes.

Keywords: 454 pyrosequencing, DCM, environmental genetic diversity, environmental samples,

Haptophyta, LSU rDNA, LSU rRNA

Received 2 October 2011; revision received 13 September 2012; accepted 20 September 2012

Introduction

Massive parallel pyrosequencing has supplemented

Sanger sequencing in recent years, especially for

environmental exploration of the microbial world

(Margulies et al. 2005; Bik et al. 2012). The use of

high-throughput sequencing technology avoids the time-

consuming and potentially biasing ligation and transforma-

tion steps inherent to classical clone library approaches,

and enables relatively exhaustive study of environmental

biodiversity for a lower cost (Sogin et al. 2006).Correspondence: Lucie Bittner, Fax: +33 (0)2 98 29 23 23;

E-mail: [email protected]

© 2012 Blackwell Publishing Ltd

Molecular Ecology (2013) 22, 87–101 doi: 10.1111/mec.12108

The targeted locus most commonly chosen to study

microbial community diversity is the small subunit of

the ribosomal DNA or RNA (SSU rDNA or rRNA). This

molecular marker has many advantages, notably being

present and fulfilling the same function in all organisms

(Olsen et al. 1986; Woese 1987). The singular, mosaic

evolutionary rate of this marker facilitates the design of

broad-taxonomic range primers and probes in highly

conserved regions, while comparisons of variable

regions is used for diversity studies. Variable SSU

rDNA regions have already been used in many 454

pyrosequencing surveys of the environmental diversity

of Bacteria and Archaea (e.g. Kysela et al. 2005; Sogin

et al. 2006; Huber et al. 2007; Roesch et al. 2007; Huse

et al. 2008; Brown et al. 2009; Barberan et al. 2011;

Eiler et al. 2011) and protists (e.g. Amaral-Zettler et al.

2009; Brown et al. 2009; Stoeck et al. 2009, 2010; Behnke

et al. 2010; Nolte et al. 2010; Cheung et al. 2010; Pawlowski

et al. 2011; Edgcomb et al. 2011; Logares et al. 2012).

The choice of the genomic region to amplify can be

constrained by both analytical (informativeness of the

targeted region for diversity studies, relevance of com-

parative database) and technological steps (e.g. higher

error rates are found when sequences longer than

300 bp are amplified, or long fragments can statistically

involve more homopolymers) (Huse et al. 2007; Schloss

2010; Kunin et al. 2010; Behnke et al. 2010). In particular,

the interpretation of environmental sequences is highly

dependent on the use of a reference database including

as many sequences as possible from taxonomically

described organisms (Stoeck et al. 2010; Pawlowski et al.

2011). SSU rDNA is by far the most common genetic

marker used to identify the strains of bacterial, archaeal

or eukaryotic microbes. However, the heterogeneity

of substitution rates in SSU rDNA can be a weakness

when diversity studies are undertaken at a global

eukaryotic scale. For example, relatively high evolution-

ary rates are observed in SSU rDNA sequences of

Foraminifera, Acantharea and Acanthamoeba (Pawlowski

& Burki 2009; Caron et al. 2009), as compared to Prasin-

ophyceae (Piganeau et al. 2011) or Haptophyta (Liu

et al. 2009). Most of pyrosequencing surveys focusing

on environmental unicellular eukaryotic (protistan)

diversity published to date (Stoeck et al. 2009, 2010;

Edgcomb et al. 2011; Pawlowski et al. 2011; Shalchian-

Tabrizi et al. 2011; Logares et al. 2012) have used ‘uni-

versal’ eukaryotic primers targeting the V4 and/or the

V9 variable regions of the SSU rDNA. In these studies,

pyrosequences assigned to the Haptophyta are rela-

tively rare. This trend might be partly explained by the

fact that some of the ecosystems investigated (e.g.

anoxic lakes, deep-sea habitats) are not expected to har-

bour significant Haptophyte populations. But important

bias could also result from lower affinity of the PCR

amplification step to Haptophytes, related to intrinsic

high GC content of Haptophyte DNA (Liu et al. 2009;

Stoeck et al. 2010) or simple mismatches in the priming

site, as observed in the V4 reverse ‘universal’ primer

(Stoeck et al. 2010). The low Haptophyte diversity

detected in these studies may also result from the rela-

tively slow rate of rDNA substitution in this lineage.

For example, V9 rDNA sequences are identical among

species within the genus Phaeocystis (Pawlowski et al.

2011). The amplification and sequencing of LSU rDNA

D1-D2 domain fragments of five clone libraries from

picoplankton size fraction (0.2–3 lm) samples taken in

subpolar and subtropical oceanic waters highlighted

hundreds of new Haptophyte ribotypes (Liu et al. 2009).

This unveiled diversity could explain the paradox of

the apparent dominance of Haptophytes in photosyn-

thetic pigment–based analyses from marine ecosystems

compared to the scarcity of Haptophyte sequences in

SSU rDNA studies when using universal primers.

In this study, we probed marine Haptophyte genetic

diversity using a primer set targeting specifically the

Haptophyte D1-D2 LSU rDNA region. This newly

designed primer set was used to amplify nucleic acid

extracts from water samples collected in the Bay of

Naples, a location a priori rich in Haptophyta (e.g.

McDonald et al. 2007). Eight samples were studied, corre-

sponding to a combination of the following parameters:

rDNA or rRNA/cDNA (reverse-transcribed from rRNA),

0.8 to 3lm or 3 to 20lm filtration size fractions, subsur-

face (1 m) or deep chlorophyll maximum (DCM, 23 m).

We addressed three main questions (i) can we reveal new

diversity in environmental Haptophyta populations at a

single geographical location, using a group-specific 454

pyrosequencing approach? (ii) which sample as template

(rDNA or rRNA), filtration size fraction and hydrog-

raphical conditions tested herein yield the highest pro-

portion of unknown phylotypes and the highest

Haptophyte genetic diversity? and (iii) can we detect sig-

nificant differences in composition of the communities

according to depth, size fraction or template?

Materials and methods

Sampling, rDNA and rRNA extraction, PCRamplification and 454 sequencing

Samples were collected in the Bay of Naples on 13th

October 2009 at the ‘Mare Chiara’ station (position 40°48.5′ N, 14° 15′ E) (Fig. S1, Supporting information) as

part of the BioMarKs project (http://www.biomarks.eu/

). Sea water was sampled with Niskin bottles at two

depths (1 and 23 m). Samples from 1 m are hereafter

referred to as ‘subsurface’ samples, whereas the deeper

samples correspond to the deep chlorophyll maximum


88 L. BITTNER ET AL.

‘DCM’. After prefiltration through a 20-lm pore-size

plankton net, 30 L of sea water was successively filtered

through 3-lm and 0.8-lm pore-size polycarbonate filters

(142-mm polycarbonate filters). To limit RNA degrada-

tion, filtration time did not exceed 30 min. Filters

were then flash-frozen in liquid nitrogen and stored at

�80 °C. In the laboratory, filters with cells were cryo-

crushed (6 knocks/sec for 1 min; FreezerMill 6700).

Total DNA and RNA were extracted simultaneously

from the same crushed filter using the NucleoSpin®

RNA L kit and quantified using a Nanodrop ND-1000

Spectrophotometer. The quality of nucleic acid extracts

was checked on a 1.5% agarose gel. Total RNA extracts

were treated with the TurboDNA free kit in order to

remove any contaminating DNA. RT–PCR was then

performed with Superscript III according to the manu-

facturer’s instructions. Eight samples were finally

obtained, corresponding to a combination of the follow-

ing conditions: rDNA or rRNA (cDNA), filtration size

fractions of [0.8–3 lm] or [3–20 lm], surface or DCM.

Based on a reference alignment of LSU rDNA D1-D2

sequences from 172 cultured Haptophyte strains repre-

senting 75 species belonging to all known families,

a primer set, named LSU1 (Table S1, Supporting

information), was manually designed to specifically

amplify Haptophyta sequences in environmental sam-

ples (Fig. S2, Supporting information). A nucleic frag-

ment ranging between 350 and 400 bp was targeted,

and for the highest number of Haptophyta lineages, we

tried to minimize the number of mismatches appearing

in the 5′ region of the primer. At the same time, we

tried to maximize the number of mismatches with LSU

sequences from non-Haptophyta lineages, referring to

LSU sequences available on the SILVA database (~23 600

sequences) (Table S2, Supporting information). Specific-

ity of the primer set was then tested on DNA

extracts of several protistan cultures (Haptophyte and

non-Haptophyte) by PCR, cloning and Sanger sequenc-

ing. These preliminary steps allowed us to consider our

primer set as Haptophyta-biased.

PCRs were conducted with ‘fusion’ primers, which

include the primers designed in this study linked to

adaptor and key sequences required for 454 sequencing

on a FLX Titanium Sequencer. For each of the 8 ampli-

fied samples, a 7-bp multiplex identifier or MID

sequence was designed and included in one of the

fusion primers in order to identify the origin of every

single read from the pooled population generated on a

single run. Structures of the ‘fusion’ primers were as fol-

lows: Primer 1: (5′) Adaptor A + MID + Key + [Forward

primer] (3′); Primer 2: (5′) Adaptor B + Key + [Reverse

primer] (3′) (with Adaptor A: 5′-CCATCTCATCCCTGC

GTGTCTCCGAC-3′, Adaptor B: 5′-CCTATCCCCTGTGT

GCCTTGGCAGTC-3′, and Key: 5′-TCAG-3′). Three PCR

amplifications were conducted from each of the 8

extracts with Phusion® High-Fidelity DNA Polymerase

(Finnzymes) with an initial denaturation step at 98 °Cfor 30 s, followed by 25 cycles of 10 s at 98 °C, 30 s at

53 °C for annealing, 30 s at 72 °C, and a final elongation

step at 72 °C for 10 min. PCR products were run on a

1.5% agarose gel to check for successful amplification

products of the expected length. Replicated PCRs were

then pooled and purified using the NucleoSpin® Extract

II kit. The purified products were quantified using a

nanodrop spectrophotometer and finally mixed in equal

concentrations. The final mix was delivered for sequenc-

ing at the Norwegian Sequencing Centre, University of

Oslo. Emulsion PCR and sequencing were performed

using a GS FLX emPCR amplicon kit using unidirec-

tional sequencing with Lib-L chemistry (Genome

Sequencer FLX Titanium, 454 Life Sciences from Roche,

Brandford, CT, USA).

Sequence data cleaning and processing

454 GS FLX flowgrams (sff files) were deposited on

the Dryad database (see the Data Accessibility sec-

tion). From sff files, we extracted untrimmed sequence

and quality data using the sff2fastq software (http://

github.com/indraniel/sff2fastq), which converts files

to the easily parsable fastq format. In each sequence,

we searched for the MID, followed by the sequence

of the forward primer, the targeted genetic sequence

and the sequence of the reverse primer, in order to

assign sequences to one of the 8 initial samples. We

extracted the targeted part of each sequence, together

with its quality value. For each sequence, we com-

puted the expected number of errors in any 50-bp

window (EE) from the quality scores, using the for-

mula EE ¼ Pi 10

�Qi=4, where Q i is the quality value

of the flowgram at position i. Any sequence with a

50-bp window with > 1% error (EE⁄50 9 100 > 1%)

was discarded. Finally, we applied chimera detection

in each sample, using the uchime module from the

usearch v4.0 software (Edgar 2010; http://www.drive5.

com/usearch/), either using the external reference

database used later for sequence assignment (see

below) or using the experimental sequences (obtained

in this study) as references because chimeras in a

sample should be formed from sequences from the

same sample.

Taxonomic assignment of 454 reads

Hierarchical clustering. Taxonomic assignment of reads

was performed using our pre-existing database of LSU

rDNA sequences from Haptophyta (see the Data Acces-

sibility section). This database includes 1462 reference


DIVERSITY PATTERNS OF UNCULTURED HAPTOPHYTES 89

LSU rDNA sequences generated by Sanger sequencing:

172 sequences from Haptophyte strains in culture (75

species representing all current known families and

almost all cultivable species currently known) and 1290

sequences from environmental clone libraries. The LSU

rDNA reference database provided in this study is cov-

ering the entire Haptophyta diversity currently

sequenced. For each primer set, we detected the for-

ward and reverse primers within the reference database

using position weight matrices allowing up to 5 degen-

eracies, and extracted the amplified parts of the

sequences. Experimental sequences (reads), sorted by

abundance, were then aligned with the reference

extracted sequences sorted by decreasing length. All

sequences, experimental and referential, were then clus-

tered to 85% identity using the global alignment cluster-

ing option of the uclust module from the usearch v4.0

software (Edgar 2010). Each 85% cluster was then reclu-

stered at a higher stringency level (86%) and so on

(87%, 88%,…) in a hierarchical manner up to 100%

similarity. Each experimental sequence was then

identified by the list of clusters to which it belonged to

at 85–100% levels. This information can be viewed as a

matrix with the rows corresponding to different

sequences and the columns corresponding to the cluster

membership at each clustering level. Taxonomic assign-

ment for a given read was performed by first looking

whether reference sequences clustered with the experi-

mental sequence at the 100% clustering level. If this

was the case, the last common taxonomic name of the

reference sequence(s) within the cluster was used to

assign the environmental read. If not, the same proce-

dure was applied to clusters from 99% to 85% similarity

if necessary, until a cluster was found containing both

the experimental read and reference sequence(s), in

which case sequences were taxonomically assigned as

described above.

Phylogenetic mapping. Several dedicated programs for

phylogenetic mapping of anonymous sequences onto

reference trees, which can handle very large data sets,

are available (Matsen et al. 2010; Berger et al. 2011).

These methods require a reference alignment and a

corresponding reference phylogenetic tree, onto which

the position of the query sequences is examined using

phylogenetic tree reconstruction algorithms. Here, two

reference alignments of LSU rDNA were built: the first

included the 172 sequences of cultured strains from our

database (labelled subsequently data set 1), the second

including all 1462 sequences from our database (cul-

tures and environmental samples, labelled subsequently

data set 2). LSU rDNA sequences were aligned using

MAFFT v6.818 taking into account RNA secondary

structure (Q-INS-i option; Katoh & Toh 2008), with

subsequent de visu refinement in BioEdit v7.0.5.3 (Hall

1999). The general time-reversible (GTR) model was

selected as the best nucleotide substitution model

according to the corrected Akaike information criterion

and the Bayesian Information Criterion implemented

and calculated in jModeltest v0.1.1 (Posada 2008). Two

LSU rDNA trees (tree1 corresponding to data set 1 and

tree2 corresponding to data set 2) were built using

maximum-likelihood (ML) inference with a GTR model

and a gamma and invariant sites distributions as imple-

mented in PhyML v3.0 (Guindon & Gascuel 2003).

These alignment/tree couples were used as references

for phylogenetic mapping. Environmental 454 reads

obtained after our cleaning process and assigned to

Haptophyta (see section Sequence data cleaning and

processing above) were aligned to the hidden Markov

model (HMM) profiles built from the reference align-

ments using tools from the HMMER v3.0 suite (http://

hmmer.org/). The resulting alignments were curated,

that is, gapped columns in the reference alignment were

removed. Finally, the phylogenetic positions of the

reads were computed using Pplacer, which enables

efficient ML and posterior probability phylogenetic

mapping (Matsen et al. 2010). Each 454 read was thus

mapped to the reference alignment, and the most

probable location was reported on the reference trees.

Haptophyte community diversity

Differences between Haptophyte communities were

investigated by considering operational taxonomic units

(OTUs) as Haptophyte sequences clustering at 97%

identity. This level of clustering was chosen according

to cultured Haptophyta intra- and inter-rank genetic

diversity (Fig. S3, Supporting information) and accord-

ing to the shape of rarefaction curves built with our 454

data (Fig. S4, Supporting information). Abundance

tables from OTU97% were built. Alpha diversity was cal-

culated using Shannon’s diversity index (Shannon 1948)

and Simpson’s evenness (Simpson 1949). Mean values

of Shannon and Simpson diversity indexes were com-

pared by an overall Kruskal–Wallis test and subsequent

pairwise Wilcoxon–Mann–Whitney tests. The OTU97%

abundance tables were standardized using the Hellinger

transformation to lower the weight of rare ‘species’

(Legendre & Gallagher 2001), and pairwise distance

matrices were then calculated using the Bray–Curtis

dissimilarity index (Bray & Curtis 1957). Variation in

Haptophyte community structure was then determined

by applying nonmetric multidimensional scaling

(NMDS, Gower 1966) to the dissimilarity matrices. The

Haptophyte community composition recorded in differ-

ent conditions (i.e. depths, size fractions and template)

was compared and tested for significant differences



using the analysis of similarity (ANOSIM, Clarke 1993),

followed by 10 000 Monte Carlo permutation tests and

Bonferroni correction. Community turnover was deter-

mined by calculating the proportion of shared OTU97%

and the proportion of specific OTU97% between the 8

samples. All data and statistical analyses were carried

out using the vegan (Oksanen et al. 2007), the MASS and

the limma packages (from the bioconductor website:

http://www.bioconductor.org/biocLite.R), as well as

custom R scripts in the R statistical environment (R ver-

sion 2.10.0, R Development Core Team, 2009). ANOSIM

was calculated through the PAST software (Hammer

et al. 2001).

Results

Haptophyte diversity revealed by 454 pyrosequencingof LSU rDNA sequences

Following our stringent strategy of data cleaning and

processing, 13 501 reads were kept and assigned as

Haptophyta sequences (Table 1, Table S3, Supporting

information). Our cleaning process involved the

removal of reads with errors in the adaptor and MID

sequences, removal of reads with one or more unre-

solved bases (Ns), a strict selection of sequences with

error score <1% and removal of presumed chimera.

Consequently, the quantity of analysed reads was rather

low (~32%, Table S3), but of high quality. Only reads

that clustered with Haptophyta reference sequences to

an identity level >85% were retained. A large majority

of reads (~87%) were assigned to Haptophyta after

cleaning and clustering steps (Table S3, Fig. S5, Sup-

porting information). This result confirms the high

Haptophyte specificity of the primer set LSU1 here

designed. In a parallel ongoing study (Bittner et al. in

preparation), the same samples were indeed amplified

with universal eukaryotic primer sets targeting variable

regions of the SSU rRNA (the V4 and the V9 region),

but only a very low proportion of the pyrosequencing

reads were assigned to Haptophyta (in the best case

2.6% of the total reads; Fig. S6, Supporting information).

It demonstrates furthermore the advantage of using the

Haptophyta-biased primers set LSU1 to specifically

amplify Haptophyta rDNAs out of total DNA and RNA

samples.

For each sample, the total number of cleaned Hap-

tophyta 454 reads, the number of Haptophyta OTUs97%,

the proportion of single reads (sr, in%), the percentage

of OTUs97% including only one read (uq), the percent-

age of assigned OTUs97% clustering with a reference

sequence previously obtained with Sanger sequenc-

ing (aREF) and the percentage of assigned OTUs97%clustering with a reference sequence from a cultured T

able

1Number

ofread

san

dnumber

ofOTU

97%

obtained

aftercleaningprocess

(see

sectionSeq

uen

cedatacleaningan

dprocessing).(sr)%

indicates

theproportionofsingle

read

s.(uq)%

indicates

theproportionofOTU

97%includingonly

1read

.(a

REF)%

indicates

theproportionofOTUs 9

7%assigned

toareference

sequen

ce(a

reference

sequen

cewas

producedwithSan

ger

sequen

cingan

dcorrespondsto

theam

plificationfrom

Hap

tophytanstrain

incu

lture

orfrom

environmen

talclonelibraries).(a

REFcu

lt)%

indicates

thepro-

portionofOTUs 9

7%assigned

toareference

sequen

ceobtained

from

culturedHap

tophyta

strains.

DCM

Subsu

rface

Dep

th

0.8–3µm

20–3

µm

0.8–3µm

20–3

µm

Sizefraction

rRNA

rDNA

rRNA

rDNA

rRNA

rDNA

rRNA

rDNA

Tem

plate

2032

read

s

(4.1%

sr)

254OTUs 9

7%

(32.7%

uq)

(16.5%

a REF

/1.4%

a REFcu

lt)

2296

read

s

(4.3%

sr)

286OTUs 9

7%

(34.3%

uq)

(14%

a REF

/1.4%

a REFcu

lt)

872read

s

(13.5%

sr)

284OTUs 9

7%

(41.5%

uq)

(15.3%

a REF

/1.6%

a REFcu

lt)

1517

read

s

(9.3%

sr)

361OTUs 9

7%

(39.3%

uq)

(17.9%

aREF

/1.7%

a REFcu

lt)

2012

read

s

(7.2%

sr)

371OTUs 9

7%

(39.1%

uq)

(18.6%

a REF

/1.3%

a REFcu

lt)

1745

read

s

(7.4%

sr)

312OTUs 9

7%

(41%

uq)

(19.6%

a REF

/1.3%

a REFcu

lt)

1519

read

s

(6.2%

sr)

259OTUs 9

7%

(36.3%

uq)

(15.8%

a REF

/1.9%

a REFcu

lt)

1508

read

s

(4.3%

sr)

212OTUs 9

7%

(30.7%

uq)

(15.1%

aREF

/1.9%

a REFcu

lt)

LSU1(8

samples

pooled)

13501read

s

(1.8%

sr)

871OTUs 9

7%

(28%

uq)

(11.8%

a REF

/0.9%

a REFcu

lt)



Haptophyte strain (aREFcult) were calculated (Table 1).

On the global data set, the proportion of OTUs97%including only one read was about ~28% (Table 1).

The most striking result was the very low percentage

of OTU97% clustering with reference sequences. On the

global data set, only 11.8% of the OTU97% clustered

with reference sequences. Additionally, 0.9% of the

OTU97% clustered with reference sequences from cul-

tured Haptophyte strains, representing a wide range of

the group’s natural diversity. Fig. 1A, B further explores

this significant proportion of Haptophyte unreferenced

OTUs at different levels of clustering. Fig. 1A shows

the extent of new clusters that are built when environ-

mental 454 reads are added to Sanger sequences.

Fig. 1B depicts the proportion of unknown Haptophyta

clusters obtained with 454 sequencing for each set of

primers. Most Haptophyte assigned reads clustered

with Sanger reference sequences from environmental

clone libraries instead of clustering with a cultured ref-

erence Haptophyta taxa (Fig. 1B, Table 1). After a hier-

archical process of assignment decreasing to a sequence

identity level of 85%, only 7.5% of reads could be

assigned to cultured, reference Haptophyta taxa (data

not shown).

The proportion of totally new OTU97% (clusters that

are not including sequences previously obtained by San-

ger sequencing from culture or environmental samples,

which can be deduced from aREF) was in average higher

in sub-data sets from the large fraction size (3–20 lm)

sampled at the DCM, and corresponding to the rRNA

template (Table 1). However, the proportion of taxonom-

ically unknown OTU97% (clusters that are not including

any sequence from Haptophytes in culture, which can be

deduced from aCULT) was in average higher in the

samples from the small size fraction (0.8–3 lm) and from

the DCM (Table 1).

Haptophyte community diversity and structuring

Abundance tables of OTU97% were used to calculate the

Shannon’s diversity index and Simpson’s evenness. The

highest alpha diversity was found in the rRNA data

from the large size fraction samples collected at the

DCM (Fig. 2, Fig. S7, Supporting information). This dif-

ference in diversity was nevertheless only significant

when comparing samples from the subsurface and from

the DCM (Kruskal–Wallis and Wilcoxon tests, details of

the results not shown). The same trend was found

when OTUs were defined at lower levels of clustering

or when singletons were removed from the abundance

tables (data not shown).

Variations in the Haptophyte community structure

were determined by a two-dimensional representation

of NMDS (Fig. 3). The Haptophyte community seems to

be mostly structured according to the depth. Commu-

nity structure seems to be also relatively influenced by

the size fraction. In contrast, rDNA and rRNA commu-

nity structures reveal large overlap. Group separation of

the samples was further tested by analysis of similarity

(ANOSIM): the Haptophyte community inferred showed a

significant differentiation of community structuring

between the two sampling depths (R = 0.66, P < 0.05)

(Table S4, Supporting information). Any other signifi-

cant dissimilarity was further detected when comparing

the samples from the small and the large size fraction,

or comparing rDNA and rRNA data (Table S4).

OTU97% which can be found in all conditions were

rare: only 12 were found in our data set. Venn diagrams

comparing the number of common OTU97% between

samples showed that on the total data set, the propor-

tion of clusters sharing both rDNA and rRNA reads

(A)

(B)

Fig. 1 Known versus unknown diversity. (A) Number of clus-

ters as a function of clustering level. (B) Proportion of unas-

signed vs. assigned reads as a function of clustering level. Full

lines indicate the proportion of reads clustering at least with

one reference sequence obtained by Sanger sequencing of envi-

ronmental or cultured samples. Dashed lines indicate the pro-

portion of reads clustering with reference sequences from

cultured Haptophyte strains.



was about 1/2 (Fig. S8, Supporting information). The

proportion of common OTU97% between the small and

the large size fractions and between the subsurface and

the DCM samples reached, respectively, 43% and 39%

(Fig. S8). Samples including the highest proportion of

specific (or nonshared) OTU97% were taken at the DCM.

A pairwise comparison of each of the eight samples

clearly highlighted that the sample, which is sharing

the lowest proportion of common OTU97% with the

others, is the one taken at the DCM, corresponding to

the large size fraction and to the analyses of the rRNA

template (Fig. S9, Supporting information).

rDN

A

rRN

A

[0.8

-3]

[3-2

0]

subs

urfa

ce

DC

M4.6

4.8

5.0

5.2

Sha

nnon

div

ersi

ty in

dex

(OTU

97%

)

[0.8

-3] r

DN

A[0

.8-3

] rR

NA

[3-2

0] rD

NA

[3-2

0] rR

NA

subs

urfa

ce rD

NA

subs

urfa

ce rR

NA

DC

M rD

NA

DC

M rR

NA

[0.8

-3] s

ubsu

rface

[3-2

0] s

ubsu

rface

[0.8

-3] D

CM

[3-2

0] D

CM

4.6

4.8

5.0

5.2

Sha

nnon

div

ersi

ty in

dex

(OTU

97%

)

[3-2

0] rD

NA

sub

surfa

ce

[3-2

0] rR

NA

sub

surfa

ce

[0.8

-3] r

DN

A s

ubsu

rface

[0.8

-3] r

RN

A s

ubsu

rface

[3-2

0] rD

NA

DC

M

[3-2

0] rR

NA

DC

M

[0.8

-3] r

DN

A D

CM

[0.8

-3] r

RN

A D

CM

4.6

4.8

5.0

5.2

Sha

nnon

div

ersi

ty in

dex

(OTU

97%

)

Fig. 2 Boxplots summarizing the range of a-diversity (Shannon’s index) calculated at 97% clustering level. Top, middle and bottom

lines of boxes represent the 25th (lower hinge), 50th (median) and 75th (upper hinge) percentiles; whiskers represent the nonextreme

sample minimum and maximum (i.e. less than 1.5 9 the interquartile range of the box).

–0.6 –0.4 –0.2 0.0 0.2 0.4

–0.3

–0.2

–0.1

0.0

0.1

0.2

0.3

NMDS1

NM

DS

2

DCMsubsurface

Stress = 2.4%

–0.6 –0.4 –0.2 0.0 0.2 0.4

–0.3

–0.2

–0.1

0.0

0.1

0.2

0.3

NMDS1

NM

DS

2

[3-20][0.8-3]

Stress = 2.4%

–0.6 –0.4 –0.2 0.0 0.2 0.4

–0.3

–0.2

–0.1

0.0

0.1

0.2

0.3

NMDS1

NM

DS

2

DNAcDNA

Stress = 2.4%

Fig. 3 Haptophyte community structure based on NMDS (nonmetric multidimensional scaling) ordination of the LSU rDNA data set

distance matrices from OTU97%. Distances matrix was beforehand calculated using the Bray–Curtis dissimilarity index. Each object

on the plot represents a sample for a given template, size range and depth. Samples with the lighter colour correspond to the

samples taken at the subsurface, whereas samples with the darker colour correspond to samples taken at the DCM. Large circles and

large diamonds correspond to samples from the 3–20lm size fraction, whereas small circles and small diamonds correspond to sam-

ples from the 0.8–3lm size fraction. Circles correspond to rDNA samples, whereas diamonds correspond to rRNA samples. Similar-

ity in Haptophyte community structure is indicated by the distance between objects: a smaller distance indicates a higher

resemblance in community structure. Samples are here grouped according to the size range. The goodness-of-fit of the NMDS

representation is indicated by the low stress values.



Phylogenetic mapping

The 13 501 LSU environmental reads obtained after our

cleaning process were mapped onto a reference tree

built from an alignment of the 172 reference sequences

from cultured Haptophytes (Fig. 4 A, B). Considering

all samples, reads belonged to all Haptophyta orders,

except the Pavlovales and the Zygodiscales. The highest

proportion of reads was mapped in the Prymnesiales

from clade B2 (or Chrysochromulinaceae): considering

the pooled data set, 68.6% of the reads were identified

as Prymnesiaceae reads. Phaeocystales represented the

second most abundant group in terms of number of

reads (11.1%). An important proportion of the reads

(8.6%) could not be assigned precisely to an Hapto-

phyte species or even to an order. Prymnesiales from

clade B1 (or Prymnesiaceae) represented 5.4% of the

reads, Coccolithales 3%, Syracosphaerales and Isochrysi-

dales each 1.2%. 0.02% of the reads were also assigned

to the species Chrysoculter rhomboideus. In each sample,

the proportions above indicated are approximately the

same (Fig. 4B, Fig. S10, Supporting information). The

most important differences that can be highlighted

between communities from the subsurface and the

DCM are as follows (i) a higher proportion of reads

were assigned to Phaeocystales in DCM samples; and

(ii) reads assigned to the species Chrysoculter rhomboi-

deus were only found in the DCM samples.

Discussion

Stringent primer design and cleaning process allowan accurate targeted metagenomics approach

The LSU rDNA region targeted herein is longer than pre-

viously pyrosequenced eukaryotic genomic regions

(>350 bp in the current study, as compared to a maximum

of 270 bp in previous studies using the V4 SSU rDNA

region), increasing the likelihood of producing low-qual-

ity sequences towards the end of the reads (Huse et al.

2007; Gilles et al. 2011; Quince et al. 2011). Therefore, our

cleaning process was stringent, with the following two

steps (i) a chimera detection step for all reads; and (ii) spe-

cific quality checking for sequences appearing only once

in a sample. The proportion of low-quality reads detected

and eliminated by our cleaning pipeline was at least twice

as high as in previous published studies exploring protis-

tan diversity (Amaral-Zettler et al. 2009; Stoeck et al. 2009,

2010; Nolte et al. 2010; Behnke et al. 2010; Cheung et al.

2010; Pawlowksi et al. 2011). Considering that the esti-

mated error rate of the 454 GS FLX Titanium is about 1%

of the whole data output (Gilles et al. 2011), interpreting

our data using a 97% sequence identity threshold seems

to be a reasonable strategy to minimize inflation of the

number of clusters (OTUs) and hence to limit the overesti-

mation of diversity.

The clustering level used to define OTUs is lineage

specific and marker dependent (Caron et al. 2009; Nebel

et al. 2011). In our data set, rarefaction curves, based on

OTUs100%, did not reach saturation, whereas they did

when using OTUs97%. This clustering level was thus

also chosen to accommodate the relatively slow rate of

rDNA substitution known from Haptophyta, as

revealed by the presence of rDNA from two different

reference cultured species in the same OTU100%. Using

OTUs97%, the proportion of the clusters including only

one read is inferior to 1/3. This proportion is similar to

the one calculated in previous NGS diversity studies

(Sogin et al. 2006; Roesch et al. 2007; Brown et al. 2009;

Stoeck et al. 2009; Behnke et al. 2010; Huse et al. 2010).

The structure of the Haptophyte diversity found here

still supports the ‘rare biosphere’ model (Sogin et al.

2006; Dawson & Hagen 2009; Caron & Countway 2009):

environmental microbial communities are dominated by

a few relatively abundant populations, and hundreds of

low-abundance populations account for most of the

observed phylogenetic diversity.

For the first time, this study tested the accuracy of a set

of LSU rDNA primer to specifically amplify a major

group of unicellular eukaryotes – the Haptophytes – from

environmental samples. Our set of primers gave excellent

results concerning specificity (87% of postcleaning reads

were assigned to Haptophyta). Monchy et al. (2011) pub-

lished a comparative study of reads obtained with uni-

versal eukaryote primers and specific fungus-designed

primers, targeting SSU hypervariable regions, in which

enrichment in fungal sequences with specific primers

reached only 3–10%. The fungus-specific primers

decreased the proportion of Metazoa, Viridiplantae and

Stramenopiles sequences, but largely favoured the ampli-

fication of Katablepharidophyta and Cryptophyta. Thus,

designing primer sets allowing the extraction of a rather

large monophyletic group out of the otherwise extremely

rich and ancient protistan diversity is not trivial. Our

study highlighted particularly the importance of design-

ing both forward and reverse group-specific primers.

Fig. 4 Pplacer phylogenetic mapping of the 13 501 environmental reads (obtained after cleaning process) onto a phylogenetic tree

including all reference sequences from cultured Haptophytes (172 sequences of cultured strains). (A) Phylogenetic mapping on the

phylogenetic tree. Number of reads assigned to a node or a branch are indicated in the green dots. Nodes labelled with a (*) corre-spond to ‘basal’ nodes that do not have yet (uncontested) taxonomic denomination. (B) Details on the proportion of reads assigned

by phylogenetic mapping for the main Haptophyte lineages.



Chrysoculter rhomboideus

Isochrysidales

Syracosphaerales

Coccolithales

Prymnesiales B1

Basal nodes (*)

Phaeocystales

Prymnesiales B2

71,8 % 59,2 % 73,4 % 74,6 % 52,9 % 83,7 % 66,2 % 69,1 % 68,6 % Prymnesiales B2 4,4 % 5,4 % 8,6 % 10,7 % 16,6 % 5,4 % 17,4 % 16,5 % 11,1 % Phaeocystales 9,4 % 10,9 % 9,1 % 8,8 % 7 % 2,2 % 9,9 % 8,1 % 8,6 % Basal nodes (*) 9,5 % 11,7 % 5,6 % 3,3 % 6 % 4,1 % 4,5 % 3,4 % 5,9 % Prymnesiales B1 0,6 % 6,3 % 2,3 % 2,4 % 4,5 % 1,4 % 1,2 % 1,6 % 2,5 % Coccolithales0,5 % 2,5 % 0,9 % 0,1 % 8,8 % 1,9 % 0,8 % 1,1 % 1,9 % Syracosphaerales3,7 % 4 % 0,2 % 0 % 4,1 % 1,1 % 0 % 0,2 % 1,5 % Isochrysidales0 % 0 % 0 % 0 % 0,2 % 0,1 % 0,1 % 0 % 0,04 % Chrysoculter rhomboideus

(A)

(B)

Large (3-20)0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Subsurface

DNA

Subsurface

cDNA

Subsurface

DNA DNA

Subsurface DCM

DNA

DCMDCM

cDNA cDNA

DCM (poolingof the

8 samples)cDNA

Large (3-20) Small (0.8-3) Small (0.8-3) Large (3-20) Large (3-20) Small (0.8-3) Small (0.8-3) Total



454 sequencing reveals significant new environmentalHaptophyte diversity

Using Sanger sequencing of Haptophyte-specific clone

libraries, Liu et al. (2009) published 674 environmental

LSU rDNA D1-D2 sequences from the picoplankton size

fraction. None of environmental phylotypes was 100%

identical to any of the taxonomically predefined

sequences obtained with the sequencing of Haptophyte

in culture. In our study here, the use of 454 sequencing

LSU rDNA reads from a single geographical location

(the Bay of Naples) unveiled hundreds of new, unas-

signed Haptophyta clusters, even at relatively low clus-

tering levels. Among the few 454 reads that could be

assigned to pre-existing Sanger sequences, most of them

were related to environmental sequences, corroborating

previous results based on SSU rDNA clone libraries

(Shi et al. 2009; Cuvelier et al. 2010). At 97% clustering

level, we can suggest that at least 99% of the genetic

LSU rDNA Haptophyte diversity is currently not avail-

able in culture, as predicted by previous clone library

studies (Shi et al. 2009; Liu et al. 2009; Cuvelier et al.

2010; Shi et al. 2011). Strains that are currently in culture

therefore do not reflect by far the natural environmental

genetic diversity. Clearly, the extent of unknown protis-

tan genetic diversity will soar with the continued use of

next-generation sequencing (NGS) technologies.

The proportion of totally new OTU97% (OTUs that are

not clustering with sequences from cultured Hap-

tophyta strains or environmental samples) and also

diversity indexes were higher in rRNA sub-data sets

from the large size fraction (3–20 lm) taken at the

DCM. The slightly higher diversity observed with

rRNA sequences can be partly due to the fact that they

have been transcribed and retro-transcribed, and both

processes are more error prone than replication (Pulsi-

nelli & Temin 1994; Sydow & Cramer 2009). These

additional steps might have introduced artefacts that

might inflate the diversity of rRNA samples. Neverthe-

less, we would have expected that these artificial

sequences would constitute rare phylotypes. We

observed, however, that rDNA OTU97% were less

evenly distributed (few abundant OTUs and many rare

OTUs) than rRNA OTU97% (more even OTUs). As PCR

amplification preferably amplifies the more abundant

reads, we thus concluded that a slightly higher, but

‘true’ diversity was found in the rRNA samples.

The proportion of taxonomically unknown OTU97%

(OTUs that are not clustering with any sequences from

cultured Haptophyta strains) was higher in the samples

from the small size fraction (0.8–3 lm). It confirms pre-

vious published results obtained with SSU rDNA

sequences from environmental pico-prymnesiophytes

(Haptophyta < 2–3 lm) sorted by flow cytometry:

Cuvelier et al. (2010) showed that potential new Hapto-

phytan phylogenetic lineages can be found when study-

ing the smallest size fractions. Moreover, one can bear

in mind that only approximately ten Haptophyte spe-

cies of this size have been described (Vaulot et al. 2008).

Taxonomic characterization of the novel environmentaldiversity

As the hierarchical clustering did not permit assign-

ment of 99% of environmental LSU rDNA 454 reads to

Sanger sequences from cultures, linking them to a reli-

able taxonomic framework was not permittable. The

phylogeny inferred on all 1462 previous LSU rDNA

Sanger sequences from cultures and environmental

samples showed that the vast majority of environmental

sequences branch deep in the Haptophyte tree, in

clades often characterized by weak bootstrap supports.

Therefore, only the reference tree based on cultured

Haptophytes could be used to map the environmental

454 reads with a certain degree of accuracy. Such analy-

sis revealed that a substantial proportion of our reads,

and thus for each of our samples, mapped into the

Prymnesiales from clade B2, a group recently desig-

nated as the family Chrysochromulinaceae (Edvardsen

et al. 2011), and highlighted for its wide distribution in

planktonic ecosystems (Liu et al. 2009). The Chrysoch-

romulina species from this group are typically small

(<5 lm) noncalcifying and saddle-shaped cells (Vaulot

et al. 2008; Edvardsen et al. 2011). A substantial propor-

tion of reads were also assigned to Phaeocystales. The

microalga Phaeocystis, the unique genus currently

described in this order, is one of the most extensively

studied taxa of marine phytoplankton notably because

of its major contribution to the global carbon budget

(Arigo et al. 1999). Phaeocytis is described as a cosmo-

politan bloom-forming microalgae (Schoemann et al.

2005), but including individual cells generally inferior

to 6 lm (Long et al. 2007).

The 454 sequencing from a single geographical site

uncovered reads belonging to all Haptophyta orders,

except the Pavlovales and the Zygodiscales. Zygodi-

scales include coccolithophore genera such as Helicosph-

aera, Rhabdosphaera, Discosphaera or Scyphosphaera, which

are nevertheless commonly observed in microscopy-

based surveys. For the first time, environmental reads

were assigned to the genus Chrysoculter, a noncalcifying

Haptophyte that has only been reported thus far from

coastal waters of Northern Japan (Nakayama et al. 2005).

Structure of environmental Haptophyte communities

In Naples Bay, we found communities that were princi-

pally structured according to sampling depth. This



structural difference was significant. Studies comparing

the structuring of the genetic diversity in marine eco-

systems between subsurface and DCM are until now

rare, and dissimilarities observed were often attributed

to undersampling issues (Massana et al. 2011). Never-

theless, it does not seem to be our case because rarefac-

tion curves calculated were reaching a plateau.

Moreover, Haptophyta communities have already been

shown to have different compositions according to the

depth in previous studies: in South Pacific (Shi et al.

2009, 2011) or in the Red Sea (Man-Aharonovich et al.

2010). Little information has been published yet on

Haptophytes’ vertical distribution in waters columns,

and it relies in majority on scanning electron micros-

copy (SEM) studies from coccolithophores lineages

(Coccolithales, Isochrysidales, Zygodiscales, Syracosp-

haerales) (Winter et al. 2002). In SEM studies, taxonomic

composition was indeed shown to change mainly

according to the depth. Furthermore, in the upper

layers, taxonomic composition was also influenced by

temperature and availability of phosphate, and in the

deeper layers, it was mainly influenced by temperature

and light availability (Cortes et al. 2001). Emiliania

huxleyi (an Isochrysidale) is distributed in the whole

water column, but other species and genera are

clearly restricted to specific depth: for example, Oolitho-

tus (a Coccolithale) and Algirosphaera (a Syracosphae-

rale) are typically found in the middle photic zone

(Cortes et al. 2001; Malinverno et al. 2003; Frada et al.

2012). The few reads that we were able to assign to the

species level in this study tend to confirm the results

previously obtained with SEM studies. The significance

of these patterns will thus have to be tested with a

broader range of samples, including also periodic sam-

pling and environmental parameters in order to take

into account seasonal variations.

We did not find significant structuring differences in

terms of size or template (rRNA or rDNA). Not et al.

(2009) suggested that metagenomic approaches based

on rRNA may significantly reduce the biases inherent

in rDNA surveys (such as the group-specific variability

in rDNA copy numbers, dormant cells and occurrence

of extracellular, ancient DNA), depicting more accu-

rately the active part of the communities. Considering

the relatively low differences in community structuring

based on rRNA and rDNA reads, it appears that the

most abundant Haptophytes in our samples were also

the most physiologically active taxa.

Suggestions for interpreting environmentalHaptophyte diversity

In our study, a high OTUs diversity has been revealed

through the use of high-throughput sequencing of a

LSU rDNA fragment a single location. However, we

cannot conclude whether the extended genetic diversity

shown here is the result of intraspecific or interspecific

diversity and/or nonconcerted evolution of ribosomal

operon copies and/or sequencing errors. Diversity cre-

ated by sequencing errors certainly occurred, but the

stringency of our cleaning process and our clustering

strategy should have largely reduced these biases in

our data set. This extended genetic diversity can also be

observed if multiple copies of LSU rDNA are present in

Haptophyte genomes, and if these diverse copies have

accumulated mutations (in a single genome) and then

were amplified. Unfortunately, reliable information

about the copy number and the variability of copies of

LSU (or even SSU) in the Haptophytes is presently

not available. Despite the sequencing of Haptophyte

genomes/transcriptomes/ESTs (e.g. from Emiliania hux-

leyi), this information is not available because of the

way data have been assembled (mainly through the

consensus of short sequences). SSU and LSU rDNA

sequences from E. huxleyi show generally no sequence

variation (on public databases, only one version is

given). rDNA copy numbers can have very different

values in taxa from different eukaryotic domains (Zhu

et al. 2005). So, rDNA and rRNA surveys are expected

to give very different views when analysing together

(in a universal study) groups with very high copy num-

ber (like Alveolates) and very low copy number (likes

Pelagophytes) (Zhu et al. 2005; Not et al. 2009; Logares

et al. 2012). In general, for the analyses of the same sam-

ple but comparing rDNA and rRNA results, lineages,

which would show a higher proportion of rDNA reads

compared to the proportion of rRNA, would be pointed

out as lineages with a high copy number of rDNA. In

this study, targeting only Haptophytes, as our ratio of

rDNA to rRNA is globally similar in the majority of the

lineages, the rDNA copy number issue and their non-

concerted evolution probably play a minor role. We can

thus expect to deal here with ‘real’ intraspecific or inter-

specific diversity. Only lineages showing less than 2.5%

of the reads (Syracosphaerales, Isochrysidales and

Chrysoculter rhomboideus) might be impacted by noncon-

certed evolution.

In our study, sub-data sets did not share more than

one half of common clusters. Potential explanations for

this relatively low overlapping are multiple. Even if

undersampling and sequencing depth are expected to be

now reduced with NGS methods, we unfortunately can

see here that they constitute ongoing issues. What we

can learn also from this relatively low overlap is that we

are still far from describing the entire environmental

Haptophyte diversity when we are focusing on single

location even if we are using a Haptophyte-specific pri-

mer set and deep sequencing methods. We confirmed



here that studying rDNA and rRNA from the same

sample gives us complementary information on diver-

sity (Not et al. 2009), and for future exploration of envi-

ronmental Haptophyte diversity, it seems to us

necessary to include both templates, and also if possible

different sets of primers to target different genomic

regions. Inferring diversity from one unique source of

information, such as a single molecular marker (Bittner

et al. 2010; Piganeau et al. 2011), or from a unique pri-

mer set obviously may bias the results or at least may

give us different view of diversity (Stoeck et al. 2010).

Pluralistic alternatives should be used in future studies

in order to build real exhaustive picture of the Hapto-

phyte diversity.

Conclusion

This work is the first NGS environmental study focus-

ing on Haptophytes. Our Haptophyte-specific primer

set, targeting the D1-D2 domain of the LSU rDNA gene,

permitted the discovery of a significantly high number

of Haptophyta phylotypes as compared to previous

studies based on Sanger sequencing of clone libraries or

pyrosequencing using universal eukaryote primers tar-

geting SSU rDNA. The majority of the environmental

LSU 454 new reads did not cluster with taxonomically

known sequences. This result further highlights the

major gap existing between the well-defined diversity

and classification inferred from cultivated microorgan-

isms, and their significantly larger natural diversity that

is not yet well understood. We also pointed out that

estimation of the number and the diversity of rDNA

copies in Haptophyte genomes is an important element

for interpreting environmental diversity studies. We

detected in our study a significant dissimilarity between

communities from different depths; this trend will have

to be further linked to chemico-physical parameters in

Naples Bay and to the presence of other microorgan-

isms (protists, bacteria, virus) for a better understanding

of the structuring. Our study highlighted once more the

observation that diversity inferred from environmental

samples is partly dependent on the samples used, and

for future studies, we therefore recommend deep

sequencing of each sample and combining results from

the rDNA and rRNA templates, and if possible using

more than one primer set.

Acknowledgments

We thank the BioMarKs consortium, and especially Fabrice

Not, Adriana Zingone and the staff and crew of the Stazione

Zoologica Napoli, for organizing the sampling of the material

analysed herein. We thank Professor Jean-Michel Claverie for

providing free access to the large computer facility of the

PACA-Bioinfo IBISA platform. Thanks to Richard Christen and

Frederic Mahe for their support and advice about computer

analyses. This work is part of the EU EraNet BiodivErsA pro-

gram BioMarKs (CdV) and the Norwegian Research Council

project HAPTODIV (190307/S40) (BE, EE). We acknowledge

the following programmes for additional support: the French

ANR grant POSEIDON (ANR-09-BLAN-0348-01) for LB, the

ANR project PROMETHEUS (ANR-09-GENM-031) for AG, the

projects ANR-09-PCS-GENM-218 and ANR-08-BDVA-003 for

SS and HO, and the EU ASSEMBLE project (227799) for IP.

The authors are grateful to Michelle Gehringer and to Micah

Dunthorn for helpful rereadings. The authors gratefully thank

anonymous referees and the subject editor for their thorough

reviews and constructive criticism on previous versions of the

manuscript.

References

Amaral-Zettler LA, McCliment EA, Ducklow HW, Huse SM

(2009) A method for studying protistan diversity using mas-

sively parallel sequencing of V9 hypervariable regions of

small-subunit ribosomal RNA genes. PLoS ONE, 4, e6372.

Barberan A, Bates ST, Casamayor EO, Fierer N (2011) Using

network analysis to explore co-occurrence patterns in soil

microbial communities. ISME Journal, 6, 343–351.Behnke A, Engel M, Christen R, Nebel M, Klein RR, Stoeck T

(2010) Depicting more accurate pictures of protistan commu-

nity complexity using pyrosequencing of hypervariable SSU

rRNAgene regions. Environmental Microbiology, 13, 340–349.Berger SA, Krompass D, Stamatakis A (2011) Performance,

accuracy, and web server for evolutionary placement of

short sequence reads under maximum likelihood. Systematic

Biology, 60, 291–302.Bik HM, Porazinska DL, Creer S, Caporaso JG, Knigth R,

Thomas WK (2012) Sequencing our way towards under-

standing global eukaryotic biodiversity. Trends in Ecology and

Evolution, 27, 233–243.

Bittner L (*), Halary S (*), Payri C et al. (2010) Some consider-

ations for analyzing biodiversity using integrative metage-

nomics and gene networks. Biology Direct, 5, 47.

Bray JR, Curtis JT (1957) An ordination of the upland forest

communities of Southern Wisconsin. Ecological Monographs,

27, 326–349.

Brown MV, Philip GK, Bunge JA, Smith MC, Bisset A, Lauro

FM et al. (2009) Microbial community structure in the North

Pacific Ocean. ISME Journal, 3, 1374–1386.Caron DA, Countway PD (2009) Hypotheses on the role of the

protistan rare biosphere in a changing world. Aquatic Micro-

bial Ecology, 57, 227–238.

Caron DA, Countway PD, Savai P, Gast RJ et al. (2009) Defin-

ing DNA-based operational taxonomic units for microbial

eukaryote ecology. Applied Environmental Microbiology, 75,

5797–5808.

Cheung MK, Au CH, Chu KH, Kwan HS, Wong CK (2010)

Composition and genetic diversity of picoeukaryotes in sub-

tropical coastal waters as revealed by 454 pyrosequencing.

ISME Journal, 4, 1053–1059.

Clarke KR (1993) Non-parametric multivariate analyses of changes

in community structure. Australian Journal of Ecology, 18, 117–143.

Cortes MY, Bollmann J, Thierstein HR (2001) Coccolithophore

ecology at the HOT station ALOHA, Hawaii. Deep-Sea

Research II, 48, 1957–1981.



Cuvelier ML, Allen AE, Monier A et al. (2010) Targeted

metagenomics and ecology of globally important uncultured

eukaryotic phytoplankton. Proceedings of the National Academy

of Sciences of the United States of America, 17, 14679–14684.Dawson SC, Hagen KD (2009) Mapping the protistan ‘rare

biosphere’. Journal of Biology, 8, 105.

Edgar RC (2010) Search and clustering orders of magnitude

faster than BLAST. Bioinformatics, 26, 2460–2461.Edgcomb V, Orsi W, Bunge J, et al. (2011) Protistan microbial

observatory in the Cariaco Basin, Caribbean I. Pyrosequenc-

ing vs Sanger insights into species richness. ISME Journal, 5,

1344–1356.Edvardsen B, Eikrem W, Throndsen JAS, Probert I, Medlin L

(2011) Ribosomal DNA phylogenies and a morphological

revision provide the basis for a new taxonomy of Prymnesi-

ales (Haptophyta). European Journal of Phycology, 46, 202–228.Eiler A, Heinrich F, Bertilsson S (2011) Coherent dynamics and

association networks among lake bacterioplankton taxa.

ISME Journal, 6, 330–342.

Frada MJ, Bidle KD, Probert I, Vargas C. de (2012) In situ sur-

vey of life cycle phases of the coccolithophore Emiliania hux-

leyi (Haptophyta). Environmental microbiology, 14, 1558–1569.Gilles A, Meglecz E, Pech N, Ferreira S, Malausa T, Martin JF

(2011) Accuracy and quality assessment of 454 GS-FLX Tita-

nium pyrosequencing. BMC Genomics, 19, 245.

Gower JC (1966) Some distance properties of latent root and

vector methods used in multivariate analysis. Biometrika, 53,

325–338.

Guindon S, Gascuel O (2003) A simple, fast, and accurate algo-

rithm to estimate large phylogenies by maximum likelihood.

Systematic Biology, 52, 696–704.Hall TA (1999) BioEdit: a user-friendly biological sequence

alignment editor and analysis program for Windows 95/98/

NT. Nucleic Acids Symposium Series, 41, 95–98.

Hammer Ø, Harper DAT, Ryan PD (2001) PAST: paleontologi-

cal statistics software package for education and data analy-

sis. Palaeontologia Electronica, 4, 1–9.Huber JA, Mark WDB, Morrison HG et al. (2007) Microbial

population structures in the deep marine biosphere. Science,

318, 97–100.

Huse SM, Huber JA Morrison HG Sogin ML, Welch DM (2007)

Accuracy and quality of massively parallel DNA

pyrosequencing. Genome Biology, 8, R143.

Huse SM, Dethlefsen L, Huber JA, et al. (2008) Exploring micro-

bial diversity and taxonomy using SSU rRNA hypervariable

tag sequencing. PLoS Genetics , 4, e1000255.

Huse SM, Welch DM, Morrison HG, Sogin ML (2010) Iron-

ing out the wrinkles in the rare biosphere through

improved OTU clustering. Environmental Microbiology, 12,

1889–1898.

Katoh K, Toh H (2008) Improved accuracy of multiple ncRNA

alignment by incorporating structural information into a

MAFFT-based framework. BMC Bioinformatics, 9, 212–224.Kunin V, Engelbrektson A, Ochman H, Hugenholtz P (2010)

Wrinkles in the rare biosphere: pyrosequencing errors can

lead to artificial inflation of diversity estimates. Environmental

Microbiology, 12, 118–123.Kysela DT, Palacios C, Sogin ML (2005) Serial analysis of

V6-ribosomal sequence tags (SARST-V6): a method for effi-

cient, high-throughput analysis of microbial community

composition. Environmental Microbiology, 7, 356–364.

Legendre P, Gallagher ED (2001) Ecologically meaningful

transformations for ordination of species data. Oecologia, 129,

271–280.

Liu H, Probert I, Uitz J et al. (2009) Extreme diversity in non-

calcifying Haptophytes explains a major pigment paradox in

open oceans. Proceedings of the National Academy of Sciences of

the United States of America, 106, 12803–12808.

Logares R, Audic S, Santini S, Pernice MC, de Vargas C, Mas-

sana (2012) Diversity patterns and activity of uncultured

marine heterotrophic flagellates unveiled with pyrosequenc-

ing. ISME Journal, 6, 1823–1833.

Long JD, Smalley GW, Barsby T, Anderson JT, Hay ME (2007)

Chemical cues induce consumer-specific defenses in a

bloom-forming marine phytoplankton. Proceedings of the

National Academy of Sciences of the United States of America,

104, 10512–10517.Malinverno E, Ziveri P, Corselli C (2003) Coccolithophorid dis-

tribution in the Ionian Sea and its relationship to eastern

Mediterranean circulation during late fall to early winter

1997. Journal of Geophysical Research, 108, 8115.

Man-Aharonovich D, Philosof A, Kirkup BC et al. (2010) Diver-

sity of active marine picoeukaryotes in the Eastern Mediter-

ranean Sea unveiled using photosystem-II psbA transcripts.

ISME Journal, 4, 1044–1052.Margulies M, Egholm M, Altman WE et al. (2005) Genome

sequencing in microfabricated high-density picolitre reactors.

Nature, 437, 376–380.

Massana R, Pernice M, Bunge JA, del Campo J (2011) Sequence

diversity and novelty of natural assemblages of picoeukary-

otes from the Indian Ocean. ISME Journal, 5, 184–195.

Matsen FA, Kodner RB, Armbrust EV (2010) Pplacer: linear time

maximum-likelihood and Bayesian phylogenetic placement of

sequences onto a fixed reference tree. BMC Bioinformatics, 11, 538.

McDonald SM, Sarno D, Scanlan DJ, Zingone A (2007) Genetic

diversity of eukaryotic ultraphytoplankton in the Gulf of

Naples during an annual cycle. Aquatic Microbial Ecology, 50,

75–89.Medlin LK, Kooistra WHCF (2010) Methods to estimate the

diversity in the marine photosynthetic protist community

with illustrations from case studies: a review. Diversity, 2,

973–1014.Monchy S, Sanciu G, Jobard M et al. (2011) Exploring and

quantifying fungal diversity in freshwater lake ecosystems

using rDNA cloning/sequencing and SSU tag pyrosequenc-

ing. Environmental Microbiology, 13, 1433–1453.Nakayama T, Yoshida M, Noel M-H, Kawachi M, Inouye I

(2005) Ultrastructure and phylogenetic position of Chrysocul-

ter rhomboideus gen. et sp. nov (Prymnesiophyceae), a new

flagellate haptophyte from Japanese coastal waters. Phycolo-

gia, 44, 369–383.

Nebel M, Pfabel C, Stock A, Dunthorn M, Stoeck T (2011)

Delimiting operational taxonomic units for assessing cili-

ate environmental diversity using small-subunit rRNA

gene sequences. Environmental Microbiology Reports, 3,

154–158.Nolte V, Pandey RV, Jost S et al. (2010) Contrasting seasonal

niche separation between rare and abundant taxa conceals the

extent of protist diversity.Molecular Ecology, 19, 2908–2015.

Not F, del Campo J, Balague V, de Vargas C, Massana R (2009)

New insights into the diversity of marine picoeukaryotes.

PLoS ONE, 4, e7143.



Oksanen J, Kindt R, Legendre P, O’Hara RB (2007) vegan:

community ecology package version, 1, 8–5.Available from

http://r-forge.r-project.org/projects/vegan/.

Olsen GJ, Lane DJ, Giovannoni SJ, Pace NR, Stalh DA (1986)

Microbial ecology and evolution: a ribosomal RNA approach.

Annual Review of Microbiology, 40, 337–365.Pawlowski J, Burki F (2009) Untangling the Phylogeny of Amoe-

boid Protists. Journal of Eukaryotic Microbiology, 56, 16–25.Pawlowski J, Christen R, Lecroq B et al. (2011) Eukaryotic

richness in the abyss: insights from pyrotag sequencing.

PLoS ONE, 6, e18169.

Piganeau G, Eyre-Walker A, Grimsley N, Moreau H (2011)

How and why DNA barcodes underestimate the diversity of

microbial eukaryotes. PLoS ONE, 6, e16342.

Posada D (2008) jModelTest: phylogenetic model averaging.

Molecular Biology and Evolution, 25, 1253–1256.Pulsinelli GA, Temin HM (1994) High rate of mismatch exten-

sion during reverse transcription in a single round of retrovi-

rus replication. Proceedings of the National Academy of Sciences

of the United States of America, 91, 9490–9494.Quince C, Lanzen A, Davenport RJ, Turnbaugh PJ (2011)

Removing noise from pyrosequenced amplicons. BMC Bioin-

formatics, 12, 38.

Roesch LF, Fulthorpe RR, Riva A et al. (2007) Pyrosequencing

enumerates and contrasts soil microbial diversity. ISME

Journal, 1, 283–290.Schloss PD (2010) The effects of alignment quality, distance cal-

culation method, sequence filtering, and region on the analy-

sis of 16S rRNA gene-based studies. PLoS Computational

Biology, 6, e1000844.

Schoemann V, Becquevort S, Stefels J, Rousseau V, Lancelot C (2005)

Phaeocystis blooms in the global ocean and their controlling mech-

anisms: a review. Journal of Sea Research, 53, 43–66.Shalchian-Tabrizi K, Reier-Røberg K, Ree DK, Klaveness D,

Brate J (2011) Marine-freshwater colonizations of Hapto-

phytes inferred from phylogeny of environmental 18S rDNA

sequences. Journal of Eukaryotic Microbiology, 58, 315–318.Shannon CE (1948) A mathematical theory of communication.

Bell System Technical Journal, 27, 379–423.Shi XL, Marie D, Jardillier L, Scanlan DJ, Vaulot D (2009)

Groups without cultured representatives dominate eukary-

otic picophytoplankton in the oligotrophic South East Pacific

Ocean. PLoS ONE, 4, e7657.

Shi XL, Lepere C, ScanlanDJ, VaulotD (2011) Plastid 16S rRNAgene

diversity among eukaryotic picophytoplankton sorted by flow

cytometry from the South PacificOcean.PLoSONE, 6, e18979.

Simpson EH (1949) Measurement of diversity. Nature, 163, 688.

Sogin ML, Morrison HG, Huber J et al. (2006) Microbial diver-

sity in the deep sea and the underexplored ‘‘rare biosphere’’.

Proceedings of the National Academy of Sciences of the United

States of America, 103, 12115–12120.Stoeck T, Behnke A, Christen R et al. (2009) Massively parallel

tag sequencing reveals the complexity of anaerobic marine

protistan communities. BMC Biology, 7, 72.

Stoeck T, Bass D, Nebel M et al. (2010) Multiple marker parallel

tag environmental DNA sequencing reveals a highly com-

plex eukaryotic community in marine anoxic water. Molecu-

lar Ecology, 19, 21–31.

Sydow JF, Cramer P (2009) RNA polymerase fidelity and tran-

scriptional proofreading. Current Opinion in Structural Biol-

ogy, 19, 732–739.

Winter A, Rost B, Hilbrecht H, Elbrachter M (2002) Vertical

and horizontal distribution of coccolithophores in the Carib-

bean Sea. Geo-Marine Letters, 22, 150–161.

Woese CR (1987) Bacterial evolution. Microbiology Review, 51,

221–271.

L.B. and C.D.V. initiated and L.B. designed and coordinated

the research. L.B. and S.R. performed the molecular experi-

ments. L.B., I.P., B.E., E.S.E., and C.D.V. helped in the building

of the reference database. L.B., S.A., A.G., S.S., and H.O. ana-

lyzed the data. L.B. wrote the manuscript. All authors partici-

pated in revising the manuscript. All authors read and

approved the final article.

Data accessibility

Dryad Digital Repository, Package Identifier doi:10.5061/

dryad.tv5v1v26 and Sequence Read Archive (SRA),

http://www.ebi.ac.uk/ena/data/view/ERP001891

Supporting information

Additional supporting information may be found in the online ver-

sion of this article.

Table S1 Primer set used to amplify the D1-D2 region of the

Haptophyte LSU rDNA

Table S2 Specificity of our primer set highlighted by in silico

analysis.

Table S3 Effect of cleaning.

Table S4 Comparison of community similarities between tem-

plates, size fractions and depths as described by ANOSIM values.

Table S5 Number of reads and number of OTU considering

LSU1 reads at 99%, 98% and 97% of sequence identity.

Fig. S1 Location of the ‘Mare Chiara’ (MC) station, Bay of

Naples, Mediterranean sea.

Fig. S2 Relative position of the LSU1 Haptophyta-specific pri-

mer set compared to the LSU rDNA D1-D2 domain.

Fig. S3 Expected D1-D2 LSU rDNA genetic distances at vari-

ous Haptophyta taxonomic levels.

Fig. S4 Rarefaction analysis of OTU100%, OTU99%, OTU98% and

OTU97% pooling reads from the eight samples.

Fig. S5 Taxonomic composition of LSU rDNA reads not

assigned to Haptophyta at 85% homology, as inferred from

blast analyses in GenBank, using default parameters.

Fig. S6 In the framework of another study, SSU rDNA reads

from the V4 and the V9 region were pyrosequenced for the

eight same samples from Naples using ‘universal’ eukaryotic

primers (details on the primers can be found in Stoeck et al.

2010 or in Logares et al. 2012).



Fig. S7 Boxplots summarizing the range of a-diversity (calcula-

tion using Simpson’s index) for each condition calculated at

97% clustering level.

Fig. S8 Venn diagrams calculated with OTU97%.

Fig. S9 Heatmap summarizing the % of shared OTU97%

between the samples.

Fig. S10 More details on the proportion of reads assigned by

phylogenetic mapping for the main Haptophytes lineages.

Fig. S11 Haptophyte community structure based on NMDS

(nonmetric multidimensional scaling) ordination of the LSU

rDNA dataset distance matrices from OTU97%, followed by

comparison of community similarities between depths, size

fractions and templates as described by ANOSIM values.



diversity patterns of uncultured haptophytes unravelled by...

Documents