evolutionary history of the hymenoptera · report evolutionary history of the hymenoptera...

22
Report Evolutionary History of the Hymenoptera Highlights d The most comprehensive dataset ever compiled for inferring the phylogeny of Hymenoptera d A major radiation of primarily ectophytic sawflies (Eusymphyta) is hypothesized d A major radiation of parasitoid wasps (Parasitoida) is identified d The phylogenetic origins of wasp-waisted wasps, stinging wasps, and bees are resolved Authors Ralph S. Peters, Lars Krogmann, Christoph Mayer, ..., Jes Rust, Bernhard Misof, Oliver Niehuis Correspondence [email protected] (R.S.P.), [email protected] (O.N.) In Brief Peters et al. infer a time-calibrated and statistically solid phylogenetic tree of the mega-diverse insect order Hymenoptera (sawflies, wasps, ants, and bees) from the analysis of phylogenomic data. This sheds new light on the early history of this intriguing group, as well as on the origins and radiation of parasitoids, stinging wasps, and bees. Peters et al., 2017, Current Biology 27, 1013–1018 April 3, 2017 ª 2017 Elsevier Ltd. http://dx.doi.org/10.1016/j.cub.2017.01.027

Upload: others

Post on 04-Oct-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evolutionary History of the Hymenoptera · Report Evolutionary History of the Hymenoptera Highlights d The most comprehensive dataset ever compiled for inferring the phylogeny of

Report

Evolutionary History of the

Hymenoptera

Highlights

d The most comprehensive dataset ever compiled for inferring

the phylogeny of Hymenoptera

d A major radiation of primarily ectophytic sawflies

(Eusymphyta) is hypothesized

d A major radiation of parasitoid wasps (Parasitoida) is

identified

d The phylogenetic origins of wasp-waisted wasps, stinging

wasps, and bees are resolved

Peters et al., 2017, Current Biology 27, 1013–1018April 3, 2017 ª 2017 Elsevier Ltd.http://dx.doi.org/10.1016/j.cub.2017.01.027

Authors

Ralph S. Peters, Lars Krogmann,

Christoph Mayer, ..., Jes Rust,

Bernhard Misof, Oliver Niehuis

[email protected] (R.S.P.),[email protected](O.N.)

In Brief

Peters et al. infer a time-calibrated and

statistically solid phylogenetic tree of the

mega-diverse insect order Hymenoptera

(sawflies, wasps, ants, and bees) from the

analysis of phylogenomic data. This

sheds new light on the early history of this

intriguing group, as well as on the origins

and radiation of parasitoids, stinging

wasps, and bees.

Page 2: Evolutionary History of the Hymenoptera · Report Evolutionary History of the Hymenoptera Highlights d The most comprehensive dataset ever compiled for inferring the phylogeny of

Current Biology

Report

Evolutionary History of the HymenopteraRalph S. Peters,1,24,25,* Lars Krogmann,2 Christoph Mayer,3 Alexander Donath,3 Simon Gunkel,4 Karen Meusemann,3,5,6

Alexey Kozlov,7 Lars Podsiadlowski,8 Malte Petersen,3 Robert Lanfear,9,10 Patricia A. Diez,11 John Heraty,12

Karl M. Kjer,13 Seraina Klopfstein,14 Rudolf Meier,15 Carlo Polidori,16 Thomas Schmitt,17 Shanlin Liu,18,19,20 Xin Zhou,21,22

Torsten Wappler,4 Jes Rust,4 Bernhard Misof,3 and Oliver Niehuis3,5,23,24,*1Center of Taxonomy and Evolutionary Research, Arthropoda Department, Zoologisches Forschungsmuseum Alexander Koenig,

53113 Bonn, Germany2Entomologie, Staatliches Museum fur Naturkunde Stuttgart, 70191 Stuttgart, Germany3Center for Molecular Biodiversity Research, Zoologisches Forschungsmuseum Alexander Koenig, 53113 Bonn, Germany4Steinmann Institut fur Geologie, Mineralogie und Pal€aontologie, 53115 Bonn, Germany5Department of Evolutionary Biology and Ecology, Institute for Biology I (Zoology), University of Freiburg, 79104 Freiburg (Brsg.), Germany6Australian National Insect Collection, CSIRO National Research Collections Australia (NRCA), Acton, ACT 2601, Australia7Scientific Computing Group, Heidelberg Institute for Theoretical Studies, 69118 Heidelberg, Germany8Institute of Evolutionary Biology and Ecology, University of Bonn, 53121 Bonn, Germany9Ecology, Evolution and Genetics, Research School of Biology, Australian National University, Canberra, ACT 2601, Australia10School of Biological Sciences, Macquarie University, Sydney, NSW 2109, Australia11Centro de Investigaciones y Transferencia de Catamarca, CITCA-CONICET/UNCA, 4700 Catamarca, Argentina12Department of Entomology, University of California, Riverside, Riverside, CA 92521, USA13Department of Biological Sciences, Rutgers University, Newark, NJ 07102, USA14Naturhistorisches Museum der Burgergemeinde Bern, 3005 Bern, Switzerland15Department of Biological Sciences and Lee Kong Chian Natural History Museum, National University of Singapore, Singapore 117543,

Singapore16Instituto de Ciencias Ambientales (ICAM), Universidad de Castilla-La Mancha, 45071 Toledo, Spain17Department of Animal Ecology and Tropical Biology, University of Wurzburg, 97074 Wurzburg, Germany18China National GeneBank-Shenzhen, BGI-Shenzhen, Shenzhen, Guangdong Province, 518083, People’s Republic of China19BGI-Shenzhen, Shenzhen, Guangdong Province, 518083, People’s Republic of China20Centre for GeoGenetics, Natural HistoryMuseumof Denmark, University of Copenhagen, Øster Voldgade 5–7, 1350Copenhagen, Denmark21Beijing Advanced Innovation Center for Food Nutrition and Human Health, China Agricultural University, Beijing 100193, People’s Republic

of China22Department of Entomology, China Agricultural University, Beijing 100193, People’s Republic of China23School of Life Sciences, Arizona State University, Tempe, AZ 85287, USA24These authors contributed equally25Lead Contact

*Correspondence: [email protected] (R.S.P.), [email protected] (O.N.)

http://dx.doi.org/10.1016/j.cub.2017.01.027

SUMMARY

Hymenoptera (sawflies, wasps, ants, and bees) areone of four mega-diverse insect orders, comprisingmore than 153,000 described and possibly up toone million undescribed extant species [1, 2]. Asparasitoids, predators, and pollinators, Hymeno-ptera play a fundamental role in virtually all terrestrialecosystems and are of substantial economic impor-tance [1, 3]. To understand the diversification andkey evolutionary transitions of Hymenoptera, mostnotably from phytophagy to parasitoidism and pre-dation (and vice versa) and from solitary to eusociallife, we inferred the phylogeny and divergence timesof all major lineages of Hymenoptera by analyzing3,256 protein-coding genes in 173 insect species.Our analyses suggest that extant Hymenopterastarted to diversify around 281 million years ago(mya). The primarily ectophytophagous sawflies arefound to be monophyletic. The species-rich lineagesof parasitoid wasps constitute a monophyletic group

Curre

as well. The little-known, species-poor Trigonaloideaare identified as the sister group of the stingingwasps (Aculeata). Finally, we located the evolu-tionary root of bees within the apoid wasp family‘‘Crabronidae.’’ Our results reveal that the extantsawfly diversity is largely the result of a previouslyunrecognized major radiation of phytophagous Hy-menoptera that did not lead to wood-dwelling andparasitoidism. They also confirm that all primarilyparasitoid wasps are descendants of a single endo-phytic parasitoid ancestor that lived around 247mya. Our findings provide the basis for a naturalclassification of Hymenoptera and allow for futurecomparative analyses of Hymenoptera, includingtheir genomes, morphology, venoms, and parasitoidand eusocial life styles.

RESULTS AND DISCUSSION

We sequenced whole-body transcriptomes of 167 species of

Hymenoptera and selected outgroups and supplemented our

nt Biology 27, 1013–1018, April 3, 2017 ª 2017 Elsevier Ltd. 1013

Page 3: Evolutionary History of the Hymenoptera · Report Evolutionary History of the Hymenoptera Highlights d The most comprehensive dataset ever compiled for inferring the phylogeny of

dataset with sequenced and annotated genomes of five hyme-

nopterans and a beetle (for details, see Supplemental Experi-

mental Procedures and Data S1A–S1D). Our study includes 54

families of Hymenoptera, representing all major superfamilies.

The phylogenetic inferences are based on the analysis of 1.5

million amino acid and 3.0 million nucleotide positions, respec-

tively, derived from 3,256 single-copy protein-coding genes

(Data S1E) and inferred by using a combination of domain-,

gene-, and codon position-based data partition schemes to

improve the fitting of the applied substitution models. Consid-

ering the taxonomic and molecular sampling, this is the most

comprehensive dataset ever generated for investigating phylo-

genetic relationships within Hymenoptera or any other insect

group. The dataset was furthermore used to estimate divergence

times with an independent-rates as well as with a correlated-

rates molecular clock approach (Data S1H) and a validated

set of 14 fossils (Data S1F).

The inferred phylogenetic relationships and divergence time

estimates were used to assess where in the phylogeny of Hyme-

noptera, when in their geological history, and how often major

evolutionary transitions took place. Specifically, we studied the

switch from feeding on plants to feeding on an insect host (para-

sitoidism), the formation of a wasp waist, the evolution of a

venomous stinger to subduemobile hosts, the evolution of euso-

ciality, and the switch from hunting prey to collecting pollen.

These evolutionary transitions are partially reflected by the his-

toric classification of Hymenoptera: sawflies (‘‘Symphyta’’) are

those Hymenoptera that lack the wasp waist that characterizes

all remaining Hymenoptera (Apocrita), ‘‘Parasitica’’ encom-

passes the primarily parasitoid Apocrita that lack a stinger, and

Aculeata comprises the stinging wasps, ants, and bees (Antho-

phila) [1]. Yet, how many major lineages each of these groups

encompasses has been controversial for decades [4–11].

The results of our phylogenomic study received strong sup-

port in all analyses, unless stated otherwise, and alter previous

ideas regarding the evolutionary history of Hymenoptera (Fig-

ure 1B; for full results and detailed experimental procedures,

see Figure S1, Supplemental Experimental Procedures, and

additional figures deposited at Mendeley Data, http://dx.doi.

org/10.17632/s5j2f62z3d.2). According to our analyses, extant

Hymenoptera started to diversify between the Carboniferous

and the Triassic (95% confidence interval [CI]: 329–239 million

years ago [mya]; mean: 281 mya; node 1 [n.1] in Figure 1B),

with the oldest currently known Hymenoptera fossils being

from the Triassic, �224 million years old [8]. Previous studies

suggested this divergence to have occurred between the sawfly

lineage Xyeloidea and the remaining Hymenoptera [5, 7–11],

whereas our analysis identified a much more inclusive clade of

sawflies (Eusymphyta; n.2) that also contains Pamphilioidea

and Tenthredinoidea as closest relatives of all remaining Hyme-

noptera (Unicalcarida). These superfamilies had been thought to

form a paraphyletic grade [5, 7, 9, 11]. Instead, they represent an

unexpected and previously unrecognized major radiation of pri-

marily ectophytophagous insects that comprises more than

7,000 described species [1]. We estimate the first diversification

of the extant eusymphytan lineages to have occurred 276–157

mya (mean 212 mya). Note that Eusymphyta were corroborated

as the sister group of all remaining Hymenoptera when addition-

ally scrutinizing the analyzed molecular data for conflicting

1014 Current Biology 27, 1013–1018, April 3, 2017

phylogenetic signal (Supplemental Experimental Procedures).

Given the novelty and importance of our finding, we anticipate

that it will significantly influence future research on Hymenoptera

relationships, and we encourage researchers to further assess

this particular phylogenetic hypothesis in future studies, for

example by extending the taxon sampling within Eusymphyta

and the outgroup.

A clade Eusymphyta representing the extant sister lineage of

all remaining Hymenoptera (Unicalcarida) has profound conse-

quences for inferring ground-plan characters of Hymenoptera.

For example, Hymenoptera were previously thought to have

been ancestrally ectophytophagous, based on the assumption

that eusymphytans form a paraphyletic assemblage. Consid-

ering that the sister group of Hymenoptera (Aparaglossata)

was ancestrally likely predacious [12], the inferred relationship

between Eusymphyta and Unicalcarida implies that the most

recent common ancestor of Hymenoptera could have been

ecto- or endophytophagous. A sister group relationship between

Eusymphyta and Unicalcarida furthermore implies that the

remarkable ability of male Hymenoptera to restore diploidy in

their muscle cells was already present in the last common

ancestor of all Hymenoptera (with a secondary loss in Xyelidae),

or that this feature evolved at least twice (in Unicalcarida and

Tenthredinoidea) [13]. Finally, the unexpected finding that the

turnip sawfly, Athalia rosae (Tenthredinoidea), whose genome

has recently been sequenced by the i5K initiative [14], is a repre-

sentative of the sister lineage of all remaining Hymenoptera will

improve our understanding of the genetic composition of the

most recent common ancestor of Hymenoptera: genomic fea-

tures shared between the turnip sawfly and species of Unicalcar-

ida with sequenced genomes (e.g., Nasonia parasitoid wasps,

ants, bees) were likely inherited from their common ancestor.

In agreement with earlier studies [9, 10], we found a single

origin of the endophytic sawfly lineages (i.e., Cephoidea, Orus-

soidea, Siricoidea, and Xiphydrioidea; n.3), which form a para-

phyletic grade, in which Orussoidea (parasitoid woodwasps)

represent the closest relatives of Apocrita (n.4). Morphological

data have suggested a sister group relationship of Orussoidea

and Apocrita (Vespina) [6, 15], but results from analyzing molec-

ular data have been inconsistent [7, 9]. Our analyses provide

strong support for the monophyly of Vespina and of Apocrita

(n.5) and imply that the bulk of primarily parasitoid wasps are de-

scendants of a single endophytic parasitoid ancestor that lived in

the Permian or in the Triassic (CI: 289–211mya; mean: 247mya).

Contrary to earlier hypotheses of sawfly relationships (see [10]),

we identified Cephoidea, and not Siricoidea and/or Xiphydrioi-

dea, as the closest extant relatives of Vespina (n.6), a result

only recently suggested [7].

The evolution of the wasp waist, a constriction between the

first and the second abdominal segment greatly improving

the maneuverability of the abdomen’s rear section, including

the ovipositor, was a major innovation in the evolution of Hyme-

noptera that undoubtedly contributed to the rapid diversification

of Apocrita (n.5) [6]. Our analysis is the first to persuasively

demonstrate that the most diverse parasitoid wasp lineages

(i.e., Ceraphronoidea, Ichneumonoidea, and Proctotrupomor-

pha) constitute a natural group (Parasitoida; n.7) whose aston-

ishing radiation was likely triggered by further optimization of

the parasitoid lifestyle and related traits (e.g., endoparasitoidism,

Page 4: Evolutionary History of the Hymenoptera · Report Evolutionary History of the Hymenoptera Highlights d The most comprehensive dataset ever compiled for inferring the phylogeny of

PamphilioideaXyeloidea

Tenthredinoidea

XiphydrioideaSiricoideaCephoideaOrussoideaIchneumonoideaCeraphronoidea

Cynipoidea

Platygastroidea

ProctotrupoideaDiaprioidea

EvanioideaStephanoideaTrigonaloidea

Chrysidoidea

Tiphioidea

Scolioidea

Ampulicidae

"Crabronidae"

Sphecidae

Anthophilabees

digger wasps

Formicoideaants

Pompiloideavelvet ants,spider wasps,and relatives

Chalcidoideajewel wasps

Vespoideapotter, honey andsocial wasps

Apoidea

Bootstrap support100%

51–75%

91–99%76–90%

0–50%

Tiphia

Cephalonomia

Bembix

Monosapyga

Harpegnathos

Tetragonula

Dryudella

Sapgya

Dasymutilla

Gorytes

Diodontus

Philanthus

Sirex

outgroup (5)

Stephanus

Xyela

Pergagrapta

Harpactus

Vespa

Sphecius

Dolichurus

Urocerus

Auplopus

Acromyrmex

Smicromyrme

Tremex

Stizus

Ampulex c.

Melittidae (3)

Mellinus

Masaris

CrabroOxybelus

Alysson

various speciesDinetus

Bombus

Chrysis

Apidae (partim, 5)

PseudogonalosPlumarius

Acantholyda

Orussus a.

Astata

Chyphotes

AprocerosArge

Aulacus

Ceraphron

Stizoides

Lestica

Camponotus

Vespula

Dendrocerus

NitelaPalarus

Tenthredo

Gasteruption

Heterodontonyx

Ampulex f.

Orussus u.

Meria

Apis

Cephus

Xiphydria

Polistes

Methocha

Blasticotoma

Andrenidae (3)

Nysson

Nematus

Sapygina

PemphredonPsenulus

Euglossa

Parnopes

Brachygaster

Trypoxylon

Crossocerus

Cimbex

Pseudoscolia

Tachysphex

Megalodontes

Cleptes

Diprion

Megachilidae (9)

Pompilus

Scolia

Colletidae (2)

Colpa

Passaloecus

Eumenes

Parischnogaster

Spilomena

Elampus

Liris

Episyron

Cerceris

Pison

Halictidae

Leptopilina

Synergus

Ibalia

Pediaspis

Andricus

TelenomusInostemma

Trissolcus

PelecinusTrichopria

Brachymeria

Cosmocomoidea

Torymus

Orasema

Brachyserphus

Nasonia

Diglyphus

Braconidae (5)Ichneumonidae (8)

Apidae (partim, 8)

(8)

(7)

Eusymphyta

Parasitoida

Permian Triassic Jurassic Cretaceus Paleogene Neo. Q.

50 0100150200250300 mya

50 0100150 Mya

Hymenoptera

AculeataApocrita

Apidae

(partim)

Tenthredo livida L.(Tenthredinoidea)

Camponotus maculatus (Fabr.)(Formicoidea)

A

B

Ampulex compressa (Fabr.)(Apoidea)

Vespula germanica (Fabr.)(Vespoidea)

Torymus bedeguaris (L.)(Chalcidoidea)

Stephanus serrator (Fabr.)(Stephanoidea)

Orussus abietinus (Scop.)(Orussoidea)

Apis mellifera L. (Apoidea)

Proctotrupomorpha

Unicalcarida

1

2

4

5

7

8

9

16

3

6

1011

12

1314

15

eusociality

eusociality

pollencollecting

Vespinaparasitoidism

wasp-waist

stinger

eusociality

eusociality

pollen collecting

Thynnoidea

Figure 1. Evolutionary History of the Hymenoptera

(A) Representatives of sawflies, wasps, ants, and bees. Scale bars represent 5 mm.

(B) Phylogenetic relationships and divergence time estimates of Hymenoptera. Key evolutionary events are indicated at the respective clades (note that only the

major eusocial lineages are considered). The tree was inferred under the maximum-likelihood optimality criterion, analyzing 1,505,514 amino acid sites and

applying a combination of protein domain- and gene-specific substitution models. Divergence times were estimated with an independent-rates molecular clock

approach and considering 14 validated fossils. Triangular branches cover multiple species (number of species in parentheses) whose relationships are shown in

detail in Figure S1. Nodes with circled numbers are referred to in the main text.

Current Biology 27, 1013–1018, April 3, 2017 1015

Page 5: Evolutionary History of the Hymenoptera · Report Evolutionary History of the Hymenoptera Highlights d The most comprehensive dataset ever compiled for inferring the phylogeny of

miniaturization), which allowed for successfully attacking a

variety of new hosts. We estimate the beginning of the group’s

radiation at 266–195 mya (mean: 228 mya), only a few million

years after Parasitoida separated from the remaining Apocrita

(CI: 276–203 mya; mean: 236 mya). The early radiation of Para-

sitoida thus falls within a time period when the parasitoids’ major

host lineages (e.g., Hemiptera, Holometabola) also started to

diversify [16].

We identified the enigmatic Trigonaloidea as the closest

extant relatives of Aculeata with strong node support (n.8), a hy-

pothesis only recently put forth [7, 9]. Evanioidea, which had also

been discussed as a possible sister group of Aculeata [5, 10, 17,

18], cluster with Stephanoidea (n.9). Node support for this rela-

tionship is low, however, and it needs to be investigated further

in future studies that include additional types of characters and

samples of Megalyroidea, a lineage that we were unable to

sequence. Note that in contrast to Aculeata, the Evanioidea, Ste-

phanoidea, and Trigonaloidea have all remained species-poor.

The identification of the closest relatives of Aculeata will be

important for better understanding which traits (e.g., venoms)

fostered the diversification of the stinging wasps.

Our analysis sheds new light on the phylogeny of Aculeata

(n.10), whose early diversification occurred 224–160 mya

(mean: 190 mya). Chrysidoids are confirmed as the sister group

of all remaining Aculeata [19]. We corroborate the artificial nature

of the former superfamily ‘‘Vespoidea’’ (i.e., all Aculeata except

Apoidea and Chrysidoidea) [5], which comprises four major

lineages that are paraphyletic with respect to Apoidea [20]. The

potter, honey, and social wasps (Vespoidea sensu Pilgrim et al.

[20]: Vespidae; n.11) were identified as the sister lineage of all

remaining non-chrysidoid Aculeata. However, the phylogenetic

position of the species-poor Rhopalosomatidae (Vespoidea

sensu Pilgrim et al. [20]), an aculeate wasp family that we were

unable to sequence and possible sister lineage of Vespidae, re-

mains controversial [9, 10, 20]. The inferred phylogenetic rela-

tionships within Vespidae suggest two independent origins of

eusociality, a previously fiercely contested hypothesis [21, 22].

In agreement with an earlier phylogenomic study [23], we in-

ferred ants (Formicoidea) as being the closest extant relatives

of Apoidea (n.12) in all of our analyses, except when applying a

Bayesian approach, which suggested ants plus scoliid wasps

(Scolioidea, possibly including also the family Bradynobaenidae

[20], which we were unable to sequence) as being sister to Apo-

idea (figure deposited at Mendeley Data, http://dx.doi.org/10.

17632/s5j2f62z3d.2). We estimate the last common ancestor

of ants and Apoidea to have lived in the Jurassic or the Creta-

ceous (CI: 192–136 mya; mean: 162 mya).

We located the phylogenetic origin of bees (Anthophila) within

the apoid wasp family ‘‘Crabronidae’’ (n.13), which our study

shows to be an artificial construct comprising five major line-

ages. The crabronid wasp lineage in our study most closely

related to bees is the species-poor tribe Psenini. This result sub-

stantiates the idea that the switch from a predatory to a herbi-

vorous lifestyle was a key to the tremendous diversification of

bees [24]. We estimate the origin of bees to have been in the

Cretaceous (CIs: 147–93mya; means: 124 and 111mya), a result

that is consistent with a close temporal link between the diversi-

fications of bees and angiosperms [24]. Melittid bees were iden-

tified as the sister lineage of all remaining Anthophila (n.14),

1016 Current Biology 27, 1013–1018, April 3, 2017

which implies that short-tongued bees do not represent a natural

group. In contrast, we confirmed long-tongued bees (i.e., Apidae

and Megachilidae) to constitute a natural entity (n.15) [24]. We

also found the eusocial apid bee lineages to be monophyletic,

corroborating the hypothesis that eusociality has evolved

once, not twice, in corbiculate (pollen basket) bees (n.16) [25].

Our study confirms the power of phylogenomic approaches

for deciphering difficult-to-resolve arthropod phylogenetic rela-

tionships [12, 16, 26, 27] by yielding well-supported answers to

some of the most pressing questions regarding the evolutionary

history of the sawflies, wasps, ants, and bees.We provide strong

evidence for understanding the phylogenetic relationships

among all major lineages of Hymenoptera, and we were able

to date the individual divergence events, both paramount for de-

ciphering the tempo and mode of diversification of ecologically,

economically, sociobiologically, and/or pharmaceutically rele-

vant traits of interest (e.g., gene repertoires, haplodiploidy and

sex determination, eusociality, chemosensation, and venoms).

Finally, our study offers the basis for establishing a natural clas-

sification of the insect order Hymenoptera.

EXPERIMENTAL PROCEDURES

We sequenced the transcriptomes of 134 species of Hymenoptera using Illu-

mina HiSeq 2000 sequencing technology (Data S1A–S1C). We complemented

our dataset by including previously published transcriptomes of 29 Hymeno-

ptera and four Neuropteroidea [16, 28]. Finally, we considered the official

gene sets of five Hymenoptera and the flour beetle Tribolium castaneum

(Data S1D). All paired-end reads were assembled with SOAPdenovo-Trans-

31kmer (version 1.01) [29], the assembled transcripts were filtered for possible

contaminants, and the raw reads and filtered assemblies were submitted to

the NCBI SRA and TSA archives. We searched the assemblies with the soft-

ware Orthograph (version beta4) [28] for transcripts of 3,260 protein-coding

genes that the OrthoDB v7 database [30] suggested to be single-copy in

Hymenoptera and Neuropteroidea (outgroup) by applying the best reciprocal

hit criterion. Orthologous transcripts were aligned with MAFFT (version

7.017) [31] at the translational (amino acid) level. All multiple sequence align-

ments (MSAs) were quality assessed and, if necessary, improved and masked

using the procedure outlined by Misof et al. [16]. The resulting MSAs were

concatenated to a supermatrix that we simultaneously partitioned based

on a combination of Pfam protein domains and genes [16]. The phylogenetic

information content of each partition was assessed with MARE (version

0.1.2-rc) [32], and all uninformative partitions were removed. We subsequently

used PartitionFinder (developer versions 2.0.0-pre2, 2.0.0-pre9, and 2.0.0-

pre10) [33] to simultaneously infer a partition scheme and proper amino acid

substitution models for analyzing each partition with the rcluster algorithm.

We applied the same partition scheme when analyzing the corresponding

supermatrix at the transcriptional (nucleotide) level, except that we modeled

the first and second codon position of each partition separately (note that

we excluded the hypervariable third codon position from our analyses). Phylo-

genetic trees were reconstructed with ExaML (versions 3.0.15 and 3.0.17) [34],

conducting 50 independent tree searches per supermatrix. Node support was

inferred with the bootstrap method [35]. Decisive datasets were used for

testing the possible impact of missing data at the partition level on the inferred

phylogenetic tree [36], and four-cluster likelihood mapping was used for

assessing the phylogenetic signal for alternative phylogenetic relationships

[37]. Permutation tests allowed assessing the impact of heterogeneous amino

acid sequence composition, non-stationarity of substitution processes, and

non-random distribution of missing data on the inferred phylogenetic tree

[16]. We additionally conducted phylogenetic inferences in a Bayesian frame-

work, using ExaBayes [38] with its default settings, enabling automatic substi-

tution model detection and applying the same data partitioning scheme that

we used in analyses under the maximum-likelihood optimality criterion. We

analyzed three independent runs with four coupled Markov chain Monte Carlo

Page 6: Evolutionary History of the Hymenoptera · Report Evolutionary History of the Hymenoptera Highlights d The most comprehensive dataset ever compiled for inferring the phylogeny of

chains and 200,000 generations each. The consense tool (part of the

ExaBayes software package) was used to obtain a consensus tree based on

the extended majority rule method (MRE), discarding the first 25% of the

sampled topologies as burn-in. Divergence timeswere calibrated using 14 fos-

sils (Data S1F), selected following best-practice recommendations [39] and

representing extant lineages distributed across the entire Hymenoptera Tree

of Life. Divergence times were estimated with mcmctree in conjunction with

codeml (both part of the PAML software package, version 4.9) [40]. We

analyzed a subset of the amino acid and of the nucleotide supermatrix, both

comprising only sites that had amino acids or nucleotides present in at least

95% of the species, both with an independent-rates model and with a corre-

lated-rates model (Figure 1B; Data S1H) and sampling parameters previously

assessed for convergence of results.

Data Resources

Data reported in this paper have been published in Mendeley Data and

are available at http://dx.doi.org/10.17632/trbj94zm2n.2 (inferred matrices

and statistics) and http://dx.doi.org/10.17632/s5j2f62z3d.2 (figures). All

sequencing data are available at NCBI via the Umbrella BioProject accession

number NCBI: PRJNA183205 (‘‘The 1KITE project: evolution of insects’’).

SUPPLEMENTAL INFORMATION

Supplemental Information includes one figure, Supplemental Experimental

Procedures, and one dataset and can be found with this article online at

http://dx.doi.org/10.1016/j.cub.2017.01.027.

AUTHOR CONTRIBUTIONS

B.M., L.K., O.N., and R.S.P. conceived the study. C.P., J.H., K.M., K.M.K.,

L.K., O.N., P.D., R.M., R.S.P., S.K., and T.S. collected or provided samples.

A.D., K.M., L.P., O.N., R.S.P., S.L., and X.Z. sequenced, assembled, and pro-

cessed the transcriptomes. A.K., C.M., K.M., M.P., O.N., R.L., and R.S.P.

phylogenetically analyzed the transcriptomes. J.R., L.K., O.N., R.S.P., S.G.,

and T.W. are responsible for the dating of the inferred phylogeny. All authors

contributed to the writing of the manuscript, with L.K., O.N., and R.S.P. taking

the lead.

ACKNOWLEDGMENTS

The presented data are the result of the collaborative efforts of the 1KITE con-

sortium. The sequencing and assembly of the 1KITE transcriptomes were

funded by BGI through support to the China National GeneBank. We thank

S. Blank, A. Dorchin, J. Gusenleitner, V. Mauss, C. Schmid-Egger, and M.

Schwarz for help with identification of samples and R. Allemand, E. Altenhofer,

E. van den Berghe, A. Blanke, J. de Boer, J. Chille, A. Dorchin, T. Eltz, M.

Fierke, R. Glatz, K. Kantner, M. Kivan, K. Kraaijeveld, S. Leonhardt, M. Neu-

mann, M. Niehuis, G. Reder, K. Riede, M. Shaw, N. Schiff, K.-H. Schmalz, J.

Schmidt, P. Schule, K. Schutte, J. Steidle, N. Szucsich, D. Tagu, and D. Yeates

for providing valuable samples. We are grateful to S. Brown, D. Gilbert, J. Lie-

big, and R. Waterhouse for providing information required for the transcript

orthology prediction. We thank V. Achter, S. Bank, D. Bartel, A. Bohm,

H. Escalona, O. Hlinka, T. Pauli, S. Simon, A. Stamatakis, V. Winkelmann,

the Cologne High Efficient Operating Platform for Science (CHEOPS) at the

Regionales Rechenzentrum Koln (RRZ), and the CSIRO HPC Dell PowerEdge

M620 Linux Cluster Systems for computing time and/or bioinformatic support.

We furthermore acknowledge the Gauss Centre for Supercomputing e. V. for

funding computing time on the GCS Supercomputer SuperMUC at the Leibniz

Supercomputing Centre (LRZ). We thank A. Stamatakis for implementing the

four-cluster likelihood quartet mapping feature in ExaML. We acknowledge

the Amt fur Umwelt, Verbraucherschutz und Lokale Agenda of Bonn, Hessen

Forst, the Israeli Nature and National Parks Protection Authority, the Mercan-

tour National Park Service, and the Struktur- und Genehmigungsbehorde

Sud and the Struktur- und Genehmigungsdirektion Nord (both Rhineland

Palatinate) for granting permission to collect samples. K.M. thanks O. Hlinka

(IM&T), D. Yeates (CSIRO), and the Schlinger Endowment to the CSIRO

National Research Collections Australia for support. H. Goulet and the Natural

History Museum (London) kindly granted permission to use published draw-

ings as blueprints for illustrating our main figure. U. Vaartjes kindly helped

with illustrating the main figure. A.D., B.M., C.M., J.R., L.P., M.P., O.N.,

R.S.P., and S.G. were supported by the Leibniz Graduate School for Genomic

Biodiversity Research. C.P. was supported by the Universidad de Castilla-La

Mancha and the European Social Fund (ESF). O.N. and T.S. acknowledge

the German Research Foundation (DFG) for supporting parts of this study

(NI 1387/1-1; SCHM 2645/2-1).

Received: September 24, 2016

Revised: December 13, 2016

Accepted: January 16, 2017

Published: March 23, 2017

REFERENCES

1. Grimaldi, D.A., and Engel, M.S. (2005). Evolution of the Insects (Cambridge

University Press).

2. Aguiar, A.P., Deans, A.R., Engel, M.S., Forshage, M., Huber, J.T.,

Jennings, J.T., Johnson, N.F., Lelej, A.S., Longino, J.T., Lohrmann, V.,

et al. (2013). Order Hymenoptera. Zootaxa 3703, 51–62.

3. Quicke, D.L.J. (1997). Parasitic Wasps (Chapman & Hall).

4. Dowton, M., and Austin, A.D. (1994). Molecular phylogeny of the insect or-

der Hymenoptera: apocritan relationships. Proc. Natl. Acad. Sci. USA 91,

9911–9915.

5. Sharkey, M.J. (2007). Phylogeny and classification of Hymenoptera.

Zootaxa 1668, 521–548.

6. Vilhelmsen, L., Miko, I., and Krogmann, L. (2010). Beyond the wasp-waist:

structural diversity and phylogenetic significance of the mesosoma in

apocritan wasps (Insecta: Hymenoptera). Zool. J. Linn. Soc. 159, 22–194.

7. Heraty, J., Ronquist, F., Carpenter, J.M., Hawks, D., Schulmeister, S.,

Dowling, A.P., Murray, D., Munro, J., Wheeler, W.C., Schiff, N., and

Sharkey, M. (2011). Evolution of the hymenopteran megaradiation. Mol.

Phylogenet. Evol. 60, 73–88.

8. Ronquist, F., Klopfstein, S., Vilhelmsen, L., Schulmeister, S., Murray, D.L.,

and Rasnitsyn, A.P. (2012). A total-evidence approach to dating with fos-

sils, applied to the early radiation of the Hymenoptera. Syst. Biol. 61,

973–999.

9. Sharkey, M.J., Carpenter, J.M., Vilhelmsen, L., Heraty, J., Liljeblad, J.,

Dowling, A.P.G., Schulmeister, S., Murray, D., Deans, A.R., Ronquist, F.,

et al. (2012). Phylogenetic relationships among superfamilies of

Hymenoptera. Cladistics 28, 80–112.

10. Klopfstein, S., Vilhelmsen, L., Heraty, J.M., Sharkey, M., and Ronquist, F.

(2013). The hymenopteran tree of life: evidence from protein-coding genes

and objectively aligned ribosomal data. PLoS ONE 8, e69344.

11. Malm, T., and Nyman, T. (2015). Phylogeny of the symphytan grade of

Hymenoptera: new pieces into the old jigsaw (fly) puzzle. Cladistics 31,

1–17.

12. Peters, R.S., Meusemann, K., Petersen, M., Mayer, C., Wilbrandt, J.,

Ziesmann, T., Donath, A., Kjer, K.M., Aspock, U., Aspock, H., et al.

(2014). The evolutionary history of holometabolous insects inferred from

transcriptome-based phylogeny and comprehensive morphological

data. BMC Evol. Biol. 14, 52.

13. Aron, S., de Menten, L., Van Bockstaele, D.R., Blank, S.M., and Roisin, Y.

(2005). When hymenopteran males reinvented diploidy. Curr. Biol. 15,

824–827.

14. i5K Consortium (2013). The i5K Initiative: advancing arthropod genomics

for knowledge, human health, agriculture, and the environment.

J. Hered. 104, 595–600.

15. Rasnitsyn, A.P., and Quicke, D.L.J. (2002). History of Insects (Kluwer

Academic Publishers).

16. Misof, B., Liu, S., Meusemann, K., Peters, R.S., Donath, A., Mayer, C.,

Frandsen, P.B., Ware, J., Flouri, T., Beutel, R.G., et al. (2014).

Phylogenomics resolves the timing and pattern of insect evolution.

Science 346, 763–767.

Current Biology 27, 1013–1018, April 3, 2017 1017

Page 7: Evolutionary History of the Hymenoptera · Report Evolutionary History of the Hymenoptera Highlights d The most comprehensive dataset ever compiled for inferring the phylogeny of

17. Peters, R.S., Meyer, B., Krogmann, L., Borner, J., Meusemann, K.,

Schutte, K., Niehuis, O., and Misof, B. (2011). The taming of an impossible

child: a standardized all-in approach to the phylogeny of Hymenoptera us-

ing public database sequences. BMC Biol. 9, 55.

18. Zimmermann, D., and Vilhelmsen, L. (2016). The sister group of Aculeata

(Hymenoptera) – evidence from internal head anatomy, with emphasis

on the tentorium. Arthropod Syst. Phylogeny 74, 195–218.

19. Brothers, D.J. (1999). Phylogeny and evolution of wasps, ants and bees

(Hymenoptera, Chrysidoidea, Vespoidea and Apoidea). Zool. Scr. 28,

233–249.

20. Pilgrim, E.F., Von Dohlen, C.D., and Pitts, J.P. (2008). Molecular phyloge-

netics of Vespoidea indicate paraphyly of the superfamily and novel

relationships of its component families and subfamilies. Zool. Scr. 37,

539–560.

21. Hines, H.M., Hunt, J.H., O’Connor, T.K., Gillespie, J.J., and Cameron, S.A.

(2007). Multigene phylogeny reveals eusociality evolved twice in vespid

wasps. Proc. Natl. Acad. Sci. USA 104, 3295–3299.

22. Pickett, K.M., and Carpenter, J.M. (2010). Simultaneous analysis and the

origin of eusociality in the Vespidae (Insecta: Hymenoptera). Arthropod

Syst. Phylogeny 68, 3–33.

23. Johnson, B.R., Borowiec,M.L., Chiu, J.C., Lee, E.K., Atallah, J., andWard,

P.S. (2013). Phylogenomics resolves evolutionary relationships among

ants, bees, and wasps. Curr. Biol. 23, 2058–2062.

24. Cardinal, S., and Danforth, B.N. (2013). Bees diversified in the age of eu-

dicots. Proc. Biol. Sci. 280, 20122686.

25. Romiguier, J., Cameron, S.A., Woodard, S.H., Fischman, B.J., Keller, L.,

and Praz, C.J. (2016). Phylogenomics controlling for base compositional

bias reveals a single origin of eusociality in corbiculate bees. Mol. Biol.

Evol. 33, 670–678.

26. Garrison, N.L., Rodriguez, J., Agnarsson, I., Coddington, J.A., Griswold,

C.E., Hamilton, C.A., Hedin, M., Kocot, K.M., Ledford, J.M., and Bond,

J.E. (2016). Spider phylogenomics: untangling the spider tree of life.

PeerJ 4, e1719.

27. Fernandez, R., Edgecombe, G.D., and Giribet, G. (2016). Exploring phylo-

genetic relationships within Myriapoda and the effects of matrix composi-

tion and occupancy on phylogenomic reconstruction. Syst. Biol. 65,

871–889.

28. Petersen, M., Meusemann, K., Donath, A., Dowling, D., Liu, S., Peters,

R.S., Podsiadlowski, L., Vasilikopoulos, A., Zhou, X., Misof, B., and

Niehuis, O. (2017). Orthograph: a versatile tool for mapping coding

nucleotide sequences to clusters of orthologous genes. BMC

Bioinformatics 18, 111.

1018 Current Biology 27, 1013–1018, April 3, 2017

29. Xie, Y., Wu, G., Tang, J., Luo, R., Patterson, J., Liu, S., Huang, W., He, G.,

Gu, S., Li, S., et al. (2014). SOAPdenovo-Trans: de novo transcriptome as-

sembly with short RNA-Seq reads. Bioinformatics 30, 1660–1666.

30. Waterhouse, R.M., Tegenfeldt, F., Li, J., Zdobnov, E.M., and Kriventseva,

E.V. (2013). OrthoDB: a hierarchical catalog of animal, fungal and bacterial

orthologs. Nucleic Acids Res. 41, D358–D365.

31. Katoh, K., and Standley, D.M. (2013). MAFFTmultiple sequence alignment

software version 7: improvements in performance and usability. Mol. Biol.

Evol. 30, 772–780.

32. Misof, B., Meyer, B., von Reumont, B.M., Kuck, P., Misof, K., and

Meusemann, K. (2013). Selecting informative subsets of sparse

supermatrices increases the chance to find correct trees. BMC

Bioinformatics 14, 348.

33. Lanfear, R., Calcott, B., Kainer, D., Mayer, C., and Stamatakis, A. (2014).

Selecting optimal partitioning schemes for phylogenomic datasets. BMC

Evol. Biol. 14, 82.

34. Kozlov, A.M., Aberer, A.J., and Stamatakis, A. (2015). ExaML version 3: a

tool for phylogenomic analyses on supercomputers. Bioinformatics 31,

2577–2579.

35. Pattengale, N.D., Alipour, M., Bininda-Emonds, O.R., Moret, B.M., and

Stamatakis, A. (2010). How many bootstrap replicates are necessary?

J. Comput. Biol. 17, 337–354.

36. Dell’Ampio, E., Meusemann, K., Szucsich, N.U., Peters, R.S., Meyer, B.,

Borner, J., Petersen, M., Aberer, A.J., Stamatakis, A., Walzl, M.G., et al.

(2014). Decisive data sets in phylogenomics: lessons from studies on

the phylogenetic relationships of primarily wingless insects. Mol. Biol.

Evol. 31, 239–249.

37. Strimmer, K., and von Haeseler, A. (1997). Likelihood-mapping: a simple

method to visualize phylogenetic content of a sequence alignment.

Proc. Natl. Acad. Sci. USA 94, 6815–6819.

38. Aberer, A.J., Kobert, K., and Stamatakis, A. (2014). ExaBayes: massively

parallel bayesian tree inference for the whole-genome era. Mol. Biol.

Evol. 31, 2553–2556.

39. Parham, J.F., Donoghue, P.C.J., Bell, C.J., Calway, T.D., Head, J.J.,

Holroyd, P.A., Inoue, J.G., Irmis, R.B., Joyce, W.G., Ksepka, D.T., et al.

(2012). Best practices for justifying fossil calibrations. Syst. Biol. 61,

346–359.

40. Yang, Z. (2007). PAML 4: phylogenetic analysis by maximum likelihood.

Mol. Biol. Evol. 24, 1586–1591.

Page 8: Evolutionary History of the Hymenoptera · Report Evolutionary History of the Hymenoptera Highlights d The most comprehensive dataset ever compiled for inferring the phylogeny of

Current Biology, Volume 27

Supplemental Information

Evolutionary History of the Hymenoptera

Ralph S. Peters, Lars Krogmann, Christoph Mayer, Alexander Donath, SimonGunkel, Karen Meusemann, Alexey Kozlov, Lars Podsiadlowski, Malte Petersen, RobertLanfear, Patricia A. Diez, John Heraty, Karl M. Kjer, Seraina Klopfstein, RudolfMeier, Carlo Polidori, Thomas Schmitt, Shanlin Liu, Xin Zhou, Torsten Wappler, JesRust, Bernhard Misof, and Oliver Niehuis

Page 9: Evolutionary History of the Hymenoptera · Report Evolutionary History of the Hymenoptera Highlights d The most comprehensive dataset ever compiled for inferring the phylogeny of

Supplemental Figure

Figure S1. Inferred Ultrametric and Time-Calibrated Tree with Node Numbers, Related to Figure 1 The phylogenetic tree is identical to the one shown in Figure 1, but additionally shows the estimated divergence times in million years ago (mya; black numbers behind nodes) and 95% confidence intervals (green bars). The consecutively numbered nodes (blue numbers before nodes) refer to Data S1H, in which all estimated divergence times and 95% confidence intervals are listed.

Aproceros leucopoda

Nomia diversipes

Panurgus dentipes

Ceraphron sp.

Acromyrmex echinatior

Dryudella pinguis

Corydalus cornutus

Euglossa dilemma

Hylaeus variegatus

Trichopria cf. drosophilae

Crabro peltarius

Parischnogaster nigricans

Monosapyga clavicornis

Heteropelma cf. amictum

Pediaspis aceris

Cephalonomia tarsalis

Orussus unicolor

Auplopus ablifrons

Palarus histrio

Blasticotoma filiceti

Thyreus orbatusAnthophora plumipes

Nysson niger

Cosmocomoidea morrilli

Acantholyda hieroglyphica

Brachyserphus parvulus

Parnopes grandior

Chlorion hirtum

Orasema simulatrix

Urocerus augur

Tetragonula carbonaria

Tetraloniella sp.

Diglyphus isaea

Liris sp.

Lestica clypeata

Tetralonia malvae

Pompilus cinereus

Syrphophilus tricinctorius

Dioxys cincta

Passaloecus eremita

Dendrocerus carpenteri

Prionyx kirbii

Ampulex compressa

Isodontia mexicana

Pelecinus polyturator

Eucera syriaca

Eumenes papillarius

Melitta haemorrhoidalis

Xanthostigma xanthostigma

Sapgya quinquepunctata

Cerceris arenaria

Cleptes nitidulus

Lithurgus chrysurus

Plumarius gradifrons

Camptopoeum sacrum

Ampulex fasciata

Harpactus elegans

Ibalia leucospoides

Tenthredo koehleri

Spilomena beata

Harpegnathos saltator

Alysson spinosus

Trissolcus semistriatus

Arge berberidis

Methocha articulata

Orussus abietinus

Bombus rupestris

Camponotus floridanus

Synergus umbraculus

Colletes cunicularius

Gyrinus marinus

Macropis fulvipes

Nasonia vitripennis

Chrysis viridula

Aleiodes testaceus

Halictus quadricinctus

Nomioides sp.

Dufourea dentiventris

Smicromyrme rufipes

Macrocentrus marginator

Trypoxylon figulus

Diaeretus essigellae

Systropha curvicornis

Chyphotes sp.

Pergagrapta polita

Andricus quercuscalicis

Tribolium castaneum

Elampus spina

Astata minorMellinus arvensis

Nomada lathburiana

Chelostoma florisomne

Eucera plumigera

Dolichurus corniculus

Leptopilina clavipes

Diprion pini

Pemphredon lugens

Gorytes laticinctus

Tremex magus

Sphecodes albilabris

Pseudomallada prasinus

Megalodontes cephalotes

Telenomus sp.

Masaris aegyptiacus

Netelia testacea

Podalonia hirsuta

Dasymutilla gloriosa

Apis mellifera

Heterodontonyx sp.

Andrena vaga

Tiphia femorata

Stephanus serratorGasteruption tournieri

Pison atrum

Heriades truncorum

Chalybion californicum

Aphidius colemani

Philanthus triangulum

Ceratina chalybea

Stizoides tridentatus

Pimpla flavicoxis

Xyela alpigena

Nematus ribesii

Colpa sexmaculata

Sceliphron curvatum

Episyron rufipes

Nitela sp.

Polistes dominula

Vespula germanica

Sphecius convallis

Tachysphex fulvitarsis

Stizus continuus

Ichneumon cf. albiger

Osmia cornuta

Pseudogonalos hahnii

Inostemma sp.

Ammophila sabulosa

Epeolus variegatus

Psenulus fuscipennis

Sapygina decemguttata

Crossocerus quadrimaculatus

Torymus bedeguaris

Pseudoscolia sinaitica

Meria tripunctata

Cotesia vestalis

Tetraloniella nigriceps

Buathra laborator

Sirex noctilio

Aulacus burquei

Cephus spinipes

Megachile willughbiella

Xylocopa violacea

Vespa crabro

Xiphydria camelus

Lasioglossum xanthopus

Stelis punctulatissima

Hyposoter didymator

Bembix rostrata

Ammobates syriacus

Oxybelus bipunctatus

Dendrocerus carpenteri

Diodontus minutus

Coelioxys conoidea

Brachygaster minuta

Cimbex rubida

Eucera nigrescens

Dinetus pictus

Brachymeria minuta

Dacnusa sibirica

Anthidium manicatum

Scolia hirta

Sphex funerarius

Netelia sp.

Dasypoda hirtipes

176

159

68

109

4857

77

51

40

83

64

35

63

62

57

85

25

28

23

60

52

137

35

92

48

69

40

94

77

163

124

214

106

62

145

57

30

12

43

63

155

129

144

63

4470

15

72

64

210

55

72

236

56

119

77

143

107

120

163

187

143

90

19

121

48

50

179

20

140

72

247

82

67

43

153

133

162

24

130

121

55

78

130

190

48

53

80

91

67

37

47

128

174

118

228

138

93

94

54

100

76

115

43

80

200

24

225

64

92

62

66

76

38

83

13

65

217

74

94

120

7538

76

106

189

9161

158

150

10

257

137

386

247

218

281

55

144

304

88

182

29

62

212

77

90

99

78

6

126

51

144

133

94

81

142

84

59

44

43

95

211

78

178

50

38

mya

Hym

en

op

tera

Acu

leata

Ap

ocrita

PamphilioideaXyeloidea

Tenthredinoidea

Xiphydrioidea

Siricoidea

CephoideaOrussoidea

Ichneumonoidea

Ceraphronoidea

Cynipoidea

Platygastroidea

ProctotrupoideaDiaprioidea

EvanioideaStephanoideaTrigonaloidea

Chrysidoidea

Tiphioidea

Scolioidea

Ampulicidae

Sphecidae

Anthophilabees

"Crabronidae"digger wasps

Formicoideaants

Pompiloideavelvet ants,spider wasps,and relatives

Chalcidoideajewel wasps

Vespoideapotter, honey andsocial wasps

Ap

oid

ea

168

14

15

129

111

264

outgroup

7

31

100

126127

13

15

9

12

1819

14

8

2

4

20

1110

17

3

16

5

1

6

63

62

47

73

60

78

80

65

53

59

76

43

42

66

79

77

74

67

56

52

44

70

55

69

41

45

71

46

61

49

57

75

72

68

64

54

5150

48

58

40

35

3837

32

36

34

25

2324

21

27

22

28

30

26

29

39

33

88

9998

9495

96

87

8283

97

93

81

86

89

92

85

90

84

91

101

130

136

153152

168

157

162

155

171

159160

165

158

151

172

170

164

166

169

163

167

129

131

124

142

139

115

141

120

135

148

123

146

116

145

119

134

132

147

117

128

137

133

125

144

149

150

138

143

121

110109

111

102

113

105

114

103106

108

104

112

107

118

122

140

173

161

156

154

Thynnoidea

Page 10: Evolutionary History of the Hymenoptera · Report Evolutionary History of the Hymenoptera Highlights d The most comprehensive dataset ever compiled for inferring the phylogeny of

Supplemental Experimental Procedures 1. Sampling of transcriptomes We sampled RNA of 134 species of Hymenoptera for transcriptome sequencing (Data S1A). Whenever possible, we deposited voucher specimens (paragenophore or syngenophore) [S1] at the Zoological Research Museum Alexander Koenig (Bonn, Germany). Collected samples were ground and preserved in RNAlater (Qiagen, Hilden, Germany) and stored at +4 °C or -80 °C until further processing. One sample (Ceraphron sp.) was first stored in 99% ethanol for < 10 min. The sample was subsequently air-dried and ground in RNAlater. In all species, we sampled the RNA from the entire body of adult specimens. Detailed information (e. g., number of specimens and their sex) are summarized in Data S1A. Of one species (Dendrocerus carpenteri), we sequenced two different samples to ensure good data representation for this important taxon. Please note that we included in our study the raw reads of 33 already sequenced whole body transcriptomes (29 Hymenoptera and four outgroup species; marked with asterisks in Data S1A) published by us in two preceding investigations [S2, S3]. The raw reads of the nine extra samples were processed exactly as the other samples. Our sampling of transcriptomes thus comprised 164 samples of Hymenoptera, referring to 163 different species, and four outgroup taxa. 2. Transcriptome sequencing RNA extraction, next generation sequencing (NGS) library preparation and sequencing of the prepared libraries on Illumina HiSeq 2000 sequencers (Illumina, San Diego, CA, USA) followed the protocol given by Misof et al. [S2]. However, in five species (i. e., Aleiodes testaceus, Ceraphron sp., Cosmocomoidea morrilli, Dolichurus corniculus, Inostemma sp.), from which a total RNA yield < 3 µg was obtained, we constructed the NGS library using the TruSeq mRNA Library Prep Kit (Illumina). We sheared the purified mRNA into fragments of 160–170 bp in length using divalent cations at 98 °C. Fragment sizes and concentrations were determined with the aid of an Agilent Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, USA) and a StepOnePlus Real-Time PCR thermocycler (Applied Biosystems, Waltham, MA, USA). All NGS libraries were paired-end sequenced on a HiSeq 2000 (Illumina) with 150 bp (all libraries except those prepared using the TruSeq kit) and 90 bp (libraries prepared with the TruSeq kit) read length. Per library, we collected approximately 2.5 Gb of raw data. 3. De novo assembly of transcriptomes Transcript raw reads were assembled using the assembler SOAPdenovo-Trans-31kmer (version 1.01) [S4]. All raw reads were quality checked and trimmed. Specifically, we discarded (i) reads with adapter contamination (minimum length of the alignment: 15 bp; at most 3 mismatches), (ii) reads that included > 10 Ns, and (iii) reads that included > 50 base pairs of low quality (i. e., Phred quality score: 2, ASCII 66 "B", Illumina 1.5+ Phred+64). All remaining reads were used for de novo assembly. For details on the assembly process, see Xie et al. [S4] and Misof et al. [S2]. In one step of the assembly, our settings differed from those of Misof et al. [S2]. In the contig forming step, linear k-mers (i. e., k-mers with a single out-degree) were merged to form edges and different edges were linked by arcs. Arcs with an abundance < 5% of the total out-degrees or < 2% of the total in-degrees were excluded. Subsequently, edges with an average abundance >= 3 were reported as contigs. 4. Identification and removal of contaminating sequences All transcriptome assemblies were first searched for vector and linker/adapter contamination using a local installation of VecScreen (http://www.ncbi.nlm.nih.gov/tools/vecscreen/) and the UniVec database build 7.1 (http://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/). We removed both terminal and internal contaminations. The removal of internal contaminations resulted in a split of contigs/scaffolds. We next searched the assembled transcriptomes for cross-library contamination, as it can occur when pooling single index-tagged NGS libraries on the same Illumina NGS sequencer lane, using the search strategy outlined by Mayer et al. [S5]. Specifically, we compared each transcriptome assembly with all other assemblies sequenced in context of the 1KITE project using BLASTN of the BLAST+ (version 2.2.29) program suite [S6]. We explicitly refrained from restricting the contamination search to only those transcriptomes, which were sequenced on the same lane, in order to be able to detect also contamination that may have occurred in pre-sequencing steps. In those instances, in which BLASTN identified transcripts that shared over a length of at least 180 bp a nucleotide sequence identity of at least 98 %, we used the coverage depth to

Page 11: Evolutionary History of the Hymenoptera · Report Evolutionary History of the Hymenoptera Highlights d The most comprehensive dataset ever compiled for inferring the phylogeny of

determine, which transcript is likely the original sequence (by being more abundant) and which one likely represents the contamination (by being less abundant). We used as coverage depth of a given transcript the average k-mer coverage statistic provided by the assembly software SOAPdenovo-Trans-31kmer [S4]. Having identified transcripts sharing a high sequence identity, we applied the following procedure: (i) If two transcripts differed more than 2-fold in their relative coverage, we removed the transcript with the lower relative coverage from the corresponding assembly. (ii) If the coverage of the two transcripts in question differed 2-fold or less, we conservatively removed both of them from the two corresponding assemblies. This procedure helped with removing putative foreign contaminations (e. g., from third party libraries sequenced on the same lane, but not present in our analyses). It also meant that in case of multiple highly similar sequences, we retained only the single transcript with the highest relative coverage, given that its coverage was more than 2-fold higher than the coverage of the second-best matching transcript. We detected in samples sequenced on two specific Illumina lanes contaminations originating from the moth Plutella xylostella (Insecta: Lepidoptera), a species that was not included in our dataset, and thus we decided to specifically search the 1KITE transcript assemblies generated from sequencing these two lanes for sequences of this moth. We retrieved all protein-coding sequences of the P. xylostella official gene set (version 1.0) from the diamondback moth genome project database (http://iae.fafu.edu.cn/DBM/) and searched the coding sequences with BLASTN against the respective 1KITE transcriptome assemblies. Hits that exhibited 98% sequence identity over a length of at least 180 bp were removed. We additionally removed transcripts from the assembled transcriptomes that NCBI identified as possibly foreign contaminations when submitting the assemblies to the NCBI Transcriptome Shotgun Assembly (TSA) database (see below). Information on how many transcripts were removed from each transcript library is summarized in Data S1B.

We removed per assembled transcript library between 14,258 (0.05% of the unfiltered assembled transcriptome) and 9,925,188 (26.99% of the unfiltered assembled transcriptome) nucleotides, as they likely or possibly represented contamination (Data S1B). However, the comparatively high number of removed nucleotides in some samples (i. e., congeneric species in the genera Eucera and Tetraloniella) is more likely explained by the chosen conservative approach for identifying possible contaminants than by actual excessive instances of contamination. Please note that we found a relatively small number of single-copy (target) genes in the species of these two genera. Since the two genera are deeply nested within Anthophila (bees), the small number of identified genes (Data S1E) is, however, insignificant for inferring the phylogenetic relationships of the major lineages of Hymenoptera. 5. Identification of orthologous transcripts of single-copy protein-coding genes We used the program Orthograph (version beta4) [S3] to search the transcript libraries for transcripts that are orthologous to a set of target genes. Specifically, we selected target genes that the OrthoDB v7 database [S7] suggested to occur in single-copy across Hymenoptera and Coleoptera (beetles). Note that we included the latter to enable outgroup comparisons (i. e., rooting of the inferred tree). Specifically, target genes were selected by searching the OrthoDB v7 database [S7] for genes that are single copy (copy number = 1) in the sequenced and annotated genome of each of the following six reference taxa: Acromyrmex echinatior (official gene set, OGS, 3.8) [S8], Apis mellifera (OGS 3.2) [S9, S10], Camponotus floridanus (OGS 3.3) [S11] Harpegnathos saltator (OGS 3.3) [S11], Nasonia vitripennis (OGS 2) [S12], and Tribolium castaneum (OGS 3.0) [S13]. The hierarchical level for clustering orthologous genes in the OrthoDB query was set to the node “Holometabola” (= Endopterygota). The six reference taxa were selected, because (i) their genomes are well sequenced and annotated, (ii) their official genes set were publicly released and freely available (in respect of the Ft. Lauderdale agreement) when we started our analyses, and (iii) they covered all those major target lineages from which genomes of species had been sequenced. We additionally demanded that the target genes have not been recorded as being duplicated (but they can be missing) in other species of Hymenoptera. In addition to the six reference species, we exploited the publicly available gene orthology information stored in OrthoDB v7 [S7] to other species when inferring a set of target genes (ortholog set). Our custom query (kindly conducted and provided by Robert Waterhouse on August 11, 2014) explicitly demanded that single-copy genes of the above six reference taxa must not occur in multi-copy (i. e., their copy number must be <= 1) in any of the following additional species: Atta cephalotes (OGS 1.2), Apis florea (OGS 1.1), Bombus impatiens (OGS 1.2), Bombus terrestris (OGS 1.3), Cephus cinctus (OGS from April 2013), Dufourea novaeangliae (OGS 1.1), Eufriesea mexicana (OGS 1.1), Habropoda laboriosa (OGS 1.2), Lasioglossum albipes (OGS 5.4), Linepithema humile (OGS 1.2), Melipona quadrifasciata (OGS 1.1), Megachile rotundata (OGS 1.1), Pogonomyrmex barbatus (OGS 1.2),

Page 12: Evolutionary History of the Hymenoptera · Report Evolutionary History of the Hymenoptera Highlights d The most comprehensive dataset ever compiled for inferring the phylogeny of

and Solenopsis invicta (OGS 2.2.3). Note that we intentionally did not restrict the copy number to 1 in order to cope with the problem that official gene sets of poorly sequenced and annotated genomes typically comprise a relatively low number of genes. We acknowledge that assembly and annotation artefacts can also result in artificially inflated copy numbers. We think, however, that our query represents a reasonable compromise between being able to search for as many promising target genes as possible in the 1KITE transcriptomes (even if these genes actually got secondarily lost in some lineages) and making sure that genes that indeed occur in multiple copies are omitted.

Orthograph requires the user to provide a tabulator-delimited file containing the information which genes in the reference species are orthologous. We obtained this information from Robert Waterhouse, who extracted it from OrthoDB v7 [S7]. Orthograph also requires the official gene sets (i. e., the predicted coding and the predicted amino acid sequences) to the sequenced genomes of these reference species if one intends to conduct analyses also on the nucleotide level. These were obtained from the sources specified in Data S1D.

Since the OrthoDB v7 file that stores information about which genes are orthologous in the six reference taxa also included information about all other taxa that were part of the OrthoDB query, we first removed with a custom Perl script from the tabulator-delimited OrthoDB v7 output file all entries that did not refer to the above six reference species. After this filtering, the file contained 3,275 entries per species, thus 19,650 entries in total. Given that the entries in the gene identifier column of this file need to perfectly correspond with the header information of the corresponding genes in the official gene sets, we re-formatted the headers in the OGS files with custom Perl scripts. In this context, we discovered that the OGS 2.0 of N. vitripennis is partially corrupt. Specifically, we found multiple entries for the following genes and proteins (the number of copies is given in parenthesis): Nasvi2EG037307t1 (2), Nasvi2EG037281t1 (4), Nasvi2EG037267t1 (3), Nasvi2EG037334t1 (2), Nasvi2EG037012t1 (2), Nasvi2EG037232t1 (2), Nasvi2EG037311t1 (3), Nasvi2EG037344t1 (2), Nasvi2EG037308t1 (3) Nasvi2EG037337t1 (4), Nasvi2EG037324t1 (2), Nasvi2EG037295t1 (2), Nasvi2EG037340t1 (2), Nasvi2EG037310t1 (3), Nasvi2EG037309t1 (2), Nasvi2EG037306t1 (3), Nasvi2EG037268t1 (2), Nasvi2EG037352t1 (3), Nasvi2EG037335t1 (3), Nasvi2EG037349t1 (2), Nasvi2EG037338t1 (2), Nasvi2EG037315t1 (3), Nasvi2EG037181t1 (2), Nasvi2EG037181t2 (2), Nasvi2EG037320t1 (5), Nasvi2EG037351t1 (2), Nasvi2EG037317t1 (2), Nasvi2EG037342t1 (3), Nasvi2EG037322t1 (4), Nasvi2EG037294t1 (3), Nasvi2EG037343t1 (5), Nasvi2EG037339t1 (2), Nasvi2EG037231t1 (3), Nasvi2EG037313t1 (2), Nasvi2EG026831t1 (2), Nasvi2EG037316t1 (4), Nasvi2EG037325t1 (2), Nasvi2EG037230t1 (3), Nasvi2EG037326t1 (2).

Since it remained unclear why different genes in the Nasonia OGS 2.0 had the same identifier (Don Gilbert, personal communication; August 11, 2014), we removed these genes from the official gene set. A total of 15 of them (EOG7JTJMP, EOG783ZJ4, EOG7W4C2X, EOG78Q60C, EOG74RCGQ, EOG7B3CCR, EOG7W1GSS, EOG7B0H4C, EOG700M13, EOG7K459C, EOG77MMBJ, EOG72GBX3, EOG7DG821, EOG78Q602, EOG764JRC) referred to genes in our OrthoDB v7 output files. We conservatively removed the corresponding entries to all six reference taxa from the OrthoDB v7 file. The edited OrthoDB v7 output file subsequently contained 3,260 entries per species, 19,560 entries in total.

The occurrence of isoforms in a dataset can compromise the search for orthologous transcripts, since the Orthograph search (see below) of a potential target transcript (at the transcriptional level) against all proteins in an official gene set can result in finding an isoform of a protein different from the isoform (of the same protein) that was initially used to search the transcript library (at the transcriptional level). The best reciprocal hit criterion, on which Orthograph relies on (see below), would then not be fulfilled. We therefore decided to screen the official gene sets for the occurrence of isoforms (and alternative transcripts on the coding sequence level) in the 3,260 target genes and retained only the isoform (and alternative transcript on the coding sequence level) that OrthoDB v7 determined to be orthologous. Finally, we removed terminal stop symbols (*) in the protein files and applied the script ‘make-ogs-corresponding.pl’ (part of the Orthograph software package; [S3]) to validate (and if necessary enforce) correspondence of the amino acid sequences of proteins and their coding sequence at the nucleotide level in the OGS files of a given species. The resulting ortholog set comprised 3,260 protein-coding genes.

These 3,260 single-copy protein-coding target genes were searched in the transcript libraries with Orthograph, which rests on a graph-based approach and uses the best-reciprocal hit (BRH) criterion to extend clusters of known orthologs with transcript sequences [S3]. Its algorithm maps transcript sequences

Page 13: Evolutionary History of the Hymenoptera · Report Evolutionary History of the Hymenoptera Highlights d The most comprehensive dataset ever compiled for inferring the phylogeny of

to the globally best matching ortholog group (OG). For this purpose, it creates a profile hidden Markov model (pHMM) from the amino acid sequences of each target gene (i. e., ortholog group (OG), containing the orthologous amino acid sequences of the six reference species). This pHMM is used to search (at the translational level) for ortholog candidates in transcript libraries. Each candidate transcript sequence is extracted and used as query in a protein BLAST search against a database of all amino acid sequences from all the reference OGS. The results from both the pHMM search and the protein BLAST search are stored in a database for evaluation and orthology delineation. The orthology delineation procedure pulls all pHMM search results from the database and sorts them by descending bit score. For each pHMM hit transcript, the corresponding BLAST result is checked whether the best hit sequence belongs to the OG that the pHMM is based on. If this is the case, the BRH criterion is fulfilled and the OG is extended with the candidate transcript. Subsequently, Exonerate [S14] is employed to infer the open reading frame (ORF) in the transcript sequence by calculating a pairwise alignment with a reference amino acid sequence (i. e., the amino acid sequence of the reference species that was retrieved in the reciprocal search as best hit; see above). This step also produces corresponding sequences on the transcriptional level and corrects frame shift errors. If possible, the ORF is extended beyond the pHMM alignment region, retaining the orthologous region.

We used Orthograph (version beta4) [S3] to search for the 3,260 target genes. Orthograph depends on the following bioinformatics software packages, of which we used the hereafter specified version numbers: BLAST+ (version 2.2.26) [S6], Exonerate (version 2.2) [S14], and MAFFT (version 7.123) [S15]. All potential orthologous transcripts found by a given pHMM were sorted by their degree of similarity to the pHMM and then reciprocally searched against the proteins of the official gene sets (from six different species; Data S1D) combined. We applied a non-strict reciprocal search. Thus, the BRH criterion was fulfilled if the first reciprocal hit was a protein that was also part of the pHMM, irrespective of the species to which the protein belonged to. If the BRH criterion could not be established for a given transcript, all transcripts with a smaller degree of similarity to a given pHMM were discarded (soft-threshold = 0). The minimum length of transcripts (on the translational level) was set to 30 amino acids. We allowed frame shift-corrected transcriptional extension of each transcript beyond the region of the transcript for which the BRH criterion was established as long as this region was not longer than the region, for which the BRH was established (see above). All other Orthograph parameters were left at the default values. When summarizing the results, we removed terminal and masked internal stop codons with X and NNN, respectively, in all amino acid and corresponding nucleotide sequences.

Search for transcripts orthologous to the 3,260 target genes in all 168 analyzed transcript libraries revealed on average transcripts to 2,403 target genes (median: 2,450; minimum: 1,456; maximum 2,760; Data S1E). Species with particularly few target gene hits were: Eucera plumigera (1,456), Ampulex compressa (1,540), Ampulex fasciata (1,542), Eucera syriaca (1,560), Tetraloniella sp. (1,662), and Acantholyda hieroglyphica (1,668). Overall, 3,256 of the 3,260 target genes were represented in the 168 transcriptomes. 6. Inference of multiple sequence alignments We constructed multiple sequence alignments (MSA) with MAFFT (version 7.017) [S15], using the L-INS-i algorithm. We subsequently checked the resulting MSAs on the translational level and, if necessary, refined the MSAs as outlined by Misof et al. [S2]. We inferred the MSA of each OG also on the transcriptional (nucleotide) level using a modified version of the software Pal2Nal [S16] (version 14; see Misof et al. [S2] for details on the software modification). All MSAs have been deposited in and are available from Mendeley Data (doi: 10.17632/trbj94zm2n.2).

Assessing the quality of the inferred Multiple Sequence Alignments (MSAs) revealed putatively misaligned amino acid transcripts (hereafter referred to as outliers) in 256 of the 3,256 investigated ortholog groups (OGs). Attempts to refine the alignment of the 619 outlier sequences succeeded in 101 instances (i. e., sequences). The remaining 518 sequences, referring to a total of 230 OGs, remained outliers and were consequently removed from the MSAs. We also removed the corresponding nucleotide sequences from the respective nucleotide MSAs.

Page 14: Evolutionary History of the Hymenoptera · Report Evolutionary History of the Hymenoptera Highlights d The most comprehensive dataset ever compiled for inferring the phylogeny of

7. Protein domain identification To facilitate a protein domain-based sequence data partitioning scheme, we applied the procedure described by Misof et al. [S2]. Protein domains and protein domain clans (i. e., evolutionary related protein domains) [S17] were identified (on the amino acid level) in each MSA by using pHMMs from the Pfam database (release 27.0) [S18] and the software pfam_scan.pl (version 1.5, release date: October 15, 2013) [S18] in conjunction with HMMER (version 3.1b2) [S19]. For additional details, see Misof et al. [S2].

Search for protein domains in the refined amino acid sequence alignments of the 3,256 single-copy genes assigned 32.0% of the alignment sites to domains of Pfam-A and 6.0% of the alignment sites to domains of Pfam-B. A total of 62.0% of the alignment sites remained unannotated. Based on the domain identification results, we split the 3,256 multiple sequence alignments and rearranged their sites into 5,922 different data blocks. Of these, 312 belong to Pfam-A domains and were pooled according to their clan annotation [S17]. After pooling domains without clan annotation but having the same name, we obtained 1,243 data blocks belonging to different Pfam-A domains and 1,111 belonging to different Pfam-B domains. Finally, we pooled unannotated regions (voids) according to their gene origin. This resulted in 3,256 data blocks corresponding to the void regions of the 3,256 analyzed genes. 8. Multiple sequence alignment masking We used a modified version of Aliscore (version 1.2) [S20–S22] to identify blocks of putative alignment ambiguities or randomized MSA sections in the MSA of each OG. Aliscore was executed with its default sliding window size, with requesting the maximum number of pairwise sequence comparisons, and with the option –e (i. e., indicating gappy amino acid sequence data). Blocks of putative alignment ambiguities or randomized MSA sections were subsequently removed from each domain data block at the amino acid and corresponding nucleotide level using custom scripts and concatenated to supermatrices containing protein domain- and gene-based data blocks (the latter comprising regions not annotated as protein domain). Gap symbols (-) at the beginning and at the end of the resulting MSAs were replaced with 'X' (translational level) and 'N' (transcriptional level), respectively. 9. Removal of phylogenetically non-informative data blocks We assessed the phylogenetic signal of each data block at the translational level with the software MARE (version 0.1.2-rc) [S23]. Data blocks whose phylogenetic information was zero were removed from the supermatrix on the translational and on the transcriptional level using custom Perl scripts. Note that the phylogenetic information content of all retained data blocks also differed from zero when being assessed on the transcriptional level.

After (i) removing ambiguous alignment sites in each data block resulting from steps outlined above, (ii) deleting data blocks that contained no phylogenetic information (792 in total), and (iii) concatenating all remaining data blocks, the resulting supermatrices consisted of 1,505,722 amino acid and 3,011,444 nucleotide sites (1st and 2nd codon position only), respectively. Each of the two supermatrices contained information from 3,256 different single-copy protein-coding genes and comprised 4,818 data blocks (301 clan data blocks from pooled Pfam-A protein domains with clan association, 1,103 data blocks from pooled Pfam-B protein domains without clan association, 778 data blocks from Pfam-B protein domains, 2,636 data blocks from void regions).

The amino acid supermatrix was further processed for testing specific phylogenetic hypotheses via analysis of decisive datasets (i. e., when demanding the presence of at least one species from a given taxonomic group in each data block of the supermatrix). The statistics of the three inferred datasets were as follows: ‘backbone dataset’ (800,393 amino acid sites, comprising 2,142 data blocks), ‘Aculeata dataset’ (709,055 amino acid sites, comprising 1,712 data blocks), ‘Apoidea dataset’ (527,548 amino acid sites, comprising 841 data blocks).

All above specified supermatrices as well as the data block information files have been deposited at Mendeley Data (doi: 10.17632/trbj94zm2n.2). 10. Partitioning and substitution model selection We used PartitionFinder (developer versions 2.0.0-pre2, 2.0.0-pre9, and 2.0.0-pre10) [S24] to identify combinations of data blocks that can be modeled with the same substitution model and which therefore might better be combined to partitions. The corrected Akaike information criterion (AICc) [S25] was used to assess whether or not to combine data blocks to partitions. PartitionFinder was run with the following

Page 15: Evolutionary History of the Hymenoptera · Report Evolutionary History of the Hymenoptera Highlights d The most comprehensive dataset ever compiled for inferring the phylogeny of

settings: branchlengths [linked], models [LG+G, LG+G+F, WAG+G, WAG+G+F, BLOSUM62+G, BLOSUM62+G+F, DCMUT+G, DCMUT+G+F, JTT+G, JTT+G+F, LG4X], model_selection [aicc], search [rcluster], weights [rate = 1.0, base = 1.0, model = 0.0, alpha = 1.0], rcluster-percent [100.0], rcluster-max [10,000]. PartitionFinder was started with the following additional command line arguments: '--raxml --all-states --min-subset-size 50'. We applied the same partition scheme that we inferred with PartitionFinder from analyzing protein domain-based data blocks at the translational (amino acid) level when analyzing the data at the transcriptional (nucleotide) level. However, we discarded the 3rd codon position, as it was saturated and typically exhibits lineage specific compositional biases [S26, S27]. We furthermore modeled the nucleotide substitutions of each codon position within a given partition separately. The applied nucleotide substitution model was GTR+G.

PartitionFinder suggested merging data blocks in each of the supermatrices outlined above to partitions. Specifically, PartitionFinder suggested analyzing the primary supermatrix (i. e., the non-decisive dataset) at the amino acid level by considering 2,058 partitions. We applied the same partition scheme when analyzing the corresponding supermatrix containing nucleotide sequences, except that we modeled the 1st and 2nd

codon position of each partition separately (the 3rd codon position was not considered in our analyses). The resulting primary supermatrix comprised accordingly on the nucleotide level 4,116 partitions. PartitionFinder suggested considering the following number of partitions for testing specific phylogenetic hypotheses via analyses of decisive datasets: 1,025 (‘backbone dataset’), 853 (‘Aculeata dataset’), and 461 (‘Apoidea dataset’). The final data sets and files containing the inferred partition schemes and model selection results are deposited at Mendeley Data (doi: 10.17632/trbj94zm2n.2). 11. Phylogenetic tree inference and bootstrap analysis We applied the maximum likelihood optimality criterion as implemented in ExaML (versions 3.0.15 and 3.0.17) [S28] for the phylogenetic inferences. We selected the tree with the best log-likelihood score found in 50 independent tree searches (25 randomized stepwise addition parsimony starting trees, 25 completely random starting trees) per dataset. All starting trees were inferred with RAxML (version 8.0.26) [S29] We applied the GTR substitution model when analyzing the nucleotide sequence data and applied the partition-specific substitution models suggested by PartitionFinder when analyzing the amino acid sequences in ExaML. Rate heterogeneity was approximated with a gamma distribution, using the median and four discrete rate categories. Node support was estimated via non-parametric bootstrapping [S30]. Bootstrap replicate alignments and random start trees were generated with RAxML and a custom shell script. Subsequently, ExaML was used to infer one ML tree per bootstrap replicate, applying the original partitioning scheme suggested by PartitionFinder for the respective supermatrix. In order to determine the minimum number of replicates needed for reliable estimation of bootstrap support values, we used the ”autoMRE” bootstrap convergence criterion [S31], as implemented in RAxML, with the default threshold setting of 0.03. In brief, this method works as follows: first, all replicate trees are randomly divided into two subsets of equal size; then, a majority rule extended (MRE) consensus tree for each subset is built; finally, the relative weighted Robinson-Foulds distance (WRF) between the consensus trees is computed. Convergence is reached if WRF is below the threshold in at least 99% of 1,000 tree set permutations. Convergence was evaluated after analyzing every batch of 50 bootstrap replicates. All inferred trees were rooted at the branch that connects Hymenoptera with the outgroups.

The majority rule extended (MRE) bootstrap convergence criterion [S31] indicated that 50 bootstrap replicates are sufficient to accurately assess node support irrespective of whether we analyzed the primary supermatrix containing amino acid sequences or the primary supermatrix containing the corresponding nucleotide sequences. All node support values provided in the present investigation are consequently based on 50 non-parametric bootstrap replicates each.

Please note that RAxML and ExaML remove completely undetermined sites (i. e., sites containing only “-” and/or “X” and/or “N”) prior to the phylogenetic inference. The total size of the supermatrices in each of the phylogenetic inferences was accordingly 1,505,514 sites (primary supermatrix, amino acids), 3,011,099 sites (primary supermatrix, nucleotides), 800,252 sites (“backbone dataset”, amino acids), 708,924 sites (“Aculeata dataset”, amino acids), and 527,471 sites (“Apoidea dataset”, amino acids).

We used ExaBayes [S32] to conduct phylogenetic inferences also in Bayesian framework and analyzed again the primary amino acids supermatrix (see above). We completed three independent ExaBayes runs with four coupled MCMC chains each. After 200,000 generations, we assessed convergence of the results by computing the average deviation of split frequencies (ASDSF criterion). We obtained an ASDSF value of 2.51%, which is generally considered an acceptable convergence according to the ExaBayes manual

Page 16: Evolutionary History of the Hymenoptera · Report Evolutionary History of the Hymenoptera Highlights d The most comprehensive dataset ever compiled for inferring the phylogeny of

[http://sco.h-its.org/exelixis/web/software/exabayes/manual/index.html#sec-7-3]. Finally, we used the consense tool (part of the ExaBayes software package) to obtain a MRE consensus tree, discarding the first 25% of the sampled topologies. Please note that ExaBayes does not support the LG4X amino acid substitution model, which PartitionFinder suggested to apply on several of the inferred data partitions. We therefore decided to use the automatic model selection strategy implemented in ExaBayes (default setting) on the PartitionFinder-inferred partitions.

The topology inferred with ExaBayes proved to be identical to the one inferred from analyzing the primary amino acid supermatrix under the maximum likelihood optimality criterion (Figure 1b), except for the phylogenetic position of Stephanoidea and Scoliidae. Stephanoidea were found with low statistical support as sister group of Evanioidea+(Trigonaloidea+Aculeata), while the maximum likelihood analysis inferred (also with low node support) Stephanoidea as sister group of Evanioidea. Note that we found noteworthy phylogenetic conflict when analyzing the dataset using Four-cluster Likelihood Quartet Mapping analyses in respect of the phylogenetic position of Stephanoidea (see also discussion in section 14). Finally, while the maximum likelihood analyses suggested ants as the sister group of Apoidea, the Bayesian analyses inferred a lineage comprising both ants and Scoliidae as the sister group of Apoidea.

The results from the phylogenetic tree inference and bootstrap analysis have been deposited at Mendeley Data (doi: 10.17632/trbj94zm2n.2). 12. Screen for rogue taxa We tested for the presence of rogue taxa with RogueNaRok (version 1.0) [S33] when analyzing the amino acid and the nucleotide supermatrix using five distinct settings: (i) bipartition support drawn onto the best ML tree (referred to as best), (ii) majority rule consensus (50% threshold, referred to as mr), (iii) greedily refined extended majority rule consensus (referred to as mre), (iv) strict consensus (100% threshold, referred to as strict), (v) the 75% threshold consensus – the criterion for pruning rogue taxa is to improve the number of edges that have at least 75% bootstrap support (referred to as t75). We used a dropset size of 1 when using the settings best and mre, a dropset size of 1+2 when using the setting mr and a dropset size of 1+4 when using the settings strict and t75.

We identified a single species that exhibited rogue taxon behavior: Nitela sp., a representative of the polyphyletic apoid wasp family “Crabronidae”. However, Nitela sp. was identified as a rogue taxon only when analyzing the amino acid primary supermatrix and only under one specific rogue taxon identification setting: 75% threshold consensus with dropset sizes 1+4. Given the fact that Nitela is deeply nested in a subordinated monophyletic lineage of apoid wasps and that its position within this lineage was not of primary interest in context of the present investigation, we refrained from excluding Nitela from our phylogenetic analyses. The results from the rogue taxon have been deposited at Mendeley Data (doi: 10.17632/trbj94zm2n.2). 13. Testing specific phylogenetic hypotheses via analysis of decisive datasets Following Dell’Ampio et al. [S34] and Misof et al. [S2], we generated decisive datasets, which contained at least one species from a given taxonomic group in each partition of the analyzed supermatrices, to assess whether a bias due to a possible lack of taxonomic sequence representation at the partition level has an impact on the inferred tree topology. We generated three different decisive datasets: (i) one for addressing the major lineages along the Hymenoptera tree (‘backbone dataset’), (ii) one for addressing the phylogenetic relationships of the major lineages of Vespoidea and Apoidea (‘Aculeata dataset’), and (iii) one for addressing the phylogenetic relationships within Apoidea specifically (‘Apoidea dataset’). Specific information on which groups we distinguished and which species are part of a given group has been deposited at Mendeley Data (doi: 10.17632/s5j2f62z3d.2). Given that the generation of decisive datasets resulted in some partitions being excluded from subsequent analyses, we inferred new partition schemes and corresponding partition-specific substitution models with PartitionFinder for each decisive dataset. Please note that in the phylogenetic analyses of the decisive datasets we considered only amino acid sequences. We conducted ten tree searches with ExaML (versions 3.0.15 and 3.0.17) with random starting trees per decisive dataset.

Phylogenetic analysis of all three decisive datasets (i. e., ‘backbone dataset’, ‘Aculeata dataset’, ‘Apoidea dataset’) resulted in topologies identical to the one inferred from analyzing the primary supermatrix (i. e., the non-decisive dataset) at the amino acid level when considering only those major lineages whose phylogenetic relationships to each other were meant to be assessed by the respective

Page 17: Evolutionary History of the Hymenoptera · Report Evolutionary History of the Hymenoptera Highlights d The most comprehensive dataset ever compiled for inferring the phylogeny of

decisive dataset. Since the decisive data sets are restricted to partitions with a required taxon coverage, the results indicate that the distribution of missing data did not produce a significant phylogenetic signal, rendering the probability of a missing data bias in our phylogenetic relationships unlikely. The results from analyzing the decisive datasets have been deposited at Mendeley Data (doi: 10.17632/s5j2f62z3d.2). 14. Four-cluster Likelihood Quartet Mapping (FcLM) We applied FcLM to assess the phylogenetic support for conflicting hypotheses [S2, S34, S35, S36], focusing on five major phylogenetic hypotheses (see below). We used the PTHREADS implementation of RAxML (versions 8.2.6 and 8.2.8) [S29] to infer the likelihood scores of quartets, using the partition scheme and substitution models inferred before when analyzing the complete supermatrix at the translational (amino acid) level. We assessed the following phylogenetic hypotheses: (i) Eusymphyta represent the sister group of Unicalcarida. This hypothesis was assessed by testing the phylogenetic position of Xyeloidea (iA) relative to the outgroup, to Pamphilioidea+Tenthredinoidea and to Unicalcarida, (iB) relative to the outgroup, to Tenthredinoidea and to Pamphilioidea+Unicalcarida, and (iC) relative to the outgroup, to Pamphilioidea and to Tenthredinoidea+Unicalcarida. Please note that we did not consider representatives of Apocrita in the three tests. (ii) Cephoidea represent the sister group of Vespina; (iii) Trigonaloidea represent the sister group of Aculeata; (iv) Stephanoidea represent the sister group of Evanioidea; (v) Formicidae represent the sister group of Apoidea. Specific information about which species were part of a specific group is given in Data S1G.

The results from testing major phylogenetic hypotheses (i. e., Eusymphyta as the sister group of Unicalcarida, Cephoidea are the sister group of Vespina, Trigonaloidea are the sister group of Aculeata, Stephanoidea are the sister group of Evanioidea, Formicidae are the sister group of Apoidea) via Four-cluster Likelihood Mapping (FcLM) as well as all analyzed amino acid data matrices have been deposited along with partition information files at Mendeley Data (doi: 10.17632/s5j2f62z3d.2 and 10.17632/trbj94zm2n.2).

Given that we found in all tree inferences strong support for monophyletic Eusymphyta, a result incompatible with the classical hypothesis of Xyeloidea being the sister group of all remaining Hymenoptera (supported by morphological data and data on muscle cell ploidy levels) [S37, S38], we additionally assessed the support for our and for alternative phylogenetic hypotheses regarding the phylogenetic position of Xyeloidea via FcLM. FcLM of the original amino acid sequence data consistently revealed the lowest phylogenetic signal (0–3%) for a sister group relationship of Xyeloidea and the remaining Hymenoptera, irrespective of the phylogenetic position of Pamphilioidea (please note that FcLM does not assume any specific phylogenetic relationships among the species of a given group, e. g., the outgroup). While it cannot be excluded that consideration of additional outgroup lineages (e. g., non-holometabolous insects) could result in a change of the topology, doing so would require design of a very different ortholog set than the one used in the present investigation, comprising significantly fewer genes and consequently likely containing substantially less phylogenetic signal.

The results from FcLM of the original amino acid sequence data revealed no phylogenetic signal for topologies that are incompatible with the hypothesis of Cephoidea being the sister group of Vespina and with the hypothesis of Trigonaloidea being the sister group of Aculeata. However, FcLM revealed phylogenetic signal in the dataset that is in conflict with the other two hypotheses. Specifically, FcLM indicated besides strong support for a sister group relationship of Formicidae and Apoidea (66%) also noteworthy support a sister group relationship of Scoliidae and Apoidea (32%). This signal has the potential of misleading phylogenetic analyses, in particular when considering only a small taxonomic sampling. Authors should keep this in mind when inferring a sister group relationship of scoliid and apoid wasps in future analyses. A sister group relationship of Scoliidae and Formicidae, as it was inferred in the Bayesian analyses, was supported only by 2% of the quartets. Considering all evidence, a sister group relationship of Formicidae and Apoidea appears currently the most likely phylogenetic scenario. FcLM also indicated strong support for a sister group relationship of Stephanoidea and Trigonaloidea+Aculeata (88%) and for a sister group relationship of Stephanoidea and Parasitoida (11%). In fact, the phylogenetic hypothesis suggested by the simultaneous phylogenetic analysis of all taxa (Figure 1; i. e., a sister group relationship of Stephanoidea and Evanioidea) is only supported by 1% of the evaluated quartets. This sister group relationship also received only a comparatively low bootstrap support (i. e., 90%), considering the size of the analyzed data matrix (Figure 1) and it was not found in the Bayesian analyses either (Fig. S4). We consequently consider the question of the phylogenetic position of Stephanoidea relative to Parasitoida, Evanioidea, and Trigonaloidea+Aculeata as still open.

Page 18: Evolutionary History of the Hymenoptera · Report Evolutionary History of the Hymenoptera Highlights d The most comprehensive dataset ever compiled for inferring the phylogeny of

15. Testing for non-phylogenetic signal biases via permutation tests To assess the possible impact of a heterogeneous composition of amino acid sequences, of a non-stationarity of substitution processes and of a non-random distribution of missing data [S39, S40] on our phylogenetic inferences, we adopted the permutation test strategy suggested by Misof et al. [S2]. For more information on the applied permutation schemes and their explanatory power, see Misof et al. [S2]. The phylogenetic signal for the conflicting hypotheses in the permuted supermatrices was assessed using FcLM (see above). FcLM was conducted on the translational level and using the same partition scheme as before, but using the WAG substitution model across all partitions. Permutation tests were done with RAxML (versions 8.2.6 and 8.2.8) [S29] and ExaML (version 3.0.17) [S28].

The results from testing major phylogenetic hypotheses (i. e., Eusymphyta are the sister group of Unicalcarida, Cephoidea are the sister group of Vespina, Trigonaloidea are the sister group of Aculeata, Stephanoidea are the sister group of Evanioidea, Formicidae are the sister group of Apoidea) via permutation tests in conjunction with FcLM have been deposited at Mendeley Data (doi: 10.17632/s5j2f62z3d.2). In none of the permutation tests did any of the three possible topologies receive more than 18% support when assessing the phylogenetic support for different topology by the permuted dataset via FcLM. The only exception were the results from testing the position of Xyeloidea and in which some topologies received up to 41% support. However, artificial phylogenetic signal supported almost exclusively topologies incompatible with monophyletic Eusymphyta, whereas artificial support for Eusymphyta remained low. We accordingly interpret the impact of artificial phylogenetic signal on the inferred phylogeny shown in Figure 1 as insignificant.

The results from the various permutation tests have been deposited at Mendeley Data (doi: 10.17632/s5j2f62z3d.2). 16. Divergence time estimation

We time-calibrated the inferred phylogenetic tree using information on 14 carefully selected fossils. We followed the best-practice recommendations put forth by Parham et al. [S41] for justifying fossil

calibrations. We considered the following fossils to calibrate the inferred phylogeny, with the lineage each fossil was assigned to given in parentheses: Triassoxyela foveolata (Hymenoptera) [S42] Libanophron astarte (Ceraphronoidea) [S43], Rhetinorhyssalus morticinus (Braconidae) [S44], Protimaspis costalis (Cynipoidea) [S45], Archaeopelecinus jinzhouensis (Pelecinidae) [S46], an undescribed species (Chalcidoidea) deposited in the Natural History Museum of the Lebanese University (Faculty of Sciences II, Fanar, Lebanon, female, Coll. No. 874A, specimen represents an undescribed family, that can be placed in Chalcidoidea based on unambiguous synapomorphic character, i. e., presence of multiporous plate sensillae on antennae), Protoparevania lourothi (Evaniidae) [S47], Lancepyris opertus (Chrysidoidea) [S48], Architiphia rasnitsyni (Tiphiidae) [S49], Paleogenia wahisi (Pepsinae) [S50], Apodolichurus sphaerocephalus (Ampulicidae) [S51], Palaeomacropis eocenicus (Macropidinae) [S52], Paleohabropoda oudardi (Anthophorini) [S53], Bombus (Bombus) randeckensis (Bombini) [S54]. Detailed information about the fossils (e. g., characters considered as autapomorphies of specific lineages, their estimate minimum geological age) is given in Data S1F. If multiple fossils of a given group were available for calibration, we selected the geologically oldest one. Please note that we considered Paleogenia wahisi (Pepsinae), deposited in Baltic Amber, despite the well-known problem of accurately estimating the geological age of Baltic Amber. However, the currently widely accepted conservative minimum age of Baltic Amber (i. e., 34 mya) [S55] proved to be informative for improving calibration of the phylogenetic tree. We used mcmctree, in conjunction with codeml (both part of the PAML software package, version 4.9) [S56] to estimate node ages. We applied the approximate likelihood method [S57] by generating a single Hessian matrix with codeml using standard parameters and the JTT model from a condensed version of our supermatrix of amino acids. Specifically, the supermatrix contained only sites at which at least 95% of the samples had a non-ambiguous amino acid symbol. The condensation of the supermatrix became necessary to overcome computational limitations when estimating node ages resulting from the large size of the dataset. Please note that we tested whether or not our data allowed for the generation and subsequent concatenation of partition-specific Hessian matrices. This approach would in principle allow different models to be applied to different partitions. However, we found that most partitions contained at least one taxon with no amino acid sequence information, rendering this approach unfeasible. Prior to the node age analyses with fossil calibration data, we started various test runs, using an arbitrary calibration and the Hessian matrix, to assess whether our sampling parameters would likely lead to convergence. These tests

Page 19: Evolutionary History of the Hymenoptera · Report Evolutionary History of the Hymenoptera Highlights d The most comprehensive dataset ever compiled for inferring the phylogeny of

revealed that a burn-in value of 40,000 and a number of samples of 400,000 typically led to convergence. We introduced the fossil calibrations as soft minima using default settings (i. e., as truncated Cauchy distributions with an offset of 0.1, a scale parameter of 1 and a left tail probability of 0.025) [S58]. The root age was specified at 412 mya in the control file, providing a very conservative estimate [S2]. We ran four dating analyses, using independent-rates clock and standard parameters, apart from the burn-in and sample size values, which we set to the values found in the above test runs and using the Hessian matrix generated from the non-partitioned dataset. All four runs provided very similar node age estimates, with two of the runs also agreeing in the boundaries of the 95% confidence intervals. The other two runs suggested slightly larger confidence limits at a small number of nodes, mostly close to the root of the tree. It is likely that in these two runs the MCMC chain took longer to converge to the final node ages and that thus a larger number of outliers was sampled. Since two of the runs agreed in their node age estimates and errors, we consider the results from these two runs to very likely represent the more realistic node ages and confidence intervals (shown in Figures 1 and S1 and specified in Data S1H). In addition, we ran the same dataset with the same settings again twice, this time using the correlated-rates clock. Finally, we used a condensed version of our supermatrix of nucleotides, analogous to the amino acid set including only sites at which at least 95% of the samples had a non-ambiguous base symbol, and ran this, with the same settings as outlined above for the amino acid set, twice with the independent and the correlated-rates clock, respectively. As model we specified HKY85. The node ages and confidence intervals are shown in Data S1H. The two replicate runs for these additional analyses converged (Figure deposited at Mendeley Data, doi: 10.17632/s5j2f62z3d.2). In general, the correlated rates runs and the runs using the nucleotide supermatrix delivered results that show older ages for the included nodes, with the effect of using correlated-rates slightly lower. However, we deem the correlated-rates clock as less suitable for the data and taxon sampling of our study, following [S59]. While correlated-rates models have been advocated for in past studies (e. g., [S60]), the basic autocorrelation method implemented in MCMCtree can result in unrealistically high or low rates, particularly if the number of taxa included in the study is high and if the root age is old (also see [S61, S62]). Supplemental References S1. Pleijel, F., Jondelius, U., Norlinder, E., Nygren, A., Oxelman, B., Schander, C., Sundberg, P., and

Thollesson, M. (2008). Phylogenies without roots? A plea for the use of vouchers in molecular phylogenetic studies. Mol. Phylogenet. Evol. 48, 369–371.

S2. Misof, B., Liu, S., Meusemann, K., Peters, R.S., Donath, A., Mayer, C., Frandsen, P.B., Ware, J., Flouri, T., Beutel, R.G., et al. (2014). Phylogenomics resolves the timing and pattern of insect evolution. Science 346, 763–767.

S3. Petersen, M., Meusemann, K., Donath, A., Dowling, D., Shanlin, L., Peters, R.S., Podsiadlowski, L., Vasilikopoulos, A., Zhou, X., Misof, B., et al. (2017). Orthograph: a versatile tool for mapping coding nucleotide sequences to clusters of orthologous genes. BMC Bioinformatics 18, 111.

S4. Xie, Y., Wu, G., Tang, J., Luo, R., Patterson, J., Liu, S., Huang, W., He, G., Gu, S., Li, S., et al. (2014). SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads. Bioinformatics 30, 1660–1666.

S5. Mayer, C., Sann, M., Donath, A., Meixner, M., Podsiadlowski, L., Peters, R.S., Petersen, M., Meusemann, K., Liere, K., Wägele, J.-W., et al. (2016). BaitFisher: a software package for multispecies target DNA enrichment probe design. Mol. Bio. Evol. 33, 1875–1886.

S6. Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., and Madden, T.L. (2009). BLAST+: architecture and applications. BMC Bioinformatics 10, 421.

S7. Waterhouse, R.M., Tegenfeldt, F., Li, J., Zdobnov, E.M., and Kriventseva, E.V. (2013). OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs. Nucleic Acids Res. 41, D358–D365.

S8. Nygaard, S., Zhang, G., Schiøtt, M., Li, C., Wurm, Y., Hu, H., Zhou, J., Ji, L., Qiu, F., Rasmussen, M., et al. (2011). The genome of the leaf-cutting ant Acromyrmex echinatior suggests key adaptations to advanced social life and fungus farming. Genome Res. 21, 1339–1348.

S9. The Honeybee Genome Sequencing Consortium (2006). Insights into social insects from the genome of the honeybee Apis mellifera. Nature 443, 931–949.

Page 20: Evolutionary History of the Hymenoptera · Report Evolutionary History of the Hymenoptera Highlights d The most comprehensive dataset ever compiled for inferring the phylogeny of

S10. Elsik, C.G., Worley, K.C., Bennett, A.K., Beye, M., Camara, F., Childers, C.P., de Graaf, D.C., Debyser, G., Deng, J., Devreese, et al. (2014). Finding the missing honey bee genes: lessons learned from a genome upgrade. BMC Genomics 15, 86.

S11. Bonasio, R., Zhang, G., Ye, C., Mutti N.S., Fang, X., Qin, N., Donahue, G., Yang, P., Li, Q., Li, C., et al. (2010). Genomic comparison of the ants Camponotus floridanus and Harpegnathos saltator. Science 329, 1068–1071.

S12. Werren, J.H., Richards, S., Desjardins, C.A., Niehuis, O., Gadau, J., and Colbourne, J.K. (2010). The Nasonia Genome Working Group, Functional and evolutionary insights from the genomes of three parasitoid Nasonia species. Science 327, 343–348.

S13. Tribolium Genome Sequencing Consortium (2008). The genome of the model beetle and pest Tribolium castaneum. Nature 452, 949–955.

S14. Slater, G.S., and Birney, E. (2005). Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31.

S15. Katoh, K., and Standley, D.M. (2013). MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780.

S16. Suyama, M., Torrents, D., and Bork, P. (2006). PAL2NAL: Robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 34, W609–W612.

S17. Finn, R.D., Mistry, J., Schuster-Böckler, B., Griffiths-Jones, S., Hollich, V., Lassmann, T., Moxon, S., Marshall, M., Khanna, A., Durbin, R., et al. (2006). Pfam: clans, web tools and services. Nucleic Acids Res. 34, D247–D251.

S18. Finn, R.D., Bateman, A., Clements, J., Coggill, P., Eberhardt, R.Y., Eddy, S.R., Heger, A., Hetherington, K., Holm, L., Mistry, J., et al. (2014). Pfam: the protein families database. Nucleic Acids Res. 42, D222–D230.

S19. Eddy S.R. (2011). Accelerated profile HMM searches. PLOS Comput. Biol. 7, e1002195. S20. Meusemann, K., von Reumont, B.M., Simon, S., Roeding, F., Strauss, S., Kück, P., Ebersberger, I.,

Walzl, M., Pass, G., Breuers, S., et al. (2010). A phylogenomic approach to resolve the arthropod tree of life. Mol. Biol. Evol. 27, 2451–2464.

S21. Li, B., Lopes, J.S., Foster, P.G., Embley, T.M., and Cox, C.J. (2014). Compositional biases among synonymous substitutions cause conflict between gene and protein trees for plastid origins. Mol. Biol. Evol. 31, 1697–1709.

S22. Kück, P., Meusemann, K., Dambach, J., Thormann, B., von Reumont, B.M., Wägele, J.W., and Misof, B. (2010). Parametric and non-parametric masking of randomness in sequence alignments can be improved and leads to better resolved trees. Front. Zool. 7, 10.

S23. Misof, B., Meyer, B., von Reumont, B.M., Kück, P., Misof, K., and Meusemann, K. (2013). Selecting informative subsets of sparse supermatrices increases the chance to find correct trees. BMC Bioinformatics 14, 348.

S24. Lanfear, R., Calcott, B., Kainer, D., Mayer, C., and Stamatakis, A. (2014). Selecting optimal partitioning schemes for phylogenomic datasets. BMC Evol. Biol. 14, 82.

S25. Hurvich, C.M., and Tsai, C.-L. (1989). Regression and time series model selection in small samples. Biometrika 76, 297–307.

S26. Akashi, H., and Kliman, R.M. (1998). Mutation pressure, natural selection, and the evolution of base composition in Drosophila. In Mutation and Evolution, R. Woodruff, J. Thompson, A. Eyre-Walker, eds. (New York: Springer), pp. 49–60.

S27. Li, B., Lopes, J.S., Foster, P.G., Embley, T.M., and Cox, C.J. (2014). Compositional biases among synonymous substitutions cause conflict between gene and protein trees for plastid origins. Mol. Biol. Evol. 31, 1697–1709.

S28. Kozlov, A.M., Aberer, A.J., and Stamatakis, A. (2015). ExaML version 3: a tool for phylogenomic analyses on supercomputers. Bioinformatics 31, 2577–2579.

S29. Stamatakis, A. (2006). RAxML-VI-HPC: Maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688–2690.

Page 21: Evolutionary History of the Hymenoptera · Report Evolutionary History of the Hymenoptera Highlights d The most comprehensive dataset ever compiled for inferring the phylogeny of

S30. Felsenstein, J. (1985). Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39, 783–791

S31. Pattengale, N.D., Alipour, M., Bininda-Emonds, O.R., Moret, B.M., and Stamatakis, A. (2010). How many bootstrap replicates are necessary? J. Comput. Biol. 17, 337–354.

S32. Aberer, A.J., Kobert, K., and Stamatakis, A. (2014). ExaBayes: massively parallel Bayesian tree inference for the whole-genome era. Mol. Biol. Evol. 31, 2553–2556.

S33. Aberer, A.J., Krompass, D., and Stamatakis, A. (2013). Pruning rogue taxa improves phylogenetic accuracy: an efficient algorithm and webservice. Syst. Biol. 62, 162–166.

S34. Dell’Ampio, E., Meusemann, K., Szucsich, N.U., Peters, R.S., Meyer, B., Borner, J., Petersen, M., Aberer, A.J., Stamatakis, A., Walzl, M.G., et al. (2014). Decisive data sets in phylogenomics: Lessons from studies on the phylogenetic relationships of primarily wingless insects. Mol. Biol. Evol. 31, 239–249.

S35. Peters, R.S., Meusemann, K., Petersen, M., Mayer, C., Wilbrandt, J., Ziesmann, T., Donath, A., Kjer, K.M., Aspöck, U., Aspöck, H., et al. (2014). The evolutionary history of holometabolous insects inferred from transcriptome-based phylogeny and comprehensive morphological data. BMC Evol. Biol. 14, 52.

S36. Strimmer, K., Haeseler, A. von (1997). Likelihood-mapping: a simple method to visualize phylogenetic content of a sequence alignment. Proc. Natl. Acad. Sci. U.S.A. 94, 6815–6819.

S37. Aron, S., de Menten, L., Van Bockstaele, D.R., Blank, S.M., and Roisin, Y. (2005). When hymenopteran males reinvented diploidy. Curr. Biol. 15, 824–827.

S38. Schulmeister, S. (2002). Simultaneous analysis of basal Hymenoptera (Insecta): introducing robust-choice sensitivity analysis. Biol. J. Linnean Soc. 79, 245–275.

S39. Heath, T.A., Zwickl, D.J., Kim, J., and Hillis, D.M. (2008). Taxon sampling affects inferences of macroevolutionary processes from phylogenetic trees. Syst. Biol. 57, 160–166.

S40. Betancur-R R., Li, C., Munroe, T.A., Ballesteros, J.A., and Ortí, G. (2013). Addressing gene tree discordance and non-stationarity to resolve a multi-locus phylogeny of the flatfishes (Teleostei: Pleuronectiformes). Syst. Biol. 62, 763–785.

S41. Parham, J.F., Donoghue, P.C.J., Bell, C.J., Calway, T.D., Head, J.J., Holroyd, P.A., Inoue, J.G., Irmis, R.B., Joyce, W.G., Ksepka, D.T., et al. (2012). Best practices for justifying fossil calibrations. Syst. Biol. 61, 346–359.

S42. Rasnitsyn A.P. (1964). New Triassic Hymenoptera of the Middle Asia. Paleontol. J. 1, 88–89. S43. Engel, M.S., and Grimaldi, D.A. (2009). Diversity and phylogeny of the Mesozoic wasp family

Stigmaphronidae (Hymenoptera: Ceraphronoidea). Denisia 26, 53–68. S44. Engel, M.S. (2016). Notes on Cretaceous amber Braconidae (Hymenoptera), with description of two

new genera. Novit. Paleoentomol. 15, 1–7. S45. Liu, Z., Engel, M.S., and Grimaldi, D. (2007). Phylogeny and geological history of the cynipoid

wasps (Hymenoptera: Cynipoidea). Am. Mus. Novit. 3583, 1–48. S46. Shih C.K., Liu, C.X., and Ren, D. (2009). The earliest fossil record of pelecinid wasps (Insecta:

Hymenoptera: Proctotrupoidea: Pelecinidae) from Inner Mongolia, China. Ann. Entomol. Soc. Am. 102, 20–38.

S47. Deans, A.R., Basibuyuk, H.H., Azar, D., and Nel, A. (2004). Descriptions of two new Early Cretaceous (Hauterivian) ensign wasp genera (Hymenoptera: Evaniidae) from Lebanese amber. Cretaceous Res. 25, 509–516.

S48. Azevedo, C.O., and Azar, D. (2012). A new fossil subfamily of Bethylidae (Hymenoptera) from the Early Cretaceous Lebanese amber and its phylogenetic position. Zoologia 29, 210–218.

S49. Darling, D.C., and Sharkey, M.J. (1990). Order Hymenoptera. In Insects from the Santana Formation, Lower Cretaceous, of Brazil, D. Grimaldi, ed. (New York: Bull. Am. Mus. Nat. Hist. 195), pp. 123–153.

S50. Rodriguez, J., Waichert, C., von Dohlen, C., Poinar Jr, G., and Pitts, J. (2016). Eocene and not Cretaceous origin of spider wasps (Pompilidae): fossil evidence from amber. Acta Palaeontol. Pol. 61, 89–96.

Page 22: Evolutionary History of the Hymenoptera · Report Evolutionary History of the Hymenoptera Highlights d The most comprehensive dataset ever compiled for inferring the phylogeny of

S51. Antropov, A. (2000). Digger wasps (Hymenoptera, Sphecidae) in Burmese amber. Bull. Nat. Hist. Mus., Ser. Geol. 56, 59–78.

S52. Michez, D., Nel, A., Menier, J.-J., and Rasmont, P. (2007). The oldest fossil of a melittid bee (Hymenoptera: Apiformes) from the early Eocene of Oise (France). Zool. J. Linn. Soc. 250, 701–709.

S53. Michez, D., De Meulemeester, T., Rasmont, P., Nel, A., and Patiny, S. (2009). New fossil evidence of the early diversification of bees: Paleohabropoda oudardi from the French Paleocene (Hymenoptera, Apidae, Anthophorini). Zool. Scripta 38, 171–181.

S54. Wappler, T., De Meulemeester, T., Aytekin, A.M., Michez, D., and Engel, M.S. (2012). Geometric morphometric analysis of a new Miocene bumble bee from the Randeck Maar of southwestern Germany (Hymenoptera: Apidae). Syst. Entomol. 37, 784–792.

S55. Dlussky, G.M., and Rasnitsyn, A.P. (2009). Ants (Insecta: Vespida: Formicidae) in the Upper Eocene Amber of Central and Eastern Europe. Paleontol. J. 43, 1024–1042.

S56. Yang, Z. (2007). PAML 4: a program package for phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586–1591.

S57. Dos Reis, M., and Yang Z. (2011). Approximate likelihood calculation on a phylogeny for Bayesian estimation of divergence times. Mol. Biol. Evol. 28, 2161–2172.

S58. Inoue, J.P.C., Donoghue, H., and Yang, Z. (2010). The impact of the representation of fossil calibrations on Bayesian estimation of species divergence times. Syst. Biol. 59, 74–89.

S59. Linder, M., Britton, T., and Sennblad, B. (2011). Evaluation of Bayesian models of substitution rate evolution – parental guidance versus mutual independence. Syst. Biol. 60, 329-342.

S60. Lepage, T., Bryant, D., Phillipe, H., and Lartillot, N. (2007). A general comparison of relaxed molecular clock models. Mol. Biol. Evol. 24, 2669-2680.

S61. Ho, S.Y.W., and Duchene, S. (2014). Molecular-clock models for estimating evolutionary rates and timescales. Mol. Ecol. 23, 5947-5965.

S62. dos Reis, M., Donoghue, P.C.J., and Yang, Z. (2016). Bayesian molecular clock dating of species divergences in the genomics era. Nature. Rev. Genet. 17, 71–80.