did viruses evolve as a distinct supergroup from common ...€¦ · 18.04.2016 · be polarized a...

1

Did viruses evolve as a distinct supergroup from common ancestors of cells?

Ajith Harish1*, Aare Abroi2, Julian Gough3 and Charles Kurland4

1Structural and Molecular Biology Group, Department of Cell and Molecular Biology, Biomedical Center, Uppsala University, 751 24 Uppsala, Sweden.

2Estonian Biocentre, Riia 23, Tartu, 51010, Estonia. 3Computational Genomics Group, Department of Computer Science, University of Bristol, The Merchant Venturers Building, Bristol BS8 1UB, UK

4Microbial Ecology, Department of Biology, Lund University, Ecology Building, SE-223 62 Lund, Sweden.

*Corresponding author: [email protected] (preferred), [email protected] (institutional)

Abstract

The evolutionary origins of viruses according to marker gene phylogenies, as well as their

relationships to the ancestors of host cells remains unclear. In a recent article Nasir

and Caetano-Anollés reported that their genome-scale phylogenetic analyses identify an

ancient origin of the “viral supergroup” (Nasir et al (2015) A phylogenomic data-driven

exploration of viral origins and evolution. Science Advances, 1(8):e1500527). It suggests that

viruses and host cells evolved independently from a universal common ancestor.

Examination of their data and phylogenetic methods indicates that systematic errors likely

affected the results. Reanalysis of the data with additional tests shows that small-genome

attraction artifacts distort their phylogenomic analyses. These new results indicate that their

suggestion of a distinct ancestry of the viral supergroup is not well supported by the

evidence.

Introduction

The debate on the ancestry of viruses is still undecided: In particular, it is still unclear

whether viruses evolved before their host cells or if they evolved more recently from the host

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted April 18, 2016. ; https://doi.org/10.1101/049171doi: bioRxiv preprint

https://doi.org/10.1101/049171

2

cells. The virus-early hypothesis posits that viruses predate or coevolved with their cellular

hosts [1]. Two alternatives describe the virus-late scenario: (i) the progressive evolution also

known as the escape hypothesis and (ii) regressive evolution or reduction hypothesis. Both

propose that viruses evolved from their host cells [1]. According to the first of these two

virus-late models, viruses evolved from their host cells through gradual acquisition of genetic

structures. The other alternative suggests that viruses, like host-dependent endoparasitic

bacteria, evolved from free-living ancestors by reductive evolution. The recent discovery of

the so-called giant viruses with double-stranded DNA genomes that parallel endoparasitic

bacteria with regards to genome size, gene content and particle size revived the reductive

evolution hypothesis. However, there are so far no identifiable ‘universal’ viral genes that are

common to viruses such as the ubiquitous cellular genes. In other words, examples of

common viral components that are analogous to the ribosomal RNA and ribosomal protein

genes, which are common to cellular genomes, are not found. This is one compelling reason

that phylogenetic tests of the “common viral ancestor” hypotheses seem so far inconclusive.

Recently, Nasir and Caetano-Anollés [2] employed phylogenetic analysis of whole-

genomes and gene contents of thousands of viruses and cellular organisms to test the

alternative hypotheses. The authors conclude that viruses are an ancient lineage that diverged

independently and in parallel with their cellular hosts from a universal common ancestor

(UCA). They reiterate their earlier claim [3] that viruses are a unique lineage, which predated

or coevolved with the last UCA of cellular lineages (LUCA) through reductive evolution

rather than through more recent multiple origins. Their claims are based on analyses of

statistical- and phyletic distribution patterns of protein domains, classified as superfamilies

(SFs) in Structural Classification of Proteins (SCOP) [4].

Detailed re-examination of Nasir and Caetano-Anollés' phylogenomic approach [2, 3]

suggests that small genomes systematically distort their phylogenetic reconstructions of the


https://doi.org/10.1101/049171

3

tree of life (ToL), especially the rooting of trees. Here the ToL is described as the

evolutionary history of contemporary genomes: as a tree of genomes or a tree of proteomes

(ToP) [5-8]. The bias due to highly reduced genomes of parasites and endosymbionts in

genome-scale phylogenies has been known for over a decade [5, 6]. In fact, prior to the recent

proposal [2] these authors recognized the anomalous effects of including small genomes in

reconstructing the ToL in analyses that were limited to cellular organisms [9] or which

included giant viruses [3]: As they say “In order to improve ToP reconstructions, we

manually studied the lifestyles of cellular organisms in the total dataset and excluded

organisms exhibiting parasitic (P) and obligate parasitic (OP) lifestyles, as their inclusion is

known to affect the topology of the phylogenetic tree” [3]. But, they may not have adequately

addressed this problem, particularly when the samplings include viral genomes that are likely

to further exacerbate bias due to small genomes [2, 3]. For this reason we systematically

tested the reliability of the phylogenetic trees, especially the rooting approach favored by

Nasir and Caetano-Anollés [2]. This approach depends critically on a pseudo-outgroup to

root the ToP (ToL), but that pseudo-outgroup is not identified empirically. Rather, it is

assumed a priori to be an empty set [2].

We show here in several independent phylogenetic reconstructions that a rooting

based on a hypothetical “all-zero” outgroup—an ancestor that is assumed to be an empty set

of protein domains—creates specific phylogenetic artifacts: In this particular approach [2, 3]

implementing the all-zero outgroup artifactually draws the taxa with the smallest genomes

(proteomes) into a false rooting very much like the classical distortions due to long-branch

attractions (LBA) in gene trees [10].

Results and Discussion

We emphasize at the outset of this study that virtually all the evolutionary


https://doi.org/10.1101/049171

4

interpretations based on phylogenetic reconstructions depend on a reliable identification of

the root of a tree [11, 12]. In particular, the rooting of a tree will determine the branching

order of species, and define the ancestor-descendant relationships between taxa as well as the

derived features of characters (character states). In effect, the root polarizes the order of

evolutionary changes along the tree with respect to time. By the same token, the rooting of a

tree distinguishes ancestral states from derived states among the different observed states of a

character. If ancestral states are identified directly, for example from fossils, characters can

be polarized a priori with regard to determining the tree topology. A priori polarization

provides for intrinsic rooting. However, direct identification of ancestral states, particularly

for extant genetic data is rarely possible. Consequently conventional phylogenetic methods

use time-reversible (undirected) models of character evolution and they only compute

unrooted trees. For example, the root of the iconic ribosomal RNA ToL or any other gene-

tree has not been determined directly [13, 14].

Accordingly, conventional approaches to character polarization are indirect and the

rooting of trees is normally a two-stage process. The most common rooting method is the

outgroup comparison method, which is based on the premise that character-states common to

the ingroup (study group) and a closely related sister-group (the outgroup) are likely to be

ancestral to both. Therefore, in an unrooted tree the root is expected to be positioned on the

branch that connects the outgroup to the ingroup. In this way, the tree (and characters) may

be polarized a posteriori [12, 14]. However, there are no known outgroups for the ToL. In the

absence of natural outgroups, pseudo-outgroups are used to root the ToL [14]. The best-

known case is the root grafted onto the unrooted ribosomal RNA ToL based on presumed

ancient (pre-LUCA) gene duplications [15, 16]. Here the paralogous proteins act as

reciprocal outgroups that root each other. Unlike gene duplications used to root gene trees,

the challenge of identifying suitable outgroups becomes more acute for genome trees.


https://doi.org/10.1101/049171

5

To meet this challenge Nasir and Caetano-Anollés use a hypothetical pseudo-

outgroup: an artificial taxon constructed from presumed ancestral states. For each character

(SF), the state '0' or 'absence' of a SF is assumed to be “ancestral” a priori. This artificial “all-

zero” taxon is used as outgroup to root the ToL. Further, they use the Lundberg rooting

method, in which outgroups are not included in the initial tree reconstruction. The Lundberg

method involves estimating an unrooted tree for the ingroup taxa only, and then attaching

outgroup(s) (when available) or a hypothesized ancestor to the tree a posteriori to determine

the position of the root [17]. Unrooted trees describe relatedness of taxa based on graded

compositional similarities of characters (and states). Accordingly, we can expect the “all-

zero” pseudo-outgroup to cluster with genomes (proteomes) in which the smallest number of

SFs is present. The latter are the proteomes described by the largest number of ‘0s’ in the

data matrix.

The instability of rooting with an all-zero pseudo-outgroup becomes clear when the

smallest proteome in a given taxon sampling varies in the rooting experiments (Figs. 1 and

2). Rooting experiments were preformed both for SF occurrence (presence/absence) patterns

and for SF abundance (copy number) patterns. However, we present results for the SF

abundance patterns, as in [2]. Throughout, we refer to genomic protein repertoires as

proteomes. Proteome size related as the number of distinct SFs in a proteome (SF occurrence)

is depicted next to each taxon for easy comparison in Figs. 1 and 2. Phylogenetic analyses

were carried out as described in [2, 3] (see methods). SFs that are shared between proteomes

of viruses and cellular organisms were used as the characters (Fig. 1a) as in [2]. Initially no

viruses are included in tree reconstructions and here the root was placed within the Archaea,

which has the smallest proteome (503 SFs) among the supergroups (Fig 1b). When a still

smaller bacterial proteome (420 SFs) was included, the position of the root as well as the

branching order changed. In this case, the bacteria were split into two groups and the root


https://doi.org/10.1101/049171

6

was placed within one of the bacterial groups (Fig. 1c). Further, when a much smaller

archaeal proteome was included (228 SFs), the root was relocated to a branch leading to the

now smallest proteome (Fig. 1d). Note that the newly included taxa, both bacteria and

archaea are host-dependent symbionts with reduced genomes.

Similarly including viruses in the analyses draws the root towards the smaller viral

proteomes (Fig. 2). As in the rooting experiments in Fig. 1, a group of DNA viruses (107-175

SFs) was introduced. These DNA viruses have larger proteomes than do the RNA viruses,

but they are much smaller than most known endosymbiotic bacteria (Fig. 2d). Again, the root

was repositioned within the DNA viruses group (Fig. 2a). Following this experiment, two

extremely reduced (107 SFs each) endosymbiotic bacteria classified as Betaproteobacteria

were included. These further displaced the root closer to the smallest set of proteomes (Fig.

2b). Finally, a set of four RNA viruses (4-17 SFs) in their genomes was introduced and they

rooted the tree within the RNA viruses (Fig. 2c). These results challenge the conclusion

drawn previously that the proteomes of RNA viruses are more ancient than proteomes of

DNA viruses [2]. In addition, the results contradict the purported antiquity of viral proteomes

as such. Rather, the data suggest that there are severe artifacts generated by genome size-bias

due to the inclusion of the viral proteomes in the analysis. These artifacts are expressed as

grossly erroneous rootings caused by small-genome attraction (SGA) in the Lundberg rooting

using the hypothetical all-zero outgroup.

Including the all-zero pseudo-outgroup in the analysis either implicitly (defined by the

ANCSTATES option in PAUP*) or explicitly (as a taxon in the data matrix) does not make a

difference to the tree topology and rooting. We note that including the hypothetical ancestor

during tree estimation amounts to a priori character polarization and pre-specification of the

root. In addition, the position of the root was the same in the different rooting experiments

when the all-zero pseudo-ancestor was explicitly specified as the outgroup for root trees


https://doi.org/10.1101/049171

7

using the outgroup method (see supplementary Figs S1 & S2). These rooting experiments

reveal a strong bias in rooting that favors small genomes irrespective of the different rooting

methods—outgroup rooting, Lundberg rooting and intrinsic rooting—used to root the ToL

with the “all-zero” hypothetical ancestor. This SGA artifact is comparable to the better-

known LBA artifact that is associated with compositional bias of nucleotides or of amino

acids that distort gene trees [10].

The use of artificial outgroups is not uncommon in rooting experiments when rooting

is ambiguous [11]. Artificial taxa are either an all-zero outgroup or an outgroup constructed

by randomizing characters and/or character-states of real taxa. Although rooting experiments

with multiple real outgroups, or randomized artificial outgroups that simulate loss of

phylogenetic signal can minimize the ambiguity in rooting the all-zero outgroup has proved

to be of little use [11, 14]. Conclusions based on an all-zero outgroup are often refuted when

empirically grounded analysis with real taxa are carried out [14]. Indeed, the present rooting

experiments (Figs. 1 and 2) clearly show that the position of the root depends on the smallest

genome in the sample when a hypothetical all-zero ancestor/outgroup is used. In effect, the

rooting approach favored by Nasir and Caetano-Anolles [2] is not reliable.

Nevertheless, small proteome size is not an irreconcilable feature of genome-tree

reconstructions [5, 6, 18]. Small genome attraction artifacts may be observed when highly

reduced proteomes of obligate endosymbionts are included in analyses with common

samplings [3, 5, 6, 18]. However, their untoward effects are only observed in the shallow end

of the tree where the endosymbionts might be clustered with unrelated groups. In such cases

the deep divergences or clustering of major branches are unperturbed [6, 18]. Such size-

biases are readily corrected by normalizations that account for genome size (actually specific

SF content in this case) [5, 6, 18].

Though size-bias correction for SF content phylogeny is known to be reliable, both


https://doi.org/10.1101/049171

8

for measures based on distance [6] and those based on frequencies of character distribution

[18], Nasir and Caetano-Anollés do not apply such corrections. They only attend to the

proteome size variations associated with SF abundances [2, 3]. Instead of accounting for

novel taxon-specific SFs in their model of evolution, the authors choose to exclude

potentially problematic small proteomes of parasitic bacteria and to include only the

proteomes of ‘free-living’ cellular organisms in their analyses [2, 3]. But all viruses are

parasites and obviously even more extreme examples of minimal proteomes.

This problem is further exacerbated by the uneven and largely incomplete annotation

of SF domains in viral proteins [19, 20]. In fact, many viral ‘proteomes’ that were sampled in

[2] are as small as a single SF. It is not clear why the inclusion of small viral proteomes was

not recognized as even more problematic than the inclusion of small parasitic bacterial

proteomes, in spite of the previous assertion of these authors that small proteomes should be

excluded [9]. Nevertheless, including small viral proteomes is inconsistent with specifically

excluding small cellular proteomes in the ToL, especially when hypotheses of reductive

evolution are considered. Screening taxa based on ‘lifestyle’ (free-living or parasitic) seems

unwarranted since extreme reductive genome evolution, sometimes called genome

streamlining, is not limited to host adapted parasitic bacteria but is common in free-living

bacteria as well as eukaryotes [21-23].

In addition to the ToP, the authors use a so-called tree of domains (ToD) to support

their conclusion that proteomes of viruses are ancient and that proteomes of RNA viruses are

particuarly ancient. The ToD is projected as the evolutionary trajectory of individual SFs.

Such projections are used as proxies to determine the relative antiquity or novelty of SFs [2].

The ToD like the ToP is also rooted with a presumed pseudo-outgroup and that rooting may

be an artifact as for the ToP. Much more serious than potential artifacts in ToD is an

egregiously bad assumption from the perspective of the SCOP hierarchal classification: the


https://doi.org/10.1101/049171

9

very notion that the ToD describes evolutionary relationships between SFs in the same way

the ToL describes genealogy of species. Evolutionary relationships between SFs with

different folds in SCOP classification are not established [4, 24]. Physicochemical protein

folding experiments and corresponding statistical analyses of sequence evolution patterns,

including simulations of protein folding are all consistent with the observations that the

sequence-structure space of SFs is discontinuous [25, 26]. Empirical data indicate that the

evolutionary transition from one SF to another through gradual changes implied in the ToD is

unlikely, if at all feasible. This makes the ToD hypothesis, which assumes that all SFs are

related to one another by common ancestry untenable. Thus the ToD contradicts the very

basis upon which SFs are classified in the SCOP hierarchy [4, 24]. The ToD is therefore

uninterpretable as an evolutionary history of individual SFs. Accordingly, the ToD cannot

reflect the ‘relative ages’ of SFs nor can it support the inferred antiquity of viruses in the

ToP.

Unlike phylogenetic trees that describe the evolution of individual proteomes, Venn

diagrams, SF sharing patterns and summary statistics of SF frequencies among groups of

proteomes only depict generalized trends. Multiple evolutionary scenarios can be invoked a

priori to explain the general trends without any phylogenetic analyses. Although such

patterns may be suggestive, they do not by themselves support reliable phylogenetic

inferences [19, 20]. Thus the authors’ inferences in [2] based on statistical distributions of

SFs alone are speculative, at best.

In summary, the authors’ [2, 3] proposed rooting for the ToL seems to be influenced

by clearly identifiable artifacts. The conceptual issue of proteome evolution may be traced to

a very widely held view: namely that “ToP were rooted by the minimum character state,

assuming that modern proteomes evolved from a relatively simpler urancestral organism that

harbored only few FSFs” [2, 9]. Thus, a common prejudice is that the origins of modern


https://doi.org/10.1101/049171

10

proteomes ought to reflect a monotonic evolutionary progression that is embodied in

something like Aristotle’s Great Chain of Being [10]. The “all-zero” or “all-absent”

hypothetical ancestor is neither empirically grounded nor biologically meaningful, but it does

indeed reflect a common prejudice [10, 14]. Indeed, we bring into question the inferred

relative antiquity of viruses and Archaea in the ToL and the notion that viruses make up an

independent fourth supergroup [2]. In effect, we suggest that the phylogenetic approach of

Nasir and Caetano-Anolles’ [2, 3] provides neither a test nor a confirmation of any one of the

hypotheses for the origins of viruses [1]. Despite its importance, reconciling the extensive

genetic and morphological diversity of viruses as well as their evolutionary origins remains

murky [1, 27]. Better methods and empirical models are required to test whether a

multiplicity of scenarios or a single over-arching hypothesis is applicable to understand the

origins of viruses.

Methods

Here, genomic protein repertoires are referred to as proteomes. We re-analyzed a

subset of the 368 proteomes sampled in [2] for phylogenetic rooting of diverse cells and their

viruses. Here, we sampled 102 cellular proteomes containing 34 each from Archaea, Bacteria

and Eukaryotes, respectively as well as 16 viral proteomes from [2]. For the latter, we note

that the DNA virus proteomes were substantially larger than those of RNA viruses in terms of

the number of identified SFs. In addition we included for comparison some of the smallest

known proteomes of Archaea and Bacteria not included in [2]. Roughly, half of the sampled

proteomes were analyzed (Figs 1 and 2) for computational simplicity. Results did not vary

when all the sampled taxa were included (see supplementary Figs S3 & S4). Rooting

experiments were preformed both with SF occurrence (presence/absence) patterns and SF

abundance (copy number) patterns; however, we present results for the SF abundance


https://doi.org/10.1101/049171

11

patterns as in [2]. In addition to the Lundberg rooting procedures carried out as in [2, 3],

rooting experiments were repeated by including the all-zero taxon in the tree reconstruction

process implicitly (using the ANCSTATES option in PAUP*) and explicitly as taxon in the

data matrix. Further, when the all-zero taxon was explicitly included, rooting experiments

were also repeated with the outgroup rooting method. Phylogenetic reconstructions were

carried out using maximum parsimony criteria implemented in PAUP* ver. 4.0b10 [28] with

heuristic tree searches using 1,000 replicates of random taxon addition and tree bisection

reconnection (TBR) branch swapping. Trees were rooted by Lundberg method, outgroup

method or intrinsically rooted by including the hypothetical all-zero ancestor in tree searches.

Acknowledgements: A. H. acknowledges support from The Swedish Research Council (to

Måns Ehrenberg) and the Knut and Alice Wallenberg Foundation, RiboCORE (to Måns

Ehrenberg), A.A. acknowledges support from European Regional Development Fund through

the Centre of Excellence in Chemical Biology grant number 3.2.0101.08-0017 and C.G.K.

acknowledges support from the Nobel Committee for Chemistry of the Royal Swedish

Academy of Sciences.

References:

1. Wessner, D., The origins of viruses. Nature Education, 2010. 3(9): p. 37. 2. Nasir, A. and G. Caetano-Anollés, A phylogenomic data-driven exploration of viral

origins and evolution. Science Advances, 2015. 1(8). 3. Nasir, A., K. Kim, and G. Caetano-Anolles, Giant viruses coexisted with the cellular

ancestors and represent a distinct supergroup along with superkingdoms Archaea, Bacteria and Eukarya. BMC Evolutionary Biology, 2012. 12(1): p. 156.

4. Murzin, A.G., et al., SCOP: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 1995. 247(4): p. 536-540.

5. Snel, B., P. Bork, and M.A. Huynen, Genome phylogeny based on gene content. Nature Genetics, 1999. 21(1): p. 108-110.

6. Yang, S., R.F. Doolittle, and P.E. Bourne, Phylogeny determined by protein domain content. Proceedings of the National Academy of Sciences of the United States of America, 2005. 102(2): p. 373-378.


https://doi.org/10.1101/049171

12

7. Fang, H., et al., A daily-updated tree of (sequenced) life as a reference for genome research. Scientific Reports, 2013. 3.

8. Kurland, C.G. and A. Harish, Structural biology and genome evolution: An introduction. Biochimie, 2015. 119: p. 205-208.

9. Kim, K.M. and G. Caetano-Anollés, The proteomic complexity and rise of the primordial ancestor of diversified life. BMC Evolutionary Biology, 2011. 11(1).

10. Gouy, R., D. Baurain, and H. Philippe, Rooting the tree of life: the phylogenetic jury is still out. Phil. Trans. R. Soc. B, 2015. 370(1678): p. 20140329.

11. Graham, S.W., R.G. Olmstead, and S.C.H. Barrett, Rooting Phylogenetic Trees with Distant Outgroups: A Case Study from the Commelinoid Monocots. Molecular Biology and Evolution, 2002. 19(10): p. 1769-1781.

12. Morrison, D.A., Phylogenetic Analyses of Parasites in the New Millennium, in Advances in Parasitology. 2006. p. 1-124.

13. Pace, N.R., A molecular view of microbial diversity and the biosphere. Science, 1997. 276(5313): p. 734-740.

14. Wheeler, W.C., Systematics: A course of lectures. 2012: John Wiley & Sons. 15. Schwartz, R. and M. Dayhoff, Origins of prokaryotes, eukaryotes, mitochondria, and

chloroplasts. Science, 1978. 199(4327): p. 395-403. 16. Woese, C.R., O. Kandler, and M.L. Wheelis, Towards a natural system of organisms:

Proposal for the domains Archaea, Bacteria, and Eucarya. Proceedings of the National Academy of Sciences of the United States of America, 1990. 87(12): p. 4576-4579.

17. Swofford, D.L. and P.B. Begle, PAUP: Phylogenetic Analysis Using Parsimony: User's Manual (Version 3.1). 1993, Champaign, IL, USA.: Illinois Natural History Survey.

18. Harish, A., A. Tunlid, and C.G. Kurland, Rooted phylogeny of the three superkingdoms. Biochimie, 2013. 95(8): p. 1593-1604.

19. Abroi, A. and J. Gough, Are viruses a source of new protein folds for organisms? - Virosphere structure space and evolution. BioEssays, 2011. 33(8): p. 626-635.

20. Abroi, A., A protein domain-based view of the virosphere-host relationship. Biochimie, 2015. 119: p. 231-243.

21. Andersson, S.G.E. and C.G. Kurland, Reductive evolution of resident genomes. Trends in Microbiology, 1998. 6(7): p. 263-268.

22. Giovannoni, S.J., J. Cameron Thrash, and B. Temperton, Implications of streamlining theory for microbial ecology. ISME J, 2014. 8(8): p. 1553-1565.

23. Dujon, B., et al., Genome evolution in yeasts. Nature, 2004. 430(6995): p. 35-44. 24. Gough, J., et al., Assignment of homology to genome sequences using a library of

hidden Markov models that represent all proteins of known structure. Journal of Molecular Biology, 2001. 313(4): p. 903-919.

25. Oliveberg, M. and P.G. Wolynes, The experimental survey of protein-folding energy landscapes. Quarterly reviews of biophysics, 2005. 38(03): p. 245-288.

26. Wolynes, P.G., Evolution, energy landscapes and the paradoxes of protein folding. Biochimie, 2015. 119: p. 218-230.

27. Forterre, P., M. Krupovic, and D. Prangishvili, Cellular domains and viral lineages. Trends in Microbiology, 2014. 22(10): p. 554-558.

28. Swofford, D.L., PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4. 2003, Sunderland, Massachusetts: Sinauer Associates.


https://doi.org/10.1101/049171

13

Figure 1

Figure 1. Implementing an “all-zero” pseudo-outgroup [2] severely distorts rooting of the ToL. Rooted

trees were reconstructed from a subset of 368 taxa (proteomes) sampled in [2], which included 17 taxa each

from Archaea, Bacteria, Eukarya (ABE) and 9 taxa from the virus groups (V). (a) Venn diagram shows the 455

SFs shared between viruses and cells (ABEV), which were used to reconstruct trees. (b) Single most

parsimonious tree of ABE taxa rooted within Archaea. (c, d) New taxa, which represent the smallest proteome

after inclusion, were progressively included in size order. The position of the root node changed accordingly to

the branch corresponding to a group (or taxon) with the smallest proteome, which is Bacteria (c), Archaea (d);

the Eukarya section is collapsed since tree topology is unaffected. Taxa are described by their NCBI taxonomy

ID, taxonomic affiliation (A, B, E or V) and proteome size in terms of the number of distinct SFs present in the

genome. To compare the position of the root node trees are drawn to show branching patterns only, branch

lengths are not proportional to the quantity of evolutionary change.


https://doi.org/10.1101/049171

14

Figure 2

Figure 2. Rooting experiments (continued from Fig. 1) show rooting bias of “all-zero” pseudo-outgroup

towards small proteomes. (a-c) New taxa, which represent the smallest proteome after inclusion, were

progressively included in size order, where the smallest proteome was from mega DNA viruses (a), Bacteria (b)

and RNA viruses (c). Details in trees are same as in Fig. 1. (d) Comparison of proteome sizes of sampled taxa in

terms of SF occurrence used to estimate trees in this study and in [2]. Numbers in parentheses above each group

indicate the number of proteomes.


https://doi.org/10.1101/049171

224324_B_577

367737_B_607

383372_B_722

262724_B_634

243090_B_707

362976_A_570

399549_A_513

272569_A_639

192952_A_618

272568_B_692

70601_A_503

349307_A_512

391774_B_666

290317_B_639

240015_B_694

69014_A_536

411154_B_664

224325_A_569

321956_B_506

402880_A_509

410358_A_525

187420_A_523

272562_B_706

368407_A_598

348780_A_588

326298_B_602

64091_A_549

262543_B_690

273063_A_524

351607_B_638

259564_A_557

323259_A_591

273068_B_646

391009_B_584

70601_A_503

64091_A_549272569_A_639

273063_A_524

348780_A_588

326298_B_602

362976_A_570

321956_B_506

262543_B_690

402880_A_509

224324_B_577

1105097_B_420

383372_B_722

259564_A_557

187420_A_523

192952_A_618

69014_A_536

272562_B_706

410358_A_525

411154_B_664272568_B_692

391009_B_584

240015_B_694

262724_B_634

243090_B_707

351607_B_638

391774_B_666290317_B_639

224325_A_569

368407_A_598323259_A_591

349307_A_512

273068_B_646

399549_A_513

367737_B_607

391009_B_584

323259_A_591

391774_B_666290317_B_639

240015_B_694

272562_B_706

351607_B_638

399549_A_513

272569_A_639

272568_B_692

224325_A_569

410358_A_525

321956_B_506

224324_B_577

273063_A_524

1105097_B_420

368407_A_598

69014_A_536

383372_B_722

70601_A_503

192952_A_618

367737_B_607

273068_B_646

262724_B_634

362976_A_570

262543_B_690

64091_A_549

243090_B_707

402880_A_509

326298_B_602

187420_A_523

348780_A_588

349307_A_512

259564_A_557

228908_A_228

411154_B_664

326298_B_602

64091_A_549

192952_A_618

272569_A_639

391774_B_666

348780_A_588

402880_A_509

411154_B_664

262724_B_634

367737_B_607

273063_A_524

351607_B_638

70601_A_503

391009_B_584

410358_A_525

187420_A_523

321956_B_506

362976_A_570

349307_A_512

273068_B_646

All_Zero_Anc

240015_B_694

290317_B_639

323259_A_591

224325_A_569

243090_B_707

69014_A_536

259564_A_557

272562_B_706

224324_B_577

399549_A_513

383372_B_722

262543_B_690

368407_A_598

272568_B_692

262543_B_690

224324_B_577

362976_A_570

273063_A_524

64091_A_549

290317_B_639

All_Zero_Anc

402880_A_509

391009_B_584

351607_B_638

321956_B_506

383372_B_722

272562_B_706

272569_A_639

349307_A_512

262724_B_634

243090_B_707

187420_A_523

1105097_B_420

391774_B_666

224325_A_569

240015_B_694

367737_B_607

272568_B_692

399549_A_513

368407_A_598323259_A_591

192952_A_618

69014_A_53670601_A_503

259564_A_557

273068_B_646

410358_A_525

348780_A_588

326298_B_602

411154_B_664

399549_A_513

64091_A_549

273063_A_524

272568_B_692

349307_A_512

262724_B_634

224324_B_577

228908_A_228

290317_B_639

362976_A_570

402880_A_509

323259_A_591

224325_A_569

367737_B_607

383372_B_722

273068_B_646272562_B_706

69014_A_536

351607_B_638

243090_B_707

321956_B_506

272569_A_639

All_Zero_Anc

410358_A_525

1105097_B_420

411154_B_664

70601_A_503

348780_A_588

391774_B_666

187420_A_523

368407_A_598

391009_B_584

326298_B_602

192952_A_618

240015_B_694

259564_A_557

262543_B_690

Intrinsincally rooted trees‘0’ prespeci�ed as the ancestral stateand hypothetical “all-zero” ancestor included in tree search.

EUKARYABACTERIAARCHAEA Smallest proteome

Root node

Outgroup rooted trees“all-zero” taxon explicitly includedin the data matrix and speci�ed asthe outgroup.

(a) (b) (c)

(d) (e) (f )

Figure S1. Rooting experiments (continued from Fig 1) where trees were either intrinsically rooted by including the hypothetical ancestor in tree search (a-c) or byincluding and explicit all-zero ancestor taxon and specifying it to be the outgroup.


https://doi.org/10.1101/049171

1094892_V_175

290317_B_639

262724_B_634

224325_A_569

323259_A_591

399549_A_51369014_A_536

1349409_V_108

273068_B_646

70601_A_503

391774_B_666

224324_B_577

262543_B_690

212035_V_176

351607_B_638

368407_A_598

272568_B_692

228908_A_228

349307_A_512

383372_B_722

367737_B_607

411154_B_664

1105097_B_420321956_B_506

243090_B_707

240015_B_694

259564_A_557

64091_A_549

391009_B_584

1235314_V_177

187420_A_523

192952_A_618

272569_A_639

410358_A_525

1349410_V_107

273063_A_524

326298_B_602

362976_A_570

272562_B_706

402880_A_509

348780_A_588

224325_A_569

391009_B_584

70601_A_503

290317_B_639

243090_B_707

273068_B_646

349307_A_512

240015_B_694

1349409_V_108

272568_B_692

64091_A_549

326298_B_602

402880_A_509

192952_A_618

410358_A_525

11520_V_7

368407_A_598

224324_B_577

262724_B_634

228908_A_228

362976_A_570

323259_A_591

399549_A_513

187420_A_523

367737_B_607

272569_A_639

69014_A_536

351607_B_638

1343077_B_107

11128_V_17

1105097_B_420

212035_V_176

321956_B_506

1235314_V_177

391774_B_666

348780_A_588

272562_B_706262543_B_690

411154_B_664

656190_V_4

273063_A_524

1349410_V_107

259564_A_557

1094892_V_175

383372_B_722

81931_V_5

1053648_B_107

367737_B_607

192952_A_618

1235314_V_177

326298_B_602

262543_B_690

348780_A_588

1053648_B_107

321956_B_506

349307_A_512

212035_V_176

243090_B_707

1343077_B_107

224324_B_577

351607_B_638

187420_A_523

70601_A_503262724_B_634

272569_A_639

1094892_V_175

1349409_V_108

224325_A_569

259564_A_557

402880_A_509

240015_B_694

391774_B_666

273068_B_646

391009_B_584

272562_B_706

399549_A_513

362976_A_570

410358_A_525

383372_B_722

323259_A_591

411154_B_664

290317_B_639

1105097_B_420

64091_A_549

368407_A_598

228908_A_228

69014_A_536

1349410_V_107

272568_B_692

273063_A_524

362976_A_570

259564_A_557

272568_B_692

69014_A_536

1094892_V_1751235314_V_177

272562_B_706

391774_B_666

411154_B_664

383372_B_722

367737_B_607

1105097_B_420

351607_B_638

368407_A_598

240015_B_694

410358_A_525

192952_A_618

348780_A_588

1349410_V_107

262724_B_634

349307_A_512

64091_A_549

228908_A_228

272569_A_639

70601_A_503

1349409_V_108

273068_B_646

391009_B_584

321956_B_506

399549_A_513

All_Zero_Anc

262543_B_690

224325_A_569

224324_B_577

273063_A_524

212035_V_176

290317_B_639

402880_A_509187420_A_523

243090_B_707

323259_A_591

326298_B_602

411154_B_664

273063_A_524

1094892_V_175

187420_A_523

391774_B_666

1053648_B_107

351607_B_638

410358_A_525

290317_B_639

All_Zero_Anc

402880_A_509

272569_A_639

323259_A_591

272568_B_692

259564_A_557

64091_A_549

228908_A_228

362976_A_570

262543_B_690

399549_A_513

224325_A_569

348780_A_588

367737_B_607

321956_B_506

383372_B_722

262724_B_634

1343077_B_107

326298_B_602

368407_A_598

212035_V_176

243090_B_707

272562_B_706

1105097_B_420

224324_B_577

240015_B_694

1235314_V_177

1349409_V_108

391009_B_584

273068_B_646

192952_A_618

349307_A_512

69014_A_53670601_A_503

1349410_V_107

290317_B_639

272568_B_692

367737_B_607

399549_A_513

349307_A_512

1349410_V_107

1105097_B_420

212035_V_176

243090_B_707

262724_B_634

368407_A_598323259_A_591

240015_B_694

410358_A_525

1094892_V_175

228908_A_228

64091_A_549

1349409_V_108

1343077_B_107

273068_B_646

383372_B_722

224324_B_577

656190_V_4

391774_B_666

272569_A_639

11128_V_17

192952_A_618

81931_V_5

411154_B_664

262543_B_690

362976_A_570

11520_V_7

272562_B_706

259564_A_557

391009_B_584

273063_A_524

321956_B_506

402880_A_509

70601_A_503

1053648_B_107

326298_B_602

187420_A_523

1235314_V_177

224325_A_569

348780_A_588

All_Zero_Anc

351607_B_638

69014_A_536

EUKARYABACTERIAARCHAEADNA VirusesRNA viruses

Smallest proteomeRoot node

Outgroup rooted trees“all-zero” taxon explicitly includedin the data matrix and speci�ed asthe outgroup.

Intrinsincally rooted trees‘0’ prespeci�ed as the ancestral stateand hypothetical “all-zero” ancestor included in tree search.

(a) (b) (c)

(d) (e) (f )

Figure S2. Rooting experiments (continued from Fig 2) where trees were either intrinsically rooted by including the hypothetical ancestor in tree search (a-c) or byincluding and explicit all-zero ancestor taxon and specifying it to be the outgroup.


https://doi.org/10.1101/049171

326298_B_602

6239_E_944

224324_B_577

368407_A_598

411154_B_664

243090_B_707

262724_B_634

262543_B_690

259564_A_557

272569_A_639

225164_E_1047

3702_E_988

272562_B_706

167879_B_765

9615_E_1091

208963_B_846

4558_E_960

273116_A_486

9913_E_1090

318161_B_758

399549_A_513

3694_E_974

290317_B_639

362976_A_570

340102_A_493

10090_E_1110

251221_B_733

273063_A_524

399550_A_446

45157_E_821

453591_A_431

367110_E_883

64091_A_549

374847_A_451

486041_E_869

436308_A_502

224308_B_781

312017_E_777

420247_A_500

397948_A_493

349307_A_512

284811_E_765

349101_B_734

192952_A_618

70601_A_503

381666_B_810

306902_E_775

402880_A_509

1054217_A_469

104341_E_864

284590_E_770

4896_E_777

323259_A_591

365044_B_770

383372_B_722

272557_A_469

190192_A_473

448385_B_835

273507_E_834

240015_B_694

1036743_B_726

273068_B_646

444157_A_468

380703_B_807

410358_A_525

243232_A_499

368408_A_449

8355_E_1008

7227_E_985

6669_E_1042

5762_E_843

269482_B_829

7955_E_1088

70791_E_915

263820_A_491

391009_B_584

4932_E_777

100226_B_776

339860_A_475

357804_B_762

321956_B_506

296587_E_898

391774_B_666

391037_B_726

240292_B_783

351607_B_638

187420_A_523

556484_E_858

272568_B_692

81824_E_900

69014_A_536

367737_B_607

565050_B_724

415426_A_455

224325_A_569

9606_E_1113

2903_E_963

9598_E_1088

348780_A_588

44056_E_842

227321_E_899

44689_E_894

318161_B_758

284811_E_765

6669_E_1042

69014_A_536

44056_E_842

383372_B_722

8355_E_1008

436308_A_502

273116_A_486

312017_E_777

415426_A_455

224324_B_577

391774_B_666

70791_E_915

45157_E_821

410358_A_525

227321_E_899

4932_E_777

351607_B_638

272557_A_469

326298_B_602

391037_B_726

367737_B_607

306902_E_775

296587_E_898

1054217_A_469

349101_B_734

104341_E_864

243232_A_499

224308_B_781

9615_E_1091

4896_E_777

348780_A_588

9606_E_1113

9598_E_1088

100226_B_776

381666_B_810

190192_A_473

486041_E_869

192952_A_618

1036743_B_726

251221_B_733

208963_B_846

7955_E_1088

420247_A_500

6239_E_944

9913_E_1090

339860_A_475

269482_B_829

2903_E_963

224325_A_569

448385_B_835

167879_B_765

323259_A_591

453591_A_431

365044_B_770

399549_A_513273063_A_524

444157_A_468

3702_E_988

4558_E_9603694_E_974

391009_B_584

340102_A_493

7227_E_985

284590_E_770

262724_B_634

368408_A_449

565050_B_724

362976_A_570

10090_E_1110

349307_A_512

44689_E_894

70601_A_503

259564_A_557

273507_E_834

262543_B_690

374847_A_451

290317_B_639

272568_B_692

81824_E_900

272569_A_639

240292_B_783

397948_A_493

321956_B_506

411154_B_664

187420_A_523

556484_E_858

367110_E_883

273068_B_646

263820_A_491

5762_E_843

243090_B_707

399550_A_446

1105097_B_420

402880_A_509

64091_A_549

380703_B_807

225164_E_1047

357804_B_762

272562_B_706

240015_B_694

368407_A_598

444157_A_468

284590_E_770

362976_A_570

7955_E_1088

391774_B_666

380703_B_807

399549_A_513

243232_A_499

486041_E_869

3702_E_988

187420_A_523

340102_A_493

70791_E_915

44056_E_842

312017_E_777

367737_B_607

224324_B_577

339860_A_475

306902_E_775

259564_A_557

348780_A_588

402880_A_509

81824_E_900

273507_E_834

367110_E_883

273063_A_524

9913_E_1090

4896_E_777

368407_A_598

6669_E_1042

391037_B_726

290317_B_639

411154_B_664

399550_A_446

224308_B_781

565050_B_724

391009_B_584

6239_E_944

262543_B_690

240015_B_694

296587_E_898

243090_B_707

436308_A_502

269482_B_829

7227_E_985

448385_B_835

192952_A_618

224325_A_569

272557_A_469

272562_B_706

9615_E_1091

5762_E_843

10090_E_1110

4558_E_960

326298_B_602

323259_A_591

45157_E_821

272569_A_639

420247_A_500

381666_B_810

4932_E_777

64091_A_549

349101_B_734

263820_A_491

365044_B_770

100226_B_776

44689_E_894

227321_E_899

240292_B_783

228908_A_228

208963_B_846

357804_B_762

321956_B_506

397948_A_493

190192_A_473

273116_A_486

453591_A_431

374847_A_451

251221_B_733

167879_B_765

1105097_B_420

3694_E_974

284811_E_765

9606_E_1113

2903_E_963

9598_E_1088

318161_B_758

349307_A_512

70601_A_503

383372_B_722

272568_B_692

410358_A_525

262724_B_634

69014_A_536

225164_E_1047

104341_E_864

1054217_A_469

556484_E_858

1036743_B_726

368408_A_449

415426_A_455

273068_B_646

8355_E_1008

351607_B_638

(a) (b) (c)

EUKARYABACTERIAARCHAEA Smallest proteome

Root node

Figure S3. Rooting experiments (corresponding to Fig 1) with a larger taxon sampling


https://doi.org/10.1101/049171

1054217_A_469

1036743_B_726272568_B_692

351607_B_638

486041_E_869

397948_A_493

9615_E_1091

339860_A_475

348780_A_588

306902_E_775

224325_A_569

4896_E_777

565050_B_724

3702_E_988

225164_E_1047

269482_B_829

391037_B_726

5762_E_843

272562_B_706

420247_A_500

7227_E_985

273063_A_524

224308_B_781

243090_B_707

262724_B_634

167879_B_765

243232_A_499

4558_E_9602903_E_963

290317_B_639

367110_E_883

81824_E_900

187420_A_523

8355_E_1008

383372_B_722

410358_A_525

1105097_B_420

399550_A_446

340102_A_493

368408_A_449

4932_E_777

272569_A_639

318161_B_758

64091_A_549

453591_A_431

367737_B_607

9606_E_1113

284811_E_765

399549_A_513

368407_A_598

411154_B_664

391009_B_584

380703_B_807

212035_V_176

391774_B_666

349101_B_734

272557_A_469

44056_E_842

374847_A_451

70791_E_915227321_E_899

263820_A_491

402880_A_509

6669_E_1042

104341_E_864

69014_A_536

6239_E_944

362976_A_570

296587_E_898

312017_E_777

556484_E_858

208963_B_846

190192_A_473

9598_E_1088

357804_B_762

326298_B_602

415426_A_455

100226_B_776

365044_B_770

1094892_V_175

44689_E_8943694_E_974

273116_A_486

192952_A_618

224324_B_577

381666_B_810

444157_A_468

284590_E_770

240015_B_694

321956_B_506

70601_A_503

262543_B_690

10090_E_1110

45157_E_821

1349409_V_108

448385_B_835

273068_B_646

1235314_V_177228908_A_228

273507_E_834

436308_A_502

1349410_V_107

7955_E_1088

349307_A_512

251221_B_733

259564_A_557

9913_E_1090

323259_A_591

240292_B_783

5762_E_843

368408_A_449

263820_A_491

273068_B_646

9598_E_1088

4896_E_777

187420_A_523

3702_E_988

227321_E_899

323259_A_591

4932_E_777

273116_A_486

269482_B_829

391009_B_584321956_B_506

192952_A_618

243232_A_499

100226_B_776

357804_B_762

367110_E_883

284811_E_765

486041_E_869

251221_B_733

272562_B_706

1105097_B_420

224325_A_569

312017_E_777

349101_B_734

415426_A_455

9606_E_1113

348780_A_588272569_A_639

1343077_B_107

362976_A_570

208963_B_846

397948_A_493

2903_E_963

6239_E_9449615_E_1091

240015_B_694565050_B_724

444157_A_468

167879_B_765

296587_E_898

45157_E_821

420247_A_500

383372_B_722391037_B_726

7955_E_108810090_E_1110

411154_B_664

380703_B_807

225164_E_1047

7227_E_985

1053648_B_107

44689_E_894

448385_B_835

6669_E_1042

1094892_V_175

70791_E_915

453591_A_431

306902_E_775

273507_E_834

228908_A_228

290317_B_639

374847_A_451

44056_E_842

340102_A_493

1349409_V_108

240292_B_783

104341_E_864

262724_B_634

339860_A_475

284590_E_770

262543_B_690

9913_E_1090

326298_B_602367737_B_607

399549_A_513

381666_B_810

259564_A_557

8355_E_1008

243090_B_707

272568_B_692

410358_A_525

273063_A_524

69014_A_536

224308_B_781

365044_B_770

1349410_V_107

190192_A_473

4558_E_960

212035_V_176

402880_A_509

70601_A_503

351607_B_638

556484_E_858

318161_B_758

224324_B_577

1235314_V_177

349307_A_512

368407_A_598

81824_E_900

64091_A_549

399550_A_446

436308_A_502

3694_E_974

391774_B_666

1054217_A_469

1036743_B_726

272557_A_469

69014_A_536

44689_E_894

1343077_B_107

11128_V_17

1349410_V_107

348780_A_588

411154_B_664

1094892_V_175

273507_E_834

227321_E_899

349101_B_734

368407_A_598

357804_B_762

374847_A_451

1053648_B_107

269482_B_829

167879_B_765

4932_E_777

656190_V_4

3702_E_988

284590_E_770

240015_B_694

2903_E_963

486041_E_869

399550_A_446

381666_B_810

349307_A_512

391009_B_584

391037_B_726

444157_A_468

225164_E_1047

312017_E_777

565050_B_724

243232_A_499

362976_A_570

284811_E_765

70791_E_915

9913_E_1090

7955_E_1088

263820_A_491

436308_A_502

368408_A_449

8355_E_1008

212035_V_176

340102_A_493

228908_A_228

208963_B_846

10090_E_1110

273068_B_646

273116_A_486

1105097_B_420

81931_V_5

306902_E_775

6239_E_944

224324_B_577351607_B_638

391774_B_666

556484_E_858

240292_B_783

397948_A_493

365044_B_770

1054217_A_469

4558_E_960

224325_A_569

44056_E_842

1349409_V_108

7227_E_985

420247_A_500

9606_E_1113

9615_E_1091

243090_B_707

9598_E_1088

192952_A_618

190192_A_473

3694_E_974

323259_A_591

273063_A_524

64091_A_549

380703_B_807

187420_A_523

6669_E_1042

272562_B_706

81824_E_900

104341_E_864

367737_B_607

367110_E_883

262724_B_634

70601_A_503

321956_B_506

339860_A_475

1235314_V_177

251221_B_733

453591_A_431

448385_B_835

296587_E_898

45157_E_821

318161_B_758

259564_A_557

4896_E_777

272557_A_469

326298_B_602

224308_B_781272568_B_692

410358_A_525

100226_B_776

262543_B_690

290317_B_639

383372_B_722

402880_A_509

11520_V_7

5762_E_843

1036743_B_726

272569_A_639

399549_A_513

415426_A_455

EUKARYABACTERIAARCHAEADNA VirusesRNA viruses

Smallest proteomeRoot node

(a) (b) (c)

Figure S4. Rooting experiments (corresponding to Fig 2) with a larger taxon sampling


https://doi.org/10.1101/049171

did viruses evolve as a distinct supergroup from common ...€¦ · 18.04.2016 · be polarized a...

Documents