mol biol evol 2012 de vienne molbev_msr317
TRANSCRIPT
-
7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317
1/12
Phylo-MCOA: A Fast and Efficient Method to Detect OutlierGenes and Species in Phylogenomics Using MultipleCo-inertia Analysis
Damien M. de Vienne,*,1,2 Sebastien Ollier,2 and Gabriela Aguileta1,2
1Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG) and UPF, Barcelona, Spain2Laboratoire Ecologie, Systematique et Evolution, UMR 8079, Universite Paris-Sud 11, Orsay cedex, France
*Corresponding author: E-mail: [email protected] editor: Herve Philippe
Abstract
Full genome data sets are currently being explored on a regular basis to infer phylogenetic trees, but there are oftendiscordances among the trees produced by different genes. An important goal in phylogenomics is to identify whichindividual gene and species produce the same phylogenetic tree and are thus likely to share the same evolutionary history.On the other hand, it is also essential to identify which genes and species produce discordant topologies and thereforeevolve in a different way or represent noise in the data. The latter are outlier genes or species and they can providea wealth of information on potentially interesting biological processes, such as incomplete lineage sorting, hybridization,and horizontal gene transfers. Here, we propose a new method to explore the genomic tree space and detect outlier genesand species based on multiple co-inertia analysis (MCOA), which efficiently captures and compares the similarities in thephylogenetic topologies produced by individual genes. Our method allows the rapid identification of outlier genes andspecies by extracting the similarities and discrepancies, in terms of the pairwise distances, between all the species in all thetrees, simultaneously. This is achieved by using MCOA, which finds successive decomposition axes from individualordinations (i.e., derived from distance matrices) that maximize a covariance function. The method is freely available asa set of R functions. The source code and tutorial can be found online at http://phylomcoa.cgenomics.org.
Key words: phylogenomics, outlier genes, phylogenetic markers, multivariate analysis, fungal genomes.
Introduction
The analysis of genomic data sets has generated the press-ing need to compare a large amount of phylogenetic trees.
The trees under comparison may, for example, representa set of shared orthologous genes from a given organismor taxonomic group, sampled across their full genomes.The goal of such comparisons can be 2-fold: either to iden-tify the set of genes producing a concordant topology thatmay reflect the species history or to identify the set of out-lier genes that generate topological incongruities and arelikely to evolve differently or represent noise. While mostefforts have been devoted to detecting genes that produceconcordant gene trees, less attention has been given toidentifying outliers, although they are very importantcandidates for studying interesting biological processes.
Detecting outlier genes and species is particularly rele-
vant in phylogenomics because genes whose tree topolo-gies are discordant with respect to an overall topologicalconsensus are good candidates for evolving at significantlydifferent rates, to be under positive selection for functionaldivergence, to have been horizontally transferred, to beparalogous instead of orthologous genes, or to be randomlysampled from incompletely sorted lineages. On the otherhand, species that appear as outliers (i.e., that are placedvery differently in different gene trees) may representspecies that are more likely to experience gene transfers,
having sequencing errors, having undergone bottlenecks,
or other biological processes at the species level. Even
though topological discordances can be detected at thegene or the species level, most methods so far are designed
to look at genes only.In the context of phylogenetic gene tree comparison and
phylogenomics, several approaches and methods havebeen proposed to detect concordant and discordant topol-
ogies trying to integrate as much as possible the available
information. A solution that attempts to deal with all the
available data is the construction of a super matrix from
the alignment of all or a large number of the available
individual genes, an approach known as total evidence
(Gadagkar et al. 2005; Aguileta et al. 2008). However, most
methods involve some sort of consensus solution either
by classical consensus of multiple competing phylogenies
(Felsenstein 1985), by building a super tree (Gordon1986; Sanderson et al. 1998), by Bayesian sampling (Ane
et al. 2007), or a summary by agreement subtrees (Cranston
and Rannala 2007). Other approaches work by trying to
filter out the noise and/or increasing the phylogenetic
signal (Rodriguez-Ezpeleta et al. 2007; Roure et al. 2007)
or by clustering congruent loci (Leigh et al. 2008, 2011).Another possibility involves identifying the conflicting
bipartitions and representing all the possible alternativesin a graphic way, an approach that results in networks
The Author 2012. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, pleasee-mail: [email protected]
Mol. Biol. Evol. doi:10.1093/molbev/msr317 Advance Access publication January 3, 2012 1
MBE Advance Access published February 14, 2012
http://phylomcoa.cgenomics.org/http://phylomcoa.cgenomics.org/ -
7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317
2/12
of genes (Huson and Bryant 2006), or a projection of theconflicting signals in 2Ds (White et al. 2007). Also, in orderto obtain a single tree from multiple individual trees of thesame organism or species, the coalescent approach can beused (Rosenberg 2002; Suchard et al. 2003; Suchard 2005;Degnan and Rosenberg 2006) or the Bayesian reconstruc-tion of gene trees taking the species tree as a prior (BEST)(Edwards et al. 2007). Finally, approaches for graphically
comparing phylogenetic trees include the use ofmultidimensional scaling to visualize tree-to-tree pairwisedistances (Hillis et al. 2005); the use of principal componentanalysis (PCA) to detect events of horizontal gene transfer(Brochier et al. 2002; Susko et al. 2006); and employingHadamard spectral analysis represented in a triangulargraph called Treeness Triangles to investigate the lossof phylogenetic signal (White et al. 2007). However, byconcatenating the multilocus data or by summarizing orobtaining a consensus of their individual gene trees, mostof the latter methods lose a wealth of potentially interest-ing information, especially by removing outlier data or
trees. Phylo-MCOA is unique in that it analyzes all the treessimultaneously without discarding data a priori. Further-more, as far as we know, the possibility to also detect out-lier species when comparing phylogenetic trees has notbeen proposed before.
Here, we present Phylo-MCOA, a fast method for com-paring multiple phylogenetic trees and detecting outliergenes and species based on multiple co-inertia analysis(MCOA) (Chessel and Hanafi 1996). The method allowsthe efficient identification of concordant and outliergenes and species, by enabling the quick comparison ofa large number of trees. It builds a consensus typologyby extracting the similarities, in terms of the pairwise
distances between all the species in all the trees, simulta-neously. The consensus typology is a 2D graphic represen-tation of the distance interrelationships between speciessupported by the majority of the genes. Phylo-MCOA findssuccessive decomposition axes from individual ordinations(derived from matrices of distances between species) thatmaximize a covariance function in order to build the con-sensus typology, and it estimates the contribution of eachindividual gene tree to this typology (i.e., to what extenteach gene deviates or agrees with what the majority ofgenes support). MCOA is expected to reveal the commonfeatures (i.e., pairwise distances and species placements) of
single-tree topologies, build a consensus typology and com-pare individual trees with this reference. Until now, MCOAhas been used with ecological (Bady et al. 2004; Hedde et al.2005) and genetic data (Laloe et al. 2007; Berthouly et al.2008); however, as far as we know, this is the first time thatMCOA is applied to phylogenetic analysis. Phylo-MCOA isunique in tracing the position of each species with respectto all other species across the compared phylogenies. Thesimultaneous analysis of all the species placements in alltrees makes it possible to detect, not only concordant genesbut also outlier gene trees, and even more importantly, itcan show which species andgenes explainthose differences.Our method is thus new in allowing the analysis of the data
at the species level for comparing the histories of differentspecies across individual gene trees.
Phylo-MCOA can be very useful for phylogenomic anal-yses, where big genomic data sets are typically analyzed.First, Phylo-MCOA readily identifies the outlier genesand species that cause topological incongruence. This isuseful for studying the evolution of specific genes, partic-ularly those that seem to differ from the rest of the genome.
The method will also identify candidate outlier species,which may undergo interesting biological processes atthe species level. Phylo-MCOA is unique in pointing notonly to outlier genes but also to outlier species. Second,when the goal is to construct a species tree or a tree thatreflects the evolution of most genes in the genome, Phylo-MCOA can be used to identify which genes producediscordant trees and remove them for subsequent phylo-genetic analysis. In this way, only the genes that share thesame evolutionary history are kept in order to build a treethat reflects the evolution of the species or that of most ofthe genome. Third, Phylo-MCOA can be used to investigate
specific evolutionary hypotheses including, but not limitedto, which genes may have been laterally transferred, whichones may be more readily lost or replaced, or which speciesproduce phylogenetic discordance.
Materials and Methods
MCOA in the Context of Phylogenetic AnalysisIn phylogenomics, there is often the need to compare setsof phylogenetic trees, determine how the position of eachspecies changes in every tree, and to find out which treesare more similar between each other in terms of the pair-wise topological distance (i.e., either nodal or patristic). It
would thus be advantageous to compare the position ofeach species in all trees at the same time, to know if thereis a correlation among the species positions and distancesrepresented in all trees (i.e., create a consensus typologythat serves as reference), and to what extent each individ-ual tree deviates from such a consensus. The problem athand involves measuring the relationship between theposition of each species and the distances between themin all the trees of a given set.
In the context of multivariate data analysis, one is inter-ested in ordering objects (e.g., variables, sites, and tables) bytheir similarity, and the relationships between the objects
are measured by different metrics (Manly 2004). Generally,objects are projected onto principal axes and are thus char-acterized numerically or graphically. This procedure placessimilar objects near each other and dissimilar ones far apartin geometric space. Traditional ordering techniques (e.g.,PCA) typically use a singular decomposition axis value ofone table. In the case of our proposed method, MCOAallows the simultaneous ordination of many Euclidean dis-tance matrices derived from every tree in the set by findingsuccessive decomposition axes from each distance matrixthat maximize a covariance function. Doing this allows theextraction of the consensus typology contributed by alltrees.
de Vienne et al. doi:10.1093/molbev/msr317 MBE
2
-
7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317
3/12
Here, we introduce the basic concepts of the method;interested readers are referred to the appendix in thesupplementary material (Supplementary Material online)and to Chessel and Hanafi (1996) for its mathematicaldescription.
To better understand how MCOA is applied to phylo-genetic data (i.e., a set ofK trees), we can decompose themethod in two steps. We first explain how one tree can berepresented in a 2D space through Principal CoordinateAnalysis (Gower 1984; Gower and Legendre 1986) (PCO,fig. 1a), and we then explain how K trees can be simulta-neously analyzed through MCOA (fig. 1b).
A single tree can be transformed into a distance matrixby calculating the pairwise distances between species in thetrees (fig. 1a, step 1). Distance between species can eitherbe nodal (number of nodes separating two leaves in thetree) or patristic (sum of the branch lengths separatingtwo leaves). We then compute the square root of this
distance matrix in order to get a Euclidean distance matrix,which is an optimal condition for the PCO to be performed.Whereas the methods typically used to obtain Euclideandistances impose a geometric distortion on them, takingtheir square root does not affect the results (de Vienne
et al. 2011). The principal coordinates of the leaves (i.e.,terminal branches) in a Euclidean space are then derivedfrom the Euclidean distance matrix by a PCO analysis(fig. 1a, step 2). Each axis defined by the PCO analysis max-imizes the variance of the coordinates of the species. Whenoriginal distances between species are Euclidean, all theeigenvalues are positive and correspond to the varianceassociated to the axis. In that case, the distances betweenspecies in the new space defined by the PCO analysis arethe same as the original distances defined by the tree.Because most of the variance is generally concentratedin the first axis (depending on the topology of the tree),we can represent the respective position of the leaves in
FIG. 1. Principle of MCOA as applied to phylogenetic trees. (a) From one tree to a 2D representation of the tree. Step 1: A single phylogenetic
tree can be converted into a distance matrix using either nodal or patristic distances between the leaves. Step 2: A PCO on the square root of
the distance matrix allows retrieving the coordinates of the species in a Euclidean space. Step 3: Positions of the species in a 2D space formed
by the two first axes of the PCO. (b) MCOA from Ktrees to the cohesion plot. Steps 1 and 2: From K trees, Kdistance matrices and KEuclidean
spaces are obtained, as in A. Each Euclidean space can then be seen as a scatter plot (or set of points) in an n 1 dimensional space (n beingthe number of leaves). Step 3: MCOA finds K auxiliary vectors (u11,u12, . . ., u
1K) and a reference vector m
1 so that the sum of the covariances
between each u1 vector and the unique m1 vector is maximized. This is done recursively for m1, m2, etc. Step 4: Cohesion plot for the two first
axes of the MCOA analysis m1 and m2. Labels represent the reference position of the species (consensus typology). Black points represent the
positions of the species in each individual gene tree.
Detecting Outlier Genes and Species in Phylogenomics doi:10.1093/molbev/msr317 MBE
3
http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1 -
7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317
4/12
a reduced space formed by these axes of the PCO (fig. 1a,step 3). The relative positions of the taxa in this reducedspace represent the main dichotomies observed in the tree.
The first steps of the MCOA analysis, as applied to mul-tiple phylogenetic trees, are the same as before. Each one ofthe K trees is transformed into a distance matrix (fig. 1b,step1), and the square root of the distance matrices arethen computed and used to perform K-independent PCOs(fig. 1b, step2). We thus obtain KEuclidean spaces, one foreach initial tree. Each one of these spaces can be seen as
a set of points (scatter plots on fig. 1b) distributed ona space having n 1 dimensions (or axes), where n isthe number of leaves in each tree. The basic goal of MCOAis then to find a set of K vectors (one for each of the KEuclidean spaces) called auxiliary vectors (from u11 tou1K), as well as a single reference vector (called m
1), so thatthe sum of the covariances between each of the uKauxiliaryvectors and the m1 vector is maximized (fig. 1b, step 3). Ananalytic demonstration of this maximization has beengiven by Chessel and Hanafi (1996). The obtained m1 vectoris the first axis of the MCOA analysis. The same method isthen applied to obtain a m2 vector by maximizing the sum
of the covariance between this vector and K u2
vectors(from u21 to u
2K). Thus m
2 becomes the second axis of theMCOA. This is done recursively for retrieving the n axesof the MCOA analysis.
Phylo-MCOA: Using MCOA for PhylogeneticAnalysisFollowing the analysis described above, the reference posi-tion of each species (coordinates of the species projectedon the m1, m2, . . . mn vectors) as well as the position of thespecies in each individual tree (coordinates of the speciesprojected on the u1, u2, . . . , un auxiliary vectors) can becomputed. It is possible to plot the reference positions
of each species, as well as their position in each individualtree, using the two first axes of the MCOA analysis (fig. 1bpart 4). This plot is usually called the cohesion plot. The setof reference positions of the species in the cohesion plot iscalled consensus typology. For a given species, the distancebetween its reference position and its position in each in-dividual tree is represented by lines. Long lines representcases were the species has a position in a given tree thatis not concordant with its position in all other trees (andtherefore neither in the consensus typology). On the con-
trary, short lines represent cases where the position of thespecies in a given tree is roughly the same as in all the othertrees (and therefore in the consensus typology). The cohe-sion plot thus contains N Tdots, where N represents thenumber of species in each tree and T the number of trees.
The cohesion plot is very informative for small data sets,where one can assess the position of each species in eachindividual tree relative to its reference position. However,this plot becomes difficult to read and to interpret asthe number of genes and species increases because specieslabels superimpose and the cloud of points and linesbecomes overcrowded. Moreover, the cohesion plot gives
only a graphical representation of the data and does notallow the extraction of statistically significant outlier genesand species. Therefore, we show instead the informationcontained in cohesion plots but using a different and morereadable representation: the two-Way Reference matrix(hereafter referred to as the 2WR matrix). The 2WR matrixis built by computing, for every species, the distance be-tween its reference position and its position in each indi-vidual tree in the multidimensional space. The rows in the2WR matrix represent species and columns represent genes(fig. 2a). Note that the order of species (rows) in the 2WRmatrix follows the average relative position of species inthe gene trees. In other words, two species that are close
FIG. 2. Detection of complete outliers by Phylo-MCOA. (a) The 2WR matrix is computed by calculating, for every species, the distance
separating its position in each gene tree to its reference position (see text). The 2WR matrix provides information about species in rows and
genes in columns. (b) The 2WR matrix is transformed into two binary matrices (2WRBgn and 2WRBsp) by detecting outlier cells as described in
the Methods section. (c) A score is computed for each gene from the 2WRBgn matrices and for each species from the 2WRBsp matrix and
represented by a bar plot. The dashed gray line on the bar plot represents the threshold chosen by the user over which a gene or a species is
considered as a complete outlier.
de Vienne et al. doi:10.1093/molbev/msr317 MBE
4
-
7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317
5/12
in the 2WR matrix (two contiguous rows) are two speciesthat are closely related in the gene trees.
Detection of OutliersPhylo-MCOA can detect two types of outliers: completeoutliers and cell-by-cell outliers. Complete outliers are oftwo types: complete outlier species and complete outliergenes. Complete outlier genes place all the species in a posi-tion different from their positions in the other gene trees,whereas complete outlier species are species whose posi-tion with respect to other species is different in all the genetrees. Cell-by-cell outliers represent species whose position
in some specific gene trees is not concordant with theirposition in the other gene trees (i.e., single cells in the2WR matrix whose value seems outlier with respect tothe other values in the matrix).
Cell-by-cell outliers cannot be detected by Phylo-MCOAfrom the 2WR matrix if complete outliers are present be-cause it hides the cell-by-cell signal. The detection of out-liers by Phylo-MCOA is thus performed in multiple steps: 1)Phylo-MCOA analysis of the original data set; 2) detectionof complete outliers; 3) removal of the complete outliersfrom the original data set; 4) Phylo-MCOA analysis ofthe new data set; and 5) detection of cell-by-cell outliers. If
no complete outliers are detected, steps 3 and 4 are omitted.For detecting complete outliers (fig. 2), the 2WR matrix
is separated in two matrices, (2WRBsp and 2WRBgn, fig. 2b)by detecting outlier cells. For constructing the 2WRBsp ma-trix, a cell is considered as an outlier (and assigned the value1), if its value in the 2WR matrix is higher than Q3 kIQR,where Q3 represents the third quartile of the distributionof values of its column and IQR represents the interquartilerange of these values. Symmetrically, for the 2WRBgn matrix,a cell is considered as outlier if its value in the 2WR matrix ishigher than Q3 kIQR, where Q3 and IQR are now com-puted based on the values in the row. We set k5 1.5 as thedefault value as is classically done when detecting outliers
with this method. The two obtained binary matrices arerepresented in figure 2b. From the 2WRBgn matrix, a scoreis computed for each gene by computing the proportion of1s in its column. From the 2WRBsp matrix, a score is com-puted for each species by computing the proportion of 1sin its row. This leads to the two barplots presented in figure2c. In this example, gene 2 and gene 13 are detected as out-liers by more than 90% of the species, and t16 and t33 aredetected as outlier species by more than 90% of the genes.
The user can fix a threshold for considering that a speciesor a gene is a complete outlier. By default, we set thisthreshold to 50% (i.e., species detected as outlier for more
than 50% of the genes are seen as complete outliers andgenes detected as outliers for more than 50% of the speciesare considered as complete outliers).
For the detection of cell-by-cell outliers (fig. 3), the 2WRmatrix (fig. 3a) is transformed into the 2WRgn and 2WRsp
matrices by normalizing the 2WR matrix by rows and col-umns, respectively; the product of these two normalizedmatrices gives the 2WRgnsp matrix that is used for thecell-by-cell outlier detection.
The 2WRgnsp matrix is first transformed into a binarymatrix by detecting outlier cells, according to the valuesin the rows, and to a second binary matrix by detecting
outlier cells according to the values in the columns. Thesetwo binary matrices are finally multiplied in order to con-serve only cells detected as outliers for both rows and col-umns. This last binary matrix is called 2WRBgnsp (fig. 3b).
This procedure for detecting cell-by-cell outliers is, how-ever, too permissive, and therefore, a second filtering step isnecessary. As explained before, Phylo-MCOA starts bytransforming trees into distance matrices. Species in thetrees are not independent from each other. Therefore, evena single modification in a tree will change all the values inthe associated distance matrix. When we detect a cell-by-cell outlier, we have to distinguish between true outliersand species that are affected by this outlier just because
FIG. 3. Detection of cell-by-cell outliers by Phylo-MCOA. (a) The 2WR matrix is obtained as described in the text. (b) The 2WR matrix is then
transformed into the 2WRBgnsp binary matrix by successive normalizations and detection of outliers (see text). This matrix contains islands of
outliers representing groups of closely related species erroneously detected as outliers surrounding the true outlier. (c) Islands are removed by
considering as true outlier only the cell having the maximum value in the island. This maximum is taken from the 2WRgnsp matrix.
Detecting Outlier Genes and Species in Phylogenomics doi:10.1093/molbev/msr317 MBE
5
-
7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317
6/12
they are close in the distance matrix. To solve this problem,we take advantage of a Phylo-MCOA property that ordersspecies in the 2WR matrix by their mean proximity in thecohesion plot. When two or more species that are contig-uous in the 2WRBgnsp matrix for a given gene are detectedas outliers, we define this group as an island (dashedcircles in fig. 3b). In each island, we need to distinguish be-tween the true outlier species and the contiguous species
that are erroneously detected as outliers. To do so, we con-sider that the outlier cell with the highest value in the2WRgnsp is the true outlier, and the contiguous cells areartifacts. This filtering step allows obtaining a binary matrixfree of islands that contains only true cell-by-cell outliers(fig. 3c).
Handling Tree and Node SupportWhen comparing phylogenetic trees, it is very importantto consider the statistical support of both trees and nodes.To take into account measures of tree support such as like-lihood scores, posterior probabilities, or any other suitable
scores, Phylo-MCOA can assign different weights, specifiedby the user, to the different trees according to their sup-port. To do so, during the MCOA analysis, each table (rep-resenting Euclidean spaces, see fig. 1) is multiplied by thesquare root of the weight that is associated to it (in abso-lute values). As a result, the values in the tables correspond-ing to a specific gene with a high weight are higher andthe importance of the corresponding auxiliary vectorsfor retrieving the m vectors also increases (fig. 1b, step3). In this sense, those genes will weigh more in the MCOA.It is also possible to take into account node support (e.g.,bootstrap values) when building the consensus typology,by choosing a cutoff level of node confidence below
which nodes will be collapsed. In this way, poorly sup-ported nodes (i.e., unlikely species bipartitions) will haveless or no influence in the consensus typology.
Handling Missing DataPhylo-MCOA is capable of analyzing data sets that areincompletely sampled (i.e., with missing data), in the caseswhere a given gene is not available for one or more speciesbut is found in all other species. To handle incomplete sam-ples, the distances that correspond to missing species ina given pairwise distance matrix are assigned the valueof the average distance obtained from all the other distance
matrices where the species is present.
Results
Phylo-MCOA Evaluation: Simulated Data SetsSimulated data sets were generated in order to evaluate theefficiency and robustness of Phylo-MCOA. All the data setswere created in the same way: we generated a random treewith N species by random splitting of edges as imple-mented in function rtree of the ape package in R language(Paradis et al. 2004) and duplicated this gene tree in orderto get Tidentical gene trees. A new gene tree was randomlygenerated for each test performed. To create a complete
outlier species, we changed its position randomly in allT trees. To generate a complete outlier gene, we replacedone of the gene trees by a completely random tree.
Subsequently, we evaluated the efficiency of Phylo-MCOA by its ability to retrieve the correct complete outliergene and species, using the corresponding 2WR matrix. Alloutliers were detected based on nodal distances, using theoption distance 5 nodal in Phylo-MCOA.
To simulate cell-by-cell outliers, we randomly sampledone species and one tree and changed randomly the posi-tion of the selected species in the selected tree. By repeat-ing this operation multiple times, we obtained data setswith a variable proportion of cell-by-cell outliers.
Effect of Sample Size on the Ability of Phylo-MCOA to
Retrieve Complete Outliers
Simulated data sets were obtained containing between 10and 100 genes (10, 20, 30, 50, and 100, respectively) andbetween 10 and 100 species (10, 20, 30, 50, and 100, respec-tively); all possible combinations were explored. In all thesedata sets, we artificially introduced three complete outliergenes and three complete outlier species.
We analyzed these data sets with Phylo-MCOA and eval-uated the proportion of complete outlier genes and speciesthat the method was able to retrieve (table 1). We observedthat as long as the number of both species and genes islarge enough (!20), Phylo-MCOA always correctly re-trieved the outliers (black cells in table 1). On the otherhand, when the number of species is small (10), Phylo-MCOA is unable to retrieve the outlier genes correctly,even for a large number of genes (first column, light gray
cells in table 1). However, outlier species are easier to re-trieve as long as the sample size is large enough (!20), evenif there is a small number of genes (10, first row, dark graycells in table 1). Finally, for moderately small number ofoutlier species (520), outlier genes are easier to retrievethan species (second column, dark gray cells in table 1).
Effect of the Proportion of Complete Outliers on the Ability
of Phylo-MCOA to Retrieve Them
We created a data set of 50 genes and 50 species and testedthe ability of Phylo-MCOA to retrieve complete outliersas their proportion increased. We tested data sets wherebetween 2% and 60% of the genes, and between 2% and
Table 1. Effect of the Size of the Data on the Ability of Phylo-
MCOA to Correctly Retrieve Complete Outlier Genes and Species.
Retrieved
Species/Genes
Number of Species
10 20 30 50 80 100
Number of genes 10 0/0 0/0 1/0 3/0 3/0 3/0
20 0/0 1/2 3/3 3/3 3/3 3/3
30 0/0 0/3 3/3 3/3 3/3 3/3
50 0/0 0/3 3/3 3/3 3/3 3/3
80 0/0 0/3 3/3 3/3 3/3 3/3100 0/1 0/3 3/3 3/3 3/3 3/3
NOTE.Three outlier genes and three outlier species were simulated. Each cellcontains two values separated by a slash (/): number of outlier species retrieved/number of outlier genes retrieved. Nodal distances were used in the Phylo-MCOA analysis. The darker the shading, the better the retrieval of outliers byPhylo-MCOA.
de Vienne et al. doi:10.1093/molbev/msr317 MBE
6
-
7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317
7/12
60% of the species were outliers (with values 2%, 10%, 20%,40%, and 60%). All possible combinations of these propor-tions were tested. We then evaluated the number of com-plete outlier genes and species that Phylo-MCOA was ableto retrieve correctly in each case (table 2). We observedthat even when 20% of genes and species are outliers, Phy-lo-MCOA was still able to retrieve almost all of them cor-rectly (black cells in table 2). A large proportion of outlier
genes (40%) does not prevent retrieving outlier species.Similarly, a large proportion of outlier species is not a prob-lem for retrieving outlier genes (dark gray cells in table 2).However, if a large proportion of both genes and species areoutliers, this will make it difficult to retrieve outliers (lightgray cells in table 2).
Effect of the Proportion of Missing Data on the Ability of
Phylo-MCOA to Retrieve Complete Outliers
In typical biological data sets, there are plenty of missingdata, therefore, methods for real data set analysis shouldbe robust to incompletely sampled data sets. In order toevaluate the ability of Phylo-MCOA to retrieve outlier
genes and species despite missing data, we created a dataset with 100 genes and 100 species containing three com-plete outlier species and five complete outlier genes. Weincreased the proportion of missing data progressivelyby randomly removing species from the original set ofgenes. We removed 1%, 2%, 5%, 10%, 15%, 20%, 30%,and 50% of the data, respectively (table 3). Note that a pro-portion of 20% of missing data means that, on average,each tree has lost 20% of its leaves. We observed thatPhylo-MCOA is very robust to missing data. Even with20% of the data missing, the three complete outlier speciesand one of the five complete outlier genes were correctlyretrieved (table 2).
Capacity of Phylo-MCOA to Detect Complete Outliers in the
Presence of Cell-by-Cell Outliers
The detection of cell-by-cell outliers by Phylo-MCOA isregularly performed on a data set from which completeoutliers have been removed. In order to verify whetherPhylo-MCOA is able to correctly detect complete outliersin the presence of cell-by-cell outliers, we simulated a dataset with 100 genes and 100 species containing three com-
plete outlier species and three complete outlier genes, andevaluated the ability of Phylo-MCOA to correctly detectthe complete outliers given different numbers of cell-by-celloutliers, from 1 to 1,500. Results are shown in table 4.If,1,200 cell-by-cell outliers are present, Phylo-MCOA isalways able to retrieve correctly the complete outliers. For1,500 cell-by-cell outliers, all complete outlier genes are stillretrieved, but the complete outlier species are not retrieved.
Capacity of Phylo-MCOA to Retrieve Cell-by-Cell Outliers
As previously explained, the detection of cell-by-cell out-liers by Phylo-MCOA is performed on a data set free of
complete outliers, therefore, for this test complete outliersare excluded. We analyzed 100 genes and 100 species witha variable proportion of cell-by-cell outliers and evaluatedthe proportion that was detected by Phylo-MCOA. Foreach number of cell-by-cell outliers, we repeated the oper-ation 20 times in order to have an estimation of the stan-dard deviation of the mean proportion of cell-by-celloutliers retrieved (table 5). We observed that the mean pro-portion of cell-by-cell outliers retrieved decreased veryslowly, starting at 100% (SD 5 0) for two and five cell-by-cell outliers included, to more than 85% for 1,000cell-by-cell included. Phylo-MCOA is thus able to detectcell-by-cell outliers even when there are several of them.
Table 2. Effect of the Proportion of Outlier Genes and Species in the Ability of Phylo-MCOA to Correctly Retrieve Them.
Retrieved Species/Genes (%)
Proportion of Complete Outlier Species [Number]
2% [1] 10% [5] 20% [10] 40% [20] 60% [30]
Proportion of complete outlier genes [number] 2 [1] 1/1 5/1 9/1 0/1 0/0
10 [5] 1/5 5/5 8/5 0/5 0/0
20 [10] 1/10 5/10 10/10 0/0 0/0
40 [20] 1/0 5/0 0/0 0/0 0/0
60 [30] 0/0 0/0 0/0 0/0 0/0
NOTE.The data set was composed of 50 genes and 50 species. The proportions of outliers are given in percentage and the absolute number of outliers is given underbracket ([ ]). Each cell contains two values separated by a slash (/): number of complete outlier species retrieved/number of complete outlier genes retrieved. Nodaldistances were used in the Phylo-MCOA analysis. The darker the shading, the better the retrieval of outliers by Phylo-MCOA.
Table 3. Effect of the Proportion of Missing Data on the Number
of Complete Outliers Retrieved (genes and species).
Proportion of Missing Data
1% 2% 5% 10% 20% 30% 50%
Species retrieved 3 3 3 3 3 0 0
Genes retrieved 5 5 5 5 1 0 0
NOTE.The data set was comprised of 100 genes and 100 species with fivecomplete outlier genes and three complete outlier species. Nodal distances wereused in the Phylo-MCOA analysis. The darker the shading, the better the retrievalof outliers by Phylo-MCOA.
Table 4. Number of Complete Outliers Retrieved with Increasing
Number of Cell-by-Cell Outliers.
Number of Cell-by-Cell Outliers
1 10 50 100 300 500 1,000 1,200 1,500
Species retrieved 3 3 3 3 3 3 3 3 0
Genes retrieved 3 3 3 3 3 3 3 3 3
NOTE.The data set was comprised of 100 genes and 100 species with threecomplete outlier genes and three complete outlier species. Nodal distances wereused in the Phylo-MCOA analysis. The darker the shading, the better the retrievalof outliers by Phylo-MCOA.
Detecting Outlier Genes and Species in Phylogenomics doi:10.1093/molbev/msr317 MBE
7
-
7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317
8/12
Phylo-MCOA Evaluation: Real Data SetWe used Phylo-MCOA on a real data set comprising 246gene trees for 21 fungal species. This data set was used pre-viously by Aguileta et al. (2008) for retrieving the best genesto reconstruct the phylogeny of fungi, and the data areavailable in FUNYBASE (Marthey et al. 2008). Aguiletaet al. (2008) had compared each individual gene tree tothe species tree obtained from the concatenation of the246 genes. From this comparison, they had obtained a scorefor each gene tree representing how suitable it was for find-ing the reference (concatenated) tree. Using Phylo-MCOA,we analyzed the data set of Aguileta et al. (2008) using
nodal distance between species and obtained a score foreach species and gene as explained in the Materials andMethods section. These scores are plotted in figure 4 forthe genes (top) and for the species (bottom). Phylo-MCOAdetected five complete outlier genes (black bars in fig. 4)and no complete outlier species. The fact that no speciesappeared as outliers in this data set is not surprisingbecause the goal of their work was to find optimal genemarkers for reconstructing the fungal species tree, there-fore, species producing incongruent trees were removedfrom the data set in Aguileta et al. (2008). In order to checkfor consistency, we compared the score that Aguileta et al.obtained for each gene (a topological similarity score re-
ferred to as topscore) to the score we obtained with Phy-lo-MCOA. Figure 5 (white dots) represents the scatter plot
of these two scores. Phylo-MCOA is able to retrieve thecorrect ranking of trees and to identify genes that canbe considered as outliers. We found that trees consideredas outliers in Aguileta et al. (2008) were indeed also outliersaccording to the analysis using Phylo-MCOA (gray arrowsin fig. 5). When using patristic distances instead of nodaldistances for the Phylo-MCOA analysis (black dots infig. 5), the same trend is still present but the correlationwith the topscore values of Aguileta et al. (2008) seemsto slightly decrease. This is expected because the topscoreused by Aguileta et al. (2008) is a topological distancemeasure, thus more similar to Phylo-MCOA with nodal
distances than to Phylo-MCOA with patristic distances.Overall, this illustrates the general utility of Phylo-MCOA for analyzing real data sets. Moreover, the methodis very fast: using the set of trees in Aguileta et al. (2008), ittook about 1 min on a pc laptop.
Using Phylo-MCOA for Filtering Phylogenomic DataSetsDeep relationships in the animal phylogeny are still de-bated. In 2009, Schierwater et al. (2009) proposed a phylog-eny of the Metazoa that was questioned because of the lowquality of the data set used to construct it. More precisely,Philippe et al. (2011) detected important problems in thedata set of Schierwater et al. (2009), such as contamina-tions, deep paralogies, or horizontal gene transfers. They
FG1000
FG1010
FG1015
FG1020
FG1021
FG1024
FG1031
FG1046
FG1049
FG1056
FG1063
FG1073
FG1108
FG436
FG459
FG465
FG478
FG487
FG491
FG499
FG503
FG507
FG508
FG524
FG525
FG529
FG533
FG534
FG535
FG542
FG543
FG546
FG549
FG551
FG552
FG556
FG559
FG569
FG570
FG575
FG576
FG578
FG579
FG583
FG586
FG588
FG589
FG591
FG595
FG598
FG610
FG612
FG616
FG621
FG627
FG630
FG634
FG635
FG636
FG638
FG644
FG646
FG649
FG651
FG652
FG657
FG659
FG662
FG665
FG668
FG673
FG678
FG679
FG684
FG685
FG686
FG687
FG689
FG690
FG691
FG692
FG694
FG695
FG696
FG698
FG699
FG702
FG704
FG705
FG708
FG716
FG718
FG720
FG722
FG723
FG730
FG734
FG735
FG740
FG746
FG747
FG750
FG752
FG756
FG757
FG758
FG761
FG762
FG763
FG771
FG781
FG784
FG788
FG789
FG798
FG805
FG813
FG819
FG821
FG825
FG832
FG834
FG841
FG844
FG848
FG850
FG854
FG855
FG861
FG862
FG863
FG864
FG866
FG870
FG878
FG884
FG893
FG894
FG896
FG897
FG898
FG901
FG909
FG912
FG916
FG927
FG933
FG942
FG948
FG964
FG975
FG994
MS241
MS277
MS282
MS294
MS302
MS304
MS305
MS307
MS311
MS313
MS320
MS325
MS329
MS333
MS334
MS337
MS345
MS348
MS351
MS353
MS355
MS358
MS359
MS361
MS362
MS368
MS375
MS377
MS378
MS379
MS380
MS382
MS384
MS386
MS393
MS394
MS395
MS397
MS398
MS399
MS400
MS401
MS407
MS408
MS409
MS411
MS413
MS414
MS415
MS417
MS418
MS422
MS424
MS428
MS429
MS430
MS431
MS432
MS437
MS441
MS442
MS444
MS447
MS448
MS452
MS453
MS454
MS456
MS459
MS460
MS462
MS463
MS467
MS470
MS481
MS484
MS485
MS486
MS487
MS489
MS493
MS497
MS501
MS517
MS524
MS529
MS532
MS541
MS547
MS549
MS550
MS561
MS578
MS621
Genes
Outlierdetectionproportion
0.0
0.2
0.4
0.6
0.8
Sce
Spa
Pch
Cne
Uma
Yli
Dha
Clu
Spo
Ssc
Ani
Afu
Cim
Mgr
Ncr
Tre
Fgr
Ago
Kla
Cgl
Sba
Species
outlierdetectionproportion
0.00
0.02
0.04
0.06
0.08
0.10
0.12
FIG. 4. Bar plots for the fungal data showing the score calculated for each gene (top) and each species (bottom). Black bars represent significant
outliers according to the threshold chosen by the user (here 0.4). Nodal distances were used for the Phylo-MCOA analysis in this case.
Table 5. Mean Proportion of Cell-by-Cell Outliers Retrieved Out of 20 Repetitions with Increasing the Number of Cell-by-Cell Outliers.
Proportion Retrieved
Number of Cell-by-Cell Outliers
2 5 10 20 50 100 200 300 400 500 600 700 800 900 1,000
Mean 1 1 0.975 0.973 0.960 0.947 0.938 0.932 0.923 0.907 0.897 0.888 0.888 0.871 0.863
SD 0 0 0.055 0.034 0.030 0.022 0.018 0.015 0.011 0.011 0.014 0.011 0.010 0.008 0.009
NOTE.The data set was comprised of 100 genes and 100 species. Nodal distances were used in the Phylo-MCOA analysis.
de Vienne et al. doi:10.1093/molbev/msr317 MBE
8
-
7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317
9/12
reanalyzed the data set of Schierwater after having replacedproblematic sequences by correct ones and showed thatthe obtained tree, even if it had less supported nodes,was more realistic given previous studies.
In order to test whether Phylo-MCOA is able to filter
problematic sequences from a real data set, we analyzed
the set of phylogenetic trees used by Schierwater et al.(2009). We removed the outliers detected by Phylo-MCOA,built a phylogeny using only the filtered data set (i.e., with-out problematic sequences) and compared our tree withthe one obtained without filtering and the one producedby Philippe et al. (2011) after filtering the data.
To obtain the animal phylogeny, we used the originaldata set of Schierwater et al. (2009). It was first aligned
and trimmed with the same tools and parameters as inPhilippe et al. (2011), and the individual gene trees werereconstructed using the same model as Philippe et al.(2011). As in Philippe et al. (2011), four genes (AT8,DCR, N4L, and ND6) were removed from the analysis be-cause they contained too few positions after trimming. Allthe trees were analyzed with Phylo-MCOA using patristicdistances and all default parameters for the outlier detec-tion. Four complete outliers genes were detected by Phylo-MCOA and removed before performing the cell-by-celldetection: rrna16S, rrna18S, rrna28S, and rrna58S. Of the984 cells in the 2WR matrix (41 genes 24 species), 21
cell-by-cell outliers were detected (see supplementary tableS1, Supplementary Material online) and were removedfrom the original data set (by removing the sequencesin the original sequence files). The remaining sequences(available as supplementary dataset S1, Supplementary Ma-terial online) were then concatenated and used to recon-struct a phylogenetic tree with 100 bootstrap replicatesusing RAxML (Stamatakis 2006) and the same model as inSchierwater et al. (2009). The tree we obtained withoutfiltering (fig. 6a) was similar to the one proposed by Schier-water et al. (2009), which is known to contain noisy dataand thus produce an inconsistent topology. After filtering,there was a general decrease of bootstrap support for no-
des in the diploblasts group (fig. 6b), while the support
FIG. 5. Scatter plot of the topological congruence (top score)
between 246 fungal gene trees and the consensus species tree
obtained from the concatenation of individual trees (x axis) againstthe score of the genes according to Phylo-MCOA (y axis). Black dots
represent the scores when using patristic distances. White dots
represent the scores when using nodal distances. Outlier genes
according to Phylo-MCOA and identified by black bars in figure 4
are shown with small gray arrows. They appear to be poorly
congruent with the species tree (small values on the x axis).
FIG. 6. Comparison of the phylogeny of animals from the Schierwater et al. (2009) data set. (a) Tree obtained from the original sequences.
(b) Tree obtained after removal of cell-by-cell outliers using Phylo-MCOA. Species with a star (*) represent Porifera. The gray arrows point to
the node support of the diploblasts group in the two trees. Nodes with black dots represent bootstrap supports higher than 95. A hundred
bootstrap replicates were performed for each tree.
Detecting Outlier Genes and Species in Phylogenomics doi:10.1093/molbev/msr317 MBE
9
http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1 -
7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317
10/12
of the nodes outside the diploblasts were almost notaffected. Overall, the bootstrap support for the mono-phyly of diploblasts went from 100 in the original dataset to 63 in the filtered data set (gray arrows in fig. 6),which is in accordance with the accepted vision that dip-loblasts are certainly not monophyletic. The monophylyof Porifera (species with an * in fig. 6) was lost, which isalso in accordance with previous studies and coherent
with the result of Philippe et al. (2011) on this dataset, who obtained a node for the Porifera with a supportvalue of 36 (Philippe et al. 2011).
Phylo-MCOA was thus able to identify true outliers thatneed to be filtered out prior to further phylogenetic anal-ysis in order to estimate a more consistent topology. How-ever, we could not retrieve exactly the same topology as theone obtained by Philippe et al. This can be due to the factthat we do not replace the sequences identified as prob-lematic by correct sequences as done in Philippe et al.(2011). Moreover, Phylo-MCOA considers that the correctsignal is the one that is present in the majority of the genes,
but this is not true in this data set for the Hexactinellida,where the majority of the sequences are slow evolvingmitochondrial sequences being contaminations from theDemospongiae, while true nuclear sequences of Hexactinel-lidae are fast evolving (Philippe et al. 2011). As a conse-quence, Hexactinellida and Demospongiae still forma very solid group after filtering, and the branch lengthleading to Hexactinellida is reduced in the new tree(fig. 6b) compared with the original tree (fig. 6a).
The Phylo-MCOA Program: Availability andRequirementsPhylo-MCOA is provided as a set of R functions whichmake use of the ade4 (Chessel et al. 2004; Dray et al.2007), ape (Paradis et al. 2004), and lattice (for graphicalfunctions, Sarkar 2010) packages and newly developedfunctions. It requires a recent version of R (.2.8.0) installedon the machine. Phylo-MCOA is freely distributed underthe GNU General Public License. It can be downloadedat http://phylomcoa.cgenomics.org.
Discussion
It has become a standard practice to use multiple genes inorder to obtain a reliable and concordant phylogeny. When
sampling many different genes for the same set of species,the individual gene trees will often be partly incongruent. Inthis case, one is interested in detecting which subset ofgenes and species is responsible for the discrepancies, inother words, in finding the outliers. Phylo-MCOA, themethod proposed here, provides a quick and powerfulway to detect outliers by integrating the phylogenetic in-formation contained in all the individual gene trees. Phylo-MCOA determines how different the compared topologiesare but more importantly, it identifies which genes and spe-cies explain those differences. This may prove very usefulwhen determining possible causes for incongruities, suchas horizontal gene transfers, different selective pressures,
recombination events, among others. Outlier genes are typ-ically affected by these factors, so identifying which subsetof species explain the differences may help to focus theefforts for studying the affected species. On the other hand,by building a consensus typology Phylo-MCOA also findsthe genes that produce concordant phylogenies. Thesegenes are excellent candidates for recovering the speciestree, which can be subsequently verified by other methods.
Phylo-MCOA thus provides good candidates, not onlyfor outliers, but for phylogenetic markers producingconcordant topologies.
To our knowledge, this is the first time that MCOA isapplied directly to analyze phylogenetic distances. Basedon a covariation optimization criterion, MCOA is a methodthat captures the similarities (and dissimilarities) obtainedfrom the ordination of several tables (types) of data, as ithas been previously demonstrated for the analysis of eco-logical data (Bady et al. 2004; Hedde et al. 2005; Laloe et al.2007). Furthermore, it is a statistically solid technique thatlends itself well to the analysis of phylogenetic data (i.e., in
terms of topological distances), where the covariation ofthe histories of genes and species can be integrated in a sin-gle ordination analysis. Even though a consensus typologyis proposed by the method, MCOA takes into account allthe gene trees in order to build it. We believe this makesa better use of the individual tree information than otherforms of phylogenetic consensus that do not take intoaccount outliers or use only a part of the available data.Our method is unique in simultaneously detecting outlierspecies and genes from the comparison of multiple genetrees.
One important aspect of tree comparisons involves theestimation of pairwise distance matrices. The Eigen decom-
position and further ordination represented in the consen-sus typology take the distances as absolute distances.Therefore, the consensus typology will change dependingon the type of distance that is actually being measured.There are two ways to calculate the distance betweentwo species in a phylogenetic tree: the sum of the branchlengths or the number of nodes separating the species inthe trees. Both distance metrics have advantages and dis-advantages: using the sum of the branch lengths (patristicdistances) allows incorporating the relative rates of evolu-tion but can artificially bring closer species that have smallbranch lengths and separate species with longer branches.
Conversely, if the internodal distance separating species isused, then the consensus typology informs us about a de-gree of phylogenetic proximity, but it tells us nothing aboutthe relative rates of evolution. It is important to be explicitabout the pairwise distance method employed in orderto derive conclusions, as genes identified as outliers whenusing one method may seem concordant with other geneswhen using another one. Comparing the results given byboth methods can be very informative. A gene that is iden-tified as an outlier when using patristic distances but notwhen using nodal distances will represent a gene evolvingat higher or slower rates in some species compared withother genes.
de Vienne et al. doi:10.1093/molbev/msr317 MBE
10
http://phylomcoa.cgenomics.org/http://phylomcoa.cgenomics.org/http://phylomcoa.cgenomics.org/ -
7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317
11/12
Phylo-MCOA allows weighing trees according to differ-ent scores (e.g., likelihood scores), which ensures that morecredible trees are given more weight in the analysis. It isequally important to assign weights to nodes to distinguishsupported relationships from spurious or dubious ones. Inprinciple, one could incorporate a measure of node supportinto MCOA metrics; however, this is not straightforward.At present, we have arrived at a solution that may alleviate
the problem of node support assignment to some extent.We have incorporated the possibility of collapsing lowlysupported nodes, according to a cutoff value given bythe user. In this way, the nodes that are poorly supportedwill contribute little or nothing to the construction of theconsensus typology. Assessing node support across phylog-enies is difficult, especially with cases of missing or dupli-cated data, as not all nodes will be present in all trees. Forthis reason, we suggest the use of Treeko (Marcet-Houbenand Gabaldon 2011) for trees with duplicated genes, priorto analysis with Phylo-MCOA. Interestingly, our methodcan be used even in the absence of completely sampled
data sets. The program can deal with gene losses, as dem-onstrated by randomly deleting species from the simulateddata analyzed.
We have shown that Phylo-MCOA is able to correctlydetect outlier genes and species from the comparison ofmultiple phylogenetic trees. We have shown this by ana-lyzing simulated data sets but, more interestingly, fromthe analysis of real data sets. We have shown, with two dif-ferent examples, how Phylo-MCOA can detect trueoutliers. First, with the fungal data set by Aguileta et al.(2008), where we found the same concordant and discor-dant genes as in the original study; and second, where weuse Phylo-MCOA to predict which cell-by-cell outliers need
to be removed from phylogenetic reconstruction in orderto recover the animal phylogeny obtained after removingnoisy data, as done by Philippe et al. (2011).
Phylo-MCOA can be used to detect candidate genes forfurther analysis, or for filtering out problematic genes inphylogenetic reconstruction. Phylo-MCOA can detecttwo types of outliers. Complete outliers can identify whichgenes and species deviate from the evolutionary history ofmost genes analyzed. This information can be very usefulfor filtering out noisy data in phylogenetic reconstruction.On the other hand, cell-by-cell outliers point to particularcases of topological discordance, where one specific species
is an outlier in a given gene. This information can be used topostulate specific hypotheses concerning particular genesor species. Although cell-by-cell outlier detection is imple-mented in the current Phylo-MCOA version, there is roomfor improvement, specifically in the way outliers are scoredto define significance.
Phylo-MCOA can be used to detect candidate genes forfurther analysis, or for filtering out problematic genes inphylogenetic reconstruction. Nevertheless, caution shouldalways be exerted because the method will give moreweight to the information given by the majority of genesunder study, and if the data set analyzed is plaguedwith dubious sequences, the method will favor an incorrect
signal. This is what we observed when we analyzed the dataset ofSchierwater et al. (2009) that is known to be riddenwith multiple faulty sequences. External information on thequality of the sequences should be employed wheneverpossible prior to analysis with our method.
Supplementary Material
Supplementary table S1 and data set S1 are available atMolecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
Acknowledgments
We would like to thank Herve Philippe for the time hespent on this manuscript and the helpful commentsand suggestions he made. We thank David Moreira for con-structive discussions, Avner Bar-Hen for interesting com-ments on the use of MCOA for phylogenomics analysis,Vincent Berry for suggestions on future developments
and comments on distances between species, and FranSupek for his suggestions on statistical significance testing.Finally, we thank Tatiana Giraud and Toni Gabaldon forinsightful comments on the manuscript as well as threeanonymous reviewers. D.M.V. acknowledges postdoctoralgrants from the Centre National de la RechercheScientifique (CNRS) and Universite Paris-Sud 11. G.A. ac-knowledges postdoctoral grants from the CNRS, UniversiteParis-Sud 11, and Agence National de la Recherche (ANR-06-BLAN-0201).
ReferencesAguileta G, Marthey S, Chiapello H, Lebrun MH, Rodolphe F,
Fournier E, Gendrault-Jacquemard A, Giraud T. 2008. Assessing
the performance of single-copy genes for recovering robust
phylogenies. Syst Biol. 57:613627.
Ane C, Larget B, Baum DA, Smith SD, Rokas A. 2007. Bayesian
estimation of concordance among gene trees. Mol Biol Evol. 24:
412426.
Bady P, Doledec S, Dumont B, Fruget JF. 2004. Multiple co-inertia
analysis: a tool for assessing synchrony in the temporal
variability of aquatic communities. C R Biol. 327:2936.
Berthouly C, BedHom B, Tixier-Boichard M, Chen CF, Lee YP,
Laloe D, Legros H, Verrier E, Rognon X. 2008. Using molecular
markers and multivariate methods to study the genetic diversity
of local European and Asian chicken breeds. Anim Genet. 39:
121129.
Brochier C, Bapteste E, Moreira D, Philippe H. 2002. Eubacterial
phylogeny based on translational apparatus proteins. Trends
Genet. 18:15.
Chessel D, Dufour AB, Thioulouse J. 2004. The ade4 package-I-one-
table methods. R News. 4:510.
Chessel D, Hanafi M. 1996. Analyses de la co-inertie de K nuages de
points. Rev Stat Appl. 44:3560.
Cranston KA, Rannala B. 2007. Summarizing a posterior distribution
of trees using agreement subtrees. Syst Biol. 56:578590.
de Vienne DM, Aguileta G, Ollier S. 2011. The Euclidean nature of
phylogenetic distance matrices. Syst Biol. 60:826832.
Degnan JH, Rosenberg NA. 2006. Discordance of species trees with
their most likely gene trees. PLoS Genet. 2:762768.
Detecting Outlier Genes and Species in Phylogenomics doi:10.1093/molbev/msr317 MBE
11
http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/http://www.mbe.oxfordjournals.org/http://www.mbe.oxfordjournals.org/http://www.mbe.oxfordjournals.org/http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1 -
7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317
12/12
Dray S, Dufour AB, Chessel D. 2007. The ade4 package-II: two-table
and K-table methods. R News. 7:4752.
Edwards SV, Liu L, Pearl DK. 2007. High-resolution species trees
without concatenation. PNAS 104:59365941.
Felsenstein J. 1985. Confidence-limits on phylogeniesan approach
using the bootstrap. Evolution 39:783791.
Gadagkar SR, Rosenberg MS, Kumar S. 2005. Inferring species
phylogenies from multiple genes: concatenated sequence tree
versus consensus gene tree. J Exp Zool B Mol Dev Evol. 304:6474.
Gordon AD. 1986. Consensus supertreesthe synthesis of rootedtrees containing overlapping sets of labeled leaves. J Classif. 3:
335348.
Gower JC. 1984. Distance matrices and their Euclidean approxima-
tion. In: Jambu M, Diday E, Lebart L, Pages J, Tomassone R,
editors. Data analysis and informatics III. Amsterdam (The
Netherlands): Elsevier. p. 321.
Gower JC, Legendre P. 1986. Metric and Euclidean properties of
dissimilarity coefficients. J Classif. 3:548.
Hedde M, Lavelle P, Joffre R, Jimenez JJ, Decaens T. 2005. Specific
functional signature in soil macro-invertebrate biostructures.
Funct Ecol. 19:785793.
Hillis DM, Heath TA, St John K. 2005. Analysis and visualization of
tree space. Syst Biol. 54:471482.
Huson DH, Bryant D. 2006. Application of phylogenetic networks inevolutionary studies. Mol Biol Evol. 23:254267.
Laloe D, Jombart T, Dufour AB, Moazami-Goudarzi K. 2007.
Consensus genetic structuring and typological value of
markers using multiple co-inertia analysis. Genet Sel Evol.39:545567.
Leigh JW, Schliep K, Lopez P, Bapteste E. 2011. Let them fall where
they may: congruence analysis in massive, phylogenetically
messy datasets. Mol Biol Evol. 28:27732785.
Leigh JW, Susko E, Baumgartner M, Roger AJ. 2008. Testing
congruence in phylogenomic analysis. Syst Biol. 57:104115.
Manly BFJ. 2004. Multivariate statistical methods: a primer. 3rd ed.
London: Chapman & Hall, Ltd.
Marcet-Houben M, Gabaldon T. 2011. TreeKO: a duplication-aware
algorithm for the comparison of phylogenetic trees.NucleicAcids Res. 39(10):e66.
Marthey S, Aguileta G, Rodolphe F, et al. (11 co-authors). 2008.
FUNYBASE: a FUNgal phYlogenomic dataBASE. BMC Bioinfor-
matics. 9:456.
Paradis E, Claude J, Strimmer K. 2004. APE: analyses of phylogenetics
and evolution in R language. Bioinformatics 20:289290.
Philippe H, Brinkmann H, Lavrov DV, Littlewood DTJ, Manuel M,
Worheide G, Baurain D. 2011. Resolving difficult phylogenetic
questions: why more sequences are not enough. PLoS Biol. 9:
e1000602.
Rodriguez-Ezpeleta N, Brinkmann H, Roure B, Lartillot N, Lang BF,Philippe H. 2007. Detecting and overcoming systematic errors in
genome-scale phylogenies. Syst Biol. 56:389399.
Rosenberg NA. 2002. The probability of topological concordance of
gene trees and species trees. Theor Popul Biol. 61:225247.
Roure B, Rodriguez-Ezpeleta N, Philippe H. 2007. SCaFoS: a tool for
selection, concatenation and fusion of sequences for phyloge-
nomics. BMC Evol Biol. 7(1 Suppl):S2.
Sanderson MJ, Purvis A, Henze C. 1998. Phylogenetic supertrees:
assembling the trees of life. Trends Ecol Evol. 13:105109.
Sarkar D. 2010. Lattice: lattice graphics [Internet]. R package version
0.18-8. Available from: http://CRAN.R-project.org/package=lattice.
Schierwater B, Eitel M, Jakob W, Osigus H-J, Hadrys H, Dellaporta SL,
Kolokotronis S-O, DeSalle R. 2009. Concatenated analysis sheds
light on early metazoan evolution and fuels a modernurmetazoon hypothesis. PLoS Biol. 7:e1000020.
Stamatakis A. 2006. RAxML-VI-HPC: maximum likelihood-based
phylogenetic analyses with thousands of taxa and mixed models.
Bioinformatics 22:26882690.
Suchard MA. 2005. Stochastic models for horizontal gene transfer:
taking a random walk through tree space. Genetics 170:419431.
Suchard MA, Weiss RE, Sinsheimer JS, Dorman KS, Patel M,
McCabe ERB. 2003. Evolutionary similarity among genes. J Am
Stat Assoc. 98:653662.
Susko E, Leigh J, Doolittle WF, Bapteste E. 2006. Visualizing and
assessing phylogenetic congruence of core gene sets: a case
study of the gamma-proteobacteria. Mol Biol Evol. 23:10191030.
White WT, Hills SF, Gaddam R, Holland BR, Penny D. 2007. Treeness
triangles: visualizing the loss of phylogenetic signal.Mol Biol Evol.24:20292039.
de Vienne et al. doi:10.1093/molbev/msr317 MBE
http://cran.r-project.org/package=latticehttp://cran.r-project.org/package=latticehttp://cran.r-project.org/package=lattice