supplementary online material - busco.ezlab.org · augustus predictions using generic parameters...

13
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 1 of 13 Supplementary Online Material BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs Felipe A. Simão , Robert M. Waterhouse * , Panagiotis Ioannidis, Evgenia V. Kriventseva, and Evgeny M. Zdobnov * Department of Genetic Medicine and Development, University of Geneva Medical School and Swiss Institute of Bioinformatics, rue Michel-Servet 1, 1211 Geneva, Switzerland. Equal contribution. * To whom correspondence should be addressed: [email protected], [email protected] Contents: 1. BUSCO: Benchmarking Universal Single-Copy Orthologs...................................................................... 2 1.1. BUSCO selection............................................................................................................................... 2 1.2. Hidden Markov models, ancestral sequences and block profiles ...................................................... 2 1.3. Candidate BUSCO matches from genome assemblies ...................................................................... 4 1.4. Gene prediction: assessing genome assemblies and transcriptomes ................................................. 4 1.5. BUSCO match assignment ................................................................................................................ 4 1.6. Classification: Complete, Duplicated, Fragmented, Missing ............................................................ 5 1.7. Training Augustus gene finding parameters ...................................................................................... 5 2. BUSCO completeness versus N50 contiguity ........................................................................................... 5 3. BUSCO versus CEGMA assessment of genome assembly completeness ................................................ 6 4. BUSCO assessments of genomes, transcriptomes, and gene sets ............................................................. 7 5. BUSCO and CEGMA analysis run-times ............................................................................................... 12 6. References ............................................................................................................................................... 13 UEST FOR UALITY BUSCO CALIDADBUSCO QUALIDADEhttp://busco.ezlab.org

Upload: nguyenxuyen

Post on 07-Aug-2019

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supplementary Online Material - busco.ezlab.org · Augustus predictions using generic parameters may produce sub-optimal gene predictions. Training the Training the prediction parameters

Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 1 of 13

Supplementary Online Material

BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs

Felipe A. Simão†, Robert M. Waterhouse

†*, Panagiotis Ioannidis, Evgenia V. Kriventseva, and Evgeny M. Zdobnov

*

Department of Genetic Medicine and Development, University of Geneva Medical School

and Swiss Institute of Bioinformatics, rue Michel-Servet 1, 1211 Geneva, Switzerland. † Equal contribution. * To whom correspondence should be addressed:

[email protected], [email protected]

Contents:

1. BUSCO: Benchmarking Universal Single-Copy Orthologs ...................................................................... 2

1.1. BUSCO selection ............................................................................................................................... 2

1.2. Hidden Markov models, ancestral sequences and block profiles ...................................................... 2

1.3. Candidate BUSCO matches from genome assemblies ...................................................................... 4

1.4. Gene prediction: assessing genome assemblies and transcriptomes ................................................. 4

1.5. BUSCO match assignment ................................................................................................................ 4

1.6. Classification: Complete, Duplicated, Fragmented, Missing ............................................................ 5

1.7. Training Augustus gene finding parameters ...................................................................................... 5

2. BUSCO completeness versus N50 contiguity ........................................................................................... 5

3. BUSCO versus CEGMA assessment of genome assembly completeness ................................................ 6

4. BUSCO assessments of genomes, transcriptomes, and gene sets ............................................................. 7

5. BUSCO and CEGMA analysis run-times ............................................................................................... 12

6. References ............................................................................................................................................... 13

UEST FOR UALITY

“BUSCO CALIDAD”

“BUSCO QUALIDADE”

http://busco.ezlab.org

Page 2: Supplementary Online Material - busco.ezlab.org · Augustus predictions using generic parameters may produce sub-optimal gene predictions. Training the Training the prediction parameters

Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 2 of 13

1. BUSCO: Benchmarking Universal Single-Copy Orthologs

1.1. BUSCO selection

Benchmarking Universal Single-Copy Orthologs (BUSCO) sets are collections of orthologous groups

with near-universally-distributed single-copy genes in each species, selected from OrthoDB root-level

orthology delineations across arthropods, vertebrates, metazoans, fungi, and eukaryotes (Kriventseva, et al.,

2014; Waterhouse, et al., 2013). BUSCO groups were selected from each major radiation of the species

phylogeny requiring genes to be present as single-copy orthologs in at least 90% of the species; in others

they may be lost or duplicated, and to ensure broad phyletic distribution they cannot all be missing from one

sub-clade. The species that define each major radiation were selected to include the majority of OrthoDB

species, excluding only those with unusually high numbers of missing or duplicated orthologs, while

retaining representation from all major sub-clades. Their widespread presence means that any BUSCO can

therefore be expected to be found as a single-copy ortholog in any newly-sequenced genome from the

appropriate phylogenetic clade (Waterhouse, et al., 2011). A total of 38 arthropods (3’078 BUSCO groups),

41 vertebrates (4’425 BUSCO groups), 93 metazoans (1’008 BUSCO groups), 125 fungi (1’438 BUSCO

groups), and 99 eukaryotes (431 BUSCO groups), were selected from OrthoDB to make up the initial

BUSCO sets which were then filtered based on uniqueness and conservation as described below to produce

the final BUSCO sets for each clade, representing 2’675 genes for arthropods, 3’023 for vertebrates, 843 for

metazoans, 1’438 for fungi, and 429 for eukaryotes. For bacteria, 40 universal marker genes were selected

from (Mende, et al., 2013).

1.2. Hidden Markov models, ancestral sequences and block profiles

Hidden Markov models: For each BUSCO group, multiple sequence alignments (MSAs) were built with

ClustalOmega (Sievers and Higgins, 2014) using the orthologous protein sequences of each BUSCO. The

MSAs were then used to build amino acid-level hidden Markov model (HMM) profiles using HMMER 3

(Eddy, 2011). Subsequently, all BUSCO input sequences were searched (hmmsearch) against the complete

library of HMM profiles to identify and remove any BUSCO groups whose members could not be reliably

distinguished from each other by their profiles, and hence ensure reliable profile-delineated orthology. In

total, 376, 852, and 156 groups were removed in this way from the arthropod, vertebrate, metazoan sets,

respectively, while none were removed for the fungi or eukaryote datasets. The remaining, reliably-

distinguishable BUSCO sets were then analysed to delineate the two parameters ‘expected-score’ and

‘expected-length’ that define the BUSCO-specific cut-offs used to classify a match as orthologous or not and

as complete or not. The ‘expected score’ cut-off is defined as 90% of the minimum bitscore from an HMM

search of all of a BUSCO group’s members against its own HMM profile (i.e. the lowest scoring match of

the sequences used to build the profile). To be classified as a true ortholog, any BUSCO-matching gene from

the species being assessed (from its genome, transcriptome, or gene set) must score above the ‘expected-

score’ cut-off. For a match to be classified as ‘complete’, it must satisfy the ‘expected-length’ cut-off, which

is defined using each BUSCO group’s protein length distribution (Figure S1). Any BUSCO-matching gene

from the species being assessed whose protein length falls within two standard deviations (2σ) of the

BUSCO group’s mean length is classified as ‘complete’.

Page 3: Supplementary Online Material - busco.ezlab.org · Augustus predictions using generic parameters may produce sub-optimal gene predictions. Training the Training the prediction parameters

Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 3 of 13

Consensus sequences: For each BUSCO group, an amino acid consensus sequence was generated from its

respective HMM profile using HMMER’s default hmmemit settings for a majority-rule consensus sequence.

These consensus sequences are used during BUSCO assessments of genome assemblies to search the

genome of the species being assessed to identify the best-matching genomic regions that may encode the

corresponding BUSCO-matching gene.

Figure S1. Distribution of the percent differences between BUSCO group member proteins and the

group’s mean protein length (negative = shorter than the mean, positive = longer than the mean, values

of one and two standard deviations are shown with lines). Insets: spread of BUSCO group member

protein lengths compared to BUSCO group mean lengths for arthropods (left) and vertebrates (right).

Block profiles: For each BUSCO group, a ‘block profile’ was built to guide automated gene predictions

with Augustus (Keller, et al., 2011). Block profiles are position-specific frequency matrices that model

conserved regions of multiple sequence alignments. The BUSCO group block profiles were created from

their corresponding protein multiple sequence alignments using the msa2prfl script from the Augustus

package. Several highly-divergent BUSCO groups failed to produce reliable block profiles, even after

processing their alignments with the Augustus preparealign script, and were therefore removed from the

assessment sets: 27, 149, 51, 0 and 2 BUSCO groups were removed from the arthropod, vertebrate,

metazoan, fungi and eukaryote sets, respectively.

Page 4: Supplementary Online Material - busco.ezlab.org · Augustus predictions using generic parameters may produce sub-optimal gene predictions. Training the Training the prediction parameters

Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 4 of 13

1.3. Candidate BUSCO matches from genome assemblies

Regions in a genome likely to encode BUSCO-matching genes are identified by tBLASTn searches

(Camacho, et al., 2009) with the reconstructed consensus sequences of each BUSCO. Neighbouring high-

scoring segment pairs (HSPs) from the tBLASTn searches are merged if located within 50 Kb of each other,

thus defining the span of the genomic regions to be evaluated. These genomic regions are then ranked

according to the total length of the consensus sequence aligned, and up to three regions are selected for the

subsequent gene prediction steps. The second- and third-ranked regions must have consensus sequence

alignment lengths of at least 70% of the aligned length of the top ranking region. Selecting more than just the

best candidate BUSCO match allows for the identification of normally-rarely duplicated BUSCOs from the

assessed genome, which, if numerous, could indicate potentially erroneously assembled haplotypes. Lastly,

the selected genomic regions are extended with 5 Kbp (small genomes) and 20 Kbp (large genomes) flanking

regions (default parameters, users can specify their own flank-extension lengths).

1.4. Gene prediction: assessing genome assemblies and transcriptomes

The candidate BUSCO-matching regions identified in the previous step are extracted from the genome

being assessed for processing by the Augustus automated gene prediction procedure. Gene prediction is

performed on each candidate region using the corresponding BUSCO group’s block profile, and default gene

finding parameters (unless otherwise specified by the user). Successful Augustus gene prediction for each

BUSCO group produces an initial BUSCO gene set whose protein sequences are then evaluated using the

BUSCO-specific cut-offs to determine true orthology and completeness. High-confidence predicted BUSCO

genes can then be selected from this initial gene set for the training of Augustus to rerun the automated gene

prediction procedure with these specific genome-trained parameters (see below). For assessing

transcriptomes, if the transcripts have not already been pre-processed to extract protein-coding genes then the

longest open reading frame (ORF) is selected for assessment.

1.5. BUSCO match assignment

This step uses the properties of each BUSCO group’s HMM profile to determine whether a significantly

matching protein sequence is likely orthologous or just homologous. Significant matches are first determined

by searching the full set of protein sequences to be assessed against the complete library of BUSCO group

HMM profiles using HMMER’s hmmsearch. As described above, filtering of the initial BUSCO sets ensured

that each library contains only reliably-distinguishable profiles. The set of protein sequences to be assessed

may be from the Augustus-predicted BUSCO gene set, a transcriptome-based gene set, or the annotated

‘Official Gene Set’ (OGS). For each hmmsearch sequence-profile alignment, two measures are computed

and evaluated: the alignment bitscore and the total length of sequence aligned to the HMM profile. For a

BUSCO-matching gene to be considered orthologous, the alignment bitscore must be greater than or equal to

the ‘expected-score’ of the corresponding BUSCO group (see above for ‘expected-score’ definition). Genes

that pass the ‘expected-score’ cut-off are then evaluated for protein length completeness as described below.

Page 5: Supplementary Online Material - busco.ezlab.org · Augustus predictions using generic parameters may produce sub-optimal gene predictions. Training the Training the prediction parameters

Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 5 of 13

1.6. Classification: Complete, Duplicated, Fragmented, Missing

The final stage of the assessments classify each arthropod, vertebrate, metazoan, fungal, or eukaryote

BUSCO as complete, duplicated, fragmented, or missing from the gene set being assessed. Classification of

BUSCO-matching genes that meet the ‘expected-score’ cut-off employs the protein length distribution of

each BUSCO to determine whether the ortholog is ‘Complete’ or ‘Fragmented’. Orthologs are considered to

be ‘Complete’ if the length of their aligned sequence is within two standard deviations (2σ) of the BUSCO

group’s mean length (i.e. 95% expectation), otherwise they are classified as ‘Fragmented’ recoveries (Figure

S1). A BUSCO is classified as ‘Duplicated’ when multiple BUSCO-matching genes meet both the

‘expected-score’ and the ‘expected-length’ cut-offs, i.e. multiple copies of full-length orthologs are found in

the gene set being assessed. Lastly, any BUSCO without a BUSCO-matching gene that meets the ‘expected-

score’ cut-off is classified as ‘Missing’.

1.7. Training Augustus gene finding parameters

Augustus predictions using generic parameters may produce sub-optimal gene predictions. Training the

prediction parameters using the most reliable gene structures obtained from the initial set of predictions can

substantially improve the results. To train Augustus, BUSCO-matching genes classified as ‘Complete’ and

single-copy are selected to form a high-quality training dataset. The selected gene structures are extracted,

and used to build GenBank files (gff2smallgb) suitable for training Augustus (etraining). This procedure

results in the creation of genome-specific gene finding parameters; for the vast majority of genomes

evaluated, when compared to ‘generic’ gene finding parameters, these genome-specific parameters result in

substantial increases in the sensitivity and specificity of Augustus predictions, both at gene and exon levels.

A second round of Augustus gene prediction is then performed using these genome-specific parameters on

all BUSCO-matching candidate regions where initial predictions failed or did not yield a ‘Complete’

ortholog. Orthology assessment, protein length evaluations, and final classifications are then performed as

outlined above to produce the final BUSCO assessment results.

Augustus allows for the possibility of further sensitivity and specificity gains by applying multiple rounds

of metaparameter optimisation performed using OptimizeAugustus. However, this extra optimisation step

comes at the cost of generally more than double the run-time for a typical genome assembly assessment,

without large improvements in assessment sensitivity. Thus, for default genome assembly assessments, this

extra optimisation step is not performed unless specified by the user (--long mode). This option is made

available to users because although the improvements from this extra optimisation step are minimal for the

purposes of assembly assessments, they can prove valuable when using BUSCO sets to train gene predictors

for subsequent use as part of multi-evidence-based whole genome annotation pipelines.

2. BUSCO completeness versus N50 contiguity

BUSCO assessment of genome assembly completeness is designed to provide a more detailed

quantification of assembly quality than traditional measures such as scaffold N50 metrics of assembly

contiguity. Comparing BUSCO completeness with N50 contiguity for a selection of genomes ranging from

fragmented draft assemblies to chromosome-level genome assemblies reveals the low correlation (r=0.149)

between these measures (Figure S2). Thus, even fragmented assemblies with relatively low N50 values can

encode fairly complete gene sets, and some assemblies that appear to be of good quality based on contiguity

measures are not necessarily more complete in terms of expected gene content. Additionally, when assessing

gene sets, it is clear that species with very high gene counts are not necessarily the most complete, nor are

those with rather low gene counts necessarily incomplete (Waterhouse, 2015). For a typical eukaryotic draft

assembly, BUSCO assessments suggest that assemblies with N50 values on the order of 50 Kbp are capable

of yielding fairly complete gene sets.

Page 6: Supplementary Online Material - busco.ezlab.org · Augustus predictions using generic parameters may produce sub-optimal gene predictions. Training the Training the prediction parameters

Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 6 of 13

Figure S2. BUSCO completeness versus N50 contiguity. Nine outliers with N50 values above 10’000 Kbp

are not shown, each of which achieve more than 90% BUSCO completeness.

3. BUSCO versus CEGMA assessment of genome assembly completeness

The Core Eukaryotic Genes Mapping Approach (CEGMA) is a widely-used method to assess genome

assembly completeness in terms of gene content (Parra, et al., 2007; Parra, et al., 2009), but does not provide

a means for directly assessing gene sets. CEGMA employs a set of 248 conserved Core Eukaryotic Genes

(CEGs) expected to be present in any newly sequenced eukaryotic genome. The CEGs are derived from

eukaryotic KOGs (Tatusov, et al., 2003) and are composed of orthologous protein sequences from six

eukaryotic species (human, fruit fly, roundworm, thale cress, fission yeast and baker’s yeast), for which a

corresponding HMM profile is built from their multiple sequence alignments.

In order to perform a like-for-like comparison of the CEGMA and BUSCO genome assembly and gene

set assessments, a subset of 250 of the 429 eukaryote BUSCOs was selected with the lowest variations of

their ‘expected-score’ and ‘expected-length’ parameters. As the CEGMA pipeline does not perform gene set

assessments, an analysis pipeline was built to use the CEGMA HMM profiles instead of the BUSCO HMM

profiles. In addition, the pipeline employed the cut-offs that CEGMA uses to determine the presence/absence

(from the provided ‘cutoff_file’ with the cut-offs for CEGMA HMMs) and complete/partial (complete,

>70% CEG length) status of potentially orthologous matches.

Thus, BUSCO assessments of genome assemblies and gene sets were performed with normal default

options except for substituting the full eukaryote BUSCO set with a subset of only 250 in order to match the

number of CEGMA CEGs. The CEGMA assessments of genome assemblies were performed with normal

default options, and CEGMA assessments of gene sets were enabled by building a pipeline to use CEGMA

HMM profiles and cut-offs. The results for the assessments of 40 species are shown in Figure 2 of the main

text. They reveal generally consistent BUSCO assessments across highly divergent lineages from fungi to

human, with somewhat less consistent results from the CEGMA assessments (BUSCO linear regression

more closely follows the diagonal than that of CEGMA).

Page 7: Supplementary Online Material - busco.ezlab.org · Augustus predictions using generic parameters may produce sub-optimal gene predictions. Training the Training the prediction parameters

Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 7 of 13

Linear regressions of each set, adjusted R2:

BUSCO R2 = 0.718

CEGMA R2 = 0.413

R2 = SSR / SST where SSR = ∑ (ŷi - )

2, SST = ∑ (yi - )

2

yi is the ith observed value

ŷi is the ith expected value from the best-fit line

and is the mean of y

To evaluate against the diagonal (x = y) instead of the best-fit, the expected value (ŷi) simply becomes the x

value (xi), and there is no intercept term (i.e. x = y = 0) so: R2(x=y)

= 1 – ( SSE / SST ) where SSE = ∑ (yi - ŷi)2.

BUSCO: R2(x=y)

= 1 – ( 1281.6 / 3440.5 ) = 0.63

CEGMA: R2(x=y)

= 1 – ( 5944.3 / 1936.3 ) = -2.07

4. BUSCO assessments of genomes, transcriptomes, and gene sets

The BUSCO assessment pipeline was applied to 70 available genome assemblies and their corresponding

official gene sets, as well as to 93 additional gene sets, and 96 transcriptomes. The detailed results are shown

in Table S1 in C[D],F,M,n BUSCO notation. The evaluated genome assemblies include both high quality

reference genomes (e.g. Homo sapiens), as well as de novo assemblies of non-model organisms, sampling a

wide range of different fold-coverage levels, N50 sizes, sequencing technologies, and assembly strategies.

These genomes represent the four major BUSCO lineages with 41 arthropods from 13 different orders, 3

vertebrates from 3 different orders, 11 basal metazoans, and 15 fungal species from 12 different orders. The

gene sets chosen for these assessments comprise: 41 arthropods, 26 vertebrates, 11 basal metazoans and 15

fungal species. 96 transcriptomes were also evaluated; sequences were typically derived from mRNA

extracted from different tissue types. The transcriptomes analysed cover a total of 11 fungal species (14

transcriptomes), 39 arthropods (44 transcriptomes), 18 vertebrates (28 transcriptomes) and 10 basal

metazoans (13 transcriptomes). Duplications [D] were not assessed (n.a.) for unfiltered gene sets or

transcriptomes that contained multiple transcripts of the same gene as this would lead to overestimates of

BUSCO duplications.

Table S1. Current assessment completeness metrics in BUSCO notation (C:complete [D:duplicated],

F:fragmented, M:missed, n:genes) sampling different types of data and a variety of eukaryotic species.

Lineage Species Sample type Identifier N50 (Kbp) BUSCOs assessment

Ver

teb

rate

s

Homo sapiens Genome GCA_000001405.15 67,794 C:89% [D:1.5%], F:6.0%, M:4.5%, n:3023

Gene set GRCh37.75 C:99% [D:1.7%], F:0.0%, M:0.0%, n:3023

Mus musculus Genome GCA_000001635.4 52,589 C:78% [D:3.0%], F:19%, M:2.5%, n:3023

Gene set GRCm38.75 C:99% [D:2.5%], F:99%, M:0.1%, n:3023

Ornithorhyncus anatinus Genome GCF_000002275.2 991 C:55% [D:0.8%], F:25%, M:18%, n:3023

Gene set OANA5.75 C:72% [D:1.1%], F:19%, M:8.2%, n:3023

Callithrix jacchus

Gene set C_jacchus3.2.1.75 C:97% [D:2.9%], F:1.7%, M:0.8%, n:3023

Transcriptome GI:532219616 Bladder C:76% [D:17%], F:5.5%, M:18%, n:3023

Transcriptome GI:532292355 hypocampus C:79% [D:18%], F:4.5%, M:15%, n:3023

Transcriptome GI:532349506 Cortex C:34% [D:7.6%], F:34%, M:64%, n:3023

Transcriptome GI:532452938 S. muscle C:69% [D:13%], F:6.0%, M:24%, n:3023

Transcriptome GI:532524775 Cerebellum C:76% [D:19%], F:5.1%, M:18%, n:3023

Pan troglodytes

Gene set CHIMP2.14.75 C:96% [D:0.5%], F:1.2%, M:1.9%, n:3023

Transcriptome GI:410228237adipose SC C:75% [D:15%], F:3.8%, M:20%, n:3023

Transcriptome GI:410308999 Fibroblast C:75% [D:16%], F:3.7%, M:21%, n:3023

Page 8: Supplementary Online Material - busco.ezlab.org · Augustus predictions using generic parameters may produce sub-optimal gene predictions. Training the Training the prediction parameters

Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 8 of 13

Lineage Species Sample type Identifier N50 (Kbp) BUSCOs assessment

Transcriptome GI:410268357 Endothelium C:75% [D:15%], F:3.5%, M:21%, n:3023

Anolis carolinensis

Gene set AnoCar2.0.75 C:89% [D:2.6%], F:6.8%, M:3.4%, n:3023

Transcriptome GI:614142443 Skeletal C:58% [D:14%], F:8.7%, M:32%, n:3023

Transcriptome GI:464801713 Whole C:27% [D:15%], F:18%, M:53%, n:3023

Latimeria chalmnae Transcriptome GI:387756559 Muscle C:37% [D:6.9%], F:11%, M:50%, n:3023

Rana clamitans Transcriptome GI:451274083 Unknown C:21% [D:0.3%], F:13%, M:65%, n:3023

Pseudoacris regilla Transcriptome GI:451272305 Unknown C:20% [D:0.4%], F:16%, M:63%, n:3023

Salmo salar Transcriptome GI:666988260 Mixed C:19% [D:7.8%], F:6.6%, M:74%, n:3023

Oreochromis niloticus Transcriptome GI:555682626 Spleen C:39% [D:0.4%], F:16%, M:44%, n:3023

Ameiurus nebulosus Transcriptome GI:472819489 Unknown C:7.3% [D:0.2%], F:10%, M:82%, n:3023

Ursus maritimus Transcriptome GI:510063642 Fat C:50% [D:29%], F:5.5%, M:44%, n:3023

Tripterygion delaisi Transcriptome GI:572723144 Brain C:35% [D:13%], F:17%, M:47%, n:3023

Atractaspis aterrima Transcriptome GI:673456880 Venom C:0.7% [D:0.0%], F:1.0%, M:98%, n:3023

Transcriptome GI:673404158 Venom C:4.4% [D:0.5%], F:6.8%, M:88%, n:3023

Latimeria menadoensis Transcriptome GI:559559797 Testis C:71% [D:15%], F:6.5%, M:22%, n:3023

Hynobius chinensis Transcriptome GI:570932341 Unknown C:59% [D:7.3%], F:13%, M:26%, n:3023

Carduelis chloris Transcriptome GI:617996660 Blood C:31% [D:0.2%], F:12%, M:55%, n:3023

Maylandia zebra Transcriptome GI:614241491 Kidney C:64% [D:15%], F:8.7%, M:26%, n:3023

Chinchilla lanigera Transcriptome GI:618625375 Trachea C:80% [D:44%], F:5.7%, M:14%, n:3023

Ailuropoda melanoleuca Gene set ailMel1.75 C:97% [D:1.3%], F:1.8%, M:0.3%, n:3023

Bos taurus Gene set UMD3.175 C:97% [D:1.3%], F:1.6%, M:0.5%, n:3023

Danio rerio Gene set Zv9.75 C:95% [D:8.3%], F:3.2%, M:1.7%, n:3023

Felis catus Gene set Felis_catus_6.2.75 C:96% [D:1.2%], F:2.8%, M:0.5%, n:3023

Ficedula albicollis Gene set FicAlb_1.4.75 C:88% [D:2.0%], F:4.1%, M:7.8%, n:3023

Gallus gallus Gene set Galga4.75 C:90% [D:2.4%], F:3.5%, M:6.0%, n:3023

Gorilla gorilla Gene set gorGor3.1.75 C:96% [D:2.6%], F:1.7%, M:2.1%, n:3023

Loxodonta africana Gene set loxAfr3.75 C:96% [D:1.5%], F:2.3%, M:1.0%, n:3023

Macaca mulatta Gene set MMUL_1.75 C:94% [D:2.0%], F:4.5%, M:0.9%, n:3023

Monodelphis domestica Gene set BROADO5.75 C:95% [D:4.0%], F:2.3%, M:1.6%, n:3023

Mustela putorius Gene set MusPutFur1.0.75 C:97% [D:1.4%], F:1.7%, M:1.0%, n:3023

Oreochromis niloticus Gene set Orenil1.0.75 C:96% [D:5.1%], F:1.4%, M:2.5%, n:3023

Oryctolagus cuniculus Gene set OryCun2.0.75 C:93% [D:2.7%], F:3.0%, M:3.2%, n:3023

Oryzias latipes Gene set MEDAKA1.75 C:83% [D:3.2%], F:5.4%, M:11%, n:3023

Pongo abelii Gene set PPYG2.75 C:95% [D:1.1%], F:3.3%, M:1.1%, n:3023

Sus scrofa Gene set Sscrofa10.2.75 C:83% [D:7.4%], F:6.8%, M:10%, n:3023

Taeniopygia guttata Gene set taeGut3.2.4.75 C:81% [D:3.2%], F:7.5%, M:11%, n:3023

Takifugu rubripes Gene set FUGU4.75 C:89% [D:5.2%], F:3.5%, M:7.3%, n:3023

Xenopus tropicalis Gene set JGI_4.2.75 C:93% [D:3.4%], F:3.5%, M:2.5%, n:3023

Xiphophorus maculatus Gene set Xipmac4.4.2.75 C:93% [D:3.6%], F:4.7%, M:1.3%, n:3023

Art

hro

po

ds

Acromyrmex echinatior Genome Aech_2.0 1,110 C:91% [D:2.6%], F:8.0%, M:0.6%, n:2675

Gene set Aech_OGS_v3.8 C:96% [D:8.8%], F:2.8%, M:0.5%, n:2675

Acyrtosiphon pisum Genome GCA_000142985.2 86 C:72% [D:6.1%], F:15%, M:12%, n:2675

Gene set GCA_000142985.2.22 C:89% [D:14%], F:4.1%, M:5.9%, n:2675

Aedes aegypti Genome AaegL3 1,547 C:86% [D:13%], F:10%, M:3.2%, n:2675

Gene set AaegL3.2 C:93% [D:17%], F:3.6%, M:3.0%, n:2675

Anopheles gambiae Genome AgamP4 49,364 C:93% [D:4.7%], F:4.1%, M:2.5%, n:2675

Gene set AgamP4.2 C:97% [D:10%], F:1.4%, M:0.8%, n:2675

Apis mellifera Genome Amel_v4.5 997 C:93% [D:2.9%], F:5.1%, M:0.9%, n:2675

Gene set Amel_OGS_v3.2 C:97% [D:9%], F:2.1%, M:0.1%, n:2675

Atta cephalotes Genome Acep 1.0 5,154 C:89% [D:2.6%], F:8.7%, M:1.3%, n:2675

Gene set Acep OGS v1.2 C:91% [D:7.7%], F:7.5%, M:0.5%, n:2675

Bombyx mori Genome GCA_000151625.1 4,008 C:73% [D:2.2%], F:17%, M:8.3%, n:2675

Gene set GLEAN set C:75% [D:7.0%], F:14%, M:10%, n:2675

Camponotus floridanus Genome Cflor_v3.3 451 C:92% [D:3.1%], F:6.6%, M:0.5%, n:2675

Gene set Cflor_OGS_v3.3 C:95% [D:8.7%], F:3.9%, M:0.4%, n:2675

Danaus plexippus Genome DanPle_1.0.22 52 C:83% [D:8.6%], F:11%, M:4.3%, n:2675

Gene set DanPle_1.0.22 C:86% [D:9.0%], F:9.5%, M:3.7%, n:2675

Daphnia pulex Genome GCA_000187875.1 642 C:83% [D:3.9%], F:11%, M:5.1%, n:2675

Gene set GCA_000187875.1.22 C:84% [D:10%], F:11%, M:4.0%, n:2675

Dendroctonus ponderosa Genome GCA_000355655.2 628 C:77% [D:6.1%], F:15%, M:7.2%, n:2675

Gene set GCA_000355655.2.22 C:82% [D:11%], F:10%, M:6.6%, n:2675

Drosophila anannasse Genome Dana_r1.3 4,599 C:96% [D:3.7%], F:1.9%, M:1.9%, n:2675

Page 9: Supplementary Online Material - busco.ezlab.org · Augustus predictions using generic parameters may produce sub-optimal gene predictions. Training the Training the prediction parameters

Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 9 of 13

Lineage Species Sample type Identifier N50 (Kbp) BUSCOs assessment

Gene set Dana_r1.3 C:98% [D:9.6%], F:0.8%, M:0.1%, n:2675

Drosophila erecta Genome Dere_r1.3 18,748 C:98% [D:4.7%], F:1.4%, M:0.4%, n:2675

Gene set Dere_r1.3 C:99% [D:9.3%], F:0.2%, M:0.1%, n:2675

Drosophila grimshawi Genome Dgri_r1.3 8,399 C:97% [D:6.2%], F:2.2%, M:0.4%, n:2675

Gene set Dgri_r1.3 C:99% [D:11%], F:0.4%, M:0.0%, n:2675

Drosophila melanogaster Genome Dmel_r5.55 23,011 C:98% [D:6.4%], F:0.6%, M:0.3%, n:2675

Gene set Dmel_r5.55 C:99% [D:9.1%], F:0.2%, M:0.0%, n:2675

Drosophila mojavensis Genome Dmoj_r1.3 24,764 C:97% [D:4.4%], F:2.2%, M:0.4%, n:2675

Gene set Dmoj_r1.3 C:99% [D:9.6%], F:0.8%, M:0.1%, n:2675

Drosophila persimilis Genome Dper_r1.3 1,869 C:93% [D:5.6%], F:5.8%, M:0.8%, n:2675

Gene set Dper_r1.3 C:93% [D:9.3%], F:5.6%, M:0.7%, n:2675

Drosophila pseudobscura Genome Dpse_r3.1 12,541 C:96% [D:6.3%], F:2.2%, M:1.1%, n:2675

Gene set Dpse_r3.1 C:98% [D:11%], F:0.6%, M:0.6%, n:2675

Drosophila sechelia Genome Dsec_r1.3 2,123 C:96% [D:5.1%], F:2.8%, M:0.7%, n:2675

Gene set Dsec_r1.3 C:96% [D:8.9%], F:3.0%, M:0.3%, n:2675

Drosophila simulans Genome Dsim_r1.4 857 C:85% [D:4.6%], F:9.0%, M:5.0%, n:2675

Gene set Dsim_r1.4 C:84% [D:7.6%], F:6.9%, M:8.0%, n:2675

Drosophila virilis Genome Dvir_r1.2 10,161 C:96% [D:5.2%], F:2.4%, M:0.6%, n:2675

Gene set Dvir_r1.2 C:99% [D:9.6%], F:0.7%, M:0.1%, n:2675

Drosophila willistoni Genome Dwil_r1.3 4,511 C:97% [D:5.5%], F:1.7%, M:0.4%, n:2675

Gene set Dwil_r1.3 C:99% [D:10%], F:0.6%, M:0.2%, n:2675

Drosophila yakuba Genome Dyak_r1.3 21,770 C:97% [D:6.5%], F:1.5%, M:0.7%, n:2675

Gene set Dyak_r1.3 C:98% [D:10%], F:0.8%, M:0.2%, n:2675

Harpegnathos saltator Genome Hsal_v3.3 601 C:89% [D:3.2%], F:9.6%, M:1.1%, n:2675

Gene set Hsal_OGS_v3.3 C:95% [D:9.0%], F:3.8%, M:0.7%, n:2675

Heliconius melpomene Genome Hmel_v1.22 194 C:77% [D:2.0%], F:11%, M:10%, n:2675

Gene set Hmel_v1.22 C:74% [D:6.7%], F:14%, M:11%, n:2675

Ixodes scapularis Genome IscaW1 76 C:58% [D:1.7%], F:21%, M:19%, n:2675

Gene set IscaW1.3 C:69% [D:6.6%], F:23%, M:7.1%, n:2675

Linepithema humile Genome Lhum_v1.0 1,402 C:92% [D:3.3%], F:7.0%, M:0.6%, n:2675

Gene set Lhum_OGS_v1.2 C:95% [D:8.8%], F:4.0%, M:0.1%, n:2675

Lutzomyia longipalpis Genome Llonj1.1 85 C:73% [D:6.3%], F:10%, M:16%, n:2675

Gene set Llonj1.1 C:66% [D:9.7%], F:13%, M:20%, n:2675

Manduca sexta Genome GCA_000262585.1 664 C:81% [D:4.4%], F:12%, M:6.1%, n:2675

Gene set OGS2_20140407 C:80% [D:10%], F:10%, M:8.2%, n:2675

Megaselia scalaris Genome Mscal_v1.22 1 C:16% [D:0.6%], F:21%, M:61%, n:2675

Gene set Mscal_v1.22 C:21% [D:1.4%], F:20%, M:58%, n:2675

Metaseiulus occidentalis Genome Mocc_1.0 896 C:76% [D:4.9%], F:12%, M:10%, n:2675

Gene set Mocc_1.0 C:82% [D:14%], F:10%, M:6.5%, n:2675

Musca domestica Genome v2.0.2 226 C:91% [D:4.3%], F:5.3%, M:2.7%, n:2675

Gene set v2.0.2 C:97% [D:29%], F:2.3%, M:0.5%, n:2675

Nasonia vitripennis Genome Nvit_v1.0 698 C:91% [D:6.0%], F:5.1%, M:3.2%, n:2675

Gene set Nvit_OGS_v1.2 C:94% [D:10%], F:4.0%, M:1.1%, n:2675

Pediculus humanus Genome PhumU2 497 C:92% [D:3.9%], F:6.1%, M:1.6%, n:2675

Gene set PhumU2.1 C:93% [D:9.1%], F:4.9%, M:1.3%, n:2675

Phlebotomus papatasi Genome Ppapi1.1 0.87 C:33% [D:3.2%], F:33%, M:33%, n:2675

Gene set Ppapi1.1 C:54% [D:6.1%], F:20%, M:25%, n:2675

Pogonomyrmex barbatus Genome Pbar_v1.0 819 C:90% [D:2.9%], F:8.5%, M:0.7%, n:2675

Gene set Pbar_OGS_v1.2 C:93% [D:8.2%], F:6.5%, M:0.3%, n:2675

Solenopsis invicta Genome Sinv_v1.0 558 C:74% [D:2.4%], F:19%, M:6.3%, n:2675

Gene set Sinv_OGS_v2.2.3 C:80% [D:6.5%], F:14%, M:5.4%, n:2675

Rhodnius prolixus Genome RproC1 847 C:85% [D:2.5%], F:12%, M:2.5%, n:2675

Gene set RprocC1.2 C:74% [D:8.3%], F:9.1%, M:16%, n:2675

Strigamia maritima Genome Smar1.22 139 C:84% [D:5.9%], F:12%, M:3.2%, n:2675

Gene set GCA_000239435.1.22 C:87% [D:12%], F:8.3%, M:4.6%, n:2675

Tetranychus urticae Genome GCA_000239435.1 2,993 C:61% [D:4.5%], F:12%, M:25%, n:2675

Gene set GCA_000239435.1.22 C:69% [D:11%], F:9.6%, M:20%, n:2675

Tribolium castaneum Genome Tcas3.22 19,135 C:95% [D:5.8%], F:3.9%, M:0.8%, n:2675

Gene set Tcas_OGS_v2 C:95% [D:10%], F:3.0%, M:1.3%, n:2675

Acanthoscurria geniculata Transcriptome GI:598795695 whole C:65% [D:n.a.], F:13%, M:20%, n:2675

Anopheles sinensis Transcriptome GI:656597267 unknown C:36% [D:n.a.], F:22%, M:41%, n:2675

Anthonomus grandis Transcriptome GI:562777735 whole C:18% [D:n.a.], F:16%, M:65%, n:2675

Page 10: Supplementary Online Material - busco.ezlab.org · Augustus predictions using generic parameters may produce sub-optimal gene predictions. Training the Training the prediction parameters

Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 10 of 13

Lineage Species Sample type Identifier N50 (Kbp) BUSCOs assessment

Bactrocera dorsalis Transcriptome GI:618068638 unknown C:87% [D:n.a.], F:5.9%, M:6.4%, n:2675

Belgica antartica Transcriptome GI:418280542 whole C:79% [D:n.a.], F:10%, M:9.8%, n:2675

Calanus finmarchicus Transcriptome GI:592958556 unknown C:84% [D:n.a.], F:7.3%, M:8.5%, n:2675

Transcriptome GI:647215886 unknown C:78% [D:n.a.], F:11%, M:10%, n:2675

Ceratitis capitata Transcriptome GI:577749858 whole C:87% [D:n.a.], F:7.3%, M:5.6%, n:2675

Cherax quadricarinatus Transcriptome GI:512174511 hypodermis C:7.8% [D:n.a.], F:7.6%, M:84%, n:2675

Corydalinae sp. Transcriptome GI:661070030 whole C:14% [D:n.a.], F:20%, M:64%, n:2675

Delia antiqua Transcriptome GI:604701913 whole C:55% [D:n.a.], F:15%, M:28%, n:2675

Dendroctonus frontalis Transcriptome GI:452943093 whole C:56% [D:n.a.], F:22%, M:21%, n:2675

Drosophila ercepeae Transcriptome GI:570540147 unknown C:18% [D:n.a.], F:16%, M:65%, n:2675

Drosophila malerkotliana m. Transcriptome GI:570549742 unknown C:19% [D:n.a.], F:16%, M:64%, n:2675

Drosophila malerkotliana p. Transcriptome GI:570523813 unknown C:29% [D:n.a.], F:24%, M:45%, n:2675

Drosophila merina Transcriptome GI:570504412 unknown C:25% [D:n.a.], F:20%, M:53%, n:2675

Drosophila miranda Transcriptome GI:645592147 unknown C:91% [D:n.a.], F:4.2%, M:4.0%, n:2675

Drosophila pseudoananassae n. Transcriptome GI:570451470 unknown C:6.2% [D:n.a.], F:21%, M:72%, n:2675

Drosophila pseudoananassae p. Transcriptome GI:570485056 whole C:8.5% [D:n.a.], F:21%, M:70%, n:2675

Drosophila serrata Transcriptome GI:480512000 unknown C:40% [D:n.a.], F:22%, M:36%, n:2675

Echinogammarus veneris Transcriptome GI:595402945 unknown C:20% [D:n.a.], F:8.0%, M:71%, n:2675

Enallagma hageni Transcriptome GI:459275420 total C:6.9% [D:n.a.], F:7.6%, M:85%, n:2675

Folsomia candida Transcriptome GI:570625125 unknown C:47% [D:n.a.], F:14%, M:38%, n:2675

Hyalella azteca Transcriptome GI:510074665 unknown C:5.9% [D:n.a.], F:3.8%, M:90%, n:2675

Transcriptome GI:510092454 unknown C:6.6% [D:n.a.], F:5.4%, M:87%, n:2675

Ips typographus Transcriptome GI:459277393 antenna C:19% [D:n.a.], F:20%, M:59%, n:2675

Ixodes scapularis Transcriptome GI:604952323 Synganglion C:27% [D:n.a.], F:26%, M:46%, n:2675

Ixodes ricinus Transcriptome GI:556088131 salivary C:77% [D:n.a.], F:8.4%, M:13%, n:2675

Latrodectus hesperus Transcriptome GI:618730332 unknown C:82% [D:n.a.], F:8.4%, M:9.3%, n:2675

Melita plumosa Transcriptome GI:510208131 whole C:6.4% [D:n.a.], F:6.3%, M:87%, n:2675

Mengenilla moldrzyki Transcriptome GI:660742704 whole C:9.5% [D:n.a.], F:13%, M:76%, n:2675

Musca domestica Transcriptome GI:604923024 unknown C:64% [D:n.a.], F:19%, M:15%, n:2675

Nannochorista philpotti Transcriptome GI:661012745 whole C:31% [D:n.a.], F:31%, M:37%, n:2675

Nilaparvata lugens Transcriptome GI:672467144 salivary C:74% [D:n.a.], F:12%, M:12%, n:2675

Orchesella cincta Transcriptome GI:570587022 unknown C:44% [D:n.a.], F:11%, M:44%, n:2675

Polistes canadensis Transcriptome GI:452055806 multiple C:51% [D:n.a.], F:22%, M:26%, n:2675

Pontastacus leptodactylus Transcriptome GI:556694752 hypodermis C:73% [D:n.a.], F:11%, M:14%, n:2675

Transcriptome GI:557011125 hepatopancreas C:44% [D:n.a.], F:12%, M:43%, n:2675

Priacma serrata Transcriptome GI:661240973 Unknown C:11% [D:n.a.], F:16%, M:72%, n:2675

Spodoptera exigua Transcriptome GI:548816146 unknown C:29% [D:n.a.], F:14%, M:55%, n:2675

Stegodyphus mimosarum Transcriptome GI:598904898 whole C:14% [D:n.a.], F:16%, M:68%, n:2675

Teleopsis dalmanni Transcriptome GI:615270444 whole C:92% [D:n.a.], F:6.0%, M:1.6%, n:2675

Teleopsis whitei Transcriptome GI:619803922 whole C:90% [D:n.a.], F:4.6%, M:5.3%, n:2675

Themira biloba Transcriptome GI:654236640 wildtype C:71% [D:n.a.], F:16%, M:11%, n:2675

Oth

er m

etaz

oan

s

Brugia malayi Genome GCA_000002995.3 37 C:60% [D:1.5%], F:13%, M:25%, n:843

Gene set B_malayi_3.0.22 C:77% [D:9.7%], F:5.1%, M:17%, n:843

Caenorhabditis briggsae Genome CB4 17,512 C:76% [D:2.9%], F:7.5%, M:16%, n:843

Gene set CB4.22 C:85% [D:11%], F:3.5%, M:11%, n:843

Caenorhabditis elegans Genome GCA_000002985.3 17,494 C:85% [D:6.9%], F:2.8%, M:11%, n:843

Gene set WBcel235.22 C:90% [D:11%], F:1.7%, M:7.5%, n:843

Caenorhabditis japonica Genome GCA_000147155.1 94 C:63% [D:4.8%], F:13%, M:22%, n:843

Gene set C_japonica-7.0.1.22 C:67% [D:9.4%], F:11%, M:20%, n:843

Helobdella robusta Genome GCA_000326865.1 3,060 C:74% [D:3.4%], F:10%, M:14%, n:843

Gene set GCA_000326865.1.22 C:85% [D:12%], F:9.9%, M:4.2%, n:843

Loa loa Genome GCA_00018385.2 174 C:80% [D:6.6%], F:2.4%, M:17%, n:843

Gene set Loa_loa_v3.22 C:81% [D:8.5%], F:4.5%, M:14%, n:843

Lottia gigantea Genome GCA_00032785.1 1,870 C:89% [D:2.3%], F:4.3%, M:5.8%, n:843

Gene set GCA_00032785.1.22 C:90% [D:13%], F:7.8%, M:2.1%, n:843

Nematostella vectensis Genome GCA_000209225.1 472 C:78% [D:3.5%], F:10%, M:10%, n:843

Gene set GCA_000209225.1.22 C:83% [D:15%], F:14%, M:2.8%, n:843

Schistosoma mansoni Genome GCA_000237925.2 34,464 C:56% [D:4.3%], F:8.3%, M:34%, n:843

Gene set ASM2379v2.22 C:65% [D:7.8%], F:8.3%, M:26%, n:843

Strongylocentrotus purpuratus Genome GCA_000002235.2 167 C:87% [D:6.5%], F:7.8%, M:4.9%, n:843

Gene set GCA_000002235.2.22 C:83% [D:19%], F:15%, M:0.7%, n:843

Trichoplax adhaerens Genome GCA_000150275.1 5,978 C:81% [D:1.1%], F:7.8%, M:10%, n:843

Page 11: Supplementary Online Material - busco.ezlab.org · Augustus predictions using generic parameters may produce sub-optimal gene predictions. Training the Training the prediction parameters

Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 11 of 13

Lineage Species Sample type Identifier N50 (Kbp) BUSCOs assessment

Gene set ASM1507v1.22 C:85% [D:11%], F:12%, M:2.3%, n:843

Ancylostoma ceylanicum Transcriptome GI:595744344 Unknown C:16% [D:n.a.], F:38%, M:44%, n:843

Aplysia californica

Transcriptome GI:613602134 chemokine C:88% [D:n.a.], F:8.1%, M:2.8%, n:843

Transcriptome GI:614063388 Gills C:88% [D:n.a.], F:8.4%, M:3.5%, n:843

Transcriptome GI:606015213 Heart C:77% [D:n.a.], F:12%, M:9.3%, n:843

Transcriptome GI:594457164 Salivary C:41% [D:n.a.], F:23%, M:34%, n:843

Apostichopus japonicus Transcriptome GI:638469663 Unknown C:68% [D:n.a.], F:24%, M:6.9%, n:843

Asterias amurensis Transcriptome GI:638532954 Unknown C:59% [D:n.a.], F:28%, M:11%, n:843

Bithynia siamensis goniomphalos Transcriptome GI:480970007 Unknown C:57% [D:n.a.], F:24%, M:17%, n:843

Evechinus chloroticus Transcriptome GI:559461775 Unknown C:92% [D:n.a.], F:5.3%, M:2.6%, n:843

Henricia sp. AR-2014 Transcriptome GI:638872012 Unknown C:90% [D:n.a.], F:7.9%, M:1.1%, n:843

Patiria miniata Transcriptome GI:638728087 Ovary C:88% [D:n.a.], F:10%, M:1.1%, n:843

Patiria pectinifera Transcriptome GI:638651248 Unknown C:80% [D:n.a.], F:18%, M:1.6%, n:843

Procotyla flyviatilis Transcriptome GI:528026207 Unknown C:54% [D:n.a.], F:18%, M:26%, n:843

Fu

ng

i

Ashbya gossypii Genome GCA_000091025.4 1,476 C:95% [D:4.5%], F:1.8%, M:2.9%, n:1438

Gene set C:95% [D:7.3%], F:3.8%, M:0.9%, n:1438

Aspergillus nidulans Genome GCA_000011425.1 3,704 C:98% [D:1.8%], F:0.9%, M:0.2%, n:1438

Gene set C:95% [D:11%], F:2.8%, M:1.8%, n:1438

Cryptococcus neoformnas Genome GCA_000091045.1 1,438 C:92% [D:5.4%], F:2.5%, M:4.8%, n:1438

Gene set C:90% [D:7.1%], F:5.9%, M:3.1%, n:1438

Gibberella zeae Genome GCA_000240135.2 5,350 C:98% [D:1.3%], F:1.3%, M:0.2%, n:1384

Gene set C:97% [D:11%], F:2.0%, M:0.2%, n:1384

Komagataella pastoris Genome GCA_000027005.1 2,394 C:93% [D:5.0%], F:4.5%, M:2.0%, n:1438

Gene set C:93% [D:8.5%], F:3.8%, M:2.7%, n:1438

Neurospora crassa Genome GCA_000182925.1 6,000 C:98% [D:6.5%], F:0.6%, M:0.6%, n:1438

Gene set C:97% [D:10%], F:1.5%, M:0.6%, n:1438

Phaeosphaeria nodorum Genome GCA_000146915.1 1,045 C:96% [D:6.0%], F:3.1%, M:0.2%, n:1438

Gene set C:91% [D:9.7%], F:8.4%, M:0.4%, n:1438

Puccinia graminis Genome GCA_000149925.1 964 C:63% [D:5.6%], F:20%, M:15%, n:1438

Gene set C:85% [D:11%], F:8.0%, M:6.3%, n:1438

Saccharomyces cerevisiae Genome GCA_000146045.2 924 C:96% [D:5.2%], F:0.4%, M:2.7%, n:1438

Gene set C:98% [D:8.6%], F:1.1%, M:0%, n:1438

Schizosaccharomyces pombe Genome GCA_000002945.2 4,539 C:89% [D:3.8%], F:2.7%, M:7.7%, n:1438

Gene set C:90% [D:9.5%], F:5.7%, M:3.3%, n:1438

Sclerotina sclerotiorum Genome GCA_000146945.1 1,625 C:70% [D:3.5%], F:3.8%, M:25%, n:1438

Gene set C:67% [D:8%], F:7.4%, M:25%, n:1438

Tuber melanosporum Genome GCA_000151645.1 638 C:95% [D:5.0%], F:4.1%, M:0.6%, n:1438

Gene set C:91% [D:9.0%], F:6.2%, M:2.3%, n:1438

Ustilago maydis Genome GCA_000328475.1 127 C:92% [D:5.9%], F:3.1%, M:4.4%, n:1438

Gene set C:88% [D:7.5%], F:6.6%, M:5.0%, n:1438

Verticillium dahliae Genome GCA_000150675.1 1,273 C:95% [D:4.4%], F:3.5%, M:0.9%, n:1438

Gene set C:94% [D:9.4%], F:4.5%, M:0.9%, n:1438

Yarrowia lipolytica Genome GCA_000002525.1 3,633 C:97% [D:5.4%], F:2.1%, M:0.6%, n:1438

Gene set C:96% [D:8.8%], F:2.9%, M:0.6%, n:1438

Agaricus subrufescens Transcriptome GI:645683639 Unknown C:7.7% [D:n.a.], F:28%, M:63%, n:1438

Armillaria ostoyae Transcriptome GI:480500433 RNA1 C:45% [D:n.a.], F:42%, M:11%, n:1438

Hypsizygus marmoreus Transcriptome GI:612225315 Unknown C:59% [D:n.a.], F:34%, M:6.4%, n:1138

Ophiocordyceps sinensis Transcriptome GI:630075070 Unknown C:38% [D:n.a.], F:36%, M:24%, n:1438

Phakopsora pachyrhizi Transcriptome GI:452772923 Thai1 C:9.3% [D:n.a.], F:12%, M:78%, n:1438

Puccinia striiformis f.sp. tritici

Transcriptome GI:509494464 PST C:32% [D:n.a.], F:35%, M:32%, n:1438

Transcriptome GI:509507311 Haustorium C:22% [D:n.a.], F:33%, M:43%, n:1438

Transcriptome GI:509515198 Spore C:17% [D:n.a.], F:32%, M:49%, n:1438

Pyrenochaeta lycopersici Transcriptome GI:589143963 unknown C:94% [D:n.a.], F:4.8%, M:0.1%, n:1438

Spraguea lophii Transcriptome GI:520759716 Spore C:6.4% [D:n.a.], F:11%, M:82%, n:1438

Termitomyces clypeatus Transcriptome GI:595370870 treated C:95% [D:n.a.], F:4.3%, M:0.0%, n:1438

Transcriptome GI:595351039 untreated C:91% [D:n.a.], F:7.5%, M:1.1%, n:1438

Trametes sanguinea Transcriptome GI:511189810 BAFC2126 C:18% [D:n.a.], F:30%, M:50%, n:1438

Uromyces appendiculatus Transcriptome GI:452898896 SWBR1 C:34% [D:n.a.], F:25%, M:39%, n:1438

Page 12: Supplementary Online Material - busco.ezlab.org · Augustus predictions using generic parameters may produce sub-optimal gene predictions. Training the Training the prediction parameters

Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 12 of 13

5. BUSCO and CEGMA analysis run-times

The total run-times of default-parameter BUSCO and CEGMA assessments of genome assemblies and

gene sets were evaluated on the analysis on representative species from different metazoan lineages (Table

S2). All analyses were performed using 4 CPUs with up to 8 GB of RAM. BUSCO assessments were

performed using the eukaryote and metazoan sets, as well as the largest specific set for each species.

Table S2. BUSCO and CEGMA assessment run-times on four representative species.

Species Dataset Analysis Run-time

Drosophila melanogaster

Genome, 180 Mbp

2’675 arthropod BUSCOs 7.6h

843 metazoan BUSCOs 3.2h

429 eukaryote BUSCOs 1.4h

250 eukaryote BUSCOs 0.81h

248 CEGMA genes 2.5h

Gene set, 13’918

2’675 arthropod BUSCOs 1.4h

843 metazoan BUSCOs 0.5h

429 eukaryote BUSCOs 0.36h

250 eukaryote BUSCOs 0.15h

248 CEGMA genes N/A

Heliconius melpomene

Genome, 269 Mbp

2’675 arthropod BUSCOs 8.1h

843 metazoan BUSCOs 3.6h

429 eukaryote BUSCOs 0.91h

250 eukaryote BUSCOs 0.58h

248 CEGMA genes 5.7h

Gene set, 12’669

2’675 arthropod BUSCOs 0.35h

843 metazoan BUSCOs 0.18h

429 eukaryote BUSCOs 0.12h

250 eukaryote BUSCOs 0.1h

248 CEGMA genes N/A

Homo sapiens

Genome, 3’381 Mbp

3’023 vertebrate BUSCOs 29h

843 metazoan BUSCOs 13h

429 eukaryote BUSCOs 6.5h

250 eukaryote BUSCOs 2.8h

248 CEGMA genes 25.3h

Gene set, 20’364

3’023 vertebrate BUSCOs 2.6h

843 metazoan BUSCOs 1.2h

429 eukaryote BUSCOs 0.5h

250 eukaryote BUSCOs 0.21h

248 CEGMA genes N/A

Caenorhabditis elegans

Genome, 100 Mbp

843 metazoan BUSCOs 5.3h

429 eukaryote BUSCOs 1.36h

250 eukaryote BUSCOs 0.88h

248 CEGMA genes 1.7h

Gene set, 20’447

843 metazoan BUSCOs 0.5h

429 eukaryote BUSCOs 0.3h

250 eukaryote BUSCOs 0.1h

248 CEGMA genes N/A

Page 13: Supplementary Online Material - busco.ezlab.org · Augustus predictions using generic parameters may produce sub-optimal gene predictions. Training the Training the prediction parameters

Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 13 of 13

6. References

Camacho, C., et al. (2009) BLAST+: architecture and applications, BMC Bioinformatics, 10, 421.

Eddy, S.R. (2011) Accelerated Profile HMM Searches, PLoS Comput Biol, 7, e1002195.

Keller, O., et al. (2011) A novel hybrid gene prediction method employing protein multiple sequence alignments,

Bioinformatics, 27, 757-763.

Kriventseva, E.V., et al. (2014) OrthoDB v8: update of the hierarchical catalog of orthologs and the underlying free

software, Nucleic Acids Res.

Mende, D.R., et al. (2013) Accurate and universal delineation of prokaryotic species, Nat Methods, 10, 881-884.

Parra, G., Bradnam, K. and Korf, I. (2007) CEGMA: a pipeline to accurately annotate core genes in eukaryotic

genomes, Bioinformatics, 23, 1061-1067.

Parra, G., et al. (2009) Assessing the gene space in draft genomes, Nucleic Acids Res, 37, 289-297.

Sievers, F. and Higgins, D.G. (2014) Clustal Omega, accurate alignment of very large numbers of sequences,

Methods Mol Biol, 1079, 105-116.

Tatusov, R., et al. (2003) The COG database: an updated version includes eukaryotes., BMC Bioinformatics, 4, 41.

Waterhouse, R.M. (2015) A maturing understanding of the composition of the insect gene repertoire, Current

Opinion in Insect Science, 1.

Waterhouse, R.M., et al. (2013) OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs, Nucleic

Acids Research, 41, D358-D365.

Waterhouse, R.M., Zdobnov, E.M. and Kriventseva, E.V. (2011) Correlating Traits of Gene Retention, Sequence

Divergence, Duplicability and Essentiality in Vertebrates, Arthropods, and Fungi, Genome Biology and

Evolution, 3, 75-86.