coalqc - quality control while inferring demographic histories … · 2 38. abstract. 39 estimating...
TRANSCRIPT
1
Article (Resources) 1
24th
Feb, 2020 2
3
CoalQC - Quality control while inferring demographic histories from genomic 4
data: Application to forest tree genomes 5
6
7
8
Ajinkya Bharatraj Patil1, Sagar Sharad Shinde
1, Raghavendra S
2, Satish B.N
3, Kushalappa 9
C.G3, Nagarjun Vijay
1 10
11
12
13
14 1Computational Evolutionary Genomics Lab, Department of Biological Sciences, IISER 15
Bhopal, Bhauri, Madhya Pradesh 2College of Agriculture Hassan, UAS Bangalore
3College 16
of Forestry, Ponnampet, Kodagu 17
18
*Corresponding authors: [email protected] & [email protected] 19
20
21
22
23
24
25
26
27
28
29
30
31
Running head: quality control of demographic inference. 32
Keywords: demographic history inference, Mesua ferrea, whole-genome assembly, PSMC, 33
repeat sequences, forest plants. 34
35
36
37
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
2
Abstract 38
Estimating demographic histories using genomic datasets has proven to be useful in 39
addressing diverse evolutionary questions. Despite improvements in inference methods and 40
availability of large genomic datasets, quality control steps to be performed prior to the use of 41
sequentially Markovian coalescent (SMC) based methods remains understudied. While 42
various filtering and masking steps have been used by previous studies, the rationale for such 43
filtering and its consequences have not been assessed systematically. In this study, we have 44
developed a reusable pipeline called “CoalQC”, to investigate potential sources of bias (such 45
as repeat regions, heterogeneous coverage, and callability). First, we demonstrate that 46
genome assembly quality can affect the estimation of demographic history using the genomes 47
of several species. We then use the CoalQC pipeline to evaluate how different repeat classes 48
affect the inference of demographic history in the plant species Populus trichocarpa. Next, 49
we assemble a draft genome by generating whole-genome sequencing data for Mesua ferrea 50
(sampled from Western Ghats, India), a multipurpose forest plant distributed across tropical 51
south-east Asia and use it as an example to evaluate several technical (sequencing technology, 52
PSMC parameter settings) and biological aspects that need to be considered while comparing 53
demographic histories. Finally, we collate the genomic datasets of 14 additional forest tree 54
species to compare the temporal dynamics of Ne and find evidence of a strong bottleneck in 55
all tropical forest plants during Mid-Pleistocene glaciations. Our findings suggest that quality 56
control prior to the use of SMC based methods is important and needs to be standardised. 57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
3
Introduction 72
Coalescent theory continues to be the fundamental tool for the study of gene genealogies in a 73
population genetic framework. Changes in coalescence time and the Watterson estimator of 74
genetic diversity (θw) along the genome serve as a record of the population history of a 75
species through time. Increasing availability of whole genomic datasets for a multitude of 76
species has made it possible to analyse demographic histories to answer a suite of questions 77
such as host-parasite co-evolution (Hecht et al. 2018), effect of climate change on population 78
dynamics (Bai et al. 2018), hybridization (Vijay et al. 2016), speciation events and split times 79
between species (Cahill et al. 2016), history of inbreeding (Prado-Martinez et al. 2013), 80
mutational meltdown (Rogers and Slatkin 2017), detecting population decline and addressing 81
threats of extinction (Mays et al. 2018). Such widespread use of genomic datasets for 82
coalescent inferences was made possible by the introduction of the PSMC method (Li and 83
Durbin 2011) that requires only one diploid genome sequencing dataset. Technical advances 84
in the use of genomic datasets for making demographic inferences and prevalence of multi-85
individual datasets facilitated by reductions in sequencing cost now allow integration of 86
information across an increasing number of individuals (Schiffels and Durbin 2014; Terhorst 87
et al. 2016; Palamara et al. 2018). 88
89
Despite the widespread use of demographic history inference methods like PSMC, 90
many potential sources of bias due to data quality have been identified and efforts to reduce 91
such effects are considered important. Earlier studies have shown that low coverage regions, 92
ascertainment bias, hyperdiverse sequences, the fraction of usable data available as well as 93
population structure will affect the estimation and interpretation of demographic histories (Li 94
and Durbin 2011; Mazet et al. 2015; Nadachowska-Brzyska et al. 2016). Notably, some of the 95
PSMC parameters or options such as mutation rate and generation time are known to 96
drastically change the scaling of the curve, while the trajectory remains unchanged 97
(Nadachowska-Brzyska et al. 2015). Detailed guidelines for the use of PSMC and MSMC are 98
described elsewhere (Mather et al. 2020). Although the effect of genome assembly quality on 99
demographic inferences has not been systematically assessed, it had been noted that genome 100
quality could bias the results (Tiley et al. 2018). Intriguingly, a recent paper investigated the 101
effect of genome quality and concluded that contemporary demographic inference methods 102
are robust to the quality of the reference genome used (Patton et al. 2019). 103
104
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
4
In general practice, repetitive elements and low coverage regions of the genome are 105
masked before the analyses to overcome biased inferences (Foote et al. 2016). It has been 106
suggested that at least 75% or more of the genome should be retained after masking for 107
robust inference of demographic history. However, organisms that have high (>30%) repeat 108
content would violate either the masking criteria or the fraction of the genome to be retained. 109
While several such quality control steps have been applied prior to the use of PSMC on 110
genomic data, the rationale for performing specific filtering steps and the adverse 111
consequences of skipping such quality control is understudied. A standardised quality control 112
pipeline would be able to alleviate some of these challenges. 113
114
In this study, we have created a reusable pipeline (CoalQC) for evaluating the quality 115
of genome-wide coalescence inferences and demonstrate the utility of the tool using a newly 116
generated ~180X coverage whole genome sequencing dataset of Mesua ferrea (a tropical tree 117
species distributed across south-east Asia) as well as public re-sequencing datasets of several 118
species. We investigate the relevance of various filtering practices and specifically answer the 119
following questions: 120
1. Does genome assembly quality affect the inference of coalescence histories? 121
2. How does the coalescence history differ between repeat classes? 122
3. Which biological (change in genome size) and technical (sequencing platform used, 123
PSMC parameter settings) factors influence demographic inference? 124
4. What can the comparison of demographic histories of forest plants reveal? 125
Our exploration of how several technical aspects can affect the inference of coalescence 126
histories is relevant not just for use of the PSMC program, but also for numerous other tools 127
that make coalescent inferences using genomic datasets. We also apply our pipeline to 128
compare the demographic history of forest trees to evaluate whether ecologically relevant 129
hypothesis can be robustly tested using demographic inference methods. 130
131
New Approaches 132
CoalQC 133
We have implemented a re-usable pipeline to perform quality control prior to the use of 134
genome-wide coalescent methods. Separate modules to evaluate the effects of repeat regions, 135
coverage and callability have been implemented to allow extensive quality control. The 136
repeat module estimates independent demographic histories using genomic regions of one 137
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
5
repeat class at a time along with non-repeat regions of the genome. Informative graphs that 138
(a) compare these independent estimates of Ne, (b) quantify relative abundance of each repeat 139
class in various atomic intervals, (c) assess robustness of the inferred results using bootstrap 140
replicates, (d) visualise trends of change in heterozygosity and Ts/Tv ratio is generated by this 141
module. We believe this module will be a valuable quality check prior to masking specific 142
repeat regions of the genome. 143
144
The module for coverage is specifically designed to evaluate the robustness of the results to 145
different coverage thresholds. Genomic regions are divided into several cumulative coverage 146
classes based on the local read depth. These coverage classes are then used to independently 147
estimate demographic histories and generate a comparative graph that can be used to 148
understand the robustness of the results to coverage constraints. Similar to the coverage 149
module, the callability module divides the genome into several callability classes to identify 150
regions of the genome that need to be excluded from the analysis by masking. Detailed 151
instructions and example commands for the use of the pipeline are provided on the github 152
repository of the CoalQC program (https://github.com/ceglab/coalqc). 153
154
Results 155
Does genome quality affect demographic inference? 156
Genome assembly quality encompasses multiple factors such as sequence contiguity 157
(generally quantified as N50), number and length of gaps, the fraction of genes assembled 158
(quantified using BUSCO’s) and fraction of the genome assembled (quantified based on the 159
percent of reads mapping to the genome assembly). To assess the effect of genome quality on 160
demographic inference, we compared Ne trajectories estimated using a single human 161
individual (NA12878) mapped to five different versions (hg4, hg10, hg15, hg19, and hg38) 162
of human genome assemblies with varying levels of quality. We found that all the measures 163
of genome quality used by us showed an improvement in recent versions of the human 164
genome (see Table S1). The estimated effective population size (Ne) showed greater 165
variability between genome assembly versions during ancient i.e., 1-7 MYA (Mean of 166
standard deviations in Ne of each atomic interval from 43 to 64 = 0.84) and recent i.e., 0-15 167
KYA (Mean of standard deviations in Ne of each atomic interval from 0 to 6 = 0.82) 168
compared to mid-time period i.e., 100-400 KYA (Mean of standard deviations in Ne of each 169
atomic interval from 18 to 32 = 0.29) (see Fig. 1). PSMC trajectories of earlier (poorer 170
assembly quality metrics) versions of the human genome showed higher estimates of Ne 171
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
6
during the ancient (~1-7 MYA) and recent times (~0-15 KYA) and lower estimates of Ne 172
during the mid-time period (~100-400 KYA) compared to the recent versions of the human 173
genome. 174
175
To evaluate the effect of assembly quality on the robustness of results we performed 176
100 bootstrap runs using each of the human genome versions considered. The heterogeneity 177
between bootstrap replicates within each version was quantified as the coefficient of variation 178
(CV) across the 100 replicates. While the CV of the early version of human genome assembly 179
(Mean of the CV of Ne across bootstrap replicates of hg4 assembly=0.054) was higher than 180
that of the other recent assemblies (hg10=0.041, hg15=0.042, hg19=0.045, and hg38=0.043) 181
considered by us, the CV’s of all the assemblies are very similar (see Fig. 1b). Comparable 182
estimates of Ne across bootstrap runs suggest that these estimates are being robustly inferred 183
for each specific genome assembly. 184
185
To ensure that the effect of genome assembly quality is not limited to just the human 186
genome, we compared the demographic histories inferred from the initial and recent versions 187
of the Tribolium castaneum and Danio rerio genomes (see Table S1). We find that similar to 188
the differences seen between different versions of the human genome assembly, the estimates 189
of Ne inferred from different versions of the genome show distinct trends (see Fig. S1). Our 190
results from the human, red flour beetle (Tribolium castaneum) and zebrafish (Danio rerio) 191
genomes suggest that genome quality does have a noticeable effect on demographic 192
inference. 193
194
How do repeat regions affect demographic inference? 195
Prior to performing demographic inference, repeat regions of the genome are generally 196
masked and excluded from the analysis. Masking of repeat regions is justified by the high 197
risk of assembly errors, collapsed segmental duplications and miss-mapping of short-reads in 198
repeat regions. Plant genomes with a high fraction of repetitive content are more prone to be 199
affected by repeats. Hence, we decided to use the high-quality genome of the plant Populus 200
trichocarpa to compare the Ne trajectories inferred using masked and unmasked genomes to 201
understand the magnitude of the change introduced by masking of repeat regions. The 202
estimates of Ne from the masked compared to the unmasked genome were lower during 203
ancient time period i.e., after ~ 1 MYA (Mean difference in Ne across atomic intervals 48 to 204
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
7
64=0.63 x 104) and higher during recent times i.e., 20KYA – 100 KYA (Mean difference in 205
Ne across atomic intervals 5 to 17=0.45 x 104, see Fig. 2a). 206
207
To evaluate how specific repeat classes affect the estimates of Ne, each repeat family 208
was unmasked while keeping other repeats masked (see Fig. 2b). Estimates of Ne after 209
inclusion of LTR-Gypsy was intermediate between masked and unmasked genome-based 210
inferences during the ancient past i.e., after 1MYA (Mean difference in Ne compared to 211
masked genome across atomic intervals 48 to 64= 0.29 x 104), whereas it showed a similar 212
trend as the unmasked genome during recent times i.e., 20KYA-100KYA (Mean difference in 213
Ne compared to masked genome across atomic intervals 5 to 17=0.44 x 104; see Fig. 2c). 214
Other repeat classes did not influence the estimates as much as LTRs and were closer to the 215
masked inference (see Fig. 2b). The robustness of the Ne estimates was assessed based on the 216
variability (quantified as CV) between bootstrap replicates using the non-repeat fraction of 217
the genome along with each individual repeat class. The CV was heterogeneous between 218
repeat classes and was relatively higher in recent time intervals (see Fig. 2c). Robustness of 219
the estimated values of Ne was comparable between the unmasked (Mean of the CV across 220
atomic intervals= 0.04687) and masked (Mean of the CV across atomic intervals= 0.047) 221
genomes. 222
We found that the fraction of repeat content in a particular atomic interval was 223
positively correlated (τ = 0.346, p-value= 0.0003, see Fig. S2) with the absolute difference 224
between masked and unmasked genome-based estimates of effective population size (Ne). 225
Having established that greater repeat abundance would more strongly affect estimates of Ne, 226
we quantified repeat family-wise abundance in genomic regions corresponding to each 227
atomic interval. In all atomic intervals, the non-repeat fraction was found to be the most 228
abundant (see Fig. 2d). Among the repeat classes, LTR-Gypsy had the highest abundance in 229
most of the atomic intervals. The extremely high abundance of LTR-Gypsy repeats in the first 230
few atomic intervals could have led to the drastic change in the Ne trajectory during recent 231
times (i.e., 20KYA-100KYA) after inclusion of LTR-Gypsy repeats. LTR's and RC-Helitron 232
have high abundance at the genome-wide level (see Fig. 2e) and have a greater influence on 233
the estimates of Ne (see Fig. 2b). 234
Genomic regions are assigned to a specific atomic interval based on the TMRCA of 235
that region. This leads to a trend of increasing levels of heterozygosity from atomic intervals 236
that correspond to recent to older time points. Each of the repeat classes independently shows 237
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
8
this trend of increase in heterozygosity similar to non-repeat regions (see Fig. 3). Comparable 238
estimates of heterozygosity between repeat and non-repeat regions suggest that the 239
heterozygous sites identified in repeat regions are not merely variant calling artefacts. The 240
ratio (Ts/Tv) of the number of transitions (Ts) to the number of transversions (Tv) has been 241
used to evaluate the accuracy of variant call sets (Wang et al. 2015). Regions of the genome 242
with artefactual variant calls would have a Ts/Tv ratio very different from the genomic 243
average. Hence, as an additional validation of the variants identified within repeat regions, we 244
calculated the Ts/Tv ratio for each repeat class by the atomic interval. While the estimates of 245
heterozygosity showed an increasing trend towards older atomic intervals, we found that the 246
Ts/Tv ratio did not show any discernible trend (see Fig. 3). Similar estimates of the Ts/Tv 247
ratio in repeat and non-repeat regions suggests that the heterozygous sites identified in repeat 248
regions are truly polymorphic. 249
250
Repeat regions of the genome have very high coverage due to the mapping of reads 251
from multiple copies and low callability due to a large number of mismatches across the 252
reads mapped to the same genomic region. Hence, some studies tend to mask genomic 253
regions based on criteria determined based on coverage or callability instead of the presence 254
of repeats. We separately evaluated the effect of masking genomic regions based on coverage 255
or callability classes (see Fig. S3 and S4) and find that masking based on these criteria needs 256
to be treated independently of repeat region-based masking. 257
258
Which biological and technical factors influence demographic inference? 259
Several technical factors such as the optimal PSMC parameter settings, sequencing platform 260
used, the prevalence of cross-contamination from closely related species, misleading or 261
incomplete metadata in public datasets are important considerations during the comparative 262
interpretation of demographic history. Similarly, biological factors such as the prevalence of 263
whole-genome duplications, changes in the karyotype or genome size and high intraspecific 264
variation in genetic diversity also need to be considered. We generated whole-genome 265
sequencing data for a tropical plant species (Mesua ferrea) and use it as an example to 266
understand how biological and technical factors influence demographic inference. For any 267
newly sequenced genome, the PSMC parameters –r (initial theta/rho ratio), -p (pattern of 268
parameters specifying distribution of free intervals and atomic intervals) and –t (the 269
maximum time to TMRCA) need to be optimised so that all the atomic intervals have 270
sufficient number of recombination events. 271
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
9
The maximum time to TMRCA 272
For the first PSMC run of the Mesua ferrea genome, -t 5 -r 5 -p "4+25*2+4+6" options were 273
used. However, the resultant PSMC output did not have enough recombination events in 274
some of the atomic intervals. Therefore, the -p parameter was optimized until all the atomic 275
intervals had a sufficient number of recombination events. A mutation rate of 2.5e-09 per site 276
per year i.e. 3.75e-08 per site per generation was used assuming 15 years of generation time 277
for scaling the results. The resultant trajectory did not go back to older (i.e., beyond 150 Kya; 278
see Fig. 4a) time points. So, we decided to optimize all the parameters so that the trajectory 279
will give meaningful results beyond 150 Kya. Hence, the maximum time to TMRCA, i.e., -t 280
parameter was increased so that the trajectory extended to older (i.e., 150 to 400 Kya; see 281
Fig. 4a) time points. The -p parameter was optimized along with -t, as increasing -t gave less 282
number of recombination events in some of the atomic intervals. While maintaining 64 283
atomic intervals, we were able to get a reliable demographic trajectory with options -t 65 -r 5 284
-p "3+2*17+15*1+1*12" i.e. 64 atomic intervals distributed across 19 free intervals 285
(1+2+15+1). To know how far the trajectory might be extended back in time if we increase –286
t, we used -t 500 for one run, which however did not have a sufficient number of 287
recombination events in most of the atomic intervals. 288
289
To know how increasing maximum time to TMRCA was altering the Ne trajectory, 290
we considered some of the longest scaffolds and visualised the assignment of specific 291
genomic regions to various atomic intervals. Comparing the atomic intervals assigned to the 292
same genomic region at different values of -t, we found that genomic regions which were 293
assigned to older atomic intervals for smaller values of -t were assigned to relatively recent 294
atomic intervals with an increase in the -t parameter (see Fig. 4b). This redistribution of 295
regions with increasing values of –t can be better understood by looking at changes in the 296
distribution of lengths of genomic regions assigned to each atomic interval (see Fig. S5). For 297
instance, in the case of Mesua ferrea the length of older atomic intervals tends to decrease 298
with increasing values of –t. Genomic regions contributing to the older atomic intervals at 299
higher values of the –t parameter become shorter and highly heterozygous. We ensured that 300
such short high heterozygosity regions are not merely variant calling artefacts by visualising 301
the atomic intervals assigned to genomic regions along scaffolds with associated 302
heterozygosity and callability at these regions (see Fig. 4b). 303
304
305
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
10
The initial theta/rho ratio 306
The ratio of heterozygosity to recombination rate generally does not have an impact over the 307
trajectory inferred using PSMC, but an increase in the value of -r does increases the number 308
of recombination events and achieves convergence faster. We used the same -p and -t values 309
(i.e., -t 65 -p "3+2*17+15*1+1*12") with different values of –r. With decreasing values of –r, 310
the Ne trajectory of Mesua ferrea extended further back in time (i.e., beyond 400KYA; see 311
Fig. S6a). However, the atomic intervals corresponding to the extended trajectory did not 312
have a sufficient number of recombination events for –r values less than 5. When the –r 313
values were set at 5 or more, the convergence was achieved faster (see Fig. S6b). 314
315
Does the sequencing platform affect the result? 316
We have previously shown that the trajectory of Ne can show extremely contrasting patterns 317
between different populations of the same species (Vijay et al. 2018). To understand the 318
variability in the demographic histories of different populations of Mesua ferrea we wanted 319
to sample additional populations. Upon searching the European Nucleotide Archive (ENA), 320
we found that a re-sequencing dataset labeled as Mesua ferrea sampled from Yunnan, China 321
(see Table S2) was available for download. Surprisingly, the demographic trajectory inferred 322
using this dataset extended back in time with -t of 5 (optimised with -r 5 -p “4+25*2+5*2”) 323
and gave a different inference (see Fig. S7). However, we found that the sequencing for the 324
sample from Yunnan had been performed using the BGISEQ-500 platform. In order to rule 325
out the possibility of sequencing platform-specific technical issues, we compared the 326
demographic trajectory of the Human individual NA12878 sequenced using BGISEQ-500 327
(see Table S2 for dataset details) with the trajectory obtained for the same individual when 328
the sequencing was performed using Illumina platform (see Fig. S8). We did not find any 329
differences in the Ne trajectories estimated using BGISEQ-500 and the Illumina platform. 330
331
Are the differences in Ne trajectories due to biological differences? 332
Having ruled out the possibility of sequencing platform-specific technical factors we 333
considered the possibility of biological differences between the two samples of Mesua ferrea. 334
Biological reasons for different trajectories can involve (a) different demographic histories of 335
specific populations or (b) changes in the karyotype altering recombination landscape or (c) 336
changes in genome size due to segmental or whole-genome duplication events. Since an 337
earlier study has documented the prevalence of ecotype specific differences in the genome 338
size of Mesua ferrea (Das et al. 2018), we first decided to compare the approximate in-silico 339
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
11
genome size estimates of the two individuals under consideration. However, genome size 340
estimates from raw read data are sensitive to the coverage depth, sequencing error rate, and 341
polymorphism rate. Nonetheless, we see that the estimated genome size of the sample from 342
China is (approximately 478 Mbp; see Fig. S9 and S10) ~ 20Mbp less than the genome size 343
estimated for the sample from India. We also observed that the heterozygosity of the sample 344
from China was ~25 fold higher than the sample from India. These differences in 345
heterozygosity could be attributed to multiple factors such as contamination of reads from 346
other species, independent Whole Genome Duplication (WGD) in Chinese sample, incorrect 347
metadata regarding species identity in the public sequencing data repository. However, the 348
WGD program (Zwaenepoel and Van De Peer 2019) used to detect whole-genome 349
duplication events from genomic data did not find any evidence to support a WGD event 350
specific to the Chinese sample based on the distribution of synonymous substitution rates 351
(Ds) (see Fig. S11). Despite ruling out the possibility of bias from sequencing technology, we 352
are not able to conclusively establish the reasons for the difference in the Ne trajectories due 353
to reasons beyond the scope of this study. 354
355
Comparative demography using PSMC on forest plant genomes 356
Estimation of the demographic history of multiple species of forest plants can provide useful 357
information about the overall evolution of forests and the role of ecological processes or 358
climatic events. We collated publicly available forest plant genomes that had sufficient data 359
and compared the demographic histories inferred using PSMC. Demographic trajectories of 360
the tropical species showed a considerable decline in Ne during 300 KYA – 1 MYA (see Fig. 361
5). This decline in Ne corresponds to a common event irrespective of the species-specific 362
population dynamics. The period which shows bottleneck in all these species might be 363
attributed to the environmental conditions of this period. During this period two major 364
glaciations and extensive de-glaciation events have been recorded, which were considered to 365
be longer and harsher than normal (van der Hammen 1974; Verbitsky et al. 2018). Glaciations 366
following the Mid-Pleistocene transition (MPT) changed the duration of glacial events from 367
41 KYs to 100 KYs, translating into longer dry and colder conditions (Pisias and Moore 368
1981; Clark et al. 2006). These dry environments also affected precipitation in the tropics, 369
leading to a decrease in CO2 concentration and less rainfall in these regions (Hewitt 2000; 370
Dupont et al. 2001; Clark et al. 2006; Cabanne et al. 2016). In contrast, the silver birch 371
(Betula pendula) showed an increase in Ne during this period, which could be explained by 372
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
12
its adaptation to the dry, cold and high-altitude conditions (Salojärvi et al. 2017). Based on 373
our analysis of 15 forest plants we could infer that late Pleistocene glaciations had a great 374
impact on most of the forest plant species leading to reduced Ne during this period. 375
Discussion 376
Using genomes and re-sequencing datasets of several species, we here explore how genome 377
quality (contiguity, prevalence of gaps, assembly completeness), repeat abundance, 378
technological heterogeneity (sequencing platform used, parameter settings optimised) and 379
biological factors (changes in genome size) can affect the inference of demographic history. 380
The scripts used for our analysis are implemented in the form of a re-usable pipeline with 381
detailed documentation. We are thus confident that these quality control strategies and the 382
associated pipeline will prove useful while comparing demographic trajectories between 383
species to obtain insights into the underlying processes. The following paragraphs highlight 384
our major findings and their relevance with respect to previous studies. 385
386
Genome quality 387
We demonstrate that the estimates of Ne for the same sequencing dataset can be drastically 388
different when the quality of the reference genome assembly changes. A recent study by 389
Patton et al. (2019) investigated the robustness of several demographic inference methods to 390
genome assembly quality and find that in comparison to other methods, PSMC robustly 391
estimates Ne except in recent time periods. In contrast to our results, Patton et al. (2019) 392
concluded that demographic inference methods are robust to the quality of the genome 393
assembly. However, Patton et al. simulated differing amounts of genome fragmentation by 394
manipulating the variant call file and overlook the possibility of genome quality-related 395
biases introduced during the read mapping and variant calling steps. Moreover, these 396
simulations assumed that random fragmentation of the genome would capture the complexity 397
of differences in the qualities of real genomes. The ends of contigs or scaffolds in genomes 398
are regions that are difficult to assemble, such as repeat-rich or hypervariable regions and are 399
not randomly distributed in the genome. We consistently see differences in the demographic 400
history estimated from different genome assembly versions of the same species. Yet, the 401
demographic histories estimated from the 2012-devil and 2019-devil assemblies in Patton et 402
al. are very similar. The negligible improvement (maximum difference in the percent of reads 403
mapped is 0.01 considering 12 individuals) in the percentage of reads mapping to the recent 404
version of the devil genome compared to the older version might explain why genome quality 405
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
13
does not appear to influence the results in the case of the Tasmanian devil. Depending on the 406
complexity of the genome architecture, the extent of improvement in genome assembly 407
quality and which aspects of genome quality are improved will affect the magnitude of 408
change in estimates of Ne. Hence, we urge caution while making generalisations regarding 409
the effect of genome quality on the estimation of Ne. 410
411
Repeat regions 412
Our results demonstrate that the inclusion of repeat regions does affect demographic 413
inference but not all types of repeats have the same effect. The effect that a particular repeat 414
class would have on the inference seems to depend on the abundance and genomic 415
distribution of that particular repeat class. Hence, lineage-specific repeat classes can 416
potentially affect the comparative analysis of demographic histories of closely related 417
species. For instance, the LTR content might differ between closely related species (Zhang et 418
al. 2020) and can heavily influence the results. We implement a quality control strategy that 419
involves the comparison of demographic histories inferred from each repeat class separately. 420
A better understanding of repeat class-specific mutation rates might allow for scaling each 421
repeat type with an appropriate mutation rate and resolve this heterogeneity in the 422
trajectories. In order to evaluate the effect of diverse repeat classes on the estimation of Ne, 423
our pipeline relies on the existence of a reasonably good quality of repeat annotation in the 424
focal species. While this is a caveat, it is a compromise done in order to finish the execution 425
of the program in a timely manner. The users also have the choice to decide the number of 426
repeat classes by combining repeat classes or separating them into sub-classes. A larger 427
number of repeat classes leads to an increase in the runtime. We urge users to perform their 428
own repeat annotation, identification and classification to overcome this limitation. 429
430
PSMC parameter settings 431
Using the newly generated genome sequencing dataset of the tropical plant Mesua Ferrea, we 432
demonstrate the effect of changing the three main parameter settings of the PSMC program. 433
Our results highlight the importance of appropriately choosing the –t parameter (i.e., the 434
maximum time to TMRCA) and provides intuitive understanding about changes in the 435
distribution of genomic regions into specific atomic intervals as the values of –t is changed. 436
The comparison of demographic histories across species requires that the PSMC parameter 437
settings are properly optimised to identify relevant differences in their Ne trajectories. By 438
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
14
comparing the trajectories of Mesua Ferrea individuals from two different populations, we 439
show the importance of the –t parameter. 440
Certain historical time periods might be of greater importance due to specific climatic 441
or geological events that have occurred in that time frame. Hence, it can be desirable to have 442
a greater resolution while estimating the effective population size histories during these time 443
periods. While increasing the number of free intervals in a particular time period will increase 444
the resolution it can also lead to certain atomic intervals having too few recombination 445
events. Hence, the atomic intervals need to be distributed such that each atomic interval has 446
more than 10 recombination events after the 20th iteration. We hope that our results provide 447
some clarity regarding the strategies to be used while choosing the –p parameter. 448
449
The output of PSMC is insightful when it is scaled to time in years based on the 450
mutation rate and generation time of the species under consideration. While changes in the 451
scaling parameters have been shown to result in similarly shaped trajectories it does change 452
the absolute values of the estimates (Nadachowska-Brzyska et al. 2015). However, accurate 453
estimates of mutation rates are missing or unreliable in the case of many species. Moreover, 454
the estimates of mutation rates obtained by different methods can produce drastically 455
different values. Generation time can also be difficult to estimate especially for long-living 456
plant species. Hence, recent studies have resorted to scaling the results using multiple 457
combinations of mutation rates and generation times to ensure the robustness of their 458
observations. 459
460
Conclusion 461
In summary, our study systematically investigates multiple sources of bias that can 462
affect the inference of demographic history from whole genomic datasets. By comparing the 463
demographic inferences obtained using different versions of the human, red flour beetle and 464
zebrafish genomes, we establish that genome quality does have a considerable impact on the 465
estimation of effective population size (Ne). Instead of simply masking repeat regions of the 466
genome, we investigate the consequences of including each repeat class using the genome of 467
the plant Populus trichocarpa. Interestingly, we find that most repeat classes are able to 468
provide inferences consistent with those obtained from non-repeat regions and can be a viable 469
source of demographic history. Our analysis of repeat regions is of special relevance as the 470
quality of genome assemblies continues to improve with long-read sequencing technologies 471
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
15
being able to correctly assemble repeat regions. Demographic histories inferred for the same 472
human individual using BGI-seq and Illumina datasets show highly concordant results. 473
Future studies that plan to compare datasets originating from these technologies are likely to 474
benefit from our comparative analysis. Parameter settings used for running PSMC have also 475
been investigated in considerable detail with guidelines for optimal use of parameters. 476
Having investigated various sources of technical variability and parameter settings, we 477
consider the genomic datasets of 15 forest trees and compare their demographic histories as a 478
case study. We not only provide guidelines for performing filtering of genomic datasets but 479
also develop the CoalQC pipeline that we hope will become a standard part of quality control 480
prior to demographic analysis using whole-genome datasets. 481
482
Material and Methods 483
Genome quality assessment 484
Assembly and annotation 485
Leaves of the plant Mesua ferrea were collected from a tree located near the College of 486
Forestry, Ponnampet (GPS coordinates 12°08'56.5"N75°54'32.5"E; Altitude: 829-850 meter 487
Above Sea level). Genomic DNA was extracted from the leaves using the standard CTAB 488
protocol. The quality and integrity of DNA were assessed using 1x gel electrophoresis, 489
Nanodrop, and Qubit. Illumina short-read (150bp) libraries were prepared with an insert size 490
of 450 ± 50 bp to generate ~110 Gbp of paired-end short-read data. The quality of the 491
sequencing read data was assessed using the program FASTQC (Andrews et al. 2015). 492
Genome size was estimated using Jellyfish (Marçais and Kingsford 2011) followed by 493
GenomeScope (Vurture et al. 2017) using k-mer size of 21. GenomeScope estimated genome 494
size of approximately 496.72 Mbp. Despite the observation of variability in genome size 495
between ecotypes of Mesua ferrea, flow cytometry-based estimates have consistently 496
predicted genome size of approximately 684.5 Mbp (Das et al. 2018). 497
498
For assembling the genome, MaSuRCA (Maryland Super Read Cabog Assembler) 499
(Zimin et al. 2017) was used with parameters USE_LINKING_MATES = 1 500
CLOSE_GAPS=1 NUM_THREADS = 32 JF_SIZE = 12000000000 SOAP_ASSEMBLY=1. 501
Our genome assembly length of 614.35 MB is slightly less than but comparable to previous 502
estimates based on flow cytometry. To assess the quality of the assembled genome, we 503
employed multiple quality assessment methods. While N50, N75, etc. are accepted metrics of 504
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
16
genome contiguity, the number of genes assembled in the assembly serves as a metric of 505
assembly completeness. The amount of repeats assembled is also one of the relevant metrics 506
which informs about the quality of the assembly in low-complexity regions. The Quast 507
(Mikheenko et al. 2018) program was used to calculate assembly statistics i.e. N50, N75, 508
Number of N´s per 100 KB, etc. (see Table S3). BUSCO (Benchmarking Universal Single-509
Copy Orthologues) (Simão et al. 2015) was used to assess the completeness of the assembly, 510
using eudicotyledons_odb10 (see Table S4 and S12) and embryophyta_odb10 (see Table S5) 511
dataset together with previously sequenced genomes of order Malpighiales. LTR_retriever's 512
LAI module was used to determine assembly quality based on the LAI (LTR Assembly 513
Index) score which assesses repeat content assembled (see Table S6). 514
515
Annotation was carried out using MAKER-P (Campbell et al. 2014) version 2 with 516
MPI. Repeat libraries obtained from RepeatModeler (Smit, AFA, Hubley 2015) and LTR-517
retriever (Ou and Jiang 2018) were concatenated and used to mask repeat regions of the 518
genome. Published CDS dataset of Populus trichocarpa and concatenated multi-fasta of all 519
available Malpighiales proteins were used as homology evidence for the first round of de-520
novo annotation. The results of the first round of annotation were then used for training 521
SNAP (Korf 2004) and AUGUSTUS (Stanke et al. 2008) implemented in BUSCO. These 522
predictions were used for the second round of annotation in MAKER-P. Iterative rounds of 523
annotation were carried out for 5 rounds until no further improvement was observed as 524
assessed by the AED (Annotation Edit Distance) values (see Table S7). 525
The raw sequencing read data was used to separately assemble the chloroplast 526
genome using the NOVOPlasty (Dierckxsens et al. 2017) program. The Maturase K gene 527
sequence of Mesua ferrea was used as a seed sequence and Garcinia mangostana chloroplast 528
genome was used as a reference. The assembler uses seed sequence to find reads that cover 529
this sequence and starts overlapped sequence assembly. The assembled chloroplast genome 530
had two sets of contigs. The orientation of the contigs was determined by dot-plot analyses 531
(see Fig. S13) with Garcinia mangostana and other Malpighiales chloroplast genome 532
sequences. The full length of the assembled chloroplast genome was 161.4 Kbp long. It was 533
then annotated using GeSeq and visualised using OGDRAW (see Fig. S14) implemented in 534
CHLOROBOX (Greiner et al. 2019). For assembling the mitochondrial genome matR gene 535
sequence of Mesua ferrea was used as a seed. Assembled chloroplast sequence was used for 536
comparison and WGS raw reads were used in NOVOPlasty. The total assembled 537
mitochondrial sequence was 20084 bp long (see Fig. S15). 538
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
17
539
540
541
Datasets used 542
To demonstrate the utility of our quality control pipeline and generality of our observations 543
across diverse taxa we used species from different phyla such as nematodes, plants, 544
vertebrates, etc. The published genome assemblies used as reference genomes were 545
downloaded from NCBI/UCSC genome browser (details provided in Table S8). We searched 546
the European Nucleotide Archive (ENA) for genomic sequencing datasets and downloaded 547
those datasets that had >20X coverage (see Table S2). The raw read datasets were mapped to 548
corresponding unmasked genomes using the short-read aligner BWA-MEM (Li 2013) with 549
default settings. 550
551
Genome assembly quality comparison 552
The latest version of the Human genome assembly, i.e., hg38, was downloaded from 553
Ensembl, whereas previous assemblies i.e. hg19, hg15, hg10, and hg4 were downloaded from 554
UCSC genome browser (see Table S8). The genome assembly statistics i.e. N50 statistics and 555
Number of N’s per 100 Kb were calculated using Quast. SAMTOOLS flagstat module was 556
used to get mapping percentages for each assembly using mapped alignments of each 557
assembly. To get assembly completeness statistics, BUSCO was used with dataset 558
Mammalia_odb9 on each of the assemblies. Genome assembly quality was also assessed for 559
red flour beetle (Tribolium castaneum) and zebrafish (Danio rerio) genomes. These 560
comparative assembly quality statistics are available in Table S1. 561
562
Inference of demographic history using PSMC 563
Parameter settings 564
Variant calling was performed using SAMTOOLS and BCFTOOLS with the depth 565
parameters for vcf2fq command of vcfutils.pl decided based on the mean coverage of the 566
reads. The resultant fastq file of heterozygous sites was converted into psmcfa format using 567
the fq2psmcfa program for bin sizes of 20, 50, and 100. For the first run of PSMC, options 568
were set as -t 5, -r 5, -p "4+25*2+4+6”. The output was evaluated to see if a sufficient 569
number of recombination events had occurred in each atomic interval. If there were some 570
atomic intervals that did not have at least ten recombination events after the 20th iteration, 571
then the -p parameters were modified. For example, if -p parameter "4+25*2+4+6" is set, it 572
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
18
has 64 atomic intervals distributed across 28 free intervals i.e. (1+25+ 1+ 1). Changing the 573
distribution of atomic intervals across free intervals, e.g., "8+25*2+2+4" would be the first 574
step to see if enough recombination events are obtained. If not, then changing the free 575
intervals, i.e., "3*2+1*10+15*2+1*14+1*4" (written as free intervals * atomic intervals) 576
was done to get sufficient recombination events. Only after obtaining a sufficient number of 577
recombination events, the -p parameter was finalised. PSMC was run with the -d (decode) 578
option for identifying genomic regions that contributed to each atomic interval. After 579
obtaining the PSMC output, an appropriate mutation rate and generation time were used to 580
generate the scaled plot using the psmc_plot.pl script. For bootstrapping analyses, the psmcfa 581
file was first split into equal lengths of 5 MB, and was used for 100 runs of PSMC. 582
583
Effect of repeat regions 584
The unmasked genomes were analysed to identify and annotate repetitive regions. For 585
genome-wide identification of LTR’s, the program LTR-retriever was run using repeat 586
libraries made by concatenating LTR harvest (Ellinghaus et al. 2008) and LTR_finder v 1.0.6 587
(Xu and Wang 2007) output. The RepeatModeler program was used for the de-novo 588
identification of repeats. Both genome-wide LTR-retriever and RepeatModeler repeat 589
libraries were concatenated and given as input to the RepeatMasker program. The tabulated 590
output file of RepeatMasker was converted to bed format and used for further analyses. 591
592
Separate runs of variant calling were carried out using the unmasked and masked 593
genomes followed by PSMC analyses. PSMC was run with -d option for both unmasked and 594
masked datasets, and outputs were produced for three bin sizes (specified using the –s flag), 595
i.e., 20, 50, and 100. For each run of PSMC, the decode2bed.pl script was used to obtain 596
details of the atomic interval assigned to specific genomic regions. The prevalence of repeat 597
regions in each atomic interval was assessed by intersecting the positions of the repeats with 598
the positions of atomic intervals using BEDTools (Quinlan and Hall 2010). The genomic 599
coordinates of heterozygous sites and ratio of transitions (Ts) to transversions (Tv) were 600
obtained using the hetlist command of seqtk (Li 2015). Subsequently, repeat class-specific 601
heterozygosity and Ts/Tv ratio in each atomic interval was calculated using the positions of 602
repeat regions, decoded atomic intervals and genome-wide list of heterozygote sites as 603
arguments to BEDTools. To evaluate the effect each individual repeat class would have on 604
PSMC, one repeat class was unmasked at a time keeping all the other repeat classes masked 605
prior to PSMC analyses. 606
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
19
607
608
609
Effect of coverage and callability 610
Genome-wide per base depths were calculated using SAMTOOLS depth command. The 611
read-depth information was used to assign cumulative coverage classes, i.e., bases having >0 612
read depth, >10 read depth, >20 read depth and so on. The genome could thus be divided into 613
regions that correspond to specific cumulative coverage classes. GATK version 3.8.1 614
(McKenna et al. 2010) was used to get callability information using CallableLoci command. 615
The callable status obtained from CallableLoci was used to divide the genome into regions of 616
a specific callability. The effect these coverage based and callability based classes would have 617
on PSMC inference was assessed by performing PSMC analyses separately on each of the 618
classes after masking all genomic regions that were outside the class under consideration. 619
620
Comparative PSMC of forest plant genomes 621
Published plant genome assemblies till November 2019 and their details were obtained from 622
the plant genome database (available at https://www.plabipd.de/timeline_view.ep). From this 623
list of published plant genomes, forest plants (i.e. excluding annual plants) species with >20X 624
coverage were selected. The genomes and corresponding short-read data were downloaded 625
from public repositories (see Table S2 and S8 for details). PSMC analysis was performed on 626
each of these species with default parameters i.e. –t5 –r5 –p “4+25*2+4+6”. A mutation rate 627
estimate of 2.5e-09 per site per year which has been used for Populus trichocarpa in an 628
earlier study (Bai et al. 2018) was used for all the species. Generation time for each species 629
was obtained through a literature search. For each species per generation, mutation rates were 630
obtained using corresponding generation times (Table S9). 631
632
Acknowledgments 633
We thank the Ministry of Human Resource Development for fellowship to ABP, the Council 634
of Scientific & Industrial Research for fellowship to SSS. NV has been awarded the 635
Innovative Young Biotechnologist Award 2018 from the Department of Biotechnology and 636
Early Career Research Award from the Department of Science and Technology (both 637
Government of India). The computational analyses were performed on the Har Gobind 638
Khorana Computational Biology cluster established and maintained by combining funds from 639
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
20
IISER Bhopal under Grant # INST/BIO/2017/019, IYBA 2018 from DBT and ECRA from 640
DST. 641
Author contributions 642
NV conceived of the study based on inspiration from CGK and designed the study along with 643
ABP & CGK. ABP conducted computational analysis and compiled the results with 644
assistance from NV. SBN and RS collected the samples required for primary data generation. 645
SSS extracted DNA from the leaf samples. ABP, NV, and CGK wrote the manuscript along 646
with SBN, RS, and SSS. All authors approved the final draft. 647
References 648
649
Andrews S, Krueger F, Seconds-Pichon A, Biggins F, Wingett S. 2015. FastQC. A quality 650
control tool for high throughput sequence data. Babraham Bioinformatics. Babraham 651
Inst. 652
Bai WN, Yan PC, Zhang BW, Woeste KE, Lin K, Zhang DY. 2018. Demographically 653
idiosyncratic responses to climate change and rapid Pleistocene diversification of the 654
walnut genus Juglans (Juglandaceae) revealed by whole-genome sequences. New 655
Phytol. 656
Cabanne GS, Calderón L, Trujillo Arias N, Flores P, Pessoa R, d’Horta FM, Miyaki CY. 657
2016. Effects of Pleistocene climate changes on species ranges and evolutionary 658
processes in the Neotropical Atlantic Forest. Biol. J. Linn. Soc. 659
Cahill JA, Soares AER, Green RE, Shapiro B. 2016. Inferring species divergence times using 660
pairwise sequential markovian coalescent modelling and low-coverage genomic data. 661
Philos. Trans. R. Soc. B Biol. Sci. 662
Campbell MS, Law MY, Holt C, Stein JC, Moghe GD, Hufnagel DE, Lei J, Achawanantakun 663
R, Jiao D, Lawrence CJ, et al. 2014. MAKER-P: A Tool kit for the rapid creation, 664
management, and quality control of plant genome annotations. Plant Physiol. 665
Clark PU, Archer D, Pollard D, Blum JD, Rial JA, Brovkin V, Mix AC, Pisias NG, Roy M. 666
2006. The middle Pleistocene transition: characteristics, mechanisms, and implications 667
for long-term changes in atmospheric pCO2. Quat. Sci. Rev. 668
Das R, Shelke RG, Rangan L, Mitra S. 2018. Estimation of nuclear genome size and 669
characterization of Ty1-copia like LTR retrotransposon in Mesua ferrea L. J. Plant 670
Biochem. Biotechnol. 671
Dierckxsens N, Mardulyn P, Smits G. 2017. NOVOPlasty: De novo assembly of organelle 672
genomes from whole genome data. Nucleic Acids Res. 673
Dupont LM, Donner B, Schneider R, Wefer G. 2001. Mid-Pleistocene environmental change 674
in tropical Africa began as early as 1.05 Ma. Geology. 675
Ellinghaus D, Kurtz S, Willhoeft U. 2008. LTRharvest, an efficient and flexible software for 676
de novo detection of LTR retrotransposons. BMC Bioinformatics. 677
Foote AD, Vijay N, Ávila-Arcos MC, Baird RW, Durban JW, Fumagalli M, Gibbs RA, 678
Hanson MB, Korneliussen TS, Martin MD, et al. 2016. Genome-culture coevolution 679
promotes rapid divergence of killer whale ecotypes. Nat. Commun. 7:ncomms11693. 680
Greiner S, Lehwark P, Bock R. 2019. OrganellarGenomeDRAW (OGDRAW) version 1.3.1: 681
expanded toolkit for the graphical visualization of organellar genomes. Nucleic Acids 682
Res. 683
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
21
van der Hammen T. 1974. The Pleistocene Changes of Vegetation and Climate in Tropical 684
South America. J. Biogeogr. 685
Hecht LBB, Thompson PC, Rosenthal BM. 2018. Comparative demography elucidates the 686
longevity of parasitic and symbiotic relationships. In: Proceedings of the Royal Society 687
B: Biological Sciences. 688
Hewitt G. 2000. The genetic legacy of the quaternary ice ages. Nature. 689
Korf I. 2004. Gene finding in novel genomes. BMC Bioinformatics. 690
Li H. 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-691
MEM. Available from: http://arxiv.org/abs/1303.3997 692
Li H. 2015. Seqtk: Toolkit for processing sequences in FASTA/Q formats. GitHub. 693
Li H, Durbin R. 2011. Inference of human population history from individual whole-genome 694
sequences. Nature 475:493–496. 695
Marçais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of 696
occurrences of k-mers. Bioinformatics. 697
Mather N, Traves SM, Ho SYW. 2020. A practical introduction to sequentially Markovian 698
coalescent methods for estimating demographic history from genomic data. Ecol. Evol. 699
Mays HL, Hung CM, Shaner PJ, Denvir J, Justice M, Yang SF, Roth TL, Oehler DA, Fan J, 700
Rekulapally S, et al. 2018. Genomic Analysis of Demographic History and Ecological 701
Niche Modeling in the Endangered Sumatran Rhinoceros Dicerorhinus sumatrensis. 702
Curr. Biol. 703
Mazet O, Rodríguez W, Chikhi L. 2015. Demographic inference using genetic data from a 704
single individual: Separating population size variation from population structure. Theor. 705
Popul. Biol. 706
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, 707
Altshuler D, Gabriel S, Daly M, et al. 2010. The genome analysis toolkit: A MapReduce 708
framework for analyzing next-generation DNA sequencing data. Genome Res. 709
Mikheenko A, Prjibelski A, Saveliev V, Antipov D, Gurevich A. 2018. Versatile genome 710
assembly evaluation with QUAST-LG. In: Bioinformatics. 711
Nadachowska-Brzyska K, Burri R, Smeds L, Ellegren H. 2016. PSMC analysis of effective 712
population sizes in molecular ecology and its application to black-and-white Ficedula 713
flycatchers. Mol. Ecol. 25:1058–1072. 714
Nadachowska-Brzyska K, Li C, Smeds L, Zhang G, Ellegren H. 2015. Temporal dynamics of 715
avian populations during pleistocene revealed by whole-genome sequences. Curr. Biol. 716
25:1375–1380. 717
Ou S, Jiang N. 2018. LTR_retriever: A highly accurate and sensitive program for 718
identification of long terminal repeat retrotransposons. Plant Physiol. 719
Palamara PF, Terhorst J, Song YS, Price AL. 2018. High-throughput inference of pairwise 720
coalescence times identifies signals of selection and enriched disease heritability. Nat. 721
Genet. 722
Patton AH, Margres MJ, Stahlke AR, Hendricks S, Lewallen K, Hamede RK, Ruiz-Aravena 723
M, Ryder O, McCallum HI, Jones ME, et al. 2019. Contemporary Demographic 724
Reconstruction Methods Are Robust to Genome Assembly Quality: A Case Study in 725
Tasmanian Devils. Mol. Biol. Evol. 726
Pisias NG, Moore TC. 1981. The evolution of Pleistocene climate: A time series approach. 727
Earth Planet. Sci. Lett. 728
Prado-Martinez J, Sudmant PH, Kidd JM, Li H, Kelley JL, Lorente-Galdos B, Veeramah KR, 729
Woerner AE, O’Connor TD, Santpere G, et al. 2013. Great ape genetic diversity and 730
population history. Nature 499:471–475. 731
Quinlan AR, Hall IM. 2010. BEDTools: A flexible suite of utilities for comparing genomic 732
features. Bioinformatics 26:841–842. 733
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
22
Rogers RL, Slatkin M. 2017. Excess of genomic defects in a woolly mammoth on Wrangel 734
island. PLoS Genet. 735
Salojärvi J, Smolander OP, Nieminen K, Rajaraman S, Safronov O, Safdari P, Lamminmäki 736
A, Immanen J, Lan T, Tanskanen J, et al. 2017. Genome sequencing and population 737
genomic analyses provide insights into the adaptive landscape of silver birch. Nat. 738
Genet. 739
Schiffels S, Durbin R. 2014. Inferring human population size and separation history from 740
multiple genome sequences. Nat. Genet. 46:919–925. 741
Simão FA, Waterhouse RM, Ioannidis P, Kriventseva E V., Zdobnov EM. 2015. BUSCO: 742
Assessing genome assembly and annotation completeness with single-copy orthologs. 743
Bioinformatics 31:3210–3212. 744
Smit, AFA, Hubley R. 2015. RepeatModeler Open-1.0. Available from: 745
http://www.repeatmasker.org 746
Stanke M, Diekhans M, Baertsch R, Haussler D. 2008. Using native and syntenically mapped 747
cDNA alignments to improve de novo gene finding. Bioinformatics. 748
Terhorst J, Kamm JA, Song YS. 2016. Robust and scalable inference of population history 749
from hundreds of unphased whole genomes. Nat. Genet. 49:303–309. 750
Tiley GP, Kimball RT, Braun EL, Burleigh JG. 2018. Comparison of the Chinese bamboo 751
partridge and red Junglefowl genome sequences highlights the importance of 752
demography in genome evolution. BMC Genomics. 753
Verbitsky MY, Crucifix M, Volobuev DM. 2018. A theory of Pleistocene glacial rhythmicity. 754
Earth Syst. Dyn. 755
Vijay N, Bossu CM, Poelstra JW, Weissensteiner MH, Suh A, Kryukov AP, Wolf JBW. 2016. 756
Evolution of heterogeneous genome differentiation across multiple contact zones in a 757
crow species complex. Nat. Commun. 7:ncomms13195. 758
Vijay N, Park C, Oh J, Jin S, Kern E, Kim HW, Zhang J, Park J-K. 2018. Population 759
Genomic Analysis Reveals Contrasting Demographic Changes of Two Closely Related 760
Dolphin Species in the Last Glacial.Satta Y, editor. Mol. Biol. Evol. [Internet] 35:2026–761
2033. Available from: https://academic.oup.com/mbe/article/35/8/2026/5017252 762
Vurture GW, Sedlazeck FJ, Nattestad M, Underwood CJ, Fang H, Gurtowski J, Schatz MC. 763
2017. GenomeScope: Fast reference-free genome profiling from short reads. In: 764
Bioinformatics. 765
Wang J, Raskin L, Samuels DC, Shyr Y, Guo Y. 2015. Genome measures used for quality 766
control are dependent on gene function and ancestry. Bioinformatics. 767
Xu Z, Wang H. 2007. LTR-FINDER: An efficient tool for the prediction of full-length LTR 768
retrotransposons. Nucleic Acids Res. 769
Zhang Z, Chen Y, Zhang J, Ma X, Li Y, Li M, Wang D, Kang M, Wu H, Yang Y, et al. 2020. 770
Improved genome assembly provides new insights into genome evolution in a desert 771
poplar ( Populus euphratica ). Mol. Ecol. Resour.:1755-0998.13142. 772
Zimin A V., Puiu D, Luo MC, Zhu T, Koren S, Marçais G, Yorke JA, Dvořák J, Salzberg SL. 773
2017. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a 774
progenitor of bread wheat, with the MaSuRCA mega-reads algorithm. Genome Res. 775
Zwaenepoel A, Van De Peer Y. 2019. Wgd-simple command line tools for the analysis of 776
ancient whole-genome duplications. Bioinformatics. 777
778
Bai WN, Yan PC, Zhang BW, Woeste KE, Lin K, Zhang DY. 2018. Demographically 779
idiosyncratic responses to climate change and rapid Pleistocene diversification of the 780
walnut genus Juglans (Juglandaceae) revealed by whole-genome sequences. New 781
Phytol. 782
Cahill JA, Soares AER, Green RE, Shapiro B. 2016. Inferring species divergence times using 783
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
23
pairwise sequential markovian coalescent modelling and low-coverage genomic data. 784
Philos. Trans. R. Soc. B Biol. Sci. 785
Campbell MS, Law MY, Holt C, Stein JC, Moghe GD, Hufnagel DE, Lei J, Achawanantakun 786
R, Jiao D, Lawrence CJ, et al. 2014. MAKER-P: A Tool kit for the rapid creation, 787
management, and quality control of plant genome annotations. Plant Physiol. 788
Das R, Shelke RG, Rangan L, Mitra S. 2018. Estimation of nuclear genome size and 789
characterization of Ty1-copia like LTR retrotransposon in Mesua ferrea L. J. Plant 790
Biochem. Biotechnol. 791
Dierckxsens N, Mardulyn P, Smits G. 2017. NOVOPlasty: De novo assembly of organelle 792
genomes from whole genome data. Nucleic Acids Res. 793
Ellinghaus D, Kurtz S, Willhoeft U. 2008. LTRharvest, an efficient and flexible software for 794
de novo detection of LTR retrotransposons. BMC Bioinformatics. 795
Foote AD, Vijay N, Ávila-Arcos MC, Baird RW, Durban JW, Fumagalli M, Gibbs RA, 796
Hanson MB, Korneliussen TS, Martin MD, et al. 2016. Genome-culture coevolution 797
promotes rapid divergence of killer whale ecotypes. Nat. Commun. 7:ncomms11693. 798
Greiner S, Lehwark P, Bock R. 2019. OrganellarGenomeDRAW (OGDRAW) version 1.3.1: 799
expanded toolkit for the graphical visualization of organellar genomes. Nucleic Acids 800
Res. 801
Hecht LBB, Thompson PC, Rosenthal BM. 2018. Comparative demography elucidates the 802
longevity of parasitic and symbiotic relationships. In: Proceedings of the Royal Society 803
B: Biological Sciences. 804
Korf I. 2004. Gene finding in novel genomes. BMC Bioinformatics. 805
Li H. 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-806
MEM. Available from: http://arxiv.org/abs/1303.3997 807
Li H. 2015. Seqtk: Toolkit for processing sequences in FASTA/Q formats. GitHub. 808
Li H, Durbin R. 2011. Inference of human population history from individual whole-genome 809
sequences. Nature 475:493–496. 810
Marçais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of 811
occurrences of k-mers. Bioinformatics. 812
Mather N, Traves SM, Ho SYW. 2020. A practical introduction to sequentially Markovian 813
coalescent methods for estimating demographic history from genomic data. Ecol. Evol. 814
Mays HL, Hung CM, Shaner PJ, Denvir J, Justice M, Yang SF, Roth TL, Oehler DA, Fan J, 815
Rekulapally S, et al. 2018. Genomic Analysis of Demographic History and Ecological 816
Niche Modeling in the Endangered Sumatran Rhinoceros Dicerorhinus sumatrensis. 817
Curr. Biol. 818
Mazet O, Rodríguez W, Chikhi L. 2015. Demographic inference using genetic data from a 819
single individual: Separating population size variation from population structure. Theor. 820
Popul. Biol. 821
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, 822
Altshuler D, Gabriel S, Daly M, et al. 2010. The genome analysis toolkit: A MapReduce 823
framework for analyzing next-generation DNA sequencing data. Genome Res. 824
Mikheenko A, Prjibelski A, Saveliev V, Antipov D, Gurevich A. 2018. Versatile genome 825
assembly evaluation with QUAST-LG. In: Bioinformatics. 826
Nadachowska-Brzyska K, Burri R, Smeds L, Ellegren H. 2016. PSMC analysis of effective 827
population sizes in molecular ecology and its application to black-and-white Ficedula 828
flycatchers. Mol. Ecol. 25:1058–1072. 829
Nadachowska-Brzyska K, Li C, Smeds L, Zhang G, Ellegren H. 2015. Temporal dynamics of 830
avian populations during pleistocene revealed by whole-genome sequences. Curr. Biol. 831
25:1375–1380. 832
Ou S, Jiang N. 2018. LTR_retriever: A highly accurate and sensitive program for 833
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
24
identification of long terminal repeat retrotransposons. Plant Physiol. 834
Palamara PF, Terhorst J, Song YS, Price AL. 2018. High-throughput inference of pairwise 835
coalescence times identifies signals of selection and enriched disease heritability. Nat. 836
Genet. 837
Patton AH, Margres MJ, Stahlke AR, Hendricks S, Lewallen K, Hamede RK, Ruiz-Aravena 838
M, Ryder O, McCallum HI, Jones ME, et al. 2019. Contemporary Demographic 839
Reconstruction Methods Are Robust to Genome Assembly Quality: A Case Study in 840
Tasmanian Devils. Mol. Biol. Evol. 841
Prado-Martinez J, Sudmant PH, Kidd JM, Li H, Kelley JL, Lorente-Galdos B, Veeramah KR, 842
Woerner AE, O’Connor TD, Santpere G, et al. 2013. Great ape genetic diversity and 843
population history. Nature 499:471–475. 844
Quinlan AR, Hall IM. 2010. BEDTools: A flexible suite of utilities for comparing genomic 845
features. Bioinformatics 26:841–842. 846
Rogers RL, Slatkin M. 2017. Excess of genomic defects in a woolly mammoth on Wrangel 847
island. PLoS Genet. 848
Schiffels S, Durbin R. 2014. Inferring human population size and separation history from 849
multiple genome sequences. Nat. Genet. 46:919–925. 850
Simão FA, Waterhouse RM, Ioannidis P, Kriventseva E V., Zdobnov EM. 2015. BUSCO: 851
Assessing genome assembly and annotation completeness with single-copy orthologs. 852
Bioinformatics 31:3210–3212. 853
Smit, AFA, Hubley R. 2015. RepeatModeler Open-1.0. Available from: 854
http://www.repeatmasker.org 855
Stanke M, Diekhans M, Baertsch R, Haussler D. 2008. Using native and syntenically mapped 856
cDNA alignments to improve de novo gene finding. Bioinformatics. 857
Terhorst J, Kamm JA, Song YS. 2016. Robust and scalable inference of population history 858
from hundreds of unphased whole genomes. Nat. Genet. 49:303–309. 859
Tiley GP, Kimball RT, Braun EL, Burleigh JG. 2018. Comparison of the Chinese bamboo 860
partridge and red Junglefowl genome sequences highlights the importance of 861
demography in genome evolution. BMC Genomics. 862
Vijay N, Bossu CM, Poelstra JW, Weissensteiner MH, Suh A, Kryukov AP, Wolf JBW. 2016. 863
Evolution of heterogeneous genome differentiation across multiple contact zones in a 864
crow species complex. Nat. Commun. 7:ncomms13195. 865
Vijay N, Park C, Oh J, Jin S, Kern E, Kim HW, Zhang J, Park J-K. 2018. Population 866
Genomic Analysis Reveals Contrasting Demographic Changes of Two Closely Related 867
Dolphin Species in the Last Glacial.Satta Y, editor. Mol. Biol. Evol. [Internet] 35:2026–868
2033. Available from: https://academic.oup.com/mbe/article/35/8/2026/5017252 869
Vurture GW, Sedlazeck FJ, Nattestad M, Underwood CJ, Fang H, Gurtowski J, Schatz MC. 870
2017. GenomeScope: Fast reference-free genome profiling from short reads. In: 871
Bioinformatics. 872
Wang J, Raskin L, Samuels DC, Shyr Y, Guo Y. 2015. Genome measures used for quality 873
control are dependent on gene function and ancestry. Bioinformatics. 874
Xu Z, Wang H. 2007. LTR-FINDER: An efficient tool for the prediction of full-length LTR 875
retrotransposons. Nucleic Acids Res. 876
Zhang Z, Chen Y, Zhang J, Ma X, Li Y, Li M, Wang D, Kang M, Wu H, Yang Y, et al. 2020. 877
Improved genome assembly provides new insights into genome evolution in a desert 878
poplar ( Populus euphratica ). Mol. Ecol. Resour.:1755-0998.13142. 879
Zimin A V., Puiu D, Luo MC, Zhu T, Koren S, Marçais G, Yorke JA, Dvořák J, Salzberg SL. 880
2017. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a 881
progenitor of bread wheat, with the MaSuRCA mega-reads algorithm. Genome Res. 882
Zwaenepoel A, Van De Peer Y. 2019. Wgd-simple command line tools for the analysis of 883
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
25
ancient whole-genome duplications. Bioinformatics. 884
885
Bai, W. N., Yan, P. C., Zhang, B. W., Woeste, K. E., Lin, K., & Zhang, D. Y. (2018). 886
Demographically idiosyncratic responses to climate change and rapid Pleistocene 887
diversification of the walnut genus Juglans (Juglandaceae) revealed by whole-genome 888
sequences. New Phytologist. 889
Cahill, J. A., Soares, A. E. R., Green, R. E., & Shapiro, B. (2016). Inferring species 890
divergence times using pairwise sequential markovian coalescent modelling and low-891
coverage genomic data. Philosophical Transactions of the Royal Society B: Biological 892
Sciences. doi:10.1098/rstb.2015.0138 893
Campbell, M. S., Law, M. Y., Holt, C., Stein, J. C., Moghe, G. D., Hufnagel, D. E., … 894
Yandell, M. (2014). MAKER-P: A Tool kit for the rapid creation, management, and 895
quality control of plant genome annotations. Plant Physiology. 896
doi:10.1104/pp.113.230144 897
Das, R., Shelke, R. G., Rangan, L., & Mitra, S. (2018). Estimation of nuclear genome size 898
and characterization of Ty1-copia like LTR retrotransposon in Mesua ferrea L. Journal 899
of Plant Biochemistry and Biotechnology. doi:10.1007/s13562-018-0457-7 900
Dierckxsens, N., Mardulyn, P., & Smits, G. (2017). NOVOPlasty: De novo assembly of 901
organelle genomes from whole genome data. Nucleic Acids Research. 902
doi:10.1093/nar/gkw955 903
Ellinghaus, D., Kurtz, S., & Willhoeft, U. (2008). LTRharvest, an efficient and flexible 904
software for de novo detection of LTR retrotransposons. BMC Bioinformatics. 905
doi:10.1186/1471-2105-9-18 906
Foote, A. D., Vijay, N., Ávila-Arcos, M. C., Baird, R. W., Durban, J. W., Fumagalli, M., … 907
Wolf, J. B. W. (2016). Genome-culture coevolution promotes rapid divergence of killer 908
whale ecotypes. Nature Communications, 7, ncomms11693. 909
Greiner, S., Lehwark, P., & Bock, R. (2019). OrganellarGenomeDRAW (OGDRAW) version 910
1.3.1: expanded toolkit for the graphical visualization of organellar genomes. Nucleic 911
Acids Research. doi:10.1093/nar/gkz238 912
Hecht, L. B. B., Thompson, P. C., & Rosenthal, B. M. (2018). Comparative demography 913
elucidates the longevity of parasitic and symbiotic relationships. In Proceedings of the 914
Royal Society B: Biological Sciences. doi:10.1098/rspb.2018.1032 915
Korf, I. (2004). Gene finding in novel genomes. BMC Bioinformatics. doi:10.1186/1471-916
2105-5-59 917
Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-918
MEM. Retrieved from http://arxiv.org/abs/1303.3997 919
Li, H. (2015). Seqtk: Toolkit for processing sequences in FASTA/Q formats. 920
Li, H., & Durbin, R. (2011). Inference of human population history from individual whole-921
genome sequences. Nature, 475(7357), 493–496. doi:10.1038/nature10231 922
Marçais, G., & Kingsford, C. (2011). A fast, lock-free approach for efficient parallel counting 923
of occurrences of k-mers. Bioinformatics. doi:10.1093/bioinformatics/btr011 924
Mather, N., Traves, S. M., & Ho, S. Y. W. (2020). A practical introduction to sequentially 925
Markovian coalescent methods for estimating demographic history from genomic data. 926
Ecology and Evolution. doi:10.1002/ece3.5888 927
Mays, H. L., Hung, C. M., Shaner, P. J., Denvir, J., Justice, M., Yang, S. F., … Primerano, D. 928
A. (2018). Genomic Analysis of Demographic History and Ecological Niche Modeling 929
in the Endangered Sumatran Rhinoceros Dicerorhinus sumatrensis. Current Biology. 930
doi:10.1016/j.cub.2017.11.021 931
Mazet, O., Rodríguez, W., & Chikhi, L. (2015). Demographic inference using genetic data 932
from a single individual: Separating population size variation from population structure. 933
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
26
Theoretical Population Biology. doi:10.1016/j.tpb.2015.06.003 934
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., … 935
DePristo, M. A. (2010). The genome analysis toolkit: A MapReduce framework for 936
analyzing next-generation DNA sequencing data. Genome Research. 937
doi:10.1101/gr.107524.110 938
Mikheenko, A., Prjibelski, A., Saveliev, V., Antipov, D., & Gurevich, A. (2018). Versatile 939
genome assembly evaluation with QUAST-LG. In Bioinformatics. 940
doi:10.1093/bioinformatics/bty266 941
Nadachowska-Brzyska, K., Burri, R., Smeds, L., & Ellegren, H. (2016). PSMC analysis of 942
effective population sizes in molecular ecology and its application to black-and-white 943
Ficedula flycatchers. Molecular Ecology, 25(5), 1058–1072. doi:10.1111/mec.13540 944
Nadachowska-Brzyska, K., Li, C., Smeds, L., Zhang, G., & Ellegren, H. (2015). Temporal 945
dynamics of avian populations during pleistocene revealed by whole-genome sequences. 946
Current Biology, 25(10), 1375–1380. doi:10.1016/j.cub.2015.03.047 947
Ou, S., & Jiang, N. (2018). LTR_retriever: A highly accurate and sensitive program for 948
identification of long terminal repeat retrotransposons. Plant Physiology. 949
doi:10.1104/pp.17.01310 950
Palamara, P. F., Terhorst, J., Song, Y. S., & Price, A. L. (2018). High-throughput inference of 951
pairwise coalescence times identifies signals of selection and enriched disease 952
heritability. Nature Genetics. doi:10.1038/s41588-018-0177-x 953
Patton, A. H., Margres, M. J., Stahlke, A. R., Hendricks, S., Lewallen, K., Hamede, R. K., … 954
Storfer, A. (2019). Contemporary Demographic Reconstruction Methods Are Robust to 955
Genome Assembly Quality: A Case Study in Tasmanian Devils. Molecular Biology and 956
Evolution. 957
Prado-Martinez, J., Sudmant, P. H., Kidd, J. M., Li, H., Kelley, J. L., Lorente-Galdos, B., … 958
Marques-Bonet, T. (2013). Great ape genetic diversity and population history. Nature, 959
499(7459), 471–5. doi:10.1038/nature12228 960
Quinlan, A. R., & Hall, I. M. (2010). BEDTools: A flexible suite of utilities for comparing 961
genomic features. Bioinformatics, 26(6), 841–842. 962
Rogers, R. L., & Slatkin, M. (2017). Excess of genomic defects in a woolly mammoth on 963
Wrangel island. PLoS Genetics. doi:10.1371/journal.pgen.1006601 964
Schiffels, S., & Durbin, R. (2014). Inferring human population size and separation history 965
from multiple genome sequences. Nature Genetics, 46(8), 919–925. 966
doi:10.1038/ng.3015 967
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V., & Zdobnov, E. M. (2015). 968
BUSCO: Assessing genome assembly and annotation completeness with single-copy 969
orthologs. Bioinformatics, 31(19), 3210–3212. doi:10.1093/bioinformatics/btv351 970
Smit, AFA, Hubley, R. (2015). RepeatModeler Open-1.0. Retrieved from 971
http://www.repeatmasker.org 972
Stanke, M., Diekhans, M., Baertsch, R., & Haussler, D. (2008). Using native and syntenically 973
mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 974
doi:10.1093/bioinformatics/btn013 975
Terhorst, J., Kamm, J. A., & Song, Y. S. (2016). Robust and scalable inference of population 976
history from hundreds of unphased whole genomes. Nature Genetics, 49(2), 303–309. 977
doi:10.1038/ng.3748 978
Tiley, G. P., Kimball, R. T., Braun, E. L., & Burleigh, J. G. (2018). Comparison of the 979
Chinese bamboo partridge and red Junglefowl genome sequences highlights the 980
importance of demography in genome evolution. BMC Genomics. doi:10.1186/s12864-981
018-4711-0 982
Vijay, N., Bossu, C. M., Poelstra, J. W., Weissensteiner, M. H., Suh, A., Kryukov, A. P., & 983
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
27
Wolf, J. B. W. (2016). Evolution of heterogeneous genome differentiation across 984
multiple contact zones in a crow species complex. Nature Communications, 7, 985
ncomms13195. 986
Vijay, N., Park, C., Oh, J., Jin, S., Kern, E., Kim, H. W., … Park, J.-K. (2018). Population 987
Genomic Analysis Reveals Contrasting Demographic Changes of Two Closely Related 988
Dolphin Species in the Last Glacial. Molecular Biology and Evolution, 35(8), 2026–989
2033. Retrieved from https://academic.oup.com/mbe/article/35/8/2026/5017252 990
Vurture, G. W., Sedlazeck, F. J., Nattestad, M., Underwood, C. J., Fang, H., Gurtowski, J., & 991
Schatz, M. C. (2017). GenomeScope: Fast reference-free genome profiling from short 992
reads. In Bioinformatics. doi:10.1093/bioinformatics/btx153 993
Wang, J., Raskin, L., Samuels, D. C., Shyr, Y., & Guo, Y. (2015). Genome measures used for 994
quality control are dependent on gene function and ancestry. Bioinformatics. 995
doi:10.1093/bioinformatics/btu668 996
Xu, Z., & Wang, H. (2007). LTR-FINDER: An efficient tool for the prediction of full-length 997
LTR retrotransposons. Nucleic Acids Research. doi:10.1093/nar/gkm286 998
Zimin, A. V., Puiu, D., Luo, M. C., Zhu, T., Koren, S., Marçais, G., … Salzberg, S. L. (2017). 999
Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a 1000
progenitor of bread wheat, with the MaSuRCA mega-reads algorithm. Genome 1001
Research. doi:10.1101/gr.213405.116 1002
Zwaenepoel, A., & Van De Peer, Y. (2019). Wgd-simple command line tools for the analysis 1003
of ancient whole-genome duplications. Bioinformatics. 1004
doi:10.1093/bioinformatics/bty915 1005
1006
Figure Legends: 1007
1008
Figure 1a: Change in Effective population size (Ne) with change in the genome quality of 1009
Human assemblies. 1010 PSMC curve with bootstrap replicates for Human-NA12878 (see Table S2) mapped to 1011
human assembly version 4 (hg4) shown in red, to hg10 shown in brown, to hg15 shown 1012
in green, to hg19 shown in purple and to hg38 shown in blue, corresponding mapping 1013
percentages are given in parentheses. Poor quality assembly (hg4) overestimated the Ne 1014
during recent (~1 KYA) and ancient (~1-5 MYA) times, whereas Ne was underestimated 1015
during mid-period (~100-400 KYA) compared to better assemblies. 1016
1017
Figure 1b: Extent of variation in PSMC trajectories inferred from Human assemblies. 1018 Estimates of Ne inferred using each assembly showed heterogeneity across time points. 1019
Blackline indicates Standard Deviation in each atomic interval across estimates of all the 1020
assemblies. The colored lines show the Coefficient of Variation within 100 bootstrap 1021
replicates of each assembly. SD curve (black line) shows that estimates in the Atomic 1022
intervals contributing to recent times (AI 1-6) and ancient times (AI 43-64) showed the 1023
highest variation across assemblies. The CV of the poorest assembly (hg4) shows the 1024
highest variation across bootstrap estimates in recent and ancient times suggesting 1025
relatively low robustness compared to others but there was not much difference for mid-1026
period. 1027
1028
Figure 2a: Change in Effective population Sizes (Ne) due to masking of repeat regions in 1029
Populus trichocarpa PSMC. 1030 PSMC curve for Populus trichocarpa after masking all the repeat regions in the genome 1031
(blue line) and without masking (orange-red line). Unmasked trajectory has dots 1032
indicating the fraction of repeats in an atomic interval, larger the size more the repeat 1033
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
28
content in an atomic interval. Ne estimates during recent (20-100 KYA) and ancient 1034
(1MYA-5MYA) times show considerable differences between the two curves showing 1035
the effect of exclusion/inclusion of repeat sequences. 1036
1037
Figure 2b: Change in Effective population Sizes (Ne) after the inclusion of each repeat 1038
class in Populus trichocarpa PSMC separately. 1039 PSMC curves for Populus trichocarpa, with masked (blue) and unmasked (orange-red) 1040
genomes used. Change in trajectory due to the inclusion of each class of repeat to the 1041
masked genome is shown. Including each repeat class and masking, other repeat-classes 1042
will show changes specific to respective repeat-class. The inclusion of LTR-Gypsy 1043
shows a distinguishingly different trajectory similar to the unmasked genome during 1044
~20-100 KYA, which shows that the inclusion of LTR-Gypsy is influencing the 1045
trajectory in recent times. 1046
1047
Figure 2c: Bootstrapped PSMC results after masking of repeats in Populus trichocarpa. 1048 PSMC curves for Populus trichocarpa showing the robustness of changes due to 1049
masking of repeats. Masked (blue) and unmasked (orange-red) shows completely 1050
distinctive trajectories whereas unmasking only LTR-Gypsy repeat class (pink) also 1051
shows a marked difference. The second y-axis (red) shows the Coefficient of variation 1052
(CV) across the bootstraps across all the repeat classes. This indicates changes in Ne due 1053
to repeats are robust to bootstrap replications. 1054
1055
Figure 2d: Abundance of different repeat classes across all atomic intervals in Populus 1056
trichocarpa PSMC. 1057 Contribution of various repeat classes to each atomic interval is shown. Non-repeat 1058
regions (light green) are generally most abundant across all intervals whereas some 1059
repeat classes such as LTR’s have considerable abundance in some of the atomic 1060
intervals. Repeat families such as LTR-Gypsy, LTR-Copia and RC-Helitron showed 1061
higher abundance compared to other repeat classes. 1062
1063
Figure 2e: Fraction of genome contributed to each atomic interval of Populus 1064
trichocarpa PSMC by the repeat classes. 1065 Contribution of various repeat classes (percentage of whole genomic length) to each 1066
atomic interval is shown. LTR-Gypsy has contributed to around 2% in the atomic 1067
intervals spanning recent and ancient times, which might be one of the contributing 1068
factors to the change in Ne trajectory. 1069
1070
Figure 3: Comparison of heterozygosity and Ts/Tv ratio across atomic intervals of 1071
Populus trichocarpa PSMC. 1072 Change in heterozygosity and corresponding Ts/Tv ratio across atomic intervals in 1073
different repeat classes of Populus trichocarpa PSMC. The heterozygosity (left y-axis) 1074
increases with atomic intervals (x-axis) whereas Ts/Tv ratio (right y-axis) does not follow 1075
this trend for most of the repeat classes. 1076
1077
Figure 4a: Demographic inference of Mesua ferrea by PSMC and effect of different 1078
values of maximum TMRCA. 1079 PSMC inferred trajectories with same -p parameter (3*2+1*10+15*2+14+4) but for 1080
several values of maximum TMRCA parameter. Colour used for -t of 35 (cyan), 45 1081
(blue), 55 (green) and 65 (brown). For -t 500 (red), -p was used “4+25*2+4+6”, but did 1082
not have sufficient number of recombination events in some of the last atomic intervals. 1083
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
29
Demographic scenario shows steep decline in Ne, after MPT (Mid-Pleistocene 1084
transition) i.e. ~700 KYA, which again went through a second bottleneck during LGM 1085
(Last glacial maximum) of Last glacial period i.e. around ~30KYA. 1086
1087
Figure 4b: Distribution of atomic intervals across scaffold1281 for different maximum 1088
TMRCA values for Mesua ferrea PSMC. 1089 For each run of PSMC with different -t values decode based genomic regions along this 1090
scaffold and corresponding atomic intervals are shown. The atomic intervals which 1091
spanned scaffold1281 are shown here with their respective colours. Callability of bases 1092
in these regions is shown to highlight the quality of variants identified; heterozygosity is 1093
shown to demarcate hypervariable regions. It can be seen that same genomic coordinates 1094
are being distributed to more recent atomic intervals from older AI’s, which hints at 1095
redistribution of positions of atomic intervals with changes in the maximum TMRCA 1096
parameter values. 1097
1098
Figure 5: Comparative PSMC for Forest plant genomes. 1099 PSMC inferred trajectories with bootstrap replicates of 15 forest plant species. Top 1100
rectangles show respective time periods with important predicted glaciation events. 1101
Betula pendula shows a completely discordant trajectory compared to all other species. 1102
Whereas, tropically distributed species have a common trend of decline during and after 1103
Mid-Pleistocene glaciations. Some of the species such as Faidherbia albida were able to 1104
recover from these bottlenecks, which translates into their adaptation to dryer 1105
environments but most of the other plants were not able to recover from the same. 1106
1107
1108
Supplementary figure legends: 1109
1110
Figure S1a: Change in Effective population sizes (Ne) along with change in genome 1111
quality. Bootstrapped PSMC curves for Tribolium castaneum mapped to Tcas1.0 and 1112
Tcas5.2 genome assemblies with different genome quality. Mutation rate used 2.9e-09 1113
per site per generation with generation time of 0.3 i.e. 12 weeks for one generation. 1114
1115
Figure S1b: Change in Effective population sizes (Ne) along with change in genome 1116
quality. Bootstrapped PSMC curves for Danio rerio mapped to danRer1 and danRer11 1117
genome assemblies with different genome quality. Mutation rate used 1.9e-09 per site 1118
per generation with generation time of 1 year. 1119
1120
Figure S2: Correlation between repeat abundance and change in Effective population 1121
size (Ne). Repeat content across atomic intervals in Populus trichocarpa PSMC showed 1122
a positive correlation with absolute change in Ne estimated from masked vs unmasked 1123
genomes. Kendall’s correlation coefficient was calculated showing Kendall’s correlation 1124
coefficient i.e. Tau (τ) =0.346, with p-value of 0.0004. 1125
1126
Figure S3: Effect of callability of bases on Populus trichocarpa PSMC. 1127
The output of CallableLoci module of GATKv3.8 distributed the genomic regions in 1128
several classes such as, Callable (good quality bases of reference genome), N-ref (Bases 1129
having N’s or gaps in the reference genome), No-coverage (Bases in the genome which 1130
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
30
were not supported by any read), Low-coverage (Bases in the genome which showed 1131
small support of reads compared to mean) and Poorly-mapped (Bases in the genome 1132
which showed poor mapping quality of reads). All of these classes were masked one at a 1133
time and the effect of each non-callable group was evaluated. After that all the non-1134
callable groups were merged as a non-callable class and they were masked followed by 1135
another run of masking callable sites. For each run the percent of the genome masked is 1136
given in parentheses. Masking of callable sites gave completely different results, 1137
whereas other individual non-callable classes did not show any large change in the 1138
inferred trajectories. Non-callable trajectory showed some difference but did not change 1139
the trajectory much to change the inferences about the demography. 1140
1141
Figure S4: Effect of masking of different cumulative coverage classes on Populus 1142 trichocarpa PSMC. Whole genomic per base depth was calculated using SAMTOOLS depth 1143
command; this was used to make cumulative coverage classes based on their coverage 1144
distribution. Bases having < 10x coverage in one class, bases having < 20x coverage in 1145
another class etc. For each coverage class genomic co-ordinates were obtained and used for 1146
masking these regions, followed by PSMC analyses. The amount of genome masked by each 1147
class is showed in parentheses. There was no difference in PSMC trajectories till masking of 1148
less than 20X coverage classes, whereas from <30X coverage class masking trajectories 1149
started to differ till <60 X coverage class. Masking for more coverage classes i.e. < 70X and 1150
more did not produce psmcfa file during analyses. Genomic regions with higher coverages 1151
mostly contributed to the older atomic intervals, as masking till <40X coverage classes 1152
showed difference in recent time and small difference in older time. 1153
1154
Figure S5: Change in length distribution of atomic intervals due to change in maximum 1155
TMRCA values in Mesua ferrea PSMC. 1156
Lengths of sequences in each atomic interval are compared for each maximum TMRCA 1157
value to evaluate if lengths are getting redistributed or not. For -t 65 (purple box) atomic 1158
intervals contributing to older times (see AI 53 to 63) are more represented than all other 1159
values, and even if present (see AI 53,54,55 and 63) -t 65 has smaller lengths compared 1160
to others in those time periods. This shows that increasing the maximum TMRCA allows 1161
shorter genomic regions to contribute to older times. 1162
1163
Figure S6a: Effect of changing θ/ρ (r flag) value in PSMC on demographic inference of 1164
Mesua ferrea PSMC. 1165
PSMC estimates were inferred for Mesua ferrea with different values of -r flag. Smaller 1166
values of these value were able to travel further ahead in trajectory (see black and red 1167
lines), whereas atomic intervals contributing to these time points did not have enough 1168
recombination events. The other -r values i.e. 5,10 and 25 showed convergence in terms 1169
of recombination events but did not show any change in trajectory (see green, blue and 1170
cyan lines). 1171
1172
Figure S6b: Effect of changing θ/ρ (r flag) value in PSMC on the number of 1173
recombination events across atomic intervals from 50 till 64 of Mesua ferrea PSMC. 1174
The -r values were able to go further ahead in trajectories with smaller values i.e. 0.1 1175
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
31
and 1, but these atomic intervals did not show convergence in terms of recombination 1176
events. For the value of 0.1, till 56th
AI enough recombination events occurred, whereas 1177
for value of 1 there were less than 10 recombination events for last three atomic 1178
intervals. 1179
1180
Figure S7: Demographic inference of Mesua ferrea for North-eastern (China) 1181
population (red) and South-western (India) population (sky-blue). 1182
Chinese sample (red) trajectory extends well back in time till ~5MYA, whereas Indian 1183
sample (sky-blue) reaches till ~400 KYA only. The time at which the population decline 1184
begins is similar in both and shows similar trajectories from ~100 KYA till the recent 1185
times, as both show second decline during last glacial period i.e. ~20 KYA. 1186
1187
Figure S8: Demographic inference of Human-NA12878 sequenced using Illumina (blue) 1188
and BGISEQ (red). The inferred PSMC trajectories showed identical Ne estimates, 1189
showing there are no sequencing platform based differences. 1190
1191
Figure S9: K-mer distribution of 21-mer’s of sample from Indian population Mesua 1192 ferrea. GenomeScope results of Indian sample predict low heterozygosity (0.03%) of the 1193
sample with a single peak at 150x coverage. Predicted genome size is approx. 497 Mbp, 1194
which is underestimation owing to neglecting high coverage sequences of organellar and 1195
repeat sequences. 1196
1197
Figure S10: K-mer distribution of 21-mer’s of sample from Chinese population Mesua 1198 ferrea. GenomeScope results of Chinese sample predict high heterozygosity (0.85%) 1199
compared to Indian sample, showing ~25-fold difference between both populations. Two 1200
peaks are due to high heterozygosity, but predict somewhat similar genome size i.e. 479 Mbp 1201
compared to other sample. 1202
1203
Figure S11: Ks- distribution plot. 1204
Distribution of synonymous substitutions (Ks) across paralogs of Mesua ferrea (green) 1205
and homologs of Mesua ferrea with Manihot esculenta (red) and Populus trichocarpa 1206
(blue). Blue and red peaks show common WGD event across Malpighiales which is 1207
around 1.1-1.2. There is a possibility of independent WGD event in clusiods or Mesua 1208
ferrea which shows peak at 0.2. The independent WGD event could be species-specific 1209
or clade specific but cannot be stated due to unavailability of other genomic dataset from 1210
clusioids. 1211
1212
Figure S12: Assembly quality comparison of Malpighiales using Eudicotyledons_odb10 1213 dataset. Comparison of genome completeness based on BUSCO scores of Malpighiales 1214
genome assemblies. Mesua ferrea showed relatively complete assembly compared to other 1215
compared species. 1216
1217
Figure S13a: Dot-plot of Garcinia mangostana (top) and Mesua ferrea (left) plastome 1218
contig set 1. 1219
1220
Figure S13b: Dot-plot of Garcinia mangostana (top) and Mesua ferrea (left) plastome 1221
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
32
contig set 2. 1222
1223
Figure S13c: Dot-plot of Jatropha curcus (top) and Mesua ferrea (left) plastome contig 1224
set 1. 1225
1226
Figure S13d: Dot-plot of Jatropha curcus (top) and Mesua ferrea (left) plastome contig 1227
set 2. 1228
1229
Figure S13e: Dot-plot of Byrsonima coccolobifolia (top) and Mesua ferrea (left) plastome 1230
contig set 1. 1231
1232
Figure S13f: Dot-plot of Byrsonima coccolobifolia (top) and Mesua ferrea (left) plastome 1233
contig set 2. 1234
1235
Figure S14: Circular plot of Mesua ferrea plastome. 1236
Assembled chloroplast genome length is 161.422 Kbp. Annotated genes are shown with 1237
colours showing their association to the pathways. 1238
1239
Figure S15: Circular plot of Mesua ferrea Mitochondria. 1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
33
Figures 1266
1267
Figure 1a 1268
1269
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
34
Figure 1b 1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
35
Figure 2a 1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
36
Figure 2b 1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
37
Figure 2c 1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
38
Figure 2d 1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
39
Figure 2e 1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
40
Figure 3 1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
41
Figure 4a 1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
42
Figure 4b 1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
43
Figure 5 1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
44
Supplementary figures: 1540
1541
Figure S1a 1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
45
1564
Figure S1b 1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
46
Figure S2 1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
47
Figure S3 1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
48
Figure S4 1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691 1692
1693
1694
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
49
Figure S5 1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
50
Figure S6a 1745
1746
1747
1748
1749
1750
1751
1752
Figure S6b 1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
51
1773
1774
Figure S7 1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
Figure S8 1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
52
1823
1824
Figure S9 1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
Figure S10 1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
53
1873
Figure S11 1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
Figure S12 1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
54
1923
Figure S13a 1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
Figure S13b 1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
55
1973
Figure S13c 1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
Figure S13d 1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
56
2023
Figure S13e 2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
Figure S13f 2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
57
2073
Figure S14 2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
Figure S15 2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
58
2123
Supplementary Tables 2124
2125
Supplementary Table S1: Assembly quality comparison. 2126
2127
Assembly Mapping Percent N50 (Length in MB) N’s per 100 KB Percentage of complete
BUSCO’s
hg4 92.86 6.2 11725.66 72.4
hg10 97.44 1.98 206.89 89.5
hg15 99.84 25.44 7.29 94.6
hg19 100 146.36 7645.47 94.9
hg38 100 145.14 4964.97 94.9
Tcas1.0 98.2 0.24 2291.2 94.2
Tcas5.2 99.01 15.27 8144.72 99.4
danRer1 95.53 45.41 12179.91 NA
danRer11 98.04 52.19 279.51 NA
2128
Supplementary Table S2: SRA reads used in this study. 2129
2130
Sr. No. SRA Accession Sample Details
1 SRR9091899 Homo sapiens NA12878 Illumina-Hiseq 4000
2 SRR7121482 Homo sapiens NA12878 BGISEQ-500
3 SRR7906163 Populus trichocarpa
4 SRR9007075 Populus euphratica
5 SRR3045849 Populus nigra
6 SRR2745904 Populus tremula
7 SRR2751102 Populus tremuloides
8 SRR7121482 Mesua ferrea BGISEQ-500
9 ERR2026087 Betula pendula
10 SRR6058604 Durio zibethinus
11 SRR10339638 Eucalyptus grandis
12 SRR7072804 Faidherbia albida
13 SRR5265130 Tectona grandis
14 SRR3860174 Quercus robur
15 SRR5678803 Ficus carica
16 DRR142810 Citrus unshiu
17 SRR5674478 Trema orientalis
18 SRR8731963 Castanea mollisima
19 ERR1346607 Olea europea
20 SRR5150443 Santalum album
21 SRR5019373 Populus pruinosa
22 ERR3491152 Trochodendron aralloides
23 SRR5992151 Tribolium castaneum
24 SRR6687445 Danio rerio
25 This study Mesua ferrea
2131
2132
2133
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
59
2134
2135
2136
Supplementary Table S3: Genome assembly statistics of Mesua ferrea genome. 2137
2138
Sr. No. Parameter Contigs Scaffolds
1 Total Assembled genome length (MB) 609.19 614.35
2 Number of contigsa 531964 503130
3 Largest contiga (KB) 2231.61 4132.84
4 N50a (KB) 251.71 392.76
5 L50a 607 379
6 L75a 2612 1584
7 GC % 38.38 38.38 aStatistics are based on contigs > 100 bp. 2139
2140
Supplementary Table S4: BUSCO score comparison across previously published 2141
genomes from Malpighiales using eudicotyledons_odb10 dataset. 2142
2143
Sr. No. Species Complete
and Single
Copy
BUSCO’s
Complete
and
Duplicated
BUSCO’s
Fragmented
BUSCO’s
Missing
BUSCO’s
Complete
BUSCO’s
(N=2121)
Complete
BUSCO’s
percentage
1 Caryocar
brasiliense
1528 101 246 246 1629 76.8
2 Euphorbia esula 348 43 476 1254 391 18.43
3 Hevea brasiliense 1646 412 24 39 2058 97.03
4 Jatropha curcus 2040 33 14 34 2073 97.74
5 Linum usitatissimum 745 1245 36 95 1990 93.82
6 Mesua ferrea 1633 364 31 93 1997 94.15
7 Manihot esculenta 1853 198 20 50 2051 96.7
8 Passiflora edulis 827 20 638 636 847 39.93
9 Populus alba 1640 426 18 37 2066 97.41
10 Populus euphratica 1592 479 13 37 2071 97.64
11 Populus simonii 1623 443 16 39 2066 97.41
12 Populus trichocarpa 1629 436 14 42 2065 97.36
13 Ricinus communis 2009 22 46 44 2031 95.76
14 Salix brachista 1757 288 19 57 2045 96.42
15 Viola pubescens 1471 297 188 165 1768 83.36
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
60
2155
2156
2157
Supplementary Table S5: BUSCO score comparison across previously published 2158
genomes from Malpighiales using embryophyta_odb10 dataset. 2159
2160
Sr. No. Species Complete
and Single
Copy
BUSCO’s
Complete
and
Duplicated
BUSCO’s
Fragmented
BUSCO’s
Missing
BUSCO’s
Complete
BUSCO’s
(N=1375)
Complete
BUSCO’s
percentage
1 Caryocar brasiliense 1017 43 183 132 1060 77.1
2 Euphorbia esula 265 42 424 644 307 22.4
3 Hevea brasiliense 1088 243 16 28 1331 96.8
4 Jatropha curcus 1331 18 4 22 1349 98.1
5 Linum usitatissimum 467 859 10 39 1326 96.5
6 Mesua ferrea 1093 199 18 65 1292 94
7 Manihot esculenta 1248 80 15 32 1328 96.6
8 Passiflora edulis 598 9 460 308 607 44.2
9 Populus alba 1074 270 10 21 1344 97.7
10 Populus euphratica 1063 284 6 22 1347 98
11 Populus simonii 1076 270 8 21 1346 97.9
12 Populus trichocarpa 1084 258 9 24 1342 97.6
13 Ricinus communis 1317 7 19 32 1375 96.3
14 Salix brachista 1200 146 5 24 1346 97.9
15 Viola pubescens 973 186 124 92 1159 84.3
2161
Supplementary Table S6: LTR-retriever LAI scores for Malpighiales genome 2162
assemblies. 2163
2164
Sr. No. Species Raw LAI LAI
1 Hevea brasiliensis 1.75 0.75
2 Jatropha curcas 2.37 1.22
3 Linum usitatissimum 5.55 8.76
4 Manihot esculenta 3.79 3.5
5 Mesua ferrea 2.1 7.73
6 Populus alba 11.32 11.74
7 Populus euphratica 1.25 3
8 Populus simonii 8.96 12.15
9 Populus trichocarpa 6.54 11.57
10 Rhizophora apiculata 7.48 13.03
11 Ricinus communis 4.37 5.43
12 Salix brachista 17.54 17.1
2165
2166
2167
2168
2169
2170
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
61
2171
2172
2173
Supplementary Table S7: Annotation Statistics for iterative MAKER-P 2174
annotation of Mesua ferrea genome assembly. 2175
2176
Parameter Round 1 Round 2 Round 3 Round 4 Round 5
No. of protein coding genes 35557 38971 38965 39011 46540
Gene density per KB 0.07 0.06 0.08 0.06 0.08
Average gene length (bp) 2631.77 2722.22 2768.24 2761.98 2477.54
Average exons per mRNA 4.73 4.96 5.1 5.09 4.55
Average exon length (bp) 206.42 217.37 222.09 222.14 222.24
Average intron length (bp) 401.68 378.83 359.1 360.17 365.12
Cumulative fraction of genes with AED
< 0.5
0.99 0.92 0.92 0.92 0.93
Percentage of complete BUSCO’s 89.4 93.3 92.3 92.1 94.7
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
62
2211
2212
2213
Supplementary Table S8: Genome assemblies used in this study. 2214 2215
Sr. No. Species Assembly Database
1 Homo sapiens GRCh38.p13 ENSEMBL
2 Homo sapiens hg4 UCSC Genome
Browser
3 Homo sapiens hg10 UCSC Genome
Browser
4 Homo sapiens hg15 UCSC Genome
Browser
5 Homo sapiens hg19 UCSC Genome
Browser
6 Caryocar brasiliense Cbr_v1.0 NCBI
7 Euphorbia esula ASM291907v1 NCBI
8 Hevea brasiliense ASM165405v1 NCBI
9 Jatropha curcus JatCur_1.0 NCBI
10 Linum usitatissimum ASM22429v2 NCBI
11 Mesua ferrea This Study
12 Manihot esculenta Manihot esculenta v6 NCBI
13 Passiflora edulis ASM215610v1 NCBI
14 Populus alba ASM523922v1 NCBI
15 Populus euphratica PopEup_1.0 NCBI
16 Populus simonii Populus_simonii_2.0 NCBI
17 Populus trichocarpa Pop_tri_v3 NCBI
18 Ricinus communis JCVI_RCG_1.1 NCBI
19 Salix brachista ASM907833v1 NCBI
20 Viola pubescens GCA_002752925.1 NCBI
21 Betula pendula Bpev01 NCBI
22 Durio zibethinus Duzib1.0 NCBI
23 Eucalyptus grandis Egrandis1_0 NCBI
24 Faidherbia albida http://gigadb.org/dataset/101054 GIGADB
25 Tectona grandis https://biit.cs.ut.ee/supplementary/WGSteak/
26 Quercus robur Q_robur_v1 NCBI
27 Ficus carica UNIPI_FiCari_1.0 NCBI
28 Citrus unshiu http://www.citrusgenome.jp/ DDBJ
29 Trema orientalis TorRG33x02_asm01 NCBI
30 Castanea mollisima ASM76360v1 NCBI
31 Olea europea O_europea_v1 NCBI
32 Santalum album ASM291163v1 NCBI
33 Populus pruinosa http://gigadb.org/dataset/100319 GIGADB
34 Trochodendron aralloides http://gigadb.org/dataset/100657 GIGADB
35 Tribolium castaneum ftp://ftp.hgsc.bcm.edu/Tcastaneum/Tcas1.0/ HGEC, BGM
36 Tribolium castaneum Tcas5.2 NCBI
37 Danio rerio danRer1 UCSC Genome
Browser
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint
63
38 Danio rerio danRer11 UCSC Genome
Browser
2216
2217
2218
2219
2220
Supplementary Table S9: Details of the mutation rate and generation time used. The 2221
Median estimate of mutation rate i.e. 2.5e-09 per site per year was used for all and 2222
converted to per generation mutation rates according to their generation times. 2223 2224
Sr. No. Species Generation time used Mutation rate per generation
1 Betula pendula 20 5e-08
2 Durio zibethinus 7 1.75e-08
3 Eucalyptus grandis 7 1.75e-08
4 Faidherbia albida 7 1.75e-08
5 Tectona grandis 7 1.75e-08
6 Quercus robur 15 3.75e-08
7 Ficus carica 7 1.75e-08
8 Citrus unshiu 7 1.75e-08
9 Trema orientalis 7 1.75e-08
10 Castanea mollisima 7 1.75e-08
11 Olea europea 7 1.75e-08
12 Santalum album 7 1.75e-08
13 Populus pruinosa 15 3.75e-08
14 Trochodendron aralloides 15 3.75e-08
15 Mesua ferrea 15 3.75e-08
16 Tribolium castaneum 0.3 (12 weeks) 2.7e-09
17 Danio rerio 1 1e-09
18 Homo sapiens 25 2.5e-08
2225
2226
.CC-BY 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted March 5, 2020. . https://doi.org/10.1101/2020.03.03.962365doi: bioRxiv preprint