conservation genomics of the endangered banff springs
TRANSCRIPT
University of Calgary
PRISM: University of Calgary's Digital Repository
Graduate Studies The Vault: Electronic Theses and Dissertations
2019-01-07
Conservation genomics of the endangered Banff
Springs Snail (Physella johnsoni) using Pool-seq
Stanford, Brenna
Stanford, B. (2019). Conservation genomics of the endangered Banff Springs Snail (Physella
johnsoni) using Pool-seq (Unpublished master's thesis). University of Calgary, Calgary, AB.
http://hdl.handle.net/1880/109445
master thesis
University of Calgary graduate students retain copyright ownership and moral rights for their
thesis. You may use this material in any way that is permitted by the Copyright Act or through
licensing that has been assigned to the document. For uses that are not allowable under
copyright legislation or licensing, you are required to seek permission.
Downloaded from PRISM: https://prism.ucalgary.ca
UNIVERSITY OF CALGARY
Conservation genomics of the endangered
Banff Springs Snail (Physella johnsoni) using Pool-seq
by
Brenna C.M. Stanford
A THESIS
SUBMITTED TO THE FACULTY OF GRADUATE STUDIES
IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE
DEGREE OF MASTER OF SCIENCE
GRADUATE PROGRAM IN BIOLOGICAL SCIENCES
CALGARY, ALBERTA
JANUARY 2019
© Brenna C.M. Stanford 2019
ii
Abstract
Understanding how species persist and adapt to local habitats is a fundamental question for
species of conservation concern. Located in Banff National Park, the endangered snail, Physella
johnsoni, inhabits seven highly specialized thermal springs. P. johnsoni undergo yearly
population bottlenecks with minimal to no dispersal among springs. The consequences of these
processes on genetic population structure are unknown. To investigate effects of habitat and life
history on P. johnsoni’s genome and to test the hypothesis of a single panmictic population, I
collected 20 to 40 snails/population for P. johnsoni and a closely related snail, P. gyrina, in
adjacent, non-thermal water. Using whole genome pooled-sequencing, millions of single
nucleotide polymorphisms were captured. These genetic variants resolved significant genetic
divergence between P. johnsoni and P. gyrina. In addition, I detected distinct genetic clusters
and reduced nucleotide diversity within each spring, indicative of strong micro-geographical
population structure and suggestive of a role for genetic drift. These results suggest that P.
johnsoni from each spring represent a distinct genetic unit, which has conservation implications
for the designation of designatable unit status under COSEWIC, and where mixing of snails may
reduce the consequences of genetic drift.
iii
Acknowledgments To my fantastic supervisor, Sean, I cannot begin to thank you enough for your guidance,
patience, and humour. I am so excited for the opportunity to keep working with you. I am
incredibly grateful for the wonderful group of people you have brought into this lab and for the
environment that you foster. To Danielle, James, Jessy, Jori, Sara, Tegan, and Teresa, you are
truly amazing. As scientists, leaders, and people I am so fortunate to call friends. Thank you for
putting up with my distractions, tears and countless questions. Thank you for the many laughs,
deep conservations, hugs and coffee. This work would not be anywhere close to this point
without all of your scientific knowledge and support.
To my committee members, Dwayne Lepitzki and Jana Vamosi – a huge thank you for
everything you’ve done to support me in this work! A special thank you, Dwayne, for braving
the cold, the snow and the heat and mosquitos, to make sure I not only lived through but
thoroughly enjoyed my field season.
To Parks Canada, specifically Mark Taylor, thank you so much for bringing me onto this
project. It has been an absolute pleasure working with you.
To my family, where do I even begin? Thank you for always believing in me, knowing
when to step in, and when to let me find my own way. You challenge me, support me and make
me laugh so hard. I will never be able to tell you how grateful I am for everything you’ve done
and continue to do for me. And last but not least, thank you, Peter. You have been so incredibly
understanding, supportive and I love you so very much.
iv
Table of Contents ACKNOWLEDGMENTS ........................................................................................................................ III
TABLE OF CONTENTS .......................................................................................................................... IV
LIST OF TABLES .................................................................................................................................... VI
LIST OF FIGURES ................................................................................................................................. VII
CHAPTER 1 GENERAL INTRODUCTION .......................................................................................... 1
1.1 INTRODUCTION .................................................................................................................................... 1
1.2 STUDY SYSTEM ................................................................................................................................... 8
1.3 OBJECTIVES ......................................................................................................................................... 9
CHAPTER 2 CONSERVATION GENOMICS IN THE BANFF SPRINGS SNAIL ........................ 13
2.1 INTRODUCTION .................................................................................................................................. 13
2.2 METHODS .......................................................................................................................................... 17
2.2.1 Sampling.....................................................................................................................................................17 2.2.2 DNA extraction ..........................................................................................................................................19 2.2.3 DNA quantification and quality check .......................................................................................................19 2.2.4 Constructing DNA pools for Pool-seq .......................................................................................................19 2.2.5 DNA sequencing ........................................................................................................................................20 2.2.6 Genomic analysis .......................................................................................................................................20 2.2.7 Pairwise FST ...............................................................................................................................................22 2.2.8 Nucleotide diversity ...................................................................................................................................22
2.3 RESULTS ............................................................................................................................................ 23
2.3.1 DNA extraction, quantification and quality ...............................................................................................23 2.3.2 DNA sequencing and pre-processing .........................................................................................................23 2.3.3 Pairwise FST ...............................................................................................................................................24 2.3.4 Nucleotide diversity ...................................................................................................................................24
2.4 DISCUSSION ....................................................................................................................................... 24
2.4.1 Population structure and nucleotide diversity between P. johnsoni and P. gyrina populations ................25 2.4.2 Population structure and nucleotide diversity within P. johnsoni and P. gyrina populations ...................27 2.4.3 Broader implications and conservation recommendations ........................................................................29 2.4.4 The utility of Pool-seq in conservation ......................................................................................................31 2.4.5 Caveats.......................................................................................................................................................33 2.4.6 Conclusions ................................................................................................................................................34
CHAPTER 3 GENERAL CONCLUSIONS .............................................................................................................41
v
REFERENCES ............................................................................................................................................................43
APPENDIX A: GENOMIC ANALYSIS PIPELINE ..............................................................................................55
APPENDIX B: DNA AND SEQUENCING QUALITY ..........................................................................................79
APPENDIX C: POPOOLATION2 PAIRWISE FST ESTIMATES .......................................................................82
vi
List of Tables Table 2.1 Number of SNPs within each population and used in pairwise comparisons between
populations determined by Poolfstat with the ‘Anova’ method for a minimum read count of one, minimum coverage of 20, a maximum coverage of 200, and a minor allele frequency of 0.05. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3. ............................. 35
Table 2.2 Pairwise FST between all populations determined using Poolfstat with the ‘Anova’ method for a minimum read count of one, minimum coverage of 20, a maximum coverage of 200, and a minor allele frequency of 0.05. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3. .................................................................................................................... 36
Table C.1 Pairwise FST between all populations calculated by PoPoolation2. Pairwise FST was calculated for 250bp side by side windows, minor allele count of 8, minimum coverage of 15 and max coverage of 200, where the entire window was acquired to meet coverage specifications. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3....................82
vii
List of Figures Figure 1.1 Schematic illustrating where genomic data are required in conservation management
plans. Namely for resolving taxonomic ambiguity, assigning DUs/ESUs and for characterizing the population genomic consequences of increasing threats. ...................... 10
Figure 1.2 Schematic of a population bottleneck. The different colours represent genetic variation, where some is randomly lost with the reduction in population size. This decrease in genetic variation is observed even when population numbers increase. ......................... 11
Figure 1.3 Schematic of Pool-seq preparation and sequencing. Equal amounts of DNA (ng) of each individual of the population is combined and the individual is “lost”. The same adaptor is ligated to all of the DNA from that population to distinguish it from other populations in sequencing and in analysis. .......................................................................... 12
Figure 2.1 Range map and sample populations for Physella johnsoni and sample populations for Physella gyrina, Banff National Park, Alberta, Canada. P. johnsoni - Cave Spring (J1), Basin Spring (J2), Lower Cave & Basin Spring (J3), Upper Cave & Basin Spring (J4), Lower Middle Spring (J5), Upper Middle Spring (J6) (not used in this study) and Kidney Spring (J7) (not used in this study). P. gyrina - Cave & Basin Marsh (G1), Five Mile Pond (G2) and Muleshoe Pond (G3). ........................................................................................... 37
Figure 2.2 Total number of P. johnsoni from January 1996 to September 2017. Population counts were taken once every three weeks till August 2000 and then once every four weeks till September 2016 when population counts were ended. From April to September 2017 and September 2018 the counts were resumed. Original springs include J1, J2, J3, J4 and J5. The re-established springs are J6 and J7. Modified from COSEWIC 2018 by Dr. Dwayne Lepitzki. ............................................................................................................................... 38
Figure 2.3 Principle coordinate analysis for all pairwise FST between P. johnsoni and P. gyrina populations calculated by Poolfstat with the ‘Anova’ method for a minimum read count of one, minimum coverage of 20, a maximum coverage of 200, and a minor allele frequency of 0.05. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3. ............................. 39
Figure 2.4 Averaged nucleotide diversity for all P. johnsoni and P. gyrina populations calculated by PoPoolation2 over 250bp side by side windows, minor allele count of 4, minimum coverage of 20 and max coverage of 200, where the 60% of the window was acquired to meet coverage specifications. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3. ....................................................................................................................................... 40
Figure B.1 Pooled DNA (5 µL) for each population pre-dilution for sequencing preparation run through 1% agarose gel with 3 µL of NEB 1 kb DNA ladder. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3.......................................................................................79
Figure B.2 Basic statistics and per base sequence quality component of FASTQC report for J3 (Lower Cave & Basin Spring) raw, reverse sequences (pre-Trimmomatic). ...................... 80
viii
Figure B.3 Basic statistics and per base sequence quality component of FASTQC report for J3 (Lower Cave & Basin Spring) filtered and trimmed reverse sequences (post-Trimmomatic). ............................................................................................................................................. 81
1
CHAPTER 1 GENERAL INTRODUCTION
1.1 INTRODUCTION
As habitats continue to change largely due to anthropogenic impacts, there is an
associated massive loss of biodiversity and an increasing number of threatened and endangered
species (Frankham 2005; Butchart et al. 2010). Habitat fragmentation, habitat loss, introduction
of invasive species, and over-exploitation can leave populations vulnerable to natural disasters,
demographic stochasticity, and environmental change (Shaffer 1981; Frankham 2005). This
biodiversity loss, at the genetic, species and ecosystem level, has incredibly harmful impacts on
human society (Cardinale et al. 2012; Hooper et al. 2012). Decreases in biodiversity lowers the
productivity and services of ecosystems (e.g., wood production, carbon sequestration, soil
mineralization) and biodiversity loss can have detrimental impacts to ecosystem function similar
to other forms of environmental change (e.g., climate warming, acidification) (Cardinale et al.
2012; Hooper et al. 2012). These factors highlight the need for conservation management to
devise and implement effective and timely management plans. However, with limited time and
resources, the choice of habitats and species to conserve remains a significant factor (Martin et
al. 2012; Carwardine et al. 2018). Allocation of limited conservation resources may need to be
directed towards prioritized species and populations based on factors such as ecological function
and/or evolutionary significance (e.g. Joseph et al. 2009; Funk et al. 2012; Carwardine et al.
2018). However, determining a species or population’s priority can be extremely difficult due to
a variety of factors including assessments of extinction risk for species. Altogether, conservation
biology is faced with increasing biodiversity losses combined with intensifying data deficiencies.
Data on species and their environment is needed for effective conservation. Data
deficiencies render fundamental questions about conservation status difficult to answer. The
International Union of Conservation of Nature (IUCN) considers a species data deficient if there
is insufficient information on a species taxonomic status, the threats or status of populations,
and/or distribution (Bland et al. 2015; Parsons 2016). Priority management and funding are
largely allocated to species of conservation concern, when these data are present, whereas data
deficient species typically receive lower priority (Morais et al. 2013; Howard & Bickford 2014;
Parsons 2016). Consequently, there can be a taxonomic bias in terms of data availability, as rare,
cryptic or non-charismatic organisms (e.g., invertebrates, which make up the majority of global
2
biodiversity) are largely data deficient (Howard & Bickford 2014; Régnier et al. 2015; Cowie et
al. 2017). In addition, over 60% of data deficient species are likely threatened by extinction
(Howard & Bickford 2014; Bland et al. 2015). If data deficient species are considered, up to 7%
of described species may have already been lost since 1550 compared to the 0.04% listed by the
IUCN (Régnier et al. 2015). Overall, such losses exemplify the need to develop more rapid and
appropriate tools towards more effective conversation management plans.
Genomics is one such tool that can be used in conjunction with other means to address
these challenges (Figure 1.1). It has become increasingly clear that genetic diversity is essential
for the long-term viability of species and that failure to protect it will undermine the actions to
protect biodiversity at the ecosystems and species level (Frankel 1974; Laikre 2010). In response
to environmental change, it is in part genetic variation, either de novo mutations (slow response),
or standing genetic variation (i.e., variation present in a population or species but unselected for
until the environment changes) or a combination of both that may facilitate population
persistence (e.g., Frankel 1974; Barrett et al. 2008; Morris et al. 2014). For example, a recent
study on Tasmanian devils (Sarcophilus harrisii) has demonstrated that in response to facial
tumour disease there has been rapid selection on standing genetic variation (Margres et al. 2018).
The disease has caused declines upwards of 80% and is almost always fatal, resulting in the
species to be listed as endangered (McCallum 2008; Storfer et al. 2018). However, the observed
rapid adaptation has conferred greatly improved survival, where a few loci explains over 80% of
variation in female survival (Margres et al. 2018). Yet, most policies or management plans do
not prioritize or inform decisions based on genetic diversity (Laikre 2010). The maintenance and
promotion of conservation of genetic diversity requires characterization of the biological
processes that threaten this variation with the integration of techniques that facilitate regular
monitoring. Altogether, testing ecological and evolutionary predictions about genetic diversity in
association with conservation objectives will help inform policy.
One of the primary aims of conservation management is to maintain or increase
population numbers. However, hidden evolutionary genetic threats may undermine management
plans and precipitate population collapse (Figure 1.1). For example, in the iconic Isle Royale
system, moose (Alces alces) and wolf (Canis lupus) populations had been presumably stable for
~30 years before the wolf population suffered a major crash (Peterson et al. 1998). Though the
original crash was predicted to be due to lack of food, the population never recovered due to
3
disease and severely decreased genetic diversity (Peterson et al. 1998). In fact, the wolf
populations had such low fitness that a single out-bred male immigrant in 1997 resulted in all
individuals born having ~50% of their ancestry from him in a little over a generation compared
to the ~14% that would have arisen under equal fitness (Adams et al. 2011). Even with this
influx of genetic material, the wolf population continued to decrease (Hedrick et al. 2014) to just
two wolves by 2018: a father and daughter, who are half siblings. This is a clear example of
decreased genetic diversity playing a role in preventing wild systems buffering environmental
change. It is an important reminder that as population numbers diminish, increased mating
between related pairs will decrease fitness of offspring, known as inbreeding depression
(Edmands 2007). However, the threshold at which inbreeding occurs, the impacts on the health
of population or species and the ability of them to avoid or recover from inbreeding are not
universal (Keller & Waller 2002) and may be managed. Such management plans require
thorough pedigrees or genetic estimates of relatedness to mitigate the effects (Vilà et al. 2003).
Additionally, even if inbreeding is not occurring the same level of decreased fitness can be
reached due to or in conjunction with other forces (Hedrick & Kalinowski 2000). Small
populations are particularly vulnerable to genetic drift, the random loss of alleles that causes
fixation (Bouzat 2010). Due to a lower number of individuals, the probablity of losing minor
alleles increases (Bouzat 2010). Populations with small effective population sizes (Ne), have
limited individuals genetically contributing to the next-generation, resulting in decreased genetic
variability (de Oliveira et al. 2006). Environmental, biological and human driven population
declines may cause the random loss and subsequent reduction of genetic diversity (Weber et al.
2004; Bouzat 2010). Through concerted conservation efforts, species number and Ne may
recover, as seen with the northern elephant seal (Hoelzel et al. 1993). Hunted almost to
extinction in the 19th century, the northern elephant seal (Mirounga angustirostris) has seen
tremendous recovery in their population numbers, however their genetic diversity has remained
extremely low due to the genetic “bottleneck” they underwent (Hoelzel et al. 1993; Bouzat 2010)
(Figure 1.2). Detecting and monitoring losses of genetic diversity is essential in mitigating its
detrimental effects and robust genomic data are integral in achieving this.
Genetic rescue, the translocation of individuals from one population into another
(Ingvarsson 2001), has been proposed to circumvent the impacts of low genetic diversity. The
introduction of new genetic material via immigrants has been shown to rescue declining
populations and increase composite fitness (Vilà et al. 2003; Edmands 2007). However,
4
complications with genetic rescue can include divergence of isolated populations that are locally
adapted to their specific ecological habitat (discussed below). In these cases, attempts to increase
genetic diversity may result in disruption of locally adapted genes or gene complexes, decreasing
fitness (outbreeding depression) (Edmands 2007). Without the integration of genetic and
genomic data into conservation planning, management decisions may unintentionally be
detrimental to species recovery.
Another aspect of conservation in which genomics is helpful as a tool is the assignment
and investigation of population and species structure (Figure 1.1). Policy decisions on how best
to allocate funds and time are based in part on the distinctiveness or taxonomic standing of a
species (Isaac et al. 2004; Mace 2004; Joseph et al. 2009) and protection for certain populations
is influenced by showing that it is distinct from others (Kell et al. 2009). Management plans must
endeavour to incorporate methods that account for specific populations or genetically or
ecologically unique populations that warrant specialized management (Funk et al. 2012).
Evolutionarily significant units (ESUs) have a variety of definitions but can be described broadly
as “a population or group of populations that warrant separate management or priority for
conservation because of high genetic and ecological distinctiveness” (Funk et al. 2012). The
Committee on the Status of Endangered Wildlife in Canada (COSEWIC) includes ESUs in its
designation of designatable units (DUs), where a population or group of populations must meet
one or more of COSEWIC’s criteria for “discreteness” and for “significance” (COSEWIC 2015).
Two criteria used by COSEWIC to determine discreteness (please see COSEWIC 2015 for full
criteria) are if population or group of populations have clear genetic distinctiveness and/or local
adaptive differences (COSEWIC 2015). Once a population or group of populations are
determined to be discrete, two criteria that may be used to deme significance is if genetic
markers illustrate there is a clear phylogenetic divergence and/or they exist in an “ecological
setting unusual or unique to the species, such that it is likely or known to have given rise to local
adaptations” (COSEWIC 2015). DUs and ESUs aim to protect sub-species variability that is
often missed by traditional taxonomy (Mee et al. 2015). Losing these vital populations can
increase the total risk of extinction, as DUs and ESUs may harbor the genetic variability
necessary to evolve with environmental change (Ceballos & Ehrlich 2002; Funk et al. 2012; Mee
et al. 2015).
5
However, delineation of populations, DUs and/or ESUs can be difficult over time and
space, confounding accurate data for conservation. For example, newly colonized habitats may
be inhabited by populations that only recently diverged; alternatively, some populations
occupying the same habitat may look the same but be genetically cryptic for demographic
reasons (Bull et al. 2013). Genetic estimates of population differentiation can help elucidate the
underlying genetic differences among populations. Wright (1950) developed an index of genetic
differentiation based on levels of heterozygosity (FST). Populations that are not genetically
differentiated will have similar allele frequencies and comparable expected levels of
heterozygosity and, therefore, a correspondingly small FST, while populations that are
differentiated will have dissimilar frequencies and increasing disequilibrium of heterozygosity
and, therefore, a larger FST (Holsinger & Weir 2009). This measure can be useful in
distinguishing cryptic population structure and determining isolation between population pairs,
especially when based on genome-wide estimates of FST from numerous genetic loci or markers.
For example, genome-wide markers have been used to resolve phylogenetic structure in
populations long considered as panmictic from mosquitos (Wyeomyia smithii) (Emerson et al.
2010) to marine threespine stickleback (Gasterosteus aculeatus) (Morris et al. 2018). Altogether,
establishing patterns of genetic differentiation (measured by FST) is a conventional first step to
understanding the nature of how organisms are distributed in time and space.
High FST values for certain genomic regions compared to the overall genomic
background can be used in uncovering putatively adaptive differences among populations. The
hypothesis is that locus-specific divergent or directional selection will maintain genetic
differentiation among populations for this allele relative to other non-selected loci (Holsinger &
Weir 2009; Whitlock & Lotterhos 2015). However, for candidate or putative genes uncovered in
such a method this must be treated as a first step to testing predictions of their potential adaptive
role as genetic drift, especially in populations with small Ne, can also cause the fixation of alleles
resulting in the same genetic pattern (Holsinger & Weir 2009). Consideration must also be included
for the influence of population structure and demography and the possibility of missing or incorrect
environmental data (Hoban et al. 2016). Measures of pairwise FST alone are unable to distinguish the
nature of the selective force causing variation in allele frequency (Lotterhos & Whitlock 2015). It is
necessary to consider any genetic variation uncovered in the context of what is known for the species
to develop a mechanistic understanding of the impact of the variation at all biological levels (Dalziel
et al. 2009). With these factors in mind it is possible to design experiments to capture the genetic
6
basis behind clear phenotypic and adaptive differences (Rogers & Bernatchez 2007; Dennenmoser et
al. 2017) The relevance for conservation management is that such genomic approaches can highlight
the presence of adaptive variation that may be contributing to local adaptation of certain at-risk
populations and help characterize their evolutionarily significance (Funk et al. 2016).
Though there have been many papers reviewing and promoting the integration of
genomics in conservation (Allendorf et al. 2010; Ouborg et al. 2010; Funk et al. 2012), it has
been largely confined to academia with very few concrete examples of it impacting management
decisions (Shafer et al. 2015). This may largely be due to the cost and level of expertise
necessary to generate and process genomic datasets (Shafer et al. 2015), but reflection on the
integration of genomics in conservation is appropriate, especially on how to make genomic tools
more accessible.
One of the main considerations in estimating genetic variation is that of the choice of
genetic marker. Initially used genetic markers in conservation were/are limited due to small
numbers and that they are not distributed throughout the genome (Ouborg et al. 2010). With the
decreasing cost of Next-Generation sequencing, the capturing of thousands to millions of
genome-wide single nucleotide polymorphisms (SNPs) is now possible. The sheer amount of
polymorphic loci has facilitated more accurate estimates of genetic variation and the detection of
finer-scale population structure (Emerson et al. 2010; Shafer et al. 2015; Morris et al. 2018).
Because they are distributed throughout the genome, SNPs can be used to detect both neutral and
potentially adaptive genetic variation, helping resolve taxonomic ambiguity and assign ESUs
(Funk et al. 2012; Shafer et al. 2015). For genomics to be integrated efficiently and for it to be
applied to real conservation dilemmas, conservation management practitioners must push for its
use and collaborate with genomic researchers (Shafer et al. 2015), with continuous
communication between the two (Lundmark et al. 2019).
Next-Generation sequencing encompasses many high-throughput sequencing methods for
the capturing of polymorphisms. Some commonly used methods include reduced representation
RADseq (Baird et al. 2008) and ddRADseq (Peterson et al. 2012), where the DNA is treated
with one or two restriction enzymes that cleave the DNA before sequencing, returning small
regions throughout the genome. SNP-Chips comprise another method whereby small sequences
are physically bound to a glass slide and hybridize with DNA to capture known SNPs (Lien et al.
2011). However, while these methods still generate thousands of markers, they do not capture
7
the entire genome. In my project, I used a technique called Pool-sequencing (Futschik &
Schlötterer 2010). Though whole genome sequencing for individuals has decreased sustainably
in price it is not financially viable to sequence the number of individuals necessary to answer
population level questions (Futschik & Schlötterer 2010). Pool-seq involves pooling of equal
amounts of DNA from multiple individuals per population and sequencing them as if they were
one “individual” (Figure 1.3) (Futschik & Schlötterer 2010). With sufficiently high number of
individuals and careful data pre-processing it has been shown to estimate allele frequencies more
accurately than individual sequencing (Futschik & Schlötterer 2010; Kofler et al. 2011a; Gautier
et al. 2013). Attention needs to be given when pooling individuals, so that they are each
represented equally (Schlötterer et al. 2014); however, newly developed models for estimating
FST have been shown to be robust to unequal representation, pool size and coverage depth
(Hivert et al. 2018). Due to the loss of the individual, estimates of Ne and migration are not
possible; however, the number of genome-wide SNPs captured makes Pool-seq very powerful in
detecting nucleotide diversity, population structure and potentially adaptive differences (though
these should be interpreted with caution (Anderson et al. 2014)). Overall, very few studies have
applied this method in a conservation context, so further research of this cost-effective method
for conservation is needed.
One group of animals that is in desperate need of effective and timely management plans
are molluscs. Though deemed to be one of the most imperiled taxa (Régnier et al. 2009; Johnson
et al. 2013; Cowie et al. 2017), only 7,276 of an estimated 70,000 to 76,000 species (Rosenberg
2014) had been assessed by the International Union of Conservation of Nature (IUCN) as of
2016 (Cowie et al. 2017). Of these 34% (2,463 species) were deemed data deficient to make a
formal assessment (Cowie et al. 2017). These numbers actually places molluscs at a high
proportion of assessed species compared to all invertebrates, where only 1.2% have been
assessed overall (Cowie et al. 2017), However, in comparison all mammal and bird species
recognized by the IUCN have been assessed and only ~5% were deemed data deficient (Cowie et
al. 2017). And yet, even though molluscs are deeply under-represented in terms of conservation
assessment by the IUCN, they still made up roughly 40% of the animal species listed as extinct
in the third issue of the 2016 Red List (Cowie et al. 2017). Their decline is heavily impacted by
habitat loss and degradation, with very little known about their response to toxins or chemicals in
aquatic systems (Régnier et al. 2009; Johnson et al. 2013). However, as the primary grazer in
many habitats and an important food source for many species, their loss has the potential to
8
cause drastic shifts in the many ecosystems they inhabit (Régnier et al. 2009; Johnson et al.
2013). Loss of integral components at the bottom of the ecosystem has large bottom-up effects,
which impact all members of the ecosystem. Not surprisingly, in terms of genomic information,
molluscs are exceedingly lacking. A clear representation of this is that there are only 23 available
reference genomes (https://www.ncbi.nlm.nih.gov/genome, txid6447[ORGN]) compared to the
522 for vertebrates hosted on NCBI (https://www.ncbi.nlm.nih.gov/genome, txid7742[ORGN]).
1.2 STUDY SYSTEM
The endangered Banff Springs Snail, Physella johnsoni, is found only in seven thermal
springs, characterized by high water temperature and hydrogen sulphide and low dissolved
oxygen and pH, in Banff National Park, Alberta, Canada (COSEWIC 2008). It is listed as
endangered under Canada’s Species At Risk Act (SARA), in part due to an incredibly small
distribution. P. johnsoni’s habitat encompasses less that 600m2 located in a total range and area
of occupancy of just 8 km2, well under the 5,000 km2 and 500 km2, respectively, necessary to
meet endangered status (Criterion B) (COSEWIC 2014, 2018). Additionally, each of the P.
johnsoni thermal springs are seen to undergo large fluctuations in number of mature individuals
with yearly population bottlenecks causing changes up to two orders of magnitude (COSEWIC
2014, 2018). In conjunction with predictions that they are hermaphroditic and have low amounts
of gene flow between the thermal springs, there is concern of a lack of genetic diversity (Lepitzki
& Pacas 2010). They are currently managed as a unique species, with genetic analysis of a few
markers showing differentiation between them and a much more common snail, Physella gyrina
(Lepitzki 1998; Remigio et al. 2001). However, some research using the same markers suggest
that P. johnsoni and P. gyrina are synonymous with each other (Wethington & Guralnick 2004).
This taxonomic ambiguity hinders the proper management of P. johnsoni (Daugherty et al. 1990;
Mace 2004). If P. johnsoni were synonymized with P. gyrina, they would likely be re-assessed
for DU status (COSEWIC 2008), where evidence on discreteness and significance would need to
be provided. Limited numbers of genetic markers are unable to provide the resolution necessary
to the level necessary to distinguish P. johnsoni and P. gyrina and resolve patterns of population
structure.
9
1.3 OBJECTIVES
My objective in this thesis was to produce a genomic dataset for use in the conservation
management of Physella johnsoni. The genomics data and resulting analysis will provide new
levels of resolution to taxonomic designation and population structure than previously achieved
through morphology and single marker sequencing. It will be integrated into Parks Canada’s
management plan and be used to advise how best to manage the species to help ensure its
continued persistence.
10
Figure 1.1 Schematic illustrating where genomic data are required in conservation management plans. Namely for resolving taxonomic ambiguity, assigning DUs/ESUs and for characterizing the population genomic consequences of increasing threats.
11
Figure 1.2 Schematic of a population bottleneck. The different colours represent genetic variation, where some is randomly lost with the reduction in population size. This decrease in genetic variation is observed even when population numbers increase.
12
Figure 1.3 Schematic of Pool-seq preparation and sequencing. Equal amounts of DNA (ng) of each individual of the population is combined and the individual is “lost”. The same adaptor is ligated to all of the DNA from that population to distinguish it from other populations in sequencing and in analysis.
13
CHAPTER 2 CONSERVATION GENOMICS IN THE BANFF SPRINGS SNAIL
2.1 INTRODUCTION
We are currently in the midst of massive biodiversity loss in association with
anthropogenic impact, habitat fragmentation, habitat loss, invasive species, over-exploitation and
environmental change (Shaffer 1981; Butchart et al. 2010). These factors have contributed to an
alarming increase in numbers and rates of threatened and endangered species (Régnier et al.
2009, 2015). There is a critical need for effective conservation management plans, where
conservation practitioners must have the tools to rapidly elucidate and assess population structure
and distribution, towards determining whether populations and/or a species meet criteria to be
considered priorities for conservation and to help characterize the threats faced by species at risk
(Figure 1.1) (Funk et al. 2012; Guisan et al. 2013; Shafer et al. 2015). With the integration of
genomics, practitioners have an unprecedented ability to resolve patterns of genetic diversity
within and between populations and species and to inform on these vital aspects of conservation
management (Figure 1.1) (Shafer et al. 2015). However, there are still limited examples where
conservation genomics has been shown to actually impact policy or management decisions
(Shafer et al. 2015).
Found only in Banff National Park, Alberta, Canada, with a global range of just 594.4 m2
(COSEWIC 2008), the Banff Springs Snail (Physella johnsoni) (Clench 1926) embodies the
challenges faced by conservation biologists. It became the first living mollusc to be listed by
Committee on the Status of Endangered Wildlife in Canada (COSEWIC) as threatened in 1997,
and in 2000 was re-assessed as endangered (COSEWIC 2008). Globally, molluscs have been
determined to be one of the most imperiled taxa (Régnier et al. 2009; Johnson et al. 2013; Cowie
et al. 2017), however the majority are unassessed for conservation (Cowie et al. 2017). P.
johnsoni belongs to the family Physidae, which is a family of about 80 species of freshwater, air-
breathing snails found widespread in the Holarctic region and into Central and Southern America
(Taylor 2003; Wethington & Lydeard 2007). Currently, ~55% of North American Physidae are
at risk, alongside the vast majority of other freshwater snails, partially because of rapid habitat
changes or loss due to human interference and/or environmental changes (Johnson et al. 2013).
14
Several factors contribute to the conservation risks faced by P. johnsoni. Their entire
global range consists of seven thermal springs characterized by high water temperature and
hydrogen sulphide content, and low dissolved oxygen content and pH (Grasby & Lepitzki 2002;
COSEWIC 2008) (Figure 2.1). The thermal springs are located along the Sulphur Mountain
Thrust fault, existing in three elevation groups (Grasby & Lepitzki 2002). The lowest elevation
group (~1400m) consists of four thermal springs located within a few hundred metres of each
other - the Cave (isolated from the others except for a small hole in the top), the Basin, and the
Lower and Upper Cave and Basin Springs (Figure 2.1) (Grasby & Lepitzki 2002). The middle
elevation group (~1500 m) is located about 1 km up Sulphur Mountain and consists of Lower
and Upper Middle Springs (Figure 2.1), West Cave and Gord’s Spring, though it is uncertain if
the physids currently residing in West Cave or Gord’s Spring are P. johnsoni (Grasby & Lepitzki
2002; COSEWIC 2008, 2018). The highest elevation group, consists of Kidney Spring (1588 m)
(Figure 2.1) and the extirpated Upper Hot Spring (1584 m) (Grasby & Lepitzki 2002). P.
johnsoni were originally found in 11 thermal springs, however they ceased to exist in six due to
water stoppages or to human interference (COSEWIC 2008). Upon water flow resuming in
Kidney and Upper Middle Springs snails were re-established successfully in 2002 and 2003,
respectively, resulting in the seven current inhabited thermal springs (COSEWIC 2008).
The taxonomic designation of many physids, including P. johnsoni, is strongly debated.
Wethington & Lydeard (2007) state that there is a more than 50% over-representation of physid
species in North America. This is in part due to the classification being heavily based on
morphological traits (e.g., shell morphology and penial structure). Though P. johnsoni was
found to be significantly more globose and to have a longer spire than P. gyrina (Lepitzki 1998),
recent evidence has shown that shell morphology is very plastic in physids. One study found
phenotypic convergence of shell shape within one generation in the lab of two morphologically
distinct but geographically adjacent populations of physids (Gustafson et al. 2014). Moore et al.
(2014) found that two populations of physids predicted to be the same species due to the same
atypical morphology were more genetically divergent from each other than a morphologically
typical snail, Physella gyrina. While P. johnsoni are currently designated as a species, alternative
hypotheses suggest that up to seven different taxa, including P. johnsoni and another endangered,
endemic thermal spring physid, P. wrighti (Hotwater Physa) are synonyms of a much more
common snail, P. gyrina and that the observed morphological differences are the result of habitat
influence (Wethington & Guralnick 2004; Wethington & Lydeard 2007).
15
P. johnsoni individuals are seemingly restricted to around the origin of the spring (30 to
36°C) (Grasby & Lepitzki 2002). Though the cause of their distribution is unknown, higher
densities are correlated with the higher temperature and hydrogen sulphide and lower dissolved
oxygen and pH (Lepitzki & Pacas 2010). This distribution may be influenced by concentration of
their food sources, algae and bacteria (Grasby & Lepitzki 2002). P. johnsoni are presumed to be
hermaphrodites, preferring to out-cross when there are favourable environmental conditions
(Jarne et al. 2000; COSEWIC 2008). P. johnsoni’s restricted habitat and life history patterns
(discussed below) have led to concern of decreased genetic diversity. Of highest concern, is that
each year the populations (whereby each thermal spring is defined as a “population”, but defined
as “sub-populations” under COSEWIC) fluctuate on the order of two magnitudes (Lepitzki &
Pacas 2010; COSEWIC 2018). Some populations will decrease to fewer than 50 individuals in
the summer months and reach population highs into the thousands in the winter and spring
(Figure 2.2) (COSEWIC 2008, 2018). The cause of these population rises and declines has not
been determined, but is speculated to be in association with seasonal changes in water chemistry
(Grasby & Lepitzki 2002). Whether this impacts the snails directly or the changes are due to
abundance of the algae and bacteria (or an association of both) is unknown (Grasby & Lepitzki
2002). As the per population numbers decrease to so few individuals (Figure 2.2) (COSEWIC
2018), the genetic variation is likely reduced to the genetic diversity contained within the
surviving individuals. Even as the population numbers increase the offspring will only contain
that genetic variation, resulting in a genetic bottleneck (Bouzat 2010). In small and restricted
populations such as P. johnsoni, low frequency alleles can be randomly lost, increasing the
homozygosity of the population (Bouzat 2010). This causes a reduction of genetic diversity and
random fixation of potentially detrimental alleles by a process called genetic drift (Bouzat 2010).
It has been well documented that even low amounts of gene flow between populations can
mitigate a loss of genetic diversity (Ingvarsson 2001; Vilà et al. 2003). There is likely limited
opportunity for genetic mixing in exceedingly high spring run-off years with the transport of
individuals from Upper Cave and Lower Cave and Basin Springs into the Basin Spring (Figure
2.1) (Lepitzki & Pacas 2010). Though never confirmed in P. johnsoni, snails in other freshwater
systems have been documented to be transported by birds (Santamaría & Klaassen 2002) and
large mammals, causing significantly decreased genetic differentiation between certain
populations (Van Leeuwen et al. 2013). Marmots (Marmota caligata) and bears (Ursus arctos)
have been observed via surveillance cameras to frequent some of the thermal springs (per. com.
16
Dr. Dwayne Lepitzki), however, overall, it is predicted that there is very little dispersal and
likely gene flow among the thermal springs (Lepitzki & Pacas 2010). Due to the intensity of the
population bottlenecks and because genetic mixing has yet to be confirmed among populations,
decreased genetic diversity is also strong conservation concern.
Previous sequencing efforts have attempted to resolve the taxonomic ambiguity and
determine the levels of genetic differentiation between P. johnsoni and P. gyrina. However,
sequencing of protein variants (allozymes) (Lepitzki 1998) or COI and 16S mitochondrial genes
(Remigio et al. 2001; Wethington & Guralnick 2004) failed to reach a consensus. In these
studies, P. johnsoni was compared to geographically close populations of P. gyrina, including
the three used in the present study – the Cave and Basin Marsh, Five Mile Pond and Muleshoe
Pond (Figure 2.1). The Cave and Basin Marsh is located downstream of the Cave and Basin
Spring cluster and contains diluted thermal water and does not freeze (per. obs). Five Mile Pond
and Muleshoe Pond are lake populations, located several kilometres upstream on the Bow River
(Figure 2.1). P. johnsoni and P. gyrina were found to be genetically distinguishable at only three
of the 12 protein loci tested, with low levels of intraspecific variation restricted to a single locus
(Lepitzki 1998). However, consensus of genetic relatedness based on COI and 16S mitochondrial
gene sequences has not been reached (Remigio et al. 2001; Wethington & Guralnick 2004; Pip &
Franck 2008). P. johnsoni and P. gyrina may be genetically close as not all analyses reveal
monophyletic groups (Wethington & Guralnick 2004). This has been hypothesized to be in part
due to the young age of the species, with P. johnsoni being predicted to only have diverged 3200
to 5200 years ago when the thermal springs were formed (Grasby et al. 2003; COSEWIC 2008).
These limited genetic tools have precluded effectively testing this hypothesis.
These evolutionary factors highlight the need for genome-wide markers for resolving
whether P. johnsoni and P. gyrina warrant separate management, to resolve the micro-
geographic genetic population structure for P. johnsoni and to detect potential underlying genetic
threats. It should be noted that I will not be attempting to resolve what constitutes a species in
this study and rather focus on inter-species and intra-species patterns of genetic differentiation.
As illustrated above, the use of limited genetic markers has been unable to resolve patterns of
genetic differentiation. To address these factors hindering conservation management, I used
Pool-sequencing (Figure 1.3) (Futschik & Schlötterer 2010). This sequencing method involves
the pooling of DNA from multiple individuals per population to provide high confidence allele
17
frequency estimates across the entire genome (Futschik & Schlötterer 2010; Kofler et al. 2011a;
Gautier et al. 2013).
In this chapter I used genome-wide single nucleotide polymorphisms (SNPs) captured by
Pool-seq to address two conservation objectives for P. johnsoni. The first objective was to
determine whether P. johnsoni is genetically distinct from P. gyrina. I hypothesized that the
taxonomic unit previously assigned to P. johnsoni and P. gyrina by defining and/or derived traits
would be valid if the observed patterns of genomic differentiation supported their distinct status.
Whether or not P. johnsoni represents a thermal ecotype of P. gyrina or rather a distinct genetic
unit has direct bearing on their conservation status (COSEWIC 2018) and the resources allocated
to their conservation. For effective management Parks Canada must be informed if P. johnsoni
and P. gyrina are genetically distinct, as an essential component of conservation biology is
taxonomic designation. Improper classification can lead to the extinction of a species (Daugherty
et al. 1990; Mace 2004). Secondly, I used this same SNP dataset to test predictions of micro-
geographical population structure and within-population genetic diversity of P. johnsoni. While
the distribution is limited to a small geographic space, gene flow is predicted to be limited
(Lepitzki & Pacas 2010) and extensive annual bottlenecks within each of the populations
(COSEWIC 2018) are predicted to amplify the effect of genetic drift resulting in increased
population divergence. The combination of these two evolutionary processes lead to the
prediction that genetic structure may be pronounced. Alternatively, P. johnsoni may represent a
single panmictic population. The genomic data produced here will facilitate management
decisions in association with habitat threats, and whether thermal springs should be managed as a
single unit or if they each warrant separate management. Overall, an analysis of genomic
divergence of these snails is required to test these hypotheses.
2.2 METHODS
2.2.1 SAMPLING
P. johnsoni were collected from five thermal springs between January and March of 2017
in the Banff Thermal Springs of Banff National Park in Alberta, Canada: 1) Cave Spring (J1) 2)
Basin Spring (J2) 3) Lower Cave & Basin Spring (J3) 4) Upper Cave & Basin Spring (J4) and 5)
Lower Middle Spring (J5) (Figure 2.1). Individuals were also collected from Upper Middle
Spring (J6) and Kidney Spring (J7), which were not included in this study (Figure 2.1). Before
18
collecting P. johnsoni, census population sizes were estimated to ensure that the number of snails
sampled (n=40) did not exceed 0.5 to 3% of the spring’s current population. This condition was
met except for J1, where only 20 P. johnsoni could be collected. In addition, a second species, P.
gyrina (n=40), were collected from three locations 1) Cave & Basin Marsh (G1) (March 2017),
2) Five Mile Pond (G2) and 3) Muleshoe Pond (G3) (July 2017) (Figure 2.1).
Snails were collected by hand for all of P. johnsoni locations (J1 to J5) and P. gyrina G1.
Snails were collected haphazardly from eight locations within the thermal spring, with five snails
being collected at each location. Water temperature was recorded at each location. A D-dipnet
was used to collect at all other locations (G2 and G3). Samples were collected in 8 batches (five
snails per) from different locations within each lake. Water temperature was taken once from a
representative location.
All snails were anesthetized in the field in batches of five by placing them into 5%
laboratory grade ethanol (EtOH) (Gilbertson & Wyatt 2016). The tubes were left to incubate
immersed in the water source as to be relatively close to the same temperature and minimize
stress. They remained in 5% EtOH until movement ceased and they released from the tube’s
surface (observed to be 5 to 15 minutes). They were then removed from the 5% EtOH and tested
for responsiveness by scrapping a hypodermic needle across the foot (Gilbertson & Wyatt 2016).
If unresponsive, they were placed on a dish made of aluminum foil and euthanized by rapid
cooling with electrical component freezing spray sprayed from under the dish (Craze & Barr
2002).
Tissue was then removed from the shells by dissecting needle or forceps trying to
minimize damage to the shell. The shells were stored individually in 95% EtOH. Ten tissue
samples from each of J1 to J5, and G1 were stored individually in RNAlaterÒ, which would
have allowed for future gene expression analysis. However, it was decided that these samples
would be better used for DNA analysis and therefore, extracted for DNA as explained below.
The remaining tissue samples were stored individually in 95% EtOH. For G2 and G3, all 40
tissue samples were stored individually in 95% EtOH.
All samples were transported in a cooler with ice packs. Once in the laboratory they were
stored at -20°C until extraction. All sampling procedures and research ethics were approved by
19
the Life and Environmental Science Animal Care Committee under protocol #LESACC AC16-
0267
2.2.2 DNA EXTRACTION
DNA was extracted from whole body tissue, following a modified OMEGA bio-tek
E.Z.N.A.Ò Mollusc DNA Kit protocol that included dried and diced tissue, overnight incubation
at 56 °C, three washes of the HiBindÒ column, and a 50µL elution. Once DNA extraction was
complete 8 to 10µL was aliquoted for quantification and quality checks. Both aliquot and stock
were stored at -20°C until further use.
2.2.3 DNA QUANTIFICATION AND QUALITY CHECK
Aliquoted DNA was quantified a minimum of twice on either QubitÒ Fluorometer 2.0 or
3.0. using QubitTM dsDNA BR Assay Kit as per protocol. Samples were vortexed briefly and
mixed by pipetting up and down before 2 µL was mixed into 198 µL of working solution for
quantification. A subset of samples were run on 1% agarose gel to visualize the level of shearing
that occurred. A subset of samples was tested for purity on NanoDropÒ Spectrophotometer ND-
1000 (260/230 and 260/280 ratios).
2.2.4 CONSTRUCTING DNA POOLS FOR POOL-SEQ
Pooled DNA for each population was completed using equal amounts from individual
DNA samples. DNA quantity (ng) for each pool was chosen so that at least 1 µL of solution was
pipetted from each individual. Individual samples were briefly vortexed, pipetted up and down
20 times before volume was added to the pool. A total of 10 µL from each pool was aliquoted for
further quantification and gel electrophoresis.
Pooled DNA samples were quantified using same method as the individual samples
(described above). 5 µL of each pool was run on 1% agarose gel to test if handling was
increasing shearing. To prepare for sequencing, each pool was diluted down to a final
concentration of 3 to 6 ng/µL. The diluted pools were quantified as above.
20
2.2.5 DNA SEQUENCING
All pools passed concentration and quality control. Libraries were prepared using a
shotgun approach with PCR with Illuminia TrueSeq LT adaptors. All libraries passed quality
control. Pooled DNA libraries were sequenced on the Illumina HiSeq XTM Sequencer using
paired-end reads of 150 base pairs (bp). Each pool was sequenced over two lanes (e.g. four pools
on one lane and then the same four pools on the second lane) for a total of four lanes at the
Génome Québec Innovation Centre, Montréal, Québec, Canada.
2.2.6 GENOMIC ANALYSIS
Full annotated genomic analysis pipeline can be found in Appendix A.
Sequences were converted from BCL files to FASTQ with no barcode mismatches for
downstream processing and analyses using bcl2fastq2 v.2.20. Sequences from the two lanes for
each pool were concatenated to one file per pool per read direction. FastQC v.011.5 (Andrews
2010) was used to check and visualize the quality of the sequences.
Trimmomatic v.036 (Bolger et al. 2014) was used to remove adaptors and filter low-
quality sequences (ILLUMINACLIP 2:30:10 CROP:135 LEADING:5 TRAILING: 5 SLIDING
WINDOW: 5:20 MINLEN:100). Sequences were hard cropped at 135 bp due to k-mer
overrepresentation in the last 15 bp in a small number of sequences. Post trimmed sequences
were checked for quality using FastQC v.011.5.
Contamination of foreign (i.e. non-snail) sequences was removed from the data with
DeconSeq v.0.4.3 (Schmieder & Edwards 2011a). Databases of potential contamination sources
were generated for Archaea and green Algae (Chlorophyta, Cryptophyta, Charophyceae,
Eustigmatophyceae, and Klebsormidiophyceae) by downloading the nt database from NCBI
(ftp://ftp.ncbi.nlm.nih.gov/blast/db, accessed 24-08-2018). Using the GenInfo Identifier (GI) list
(https://www.ncbi.nlm.nih.gov/nuccore, accessed 24-08-2018) for each of the above,
blastdb_aliastool was used to create a file that masked the database so only the organisms of
interest was available. This masked database was then converted to a FASTA file using
blastdbcmd. The threespine stickleback (Gasterosteus aculeatus) genome was accessed from
https://datadryad.org/resource/doi:10.5061/dryad.h7h32 (Peichel et al. 2017) and the human
(Homo sapiens) genome was accessed from
21
ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/hs_ref_GRCh38.p12_ch
. The bacterial database was constructed from NCBI Assembly database
(https://www.ncbi.nlm.nih.gov/assembly/?term=bacteria, accessed 06-09-2018), selecting only
“Complete Genomes” with the exception of those known to be in the thermal springs
(Aphanothece, Brevundimonas, Chloroflexus, Lyngbya, Microcoleus, Oscillatoria, Phormidium,
Porphyrobacter, Rhodobacter, Rhodopseudomonas, Rubrivivax, Spirulina, Synechocystis,
Thermonanas, and Thiothrix) (Bilyj 2011) where all information (chromosome, scaffold, and
contig) was included. The genomes were concatenated together to be prepared for use by
DeconSeq. All databases were prepared following DeconSeq manual. Briefly, any Ns, any
sequences less than 200 bp and sequence duplicates were removed (Schmieder & Edwards
2011b). The bacterial database was split into ~2.7 Gb chunks (FASTA Splitter v.0.2.6;
http://kirill-kryukov.com/study/tools/fasta-splitter/) so that it could be indexed before the
Burrows-Wheeler Aligner (BWA) (Li & Durbin 2009) (provided in the DeconSeq download)
indexed the databases. Trimmed sequences were split into ~1.1 Gb segments (FASTQ Splitter
v.0.1.2; http://kirill-kryukov.com/study/tools/fastq-splitter/), compared against the databases,
removing any sequence which had an identity (-i) of 94% and a coverage (-c) of 90%. The out-
files per population were merged together and paired read files were compared, removing any
sequences found in only one of the files, using fastq-pair (Edwards 2017).
A genome reference was subsequently assembled from the pool with the highest amount
of sequences (J3) using DISCOVAR de novo (v.52488)
(https://software.broadinstitute.org/software/discovar/blog/?page_id=98) after using Clumpify to
remove PCR duplicates (Bushnell 2014) . DISCOVAR de novo was run to assemble de-
duplicated sequences into flattened lines. An assembly was also attempted using all pools, albeit
with J3 determined to generate a “better” assembly. DISCOVAR de novo was chosen because it
does not require sequences done with two different insert sizes, unlike most assemblers.
However, the sequences are recommended be from a single, PCR-free library preparation and
250 bp paired-end reads. The consequences of breaking these assumptions will be stated in
Results and discussed in Caveats.
Burrows-Wheeler Aligner (BWA)-MEM (v. 0.7.17) (Li & Durbin 2009) was used to
align sequences to the assembled genome reference with the -M option which disables making
multiple primary alignments for different regions of the query sequence for compatibility with
22
downstream packages. Aligned sequences were sorted from a SAM file to a BAM file by
chromosome number using SAMtools (Li et al. 2009). SAMtools was also used to remove any
sequences that fell below a mapping quality of 20 and any sequences where its mate did not map.
Duplicates were removed with Picard Tools (v. 2.17.3.) (http://broadinstitute.github.io/picard/).
SAMtools’ function flagstat was used to determine summary statistics of resulting bam files
while Picard Tools (ValidateSamFile) was used to validate that the files were not corrupted in
any way. SAMtools was used to create individual mpileup files for each pool and a mpileup file
containing all pools, specifying the -B function which stops BAC re-alignment necessary for
downstream analysis.
2.2.7 PAIRWISE FST
Pairwise FST between all pools was determined using the R package, Poolfstat (Hivert et
al. 2018). Firstly, the mpileup file containing all of the pools was first converted to a sync file
using PoPoolation2 (Kofler et al. 2011b), filtering for a minimum base quality of 20. It is
important to note that to specify the Phred33 encoding, --fastq-type must be set to “sanger”
(rather than “illumina” for Phred64). Using Poolfstat the sync file was converted to a popsync
object in RStudio (v. 1.1.383) (RStudio Team 2016) with a minimum read count of one,
minimum coverage of 20, maximum coverage of 200, minimum allele frequency of 0.05 and
removing indels. Pairwise FST was calculated using the “Anova” method and the same
parameters used to import the sync file. A Euclidean distance matrix was created from the
pairwise FST matrix using the dist function in R. The pco function in the package LabDSV (v.1.8-
0) (Roberts 2012) was used to determine the eigenvalues, from which the percent explained by
eigen vector one and two was calculated. Pairwise FST and percent eigen vectors were visualized
using ggplot2 (Wickham 2016).
2.2.8 NUCLEOTIDE DIVERSITY
Individual mpileups were analyzed in PoPoolation1 (Kofler et al. 2011a) to determine
within population nucleotide diversity. Nucleotide diversity was determined for SNPs with a
minimum count of two, a minimum coverage of 20, max coverage of 100, for side by side
windows of 250 bp and for a pool size equal to the diploid number of individuals in the pool. As
with above, we had to set the fastq-type to sanger. Mean nucleotide diversity using all windows
was calculated, even those that did not contain a SNP for the individuals of that pool, as this is a
23
diversity of zero in RStudio (v. 1.1.383). The nucleotide diversity values were visualized using
the package ggplot2.
2.3 RESULTS
2.3.1 DNA EXTRACTION, QUANTIFICATION AND QUALITY
DNA yield was variable among samples, even with tissues of similar size (range from 10
ng/µL up to greater than 200 ng/µL). Extracted DNA was determined to be free of organic
compounds and protein contamination (mean ± SD) (260/230 ratio of 2.05 ± 0.26 and 260/280
ratio of 1.95 ± 0.085). Samples exhibited a high molecular weight and limited shearing
(individual samples not shown). Pooled samples did not appear to have increased shearing from
handling throughout the process with the majority of DNA above 10 Kbp (Figure B.1).
2.3.2 DNA SEQUENCING AND PRE-PROCESSING
A total of 3,675,153,756 sequences were assigned to a population over the eight
populations with a quality score of 38 for all populations, with the exception of one of the J3 lane
positions which had a quality score of 37 (J1: 446,326,456 sequences; J2: 478,291,654
sequences; J3: 501,946,794 sequences; J4: 430,999,172 sequences; J5: 412,287,368 sequences;
G1: 451,624,748 sequences; G2: 453,963,908 sequences; G3: 499,713,656 sequences). No
populations were flagged or failed per base quality scores, though quality decreased as the read
progressed (Figure B.2). Trimmomatic filtering removed a total of 843,631,584 sequences
(22.96%) (J1: 22.65%; J2: 21.92%; J3: 22.10%; J4: 22.04%; J5: 22.43%; G1: 22.45%; G2:
25.85%; G3: 24.13%). For each population per base quality improved such that quality was
above 30 (Figure B.3). A total of 19,198,864 sequences were removed from trimmed sequences
as non-snail contamination (J1: 0.66%; J2: 0.70%; J3: 0.81%; J4: 0.73%; J5: 0.65%; G1: 0.73%;
G2: 0.58%; G3: 0.55%). The genome reference assembly produced from J3 had an N50 = 3,931
bp (~450 Mbp in 1 kb+ scaffolds and ~79 Mbp in 10 kb+ scaffolds), mean length of first read in
pair up to first error (MPL1) of 90 and an estimated chimera rate of 0.55%. The genome
reference assembly produced using all pools generated an N50 = 2,739 bp (~540 Mbp in 1 kb+
scaffolds and ~48 Mbp in 10 kb+ scaffolds), MPL1 of 77 and estimated chimera rate of 1.08%.
Of the decontaminated sequences and post filtering (for unpaired reads, duplicates and a
minimum mapping quality of 20), I mapped an overall 2,034,681,286 sequences (72.35%, or
55.36% of initial sequences) to this assembly (J1: 251,520,504 sequences (73.34%); J2:
24
268,182,754 sequences (72.31%); J3: 284,998,462 sequences (73.49%); J4: 244,620,378
sequences (73.34%), J5: 235,247,357 sequences (74.04%); G1: 246,765,601 sequences
(70.98%); G2: 237,549,944 sequences (70.98%); G3: 265,796,286 sequences (70.50%)). All
BAM files passed validation.
2.3.3 PAIRWISE FST
The number of within population bi-allelic positions captured was 921,339 to 1,300,995
per each P. johnsoni pooled population and 3,053,291 to 3,736,834 per each P. gyrina pooled
population (Table 2.1). Pairwise FST between P. johnsoni populations ranged from 0.106 (J2 vs.
J4) to 0.367 (J1 vs. J4) (Table 2.2). Between P. johnsoni and P. gyrina populations pairwise FST
was 0.519 (J5 vs. G1) to 0.709 (J4 vs. G2) (Table 2.2). For P. gyrina populations, pairwise FST
ranged from 0.359 (G2 vs. G3) to 0.498 (G1 vs. G2). The PCoA plot of the pairwise FST
separated P. johnsoni and P. gyrina along the first axis and explained 76.85% of the allelic
variation shaping this genetic structure (Figure 2.3). J2, J3, and J4 clustered together (Figure
2.3). J1 and J5 were also clustered together, but clustering was not reflected in the distance
between them (pairwise FST = 0. 335) (Table 2.2) but rather that they were both equal distances
from all other populations (Figure 2.3). G2 and G3 were loosely clustered while G1 fell out
towards the P. johnsoni populations (Figure 2.3).
2.3.4 NUCLEOTIDE DIVERSITY
Nucleotide diversity was decreased in P. johnsoni populations (J1 to J5) compared to P.
gyrina (G1 to G3) (mean across all 250 bp windows): J1: 0.00133, J2: 0.00115, J3: 0.00113, J4:
0.00106, J5: 0.00156, G1: 0.00421, G2: 0.00475, G3: 0.00536) (Figure 2.4).
2.4 DISCUSSION
As the instances of habitat loss and fragmentation increase and contribute to species
decline (Shaffer 1981), the effect of these events on genetic diversity is of increasing importance
to conservation management (Frankham 2005). Population size fluctuations (bottlenecks),
inbreeding, reduced gene flow and genetic drift can decrease the fitness of a population or
species (Bouzat 2010). Freshwater snails present an excellent system in which to study these
impacts as they often naturally exist in discrete populations with limited dispersal (Viard et al.
1997). Genetic drift has also been shown to have a rapid and large influence due to founder
25
effects, frequent bottlenecks, and low immigrant rates causing exceedingly low genetic diversity
(Viard et al. 1997; Bousset et al. 2004). These species, who additionally have the ability to
colonize a wide range of habitats, provide a unique opportunity to study the genomic
underpinning of population and species differentiation (Mavárez et al. 2002c). Though there are
instances in certain species and ecosystems where there are no patterns of genetic differentiation
between distal populations (Gu et al. 2015; Lounnas et al. 2018), strong genetic structure is
common within small micro-geographical ranges (Mavárez et al. 2002a; b; Bousset et al. 2004;
Djuikwo-Teukeng et al. 2014). The endangered P. johnsoni exemplifies the characteristics that
can make freshwater snails ideal study systems. Its global habitat is restricted to just seven
geographically close thermal springs that undergo severe yearly population bottlenecks with
minimal gene flow predicted between the thermal springs (Lepitzki & Pacas 2010). While it is
currently managed as a species, studies have questioned the validity of this designation and have
proposed that P. johnsoni represents thermal ecotypes of a much more common snail, P. gyrina
(Wethington & Guralnick 2004; Wethington & Lydeard 2007). The objectives of this study were
therefore to test whether these putative species represented distinct phylogenetic units, and to test
whether the yearly cyclic bottlenecks contributed to population divergence and decreased
nucleotide diversity among P. johnsoni from different thermal springs.
2.4.1 POPULATION STRUCTURE AND NUCLEOTIDE DIVERSITY BETWEEN P. JOHNSONI AND P. GYRINA POPULATIONS
I discovered strong genetic divergence between pooled samples of P. johnsoni and P.
gyrina. Using just under one million to over four million SNPs sequenced across the genome a
pairwise FST of 0.636 ± 0.0605 (mean ± SD) was found between P. johnsoni and P. gyrina
populations on a geographical scale of less than a few kilometres. Though the relationship
between FST and the number of migrates is tenuous at best in most systems, this value would
indicate one migrant every five to nine generations (Whitlock & McCauley 1999). This value
should be interpreted with extreme caution, as these populations break many of the assumptions
of this relationship (e.g., evolutionary equilibrium). For example, there should be no genetic
isolation by geographical distance with all populations contributing equally to the pool of
migrants (Whitlock & McCauley 1999). This is not reflected by the patterns of pairwise FST
(e.g., J5 is the most geographically far thermal spring to G1, but genetically the closest). As well,
populations are assumed to have a constant number of individuals, which is clearly not the case
in this system, no selection, no mutation and have reached equilibrium between migration and
26
genetic drift (Whitlock & McCauley 1999). With this in mind, the patterns of genetic
differentiation should be considered to indicate a maximized FST between these species with
virtually no gene flow, rather than a quantitative estimate of migrants. As well, there was a
striking difference in the nucleotide diversity between P. johnsoni and P. gyrina. P. gyrina
populations were found to have over double (G1) or triple (G2 and G3) the nucleotide diversity
observed in P. johnsoni populations. This relationship was also reflected in the number of SNPs
captured within each population, with P. gyrina having roughly three to four times the amount of
within population SNPs captured. Due to pooling the individuals before sequencing, it was not
possible to determine the amount of variability in nucleotide diversity between individuals
towards establishing significance.
There are a few plausible explanations for the observed population structure between P.
johnsoni and P. gyrina, and the reduced diversity in P. johnsoni. One possibility is that P.
johnsoni is adapting to the thermal spring environment and divergent selection is causing the
fixation of critical alleles, increasing divergence and decreasing nucleotide diversity. Another,
non-mutually exclusive possibility is the influence of the repeated bottlenecks. Over roughly 20
generations (assuming generation time of one year), Lepitzki (COSEWIC 2018) documented
shifts where certain P. johnsoni populations’ minimum reached under 0.010% of their maximum
value. Such patterns are predicted to result in genetic drift, whereby the probability of random
fixation or extinction of alleles is inversely proportional to effective population size (Hedrick &
Kalinowski 2000) and a reduction in genetic diversity is the predicted outcome of the process.
Though P. gyrina have been shown to have seasonal patterns of large increase and decrease in
other Albertan lakes (Sankurathri & Holmes 1976), higher population numbers and less
constrained habitat may decrease the influence of the bottlenecks on genetic diversity as
compared to P. johnsoni (Bouzat 2010). There could also be influence from multiple
evolutionary forces. Indications of selection have been found in populations that undergo
extensive bottlenecks; however, the selective force for an allele must be strong enough to
overcome random drift (e.g., Koskinen et al. 2002; Funk et al. 2016). Though there are clear
patterns of genetic differentiation and decreased nucleotide diversity which highlight the
conservation concern for P. johnsoni, more data will be necessary to determine the relative roles
of genetic drift and selection in this system.
27
2.4.2 POPULATION STRUCTURE AND NUCLEOTIDE DIVERSITY WITHIN P. JOHNSONI AND P. GYRINA POPULATIONS
The SNPs captured in this study support the hypothesis of multiple genetic populations
for P. johnsoni and P. gyrina. Previous genetic work in P. gyrina which included G1, G2 and G3
and an additional two populations located in Banff National Park and Montana, was unable to
find support for monophyly or distinguish between three of the five populations (Remigio et al.
2001). Using two (arguably non-neutral) genetic markers, they found that G1 and G2 grouped
together away from G3 (Remigio et al. 2001). In this present study, I found that there was strong
population structure, with pairwise FST of 0.359 between G2 and G3 and of 0.498 and 0.455
between G1 and G2 and G1 and G3, respectively. This population structure suggests more
differentiation between the marsh population containing thermal water run-off (G1) and the lake
populations (G2 and G3) than between the two lake populations. Re-addressing population
structure with genome-wide SNPs may be more effective to resolve population structure when
present (e.g. Emerson et al. 2010). G1 was seen to have decreased nucleotide diversity compared
to both G2 and G3, though more SNPs were captured in this population than in G2. Whether
these patterns are due to adaptation to different habitat types, connectivity (discussed below) or a
combination is unknown.
Between the five populations of P. johnsoni included in this study, which are spread over
just one kilometre, I found pairwise FST ranging from 0.106 to 0.367 (factors likely impacting
this structure discussed below). Within P. johnsoni populations no general trends could be seen
between the severity of the population’s minimum and maximum and the amount of nucleotide
diversity or the amount of within population SNPs. In fact, J5 which has had some of the lowest
population minimums (30 to 40 individuals between 1996 to 2017) (COSEWIC 2018), was
found to have the highest nucleotide diversity and within population SNPs. This illustrates that
consensus data should be used in conjunction with genomic data for conservation management
(Keller & Waller 2002). Altogether, these analyses reveal that population genetic factors are
influencing the evolutionary trajectories of snails within these thermal springs at a remarkably
microgeographic scale.
There are several ecological genetic factors possibly influencing the observed patterns of
genetic differentiation between populations of the same species. One possibility is micro-habitat
local adaptation, however further data generation and analysis would be necessary to address this
28
(discussed in Chapter 3 General conclusions). Again, another possible influencing force is
genetic drift. In addition to decreasing genetic diversity, drift is predicted to increase genetic
differentiation between populations. To some extent all of the populations included in this study
have likely undergone bottlenecks, so it is possible that these events are contributing to the
genetic differentiation and FST estimates measured. Ease of dispersal may also be playing a role
in the patterns of differentiation. For instance, between J2, J3, and J4 there is decreased pairwise
FST as compared to J1 (protected by a cave) and J5 (up Sulphur Mountain by about one km). In
certain conditions J4 water will run above ground to J3 and snails have been observed in the J4
outlet (Lepitzki 2002). Though water has been observed to flow from J3 into J1, there are no
patterns of decreased population structure between them in comparison to J1 to J2 or J4.
Between the early to mid 1900s to 1997, dispersal between P. johnsoni may have been impacted,
as the thermal springs were piped together and bathing occurred between J1 and J2 until 1997
(COSEWIC 2008). Without prior sequencing to compare to, the effects of this will remain
unknown. Between P. gyrina populations, the two populations connected by the Bow River (G2
and G3) have decreased (though still substantial) population structure compared to G1, which is
isolated from the river. Though birds (Santamaría & Klaassen 2002) and mammals (Van
Leeuwen et al. 2013) have been shown to transport snails and shape the patterns of genetic
diversity (Van Leeuwen et al. 2013), water connectedness can frequently drive population
structure, including in other aquatic species (e.g., Kremer et al. 2017). Previous work has shown
that there is decreased genetic differentiation between populations over much further ranges that
are connected by waterways which allow the transport of snails and eggs on mats, than even very
close pond populations (Mavárez et al. 2002a; Bousset et al. 2004; Djuikwo-Teukeng et al.
2014). Other than the flooding between certain P. johnsoni populations and the possibility of
previous connectivity through pipes, P. johnsoni populations have very little water connection as
the thermal water largely runs underground (Grasby & Lepitzki 2002). Though there is increased
nucleotide diversity in the two connected lake populations of P. gyrina, it is difficult to
disentangle if this is due to higher populations numbers and their habitat or because of
connectivity. Interestingly, the P. johnsoni populations of J2, J3 and J4 which have the lowest
genetic distance and presumably the highest probability of genetic mixing, also have the lowest
nucleotide diversity. In conjunction with the measure of strong population structure between
geographically close populations, decreased nucleotide diversity and amount of polymorphic
29
sites in P. johnsoni, populations coupled with the known life history, provides compelling
evidence that genetic drift may be driving minor allele loss in P. johnsoni populations.
2.4.3 BROADER IMPLICATIONS AND CONSERVATION RECOMMENDATIONS
Though species designation has a broad spectrum of definitions, this study has shown
there to be clear genetic differentiation between P. johnsoni and P. gyrina and between
populations of what is argued to be the same species. This level of genetic differentiation
between populations of the same species is not restricted to this study and has been found in
other freshwater snail species (Mavárez et al. 2002a; b; Bousset et al. 2004; Djuikwo-Teukeng et
al. 2014). This brings concern of the potential loss of freshwater snail biodiversity that may
contain ecological and evolutionarily significant genetic diversity (Funk et al. 2012; Mee et al.
2015). On a whole, molluscs are data deficient with respect to conservation. Though only 10% of
known species of molluscs have been assessed by the International Union of Conservation of
Nature (IUCN) as of 2016, they still represent 40% of the documented extinctions (Cowie et al.
2017). With the genetic structure observed on such a short geographical scale, there is a high
probability that we are in fact losing, if not “species”, genetically diverse populations which are
essential for persistence with environmental change (Ceballos & Ehrlich 2002) at a much faster
rate than even predicted (Régnier et al. 2015; Cowie et al. 2017). P. johnsoni is fortunate that it
exists in such a visibly unique habitat in a national park where COSEWIC agreed that even if it
actually represented a thermal ecotype of P. gyrina, it would have been likely re-designated as a
designatable unit (DU) (COSEWIC 2008). This ensured the allocation of resources due to its
proposed ecological or evolutionary significance (Joseph et al. 2009; Funk et al. 2012;
Carwardine et al. 2018). Actions included census counts done every three to four weeks from
1996 till 2017 (though terminated in 2017), motion triggered alarms to prevent people soaking in
the Middle Springs, the closing of swimming at the Cave and Basin Springs, and previous
funding to test the evolutionary significance of the species (Lepitzki 1998; Remigio et al. 2001;
COSEWIC 2008, 2018; Lepitzki & Pacas 2010). Without a recognition of taxonomic, ecological
or evolutionarily uniqueness, this level of resources will not be allocated to species (Isaac et al.
2004; Joseph et al. 2009). Unfortunately, there is a taxonomic bias in the primary research
necessary to assess these measures of distinctiveness (Howard & Bickford 2014; Régnier et al.
2015; Cowie et al. 2017). Genomics provides a relatively cost effective (the integration of Pool-
30
seq into conservation is discussed below) method for determining population structure and
characterizing the genetic health of species or populations.
As illustrated in the P. johnsoni populations compared to P. gyrina, bottlenecks in small,
isolated populations can cause genetic drift to fix alleles (decreasing nucleotide diversity) and
promotes genetic differentiation. This loss of alleles can cause the fixation of detrimental alleles
(Bouzat 2010) and decrease of standing genetic variation necessary to rapidly respond to
environmental change (Morris et al. 2014). For P. johnsoni this means that each population
represents an incredibly important reservoir for the limited nucleotide diversity found across the
species. In light of this, I would recommend that the population counts are re-instated so that
deviations from 20-year trends can be detected quickly and, ideally, coupled with concurrent
genomic estimates of genetic diversity to directly test predictions associated with genetic drift.
The routine sequencing of the populations every few years would be an incredibly valuable
component of P. johnsoni’s management plan, as predicted with other management plans (e.g.,
De Barba et al. 2010; Hendricks et al. 2017). Temporal differences in the same population’s
nucleotide diversity and the extent of differentiation would be a powerful way to investigate the
effect of genetic drift (Bousset et al. 2004). If a population starts declining in numbers (which
there has already been a significant decline in maxima observed (COSEWIC 2018)) and/or there
is increased fixation of alleles, translocation of individuals from another population may be
warranted as genetic rescue (Ingvarsson 2001; Edmands 2007). Under such scenarios, there
could be concern regarding the potential for outbreeding depression if locally adaptation to each
thermal spring was disrupted with the influx of new individuals (Edmands 2007). However, the
genetic differentiation shown here likely indicates either current or recent gene flow. It is
possible that gene flow between these thermal springs has decreased below what would be
natural for the system, as each of the thermal springs has been impacted by humans (COSEWIC
2008), presumably decreasing frequentation by animals that could act as vectors for these snails.
Additionally, if adaptive differences are occurring at certain alleles even in the face of population
bottlenecks and corresponding impact of genetic drift, the selective force would be incredibly
strong and therefore unlikely to be disrupted by a few migrants (Funk et al. 2016). Without semi-
frequent monitoring of genetic variation, it will be impossible to establish a baseline of what is
considered normal and stable for the system, with genetic threats remaining undetectable. As
well, further monitoring would provide the parameters necessary to elucidate the roles of
selection and drift in this system. This could be used to characterize the potential risk of
31
outbreeding depression if translocation occurred to mitigate the impact of low genetic diversity
and/or inbreeding depression. P. johnsoni represents a fantastic and unique opportunity to
conduct research on how a species’ genome existing in small, isolated populations with
minimum gene flow and bottlenecks is impacted. In the face of the biodiversity crisis, where
critically important genetic diversity is so often over-looked (Frankel 1974; Laikre 2010),
characterizing and understanding genetic drift is vital.
2.4.4 THE UTILITY OF POOL-SEQ IN CONSERVATION
Pool-seq provides a low-cost method for capturing genome-wide polymorphisms. In
conservation management Pool-seq can be effective to decrease sequencing costs but not reduce
the number of individuals (Ferretti et al. 2013). However, there are some purposes where Pool-
seq excels and others where it is limited. Firstly, Pool-seq is particularly useful in cases where
there are unknown amounts of polymorphism, such as this study. With RAD (Baird et al. 2008)
and ddRAD (Peterson et al. 2012) sequencing, only a small proportion of the genome is
captured, with one snail study capturing less than three thousand markers (Kess et al. 2016) .
Because these methods involve the use of restriction enzymes that cut at specific patterns of
DNA, it is hard to predict the amount of DNA chunks of appropriate size that will be generated
(Liu et al. 2013) and pilot studies can be necessary to determine this (Kess et al. 2016).
Barcoding individuals, even when doing reduced sequencing can still represent a large financial
investment for decreased amount of SNPs captured (Gautier et al. 2013). However, because of
the loss of individual in Pool-seq, it is not possible to accurately estimate migrant rate, effective
population size or inbreeding coefficient using Pool-seq (Andrews et al. 2016). Additionally,
assignment of individuals to populations is not possible (Andrews et al. 2016). This must be
taken into account when sampling, especially if the species doesn’t exist in discrete populations.
If using Pool-seq, specific parameters and filtering must be used to decrease bias in allele
frequency estimates and subsequent calculations. I will discuss these in the context of this study.
At the sampling level, a minimum of 40 individuals is recommended per population for the most
accurate population allele frequency estimates (Schlötterer et al. 2014). Though Hivert et al.
2018 argue that their estimator for pairwise FST is unbiased by pool size or coverage, this is of
consideration for the measure of nucleotide diversity in this study (Kofler et al. 2011a). This is a
known limitation of using Pool-seq in endangered species (Schlötterer et al. 2014) as the
intention was to sample 40 individuals per population, however, J2 had low population numbers.
32
To mitigate this I used windows in calculating nucleotide diversity, as per recommended with
low sample size (Kofler et al. 2011a; Schlötterer et al. 2014). Care was taken to ensure equal
representation of each individual per pool in terms of DNA amount (Gautier et al. 2013;
Schlötterer et al. 2014). For filtering, pre-processing and calculations, I followed recommended
best practices to mitigate the effects of sequencing error as incorrect SNP calls (Kofler et al.
2011b; a; Schlötterer et al. 2014; Hivert et al. 2018). Further considerations are discussed in
Caveats.
As we strive to include genomics into conservation with increasing frequency, careful
validation and reflection on software used must be conducted (Shafer et al. 2015). In this study,
pairwise FST was originally calculated using the established PoPoolation2 (Kofler et al. 2011b).
It was then calculated using newly developed Poolfstat (Hivert et al. 2018) as a confirmation.
The packages use different methods for calculating estimates of allele counts, with Hivert et al.
(2018) illustrating that the PoPoolation2 estimate is biased (not converging on expected values
and impacted by coverage and sample size). The differences between the two packages should
not have been extreme (Hivert et al. 2018); however, I found there to be up to a 5x difference
between the two packages in the pairwise FST calculated. PoPoolation2 (Kofler et al. 2011b)
found pairwise FST of 0.044 to 0.076 between populations of P. johnsoni, 0.21 to 0.35 between
P. johnsoni and P. gyrina and 0.167 to 0.237 between P. gyrina populations (Table C.1),
compared to 0.106 to 0.367, 0.519 to 0.709 and 0.359 to 0.498 respectively found by Poolfstat
(Table 2.2) (Hivert et al. 2018). The population structure reported would have indicated that P.
johnsoni and P. gyrina may not be genetically distinct and that gene flow was likely occurring
between them. However, it was determined that when calculating pairwise FST, Popoolation2
(Kofler et al. 2011b) considers all base positions that are polymorphic in one or more
populations when calculating the pairwise comparisons, regardless if the position is polymorphic
in either of the populations in the present pairwise comparison. Thus, any allelic position that
was polymorphic in some population but that was fixed in the two populations being compared
generated a pairwise FST of zero, effectively dampening the population structure. This was
exasperated by difference in nucleotide diversity between the P. johnsoni and P. gyrina
populations. This is a clear illustration that genomic results must be thought of in the context of
known ecological information for the species and must be examined closely before conservation
recommendations are made.
33
Development and re-use of well-developed sequencing methods and pipelines are at the
core of genomics being integrated efficiently into conservation (Shafer et al. 2015). If this
pipeline was applied to a different project, once the samples were collected, it would likely take
less than a month to go from extracted DNA to having population structure results. In this
context, using Pool-seq provides an incredibly cost-effective method for conservation
management to assess population structure and investigate the genetic health of populations. This
method can be complemented by restriction enzyme sequencing of a subset of individuals to
provide estimates of effective population number, inbreeding coefficient and migrant rates
(Andrews et al. 2016) for more complete conservation management plans.
2.4.5 CAVEATS
An unavoidable consequence of pooling individuals for Pool-seq is that there is no way to
distinguish between two sequences that were sequenced twice from the same individual or from
two individuals (Schlötterer et al. 2014). Downstream applications assume that they were from
different individuals, which may bias the population estimates for allele frequency. However, the
estimates that I provide in this study are based off of the averaging over millions of positions, so
the bias should be decreased. The genome reference I created was using pooled, 135 bp (post
Trimmomatic) paired-end short read sequences from one P. johnsoni population. DISCOVAR de
novo was designed for paired-end reads of 250 bp from a single PCR-free library, though the
creators do state that PCR-amplified libraries can “in principle be used”, as well as 150 bp reads
“may work” (https://software.broadinstitute.org/software/discovar/blog/?page_id=23). Increased
quality was seen when using one pool (J3) rather than all pools to construct the genome assembly
reflected in an increased N50 value and MPL1 and decreased estimated chimera rate. The
creators of DISCOVAR de novo state that the MPL1 should be 175 bp to 225 bp for 250 paired-
end reads (compared to 90 bp for J3 reference genome), though this value did not result in
DISCOVAR de novo flagging the assembly as problematic, nor did any of the other values
generated. However, due to these factors, the constructed contigs were short, which increases
sequence mapping error. Additionally, I assumed that P. gyrina would successfully map to a
reference genome constructed from P. johnsoni sequences. When calling SNPs, it was required
that all populations have a minimum of 20x coverage over a position, which should decrease the
impact of certain populations not mapping to divergent regions. Copy number variants and
repetitive regions of the genome may collapse to the same position (Schlötterer et al. 2014). This
34
is not an issue unique to Pool-seq and is an unfortunate limitation in using short sequencing
reads. I attempted to mitigate this by setting the upper limit of sequencing coverage to 200.
Biological differences between P. johnsoni and P. gyrina, such as selfing rates could influence
the amount of fixation occuring. I am unable to determine what evolutionary force is generating
the genetic differentiation between and within P. johnsoni and P. gyrina populations. While I can
predict that genetic drift plays a large role with the decreased genetic diversity observed and
known bottlenecks, further work investigating potentially adaptive differences between P.
johnsoni and P. gyrina is necessary.
2.4.6 CONCLUSIONS
Without the integration of genomics, current conservation management plans will remain
incomplete. Here I used Pool-seq, a cost-effective sequencing method to capture millions of
SNPs in the globally restricted P. johnsoni and the more common P. gyrina. Analyses using
these SNPs were able to resolve genetic structure that had remained ambiguous between the two
species with the use of a few genetic markers (Remigio et al. 2001; Wethington & Guralnick
2004). These results indicated that P. johnsoni and P. gyrina were genetically distinct from each
other. Additionally, I characterized that there was extensive population structure between
populations of the same species. Coupled with determining decreased nucleotide in P. johnsoni
populations, which undergo massive bottlenecks, these results indicate that there may be a large
impact of genetic drift. The findings of this study will be integrated into P. johnsoni’s
management plan and will help make it more complete. Without the use of genomics,
differentiation between P. johnsoni and P. gyrina would not have been determined and the
impact of bottlenecks on decreasing genetic diversity would have remained a predicted but
uncharacterized threat. As illustrated in this system, there is an incredible place and need for the
use of genomics in conservation. The partnership between Parks Canada and University of
Calgary researchers represents the type of collaboration that is necessary for genomics to be used
in real world policy applications.
35
Population J1 J2 J3 J4 J5 G1 G2 G3
J1 1,174,228 1,362,417 1,369,156 1,349,817 1,575,031 3,491,977 4,416,793 4,607,239
J2 1,000,751 1,127,267 1,049,468 1,468,419 3,527,269 4,397,121 4,601,359
J3 1,061,899 1,090,399 1,454,054 3,460,571 4,358,911 4,566,735
J4 921,339 1,090,399 3,510,866 4,372,598 4,580,040
J5 1,300,995 3,531,009 4,4666,20 4,673,149
G1 3,248,634 4,7089,28 4,817,892
G2 3,053,291 4,366,631
G3 3,736,834
Table 2.1 Number of SNPs within each population and used in pairwise comparisons between populations determined by Poolfstat with the ‘Anova’ method for a minimum read count of one, minimum coverage of 20, a maximum coverage of 200, and a minor allele frequency of 0.05. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3.
36
Population J1 J2 J3 J4 J5 G1 G2 G3
J1 0 0.315 0.335 0.367 0.322 0.550 0.692 0.650
J2 0 0.136 0.106 0.312 0.569 0.699 0.656 J3 0 0.209 0.302 0.574 0.707 0.667
J4 0 0.351 0.586 0.709 0.666 J5 0 0.519 0.671 0.630 G1 0 0.498 0.455
G2 0 0.359 G3 0
Table 2.2 Pairwise FST between all populations determined using Poolfstat with the ‘Anova’ method for a minimum read count of one, minimum coverage of 20, a maximum coverage of 200, and a minor allele frequency of 0.05. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3.
37
Figure 2.1 Range map and sample populations for Physella johnsoni and sample populations for Physella gyrina, Banff National Park, Alberta, Canada. P. johnsoni - Cave Spring (J1), Basin Spring (J2), Lower Cave & Basin Spring (J3), Upper Cave & Basin Spring (J4), Lower Middle Spring (J5), Upper Middle Spring (J6) (not used in this study) and Kidney Spring (J7) (not used in this study). P. gyrina - Cave & Basin Marsh (G1), Five Mile Pond (G2) and Muleshoe Pond (G3).
38
Figure 2.2 Total number of P. johnsoni from January 1996 to September 2017. Population counts were taken once every three weeks till August 2000 and then once every four weeks till September 2016 when population counts were ended. From April to September 2017 and September 2018 the counts were resumed. Original springs include J1, J2, J3, J4 and J5. The re-established springs are J6 and J7. Modified from COSEWIC 2018 by Dr. Dwayne Lepitzki.
39
Figure 2.3 Principle coordinate analysis for all pairwise FST between P. johnsoni and P. gyrina populations calculated by Poolfstat with the ‘Anova’ method for a minimum read count of one, minimum coverage of 20, a maximum coverage of 200, and a minor allele frequency of 0.05. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3.
40
Figure 2.4 Averaged nucleotide diversity for all P. johnsoni and P. gyrina populations calculated by PoPoolation2 over 250bp side by side windows, minor allele count of 4, minimum coverage of 20 and max coverage of 200, where the 60% of the window was acquired to meet coverage specifications. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3.
41
CHAPTER 3 GENERAL CONCLUSIONS In this study, I aimed to provide clarity to previously unresolved taxonomic designations
between the Banff Springs Snail (Physella johnsoni) and Physella gyrina. Additionally, I
provided new data that characterized the genetic diversity and micro-population of P. johnsoni. Using Pool-seq, just under a million to over four million SNPs were captured per population,
allowing me to uncover strong and defined population structure between P. johnsoni and
P. gyrina. This leads me to believe that they represent genetically diverse units and warrant
continued separate management. Even between populations containing the same species there
was extensive population structure, leading me to have concern on the prospect of “lumping” of
species of snails (Wethington & Guralnick 2004), as myself and others (Mavárez et al. 2002a; b;
Bousset et al. 2004; Djuikwo-Teukeng et al. 2014) have found large population structure
between populations on small geographic scales. In terms of management, I believe the data
deficiency for molluscs (Régnier et al. 2009; Cowie et al. 2017), specifically genomic data, will
result in the loss of possibly genetically unique and interesting species and sub-species and
threaten the continued persistence of many molluscs.
Pool-seq is a fantastic tool to address population structure and nucleotide diversity.
Complications can arise when using it to address adaptive differences, as without an annotated
genome, regions of divergence lack biological relevance. However, this issue is not restricted to
Pool-seq and is shared for all sequencing methods. Unlike RAD-seq, Pool-seq lets us capture the
majority of the genome though and without a reference genome it feels under-utilized.
Fortunately, the cost of sequencing genomes is continually decreasing, and the number of
available genome references is increasing. In terms of conservation management, the current
toolset for analyzing Pool-seq data is limited in some respects. There are packages and scripts
developed to determine pairwise FST, nucleotide diversity (Tajima's Pi), Watterson’s Theta or
Tajima's D, but due to the loss of the individual, Pool-seq data cannot be used to determine levels
of inbreeding or effective population size. However, if these don’t need to be explicitly
determined for the species or population, Pool-seq does provide an impressive amount of data
and resolution for the parameters it can determine for a very attractive price.
In future steps, I would like to investigate further the population structure between P. johnsoni to P. gyrina and to take the first steps in determining if there are possibly adaptive
42
differences between them. In the pursuit of this goal, I think that generating a reference genome
would be of great benefit. This would allow us to start investigating if there are shared regions of
the genome that show evidence of selection between P. johnsoni and P. gyrina and if these
regions lie near or in potential gene coding regions.
In conclusion, the data I have generated and presented here provides the resolution
necessary to determine that P. johnsoni and P. gyrina are genetically distinct. Additionally, I
have shown that there is strong micro-geographical population structure between the P. johnsoni thermal springs and decreased within population nucleotide diversity. I recommend a modified
version of the current recovery strategy and action plan for P. johnsoni as the appropriate action
plan. Considering the decreased nucleotide diversity shown in P. johnsoni, each population plays
a vital role in the evolutionary robustness of the species beyond just total numbers. I recommend
re-instating population counts focused on capturing the yearly minimum and maximum for each
population. I propose that population counts be done every four weeks for the three months, or
some duriation and frequency that captures previously recorded population minimums and
maximums (COSEWIC 2018). This could provide evidence of deviations from the 20-year
norms and therefore provide the first warning signs of population collapse, especially when
coupled with genomic data. As such, I recommend that semi-regular sequencing be incorporated
into P. johnsoni’s management plan to establish a baseline for the impact of genetic drift in
population divergence and nucleotide diversity. Additional monitoring of genetic variation levels
could determine potentially adaptive versus non-adaptive loci, effective population sizes and
inbreeding coefficients. Decreasing nucleotide diversity, effective population size and/or
population numbers and/or increasing inbreeding may warrant translocation of individuals from
a population with different polymorphisms. By characterizing these factors, management would
be able to weigh the potential risks of outbreeding depression versus inbreeding depression. As
demonstrated in this system, the use of genomics in conservation is a vital component of creating
effective and efficient management plans.
43
References Adams JR, Vucetich LM, Hedrick PW, Peterson RO, Vucetich JA (2011) Genomic sweep and
potential genetic rescue during limiting environmental conditions in an isolated wolf
population. Proceedings of the Royal Society B: Biological Sciences, 278, 3336–3344.
Allendorf FW, Hohenlohe PA, Luikart G (2010) Genomics and the future of conservation
genetics. Nature Reviews Genetics, 11, 697–709.
Anderson E, Skaug HJ, Barshis DJ (2014) Next-generation sequencing for molecular ecology: a
cavaet regarding pooled samples. Molecular Ecology, 23, 502–512.
Andrews S (2010) FastQC: a quality control tool for high throughput sequence data. Available
online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc
Andrews KR, Good JM, Miller MR, Luikart G, Hohenlohe PA (2016) Harnessing the power of
RADseq for ecological and evolutionary genomics. Nature Reviews Genetics, 17, 81–92.
Baird NA, Etter PD, Atwood TS et al. (2008) Rapid SNP discovery and genetic mapping using
sequenced RAD markers. PLoS ONE, 3, 1–7.
De Barba M, Waits LP, Garton EO et al. (2010) The power of genetic monitoring for studying
demography, ecology and genetics of a reintroduced brown bear population. Molecular Ecology, 3938–3951.
Barrett RDH, Rogers SM, Schluter D (2008) Natural selection on a major armor gene in
threespine stickleback. Science, 322, 255–257.
Bilyj M (2011) A study on the phototrophic microbial mat communities of Sulphur Mountain
Thermal Springs and their association with the endangered, endemic snail Physella johnsoni. University of Manitoba.
Bland LM, Collen B, David C, Orme L, Bielby J (2015) Predicting the conservation status of
Data Deficient species. Conservation Biology, 53, 1792–1803.
Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: A flexible trimmer for Illumina sequence
44
data. Bioinformatics, 30, 2114–2120.
Bousset L, Henry PY, Sourrouille P, Jarne P (2004) Population biology of the invasive
freshwater snail Physa acuta approached through genetic markers, ecological
characterization and demography. Molecular Ecology, 13, 2023–2036.
Bouzat JL (2010) Conservation genetics of population bottlenecks: The role of chance, selection,
and history. Conservation Genetics, 11, 463–478.
Bull JK, Sands C, Garrick RC et al. (2013) Environmental complexity and biodiversity: The
multi- layered evolutionary history of a log-dwelling velvet worm in montane temperate
Australia. PLoS ONE, 8, 1–15.
Bushnell B (2014) BBMap: A Fast, Accurate, Splice-Aware Aligner. Lawrence Berkeley National Laboratory. LBNL Report #: LBNL-7065E. Retrieved from
https://escholarship.org/uc/item/1h3515gn
Butchart SHM, Walpole M, Collen B et al. (2010) Global biodiversity: Indicators of recent
declines. Science, 328, 1164–1168.
Cardinale BJ, Duffy JE, Gonzalez A et al. (2012) Biodiversity loss and its impact on humanity.
Nature, 486, 59–67.
Carwardine J, Martin TG, Firn J et al. (2018) Priority threat management for biodiversity
conservation: A handbook. Journal of Applied Ecology, 0–2.
Ceballos G, Ehrlich PR (2002) Mammal population losses and the extinction crisis. Science, 296,
904–907.
Clench WJ (1926) Three new species of Physa. Occasional Papers of the Museum of Zoology,
168, 1–8.
COSEWIC (2008) COSEWIC assessment and update status report on the Banff Springs Snail
Physella johnsoni in Canada. Committee on the Status of Endangered Wildlife in Canada. Ottawa., vii + 53 pp.
COSEWIC (2014) COSEWIC wildlife species assessment: quantitative criteria and guidelines.
45
Committee on the Status of Endangered Wildlife in Canada [cited 2018].
https://www.canada.ca/en/environment-climate-change/services/committee-status-
endangered-wildlife/wildlife-species-assessment-process-categories-guidelines/quantitative-
criteria.html (accessed on 18 Decemeber 2018).
COSEWIC (2015) Guidelines for recognizing designatable units. Committee on the Status of
Endangered Wildlife in Canada [cited 2018]. https://www.canada.ca/en/environment-
climate-change/services/committee-status-endangered-wildlife/guidelines-recognizing-
designatable-units.html (accessed on 29 October 2018).
COSEWIC (2018) COSEWIC status appraisal summary on the Banff Springs Snail Physella
johnsoni in Canada. Committee on the Status of Endangered Wildlife in Canada. Ottawa., xxvi pp.
Cowie RH, Regnier C, Fontaine B, Bouchet P (2017) Measuring the sixth extinction: what do
mollusks tell us? Nautilus, 131, 3–41.
Craze PG, Barr AG (2002) The use of electrical-component freezing spray as a method of killing
and preparing snails. Journal of Molluscan Studies, 68, 191–192.
Dalziel AC, Rogers SM, Schulte PM (2009) Linking genotypes to phenotypes and fitness: How
mechanistic biology can inform molecular ecology. Molecular Ecology, 18, 4997–5017.
Daugherty CH, Cree A, Hay JM, Thompson MB (1990) Neglected taxonomy and continuing
extinctions of tuatara (Sphenodon). Letters to Nature, 374, 177–179.
Dennenmoser S, Vamosi SM, Nolte AW, Rogers SM (2017) Adaptive genomic divergence
under high gene flow between freshwater and brackish-water ecotypes of prickly sculpin
(Cottus asper) revealed by Pool-Seq. Molecular Ecology, 26, 25–42.
Djuikwo-Teukeng FF, Da Silva A, Njiokou F et al. (2014) Significant population genetic
structure of the Cameroonian fresh water snail, Bulinus globosus, (Gastropoda: Planorbidae)
revealed by nuclear microsatellite loci analysis. Acta Tropica, 137, 111–117.
Edmands S (2007) Between a rock and a hard place: Evaluating the relative risks of inbreeding
and outbreeding for conservation and management. Molecular Ecology, 16, 463–475.
46
Emerson KJ, Merz CR, Catchen JM et al. (2010) Resolving postglacial phylogeography using
high-throughput sequencing. Proceedings of the National Academy of Sciences of the United States of America, 107, 16196–16200.
Ferretti L, Ramos-Onsins SE, Pérez-Enciso M (2013) Population genomics from pool
sequencing. Molecular Ecology, 22, 5561–5576.
Frankel OH (1974) Genetic conservation: our evolutionary responsibility. Genetics, 78, 53–65.
Frankham R (2005) Genetics and extinction. Biological Conservation, 126, 131–140.
Funk WC, Lovich RE, Hohenlohe PA et al. (2016) Adaptive divergence despite strong genetic
drift: Genomic analysis of the evolutionary mechanisms causing genetic differentiation in
the island fox (Urocyon littoralis). Molecular Ecology, 25, 2176–2194.
Funk WC, McKay JK, Hohenlohe PA, Allendorf FW (2012) Harnessing genomics for
delineating conservation units. Trends in Ecology and Evolution, 27, 489–496.
Futschik A, Schlötterer C (2010) The next generation of molecular markers from massively
parallel sequencing of pooled DNA samples. Genetics, 186, 207–218.
Gautier M, Foucaud J, Gharbi K et al. (2013) Estimation of population allele frequencies from
next-generation sequencing data: Pool-versus individual-based genotyping. Molecular Ecology, 22, 3766–3779.
Gilbertson CR, Wyatt JD (2016) Evaluation of euthanasia techniques for an invertebrate species,
land snails (Succinea putris). Journal of the American Association for Laboratory Animal Science, 55, 1–5.
Grasby SE, van Everdingen RO, Bednarski J, Lepitzki DA (2003) Travertine mounds of the
Cave and Basin National Historic Site, Banff National Park. Canadian Journal of Earth Sciences, 40, 1501–1513.
Grasby SE, Lepitzki DAW (2002) Physical and chemical properties of the Sulphur Mountain
thermal springs, Banff National Park, and implications for endangered snails. Canadian Journal of Earth Sciences, 39, 1349–1361.
47
Gu QH, Zhou CJ, Cheng QQ et al. (2015) The perplexing population genetic structure of
Bellamya purificata (Gastropoda: Viviparidae): low genetic differentiation despite low
dispersal ability. Journal of Molluscan Studies, 81, 466–475.
Guisan A, Tingley R, Baumgartner JB et al. (2013) Predicting species distributions for
conservation decisions. Ecology Letters, 16, 1424–1435.
Gustafson KD, Kensinger BJ, Bolek MG, Luttbeg B (2014) Distinct snail (Physa) morphotypes
from different habitats converge in shell shape and size under common garden conditions.
Evolutionary Ecology Research, 16, 77–89.
Hedrick PW, Kalinowski ST. (2000) Inbreeding Depression in Conservation Biology. Annual
Review of Ecology and Systematics, 31, 139–162.
Hedrick PW, Peterson RO, Vucetich LM, Adams JR, Vucetich JA (2014) Genetic rescue in Isle
Royale wolves: genetic analysis and the collapse of the population. Conservation Genetics,
15, 1111–1121.
Hendricks S, Epstein B, Schönfeld B et al. (2017) Conservation implications of limited genetic
diversity and population structure in Tasmanian devils (Sarcophilus harrisii). Conservation Genetics, 18, 977–982.
Hivert V, Leblois R, Petit EJ, Gautier M, Vitalis R (2018) Measuring genetic differentiation from
pool-seq data. Genetics, 210, 315–330.
Hoban S, Kelley JL, Lotterhos KE et al. (2016) Finding the genomic basis of local adaptation:
Pitfalls, practical solutions, and future directions. The American Naturalist, 188, 379–397.
Hoelzel AR, Halley J, O’brien SJ et al. (1993) Elephant seal genetic variation and the use of
simulation models to investigate historical population bottlenecks. Journal of Heredity, 84,
443–449.
Holsinger KE, Weir BS (2009) Genetics in geographically structured populations: defining,
estimating and interpreting FST. Nature reviews. Genetics, 10, 639–650.
Hooper DU, Adair EC, Cardinale BJ et al. (2012) A global synthesis reveals biodiversity loss as
48
a major driver of ecosystem change. Nature, 486, 105–108.
Howard SD, Bickford DP (2014) Amphibians over the edge: Silent extinction risk of Data
Deficient species. Diversity and Distributions, 20, 837–846.
Ingvarsson PK (2001) Restoration of genetic variation lost - The genetic rescue hypothesis.
Trends in Ecology and Evolution, 16, 62–63.
Isaac NJB, Mallet J, Mace GM (2004) Taxonomic inflation: Its influence on macroecology and
conservation. Trends in Ecology and Evolution, 19, 464–469.
Jarne P, Perdieu MA, Pernot AF, Delay B, David P (2000) The influence of self-fertilization and
grouping on fitness attributes in the freshwater snail Physa acuta: Population and individual
inbreeding depression. Journal of Evolutionary Biology, 13, 645–655.
Johnson PD, Bogan AE, Brown KM et al. (2013) Conservation status of freshwater gastropods
of Canada and the United States. Fisheries, 38, 247–282.
Joseph LN, Maloney RF, Possingham HP (2009) Optimal allocation of resources among
threatened species: a project prioritization protocol. Conservation Biology, 23, 328–338.
Kell LT, Dickey-Collas M, Hintzen NT et al. (2009) Lumpers or splitters? Evaluating recovery
and management plans for metapopulations of herring. ICES Journal of Marine Science, 66,
1776–1783.
Keller LF, Waller DM (2002) Inbreeding effects in wild populations. Trends in Ecology and Evolution, 17, 230–241.
Kess T, Gross J, Harper F, Boulding EG (2016) Low-cost ddRAD method of SNP discovery and
genotyping applied to the periwinkle Littorina saxatilis. Journal of Molluscan Studies, 82,
104–109.
Kofler R, Orozco-terWengel P, de Maio N et al. (2011a) PoPoolation: A toolbox for population
genetic analysis of next generation sequencing data from pooled individuals. PLoS ONE, 6.
Kofler R, Pandey RV, Schlötterer C (2011b) PoPoolation2: Identifying differentiation between
populations using sequencing of pooled DNA samples (Pool-Seq). Bioinformatics, 27,
49
3435–3436.
Koskinen MT, Haugen TO, Primmer CR (2002) Contemporary fisherian life-history evolution in
small salmonid populations. Nature, 419, 826–830.
Kremer CS, Vamosi SM, Rogers SM (2017) Watershed characteristics shape the landscape
genetics of brook stickleback (Culaea inconstans) in shallow prairie lakes. Ecology and Evolution, 7, 3067–3079.
Laikre L (2010) Genetic diversity is overlooked in international conservation policy
implementation. Conservation Genetics, 11, 349–354.
Van Leeuwen CHA, Huig N, Van Der Velde G et al. (2013) How did this snail get here? Several
dispersal vectors inferred for an aquatic invasive species. Freshwater Biology, 58, 88–99.
Lepitzki DAW (1998) The ecology of Physella johnsoni, the threatened Banff Springs Snail.
Heritage Resource Conservation - Aquatics, i-146.
Lepitzki DAW (2002) Status of the Banff Springs Snail (Physella johnsoni) in Alberta. Alberta Sustainable Resource Development, Fish and Wildlife Division, and Alberta Conservation Association, Wildlife Status Report No. 40, Edmonton, AB., 29 pp.
Lepitzki DAW, Pacas C (2010) Recovery Strategy and Action Plan for the Banff Springs Snail
(Physella johnsoni) in Canada. Species at Risk Act Recovery Strategy Series. Parks Canada Agency, Ottawa, vii + 63 pp.
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform.
Bioinformatics, 25, 1754–1760.
Li H, Handsaker B, Wysoker A et al. (2009) The Sequence Alignment/Map format and
SAMtools. Bioinformatics, 25, 2078–2079.
Lien S, Gidskehaug L, Moen T et al. (2011) A dense SNP-based linkage map for Atlantic
salmon (Salmo salar) reveals extended chromosome homeologies and striking differences
in sex-specific recombination patterns. BMC Genomics, 12, 1–10.
Liu MM, Davey JW, Banerjee R et al. (2013) Fine Mapping of the pond snail left-right
50
asymmetry (chirality) locus using RAD-Seq and Fibre-FISH. PLoS ONE, 8, 2–8.
Lotterhos KE, Whitlock MC (2015) The relative power of genome scans to detect local
adaptation depends on sampling design and statistical method. Molecular Ecology, 24,
1031–1046.
Lounnas M, Correa AC, Alda P et al. (2018) Population structure and genetic diversity in the
invasive freshwater snail Galba schirazensis (Lymnaeidae). Canadian Journal of Zoology,
96, 425–435.
Lundmark C, Sandström A, Andersson K, Laikre L (2019) Monitoring the effects of knowledge
communication on conservation managers’ perception of genetic biodiversity – A case
study from the Baltic Sea. Marine Policy, 99, 223–229.
Mace GM (2004) The role of taxonomy in species conservation. Philosophical transactions of the Royal Society of London. Series B, Biological Sciences, 359, 711–9.
Margres MJ, Jones ME, Epstein B et al. (2018) Large-effect loci affect survival in Tasmanian
devils (Sarcophilus harrisii) infected with a transmissible cancer. Molecular Ecology, 27,
4189–4199.
Martin TG, Nally S, Burbidge AA et al. (2012) Acting fast helps avoid extinction. Conservation Letters, 5, 274–280.
Mavárez J, Amarista M, Pointier JP, Jarne P (2002a) Fine-scale population structure and
dispersal in Biomphalaria glabrata, the intermediate snail host of schistosoma mansoni, in
Venezuela. Molecular Ecology, 11, 879–889.
Mavárez J, Pointier JP, David P, Delay B, Jarne P (2002b) Genetic differentiation, dispersal and
mating system in the schistosome-transmitting freshwater snail Biomphalaria glabrata.
Heredity, 89, 258–265.
Mavárez J, Steiner C, Pointier J-P, Jarne P (2002c) Evolutionary history and phylogeography of
the schistosome-vector freshwater snail Biomphalaria glabrata based on nuclear and
mitochondrial DNA sequences. Heredity, 89, 266–272.
51
McCallum H (2008) Tasmanian devil facial tumour disease: lessons for conservation biology.
Trends in Ecology and Evolution, 23, 631–637.
Mee JA, Bernatchez L, Reist JD, Rogers SM, Taylor EB (2015) Identifying designatable units
for intraspecific conservation prioritization: A hierarchical approach applied to the lake
whitefish species complex (Coregonus spp.). Evolutionary Applications, 8, 423–441.
Moore AC, Burch JB, Duda TF (2014) Recognition of a highly restricted freshwater snail lineage
(Physidae: Physella) in southeastern Oregon: convergent evolution, historical context, and
conservation considerations. Conservation Genetics, 16, 113–123.
Morais AR, Siqueira MN, Lemes P et al. (2013) Unraveling the conservation status of data
deficient species. Biological Conservation, 166, 98–102.
Morris MRJ, Bowles E, Allen BE, Jamniczky HA, Rogers SM (2018) Contemporary ancestor?
Adaptive divergence from standing genetic variation in Pacific marine threespine
stickleback. BMC Evolutionary Biology, 18, 1–21.
Morris MRJ, Richard R, Leder EH et al. (2014) Gene expression plasticity evolves in response to
colonization of freshwater lakes in threespine stickleback. Molecular Ecology, 23, 3226–
3240.
de Oliveira LR, Arias-Schreiber M, Meyer D, Morgante JS (2006) Effective population size in a
bottlenecked fur seal population. Biological Conservation, 131, 505–509.
Ouborg NJ, Pertoldi C, Loeschcke V, Bijlsma RK, Hedrick PW (2010) Conservation genetics in
transition to conservation genomics. Trends in Genetics, 26, 177–187.
Parsons ECM (2016) Why IUCN should replace “data deficient” conservation status with a
precautionary “assume threatened” status—A cetacean case study. Frontiers in Marine Science, 3, 2015–2017.
Peichel CL, Sullivan ST, Liachko I, White MA (2017) Improvement of the threespine
stickleback genome using a Hi-C-based proximity-guided assembly. Journal of Heredity,
108, 693–700.
52
Peterson RO, Thomas NJ, Thurber JM, Vucetich JA, Waite TA (1998) Population limitation and
the wolves of Isle Royale. Journal of Mammalogy, 79, 828.
Peterson BK, Weber JN, Kay EH, Fisher HS, Hoekstra HE (2012) Double digest RADseq: An
inexpensive method for de novo SNP discovery and genotyping in model and non-model
species. PLoS ONE, 7, 1–11.
Pip E, Franck JPC (2008) Molecular phylogenetics of central Canadian Physidae (Pulmonata :
Basommatophora). Canadian Journal of Zoology, 16, 10–16.
Régnier C, Achaz G, Lambert A et al. (2015) Mass extinction in poorly known taxa.
Proceedings of the National Academy of Sciences, 112, 7761–7766.
Régnier C, Fontaine B, Bouchet P (2009) Not knowing, not recording, not listing: Numerous
unnoticed mollusk extinctions. Conservation Biology, 23, 1214–1221.
Remigio EA, Lepitzki DAW, Lee JS, Hebert PDN (2001) Molecular systematic relationships and
evidence for a recent origin of the thermal spring endemic snails Physella johnsoni and
Physella wrighti (Pulmonata: Physidae). Canadian Journal of Zoology, 79, 1941–1950.
Roberts DW (2012) Package ‘labdsv’
Rogers SM, Bernatchez L (2007) The genetic architecture of ecological speciation and the
association with signatures of selection in natural lake whitefish (Coregonus sp.
Salmonidae) species pairs. Molecular Biology and Evolution, 24, 1423–1438.
Rosenberg G (2014) A new critical estimate of named species-level diversity of the recent
Mollusca. American Malacological Bulletin, 32, 308–322.
RStudio Team (2016) RStudio: Integrated Development for R. RStudio, Inc., Boston, MA
http://www.rstudio.com/.
Sankurathri CS, Holmes JC (1976) Effects of thermal efffuents on the population dynamics of
Physa gyrina Say (Mollusca: Gastropoda) at Lake Wabamun, Alberta. Canadian Journal of Zoology, 54, 582–590.
Santamaría L, Klaassen M (2002) Waterbird-mediated dispersal of aquatic organisms: An
53
introduction. Acta Oecologica, 23, 115–119.
Schlötterer C, Tobler R, Kofler R, Nolte V (2014) Sequencing pools of individuals — mining
genome-wide polymorphism data without big funding. Nature Publishing Group, 15, 749–
763.
Schmieder R, Edwards R (2011a) Fast identification and removal of sequence contamination
from genomic and metagenomic datasets. PLoS ONE, 6.
Schmieder R, Edwards R (2011b) Quality control and preprocessing of metagenomic datasets.
Bioinformatics, 21, 863–864.
Shafer ABA, Wolf JBW, Alves PC et al. (2015) Genomics and the challenging translation into
conservation practice. Trends in Ecology and Evolution, 30, 78–87.
Shaffer ML (1981) Minimum population sizes for species conservation. Bioscience, 31, 131–
134.
Storfer A, Hohenlohe PA, Margres MJ et al. (2018) The devil is in the details: Genomics of
transmissible cancers in Tasmanian devils. PLoS Pathogens, 14, 1–7.
Taylor DW (2003) Introduction to Physidae (Gastropoda: Hygrophila); biogeography,
classification, morphology. Revista de Biologia Tropical, 51, 1–287.
Viard F, Justy F, Jarne P (1997) The influence of self-fertilization and population dynamics on
the genetic structure of subdivided populations: a case study using microsatellite markers in
the freshwater snail Bulinus truncatus. Evolution, 51, 1518–1528.
Vilà C, Sundqvist A-K, Flagstad Ø et al. (2003) Rescue of a severely bottlenecked wolf (Canis lupus) population by a single immigrant. Royal Society, 270, 91–97.
Weber DS, Stewart BS, Lehman N (2004) Genetic consequences of a severe population
bottleneck in the guadalupe fur seal (Arctocephalus townsendi). Journal of Heredity, 95,
144–153.
Wethington AR, Guralnick R (2004) Are populations of physids from different hot springs
distinctive lineages? American Malacological Bulletin, 19, 135–144.
54
Wethington AR, Lydeard C (2007) A molecular phylogeny of Physidae (Gastropoda:
Basommatophora) based on mitochondrial DNA sequences. Journal of Molluscan Studies,
73, 241–257.
Whitlock MC, Lotterhos KE (2015) Reliable detection of loci responsible for local adaptation:
inference of a null model through trimming the dstribution of FST. The American Naturalist, 186, S24–S36.
Whitlock M, McCauley D (1999) Indirect measures of gene flow and migration: FST ¹ 1/(4Nm +
1). Heredity, 82, 117–125.
Wickham H (2016) ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.
ISBN 978-3-319-24277-4, http://ggplot2.org
Wright S (1950) The genetical structure of populations. The Annals of Human Genetics, 324–
354.
Appendix A: Genomic analysis pipeline
General notes about the project
Below is the pipeline I used to analyze the Pool-seq data. I have generalized each step so that future studentscan adjust it to their projects. For my project, I analyzed five sites of Physella johnsoni and three sites ofPhysella gyrina of 20 to 40 individuals per site. These were sequenced by Genome Quebec on the IlluminaHiSeq X, aiming for about 40x coverage (determined to be almost double that in actuality). Each siteconsisted of half a lane worth of data (total of four lanes).
The pipeline is not linear, in that there are analysis that branches off at certain points. As well, most of thiswas run on Cedar so the accessing of modules reflects that. Some of it was run on ARC though.
I included the SLURMs because it gives an idea of how long each thing took to run for my files. Rememeberthat my files were half a lane each, a college of mine sequenced six pools over one lane and her analysis tooka fraction of the time listed below.
Getting started in Cedar
Launch Terminal on a Mac. I think there is something called Putty for Windows?
Logging into Cedarssh [email protected]#ex. [email protected]#You will be prompted for your password#When you type it in nothing will appear but just hit "enter" when you've typed it in#and it should log you right in!
Navigating around Cedar and creating a project directoryls #will show you all the directories in your home folder
#How to make symbolic link to our project folder in def-srogers
rm project #removes the current project directory
ln -s projects/def-rogers/your_user_name project #assigns your account to project
cd project #change directory your in to project
pwd #will give you the path to the directory you are in
mkdir project_name #will make a directory in the project directory named "project_name".
cd project_name
mkdir 00_nameofstep1 #this is a good way to keep your steps in order
cd 00*tab* #by using "tab" on your keyboard it will auto-complete the name
It is worth spending some time learning Unix commands. I wish I had spent more time doing this, instead oftrying to figure it out as I went.
55
How to make executable codes
You will need to choose a text editor to use on Cedar or Graham. We chose to use GNU nano because it isuser and beginner friendly. To make exectuable codes do below:nano codename#will open a new, blank, nano with the name "codename"
type code
^X #this will close the code and you can select to save it.#nano has functions listed at the bottom of the file - "^" is "control"
ls #will list all the files in the directory you are in - should see your code here!
chmod +x codename #this changes the code so that it is now exectuable, important!
How to submit jobs to run on Cedar
To run code on Cedar, you need to submit them as “jobs”. The way this was explained to me is that youhave to submit your request to the “secretary” of Cedar and they will send your job to the appropiate place.How we do this is something called “SLURM”.
Create a SLURMnano SLURM_name
#copy and paste below and change as necessary
^X #close and save SLURM
Example SLURM#!/bin/bash #Must put this at top# ---------------------------------------------------------------------# Place you can leave yourself a descriptor of this SLURM# ---------------------------------------------------------------------
#SBATCH --job-name=Nameofyourjob #Make this descriptive but short!#SBATCH --account=def-srogers #the account this is under#SBATCH --cpus-per-task=XX #how many threads you want#SBATCH --time=0-00:00 #time you want - goes days:hours:minutes#SBATCH --mem=XXG #amount of memory you want
# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------
/path/to/thecodeyouwanttorun
# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------
56
Submit SLURM and check on statussbatch SLURM_name #will submit job to queue
squeue -u your_user_name #will give you the job's status
scancel job_ID_number #will cancel the job
Once the job starts running, a SLURM out-file will be created in the format of: slurm-job_ID_number.out.This out-file gives you information on what step your job is on and whether or not it completed successfully.If successful it will have “Job finished with exit code 0”. If it doesn’t say that, then it may give you a helpfulerror code or a non-helpful error code.
Getting information on the job after it has runacct -j job_ID_number --format=JobID,JobName,MaxRSS,Elapsed#Gives JobID (kind of redundant), the name of the job, the memory and the time it took.
Convert BCL to Fastq - not allowing barcode mismatches
If working with BCL files you must first change them to fastq format. Genome Quebec will give you yourfiles in fastq format. However, they allow one barcode mismatch. At times when you need 100% confidenceof sequence assignment to the right population you will need to allow no mismatches. You may be able toask Genome Quebec to do this on their end but I didn’t know this until after they had done the conversionso I got the BCL files from them.
Manual: https://support.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/bcl2fastq/bcl2fastq2_guide_15051736_v2.pdf
tiles: these are the tiles that your data is on (because there will likely be other sequence info)
sample-sheet: need to get this from Genome Quebec. Just info about your sequences
-r 32 -p 32 -w 32: reading, processing and writing thread count. I set them all to 32.
use-bases-mask Y151,I6n2,Y151: this is specific about the sequencer - the Y151 is because it is PE 150reads
Ex. bcl2fastq.1module load bcl2fastq2/2.20 && \bcl2fastq\--runfolder-dir /path/to/BCLfiles/171207_E00434_0072_AHGNNHCCXY_4467HS23A\
#the above is an identifier - change!--output-dir /path/to/00_raw_data\--tiles s_1\ #change to what tile you are converting - I had s_1 to s_4--sample-sheet /path/to/BCLfiles/171207_E00434_0072_AHGNNHCCXY_4467HS23A/SampleSheet.1.csv\
#change the above depending on tile - there is a SampleSheet for each tile--create-fastq-for-index-reads\-r 32 -p 32 -w 32\--barcode-mismatches 0 --use-bases-mask Y151,I6n2,Y151
#!/bin/bash# ---------------------------------------------------------------------# Slurm for bcl2fastq for L001# ---------------------------------------------------------------------
#SBATCH --job-name=bcl2fastq.1
57
#SBATCH --account=def-srogers#SBATCH --cpus-per-task=32#SBATCH --time=0-00:45#SBATCH --mem=20G
# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------
/path/to/bcl2fastq.1
# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------
Concatenate files together - if sites were run on two or more siteson the flow cells
My sites were split over two lanes on the flow cells, so when I converted them from BCL to fastq I had fourfiles for each site.cat XXPool_L001_R1.fq.gz XXPool_L002_R1.fq.gz > XXPool_R1.fq.gz
FastQC - check the quality of your sequencing reads
Fastqc will give you a report on the quality of your sequencing reads. https://dnacore.missouri.edu/PDF/FastQC_Manual.pdf this is a pretty good tutorial on how to interpet them.
The code below will loop through every .fq.gz file in the directory you tell it to look in and make a report foreach.#!/bin/bash# ---------------------------------------------------------------------# FastQC for pool sites# ---------------------------------------------------------------------
#SBATCH --job-name=fastqc#SBATCH --account=def-srogers#SBATCH --cpus-per-task=1#SBATCH --time=0-15:00#SBATCH --mem=5G
# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------
module load fastqc/0.11.5
for i in /path/to/fastqfiles/*fq.gz
58
dofastqc -o /path/to/where/you/want/reports $i
done# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------
Trimmomatic - cleaning and filtering low-quality reads and remov-ing adaptors
The available Adaptor file doesn’t have all of the adaptors in it. Please contact me if you would like the filewe created. We included all of the adaptors used in our library preparation.
phred: can be 64. Depends on your sequencing.
threads: change to the number of threads you have available on your cluster. We used 16, half of a node, butI think we could have used more. Just make sure to match this to what you are asking for in your SLURM.
ILLUMINACLIP: path to the adaptor file you made. 2: seed mismatches 30: is how accurate the matchbetween the two adaptor ligated reads must be 10: SimpleClip Threshold (I have a doc that goes in moredetail if interested)
CROP: we were having an issue in some of our reads where Trimmomatic just wasn’t detecting the repetitivesequences in the last 15 nucleotides. Hence the hard crop to trim the last 15 nucleotides off each read.
LEADING: trim the leading nucleotides if they fall under Q5
TRAILING: trim the trailing nucleotides if they fall under Q5
SLIDINGWINDOW: 5:20 - look at 5 bases at a time and trim if the average Q is less than 20
MINLEN: only keep reads that are minimally 100 bpjava -jar $EBROOTTRIMMOMATIC/trimmomatic-0.36.jar PE -phred33 -threads 16 -trimlog logfile \path/to/fastqfiles/XXPool_R1.fq.gz /path/to/fastqfiles/XXPool_R2.fq.gz \XX_R1_P_qtrim.fq XX_R1_U_qtrim.fq XX_R2_P_qtrim.fq XX_R2_U_qtrim.fq \ILLUMINACLIP:/path/to/Adaptors/TruSeq3-PE-all.fa:2:30:10 CROP:135 LEADING:5 TRAILING:5SLIDINGWINDOW:5:20 MINLEN:100
#!/bin/bash# ---------------------------------------------------------------------# Trim cat files# ---------------------------------------------------------------------
#SBATCH --job-name=XX_trim#SBATCH --account=def-srogers#SBATCH --cpus-per-task=16#SBATCH --time=0-20:00#SBATCH --mem=15G
# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------
module load trimmomatic/0.36
59
/path/to/trimcode/trimcode
# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------
Post-trim FastQC - check the quality of your sequencing reads
Same as above but for the post-trim files. Should see the quality go up and there be less sequences. Look forover-represented k-mers, there shouldn’t be any!
DeconSeq - removing non-snail contaminants from the sequences
Full disclaimer I made the databases a variety of ways as I found out one way wouldn’t work for everyorganism.
Step 1.1: How I created the databases for Archaea and Algae (Charophyceae,Chlorophyta, Cryptophyta, Eustigmatophyceae and Klebsormidiophyceae)
Create a database from NCBI of the sequences you would like to remove. You need to get the GI listfrom NCBI (as per http://johnstantongeddes.org/aptranscriptome/2013/12/31/notes.html or https://www.biostars.org/p/6528/).
You will need to install the newest version of NCBI BLAST+ (https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download).
I have heard that moving forward NCBI will be switching to using taxid to create databases rather than GIlist but upon publishing this GI list were still being used.
In this example, I am using Archaea. Key thing is that the nt database and your GI list must existin the same directory. It will be messy, which is why I put the “X” in front of the files I was making sothat all of them were at the bottom./path/to/ncbi_directory/ncbi-blast-2.7.1+/bin/blastdb_aliastool -db nt -gilist Archaea.gi-dbtype nucl -out X_nt_archaea -title "database for Archaea"
I just ran this in Cedar without a SLURM. Takes ~ 30 sec to a minute. Creates a .nal file, which if you put“X_nt_archaea” as your database, will mask everything else in the database but those sequences.
Once you’ve created the .nal file, you need to convert the database to .fasta file. It’s small so I ran it in theSLURM.#!/bin/bash# ---------------------------------------------------------------------# NCBI to fasta# ---------------------------------------------------------------------
#SBATCH --job-name=Charo_fasta#SBATCH --account=def-srogers#SBATCH --cpus-per-task=1#SBATCH --time=0-05:00#SBATCH --mem=1G
60
# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------
module load perl/5.22.4
/path/to/ncbi/ncbi-blast-2.7.1+/bin/blastdbcmd -entry all-db X_nt_archaea -out Archaea.fasta
# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------
Step 1.2: Creating the database for Threespine stickleback and Human
Threespine stickleback downloaded from: https://datadryad.org/resource/doi:10.5061/dryad.h7h32 (Peichelet al. 2017)
Human genome was downloaded from: ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/hs_ref_GRCh38.p12_ch.
I used the tutorial provided by DeconSeq to access the human genome.#Download sequence datafor i in {1..22} X Y MT;do wgetftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/hs_ref_GRCh38.p12_chr$i.fa.gz;done
#Extracting and joining datafor i in {1..22} X Y MT; do gzip -dvc hs_ref_GRCh38.p12_chr$i.fa.gz>>hs_ref_GRCh38_p12.fa; rm hs_ref_GRCh38.p12_chr$i.fa.gz;done
Step 1.3: Creation of the Bacterial database
I downloaded all of the full bacterial genomes off of NCBI (I think there was about 10,000 of them?) and allassembly levels for bacteria that have been found in the thermal springs onto a computer (many Gb). I thenused Globus to put them on to Cedar. https://www.ncbi.nlm.nih.gov/assembly/?term=bacteria
You will need to unzip any files that are zipped (including your query sequences) because DeconSeq can’t usezipped files.
Can use:for file in *.gz #loop through all files with this file extensiondogunzip $file #unzip themdone #when it's finished, stop.
This will had to be done for all of the bacterial genomes. It took 10+ hours so I would consider submiting itas a SLURM.
Then I concatenated the bacterial genomes together, using:
61
#!/bin/bash# ---------------------------------------------------------------------# cat FG NCBI database# ---------------------------------------------------------------------
#SBATCH --job-name=cat_bac#SBATCH --account=def-srogers#SBATCH --cpus-per-task=16#SBATCH --time=0-10:00#SBATCH --mem=1G
# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------
find . -name '*.fna' -print0 | xargs -0 cat > bacteria_genomes.fa
# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------
Step 2: Splitting sequences by long repeats of ambiguous base N (this is fromthe DeconSeq manual
Below is all for the human genome because that is what is listed in the manual but I did this for all of thedatabases.cat hs_ref_GRCh38_p12.fa | perl -p -e 's/N\n/N/' |
perl -p -e 's/^N+//;s/N+$//;s/N{200,}/\n>split\n/'>hs_ref_GRCh38_p12_split.fa; rm hs_ref_GRCh38_p12.fa
Step 3: Filtering databases
Need to download and install PRINSEQ - can be found at https://sourceforge.net/projects/prinseq/files/
This step needs a SLURM because it will run out of memory otherwise#!/bin/bash# ---------------------------------------------------------------------# Filtering sequences - PRINSEQ# ---------------------------------------------------------------------
#SBATCH --job-name=human_prinseq#SBATCH --account=def-srogers#SBATCH --cpus-per-task=1#SBATCH --time=0-0:10#SBATCH --mem=10G
# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------
62
module load perl/5.22.4
perl /path/to/prinseq-lite-0.20.4/prinseq-lite.pl -log -verbose-fasta hs_ref_GRCh38_p12_split.fa -min_len 200 -ns_max_p 10 -derep 12345-out_good hs_ref_GRCh38_p12_split_prinseq -seq_id hs_ref_GRCh38_p12_-rm_header -out_bad null
# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------
Step 4: Index the databases
For the bacterial database (because it is over 200Gb) you will need to first split it into managable chunksbefore BWA can use them. Can use fasta splitter (http://kirill-kryukov.com/study/tools/fasta-splitter/).
The files need to be under 3Gb each, so split it to as many chunks as you need. I did 100.#!/bin/bash# ---------------------------------------------------------------------# FASTA splitter# ---------------------------------------------------------------------
#SBATCH --job-name=fasta_split_bac#SBATCH --account=def-srogers#SBATCH --cpus-per-task=16#SBATCH --time=0-10:00#SBATCH --mem=100G
# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------
module load perl/5.22.4
perl /pathto/fasta-splitter.pl --n-parts 100 Bacteria_FG_split_prinseq.fasta
# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------
Next step is to index the databases You must use the BWA provided in the DeconSeq package!The newer BWA reads the files incorrectly for this and will only produce 5 of the 8 outfiles necessary. Cedaris a 64 bit Linux system, so use bwa64. I know that the top examples have been using the human (sorry forthe lack of consistency) but the bacterial one needed some extra things to make it run. I didn’t want to run100 SLURMs so this is a batch job!
Modified from script kindly provided by Dr. Stefan Dennenmoser#!/bin/bash# ---------------------------------------------------------------------# BWA# ---------------------------------------------------------------------
63
#SBATCH --job-name=index_bac#SBATCH --ntasks=1#SBATCH --account=def-srogers#SBATCH --time=0-10:00#SBATCH --mem-per-cpu=20G#SBATCH --array=1-100
# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------
cd /path/to/Bacterial_Database
filename=`ls -1 *.fasta* | tail -n +${SLURM_ARRAY_TASK_ID} | head -1`
filename2=${filename::-6} #the filename without the .fasta part (-6 letters)
/path/to/deconseq-standalone-0.4.3/bwa64 index -p $filename2 -a bwtsw $filename>bwa.log 2>&1
# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------
Step 5: Configure the DeconSeq file
You will have to go into the installed DeconSeq directory and set up the configure file DeconSeqConfig.pm.You will just need to change the database directory and out directory and the what files are accessed in the“use constant DBS”.
Step 6: Split the query sequences
DeconSeq is unable to handle big datasets (e.g. 50+ Gb files that I was using)
Will need to use fastq splitter (http://kirill-kryukov.com/study/tools/fastq-splitter/)
I don’t think I ran this in a SLURM and just used the console. If it fails. . . put it in a SLURM.perl /path/to/fastq-splitter.pl --n-parts 50 --check XX_R1.fq
Step 7: ACTUALLY RUNNING DECONSEQ
For DeconSeq you need to choose the identity (-i) which is the percent match between your query sequenceand the database and the coverage (-c) which is the amount of the sequence aligns.
I went with the parameters they used in their paper and based on what I had seen other people do, whichwas 94% identity (-i 94) and 90% to 95% coverage (-c 90 or -c 95).
Submit this as a batch job.#!/bin/bash# ---------------------------------------------------------------------
64
# Deconseq# ---------------------------------------------------------------------
#SBATCH --job-name=XX_decon#SBATCH --ntasks=1#SBATCH --account=def-srogers#SBATCH --time=3-00:00#SBATCH --mem-per-cpu=7500MB#SBATCH --array=1-50
# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------
module load perl/5.22.4
#put all of the split files from one pop and read direction in its own directory
cd /path/to/trimmed_seq/XX_R1_split
filename=`ls -1 *.fq* | tail -n +${SLURM_ARRAY_TASK_ID} | head -1`
perl /path/to/deconseq-standalone-0.4.3/deconseq.pl -i 94 -c 90 -f/path/to/trimmed_seq/XX_R1_split/$filename -dbs hsref-out_dir /path/to/deconseq_out/XX_R1 -id $filename
# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------
Step 8: Creating paired files again
DeconSeq was never designed for paired-end sequencing. Therefore it will process each read directionseperately and this causes sequences to be removed in one direction and not the other.
Firstly, concatenate the 50 (or more or less depending on what you split your in files to) of clean files. I alsodid this for the .cont files because I wanted to keep them and see how many were removed.
Then you can use this person’s script found at: https://github.com/linsalrob/fastq-pair
Step 1: Clone or download - copy URL
Step 2: In Cedar or ARC or whatever cluster type: git clone -should see a directory called “fastq-pair”
Step 3: gcc fastq-pair/*.c -o fastq_pair
Step 4: Should see an exectuable script called “fastq_pair”
Running it is super simple.path/to/fastq_pair -t VALUE path/to/file_R1.fastq path/to/file_R2.fastq
Where VALUE is roughly the number of sequences in your file. This is the setting of the hash table size. Forsome reason I don’t know why, I couldn’t get this to run as a SLURM. It would make the outfiles (if I gave itover 50Gb) but it would run for far longer than it runs in the terminal and just never finish. I ended uprunning it in ARC’s console because Cedar doesn’t have enough resources allocated to their console.
65
It makes four output files. R1 paired and unpaired and R2 paired and unpaired. Then you will need to zipthe files back up!
DISCOVAR de novo - assembling a reference genome
DISCOVAR de novo is very easy to use but there are downstream challenges of not having a reference genome.As well, it was designed for paired-end sequences of 250 bp sequenced from one individual prepared PCR-free.In my thesis you can see the impact of breaking these assumptions in the quality of the genome produced. Ifit is at all possible to run an individual with at least two insert sizes (ex. 1 kb and 5,000 kb), you will be ableto likely generate a much more robust assembly. Ex. Schell et al. 2017.
You can download DISCOVAR denovo from here: https://software.broadinstitute.org/software/discovar/blog/?page_id=98. There are two ways of getting (probably lots more but these are the ones I use) info fromthe web onto Cedar.
Option 1 Go to the link you want to download. Right click and “Copy Link Address”. Go to Cedar. Dobelow.wgetftp://ftp.broadinstitute.org/pub/crd/DiscovarDeNovo/latest_source_code/LATEST_VERSION.tar.gz
Option 2 Can download to your personal computer (if the file isn’t too big) and then use Globus to transferit from your computer to Cedar. Globus is supported by Compute Canada. You will have to login, downloadGlobus to your computer and make your computer an endpoint.
Once you download DISCOVAR denovo, you will have to unzip it.
A little aside. . . #Clumpify (belongs to the BBMap/BBTools package)
Before I assembled the sequences, I used Clumpify to remove PCR duplicates (unlike Picard it doesn’t need areference genome, however I don’t think it is as robust) because the library prep is intended to be PCR free.This is seen to improve the quality of the assembly. I also tried BBNORM to normalize the sequencing depthat around 60x coverage. It decreased the N50 but increased the MPL1 and decreased the estimated chimerarate. In hindsight, I think I should have investigated this assembly more and maybe used it.
dedupe: The command to remove duplicates
subs=2: This means that there can be two subsitutions between the the compared sequences and it will beconsidered a duplicate.clumpify.sh in1=/home/youraccount/path/to/trimmed.fq.gz/XX_R1_P.fq.gzin2=/path/to/trimmed.fq.gz/XX_R2_P.fq.gz out1=/path/to/Clumpify/XX_R1_P_nodup.fq.gzout2=/path/to/Clumpify/XX_R2_P_nodup.fq.gz dedupe subs=2
#!/bin/bash# ---------------------------------------------------------------------# Removing duplicates from XX allowing two subs# ---------------------------------------------------------------------
#SBATCH --job-name=nodup_XX#SBATCH --account=def-srogers#SBATCH --nodes=1#SBATCH --cpus-per-task=32#SBATCH --time=0-00:30#SBATCH --mem=150G
# ---------------------------------------------------------------------
66
echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------
module load nixpkgs/16.09module load intel/2016.4module load bbmap/37.36
/home/youraccount/Clumpify/code/XX_clump
# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------
Now that we have files with duplicates removed, back to DISCOVAR de novo
DISCOVAR de novo takes A LOT of memory.DiscovarDeNovo READS=/path/to/fastq/files/you/want/to/use/*fq.gzOUT_DIR=/path/to/referencegenomeMAX_MEM_GB=1450 NUM_THREADS=32
#!/bin/bash# ---------------------------------------------------------------------# Assembly of genome (using 1 site)# ---------------------------------------------------------------------
#SBATCH --job-name=XX_ref#SBATCH --account=def-srogers#SBATCH --nodes=1#SBATCH --cpus-per-task=32#SBATCH --time=0-30:00#SBATCH --mem=1400G
# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------
path/to/code/DDN_code
# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------
BWA (Burrows-Wheeler Aligner) - aligning sequences to the ref-erence genome
Step 1: Index the reference
This step will create a bunch of index files using the name “reference_genome” as the name of the file
67
bwa index -p reference_genome /path/to/reference/reference_genome.fa
#!/bin/bash# ---------------------------------------------------------------------# BWA# ---------------------------------------------------------------------
#SBATCH --job-name=ref_index#SBATCH --account=def-srogers#SBATCH --cpus-per-task=16#SBATCH --time=0-04:00#SBATCH --mem=5G
# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------
module load nixpkgs/16.09module load gcc/5.4.0module load intel/2016.4module load intel/2017.1
module load bwa/0.7.17
/path/to/code/bwa_index_code
# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------
Step 2: Align to reference
Details about parameters can be found here: http://bio-bwa.sourceforge.net/bwa.shtml
-M: mark shorter split hits as secondary (necessary for using the file in Picard downstream)
-t 16: number of threads. Match with SLURM
-R: complete read group header line
XX: need to change XX to whatever site you are currently working on.
The reference genome needs to be put in without the “.fa” because it will be using all of theindexes that are in the directory too.bwa mem -M -t 16 -R '@RG\tID:XX\tLB:XX\tSM:XX\tPL:ILLUMINA'/path/to/ref/reference_genomepath/to/XX_R1.fq.gz path/to/XX_R2.fq.gz > XX_out.sam
#!/bin/bash# ---------------------------------------------------------------------# sam# ---------------------------------------------------------------------
#SBATCH --job-name=XX_align
68
#SBATCH --account=def-srogers#SBATCH --cpus-per-task=16#SBATCH --time=0-05:00#SBATCH --mem=10G
# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------
module load nixpkgs/16.09module load gcc/5.4.0module load intel/2016.4module load intel/2017.1
module load bwa/0.7.17
/path/to/code/align_code
# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------
Samtools - sort the .sam file into a .bam by chromosome number
There are two options for sorting but it needs to be sorted by chromosome for downstream applications.
-@: this argument is where you set the number of threads
-T: this argument is where you set the indentifier - change to which ever site you are currently working on
-o: write to this outfilesamtools sort -@ 16 -T XX -o XX.bam XX_out.sam
#!/bin/bash# ---------------------------------------------------------------------# Samtools Sort# ---------------------------------------------------------------------
#SBATCH --job-name=XX.bam#SBATCH --account=def-srogers#SBATCH --cpus-per-task=16#SBATCH --time=0-04:00#SBATCH --mem=5G
# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------
module load nixpkgs/16.09module load gcc/5.4.0module load intel/2016.4
69
module load samtools/1.5
/path/to/code/samsort_code
# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------
Samtools - filter low quality reads and where only one read mapped
Details can be found at: http://www.htslib.org/doc/samtools.html
Need to remove where only one read mapped, duplicates and low alignment quality reads (Q20). I did this inthree steps, where I removed 1 read mapped first and then removed dups, and then filtered for Q20.
-@ : number of threads
-f 2: this means only keep it if there is paired reads
-o: write to this outfile
samtools has good documentation online. Be advised that they updated the package in April 2018! Just needto look at the right info.samtools view -@ 16 -f 2 -o XX_rm1mate.bam /path/to/XX.bam
#!/bin/bash# ---------------------------------------------------------------------# Samtools Sort# ---------------------------------------------------------------------
#SBATCH --job-name=XX_rm_Q20#SBATCH --account=def-srogers#SBATCH --cpus-per-task=16#SBATCH --time=0-02:00#SBATCH --mem=25G
# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------
module load nixpkgs/16.09module load gcc/5.4.0module load intel/2016.4
module load samtools/1.5
/path/to/code/rm1mate_code
# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------
70
Picard - removing duplicates
Picard will remove duplicates using the MarkDuplicates function. You need to add REMOVE_DUPLICATES=TRUEto remove them. GATK suggests that you keep the marked duplicates but PoPoolation1 and 2 needs themremoved.java -jar $EBROOTPICARD/picard.jar MarkDuplicatesINPUT=/path/to/XX_rm1mate.bam OUTPUT=XX_rm1mate_nodup.bamMETRICS_FILE=XX_rm1mate_Q20_nodup.txt REMOVE_DUPLICATES=TRUE
#!/bin/bash# ---------------------------------------------------------------------# Picard Dup removal# ---------------------------------------------------------------------
#SBATCH --job-name=XX_nodup_rm1mate_Q20#SBATCH --account=def-srogers#SBATCH --cpus-per-task=16#SBATCH --time=0-02:00#SBATCH --mem=40G
# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------
module load nixpkgs/16.09
module load picard/2.17.3
/path/to/code/removeduplicates_code
# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------
Flagstat - get stats on your alignment
Flagstat will give you stats on the alignment - like percent mapped.#!/bin/bash# ---------------------------------------------------------------------# Flagstat# ---------------------------------------------------------------------
#SBATCH --job-name=XX_flagstat
#SBATCH --account=def-srogers
#SBATCH --cpus-per-task=16#SBATCH --time=0-00:20#SBATCH --mem=1G
71
# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------
module load nixpkgs/16.09module load gcc/5.4.0module load intel/2016.4
module load samtools/1.5
samtools flagstat path/to/XX_rm1mate_nodup.bam
# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------
Picard - Validate the .bam file before moving forward
Check if the bam file is not broken.java -jar$EBROOTPICARD/picard.jar ValidateSamFile I=/path/to/XX_rm1mate_nodup.bam MODE=SUMMARY
#!/bin/bash# ---------------------------------------------------------------------# Validate Bam Test# ---------------------------------------------------------------------
#SBATCH --job-name=Vali#SBATCH --account=def-srogers#SBATCH --cpus-per-task=16#SBATCH --time=0-01:00#SBATCH --mem=15G
# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------
module load nixpkgs/16.09
module load picard/2.17.3
/path/to/code/validate_code
# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------
72
Samtools - Q20 filter
-q 20: filter any sequence that the alignment score is less than 20 “Minimum mapping quality for an alignmentto be used”samtools view -@ 16 -q 20 -o XX_rm1mate.bam /path/to/XX.bam
#!/bin/bash# ---------------------------------------------------------------------# Samtools Sort# ---------------------------------------------------------------------
#SBATCH --job-name=XX_Q20#SBATCH --account=def-srogers#SBATCH --cpus-per-task=16#SBATCH --time=0-02:00#SBATCH --mem=25G
# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------
module load nixpkgs/16.09module load gcc/5.4.0module load intel/2016.4
module load samtools/1.5
/path/to/code/Q20_code
# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------
Samtools - mpileup
Information about parameters can be found: http://www.htslib.org/doc/samtools.html
The mpileup is a file type that contains base-pair information at each chromosomal position.
In the below code you can add an arguement of -f and the path to the reference genome. This will give youthe reference genome info in the mpileup file. I didn’t use this because I don’t really care what my referenceis. In this step all of the .bam files will be combined into one mpileup (in my case site 1 through 8). ForPoPoolation 1, you need to form a mpileup for each site which will be outlined later in this pipeline.
-B: stops BAC re-alignment. Necessary for PoPoolation2.
-o: write to this outfile
This example has three sites (XX, YY and ZZ) being combined into one mpileupsamtools mpileup -BXX_rm1mate_Q20_nodup.bam YY_rm1mate_Q20_nodup.bam ZZ_rm1mate_Q20_nodup.bam-o ref_allpools.mpileup
73
#!/bin/bash# ---------------------------------------------------------------------# Samtools mpileup# ---------------------------------------------------------------------
#SBATCH --job-name=tut_allpools_mpileup
#SBATCH --account=def-srogers
#SBATCH --cpus-per-task=16#SBATCH --time=0-06:00#SBATCH --mem=1G
# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------
module load nixpkgs/16.09module load gcc/5.4.0module load intel/2016.4
module load samtools/1.5
/path/to/code/mpileup_code
# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------
PoPoolation2 - convert mpileup to msync
The msync format is the file type that PoPoolation2 needs to do all of its analysis. It is also one of theformats that Poolfstat accepts as its input. You will need to download PoPoolation2 into your home directory.(https://sourceforge.net/p/popoolation2/wiki/Main/).
–min-qual 20: filter anything that has a base quality of less than 20
–threads 8: The number of threads you will be using. I have this faint memory (why didn’t I type thesenotes as I went?) that this step goes a little squirrely if threaded higher than 8java -jar /home/youraccount/popoolation2_1201/mpileup2sync.jar--input path/to/ref_allpools.mpileup --output ref_allpools.sync--fastq-type sanger --min-qual 20 --threads 8
#!/bin/bash# ---------------------------------------------------------------------# mpileup2sync# ---------------------------------------------------------------------
#SBATCH --job-name=mpileup2sync
#SBATCH --account=def-srogers
74
#SBATCH --cpus-per-task=8#SBATCH --time=0-2:00#SBATCH --mem=15G
# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------
module load nixpkgs/16.09module load java/1.8.0_121
/path/to/code/mpileup2sync_code
# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------
In the slurm.out it will give you exit code 1. Check that the last chromosome in the sync filematches the mpileup file and if it does, you’re ok.
Poolfstat - pairwise Fst
I originally used PoPoolation2 to calculate pairwise FST but for reasons in my thesis, I think it is biased. So,I went with Poolfstat. It is an R package and it’s so fast. Took about 4 hours to bring the data into R andthen under ten minutes to run the calculation.library(poolfstat)
#From their paper, which was re-analyzing Dennenmoser et al. 2017#- where he had four populations of n=44 of prickly scuplin
pool.data = popsync2pooldata((sync.file="./file"), poolsizes=c(44,44,44,44),poolnames=c(FE,CR,PI,HZ), min.rc=1,min.cov.per.pool=10, max.cov.per.pool=300, min.maf=0.01, noindel=TRUE)
#I used:min.rc=1min.cov.per.pool=20max.cov.per.pool=200min.maf=0.05noindel=TRUE
Once the data is in, you are set to run pairwise FstPW_fst <- computePairwiseFSTmatrix(pool.data, method = "Anova",min.cov.per.pool=20, max.cov.per.pool=200, min.maf=0.05,output.snp.values = TRUE)
I saved the PairwiseFSTmatrix ∗ and∗NbOfSNPs components of the outfile as their own files because Iwanted to keep them. The $PairwiseFSTmatrix file is necessary to do the below.
75
Visualizing pairwise Fst
To visual the Pairwise FST distance between each population, I used a Principal Coordinate Analysis. Thereis a really good blog post describing the different between PCoA and PCA here: http://occamstypewriter.org/boboh/2012/01/17/pca_and_pcoa_explained/.#read in the pairwise FST generated by Poolfstatpcoa <- read.csv("PW_fst_matrix.csv", header=FALSE)
#make it a matrixpcoa.matrix <- data.matrix(pcoa)
#calculate the euclidean distance between the pairwise FSTeuc.matrix <- dist(pcoa.matrix,'euclidean')
library(labdsv) #package used for PCoA
pco <- pco(euc.matrix, k =2) #calculate the pco for euc distance
#sum the eigen vectors (8 populations)sumeigen=pco$eig[1]+pco$eig[2]+...+pco$eig[8])
eig1=pco$eig[1]eig2=pco$eig[2]perc_eig1=eig1/sumeigen #percent explained by eigen vector 1perc_eig2=eig2/sumeigen #percent explained by eigen vector 2
plot(pco) #will give you a not pretty figure
#the data is stored as character so this and below deals with that.pco.ggplot<-data.frame(cbind(c("Pop1", "Pop2",..."Pop8"), as.numeric(pco$points[,1]),as.numeric(pco$points[,2])))
pco.ggplot$X2<-as.numeric(as.character(pco.ggplot$X2))pco.ggplot$X3<-as.numeric(as.character(pco.ggplot$X3))
colnames(pco.ggplot) <- c("site", "PCoA1", "PCoA2")
library(ggrepel)library(RColorBrewer)library(ggplot2)
tiff('PCoA', units="in", width=10, height=5, res=300)ggplot(pco.ggplot, aes(x=PCoA1, y=PCoA2)) + geom_point(colour="chartreuse4")+ geom_point(data=pco.ggplot[c(3, 4, 7), ], aes(x=PCoA1, y=PCoA2), colour="purple4")+ geom_label_repel(aes(label = site), size = 3, hjust = 0, nudge_x = 0.003,
nudge_y = - 0.00, colour="chartreuse4", show.legend = FALSE)+ theme(legend.position="none")+ geom_label_repel(data=pco.ggplot[c(3, 4, 7), ],
aes(label = site, x=PCoA1, y=PCoA2), colour="purple4",size = 3, hjust = 0, nudge_x = 0.003, nudge_y = - 0.00, show.legend = FALSE)
+ theme_bw() + theme(axis.text=element_text(size=13), axis.title=element_text(size=15,face="bold"),panel.border = element_blank(), panel.grid.major = element_blank(),panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"))
76
+ labs(x = "PCoA 1 (XX.XX%)", y = "PCoA 2 (XX.XX%)")+ scale_x_continuous(breaks=c(-0.6, -0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1.0),
labels=waiver()) + coord_fixed()dev.off()
#I am going to be honest here with this ggplot code. It works.#It ain't pretty and I just kept adding to it until it did a thing and now I am afraid to touch it.
PoPoolation1 - Nucleotide Diversity
The first step here is to make mpileups for each of your sites, instead of one with all of them. For example, inthe above, you would just specify site XX.
–fastq-type sanger: even though our data is Illumina, we had to set it as sanger. This is because Phred 33,not 64
–pool-size: this should equal the number of individuals x 2 (for diploids) in the site. I have also read that ifit is ok to put the number of individuals and that it makes so little difference they are considering takingthat parameter out.
–min-count: minor allele count for that site
–min-coverage: the minimum coverage of a SNP for it to be used in analysis
max-coverage: the maximum coverage of a SNP for it to be used in analysis
–window-size: size of windowperl /home/youraccount/popoolation_1.2.2/Variance-sliding.pl --measure pi--input /path/to/XX_rm1mate_Q20_nodup.mpileup --output XX_pi.file--snp-output XX_pi.snps --fastq-type sanger --pool-size 40 --min-count 4--min-coverage 20 --max-coverage 200 --window-size 250 --step-size 250
This will give you a .file and .snp file as the outfiles. The .file can be loaded into R and used to calculatemean nucleotide diversity and standard deviation.XX_w250_pi <- read.delim("./XX_w250_pi.file", header = FALSE, sep = "\t", dec = ".")
colnames(XX_w250_pi) <- c("chr", "position", "Num.of.SNPs", "frac.of.cov", "pi")
XX_w250_pi$pi<-as.numeric(as.character(XX_w250_pi))
mean(XX_w250_pi$pi, na.rm=TRUE)
pi_matrix <- matrix(c("Pop1", "Pop2",..."Pop8", pi_1, pi_2, ...pi_8), nrow = 8, ncol = 2)
colnames(pi_matrix) <- c("Site", "NucleotideDiversity")
pi_DF <- as.data.frame(pi_matrix)
pi_DF$NucleotideDiversity <- as.numeric(as.character(pi_DF$NucleotideDiversity))
library(ggplot2)
tiff('pi_figure.tiff', units="in", width=5, height=5, res=300)ggplot(data=pi_DF, aes(x=Site, y=NucleotideDiversity))
77
+ geom_point(colour="purple4", size = 3) + labs(y= "Nucleotide Diversity")+ geom_point(data=pi_DF[c(1, 2, 3, 4, 5), ], aes(x=Site, y=NucleotideDiversity),
colour="chartreuse4", size = 3) + theme_bw() + ylim(0, 0.006)+ theme(axis.text=element_text(size=13), axis.title=element_text(size=15,face="bold"),
panel.border = element_blank(), panel.grid.major = element_blank(),panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"))
dev.off()
#Same as above regarding the ggplot.
78
79
Appendix B: DNA and sequencing quality
Figure B.1 Pooled DNA (5 µL) for each population pre-dilution for sequencing preparation run through 1% agarose gel with 3 µL of NEB 1 kb DNA ladder. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3.
80
Figure B.2 Basic statistics and per base sequence quality component of FASTQC report for J3 (Lower Cave & Basin Spring) raw, reverse sequences (pre-Trimmomatic).
81
Figure B.3 Basic statistics and per base sequence quality component of FASTQC report for J3 (Lower Cave & Basin Spring) filtered and trimmed reverse sequences (post-Trimmomatic).
82
Appendix C: PoPoolation2 pairwise FST estimates Table C.1 Pairwise FST between all populations calculated by PoPoolation2. Pairwise FSTwas calculated for 250bp side by side windows, minor allele count of 8, minimum coverage of 15 and max coverage of 200, where the entire window was acquired to meet coverage specifications. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3.
Population J1 J2 J3 J4 J5 G1 G2 G3 J1 0 0.070 0.065 0.075 0.076 0.211 0.335 0.318
J2 0 0.044 0.033 0.074 0.237 0.355 0.340
J3 0 0.044 0.061 0.213 0.336 0.320
J4 0 0.074 0.233 0.351 0.333
J5 0 0.205 0.333 0.318
G1 0 0.237 0.217
G2 0 0.167
G3 0