conservation genomics of the endangered banff springs

91
University of Calgary PRISM: University of Calgary's Digital Repository Graduate Studies The Vault: Electronic Theses and Dissertations 2019-01-07 Conservation genomics of the endangered Banff Springs Snail (Physella johnsoni) using Pool-seq Stanford, Brenna Stanford, B. (2019). Conservation genomics of the endangered Banff Springs Snail (Physella johnsoni) using Pool-seq (Unpublished master's thesis). University of Calgary, Calgary, AB. http://hdl.handle.net/1880/109445 master thesis University of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission. Downloaded from PRISM: https://prism.ucalgary.ca

Upload: others

Post on 16-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

University of Calgary

PRISM: University of Calgary's Digital Repository

Graduate Studies The Vault: Electronic Theses and Dissertations

2019-01-07

Conservation genomics of the endangered Banff

Springs Snail (Physella johnsoni) using Pool-seq

Stanford, Brenna

Stanford, B. (2019). Conservation genomics of the endangered Banff Springs Snail (Physella

johnsoni) using Pool-seq (Unpublished master's thesis). University of Calgary, Calgary, AB.

http://hdl.handle.net/1880/109445

master thesis

University of Calgary graduate students retain copyright ownership and moral rights for their

thesis. You may use this material in any way that is permitted by the Copyright Act or through

licensing that has been assigned to the document. For uses that are not allowable under

copyright legislation or licensing, you are required to seek permission.

Downloaded from PRISM: https://prism.ucalgary.ca

UNIVERSITY OF CALGARY

Conservation genomics of the endangered

Banff Springs Snail (Physella johnsoni) using Pool-seq

by

Brenna C.M. Stanford

A THESIS

SUBMITTED TO THE FACULTY OF GRADUATE STUDIES

IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE

DEGREE OF MASTER OF SCIENCE

GRADUATE PROGRAM IN BIOLOGICAL SCIENCES

CALGARY, ALBERTA

JANUARY 2019

© Brenna C.M. Stanford 2019

ii

Abstract

Understanding how species persist and adapt to local habitats is a fundamental question for

species of conservation concern. Located in Banff National Park, the endangered snail, Physella

johnsoni, inhabits seven highly specialized thermal springs. P. johnsoni undergo yearly

population bottlenecks with minimal to no dispersal among springs. The consequences of these

processes on genetic population structure are unknown. To investigate effects of habitat and life

history on P. johnsoni’s genome and to test the hypothesis of a single panmictic population, I

collected 20 to 40 snails/population for P. johnsoni and a closely related snail, P. gyrina, in

adjacent, non-thermal water. Using whole genome pooled-sequencing, millions of single

nucleotide polymorphisms were captured. These genetic variants resolved significant genetic

divergence between P. johnsoni and P. gyrina. In addition, I detected distinct genetic clusters

and reduced nucleotide diversity within each spring, indicative of strong micro-geographical

population structure and suggestive of a role for genetic drift. These results suggest that P.

johnsoni from each spring represent a distinct genetic unit, which has conservation implications

for the designation of designatable unit status under COSEWIC, and where mixing of snails may

reduce the consequences of genetic drift.

iii

Acknowledgments To my fantastic supervisor, Sean, I cannot begin to thank you enough for your guidance,

patience, and humour. I am so excited for the opportunity to keep working with you. I am

incredibly grateful for the wonderful group of people you have brought into this lab and for the

environment that you foster. To Danielle, James, Jessy, Jori, Sara, Tegan, and Teresa, you are

truly amazing. As scientists, leaders, and people I am so fortunate to call friends. Thank you for

putting up with my distractions, tears and countless questions. Thank you for the many laughs,

deep conservations, hugs and coffee. This work would not be anywhere close to this point

without all of your scientific knowledge and support.

To my committee members, Dwayne Lepitzki and Jana Vamosi – a huge thank you for

everything you’ve done to support me in this work! A special thank you, Dwayne, for braving

the cold, the snow and the heat and mosquitos, to make sure I not only lived through but

thoroughly enjoyed my field season.

To Parks Canada, specifically Mark Taylor, thank you so much for bringing me onto this

project. It has been an absolute pleasure working with you.

To my family, where do I even begin? Thank you for always believing in me, knowing

when to step in, and when to let me find my own way. You challenge me, support me and make

me laugh so hard. I will never be able to tell you how grateful I am for everything you’ve done

and continue to do for me. And last but not least, thank you, Peter. You have been so incredibly

understanding, supportive and I love you so very much.

iv

Table of Contents ACKNOWLEDGMENTS ........................................................................................................................ III

TABLE OF CONTENTS .......................................................................................................................... IV

LIST OF TABLES .................................................................................................................................... VI

LIST OF FIGURES ................................................................................................................................. VII

CHAPTER 1 GENERAL INTRODUCTION .......................................................................................... 1

1.1 INTRODUCTION .................................................................................................................................... 1

1.2 STUDY SYSTEM ................................................................................................................................... 8

1.3 OBJECTIVES ......................................................................................................................................... 9

CHAPTER 2 CONSERVATION GENOMICS IN THE BANFF SPRINGS SNAIL ........................ 13

2.1 INTRODUCTION .................................................................................................................................. 13

2.2 METHODS .......................................................................................................................................... 17

2.2.1 Sampling.....................................................................................................................................................17 2.2.2 DNA extraction ..........................................................................................................................................19 2.2.3 DNA quantification and quality check .......................................................................................................19 2.2.4 Constructing DNA pools for Pool-seq .......................................................................................................19 2.2.5 DNA sequencing ........................................................................................................................................20 2.2.6 Genomic analysis .......................................................................................................................................20 2.2.7 Pairwise FST ...............................................................................................................................................22 2.2.8 Nucleotide diversity ...................................................................................................................................22

2.3 RESULTS ............................................................................................................................................ 23

2.3.1 DNA extraction, quantification and quality ...............................................................................................23 2.3.2 DNA sequencing and pre-processing .........................................................................................................23 2.3.3 Pairwise FST ...............................................................................................................................................24 2.3.4 Nucleotide diversity ...................................................................................................................................24

2.4 DISCUSSION ....................................................................................................................................... 24

2.4.1 Population structure and nucleotide diversity between P. johnsoni and P. gyrina populations ................25 2.4.2 Population structure and nucleotide diversity within P. johnsoni and P. gyrina populations ...................27 2.4.3 Broader implications and conservation recommendations ........................................................................29 2.4.4 The utility of Pool-seq in conservation ......................................................................................................31 2.4.5 Caveats.......................................................................................................................................................33 2.4.6 Conclusions ................................................................................................................................................34

CHAPTER 3 GENERAL CONCLUSIONS .............................................................................................................41

v

REFERENCES ............................................................................................................................................................43

APPENDIX A: GENOMIC ANALYSIS PIPELINE ..............................................................................................55

APPENDIX B: DNA AND SEQUENCING QUALITY ..........................................................................................79

APPENDIX C: POPOOLATION2 PAIRWISE FST ESTIMATES .......................................................................82

vi

List of Tables Table 2.1 Number of SNPs within each population and used in pairwise comparisons between

populations determined by Poolfstat with the ‘Anova’ method for a minimum read count of one, minimum coverage of 20, a maximum coverage of 200, and a minor allele frequency of 0.05. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3. ............................. 35

Table 2.2 Pairwise FST between all populations determined using Poolfstat with the ‘Anova’ method for a minimum read count of one, minimum coverage of 20, a maximum coverage of 200, and a minor allele frequency of 0.05. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3. .................................................................................................................... 36

Table C.1 Pairwise FST between all populations calculated by PoPoolation2. Pairwise FST was calculated for 250bp side by side windows, minor allele count of 8, minimum coverage of 15 and max coverage of 200, where the entire window was acquired to meet coverage specifications. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3....................82

vii

List of Figures Figure 1.1 Schematic illustrating where genomic data are required in conservation management

plans. Namely for resolving taxonomic ambiguity, assigning DUs/ESUs and for characterizing the population genomic consequences of increasing threats. ...................... 10

Figure 1.2 Schematic of a population bottleneck. The different colours represent genetic variation, where some is randomly lost with the reduction in population size. This decrease in genetic variation is observed even when population numbers increase. ......................... 11

Figure 1.3 Schematic of Pool-seq preparation and sequencing. Equal amounts of DNA (ng) of each individual of the population is combined and the individual is “lost”. The same adaptor is ligated to all of the DNA from that population to distinguish it from other populations in sequencing and in analysis. .......................................................................... 12

Figure 2.1 Range map and sample populations for Physella johnsoni and sample populations for Physella gyrina, Banff National Park, Alberta, Canada. P. johnsoni - Cave Spring (J1), Basin Spring (J2), Lower Cave & Basin Spring (J3), Upper Cave & Basin Spring (J4), Lower Middle Spring (J5), Upper Middle Spring (J6) (not used in this study) and Kidney Spring (J7) (not used in this study). P. gyrina - Cave & Basin Marsh (G1), Five Mile Pond (G2) and Muleshoe Pond (G3). ........................................................................................... 37

Figure 2.2 Total number of P. johnsoni from January 1996 to September 2017. Population counts were taken once every three weeks till August 2000 and then once every four weeks till September 2016 when population counts were ended. From April to September 2017 and September 2018 the counts were resumed. Original springs include J1, J2, J3, J4 and J5. The re-established springs are J6 and J7. Modified from COSEWIC 2018 by Dr. Dwayne Lepitzki. ............................................................................................................................... 38

Figure 2.3 Principle coordinate analysis for all pairwise FST between P. johnsoni and P. gyrina populations calculated by Poolfstat with the ‘Anova’ method for a minimum read count of one, minimum coverage of 20, a maximum coverage of 200, and a minor allele frequency of 0.05. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3. ............................. 39

Figure 2.4 Averaged nucleotide diversity for all P. johnsoni and P. gyrina populations calculated by PoPoolation2 over 250bp side by side windows, minor allele count of 4, minimum coverage of 20 and max coverage of 200, where the 60% of the window was acquired to meet coverage specifications. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3. ....................................................................................................................................... 40

Figure B.1 Pooled DNA (5 µL) for each population pre-dilution for sequencing preparation run through 1% agarose gel with 3 µL of NEB 1 kb DNA ladder. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3.......................................................................................79

Figure B.2 Basic statistics and per base sequence quality component of FASTQC report for J3 (Lower Cave & Basin Spring) raw, reverse sequences (pre-Trimmomatic). ...................... 80

viii

Figure B.3 Basic statistics and per base sequence quality component of FASTQC report for J3 (Lower Cave & Basin Spring) filtered and trimmed reverse sequences (post-Trimmomatic). ............................................................................................................................................. 81

1

CHAPTER 1 GENERAL INTRODUCTION

1.1 INTRODUCTION

As habitats continue to change largely due to anthropogenic impacts, there is an

associated massive loss of biodiversity and an increasing number of threatened and endangered

species (Frankham 2005; Butchart et al. 2010). Habitat fragmentation, habitat loss, introduction

of invasive species, and over-exploitation can leave populations vulnerable to natural disasters,

demographic stochasticity, and environmental change (Shaffer 1981; Frankham 2005). This

biodiversity loss, at the genetic, species and ecosystem level, has incredibly harmful impacts on

human society (Cardinale et al. 2012; Hooper et al. 2012). Decreases in biodiversity lowers the

productivity and services of ecosystems (e.g., wood production, carbon sequestration, soil

mineralization) and biodiversity loss can have detrimental impacts to ecosystem function similar

to other forms of environmental change (e.g., climate warming, acidification) (Cardinale et al.

2012; Hooper et al. 2012). These factors highlight the need for conservation management to

devise and implement effective and timely management plans. However, with limited time and

resources, the choice of habitats and species to conserve remains a significant factor (Martin et

al. 2012; Carwardine et al. 2018). Allocation of limited conservation resources may need to be

directed towards prioritized species and populations based on factors such as ecological function

and/or evolutionary significance (e.g. Joseph et al. 2009; Funk et al. 2012; Carwardine et al.

2018). However, determining a species or population’s priority can be extremely difficult due to

a variety of factors including assessments of extinction risk for species. Altogether, conservation

biology is faced with increasing biodiversity losses combined with intensifying data deficiencies.

Data on species and their environment is needed for effective conservation. Data

deficiencies render fundamental questions about conservation status difficult to answer. The

International Union of Conservation of Nature (IUCN) considers a species data deficient if there

is insufficient information on a species taxonomic status, the threats or status of populations,

and/or distribution (Bland et al. 2015; Parsons 2016). Priority management and funding are

largely allocated to species of conservation concern, when these data are present, whereas data

deficient species typically receive lower priority (Morais et al. 2013; Howard & Bickford 2014;

Parsons 2016). Consequently, there can be a taxonomic bias in terms of data availability, as rare,

cryptic or non-charismatic organisms (e.g., invertebrates, which make up the majority of global

2

biodiversity) are largely data deficient (Howard & Bickford 2014; Régnier et al. 2015; Cowie et

al. 2017). In addition, over 60% of data deficient species are likely threatened by extinction

(Howard & Bickford 2014; Bland et al. 2015). If data deficient species are considered, up to 7%

of described species may have already been lost since 1550 compared to the 0.04% listed by the

IUCN (Régnier et al. 2015). Overall, such losses exemplify the need to develop more rapid and

appropriate tools towards more effective conversation management plans.

Genomics is one such tool that can be used in conjunction with other means to address

these challenges (Figure 1.1). It has become increasingly clear that genetic diversity is essential

for the long-term viability of species and that failure to protect it will undermine the actions to

protect biodiversity at the ecosystems and species level (Frankel 1974; Laikre 2010). In response

to environmental change, it is in part genetic variation, either de novo mutations (slow response),

or standing genetic variation (i.e., variation present in a population or species but unselected for

until the environment changes) or a combination of both that may facilitate population

persistence (e.g., Frankel 1974; Barrett et al. 2008; Morris et al. 2014). For example, a recent

study on Tasmanian devils (Sarcophilus harrisii) has demonstrated that in response to facial

tumour disease there has been rapid selection on standing genetic variation (Margres et al. 2018).

The disease has caused declines upwards of 80% and is almost always fatal, resulting in the

species to be listed as endangered (McCallum 2008; Storfer et al. 2018). However, the observed

rapid adaptation has conferred greatly improved survival, where a few loci explains over 80% of

variation in female survival (Margres et al. 2018). Yet, most policies or management plans do

not prioritize or inform decisions based on genetic diversity (Laikre 2010). The maintenance and

promotion of conservation of genetic diversity requires characterization of the biological

processes that threaten this variation with the integration of techniques that facilitate regular

monitoring. Altogether, testing ecological and evolutionary predictions about genetic diversity in

association with conservation objectives will help inform policy.

One of the primary aims of conservation management is to maintain or increase

population numbers. However, hidden evolutionary genetic threats may undermine management

plans and precipitate population collapse (Figure 1.1). For example, in the iconic Isle Royale

system, moose (Alces alces) and wolf (Canis lupus) populations had been presumably stable for

~30 years before the wolf population suffered a major crash (Peterson et al. 1998). Though the

original crash was predicted to be due to lack of food, the population never recovered due to

3

disease and severely decreased genetic diversity (Peterson et al. 1998). In fact, the wolf

populations had such low fitness that a single out-bred male immigrant in 1997 resulted in all

individuals born having ~50% of their ancestry from him in a little over a generation compared

to the ~14% that would have arisen under equal fitness (Adams et al. 2011). Even with this

influx of genetic material, the wolf population continued to decrease (Hedrick et al. 2014) to just

two wolves by 2018: a father and daughter, who are half siblings. This is a clear example of

decreased genetic diversity playing a role in preventing wild systems buffering environmental

change. It is an important reminder that as population numbers diminish, increased mating

between related pairs will decrease fitness of offspring, known as inbreeding depression

(Edmands 2007). However, the threshold at which inbreeding occurs, the impacts on the health

of population or species and the ability of them to avoid or recover from inbreeding are not

universal (Keller & Waller 2002) and may be managed. Such management plans require

thorough pedigrees or genetic estimates of relatedness to mitigate the effects (Vilà et al. 2003).

Additionally, even if inbreeding is not occurring the same level of decreased fitness can be

reached due to or in conjunction with other forces (Hedrick & Kalinowski 2000). Small

populations are particularly vulnerable to genetic drift, the random loss of alleles that causes

fixation (Bouzat 2010). Due to a lower number of individuals, the probablity of losing minor

alleles increases (Bouzat 2010). Populations with small effective population sizes (Ne), have

limited individuals genetically contributing to the next-generation, resulting in decreased genetic

variability (de Oliveira et al. 2006). Environmental, biological and human driven population

declines may cause the random loss and subsequent reduction of genetic diversity (Weber et al.

2004; Bouzat 2010). Through concerted conservation efforts, species number and Ne may

recover, as seen with the northern elephant seal (Hoelzel et al. 1993). Hunted almost to

extinction in the 19th century, the northern elephant seal (Mirounga angustirostris) has seen

tremendous recovery in their population numbers, however their genetic diversity has remained

extremely low due to the genetic “bottleneck” they underwent (Hoelzel et al. 1993; Bouzat 2010)

(Figure 1.2). Detecting and monitoring losses of genetic diversity is essential in mitigating its

detrimental effects and robust genomic data are integral in achieving this.

Genetic rescue, the translocation of individuals from one population into another

(Ingvarsson 2001), has been proposed to circumvent the impacts of low genetic diversity. The

introduction of new genetic material via immigrants has been shown to rescue declining

populations and increase composite fitness (Vilà et al. 2003; Edmands 2007). However,

4

complications with genetic rescue can include divergence of isolated populations that are locally

adapted to their specific ecological habitat (discussed below). In these cases, attempts to increase

genetic diversity may result in disruption of locally adapted genes or gene complexes, decreasing

fitness (outbreeding depression) (Edmands 2007). Without the integration of genetic and

genomic data into conservation planning, management decisions may unintentionally be

detrimental to species recovery.

Another aspect of conservation in which genomics is helpful as a tool is the assignment

and investigation of population and species structure (Figure 1.1). Policy decisions on how best

to allocate funds and time are based in part on the distinctiveness or taxonomic standing of a

species (Isaac et al. 2004; Mace 2004; Joseph et al. 2009) and protection for certain populations

is influenced by showing that it is distinct from others (Kell et al. 2009). Management plans must

endeavour to incorporate methods that account for specific populations or genetically or

ecologically unique populations that warrant specialized management (Funk et al. 2012).

Evolutionarily significant units (ESUs) have a variety of definitions but can be described broadly

as “a population or group of populations that warrant separate management or priority for

conservation because of high genetic and ecological distinctiveness” (Funk et al. 2012). The

Committee on the Status of Endangered Wildlife in Canada (COSEWIC) includes ESUs in its

designation of designatable units (DUs), where a population or group of populations must meet

one or more of COSEWIC’s criteria for “discreteness” and for “significance” (COSEWIC 2015).

Two criteria used by COSEWIC to determine discreteness (please see COSEWIC 2015 for full

criteria) are if population or group of populations have clear genetic distinctiveness and/or local

adaptive differences (COSEWIC 2015). Once a population or group of populations are

determined to be discrete, two criteria that may be used to deme significance is if genetic

markers illustrate there is a clear phylogenetic divergence and/or they exist in an “ecological

setting unusual or unique to the species, such that it is likely or known to have given rise to local

adaptations” (COSEWIC 2015). DUs and ESUs aim to protect sub-species variability that is

often missed by traditional taxonomy (Mee et al. 2015). Losing these vital populations can

increase the total risk of extinction, as DUs and ESUs may harbor the genetic variability

necessary to evolve with environmental change (Ceballos & Ehrlich 2002; Funk et al. 2012; Mee

et al. 2015).

5

However, delineation of populations, DUs and/or ESUs can be difficult over time and

space, confounding accurate data for conservation. For example, newly colonized habitats may

be inhabited by populations that only recently diverged; alternatively, some populations

occupying the same habitat may look the same but be genetically cryptic for demographic

reasons (Bull et al. 2013). Genetic estimates of population differentiation can help elucidate the

underlying genetic differences among populations. Wright (1950) developed an index of genetic

differentiation based on levels of heterozygosity (FST). Populations that are not genetically

differentiated will have similar allele frequencies and comparable expected levels of

heterozygosity and, therefore, a correspondingly small FST, while populations that are

differentiated will have dissimilar frequencies and increasing disequilibrium of heterozygosity

and, therefore, a larger FST (Holsinger & Weir 2009). This measure can be useful in

distinguishing cryptic population structure and determining isolation between population pairs,

especially when based on genome-wide estimates of FST from numerous genetic loci or markers.

For example, genome-wide markers have been used to resolve phylogenetic structure in

populations long considered as panmictic from mosquitos (Wyeomyia smithii) (Emerson et al.

2010) to marine threespine stickleback (Gasterosteus aculeatus) (Morris et al. 2018). Altogether,

establishing patterns of genetic differentiation (measured by FST) is a conventional first step to

understanding the nature of how organisms are distributed in time and space.

High FST values for certain genomic regions compared to the overall genomic

background can be used in uncovering putatively adaptive differences among populations. The

hypothesis is that locus-specific divergent or directional selection will maintain genetic

differentiation among populations for this allele relative to other non-selected loci (Holsinger &

Weir 2009; Whitlock & Lotterhos 2015). However, for candidate or putative genes uncovered in

such a method this must be treated as a first step to testing predictions of their potential adaptive

role as genetic drift, especially in populations with small Ne, can also cause the fixation of alleles

resulting in the same genetic pattern (Holsinger & Weir 2009). Consideration must also be included

for the influence of population structure and demography and the possibility of missing or incorrect

environmental data (Hoban et al. 2016). Measures of pairwise FST alone are unable to distinguish the

nature of the selective force causing variation in allele frequency (Lotterhos & Whitlock 2015). It is

necessary to consider any genetic variation uncovered in the context of what is known for the species

to develop a mechanistic understanding of the impact of the variation at all biological levels (Dalziel

et al. 2009). With these factors in mind it is possible to design experiments to capture the genetic

6

basis behind clear phenotypic and adaptive differences (Rogers & Bernatchez 2007; Dennenmoser et

al. 2017) The relevance for conservation management is that such genomic approaches can highlight

the presence of adaptive variation that may be contributing to local adaptation of certain at-risk

populations and help characterize their evolutionarily significance (Funk et al. 2016).

Though there have been many papers reviewing and promoting the integration of

genomics in conservation (Allendorf et al. 2010; Ouborg et al. 2010; Funk et al. 2012), it has

been largely confined to academia with very few concrete examples of it impacting management

decisions (Shafer et al. 2015). This may largely be due to the cost and level of expertise

necessary to generate and process genomic datasets (Shafer et al. 2015), but reflection on the

integration of genomics in conservation is appropriate, especially on how to make genomic tools

more accessible.

One of the main considerations in estimating genetic variation is that of the choice of

genetic marker. Initially used genetic markers in conservation were/are limited due to small

numbers and that they are not distributed throughout the genome (Ouborg et al. 2010). With the

decreasing cost of Next-Generation sequencing, the capturing of thousands to millions of

genome-wide single nucleotide polymorphisms (SNPs) is now possible. The sheer amount of

polymorphic loci has facilitated more accurate estimates of genetic variation and the detection of

finer-scale population structure (Emerson et al. 2010; Shafer et al. 2015; Morris et al. 2018).

Because they are distributed throughout the genome, SNPs can be used to detect both neutral and

potentially adaptive genetic variation, helping resolve taxonomic ambiguity and assign ESUs

(Funk et al. 2012; Shafer et al. 2015). For genomics to be integrated efficiently and for it to be

applied to real conservation dilemmas, conservation management practitioners must push for its

use and collaborate with genomic researchers (Shafer et al. 2015), with continuous

communication between the two (Lundmark et al. 2019).

Next-Generation sequencing encompasses many high-throughput sequencing methods for

the capturing of polymorphisms. Some commonly used methods include reduced representation

RADseq (Baird et al. 2008) and ddRADseq (Peterson et al. 2012), where the DNA is treated

with one or two restriction enzymes that cleave the DNA before sequencing, returning small

regions throughout the genome. SNP-Chips comprise another method whereby small sequences

are physically bound to a glass slide and hybridize with DNA to capture known SNPs (Lien et al.

2011). However, while these methods still generate thousands of markers, they do not capture

7

the entire genome. In my project, I used a technique called Pool-sequencing (Futschik &

Schlötterer 2010). Though whole genome sequencing for individuals has decreased sustainably

in price it is not financially viable to sequence the number of individuals necessary to answer

population level questions (Futschik & Schlötterer 2010). Pool-seq involves pooling of equal

amounts of DNA from multiple individuals per population and sequencing them as if they were

one “individual” (Figure 1.3) (Futschik & Schlötterer 2010). With sufficiently high number of

individuals and careful data pre-processing it has been shown to estimate allele frequencies more

accurately than individual sequencing (Futschik & Schlötterer 2010; Kofler et al. 2011a; Gautier

et al. 2013). Attention needs to be given when pooling individuals, so that they are each

represented equally (Schlötterer et al. 2014); however, newly developed models for estimating

FST have been shown to be robust to unequal representation, pool size and coverage depth

(Hivert et al. 2018). Due to the loss of the individual, estimates of Ne and migration are not

possible; however, the number of genome-wide SNPs captured makes Pool-seq very powerful in

detecting nucleotide diversity, population structure and potentially adaptive differences (though

these should be interpreted with caution (Anderson et al. 2014)). Overall, very few studies have

applied this method in a conservation context, so further research of this cost-effective method

for conservation is needed.

One group of animals that is in desperate need of effective and timely management plans

are molluscs. Though deemed to be one of the most imperiled taxa (Régnier et al. 2009; Johnson

et al. 2013; Cowie et al. 2017), only 7,276 of an estimated 70,000 to 76,000 species (Rosenberg

2014) had been assessed by the International Union of Conservation of Nature (IUCN) as of

2016 (Cowie et al. 2017). Of these 34% (2,463 species) were deemed data deficient to make a

formal assessment (Cowie et al. 2017). These numbers actually places molluscs at a high

proportion of assessed species compared to all invertebrates, where only 1.2% have been

assessed overall (Cowie et al. 2017), However, in comparison all mammal and bird species

recognized by the IUCN have been assessed and only ~5% were deemed data deficient (Cowie et

al. 2017). And yet, even though molluscs are deeply under-represented in terms of conservation

assessment by the IUCN, they still made up roughly 40% of the animal species listed as extinct

in the third issue of the 2016 Red List (Cowie et al. 2017). Their decline is heavily impacted by

habitat loss and degradation, with very little known about their response to toxins or chemicals in

aquatic systems (Régnier et al. 2009; Johnson et al. 2013). However, as the primary grazer in

many habitats and an important food source for many species, their loss has the potential to

8

cause drastic shifts in the many ecosystems they inhabit (Régnier et al. 2009; Johnson et al.

2013). Loss of integral components at the bottom of the ecosystem has large bottom-up effects,

which impact all members of the ecosystem. Not surprisingly, in terms of genomic information,

molluscs are exceedingly lacking. A clear representation of this is that there are only 23 available

reference genomes (https://www.ncbi.nlm.nih.gov/genome, txid6447[ORGN]) compared to the

522 for vertebrates hosted on NCBI (https://www.ncbi.nlm.nih.gov/genome, txid7742[ORGN]).

1.2 STUDY SYSTEM

The endangered Banff Springs Snail, Physella johnsoni, is found only in seven thermal

springs, characterized by high water temperature and hydrogen sulphide and low dissolved

oxygen and pH, in Banff National Park, Alberta, Canada (COSEWIC 2008). It is listed as

endangered under Canada’s Species At Risk Act (SARA), in part due to an incredibly small

distribution. P. johnsoni’s habitat encompasses less that 600m2 located in a total range and area

of occupancy of just 8 km2, well under the 5,000 km2 and 500 km2, respectively, necessary to

meet endangered status (Criterion B) (COSEWIC 2014, 2018). Additionally, each of the P.

johnsoni thermal springs are seen to undergo large fluctuations in number of mature individuals

with yearly population bottlenecks causing changes up to two orders of magnitude (COSEWIC

2014, 2018). In conjunction with predictions that they are hermaphroditic and have low amounts

of gene flow between the thermal springs, there is concern of a lack of genetic diversity (Lepitzki

& Pacas 2010). They are currently managed as a unique species, with genetic analysis of a few

markers showing differentiation between them and a much more common snail, Physella gyrina

(Lepitzki 1998; Remigio et al. 2001). However, some research using the same markers suggest

that P. johnsoni and P. gyrina are synonymous with each other (Wethington & Guralnick 2004).

This taxonomic ambiguity hinders the proper management of P. johnsoni (Daugherty et al. 1990;

Mace 2004). If P. johnsoni were synonymized with P. gyrina, they would likely be re-assessed

for DU status (COSEWIC 2008), where evidence on discreteness and significance would need to

be provided. Limited numbers of genetic markers are unable to provide the resolution necessary

to the level necessary to distinguish P. johnsoni and P. gyrina and resolve patterns of population

structure.

9

1.3 OBJECTIVES

My objective in this thesis was to produce a genomic dataset for use in the conservation

management of Physella johnsoni. The genomics data and resulting analysis will provide new

levels of resolution to taxonomic designation and population structure than previously achieved

through morphology and single marker sequencing. It will be integrated into Parks Canada’s

management plan and be used to advise how best to manage the species to help ensure its

continued persistence.

10

Figure 1.1 Schematic illustrating where genomic data are required in conservation management plans. Namely for resolving taxonomic ambiguity, assigning DUs/ESUs and for characterizing the population genomic consequences of increasing threats.

11

Figure 1.2 Schematic of a population bottleneck. The different colours represent genetic variation, where some is randomly lost with the reduction in population size. This decrease in genetic variation is observed even when population numbers increase.

12

Figure 1.3 Schematic of Pool-seq preparation and sequencing. Equal amounts of DNA (ng) of each individual of the population is combined and the individual is “lost”. The same adaptor is ligated to all of the DNA from that population to distinguish it from other populations in sequencing and in analysis.

13

CHAPTER 2 CONSERVATION GENOMICS IN THE BANFF SPRINGS SNAIL

2.1 INTRODUCTION

We are currently in the midst of massive biodiversity loss in association with

anthropogenic impact, habitat fragmentation, habitat loss, invasive species, over-exploitation and

environmental change (Shaffer 1981; Butchart et al. 2010). These factors have contributed to an

alarming increase in numbers and rates of threatened and endangered species (Régnier et al.

2009, 2015). There is a critical need for effective conservation management plans, where

conservation practitioners must have the tools to rapidly elucidate and assess population structure

and distribution, towards determining whether populations and/or a species meet criteria to be

considered priorities for conservation and to help characterize the threats faced by species at risk

(Figure 1.1) (Funk et al. 2012; Guisan et al. 2013; Shafer et al. 2015). With the integration of

genomics, practitioners have an unprecedented ability to resolve patterns of genetic diversity

within and between populations and species and to inform on these vital aspects of conservation

management (Figure 1.1) (Shafer et al. 2015). However, there are still limited examples where

conservation genomics has been shown to actually impact policy or management decisions

(Shafer et al. 2015).

Found only in Banff National Park, Alberta, Canada, with a global range of just 594.4 m2

(COSEWIC 2008), the Banff Springs Snail (Physella johnsoni) (Clench 1926) embodies the

challenges faced by conservation biologists. It became the first living mollusc to be listed by

Committee on the Status of Endangered Wildlife in Canada (COSEWIC) as threatened in 1997,

and in 2000 was re-assessed as endangered (COSEWIC 2008). Globally, molluscs have been

determined to be one of the most imperiled taxa (Régnier et al. 2009; Johnson et al. 2013; Cowie

et al. 2017), however the majority are unassessed for conservation (Cowie et al. 2017). P.

johnsoni belongs to the family Physidae, which is a family of about 80 species of freshwater, air-

breathing snails found widespread in the Holarctic region and into Central and Southern America

(Taylor 2003; Wethington & Lydeard 2007). Currently, ~55% of North American Physidae are

at risk, alongside the vast majority of other freshwater snails, partially because of rapid habitat

changes or loss due to human interference and/or environmental changes (Johnson et al. 2013).

14

Several factors contribute to the conservation risks faced by P. johnsoni. Their entire

global range consists of seven thermal springs characterized by high water temperature and

hydrogen sulphide content, and low dissolved oxygen content and pH (Grasby & Lepitzki 2002;

COSEWIC 2008) (Figure 2.1). The thermal springs are located along the Sulphur Mountain

Thrust fault, existing in three elevation groups (Grasby & Lepitzki 2002). The lowest elevation

group (~1400m) consists of four thermal springs located within a few hundred metres of each

other - the Cave (isolated from the others except for a small hole in the top), the Basin, and the

Lower and Upper Cave and Basin Springs (Figure 2.1) (Grasby & Lepitzki 2002). The middle

elevation group (~1500 m) is located about 1 km up Sulphur Mountain and consists of Lower

and Upper Middle Springs (Figure 2.1), West Cave and Gord’s Spring, though it is uncertain if

the physids currently residing in West Cave or Gord’s Spring are P. johnsoni (Grasby & Lepitzki

2002; COSEWIC 2008, 2018). The highest elevation group, consists of Kidney Spring (1588 m)

(Figure 2.1) and the extirpated Upper Hot Spring (1584 m) (Grasby & Lepitzki 2002). P.

johnsoni were originally found in 11 thermal springs, however they ceased to exist in six due to

water stoppages or to human interference (COSEWIC 2008). Upon water flow resuming in

Kidney and Upper Middle Springs snails were re-established successfully in 2002 and 2003,

respectively, resulting in the seven current inhabited thermal springs (COSEWIC 2008).

The taxonomic designation of many physids, including P. johnsoni, is strongly debated.

Wethington & Lydeard (2007) state that there is a more than 50% over-representation of physid

species in North America. This is in part due to the classification being heavily based on

morphological traits (e.g., shell morphology and penial structure). Though P. johnsoni was

found to be significantly more globose and to have a longer spire than P. gyrina (Lepitzki 1998),

recent evidence has shown that shell morphology is very plastic in physids. One study found

phenotypic convergence of shell shape within one generation in the lab of two morphologically

distinct but geographically adjacent populations of physids (Gustafson et al. 2014). Moore et al.

(2014) found that two populations of physids predicted to be the same species due to the same

atypical morphology were more genetically divergent from each other than a morphologically

typical snail, Physella gyrina. While P. johnsoni are currently designated as a species, alternative

hypotheses suggest that up to seven different taxa, including P. johnsoni and another endangered,

endemic thermal spring physid, P. wrighti (Hotwater Physa) are synonyms of a much more

common snail, P. gyrina and that the observed morphological differences are the result of habitat

influence (Wethington & Guralnick 2004; Wethington & Lydeard 2007).

15

P. johnsoni individuals are seemingly restricted to around the origin of the spring (30 to

36°C) (Grasby & Lepitzki 2002). Though the cause of their distribution is unknown, higher

densities are correlated with the higher temperature and hydrogen sulphide and lower dissolved

oxygen and pH (Lepitzki & Pacas 2010). This distribution may be influenced by concentration of

their food sources, algae and bacteria (Grasby & Lepitzki 2002). P. johnsoni are presumed to be

hermaphrodites, preferring to out-cross when there are favourable environmental conditions

(Jarne et al. 2000; COSEWIC 2008). P. johnsoni’s restricted habitat and life history patterns

(discussed below) have led to concern of decreased genetic diversity. Of highest concern, is that

each year the populations (whereby each thermal spring is defined as a “population”, but defined

as “sub-populations” under COSEWIC) fluctuate on the order of two magnitudes (Lepitzki &

Pacas 2010; COSEWIC 2018). Some populations will decrease to fewer than 50 individuals in

the summer months and reach population highs into the thousands in the winter and spring

(Figure 2.2) (COSEWIC 2008, 2018). The cause of these population rises and declines has not

been determined, but is speculated to be in association with seasonal changes in water chemistry

(Grasby & Lepitzki 2002). Whether this impacts the snails directly or the changes are due to

abundance of the algae and bacteria (or an association of both) is unknown (Grasby & Lepitzki

2002). As the per population numbers decrease to so few individuals (Figure 2.2) (COSEWIC

2018), the genetic variation is likely reduced to the genetic diversity contained within the

surviving individuals. Even as the population numbers increase the offspring will only contain

that genetic variation, resulting in a genetic bottleneck (Bouzat 2010). In small and restricted

populations such as P. johnsoni, low frequency alleles can be randomly lost, increasing the

homozygosity of the population (Bouzat 2010). This causes a reduction of genetic diversity and

random fixation of potentially detrimental alleles by a process called genetic drift (Bouzat 2010).

It has been well documented that even low amounts of gene flow between populations can

mitigate a loss of genetic diversity (Ingvarsson 2001; Vilà et al. 2003). There is likely limited

opportunity for genetic mixing in exceedingly high spring run-off years with the transport of

individuals from Upper Cave and Lower Cave and Basin Springs into the Basin Spring (Figure

2.1) (Lepitzki & Pacas 2010). Though never confirmed in P. johnsoni, snails in other freshwater

systems have been documented to be transported by birds (Santamaría & Klaassen 2002) and

large mammals, causing significantly decreased genetic differentiation between certain

populations (Van Leeuwen et al. 2013). Marmots (Marmota caligata) and bears (Ursus arctos)

have been observed via surveillance cameras to frequent some of the thermal springs (per. com.

16

Dr. Dwayne Lepitzki), however, overall, it is predicted that there is very little dispersal and

likely gene flow among the thermal springs (Lepitzki & Pacas 2010). Due to the intensity of the

population bottlenecks and because genetic mixing has yet to be confirmed among populations,

decreased genetic diversity is also strong conservation concern.

Previous sequencing efforts have attempted to resolve the taxonomic ambiguity and

determine the levels of genetic differentiation between P. johnsoni and P. gyrina. However,

sequencing of protein variants (allozymes) (Lepitzki 1998) or COI and 16S mitochondrial genes

(Remigio et al. 2001; Wethington & Guralnick 2004) failed to reach a consensus. In these

studies, P. johnsoni was compared to geographically close populations of P. gyrina, including

the three used in the present study – the Cave and Basin Marsh, Five Mile Pond and Muleshoe

Pond (Figure 2.1). The Cave and Basin Marsh is located downstream of the Cave and Basin

Spring cluster and contains diluted thermal water and does not freeze (per. obs). Five Mile Pond

and Muleshoe Pond are lake populations, located several kilometres upstream on the Bow River

(Figure 2.1). P. johnsoni and P. gyrina were found to be genetically distinguishable at only three

of the 12 protein loci tested, with low levels of intraspecific variation restricted to a single locus

(Lepitzki 1998). However, consensus of genetic relatedness based on COI and 16S mitochondrial

gene sequences has not been reached (Remigio et al. 2001; Wethington & Guralnick 2004; Pip &

Franck 2008). P. johnsoni and P. gyrina may be genetically close as not all analyses reveal

monophyletic groups (Wethington & Guralnick 2004). This has been hypothesized to be in part

due to the young age of the species, with P. johnsoni being predicted to only have diverged 3200

to 5200 years ago when the thermal springs were formed (Grasby et al. 2003; COSEWIC 2008).

These limited genetic tools have precluded effectively testing this hypothesis.

These evolutionary factors highlight the need for genome-wide markers for resolving

whether P. johnsoni and P. gyrina warrant separate management, to resolve the micro-

geographic genetic population structure for P. johnsoni and to detect potential underlying genetic

threats. It should be noted that I will not be attempting to resolve what constitutes a species in

this study and rather focus on inter-species and intra-species patterns of genetic differentiation.

As illustrated above, the use of limited genetic markers has been unable to resolve patterns of

genetic differentiation. To address these factors hindering conservation management, I used

Pool-sequencing (Figure 1.3) (Futschik & Schlötterer 2010). This sequencing method involves

the pooling of DNA from multiple individuals per population to provide high confidence allele

17

frequency estimates across the entire genome (Futschik & Schlötterer 2010; Kofler et al. 2011a;

Gautier et al. 2013).

In this chapter I used genome-wide single nucleotide polymorphisms (SNPs) captured by

Pool-seq to address two conservation objectives for P. johnsoni. The first objective was to

determine whether P. johnsoni is genetically distinct from P. gyrina. I hypothesized that the

taxonomic unit previously assigned to P. johnsoni and P. gyrina by defining and/or derived traits

would be valid if the observed patterns of genomic differentiation supported their distinct status.

Whether or not P. johnsoni represents a thermal ecotype of P. gyrina or rather a distinct genetic

unit has direct bearing on their conservation status (COSEWIC 2018) and the resources allocated

to their conservation. For effective management Parks Canada must be informed if P. johnsoni

and P. gyrina are genetically distinct, as an essential component of conservation biology is

taxonomic designation. Improper classification can lead to the extinction of a species (Daugherty

et al. 1990; Mace 2004). Secondly, I used this same SNP dataset to test predictions of micro-

geographical population structure and within-population genetic diversity of P. johnsoni. While

the distribution is limited to a small geographic space, gene flow is predicted to be limited

(Lepitzki & Pacas 2010) and extensive annual bottlenecks within each of the populations

(COSEWIC 2018) are predicted to amplify the effect of genetic drift resulting in increased

population divergence. The combination of these two evolutionary processes lead to the

prediction that genetic structure may be pronounced. Alternatively, P. johnsoni may represent a

single panmictic population. The genomic data produced here will facilitate management

decisions in association with habitat threats, and whether thermal springs should be managed as a

single unit or if they each warrant separate management. Overall, an analysis of genomic

divergence of these snails is required to test these hypotheses.

2.2 METHODS

2.2.1 SAMPLING

P. johnsoni were collected from five thermal springs between January and March of 2017

in the Banff Thermal Springs of Banff National Park in Alberta, Canada: 1) Cave Spring (J1) 2)

Basin Spring (J2) 3) Lower Cave & Basin Spring (J3) 4) Upper Cave & Basin Spring (J4) and 5)

Lower Middle Spring (J5) (Figure 2.1). Individuals were also collected from Upper Middle

Spring (J6) and Kidney Spring (J7), which were not included in this study (Figure 2.1). Before

18

collecting P. johnsoni, census population sizes were estimated to ensure that the number of snails

sampled (n=40) did not exceed 0.5 to 3% of the spring’s current population. This condition was

met except for J1, where only 20 P. johnsoni could be collected. In addition, a second species, P.

gyrina (n=40), were collected from three locations 1) Cave & Basin Marsh (G1) (March 2017),

2) Five Mile Pond (G2) and 3) Muleshoe Pond (G3) (July 2017) (Figure 2.1).

Snails were collected by hand for all of P. johnsoni locations (J1 to J5) and P. gyrina G1.

Snails were collected haphazardly from eight locations within the thermal spring, with five snails

being collected at each location. Water temperature was recorded at each location. A D-dipnet

was used to collect at all other locations (G2 and G3). Samples were collected in 8 batches (five

snails per) from different locations within each lake. Water temperature was taken once from a

representative location.

All snails were anesthetized in the field in batches of five by placing them into 5%

laboratory grade ethanol (EtOH) (Gilbertson & Wyatt 2016). The tubes were left to incubate

immersed in the water source as to be relatively close to the same temperature and minimize

stress. They remained in 5% EtOH until movement ceased and they released from the tube’s

surface (observed to be 5 to 15 minutes). They were then removed from the 5% EtOH and tested

for responsiveness by scrapping a hypodermic needle across the foot (Gilbertson & Wyatt 2016).

If unresponsive, they were placed on a dish made of aluminum foil and euthanized by rapid

cooling with electrical component freezing spray sprayed from under the dish (Craze & Barr

2002).

Tissue was then removed from the shells by dissecting needle or forceps trying to

minimize damage to the shell. The shells were stored individually in 95% EtOH. Ten tissue

samples from each of J1 to J5, and G1 were stored individually in RNAlaterÒ, which would

have allowed for future gene expression analysis. However, it was decided that these samples

would be better used for DNA analysis and therefore, extracted for DNA as explained below.

The remaining tissue samples were stored individually in 95% EtOH. For G2 and G3, all 40

tissue samples were stored individually in 95% EtOH.

All samples were transported in a cooler with ice packs. Once in the laboratory they were

stored at -20°C until extraction. All sampling procedures and research ethics were approved by

19

the Life and Environmental Science Animal Care Committee under protocol #LESACC AC16-

0267

2.2.2 DNA EXTRACTION

DNA was extracted from whole body tissue, following a modified OMEGA bio-tek

E.Z.N.A.Ò Mollusc DNA Kit protocol that included dried and diced tissue, overnight incubation

at 56 °C, three washes of the HiBindÒ column, and a 50µL elution. Once DNA extraction was

complete 8 to 10µL was aliquoted for quantification and quality checks. Both aliquot and stock

were stored at -20°C until further use.

2.2.3 DNA QUANTIFICATION AND QUALITY CHECK

Aliquoted DNA was quantified a minimum of twice on either QubitÒ Fluorometer 2.0 or

3.0. using QubitTM dsDNA BR Assay Kit as per protocol. Samples were vortexed briefly and

mixed by pipetting up and down before 2 µL was mixed into 198 µL of working solution for

quantification. A subset of samples were run on 1% agarose gel to visualize the level of shearing

that occurred. A subset of samples was tested for purity on NanoDropÒ Spectrophotometer ND-

1000 (260/230 and 260/280 ratios).

2.2.4 CONSTRUCTING DNA POOLS FOR POOL-SEQ

Pooled DNA for each population was completed using equal amounts from individual

DNA samples. DNA quantity (ng) for each pool was chosen so that at least 1 µL of solution was

pipetted from each individual. Individual samples were briefly vortexed, pipetted up and down

20 times before volume was added to the pool. A total of 10 µL from each pool was aliquoted for

further quantification and gel electrophoresis.

Pooled DNA samples were quantified using same method as the individual samples

(described above). 5 µL of each pool was run on 1% agarose gel to test if handling was

increasing shearing. To prepare for sequencing, each pool was diluted down to a final

concentration of 3 to 6 ng/µL. The diluted pools were quantified as above.

20

2.2.5 DNA SEQUENCING

All pools passed concentration and quality control. Libraries were prepared using a

shotgun approach with PCR with Illuminia TrueSeq LT adaptors. All libraries passed quality

control. Pooled DNA libraries were sequenced on the Illumina HiSeq XTM Sequencer using

paired-end reads of 150 base pairs (bp). Each pool was sequenced over two lanes (e.g. four pools

on one lane and then the same four pools on the second lane) for a total of four lanes at the

Génome Québec Innovation Centre, Montréal, Québec, Canada.

2.2.6 GENOMIC ANALYSIS

Full annotated genomic analysis pipeline can be found in Appendix A.

Sequences were converted from BCL files to FASTQ with no barcode mismatches for

downstream processing and analyses using bcl2fastq2 v.2.20. Sequences from the two lanes for

each pool were concatenated to one file per pool per read direction. FastQC v.011.5 (Andrews

2010) was used to check and visualize the quality of the sequences.

Trimmomatic v.036 (Bolger et al. 2014) was used to remove adaptors and filter low-

quality sequences (ILLUMINACLIP 2:30:10 CROP:135 LEADING:5 TRAILING: 5 SLIDING

WINDOW: 5:20 MINLEN:100). Sequences were hard cropped at 135 bp due to k-mer

overrepresentation in the last 15 bp in a small number of sequences. Post trimmed sequences

were checked for quality using FastQC v.011.5.

Contamination of foreign (i.e. non-snail) sequences was removed from the data with

DeconSeq v.0.4.3 (Schmieder & Edwards 2011a). Databases of potential contamination sources

were generated for Archaea and green Algae (Chlorophyta, Cryptophyta, Charophyceae,

Eustigmatophyceae, and Klebsormidiophyceae) by downloading the nt database from NCBI

(ftp://ftp.ncbi.nlm.nih.gov/blast/db, accessed 24-08-2018). Using the GenInfo Identifier (GI) list

(https://www.ncbi.nlm.nih.gov/nuccore, accessed 24-08-2018) for each of the above,

blastdb_aliastool was used to create a file that masked the database so only the organisms of

interest was available. This masked database was then converted to a FASTA file using

blastdbcmd. The threespine stickleback (Gasterosteus aculeatus) genome was accessed from

https://datadryad.org/resource/doi:10.5061/dryad.h7h32 (Peichel et al. 2017) and the human

(Homo sapiens) genome was accessed from

21

ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/hs_ref_GRCh38.p12_ch

. The bacterial database was constructed from NCBI Assembly database

(https://www.ncbi.nlm.nih.gov/assembly/?term=bacteria, accessed 06-09-2018), selecting only

“Complete Genomes” with the exception of those known to be in the thermal springs

(Aphanothece, Brevundimonas, Chloroflexus, Lyngbya, Microcoleus, Oscillatoria, Phormidium,

Porphyrobacter, Rhodobacter, Rhodopseudomonas, Rubrivivax, Spirulina, Synechocystis,

Thermonanas, and Thiothrix) (Bilyj 2011) where all information (chromosome, scaffold, and

contig) was included. The genomes were concatenated together to be prepared for use by

DeconSeq. All databases were prepared following DeconSeq manual. Briefly, any Ns, any

sequences less than 200 bp and sequence duplicates were removed (Schmieder & Edwards

2011b). The bacterial database was split into ~2.7 Gb chunks (FASTA Splitter v.0.2.6;

http://kirill-kryukov.com/study/tools/fasta-splitter/) so that it could be indexed before the

Burrows-Wheeler Aligner (BWA) (Li & Durbin 2009) (provided in the DeconSeq download)

indexed the databases. Trimmed sequences were split into ~1.1 Gb segments (FASTQ Splitter

v.0.1.2; http://kirill-kryukov.com/study/tools/fastq-splitter/), compared against the databases,

removing any sequence which had an identity (-i) of 94% and a coverage (-c) of 90%. The out-

files per population were merged together and paired read files were compared, removing any

sequences found in only one of the files, using fastq-pair (Edwards 2017).

A genome reference was subsequently assembled from the pool with the highest amount

of sequences (J3) using DISCOVAR de novo (v.52488)

(https://software.broadinstitute.org/software/discovar/blog/?page_id=98) after using Clumpify to

remove PCR duplicates (Bushnell 2014) . DISCOVAR de novo was run to assemble de-

duplicated sequences into flattened lines. An assembly was also attempted using all pools, albeit

with J3 determined to generate a “better” assembly. DISCOVAR de novo was chosen because it

does not require sequences done with two different insert sizes, unlike most assemblers.

However, the sequences are recommended be from a single, PCR-free library preparation and

250 bp paired-end reads. The consequences of breaking these assumptions will be stated in

Results and discussed in Caveats.

Burrows-Wheeler Aligner (BWA)-MEM (v. 0.7.17) (Li & Durbin 2009) was used to

align sequences to the assembled genome reference with the -M option which disables making

multiple primary alignments for different regions of the query sequence for compatibility with

22

downstream packages. Aligned sequences were sorted from a SAM file to a BAM file by

chromosome number using SAMtools (Li et al. 2009). SAMtools was also used to remove any

sequences that fell below a mapping quality of 20 and any sequences where its mate did not map.

Duplicates were removed with Picard Tools (v. 2.17.3.) (http://broadinstitute.github.io/picard/).

SAMtools’ function flagstat was used to determine summary statistics of resulting bam files

while Picard Tools (ValidateSamFile) was used to validate that the files were not corrupted in

any way. SAMtools was used to create individual mpileup files for each pool and a mpileup file

containing all pools, specifying the -B function which stops BAC re-alignment necessary for

downstream analysis.

2.2.7 PAIRWISE FST

Pairwise FST between all pools was determined using the R package, Poolfstat (Hivert et

al. 2018). Firstly, the mpileup file containing all of the pools was first converted to a sync file

using PoPoolation2 (Kofler et al. 2011b), filtering for a minimum base quality of 20. It is

important to note that to specify the Phred33 encoding, --fastq-type must be set to “sanger”

(rather than “illumina” for Phred64). Using Poolfstat the sync file was converted to a popsync

object in RStudio (v. 1.1.383) (RStudio Team 2016) with a minimum read count of one,

minimum coverage of 20, maximum coverage of 200, minimum allele frequency of 0.05 and

removing indels. Pairwise FST was calculated using the “Anova” method and the same

parameters used to import the sync file. A Euclidean distance matrix was created from the

pairwise FST matrix using the dist function in R. The pco function in the package LabDSV (v.1.8-

0) (Roberts 2012) was used to determine the eigenvalues, from which the percent explained by

eigen vector one and two was calculated. Pairwise FST and percent eigen vectors were visualized

using ggplot2 (Wickham 2016).

2.2.8 NUCLEOTIDE DIVERSITY

Individual mpileups were analyzed in PoPoolation1 (Kofler et al. 2011a) to determine

within population nucleotide diversity. Nucleotide diversity was determined for SNPs with a

minimum count of two, a minimum coverage of 20, max coverage of 100, for side by side

windows of 250 bp and for a pool size equal to the diploid number of individuals in the pool. As

with above, we had to set the fastq-type to sanger. Mean nucleotide diversity using all windows

was calculated, even those that did not contain a SNP for the individuals of that pool, as this is a

23

diversity of zero in RStudio (v. 1.1.383). The nucleotide diversity values were visualized using

the package ggplot2.

2.3 RESULTS

2.3.1 DNA EXTRACTION, QUANTIFICATION AND QUALITY

DNA yield was variable among samples, even with tissues of similar size (range from 10

ng/µL up to greater than 200 ng/µL). Extracted DNA was determined to be free of organic

compounds and protein contamination (mean ± SD) (260/230 ratio of 2.05 ± 0.26 and 260/280

ratio of 1.95 ± 0.085). Samples exhibited a high molecular weight and limited shearing

(individual samples not shown). Pooled samples did not appear to have increased shearing from

handling throughout the process with the majority of DNA above 10 Kbp (Figure B.1).

2.3.2 DNA SEQUENCING AND PRE-PROCESSING

A total of 3,675,153,756 sequences were assigned to a population over the eight

populations with a quality score of 38 for all populations, with the exception of one of the J3 lane

positions which had a quality score of 37 (J1: 446,326,456 sequences; J2: 478,291,654

sequences; J3: 501,946,794 sequences; J4: 430,999,172 sequences; J5: 412,287,368 sequences;

G1: 451,624,748 sequences; G2: 453,963,908 sequences; G3: 499,713,656 sequences). No

populations were flagged or failed per base quality scores, though quality decreased as the read

progressed (Figure B.2). Trimmomatic filtering removed a total of 843,631,584 sequences

(22.96%) (J1: 22.65%; J2: 21.92%; J3: 22.10%; J4: 22.04%; J5: 22.43%; G1: 22.45%; G2:

25.85%; G3: 24.13%). For each population per base quality improved such that quality was

above 30 (Figure B.3). A total of 19,198,864 sequences were removed from trimmed sequences

as non-snail contamination (J1: 0.66%; J2: 0.70%; J3: 0.81%; J4: 0.73%; J5: 0.65%; G1: 0.73%;

G2: 0.58%; G3: 0.55%). The genome reference assembly produced from J3 had an N50 = 3,931

bp (~450 Mbp in 1 kb+ scaffolds and ~79 Mbp in 10 kb+ scaffolds), mean length of first read in

pair up to first error (MPL1) of 90 and an estimated chimera rate of 0.55%. The genome

reference assembly produced using all pools generated an N50 = 2,739 bp (~540 Mbp in 1 kb+

scaffolds and ~48 Mbp in 10 kb+ scaffolds), MPL1 of 77 and estimated chimera rate of 1.08%.

Of the decontaminated sequences and post filtering (for unpaired reads, duplicates and a

minimum mapping quality of 20), I mapped an overall 2,034,681,286 sequences (72.35%, or

55.36% of initial sequences) to this assembly (J1: 251,520,504 sequences (73.34%); J2:

24

268,182,754 sequences (72.31%); J3: 284,998,462 sequences (73.49%); J4: 244,620,378

sequences (73.34%), J5: 235,247,357 sequences (74.04%); G1: 246,765,601 sequences

(70.98%); G2: 237,549,944 sequences (70.98%); G3: 265,796,286 sequences (70.50%)). All

BAM files passed validation.

2.3.3 PAIRWISE FST

The number of within population bi-allelic positions captured was 921,339 to 1,300,995

per each P. johnsoni pooled population and 3,053,291 to 3,736,834 per each P. gyrina pooled

population (Table 2.1). Pairwise FST between P. johnsoni populations ranged from 0.106 (J2 vs.

J4) to 0.367 (J1 vs. J4) (Table 2.2). Between P. johnsoni and P. gyrina populations pairwise FST

was 0.519 (J5 vs. G1) to 0.709 (J4 vs. G2) (Table 2.2). For P. gyrina populations, pairwise FST

ranged from 0.359 (G2 vs. G3) to 0.498 (G1 vs. G2). The PCoA plot of the pairwise FST

separated P. johnsoni and P. gyrina along the first axis and explained 76.85% of the allelic

variation shaping this genetic structure (Figure 2.3). J2, J3, and J4 clustered together (Figure

2.3). J1 and J5 were also clustered together, but clustering was not reflected in the distance

between them (pairwise FST = 0. 335) (Table 2.2) but rather that they were both equal distances

from all other populations (Figure 2.3). G2 and G3 were loosely clustered while G1 fell out

towards the P. johnsoni populations (Figure 2.3).

2.3.4 NUCLEOTIDE DIVERSITY

Nucleotide diversity was decreased in P. johnsoni populations (J1 to J5) compared to P.

gyrina (G1 to G3) (mean across all 250 bp windows): J1: 0.00133, J2: 0.00115, J3: 0.00113, J4:

0.00106, J5: 0.00156, G1: 0.00421, G2: 0.00475, G3: 0.00536) (Figure 2.4).

2.4 DISCUSSION

As the instances of habitat loss and fragmentation increase and contribute to species

decline (Shaffer 1981), the effect of these events on genetic diversity is of increasing importance

to conservation management (Frankham 2005). Population size fluctuations (bottlenecks),

inbreeding, reduced gene flow and genetic drift can decrease the fitness of a population or

species (Bouzat 2010). Freshwater snails present an excellent system in which to study these

impacts as they often naturally exist in discrete populations with limited dispersal (Viard et al.

1997). Genetic drift has also been shown to have a rapid and large influence due to founder

25

effects, frequent bottlenecks, and low immigrant rates causing exceedingly low genetic diversity

(Viard et al. 1997; Bousset et al. 2004). These species, who additionally have the ability to

colonize a wide range of habitats, provide a unique opportunity to study the genomic

underpinning of population and species differentiation (Mavárez et al. 2002c). Though there are

instances in certain species and ecosystems where there are no patterns of genetic differentiation

between distal populations (Gu et al. 2015; Lounnas et al. 2018), strong genetic structure is

common within small micro-geographical ranges (Mavárez et al. 2002a; b; Bousset et al. 2004;

Djuikwo-Teukeng et al. 2014). The endangered P. johnsoni exemplifies the characteristics that

can make freshwater snails ideal study systems. Its global habitat is restricted to just seven

geographically close thermal springs that undergo severe yearly population bottlenecks with

minimal gene flow predicted between the thermal springs (Lepitzki & Pacas 2010). While it is

currently managed as a species, studies have questioned the validity of this designation and have

proposed that P. johnsoni represents thermal ecotypes of a much more common snail, P. gyrina

(Wethington & Guralnick 2004; Wethington & Lydeard 2007). The objectives of this study were

therefore to test whether these putative species represented distinct phylogenetic units, and to test

whether the yearly cyclic bottlenecks contributed to population divergence and decreased

nucleotide diversity among P. johnsoni from different thermal springs.

2.4.1 POPULATION STRUCTURE AND NUCLEOTIDE DIVERSITY BETWEEN P. JOHNSONI AND P. GYRINA POPULATIONS

I discovered strong genetic divergence between pooled samples of P. johnsoni and P.

gyrina. Using just under one million to over four million SNPs sequenced across the genome a

pairwise FST of 0.636 ± 0.0605 (mean ± SD) was found between P. johnsoni and P. gyrina

populations on a geographical scale of less than a few kilometres. Though the relationship

between FST and the number of migrates is tenuous at best in most systems, this value would

indicate one migrant every five to nine generations (Whitlock & McCauley 1999). This value

should be interpreted with extreme caution, as these populations break many of the assumptions

of this relationship (e.g., evolutionary equilibrium). For example, there should be no genetic

isolation by geographical distance with all populations contributing equally to the pool of

migrants (Whitlock & McCauley 1999). This is not reflected by the patterns of pairwise FST

(e.g., J5 is the most geographically far thermal spring to G1, but genetically the closest). As well,

populations are assumed to have a constant number of individuals, which is clearly not the case

in this system, no selection, no mutation and have reached equilibrium between migration and

26

genetic drift (Whitlock & McCauley 1999). With this in mind, the patterns of genetic

differentiation should be considered to indicate a maximized FST between these species with

virtually no gene flow, rather than a quantitative estimate of migrants. As well, there was a

striking difference in the nucleotide diversity between P. johnsoni and P. gyrina. P. gyrina

populations were found to have over double (G1) or triple (G2 and G3) the nucleotide diversity

observed in P. johnsoni populations. This relationship was also reflected in the number of SNPs

captured within each population, with P. gyrina having roughly three to four times the amount of

within population SNPs captured. Due to pooling the individuals before sequencing, it was not

possible to determine the amount of variability in nucleotide diversity between individuals

towards establishing significance.

There are a few plausible explanations for the observed population structure between P.

johnsoni and P. gyrina, and the reduced diversity in P. johnsoni. One possibility is that P.

johnsoni is adapting to the thermal spring environment and divergent selection is causing the

fixation of critical alleles, increasing divergence and decreasing nucleotide diversity. Another,

non-mutually exclusive possibility is the influence of the repeated bottlenecks. Over roughly 20

generations (assuming generation time of one year), Lepitzki (COSEWIC 2018) documented

shifts where certain P. johnsoni populations’ minimum reached under 0.010% of their maximum

value. Such patterns are predicted to result in genetic drift, whereby the probability of random

fixation or extinction of alleles is inversely proportional to effective population size (Hedrick &

Kalinowski 2000) and a reduction in genetic diversity is the predicted outcome of the process.

Though P. gyrina have been shown to have seasonal patterns of large increase and decrease in

other Albertan lakes (Sankurathri & Holmes 1976), higher population numbers and less

constrained habitat may decrease the influence of the bottlenecks on genetic diversity as

compared to P. johnsoni (Bouzat 2010). There could also be influence from multiple

evolutionary forces. Indications of selection have been found in populations that undergo

extensive bottlenecks; however, the selective force for an allele must be strong enough to

overcome random drift (e.g., Koskinen et al. 2002; Funk et al. 2016). Though there are clear

patterns of genetic differentiation and decreased nucleotide diversity which highlight the

conservation concern for P. johnsoni, more data will be necessary to determine the relative roles

of genetic drift and selection in this system.

27

2.4.2 POPULATION STRUCTURE AND NUCLEOTIDE DIVERSITY WITHIN P. JOHNSONI AND P. GYRINA POPULATIONS

The SNPs captured in this study support the hypothesis of multiple genetic populations

for P. johnsoni and P. gyrina. Previous genetic work in P. gyrina which included G1, G2 and G3

and an additional two populations located in Banff National Park and Montana, was unable to

find support for monophyly or distinguish between three of the five populations (Remigio et al.

2001). Using two (arguably non-neutral) genetic markers, they found that G1 and G2 grouped

together away from G3 (Remigio et al. 2001). In this present study, I found that there was strong

population structure, with pairwise FST of 0.359 between G2 and G3 and of 0.498 and 0.455

between G1 and G2 and G1 and G3, respectively. This population structure suggests more

differentiation between the marsh population containing thermal water run-off (G1) and the lake

populations (G2 and G3) than between the two lake populations. Re-addressing population

structure with genome-wide SNPs may be more effective to resolve population structure when

present (e.g. Emerson et al. 2010). G1 was seen to have decreased nucleotide diversity compared

to both G2 and G3, though more SNPs were captured in this population than in G2. Whether

these patterns are due to adaptation to different habitat types, connectivity (discussed below) or a

combination is unknown.

Between the five populations of P. johnsoni included in this study, which are spread over

just one kilometre, I found pairwise FST ranging from 0.106 to 0.367 (factors likely impacting

this structure discussed below). Within P. johnsoni populations no general trends could be seen

between the severity of the population’s minimum and maximum and the amount of nucleotide

diversity or the amount of within population SNPs. In fact, J5 which has had some of the lowest

population minimums (30 to 40 individuals between 1996 to 2017) (COSEWIC 2018), was

found to have the highest nucleotide diversity and within population SNPs. This illustrates that

consensus data should be used in conjunction with genomic data for conservation management

(Keller & Waller 2002). Altogether, these analyses reveal that population genetic factors are

influencing the evolutionary trajectories of snails within these thermal springs at a remarkably

microgeographic scale.

There are several ecological genetic factors possibly influencing the observed patterns of

genetic differentiation between populations of the same species. One possibility is micro-habitat

local adaptation, however further data generation and analysis would be necessary to address this

28

(discussed in Chapter 3 General conclusions). Again, another possible influencing force is

genetic drift. In addition to decreasing genetic diversity, drift is predicted to increase genetic

differentiation between populations. To some extent all of the populations included in this study

have likely undergone bottlenecks, so it is possible that these events are contributing to the

genetic differentiation and FST estimates measured. Ease of dispersal may also be playing a role

in the patterns of differentiation. For instance, between J2, J3, and J4 there is decreased pairwise

FST as compared to J1 (protected by a cave) and J5 (up Sulphur Mountain by about one km). In

certain conditions J4 water will run above ground to J3 and snails have been observed in the J4

outlet (Lepitzki 2002). Though water has been observed to flow from J3 into J1, there are no

patterns of decreased population structure between them in comparison to J1 to J2 or J4.

Between the early to mid 1900s to 1997, dispersal between P. johnsoni may have been impacted,

as the thermal springs were piped together and bathing occurred between J1 and J2 until 1997

(COSEWIC 2008). Without prior sequencing to compare to, the effects of this will remain

unknown. Between P. gyrina populations, the two populations connected by the Bow River (G2

and G3) have decreased (though still substantial) population structure compared to G1, which is

isolated from the river. Though birds (Santamaría & Klaassen 2002) and mammals (Van

Leeuwen et al. 2013) have been shown to transport snails and shape the patterns of genetic

diversity (Van Leeuwen et al. 2013), water connectedness can frequently drive population

structure, including in other aquatic species (e.g., Kremer et al. 2017). Previous work has shown

that there is decreased genetic differentiation between populations over much further ranges that

are connected by waterways which allow the transport of snails and eggs on mats, than even very

close pond populations (Mavárez et al. 2002a; Bousset et al. 2004; Djuikwo-Teukeng et al.

2014). Other than the flooding between certain P. johnsoni populations and the possibility of

previous connectivity through pipes, P. johnsoni populations have very little water connection as

the thermal water largely runs underground (Grasby & Lepitzki 2002). Though there is increased

nucleotide diversity in the two connected lake populations of P. gyrina, it is difficult to

disentangle if this is due to higher populations numbers and their habitat or because of

connectivity. Interestingly, the P. johnsoni populations of J2, J3 and J4 which have the lowest

genetic distance and presumably the highest probability of genetic mixing, also have the lowest

nucleotide diversity. In conjunction with the measure of strong population structure between

geographically close populations, decreased nucleotide diversity and amount of polymorphic

29

sites in P. johnsoni, populations coupled with the known life history, provides compelling

evidence that genetic drift may be driving minor allele loss in P. johnsoni populations.

2.4.3 BROADER IMPLICATIONS AND CONSERVATION RECOMMENDATIONS

Though species designation has a broad spectrum of definitions, this study has shown

there to be clear genetic differentiation between P. johnsoni and P. gyrina and between

populations of what is argued to be the same species. This level of genetic differentiation

between populations of the same species is not restricted to this study and has been found in

other freshwater snail species (Mavárez et al. 2002a; b; Bousset et al. 2004; Djuikwo-Teukeng et

al. 2014). This brings concern of the potential loss of freshwater snail biodiversity that may

contain ecological and evolutionarily significant genetic diversity (Funk et al. 2012; Mee et al.

2015). On a whole, molluscs are data deficient with respect to conservation. Though only 10% of

known species of molluscs have been assessed by the International Union of Conservation of

Nature (IUCN) as of 2016, they still represent 40% of the documented extinctions (Cowie et al.

2017). With the genetic structure observed on such a short geographical scale, there is a high

probability that we are in fact losing, if not “species”, genetically diverse populations which are

essential for persistence with environmental change (Ceballos & Ehrlich 2002) at a much faster

rate than even predicted (Régnier et al. 2015; Cowie et al. 2017). P. johnsoni is fortunate that it

exists in such a visibly unique habitat in a national park where COSEWIC agreed that even if it

actually represented a thermal ecotype of P. gyrina, it would have been likely re-designated as a

designatable unit (DU) (COSEWIC 2008). This ensured the allocation of resources due to its

proposed ecological or evolutionary significance (Joseph et al. 2009; Funk et al. 2012;

Carwardine et al. 2018). Actions included census counts done every three to four weeks from

1996 till 2017 (though terminated in 2017), motion triggered alarms to prevent people soaking in

the Middle Springs, the closing of swimming at the Cave and Basin Springs, and previous

funding to test the evolutionary significance of the species (Lepitzki 1998; Remigio et al. 2001;

COSEWIC 2008, 2018; Lepitzki & Pacas 2010). Without a recognition of taxonomic, ecological

or evolutionarily uniqueness, this level of resources will not be allocated to species (Isaac et al.

2004; Joseph et al. 2009). Unfortunately, there is a taxonomic bias in the primary research

necessary to assess these measures of distinctiveness (Howard & Bickford 2014; Régnier et al.

2015; Cowie et al. 2017). Genomics provides a relatively cost effective (the integration of Pool-

30

seq into conservation is discussed below) method for determining population structure and

characterizing the genetic health of species or populations.

As illustrated in the P. johnsoni populations compared to P. gyrina, bottlenecks in small,

isolated populations can cause genetic drift to fix alleles (decreasing nucleotide diversity) and

promotes genetic differentiation. This loss of alleles can cause the fixation of detrimental alleles

(Bouzat 2010) and decrease of standing genetic variation necessary to rapidly respond to

environmental change (Morris et al. 2014). For P. johnsoni this means that each population

represents an incredibly important reservoir for the limited nucleotide diversity found across the

species. In light of this, I would recommend that the population counts are re-instated so that

deviations from 20-year trends can be detected quickly and, ideally, coupled with concurrent

genomic estimates of genetic diversity to directly test predictions associated with genetic drift.

The routine sequencing of the populations every few years would be an incredibly valuable

component of P. johnsoni’s management plan, as predicted with other management plans (e.g.,

De Barba et al. 2010; Hendricks et al. 2017). Temporal differences in the same population’s

nucleotide diversity and the extent of differentiation would be a powerful way to investigate the

effect of genetic drift (Bousset et al. 2004). If a population starts declining in numbers (which

there has already been a significant decline in maxima observed (COSEWIC 2018)) and/or there

is increased fixation of alleles, translocation of individuals from another population may be

warranted as genetic rescue (Ingvarsson 2001; Edmands 2007). Under such scenarios, there

could be concern regarding the potential for outbreeding depression if locally adaptation to each

thermal spring was disrupted with the influx of new individuals (Edmands 2007). However, the

genetic differentiation shown here likely indicates either current or recent gene flow. It is

possible that gene flow between these thermal springs has decreased below what would be

natural for the system, as each of the thermal springs has been impacted by humans (COSEWIC

2008), presumably decreasing frequentation by animals that could act as vectors for these snails.

Additionally, if adaptive differences are occurring at certain alleles even in the face of population

bottlenecks and corresponding impact of genetic drift, the selective force would be incredibly

strong and therefore unlikely to be disrupted by a few migrants (Funk et al. 2016). Without semi-

frequent monitoring of genetic variation, it will be impossible to establish a baseline of what is

considered normal and stable for the system, with genetic threats remaining undetectable. As

well, further monitoring would provide the parameters necessary to elucidate the roles of

selection and drift in this system. This could be used to characterize the potential risk of

31

outbreeding depression if translocation occurred to mitigate the impact of low genetic diversity

and/or inbreeding depression. P. johnsoni represents a fantastic and unique opportunity to

conduct research on how a species’ genome existing in small, isolated populations with

minimum gene flow and bottlenecks is impacted. In the face of the biodiversity crisis, where

critically important genetic diversity is so often over-looked (Frankel 1974; Laikre 2010),

characterizing and understanding genetic drift is vital.

2.4.4 THE UTILITY OF POOL-SEQ IN CONSERVATION

Pool-seq provides a low-cost method for capturing genome-wide polymorphisms. In

conservation management Pool-seq can be effective to decrease sequencing costs but not reduce

the number of individuals (Ferretti et al. 2013). However, there are some purposes where Pool-

seq excels and others where it is limited. Firstly, Pool-seq is particularly useful in cases where

there are unknown amounts of polymorphism, such as this study. With RAD (Baird et al. 2008)

and ddRAD (Peterson et al. 2012) sequencing, only a small proportion of the genome is

captured, with one snail study capturing less than three thousand markers (Kess et al. 2016) .

Because these methods involve the use of restriction enzymes that cut at specific patterns of

DNA, it is hard to predict the amount of DNA chunks of appropriate size that will be generated

(Liu et al. 2013) and pilot studies can be necessary to determine this (Kess et al. 2016).

Barcoding individuals, even when doing reduced sequencing can still represent a large financial

investment for decreased amount of SNPs captured (Gautier et al. 2013). However, because of

the loss of individual in Pool-seq, it is not possible to accurately estimate migrant rate, effective

population size or inbreeding coefficient using Pool-seq (Andrews et al. 2016). Additionally,

assignment of individuals to populations is not possible (Andrews et al. 2016). This must be

taken into account when sampling, especially if the species doesn’t exist in discrete populations.

If using Pool-seq, specific parameters and filtering must be used to decrease bias in allele

frequency estimates and subsequent calculations. I will discuss these in the context of this study.

At the sampling level, a minimum of 40 individuals is recommended per population for the most

accurate population allele frequency estimates (Schlötterer et al. 2014). Though Hivert et al.

2018 argue that their estimator for pairwise FST is unbiased by pool size or coverage, this is of

consideration for the measure of nucleotide diversity in this study (Kofler et al. 2011a). This is a

known limitation of using Pool-seq in endangered species (Schlötterer et al. 2014) as the

intention was to sample 40 individuals per population, however, J2 had low population numbers.

32

To mitigate this I used windows in calculating nucleotide diversity, as per recommended with

low sample size (Kofler et al. 2011a; Schlötterer et al. 2014). Care was taken to ensure equal

representation of each individual per pool in terms of DNA amount (Gautier et al. 2013;

Schlötterer et al. 2014). For filtering, pre-processing and calculations, I followed recommended

best practices to mitigate the effects of sequencing error as incorrect SNP calls (Kofler et al.

2011b; a; Schlötterer et al. 2014; Hivert et al. 2018). Further considerations are discussed in

Caveats.

As we strive to include genomics into conservation with increasing frequency, careful

validation and reflection on software used must be conducted (Shafer et al. 2015). In this study,

pairwise FST was originally calculated using the established PoPoolation2 (Kofler et al. 2011b).

It was then calculated using newly developed Poolfstat (Hivert et al. 2018) as a confirmation.

The packages use different methods for calculating estimates of allele counts, with Hivert et al.

(2018) illustrating that the PoPoolation2 estimate is biased (not converging on expected values

and impacted by coverage and sample size). The differences between the two packages should

not have been extreme (Hivert et al. 2018); however, I found there to be up to a 5x difference

between the two packages in the pairwise FST calculated. PoPoolation2 (Kofler et al. 2011b)

found pairwise FST of 0.044 to 0.076 between populations of P. johnsoni, 0.21 to 0.35 between

P. johnsoni and P. gyrina and 0.167 to 0.237 between P. gyrina populations (Table C.1),

compared to 0.106 to 0.367, 0.519 to 0.709 and 0.359 to 0.498 respectively found by Poolfstat

(Table 2.2) (Hivert et al. 2018). The population structure reported would have indicated that P.

johnsoni and P. gyrina may not be genetically distinct and that gene flow was likely occurring

between them. However, it was determined that when calculating pairwise FST, Popoolation2

(Kofler et al. 2011b) considers all base positions that are polymorphic in one or more

populations when calculating the pairwise comparisons, regardless if the position is polymorphic

in either of the populations in the present pairwise comparison. Thus, any allelic position that

was polymorphic in some population but that was fixed in the two populations being compared

generated a pairwise FST of zero, effectively dampening the population structure. This was

exasperated by difference in nucleotide diversity between the P. johnsoni and P. gyrina

populations. This is a clear illustration that genomic results must be thought of in the context of

known ecological information for the species and must be examined closely before conservation

recommendations are made.

33

Development and re-use of well-developed sequencing methods and pipelines are at the

core of genomics being integrated efficiently into conservation (Shafer et al. 2015). If this

pipeline was applied to a different project, once the samples were collected, it would likely take

less than a month to go from extracted DNA to having population structure results. In this

context, using Pool-seq provides an incredibly cost-effective method for conservation

management to assess population structure and investigate the genetic health of populations. This

method can be complemented by restriction enzyme sequencing of a subset of individuals to

provide estimates of effective population number, inbreeding coefficient and migrant rates

(Andrews et al. 2016) for more complete conservation management plans.

2.4.5 CAVEATS

An unavoidable consequence of pooling individuals for Pool-seq is that there is no way to

distinguish between two sequences that were sequenced twice from the same individual or from

two individuals (Schlötterer et al. 2014). Downstream applications assume that they were from

different individuals, which may bias the population estimates for allele frequency. However, the

estimates that I provide in this study are based off of the averaging over millions of positions, so

the bias should be decreased. The genome reference I created was using pooled, 135 bp (post

Trimmomatic) paired-end short read sequences from one P. johnsoni population. DISCOVAR de

novo was designed for paired-end reads of 250 bp from a single PCR-free library, though the

creators do state that PCR-amplified libraries can “in principle be used”, as well as 150 bp reads

“may work” (https://software.broadinstitute.org/software/discovar/blog/?page_id=23). Increased

quality was seen when using one pool (J3) rather than all pools to construct the genome assembly

reflected in an increased N50 value and MPL1 and decreased estimated chimera rate. The

creators of DISCOVAR de novo state that the MPL1 should be 175 bp to 225 bp for 250 paired-

end reads (compared to 90 bp for J3 reference genome), though this value did not result in

DISCOVAR de novo flagging the assembly as problematic, nor did any of the other values

generated. However, due to these factors, the constructed contigs were short, which increases

sequence mapping error. Additionally, I assumed that P. gyrina would successfully map to a

reference genome constructed from P. johnsoni sequences. When calling SNPs, it was required

that all populations have a minimum of 20x coverage over a position, which should decrease the

impact of certain populations not mapping to divergent regions. Copy number variants and

repetitive regions of the genome may collapse to the same position (Schlötterer et al. 2014). This

34

is not an issue unique to Pool-seq and is an unfortunate limitation in using short sequencing

reads. I attempted to mitigate this by setting the upper limit of sequencing coverage to 200.

Biological differences between P. johnsoni and P. gyrina, such as selfing rates could influence

the amount of fixation occuring. I am unable to determine what evolutionary force is generating

the genetic differentiation between and within P. johnsoni and P. gyrina populations. While I can

predict that genetic drift plays a large role with the decreased genetic diversity observed and

known bottlenecks, further work investigating potentially adaptive differences between P.

johnsoni and P. gyrina is necessary.

2.4.6 CONCLUSIONS

Without the integration of genomics, current conservation management plans will remain

incomplete. Here I used Pool-seq, a cost-effective sequencing method to capture millions of

SNPs in the globally restricted P. johnsoni and the more common P. gyrina. Analyses using

these SNPs were able to resolve genetic structure that had remained ambiguous between the two

species with the use of a few genetic markers (Remigio et al. 2001; Wethington & Guralnick

2004). These results indicated that P. johnsoni and P. gyrina were genetically distinct from each

other. Additionally, I characterized that there was extensive population structure between

populations of the same species. Coupled with determining decreased nucleotide in P. johnsoni

populations, which undergo massive bottlenecks, these results indicate that there may be a large

impact of genetic drift. The findings of this study will be integrated into P. johnsoni’s

management plan and will help make it more complete. Without the use of genomics,

differentiation between P. johnsoni and P. gyrina would not have been determined and the

impact of bottlenecks on decreasing genetic diversity would have remained a predicted but

uncharacterized threat. As illustrated in this system, there is an incredible place and need for the

use of genomics in conservation. The partnership between Parks Canada and University of

Calgary researchers represents the type of collaboration that is necessary for genomics to be used

in real world policy applications.

35

Population J1 J2 J3 J4 J5 G1 G2 G3

J1 1,174,228 1,362,417 1,369,156 1,349,817 1,575,031 3,491,977 4,416,793 4,607,239

J2 1,000,751 1,127,267 1,049,468 1,468,419 3,527,269 4,397,121 4,601,359

J3 1,061,899 1,090,399 1,454,054 3,460,571 4,358,911 4,566,735

J4 921,339 1,090,399 3,510,866 4,372,598 4,580,040

J5 1,300,995 3,531,009 4,4666,20 4,673,149

G1 3,248,634 4,7089,28 4,817,892

G2 3,053,291 4,366,631

G3 3,736,834

Table 2.1 Number of SNPs within each population and used in pairwise comparisons between populations determined by Poolfstat with the ‘Anova’ method for a minimum read count of one, minimum coverage of 20, a maximum coverage of 200, and a minor allele frequency of 0.05. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3.

36

Population J1 J2 J3 J4 J5 G1 G2 G3

J1 0 0.315 0.335 0.367 0.322 0.550 0.692 0.650

J2 0 0.136 0.106 0.312 0.569 0.699 0.656 J3 0 0.209 0.302 0.574 0.707 0.667

J4 0 0.351 0.586 0.709 0.666 J5 0 0.519 0.671 0.630 G1 0 0.498 0.455

G2 0 0.359 G3 0

Table 2.2 Pairwise FST between all populations determined using Poolfstat with the ‘Anova’ method for a minimum read count of one, minimum coverage of 20, a maximum coverage of 200, and a minor allele frequency of 0.05. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3.

37

Figure 2.1 Range map and sample populations for Physella johnsoni and sample populations for Physella gyrina, Banff National Park, Alberta, Canada. P. johnsoni - Cave Spring (J1), Basin Spring (J2), Lower Cave & Basin Spring (J3), Upper Cave & Basin Spring (J4), Lower Middle Spring (J5), Upper Middle Spring (J6) (not used in this study) and Kidney Spring (J7) (not used in this study). P. gyrina - Cave & Basin Marsh (G1), Five Mile Pond (G2) and Muleshoe Pond (G3).

38

Figure 2.2 Total number of P. johnsoni from January 1996 to September 2017. Population counts were taken once every three weeks till August 2000 and then once every four weeks till September 2016 when population counts were ended. From April to September 2017 and September 2018 the counts were resumed. Original springs include J1, J2, J3, J4 and J5. The re-established springs are J6 and J7. Modified from COSEWIC 2018 by Dr. Dwayne Lepitzki.

39

Figure 2.3 Principle coordinate analysis for all pairwise FST between P. johnsoni and P. gyrina populations calculated by Poolfstat with the ‘Anova’ method for a minimum read count of one, minimum coverage of 20, a maximum coverage of 200, and a minor allele frequency of 0.05. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3.

40

Figure 2.4 Averaged nucleotide diversity for all P. johnsoni and P. gyrina populations calculated by PoPoolation2 over 250bp side by side windows, minor allele count of 4, minimum coverage of 20 and max coverage of 200, where the 60% of the window was acquired to meet coverage specifications. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3.

41

CHAPTER 3 GENERAL CONCLUSIONS In this study, I aimed to provide clarity to previously unresolved taxonomic designations

between the Banff Springs Snail (Physella johnsoni) and Physella gyrina. Additionally, I

provided new data that characterized the genetic diversity and micro-population of P. johnsoni. Using Pool-seq, just under a million to over four million SNPs were captured per population,

allowing me to uncover strong and defined population structure between P. johnsoni and

P. gyrina. This leads me to believe that they represent genetically diverse units and warrant

continued separate management. Even between populations containing the same species there

was extensive population structure, leading me to have concern on the prospect of “lumping” of

species of snails (Wethington & Guralnick 2004), as myself and others (Mavárez et al. 2002a; b;

Bousset et al. 2004; Djuikwo-Teukeng et al. 2014) have found large population structure

between populations on small geographic scales. In terms of management, I believe the data

deficiency for molluscs (Régnier et al. 2009; Cowie et al. 2017), specifically genomic data, will

result in the loss of possibly genetically unique and interesting species and sub-species and

threaten the continued persistence of many molluscs.

Pool-seq is a fantastic tool to address population structure and nucleotide diversity.

Complications can arise when using it to address adaptive differences, as without an annotated

genome, regions of divergence lack biological relevance. However, this issue is not restricted to

Pool-seq and is shared for all sequencing methods. Unlike RAD-seq, Pool-seq lets us capture the

majority of the genome though and without a reference genome it feels under-utilized.

Fortunately, the cost of sequencing genomes is continually decreasing, and the number of

available genome references is increasing. In terms of conservation management, the current

toolset for analyzing Pool-seq data is limited in some respects. There are packages and scripts

developed to determine pairwise FST, nucleotide diversity (Tajima's Pi), Watterson’s Theta or

Tajima's D, but due to the loss of the individual, Pool-seq data cannot be used to determine levels

of inbreeding or effective population size. However, if these don’t need to be explicitly

determined for the species or population, Pool-seq does provide an impressive amount of data

and resolution for the parameters it can determine for a very attractive price.

In future steps, I would like to investigate further the population structure between P. johnsoni to P. gyrina and to take the first steps in determining if there are possibly adaptive

42

differences between them. In the pursuit of this goal, I think that generating a reference genome

would be of great benefit. This would allow us to start investigating if there are shared regions of

the genome that show evidence of selection between P. johnsoni and P. gyrina and if these

regions lie near or in potential gene coding regions.

In conclusion, the data I have generated and presented here provides the resolution

necessary to determine that P. johnsoni and P. gyrina are genetically distinct. Additionally, I

have shown that there is strong micro-geographical population structure between the P. johnsoni thermal springs and decreased within population nucleotide diversity. I recommend a modified

version of the current recovery strategy and action plan for P. johnsoni as the appropriate action

plan. Considering the decreased nucleotide diversity shown in P. johnsoni, each population plays

a vital role in the evolutionary robustness of the species beyond just total numbers. I recommend

re-instating population counts focused on capturing the yearly minimum and maximum for each

population. I propose that population counts be done every four weeks for the three months, or

some duriation and frequency that captures previously recorded population minimums and

maximums (COSEWIC 2018). This could provide evidence of deviations from the 20-year

norms and therefore provide the first warning signs of population collapse, especially when

coupled with genomic data. As such, I recommend that semi-regular sequencing be incorporated

into P. johnsoni’s management plan to establish a baseline for the impact of genetic drift in

population divergence and nucleotide diversity. Additional monitoring of genetic variation levels

could determine potentially adaptive versus non-adaptive loci, effective population sizes and

inbreeding coefficients. Decreasing nucleotide diversity, effective population size and/or

population numbers and/or increasing inbreeding may warrant translocation of individuals from

a population with different polymorphisms. By characterizing these factors, management would

be able to weigh the potential risks of outbreeding depression versus inbreeding depression. As

demonstrated in this system, the use of genomics in conservation is a vital component of creating

effective and efficient management plans.

43

References Adams JR, Vucetich LM, Hedrick PW, Peterson RO, Vucetich JA (2011) Genomic sweep and

potential genetic rescue during limiting environmental conditions in an isolated wolf

population. Proceedings of the Royal Society B: Biological Sciences, 278, 3336–3344.

Allendorf FW, Hohenlohe PA, Luikart G (2010) Genomics and the future of conservation

genetics. Nature Reviews Genetics, 11, 697–709.

Anderson E, Skaug HJ, Barshis DJ (2014) Next-generation sequencing for molecular ecology: a

cavaet regarding pooled samples. Molecular Ecology, 23, 502–512.

Andrews S (2010) FastQC: a quality control tool for high throughput sequence data. Available

online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc

Andrews KR, Good JM, Miller MR, Luikart G, Hohenlohe PA (2016) Harnessing the power of

RADseq for ecological and evolutionary genomics. Nature Reviews Genetics, 17, 81–92.

Baird NA, Etter PD, Atwood TS et al. (2008) Rapid SNP discovery and genetic mapping using

sequenced RAD markers. PLoS ONE, 3, 1–7.

De Barba M, Waits LP, Garton EO et al. (2010) The power of genetic monitoring for studying

demography, ecology and genetics of a reintroduced brown bear population. Molecular Ecology, 3938–3951.

Barrett RDH, Rogers SM, Schluter D (2008) Natural selection on a major armor gene in

threespine stickleback. Science, 322, 255–257.

Bilyj M (2011) A study on the phototrophic microbial mat communities of Sulphur Mountain

Thermal Springs and their association with the endangered, endemic snail Physella johnsoni. University of Manitoba.

Bland LM, Collen B, David C, Orme L, Bielby J (2015) Predicting the conservation status of

Data Deficient species. Conservation Biology, 53, 1792–1803.

Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: A flexible trimmer for Illumina sequence

44

data. Bioinformatics, 30, 2114–2120.

Bousset L, Henry PY, Sourrouille P, Jarne P (2004) Population biology of the invasive

freshwater snail Physa acuta approached through genetic markers, ecological

characterization and demography. Molecular Ecology, 13, 2023–2036.

Bouzat JL (2010) Conservation genetics of population bottlenecks: The role of chance, selection,

and history. Conservation Genetics, 11, 463–478.

Bull JK, Sands C, Garrick RC et al. (2013) Environmental complexity and biodiversity: The

multi- layered evolutionary history of a log-dwelling velvet worm in montane temperate

Australia. PLoS ONE, 8, 1–15.

Bushnell B (2014) BBMap: A Fast, Accurate, Splice-Aware Aligner. Lawrence Berkeley National Laboratory. LBNL Report #: LBNL-7065E. Retrieved from

https://escholarship.org/uc/item/1h3515gn

Butchart SHM, Walpole M, Collen B et al. (2010) Global biodiversity: Indicators of recent

declines. Science, 328, 1164–1168.

Cardinale BJ, Duffy JE, Gonzalez A et al. (2012) Biodiversity loss and its impact on humanity.

Nature, 486, 59–67.

Carwardine J, Martin TG, Firn J et al. (2018) Priority threat management for biodiversity

conservation: A handbook. Journal of Applied Ecology, 0–2.

Ceballos G, Ehrlich PR (2002) Mammal population losses and the extinction crisis. Science, 296,

904–907.

Clench WJ (1926) Three new species of Physa. Occasional Papers of the Museum of Zoology,

168, 1–8.

COSEWIC (2008) COSEWIC assessment and update status report on the Banff Springs Snail

Physella johnsoni in Canada. Committee on the Status of Endangered Wildlife in Canada. Ottawa., vii + 53 pp.

COSEWIC (2014) COSEWIC wildlife species assessment: quantitative criteria and guidelines.

45

Committee on the Status of Endangered Wildlife in Canada [cited 2018].

https://www.canada.ca/en/environment-climate-change/services/committee-status-

endangered-wildlife/wildlife-species-assessment-process-categories-guidelines/quantitative-

criteria.html (accessed on 18 Decemeber 2018).

COSEWIC (2015) Guidelines for recognizing designatable units. Committee on the Status of

Endangered Wildlife in Canada [cited 2018]. https://www.canada.ca/en/environment-

climate-change/services/committee-status-endangered-wildlife/guidelines-recognizing-

designatable-units.html (accessed on 29 October 2018).

COSEWIC (2018) COSEWIC status appraisal summary on the Banff Springs Snail Physella

johnsoni in Canada. Committee on the Status of Endangered Wildlife in Canada. Ottawa., xxvi pp.

Cowie RH, Regnier C, Fontaine B, Bouchet P (2017) Measuring the sixth extinction: what do

mollusks tell us? Nautilus, 131, 3–41.

Craze PG, Barr AG (2002) The use of electrical-component freezing spray as a method of killing

and preparing snails. Journal of Molluscan Studies, 68, 191–192.

Dalziel AC, Rogers SM, Schulte PM (2009) Linking genotypes to phenotypes and fitness: How

mechanistic biology can inform molecular ecology. Molecular Ecology, 18, 4997–5017.

Daugherty CH, Cree A, Hay JM, Thompson MB (1990) Neglected taxonomy and continuing

extinctions of tuatara (Sphenodon). Letters to Nature, 374, 177–179.

Dennenmoser S, Vamosi SM, Nolte AW, Rogers SM (2017) Adaptive genomic divergence

under high gene flow between freshwater and brackish-water ecotypes of prickly sculpin

(Cottus asper) revealed by Pool-Seq. Molecular Ecology, 26, 25–42.

Djuikwo-Teukeng FF, Da Silva A, Njiokou F et al. (2014) Significant population genetic

structure of the Cameroonian fresh water snail, Bulinus globosus, (Gastropoda: Planorbidae)

revealed by nuclear microsatellite loci analysis. Acta Tropica, 137, 111–117.

Edmands S (2007) Between a rock and a hard place: Evaluating the relative risks of inbreeding

and outbreeding for conservation and management. Molecular Ecology, 16, 463–475.

46

Emerson KJ, Merz CR, Catchen JM et al. (2010) Resolving postglacial phylogeography using

high-throughput sequencing. Proceedings of the National Academy of Sciences of the United States of America, 107, 16196–16200.

Ferretti L, Ramos-Onsins SE, Pérez-Enciso M (2013) Population genomics from pool

sequencing. Molecular Ecology, 22, 5561–5576.

Frankel OH (1974) Genetic conservation: our evolutionary responsibility. Genetics, 78, 53–65.

Frankham R (2005) Genetics and extinction. Biological Conservation, 126, 131–140.

Funk WC, Lovich RE, Hohenlohe PA et al. (2016) Adaptive divergence despite strong genetic

drift: Genomic analysis of the evolutionary mechanisms causing genetic differentiation in

the island fox (Urocyon littoralis). Molecular Ecology, 25, 2176–2194.

Funk WC, McKay JK, Hohenlohe PA, Allendorf FW (2012) Harnessing genomics for

delineating conservation units. Trends in Ecology and Evolution, 27, 489–496.

Futschik A, Schlötterer C (2010) The next generation of molecular markers from massively

parallel sequencing of pooled DNA samples. Genetics, 186, 207–218.

Gautier M, Foucaud J, Gharbi K et al. (2013) Estimation of population allele frequencies from

next-generation sequencing data: Pool-versus individual-based genotyping. Molecular Ecology, 22, 3766–3779.

Gilbertson CR, Wyatt JD (2016) Evaluation of euthanasia techniques for an invertebrate species,

land snails (Succinea putris). Journal of the American Association for Laboratory Animal Science, 55, 1–5.

Grasby SE, van Everdingen RO, Bednarski J, Lepitzki DA (2003) Travertine mounds of the

Cave and Basin National Historic Site, Banff National Park. Canadian Journal of Earth Sciences, 40, 1501–1513.

Grasby SE, Lepitzki DAW (2002) Physical and chemical properties of the Sulphur Mountain

thermal springs, Banff National Park, and implications for endangered snails. Canadian Journal of Earth Sciences, 39, 1349–1361.

47

Gu QH, Zhou CJ, Cheng QQ et al. (2015) The perplexing population genetic structure of

Bellamya purificata (Gastropoda: Viviparidae): low genetic differentiation despite low

dispersal ability. Journal of Molluscan Studies, 81, 466–475.

Guisan A, Tingley R, Baumgartner JB et al. (2013) Predicting species distributions for

conservation decisions. Ecology Letters, 16, 1424–1435.

Gustafson KD, Kensinger BJ, Bolek MG, Luttbeg B (2014) Distinct snail (Physa) morphotypes

from different habitats converge in shell shape and size under common garden conditions.

Evolutionary Ecology Research, 16, 77–89.

Hedrick PW, Kalinowski ST. (2000) Inbreeding Depression in Conservation Biology. Annual

Review of Ecology and Systematics, 31, 139–162.

Hedrick PW, Peterson RO, Vucetich LM, Adams JR, Vucetich JA (2014) Genetic rescue in Isle

Royale wolves: genetic analysis and the collapse of the population. Conservation Genetics,

15, 1111–1121.

Hendricks S, Epstein B, Schönfeld B et al. (2017) Conservation implications of limited genetic

diversity and population structure in Tasmanian devils (Sarcophilus harrisii). Conservation Genetics, 18, 977–982.

Hivert V, Leblois R, Petit EJ, Gautier M, Vitalis R (2018) Measuring genetic differentiation from

pool-seq data. Genetics, 210, 315–330.

Hoban S, Kelley JL, Lotterhos KE et al. (2016) Finding the genomic basis of local adaptation:

Pitfalls, practical solutions, and future directions. The American Naturalist, 188, 379–397.

Hoelzel AR, Halley J, O’brien SJ et al. (1993) Elephant seal genetic variation and the use of

simulation models to investigate historical population bottlenecks. Journal of Heredity, 84,

443–449.

Holsinger KE, Weir BS (2009) Genetics in geographically structured populations: defining,

estimating and interpreting FST. Nature reviews. Genetics, 10, 639–650.

Hooper DU, Adair EC, Cardinale BJ et al. (2012) A global synthesis reveals biodiversity loss as

48

a major driver of ecosystem change. Nature, 486, 105–108.

Howard SD, Bickford DP (2014) Amphibians over the edge: Silent extinction risk of Data

Deficient species. Diversity and Distributions, 20, 837–846.

Ingvarsson PK (2001) Restoration of genetic variation lost - The genetic rescue hypothesis.

Trends in Ecology and Evolution, 16, 62–63.

Isaac NJB, Mallet J, Mace GM (2004) Taxonomic inflation: Its influence on macroecology and

conservation. Trends in Ecology and Evolution, 19, 464–469.

Jarne P, Perdieu MA, Pernot AF, Delay B, David P (2000) The influence of self-fertilization and

grouping on fitness attributes in the freshwater snail Physa acuta: Population and individual

inbreeding depression. Journal of Evolutionary Biology, 13, 645–655.

Johnson PD, Bogan AE, Brown KM et al. (2013) Conservation status of freshwater gastropods

of Canada and the United States. Fisheries, 38, 247–282.

Joseph LN, Maloney RF, Possingham HP (2009) Optimal allocation of resources among

threatened species: a project prioritization protocol. Conservation Biology, 23, 328–338.

Kell LT, Dickey-Collas M, Hintzen NT et al. (2009) Lumpers or splitters? Evaluating recovery

and management plans for metapopulations of herring. ICES Journal of Marine Science, 66,

1776–1783.

Keller LF, Waller DM (2002) Inbreeding effects in wild populations. Trends in Ecology and Evolution, 17, 230–241.

Kess T, Gross J, Harper F, Boulding EG (2016) Low-cost ddRAD method of SNP discovery and

genotyping applied to the periwinkle Littorina saxatilis. Journal of Molluscan Studies, 82,

104–109.

Kofler R, Orozco-terWengel P, de Maio N et al. (2011a) PoPoolation: A toolbox for population

genetic analysis of next generation sequencing data from pooled individuals. PLoS ONE, 6.

Kofler R, Pandey RV, Schlötterer C (2011b) PoPoolation2: Identifying differentiation between

populations using sequencing of pooled DNA samples (Pool-Seq). Bioinformatics, 27,

49

3435–3436.

Koskinen MT, Haugen TO, Primmer CR (2002) Contemporary fisherian life-history evolution in

small salmonid populations. Nature, 419, 826–830.

Kremer CS, Vamosi SM, Rogers SM (2017) Watershed characteristics shape the landscape

genetics of brook stickleback (Culaea inconstans) in shallow prairie lakes. Ecology and Evolution, 7, 3067–3079.

Laikre L (2010) Genetic diversity is overlooked in international conservation policy

implementation. Conservation Genetics, 11, 349–354.

Van Leeuwen CHA, Huig N, Van Der Velde G et al. (2013) How did this snail get here? Several

dispersal vectors inferred for an aquatic invasive species. Freshwater Biology, 58, 88–99.

Lepitzki DAW (1998) The ecology of Physella johnsoni, the threatened Banff Springs Snail.

Heritage Resource Conservation - Aquatics, i-146.

Lepitzki DAW (2002) Status of the Banff Springs Snail (Physella johnsoni) in Alberta. Alberta Sustainable Resource Development, Fish and Wildlife Division, and Alberta Conservation Association, Wildlife Status Report No. 40, Edmonton, AB., 29 pp.

Lepitzki DAW, Pacas C (2010) Recovery Strategy and Action Plan for the Banff Springs Snail

(Physella johnsoni) in Canada. Species at Risk Act Recovery Strategy Series. Parks Canada Agency, Ottawa, vii + 63 pp.

Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform.

Bioinformatics, 25, 1754–1760.

Li H, Handsaker B, Wysoker A et al. (2009) The Sequence Alignment/Map format and

SAMtools. Bioinformatics, 25, 2078–2079.

Lien S, Gidskehaug L, Moen T et al. (2011) A dense SNP-based linkage map for Atlantic

salmon (Salmo salar) reveals extended chromosome homeologies and striking differences

in sex-specific recombination patterns. BMC Genomics, 12, 1–10.

Liu MM, Davey JW, Banerjee R et al. (2013) Fine Mapping of the pond snail left-right

50

asymmetry (chirality) locus using RAD-Seq and Fibre-FISH. PLoS ONE, 8, 2–8.

Lotterhos KE, Whitlock MC (2015) The relative power of genome scans to detect local

adaptation depends on sampling design and statistical method. Molecular Ecology, 24,

1031–1046.

Lounnas M, Correa AC, Alda P et al. (2018) Population structure and genetic diversity in the

invasive freshwater snail Galba schirazensis (Lymnaeidae). Canadian Journal of Zoology,

96, 425–435.

Lundmark C, Sandström A, Andersson K, Laikre L (2019) Monitoring the effects of knowledge

communication on conservation managers’ perception of genetic biodiversity – A case

study from the Baltic Sea. Marine Policy, 99, 223–229.

Mace GM (2004) The role of taxonomy in species conservation. Philosophical transactions of the Royal Society of London. Series B, Biological Sciences, 359, 711–9.

Margres MJ, Jones ME, Epstein B et al. (2018) Large-effect loci affect survival in Tasmanian

devils (Sarcophilus harrisii) infected with a transmissible cancer. Molecular Ecology, 27,

4189–4199.

Martin TG, Nally S, Burbidge AA et al. (2012) Acting fast helps avoid extinction. Conservation Letters, 5, 274–280.

Mavárez J, Amarista M, Pointier JP, Jarne P (2002a) Fine-scale population structure and

dispersal in Biomphalaria glabrata, the intermediate snail host of schistosoma mansoni, in

Venezuela. Molecular Ecology, 11, 879–889.

Mavárez J, Pointier JP, David P, Delay B, Jarne P (2002b) Genetic differentiation, dispersal and

mating system in the schistosome-transmitting freshwater snail Biomphalaria glabrata.

Heredity, 89, 258–265.

Mavárez J, Steiner C, Pointier J-P, Jarne P (2002c) Evolutionary history and phylogeography of

the schistosome-vector freshwater snail Biomphalaria glabrata based on nuclear and

mitochondrial DNA sequences. Heredity, 89, 266–272.

51

McCallum H (2008) Tasmanian devil facial tumour disease: lessons for conservation biology.

Trends in Ecology and Evolution, 23, 631–637.

Mee JA, Bernatchez L, Reist JD, Rogers SM, Taylor EB (2015) Identifying designatable units

for intraspecific conservation prioritization: A hierarchical approach applied to the lake

whitefish species complex (Coregonus spp.). Evolutionary Applications, 8, 423–441.

Moore AC, Burch JB, Duda TF (2014) Recognition of a highly restricted freshwater snail lineage

(Physidae: Physella) in southeastern Oregon: convergent evolution, historical context, and

conservation considerations. Conservation Genetics, 16, 113–123.

Morais AR, Siqueira MN, Lemes P et al. (2013) Unraveling the conservation status of data

deficient species. Biological Conservation, 166, 98–102.

Morris MRJ, Bowles E, Allen BE, Jamniczky HA, Rogers SM (2018) Contemporary ancestor?

Adaptive divergence from standing genetic variation in Pacific marine threespine

stickleback. BMC Evolutionary Biology, 18, 1–21.

Morris MRJ, Richard R, Leder EH et al. (2014) Gene expression plasticity evolves in response to

colonization of freshwater lakes in threespine stickleback. Molecular Ecology, 23, 3226–

3240.

de Oliveira LR, Arias-Schreiber M, Meyer D, Morgante JS (2006) Effective population size in a

bottlenecked fur seal population. Biological Conservation, 131, 505–509.

Ouborg NJ, Pertoldi C, Loeschcke V, Bijlsma RK, Hedrick PW (2010) Conservation genetics in

transition to conservation genomics. Trends in Genetics, 26, 177–187.

Parsons ECM (2016) Why IUCN should replace “data deficient” conservation status with a

precautionary “assume threatened” status—A cetacean case study. Frontiers in Marine Science, 3, 2015–2017.

Peichel CL, Sullivan ST, Liachko I, White MA (2017) Improvement of the threespine

stickleback genome using a Hi-C-based proximity-guided assembly. Journal of Heredity,

108, 693–700.

52

Peterson RO, Thomas NJ, Thurber JM, Vucetich JA, Waite TA (1998) Population limitation and

the wolves of Isle Royale. Journal of Mammalogy, 79, 828.

Peterson BK, Weber JN, Kay EH, Fisher HS, Hoekstra HE (2012) Double digest RADseq: An

inexpensive method for de novo SNP discovery and genotyping in model and non-model

species. PLoS ONE, 7, 1–11.

Pip E, Franck JPC (2008) Molecular phylogenetics of central Canadian Physidae (Pulmonata :

Basommatophora). Canadian Journal of Zoology, 16, 10–16.

Régnier C, Achaz G, Lambert A et al. (2015) Mass extinction in poorly known taxa.

Proceedings of the National Academy of Sciences, 112, 7761–7766.

Régnier C, Fontaine B, Bouchet P (2009) Not knowing, not recording, not listing: Numerous

unnoticed mollusk extinctions. Conservation Biology, 23, 1214–1221.

Remigio EA, Lepitzki DAW, Lee JS, Hebert PDN (2001) Molecular systematic relationships and

evidence for a recent origin of the thermal spring endemic snails Physella johnsoni and

Physella wrighti (Pulmonata: Physidae). Canadian Journal of Zoology, 79, 1941–1950.

Roberts DW (2012) Package ‘labdsv’

Rogers SM, Bernatchez L (2007) The genetic architecture of ecological speciation and the

association with signatures of selection in natural lake whitefish (Coregonus sp.

Salmonidae) species pairs. Molecular Biology and Evolution, 24, 1423–1438.

Rosenberg G (2014) A new critical estimate of named species-level diversity of the recent

Mollusca. American Malacological Bulletin, 32, 308–322.

RStudio Team (2016) RStudio: Integrated Development for R. RStudio, Inc., Boston, MA

http://www.rstudio.com/.

Sankurathri CS, Holmes JC (1976) Effects of thermal efffuents on the population dynamics of

Physa gyrina Say (Mollusca: Gastropoda) at Lake Wabamun, Alberta. Canadian Journal of Zoology, 54, 582–590.

Santamaría L, Klaassen M (2002) Waterbird-mediated dispersal of aquatic organisms: An

53

introduction. Acta Oecologica, 23, 115–119.

Schlötterer C, Tobler R, Kofler R, Nolte V (2014) Sequencing pools of individuals — mining

genome-wide polymorphism data without big funding. Nature Publishing Group, 15, 749–

763.

Schmieder R, Edwards R (2011a) Fast identification and removal of sequence contamination

from genomic and metagenomic datasets. PLoS ONE, 6.

Schmieder R, Edwards R (2011b) Quality control and preprocessing of metagenomic datasets.

Bioinformatics, 21, 863–864.

Shafer ABA, Wolf JBW, Alves PC et al. (2015) Genomics and the challenging translation into

conservation practice. Trends in Ecology and Evolution, 30, 78–87.

Shaffer ML (1981) Minimum population sizes for species conservation. Bioscience, 31, 131–

134.

Storfer A, Hohenlohe PA, Margres MJ et al. (2018) The devil is in the details: Genomics of

transmissible cancers in Tasmanian devils. PLoS Pathogens, 14, 1–7.

Taylor DW (2003) Introduction to Physidae (Gastropoda: Hygrophila); biogeography,

classification, morphology. Revista de Biologia Tropical, 51, 1–287.

Viard F, Justy F, Jarne P (1997) The influence of self-fertilization and population dynamics on

the genetic structure of subdivided populations: a case study using microsatellite markers in

the freshwater snail Bulinus truncatus. Evolution, 51, 1518–1528.

Vilà C, Sundqvist A-K, Flagstad Ø et al. (2003) Rescue of a severely bottlenecked wolf (Canis lupus) population by a single immigrant. Royal Society, 270, 91–97.

Weber DS, Stewart BS, Lehman N (2004) Genetic consequences of a severe population

bottleneck in the guadalupe fur seal (Arctocephalus townsendi). Journal of Heredity, 95,

144–153.

Wethington AR, Guralnick R (2004) Are populations of physids from different hot springs

distinctive lineages? American Malacological Bulletin, 19, 135–144.

54

Wethington AR, Lydeard C (2007) A molecular phylogeny of Physidae (Gastropoda:

Basommatophora) based on mitochondrial DNA sequences. Journal of Molluscan Studies,

73, 241–257.

Whitlock MC, Lotterhos KE (2015) Reliable detection of loci responsible for local adaptation:

inference of a null model through trimming the dstribution of FST. The American Naturalist, 186, S24–S36.

Whitlock M, McCauley D (1999) Indirect measures of gene flow and migration: FST ¹ 1/(4Nm +

1). Heredity, 82, 117–125.

Wickham H (2016) ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.

ISBN 978-3-319-24277-4, http://ggplot2.org

Wright S (1950) The genetical structure of populations. The Annals of Human Genetics, 324–

354.

Appendix A: Genomic analysis pipeline

General notes about the project

Below is the pipeline I used to analyze the Pool-seq data. I have generalized each step so that future studentscan adjust it to their projects. For my project, I analyzed five sites of Physella johnsoni and three sites ofPhysella gyrina of 20 to 40 individuals per site. These were sequenced by Genome Quebec on the IlluminaHiSeq X, aiming for about 40x coverage (determined to be almost double that in actuality). Each siteconsisted of half a lane worth of data (total of four lanes).

The pipeline is not linear, in that there are analysis that branches off at certain points. As well, most of thiswas run on Cedar so the accessing of modules reflects that. Some of it was run on ARC though.

I included the SLURMs because it gives an idea of how long each thing took to run for my files. Rememeberthat my files were half a lane each, a college of mine sequenced six pools over one lane and her analysis tooka fraction of the time listed below.

Getting started in Cedar

Launch Terminal on a Mac. I think there is something called Putty for Windows?

Logging into Cedarssh [email protected]#ex. [email protected]#You will be prompted for your password#When you type it in nothing will appear but just hit "enter" when you've typed it in#and it should log you right in!

Navigating around Cedar and creating a project directoryls #will show you all the directories in your home folder

#How to make symbolic link to our project folder in def-srogers

rm project #removes the current project directory

ln -s projects/def-rogers/your_user_name project #assigns your account to project

cd project #change directory your in to project

pwd #will give you the path to the directory you are in

mkdir project_name #will make a directory in the project directory named "project_name".

cd project_name

mkdir 00_nameofstep1 #this is a good way to keep your steps in order

cd 00*tab* #by using "tab" on your keyboard it will auto-complete the name

It is worth spending some time learning Unix commands. I wish I had spent more time doing this, instead oftrying to figure it out as I went.

55

How to make executable codes

You will need to choose a text editor to use on Cedar or Graham. We chose to use GNU nano because it isuser and beginner friendly. To make exectuable codes do below:nano codename#will open a new, blank, nano with the name "codename"

type code

^X #this will close the code and you can select to save it.#nano has functions listed at the bottom of the file - "^" is "control"

ls #will list all the files in the directory you are in - should see your code here!

chmod +x codename #this changes the code so that it is now exectuable, important!

How to submit jobs to run on Cedar

To run code on Cedar, you need to submit them as “jobs”. The way this was explained to me is that youhave to submit your request to the “secretary” of Cedar and they will send your job to the appropiate place.How we do this is something called “SLURM”.

Create a SLURMnano SLURM_name

#copy and paste below and change as necessary

^X #close and save SLURM

Example SLURM#!/bin/bash #Must put this at top# ---------------------------------------------------------------------# Place you can leave yourself a descriptor of this SLURM# ---------------------------------------------------------------------

#SBATCH --job-name=Nameofyourjob #Make this descriptive but short!#SBATCH --account=def-srogers #the account this is under#SBATCH --cpus-per-task=XX #how many threads you want#SBATCH --time=0-00:00 #time you want - goes days:hours:minutes#SBATCH --mem=XXG #amount of memory you want

# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------

/path/to/thecodeyouwanttorun

# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------

56

Submit SLURM and check on statussbatch SLURM_name #will submit job to queue

squeue -u your_user_name #will give you the job's status

scancel job_ID_number #will cancel the job

Once the job starts running, a SLURM out-file will be created in the format of: slurm-job_ID_number.out.This out-file gives you information on what step your job is on and whether or not it completed successfully.If successful it will have “Job finished with exit code 0”. If it doesn’t say that, then it may give you a helpfulerror code or a non-helpful error code.

Getting information on the job after it has runacct -j job_ID_number --format=JobID,JobName,MaxRSS,Elapsed#Gives JobID (kind of redundant), the name of the job, the memory and the time it took.

Convert BCL to Fastq - not allowing barcode mismatches

If working with BCL files you must first change them to fastq format. Genome Quebec will give you yourfiles in fastq format. However, they allow one barcode mismatch. At times when you need 100% confidenceof sequence assignment to the right population you will need to allow no mismatches. You may be able toask Genome Quebec to do this on their end but I didn’t know this until after they had done the conversionso I got the BCL files from them.

Manual: https://support.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/bcl2fastq/bcl2fastq2_guide_15051736_v2.pdf

tiles: these are the tiles that your data is on (because there will likely be other sequence info)

sample-sheet: need to get this from Genome Quebec. Just info about your sequences

-r 32 -p 32 -w 32: reading, processing and writing thread count. I set them all to 32.

use-bases-mask Y151,I6n2,Y151: this is specific about the sequencer - the Y151 is because it is PE 150reads

Ex. bcl2fastq.1module load bcl2fastq2/2.20 && \bcl2fastq\--runfolder-dir /path/to/BCLfiles/171207_E00434_0072_AHGNNHCCXY_4467HS23A\

#the above is an identifier - change!--output-dir /path/to/00_raw_data\--tiles s_1\ #change to what tile you are converting - I had s_1 to s_4--sample-sheet /path/to/BCLfiles/171207_E00434_0072_AHGNNHCCXY_4467HS23A/SampleSheet.1.csv\

#change the above depending on tile - there is a SampleSheet for each tile--create-fastq-for-index-reads\-r 32 -p 32 -w 32\--barcode-mismatches 0 --use-bases-mask Y151,I6n2,Y151

#!/bin/bash# ---------------------------------------------------------------------# Slurm for bcl2fastq for L001# ---------------------------------------------------------------------

#SBATCH --job-name=bcl2fastq.1

57

#SBATCH --account=def-srogers#SBATCH --cpus-per-task=32#SBATCH --time=0-00:45#SBATCH --mem=20G

# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------

/path/to/bcl2fastq.1

# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------

Concatenate files together - if sites were run on two or more siteson the flow cells

My sites were split over two lanes on the flow cells, so when I converted them from BCL to fastq I had fourfiles for each site.cat XXPool_L001_R1.fq.gz XXPool_L002_R1.fq.gz > XXPool_R1.fq.gz

FastQC - check the quality of your sequencing reads

Fastqc will give you a report on the quality of your sequencing reads. https://dnacore.missouri.edu/PDF/FastQC_Manual.pdf this is a pretty good tutorial on how to interpet them.

The code below will loop through every .fq.gz file in the directory you tell it to look in and make a report foreach.#!/bin/bash# ---------------------------------------------------------------------# FastQC for pool sites# ---------------------------------------------------------------------

#SBATCH --job-name=fastqc#SBATCH --account=def-srogers#SBATCH --cpus-per-task=1#SBATCH --time=0-15:00#SBATCH --mem=5G

# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------

module load fastqc/0.11.5

for i in /path/to/fastqfiles/*fq.gz

58

dofastqc -o /path/to/where/you/want/reports $i

done# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------

Trimmomatic - cleaning and filtering low-quality reads and remov-ing adaptors

The available Adaptor file doesn’t have all of the adaptors in it. Please contact me if you would like the filewe created. We included all of the adaptors used in our library preparation.

phred: can be 64. Depends on your sequencing.

threads: change to the number of threads you have available on your cluster. We used 16, half of a node, butI think we could have used more. Just make sure to match this to what you are asking for in your SLURM.

ILLUMINACLIP: path to the adaptor file you made. 2: seed mismatches 30: is how accurate the matchbetween the two adaptor ligated reads must be 10: SimpleClip Threshold (I have a doc that goes in moredetail if interested)

CROP: we were having an issue in some of our reads where Trimmomatic just wasn’t detecting the repetitivesequences in the last 15 nucleotides. Hence the hard crop to trim the last 15 nucleotides off each read.

LEADING: trim the leading nucleotides if they fall under Q5

TRAILING: trim the trailing nucleotides if they fall under Q5

SLIDINGWINDOW: 5:20 - look at 5 bases at a time and trim if the average Q is less than 20

MINLEN: only keep reads that are minimally 100 bpjava -jar $EBROOTTRIMMOMATIC/trimmomatic-0.36.jar PE -phred33 -threads 16 -trimlog logfile \path/to/fastqfiles/XXPool_R1.fq.gz /path/to/fastqfiles/XXPool_R2.fq.gz \XX_R1_P_qtrim.fq XX_R1_U_qtrim.fq XX_R2_P_qtrim.fq XX_R2_U_qtrim.fq \ILLUMINACLIP:/path/to/Adaptors/TruSeq3-PE-all.fa:2:30:10 CROP:135 LEADING:5 TRAILING:5SLIDINGWINDOW:5:20 MINLEN:100

#!/bin/bash# ---------------------------------------------------------------------# Trim cat files# ---------------------------------------------------------------------

#SBATCH --job-name=XX_trim#SBATCH --account=def-srogers#SBATCH --cpus-per-task=16#SBATCH --time=0-20:00#SBATCH --mem=15G

# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------

module load trimmomatic/0.36

59

/path/to/trimcode/trimcode

# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------

Post-trim FastQC - check the quality of your sequencing reads

Same as above but for the post-trim files. Should see the quality go up and there be less sequences. Look forover-represented k-mers, there shouldn’t be any!

DeconSeq - removing non-snail contaminants from the sequences

Full disclaimer I made the databases a variety of ways as I found out one way wouldn’t work for everyorganism.

Step 1.1: How I created the databases for Archaea and Algae (Charophyceae,Chlorophyta, Cryptophyta, Eustigmatophyceae and Klebsormidiophyceae)

Create a database from NCBI of the sequences you would like to remove. You need to get the GI listfrom NCBI (as per http://johnstantongeddes.org/aptranscriptome/2013/12/31/notes.html or https://www.biostars.org/p/6528/).

You will need to install the newest version of NCBI BLAST+ (https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download).

I have heard that moving forward NCBI will be switching to using taxid to create databases rather than GIlist but upon publishing this GI list were still being used.

In this example, I am using Archaea. Key thing is that the nt database and your GI list must existin the same directory. It will be messy, which is why I put the “X” in front of the files I was making sothat all of them were at the bottom./path/to/ncbi_directory/ncbi-blast-2.7.1+/bin/blastdb_aliastool -db nt -gilist Archaea.gi-dbtype nucl -out X_nt_archaea -title "database for Archaea"

I just ran this in Cedar without a SLURM. Takes ~ 30 sec to a minute. Creates a .nal file, which if you put“X_nt_archaea” as your database, will mask everything else in the database but those sequences.

Once you’ve created the .nal file, you need to convert the database to .fasta file. It’s small so I ran it in theSLURM.#!/bin/bash# ---------------------------------------------------------------------# NCBI to fasta# ---------------------------------------------------------------------

#SBATCH --job-name=Charo_fasta#SBATCH --account=def-srogers#SBATCH --cpus-per-task=1#SBATCH --time=0-05:00#SBATCH --mem=1G

60

# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------

module load perl/5.22.4

/path/to/ncbi/ncbi-blast-2.7.1+/bin/blastdbcmd -entry all-db X_nt_archaea -out Archaea.fasta

# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------

Step 1.2: Creating the database for Threespine stickleback and Human

Threespine stickleback downloaded from: https://datadryad.org/resource/doi:10.5061/dryad.h7h32 (Peichelet al. 2017)

Human genome was downloaded from: ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/hs_ref_GRCh38.p12_ch.

I used the tutorial provided by DeconSeq to access the human genome.#Download sequence datafor i in {1..22} X Y MT;do wgetftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/hs_ref_GRCh38.p12_chr$i.fa.gz;done

#Extracting and joining datafor i in {1..22} X Y MT; do gzip -dvc hs_ref_GRCh38.p12_chr$i.fa.gz>>hs_ref_GRCh38_p12.fa; rm hs_ref_GRCh38.p12_chr$i.fa.gz;done

Step 1.3: Creation of the Bacterial database

I downloaded all of the full bacterial genomes off of NCBI (I think there was about 10,000 of them?) and allassembly levels for bacteria that have been found in the thermal springs onto a computer (many Gb). I thenused Globus to put them on to Cedar. https://www.ncbi.nlm.nih.gov/assembly/?term=bacteria

You will need to unzip any files that are zipped (including your query sequences) because DeconSeq can’t usezipped files.

Can use:for file in *.gz #loop through all files with this file extensiondogunzip $file #unzip themdone #when it's finished, stop.

This will had to be done for all of the bacterial genomes. It took 10+ hours so I would consider submiting itas a SLURM.

Then I concatenated the bacterial genomes together, using:

61

#!/bin/bash# ---------------------------------------------------------------------# cat FG NCBI database# ---------------------------------------------------------------------

#SBATCH --job-name=cat_bac#SBATCH --account=def-srogers#SBATCH --cpus-per-task=16#SBATCH --time=0-10:00#SBATCH --mem=1G

# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------

find . -name '*.fna' -print0 | xargs -0 cat > bacteria_genomes.fa

# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------

Step 2: Splitting sequences by long repeats of ambiguous base N (this is fromthe DeconSeq manual

Below is all for the human genome because that is what is listed in the manual but I did this for all of thedatabases.cat hs_ref_GRCh38_p12.fa | perl -p -e 's/N\n/N/' |

perl -p -e 's/^N+//;s/N+$//;s/N{200,}/\n>split\n/'>hs_ref_GRCh38_p12_split.fa; rm hs_ref_GRCh38_p12.fa

Step 3: Filtering databases

Need to download and install PRINSEQ - can be found at https://sourceforge.net/projects/prinseq/files/

This step needs a SLURM because it will run out of memory otherwise#!/bin/bash# ---------------------------------------------------------------------# Filtering sequences - PRINSEQ# ---------------------------------------------------------------------

#SBATCH --job-name=human_prinseq#SBATCH --account=def-srogers#SBATCH --cpus-per-task=1#SBATCH --time=0-0:10#SBATCH --mem=10G

# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------

62

module load perl/5.22.4

perl /path/to/prinseq-lite-0.20.4/prinseq-lite.pl -log -verbose-fasta hs_ref_GRCh38_p12_split.fa -min_len 200 -ns_max_p 10 -derep 12345-out_good hs_ref_GRCh38_p12_split_prinseq -seq_id hs_ref_GRCh38_p12_-rm_header -out_bad null

# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------

Step 4: Index the databases

For the bacterial database (because it is over 200Gb) you will need to first split it into managable chunksbefore BWA can use them. Can use fasta splitter (http://kirill-kryukov.com/study/tools/fasta-splitter/).

The files need to be under 3Gb each, so split it to as many chunks as you need. I did 100.#!/bin/bash# ---------------------------------------------------------------------# FASTA splitter# ---------------------------------------------------------------------

#SBATCH --job-name=fasta_split_bac#SBATCH --account=def-srogers#SBATCH --cpus-per-task=16#SBATCH --time=0-10:00#SBATCH --mem=100G

# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------

module load perl/5.22.4

perl /pathto/fasta-splitter.pl --n-parts 100 Bacteria_FG_split_prinseq.fasta

# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------

Next step is to index the databases You must use the BWA provided in the DeconSeq package!The newer BWA reads the files incorrectly for this and will only produce 5 of the 8 outfiles necessary. Cedaris a 64 bit Linux system, so use bwa64. I know that the top examples have been using the human (sorry forthe lack of consistency) but the bacterial one needed some extra things to make it run. I didn’t want to run100 SLURMs so this is a batch job!

Modified from script kindly provided by Dr. Stefan Dennenmoser#!/bin/bash# ---------------------------------------------------------------------# BWA# ---------------------------------------------------------------------

63

#SBATCH --job-name=index_bac#SBATCH --ntasks=1#SBATCH --account=def-srogers#SBATCH --time=0-10:00#SBATCH --mem-per-cpu=20G#SBATCH --array=1-100

# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------

cd /path/to/Bacterial_Database

filename=`ls -1 *.fasta* | tail -n +${SLURM_ARRAY_TASK_ID} | head -1`

filename2=${filename::-6} #the filename without the .fasta part (-6 letters)

/path/to/deconseq-standalone-0.4.3/bwa64 index -p $filename2 -a bwtsw $filename>bwa.log 2>&1

# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------

Step 5: Configure the DeconSeq file

You will have to go into the installed DeconSeq directory and set up the configure file DeconSeqConfig.pm.You will just need to change the database directory and out directory and the what files are accessed in the“use constant DBS”.

Step 6: Split the query sequences

DeconSeq is unable to handle big datasets (e.g. 50+ Gb files that I was using)

Will need to use fastq splitter (http://kirill-kryukov.com/study/tools/fastq-splitter/)

I don’t think I ran this in a SLURM and just used the console. If it fails. . . put it in a SLURM.perl /path/to/fastq-splitter.pl --n-parts 50 --check XX_R1.fq

Step 7: ACTUALLY RUNNING DECONSEQ

For DeconSeq you need to choose the identity (-i) which is the percent match between your query sequenceand the database and the coverage (-c) which is the amount of the sequence aligns.

I went with the parameters they used in their paper and based on what I had seen other people do, whichwas 94% identity (-i 94) and 90% to 95% coverage (-c 90 or -c 95).

Submit this as a batch job.#!/bin/bash# ---------------------------------------------------------------------

64

# Deconseq# ---------------------------------------------------------------------

#SBATCH --job-name=XX_decon#SBATCH --ntasks=1#SBATCH --account=def-srogers#SBATCH --time=3-00:00#SBATCH --mem-per-cpu=7500MB#SBATCH --array=1-50

# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------

module load perl/5.22.4

#put all of the split files from one pop and read direction in its own directory

cd /path/to/trimmed_seq/XX_R1_split

filename=`ls -1 *.fq* | tail -n +${SLURM_ARRAY_TASK_ID} | head -1`

perl /path/to/deconseq-standalone-0.4.3/deconseq.pl -i 94 -c 90 -f/path/to/trimmed_seq/XX_R1_split/$filename -dbs hsref-out_dir /path/to/deconseq_out/XX_R1 -id $filename

# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------

Step 8: Creating paired files again

DeconSeq was never designed for paired-end sequencing. Therefore it will process each read directionseperately and this causes sequences to be removed in one direction and not the other.

Firstly, concatenate the 50 (or more or less depending on what you split your in files to) of clean files. I alsodid this for the .cont files because I wanted to keep them and see how many were removed.

Then you can use this person’s script found at: https://github.com/linsalrob/fastq-pair

Step 1: Clone or download - copy URL

Step 2: In Cedar or ARC or whatever cluster type: git clone -should see a directory called “fastq-pair”

Step 3: gcc fastq-pair/*.c -o fastq_pair

Step 4: Should see an exectuable script called “fastq_pair”

Running it is super simple.path/to/fastq_pair -t VALUE path/to/file_R1.fastq path/to/file_R2.fastq

Where VALUE is roughly the number of sequences in your file. This is the setting of the hash table size. Forsome reason I don’t know why, I couldn’t get this to run as a SLURM. It would make the outfiles (if I gave itover 50Gb) but it would run for far longer than it runs in the terminal and just never finish. I ended uprunning it in ARC’s console because Cedar doesn’t have enough resources allocated to their console.

65

It makes four output files. R1 paired and unpaired and R2 paired and unpaired. Then you will need to zipthe files back up!

DISCOVAR de novo - assembling a reference genome

DISCOVAR de novo is very easy to use but there are downstream challenges of not having a reference genome.As well, it was designed for paired-end sequences of 250 bp sequenced from one individual prepared PCR-free.In my thesis you can see the impact of breaking these assumptions in the quality of the genome produced. Ifit is at all possible to run an individual with at least two insert sizes (ex. 1 kb and 5,000 kb), you will be ableto likely generate a much more robust assembly. Ex. Schell et al. 2017.

You can download DISCOVAR denovo from here: https://software.broadinstitute.org/software/discovar/blog/?page_id=98. There are two ways of getting (probably lots more but these are the ones I use) info fromthe web onto Cedar.

Option 1 Go to the link you want to download. Right click and “Copy Link Address”. Go to Cedar. Dobelow.wgetftp://ftp.broadinstitute.org/pub/crd/DiscovarDeNovo/latest_source_code/LATEST_VERSION.tar.gz

Option 2 Can download to your personal computer (if the file isn’t too big) and then use Globus to transferit from your computer to Cedar. Globus is supported by Compute Canada. You will have to login, downloadGlobus to your computer and make your computer an endpoint.

Once you download DISCOVAR denovo, you will have to unzip it.

A little aside. . . #Clumpify (belongs to the BBMap/BBTools package)

Before I assembled the sequences, I used Clumpify to remove PCR duplicates (unlike Picard it doesn’t need areference genome, however I don’t think it is as robust) because the library prep is intended to be PCR free.This is seen to improve the quality of the assembly. I also tried BBNORM to normalize the sequencing depthat around 60x coverage. It decreased the N50 but increased the MPL1 and decreased the estimated chimerarate. In hindsight, I think I should have investigated this assembly more and maybe used it.

dedupe: The command to remove duplicates

subs=2: This means that there can be two subsitutions between the the compared sequences and it will beconsidered a duplicate.clumpify.sh in1=/home/youraccount/path/to/trimmed.fq.gz/XX_R1_P.fq.gzin2=/path/to/trimmed.fq.gz/XX_R2_P.fq.gz out1=/path/to/Clumpify/XX_R1_P_nodup.fq.gzout2=/path/to/Clumpify/XX_R2_P_nodup.fq.gz dedupe subs=2

#!/bin/bash# ---------------------------------------------------------------------# Removing duplicates from XX allowing two subs# ---------------------------------------------------------------------

#SBATCH --job-name=nodup_XX#SBATCH --account=def-srogers#SBATCH --nodes=1#SBATCH --cpus-per-task=32#SBATCH --time=0-00:30#SBATCH --mem=150G

# ---------------------------------------------------------------------

66

echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------

module load nixpkgs/16.09module load intel/2016.4module load bbmap/37.36

/home/youraccount/Clumpify/code/XX_clump

# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------

Now that we have files with duplicates removed, back to DISCOVAR de novo

DISCOVAR de novo takes A LOT of memory.DiscovarDeNovo READS=/path/to/fastq/files/you/want/to/use/*fq.gzOUT_DIR=/path/to/referencegenomeMAX_MEM_GB=1450 NUM_THREADS=32

#!/bin/bash# ---------------------------------------------------------------------# Assembly of genome (using 1 site)# ---------------------------------------------------------------------

#SBATCH --job-name=XX_ref#SBATCH --account=def-srogers#SBATCH --nodes=1#SBATCH --cpus-per-task=32#SBATCH --time=0-30:00#SBATCH --mem=1400G

# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------

path/to/code/DDN_code

# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------

BWA (Burrows-Wheeler Aligner) - aligning sequences to the ref-erence genome

Step 1: Index the reference

This step will create a bunch of index files using the name “reference_genome” as the name of the file

67

bwa index -p reference_genome /path/to/reference/reference_genome.fa

#!/bin/bash# ---------------------------------------------------------------------# BWA# ---------------------------------------------------------------------

#SBATCH --job-name=ref_index#SBATCH --account=def-srogers#SBATCH --cpus-per-task=16#SBATCH --time=0-04:00#SBATCH --mem=5G

# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------

module load nixpkgs/16.09module load gcc/5.4.0module load intel/2016.4module load intel/2017.1

module load bwa/0.7.17

/path/to/code/bwa_index_code

# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------

Step 2: Align to reference

Details about parameters can be found here: http://bio-bwa.sourceforge.net/bwa.shtml

-M: mark shorter split hits as secondary (necessary for using the file in Picard downstream)

-t 16: number of threads. Match with SLURM

-R: complete read group header line

XX: need to change XX to whatever site you are currently working on.

The reference genome needs to be put in without the “.fa” because it will be using all of theindexes that are in the directory too.bwa mem -M -t 16 -R '@RG\tID:XX\tLB:XX\tSM:XX\tPL:ILLUMINA'/path/to/ref/reference_genomepath/to/XX_R1.fq.gz path/to/XX_R2.fq.gz > XX_out.sam

#!/bin/bash# ---------------------------------------------------------------------# sam# ---------------------------------------------------------------------

#SBATCH --job-name=XX_align

68

#SBATCH --account=def-srogers#SBATCH --cpus-per-task=16#SBATCH --time=0-05:00#SBATCH --mem=10G

# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------

module load nixpkgs/16.09module load gcc/5.4.0module load intel/2016.4module load intel/2017.1

module load bwa/0.7.17

/path/to/code/align_code

# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------

Samtools - sort the .sam file into a .bam by chromosome number

There are two options for sorting but it needs to be sorted by chromosome for downstream applications.

-@: this argument is where you set the number of threads

-T: this argument is where you set the indentifier - change to which ever site you are currently working on

-o: write to this outfilesamtools sort -@ 16 -T XX -o XX.bam XX_out.sam

#!/bin/bash# ---------------------------------------------------------------------# Samtools Sort# ---------------------------------------------------------------------

#SBATCH --job-name=XX.bam#SBATCH --account=def-srogers#SBATCH --cpus-per-task=16#SBATCH --time=0-04:00#SBATCH --mem=5G

# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------

module load nixpkgs/16.09module load gcc/5.4.0module load intel/2016.4

69

module load samtools/1.5

/path/to/code/samsort_code

# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------

Samtools - filter low quality reads and where only one read mapped

Details can be found at: http://www.htslib.org/doc/samtools.html

Need to remove where only one read mapped, duplicates and low alignment quality reads (Q20). I did this inthree steps, where I removed 1 read mapped first and then removed dups, and then filtered for Q20.

-@ : number of threads

-f 2: this means only keep it if there is paired reads

-o: write to this outfile

samtools has good documentation online. Be advised that they updated the package in April 2018! Just needto look at the right info.samtools view -@ 16 -f 2 -o XX_rm1mate.bam /path/to/XX.bam

#!/bin/bash# ---------------------------------------------------------------------# Samtools Sort# ---------------------------------------------------------------------

#SBATCH --job-name=XX_rm_Q20#SBATCH --account=def-srogers#SBATCH --cpus-per-task=16#SBATCH --time=0-02:00#SBATCH --mem=25G

# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------

module load nixpkgs/16.09module load gcc/5.4.0module load intel/2016.4

module load samtools/1.5

/path/to/code/rm1mate_code

# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------

70

Picard - removing duplicates

Picard will remove duplicates using the MarkDuplicates function. You need to add REMOVE_DUPLICATES=TRUEto remove them. GATK suggests that you keep the marked duplicates but PoPoolation1 and 2 needs themremoved.java -jar $EBROOTPICARD/picard.jar MarkDuplicatesINPUT=/path/to/XX_rm1mate.bam OUTPUT=XX_rm1mate_nodup.bamMETRICS_FILE=XX_rm1mate_Q20_nodup.txt REMOVE_DUPLICATES=TRUE

#!/bin/bash# ---------------------------------------------------------------------# Picard Dup removal# ---------------------------------------------------------------------

#SBATCH --job-name=XX_nodup_rm1mate_Q20#SBATCH --account=def-srogers#SBATCH --cpus-per-task=16#SBATCH --time=0-02:00#SBATCH --mem=40G

# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------

module load nixpkgs/16.09

module load picard/2.17.3

/path/to/code/removeduplicates_code

# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------

Flagstat - get stats on your alignment

Flagstat will give you stats on the alignment - like percent mapped.#!/bin/bash# ---------------------------------------------------------------------# Flagstat# ---------------------------------------------------------------------

#SBATCH --job-name=XX_flagstat

#SBATCH --account=def-srogers

#SBATCH --cpus-per-task=16#SBATCH --time=0-00:20#SBATCH --mem=1G

71

# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------

module load nixpkgs/16.09module load gcc/5.4.0module load intel/2016.4

module load samtools/1.5

samtools flagstat path/to/XX_rm1mate_nodup.bam

# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------

Picard - Validate the .bam file before moving forward

Check if the bam file is not broken.java -jar$EBROOTPICARD/picard.jar ValidateSamFile I=/path/to/XX_rm1mate_nodup.bam MODE=SUMMARY

#!/bin/bash# ---------------------------------------------------------------------# Validate Bam Test# ---------------------------------------------------------------------

#SBATCH --job-name=Vali#SBATCH --account=def-srogers#SBATCH --cpus-per-task=16#SBATCH --time=0-01:00#SBATCH --mem=15G

# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------

module load nixpkgs/16.09

module load picard/2.17.3

/path/to/code/validate_code

# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------

72

Samtools - Q20 filter

-q 20: filter any sequence that the alignment score is less than 20 “Minimum mapping quality for an alignmentto be used”samtools view -@ 16 -q 20 -o XX_rm1mate.bam /path/to/XX.bam

#!/bin/bash# ---------------------------------------------------------------------# Samtools Sort# ---------------------------------------------------------------------

#SBATCH --job-name=XX_Q20#SBATCH --account=def-srogers#SBATCH --cpus-per-task=16#SBATCH --time=0-02:00#SBATCH --mem=25G

# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------

module load nixpkgs/16.09module load gcc/5.4.0module load intel/2016.4

module load samtools/1.5

/path/to/code/Q20_code

# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------

Samtools - mpileup

Information about parameters can be found: http://www.htslib.org/doc/samtools.html

The mpileup is a file type that contains base-pair information at each chromosomal position.

In the below code you can add an arguement of -f and the path to the reference genome. This will give youthe reference genome info in the mpileup file. I didn’t use this because I don’t really care what my referenceis. In this step all of the .bam files will be combined into one mpileup (in my case site 1 through 8). ForPoPoolation 1, you need to form a mpileup for each site which will be outlined later in this pipeline.

-B: stops BAC re-alignment. Necessary for PoPoolation2.

-o: write to this outfile

This example has three sites (XX, YY and ZZ) being combined into one mpileupsamtools mpileup -BXX_rm1mate_Q20_nodup.bam YY_rm1mate_Q20_nodup.bam ZZ_rm1mate_Q20_nodup.bam-o ref_allpools.mpileup

73

#!/bin/bash# ---------------------------------------------------------------------# Samtools mpileup# ---------------------------------------------------------------------

#SBATCH --job-name=tut_allpools_mpileup

#SBATCH --account=def-srogers

#SBATCH --cpus-per-task=16#SBATCH --time=0-06:00#SBATCH --mem=1G

# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------

module load nixpkgs/16.09module load gcc/5.4.0module load intel/2016.4

module load samtools/1.5

/path/to/code/mpileup_code

# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------

PoPoolation2 - convert mpileup to msync

The msync format is the file type that PoPoolation2 needs to do all of its analysis. It is also one of theformats that Poolfstat accepts as its input. You will need to download PoPoolation2 into your home directory.(https://sourceforge.net/p/popoolation2/wiki/Main/).

–min-qual 20: filter anything that has a base quality of less than 20

–threads 8: The number of threads you will be using. I have this faint memory (why didn’t I type thesenotes as I went?) that this step goes a little squirrely if threaded higher than 8java -jar /home/youraccount/popoolation2_1201/mpileup2sync.jar--input path/to/ref_allpools.mpileup --output ref_allpools.sync--fastq-type sanger --min-qual 20 --threads 8

#!/bin/bash# ---------------------------------------------------------------------# mpileup2sync# ---------------------------------------------------------------------

#SBATCH --job-name=mpileup2sync

#SBATCH --account=def-srogers

74

#SBATCH --cpus-per-task=8#SBATCH --time=0-2:00#SBATCH --mem=15G

# ---------------------------------------------------------------------echo "Current working directory: `pwd`"echo "Starting run at: `date`"# ---------------------------------------------------------------------

module load nixpkgs/16.09module load java/1.8.0_121

/path/to/code/mpileup2sync_code

# ---------------------------------------------------------------------echo "Job finished with exit code $? at: `date`"# ---------------------------------------------------------------------

In the slurm.out it will give you exit code 1. Check that the last chromosome in the sync filematches the mpileup file and if it does, you’re ok.

Poolfstat - pairwise Fst

I originally used PoPoolation2 to calculate pairwise FST but for reasons in my thesis, I think it is biased. So,I went with Poolfstat. It is an R package and it’s so fast. Took about 4 hours to bring the data into R andthen under ten minutes to run the calculation.library(poolfstat)

#From their paper, which was re-analyzing Dennenmoser et al. 2017#- where he had four populations of n=44 of prickly scuplin

pool.data = popsync2pooldata((sync.file="./file"), poolsizes=c(44,44,44,44),poolnames=c(FE,CR,PI,HZ), min.rc=1,min.cov.per.pool=10, max.cov.per.pool=300, min.maf=0.01, noindel=TRUE)

#I used:min.rc=1min.cov.per.pool=20max.cov.per.pool=200min.maf=0.05noindel=TRUE

Once the data is in, you are set to run pairwise FstPW_fst <- computePairwiseFSTmatrix(pool.data, method = "Anova",min.cov.per.pool=20, max.cov.per.pool=200, min.maf=0.05,output.snp.values = TRUE)

I saved the PairwiseFSTmatrix ∗ and∗NbOfSNPs components of the outfile as their own files because Iwanted to keep them. The $PairwiseFSTmatrix file is necessary to do the below.

75

Visualizing pairwise Fst

To visual the Pairwise FST distance between each population, I used a Principal Coordinate Analysis. Thereis a really good blog post describing the different between PCoA and PCA here: http://occamstypewriter.org/boboh/2012/01/17/pca_and_pcoa_explained/.#read in the pairwise FST generated by Poolfstatpcoa <- read.csv("PW_fst_matrix.csv", header=FALSE)

#make it a matrixpcoa.matrix <- data.matrix(pcoa)

#calculate the euclidean distance between the pairwise FSTeuc.matrix <- dist(pcoa.matrix,'euclidean')

library(labdsv) #package used for PCoA

pco <- pco(euc.matrix, k =2) #calculate the pco for euc distance

#sum the eigen vectors (8 populations)sumeigen=pco$eig[1]+pco$eig[2]+...+pco$eig[8])

eig1=pco$eig[1]eig2=pco$eig[2]perc_eig1=eig1/sumeigen #percent explained by eigen vector 1perc_eig2=eig2/sumeigen #percent explained by eigen vector 2

plot(pco) #will give you a not pretty figure

#the data is stored as character so this and below deals with that.pco.ggplot<-data.frame(cbind(c("Pop1", "Pop2",..."Pop8"), as.numeric(pco$points[,1]),as.numeric(pco$points[,2])))

pco.ggplot$X2<-as.numeric(as.character(pco.ggplot$X2))pco.ggplot$X3<-as.numeric(as.character(pco.ggplot$X3))

colnames(pco.ggplot) <- c("site", "PCoA1", "PCoA2")

library(ggrepel)library(RColorBrewer)library(ggplot2)

tiff('PCoA', units="in", width=10, height=5, res=300)ggplot(pco.ggplot, aes(x=PCoA1, y=PCoA2)) + geom_point(colour="chartreuse4")+ geom_point(data=pco.ggplot[c(3, 4, 7), ], aes(x=PCoA1, y=PCoA2), colour="purple4")+ geom_label_repel(aes(label = site), size = 3, hjust = 0, nudge_x = 0.003,

nudge_y = - 0.00, colour="chartreuse4", show.legend = FALSE)+ theme(legend.position="none")+ geom_label_repel(data=pco.ggplot[c(3, 4, 7), ],

aes(label = site, x=PCoA1, y=PCoA2), colour="purple4",size = 3, hjust = 0, nudge_x = 0.003, nudge_y = - 0.00, show.legend = FALSE)

+ theme_bw() + theme(axis.text=element_text(size=13), axis.title=element_text(size=15,face="bold"),panel.border = element_blank(), panel.grid.major = element_blank(),panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"))

76

+ labs(x = "PCoA 1 (XX.XX%)", y = "PCoA 2 (XX.XX%)")+ scale_x_continuous(breaks=c(-0.6, -0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1.0),

labels=waiver()) + coord_fixed()dev.off()

#I am going to be honest here with this ggplot code. It works.#It ain't pretty and I just kept adding to it until it did a thing and now I am afraid to touch it.

PoPoolation1 - Nucleotide Diversity

The first step here is to make mpileups for each of your sites, instead of one with all of them. For example, inthe above, you would just specify site XX.

–fastq-type sanger: even though our data is Illumina, we had to set it as sanger. This is because Phred 33,not 64

–pool-size: this should equal the number of individuals x 2 (for diploids) in the site. I have also read that ifit is ok to put the number of individuals and that it makes so little difference they are considering takingthat parameter out.

–min-count: minor allele count for that site

–min-coverage: the minimum coverage of a SNP for it to be used in analysis

max-coverage: the maximum coverage of a SNP for it to be used in analysis

–window-size: size of windowperl /home/youraccount/popoolation_1.2.2/Variance-sliding.pl --measure pi--input /path/to/XX_rm1mate_Q20_nodup.mpileup --output XX_pi.file--snp-output XX_pi.snps --fastq-type sanger --pool-size 40 --min-count 4--min-coverage 20 --max-coverage 200 --window-size 250 --step-size 250

This will give you a .file and .snp file as the outfiles. The .file can be loaded into R and used to calculatemean nucleotide diversity and standard deviation.XX_w250_pi <- read.delim("./XX_w250_pi.file", header = FALSE, sep = "\t", dec = ".")

colnames(XX_w250_pi) <- c("chr", "position", "Num.of.SNPs", "frac.of.cov", "pi")

XX_w250_pi$pi<-as.numeric(as.character(XX_w250_pi))

mean(XX_w250_pi$pi, na.rm=TRUE)

pi_matrix <- matrix(c("Pop1", "Pop2",..."Pop8", pi_1, pi_2, ...pi_8), nrow = 8, ncol = 2)

colnames(pi_matrix) <- c("Site", "NucleotideDiversity")

pi_DF <- as.data.frame(pi_matrix)

pi_DF$NucleotideDiversity <- as.numeric(as.character(pi_DF$NucleotideDiversity))

library(ggplot2)

tiff('pi_figure.tiff', units="in", width=5, height=5, res=300)ggplot(data=pi_DF, aes(x=Site, y=NucleotideDiversity))

77

+ geom_point(colour="purple4", size = 3) + labs(y= "Nucleotide Diversity")+ geom_point(data=pi_DF[c(1, 2, 3, 4, 5), ], aes(x=Site, y=NucleotideDiversity),

colour="chartreuse4", size = 3) + theme_bw() + ylim(0, 0.006)+ theme(axis.text=element_text(size=13), axis.title=element_text(size=15,face="bold"),

panel.border = element_blank(), panel.grid.major = element_blank(),panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"))

dev.off()

#Same as above regarding the ggplot.

78

79

Appendix B: DNA and sequencing quality

Figure B.1 Pooled DNA (5 µL) for each population pre-dilution for sequencing preparation run through 1% agarose gel with 3 µL of NEB 1 kb DNA ladder. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3.

80

Figure B.2 Basic statistics and per base sequence quality component of FASTQC report for J3 (Lower Cave & Basin Spring) raw, reverse sequences (pre-Trimmomatic).

81

Figure B.3 Basic statistics and per base sequence quality component of FASTQC report for J3 (Lower Cave & Basin Spring) filtered and trimmed reverse sequences (post-Trimmomatic).

82

Appendix C: PoPoolation2 pairwise FST estimates Table C.1 Pairwise FST between all populations calculated by PoPoolation2. Pairwise FSTwas calculated for 250bp side by side windows, minor allele count of 8, minimum coverage of 15 and max coverage of 200, where the entire window was acquired to meet coverage specifications. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3.

Population J1 J2 J3 J4 J5 G1 G2 G3 J1 0 0.070 0.065 0.075 0.076 0.211 0.335 0.318

J2 0 0.044 0.033 0.074 0.237 0.355 0.340

J3 0 0.044 0.061 0.213 0.336 0.320

J4 0 0.074 0.233 0.351 0.333

J5 0 0.205 0.333 0.318

G1 0 0.237 0.217

G2 0 0.167

G3 0