GBS Usage
Cases:
Non-model
OrganismsKatie E. Hyma, PhD
Bioinformatics Core
Institute for Genomic Diversity
Cornell University
Q: How many SNPs will I get?
A: 42.
What question do you really want to ask?
Q: How many SNPs do I need?
A: 42.
What question are you trying to answer?
The Best Advice I can Give You:
The Best Advice I can Give You:
Does my Experimental Design make Sense?
Do I have a Plan for Analysis?
Do my Results make Biological Sense??
Why can GBS be
complicated?
Sequencing Error
SNP calling
Missing data
Large data sets
Population type
Biological Reality
Is GBS more complicated for
non-model organisms?
I don’t have a reference genome
We don’t know anything about
population structure or diversity
My reference genome is in a pre-
assembly state, or the physical ordering is
inaccurate
My species is outcrossing
Roadmap
No reference genome (UNEAK pipeline)
SNP CALLING – reference pipeline
Examples and usage cases
The non-model problem:No Reference Genome
Universal Network Enabled Analysis Kit (UNEAK)
A reference free SNP calling pipeline
Fei Lu, Cornell University
Overview of UNEAK
Genome is digested, sequenced using GBS
Reads are trimmed to 64 bp
Identical reads = tag
A
B
slide courtesy Fei Lu, Cornell University
Overview of UNEAK – Network filter
Pairwise alignment to findtag pairs with 1 bp mismatch
Topology of tag networks
Keep common reciprocal tags
C
E
D
errorreal tags
F
Build tag networks
count
slide courtesy Fei Lu, Cornell University
Topology of tag networks
Tag
Error
Plastid &Highly repetitivetags
Moderatelyrepetitive tags,Paralogs &SNPs
Networks of 2496 tags
slide courtesy Fei Lu, Cornell University
Details about network filter
SNP
Error tolerance
slide courtesy Fei Lu, Cornell University
Pipeline validated with maize inbred linkage population –network filter
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0
0.0
8
0.1
6
0.2
4
0.3
2
0.4
0.4
8
0.5
6
0.6
4
0.7
2
0.8
0.8
8
0.9
6Pro
po
rtio
n o
f S
NP
s
Allele frequency
Single-site rate(Blast to maize)
Allele frequencydistribution
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
0
0.0
7
0.1
4
0.2
1
0.2
8
0.3
5
0.4
2
0.4
9
0.5
6
0.6
3
0.7
0.7
7
0.8
4
0.9
1
0.9
8
Pro
po
rtio
n o
f S
NP
s
Allele frequency
23.3% 85.0%
Step 1 Pairwise alignment of tags
Step 2Network filter
Evaluationcriteria
slide courtesy Fei Lu, Cornell University
Pipeline validated with maize inbred linkage population –SNP validation
LD distribution (SNPs against 1106 markers)
Alignment (B97 tags against B97 shotgun genome from Maize HapMap2 data)
93.2% SNPs are polymorphic
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0
0.0
4
0.0
8
0.1
2
0.1
6
0.2
0.2
4
0.2
8
0.3
2
0.3
6
0.4
0.4
4
0.4
8
0.5
2
0.5
6
0.6
0.6
4
0.6
8
0.7
2
0.7
6
0.8
0.8
4
0.8
8
0.9
2
0.9
6
Pro
po
rtio
n o
f S
NP
s
LD (r2)
0.2
92.2%
slide courtesy Fei Lu, Cornell University
Discovery
Tag Counts
SNP Caller
Genotypes
Tags by Taxa
Sequence
Network Filter
GBS UNEAK Pipeline Fork
1) UTagCountToTagPairPlugin
2) UExportTagPairPlugin
TASSEL 3.0
No Reference Genome SNPs can be called with the UNEAK pipeline if
there is no reference genome.
Be careful with your sampling strategy
Optimized for single species
Allele frequency can matter!
Decide whether your downstream analysis will be qualitatively influenced by biases in SNP calling
Are you doing association mapping? Estimating population genetic parameters?
No Reference Genome
Tag pairs are maximum 1 bp different / 64
bp sequenced
Examples when sampling strategy matters:
The same locus may be identified as two separate loci when there is
population structure and overall divergence is > 1.5%
With highly divergent sub-populations some loci that are monomorphic within
a subpopulation but diverged > 1bp/tag between species will not be
detected
No Reference Genome
Allele frequency can matter for SNP calling
in the UNEAK pipeline – and is influenced by
your population/ sampling strategy
Undersampled diversity (and true rare
alleles) can look like error, and are even
harder to detect with the non-reference
pipeline
Sequence based SNP calling:
What’s in a SNP?
Total Reads# minor allele reads to call
heterozygote (1% error rate)
2 1
3 1
4 1
5 1
6 1
7 2
8 2
9 2
10 2
11 2
12 2
13 2
14 3
Lookup table for GBS pipeline is calculated based on binomial distribution
Sequence based SNP calling:
What’s in a SNP?
Total Reads# minor allele reads to call
heterozygote (1% error rate)# minor allele reads to call
heterozygote (0.33% error rate)
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 2 1
8 2 1
9 2 2
10 2 2
11 2 2
12 2 2
13 2 2
14 3 2
Lookup table for GBS pipeline is calculated based on binomial distribution
Sequence based SNP calling:
What’s in a SNP?
Total read depth: 7
Major allele (“G”) count: 4
Minor allele (“C”) count: 3
SNP call: “S” (heterozygous G/C)
Sequence based SNP calling:
What’s in a SNP?
Total read
depth: 7
Major allele
(“G”) count: 6
Minor allele
(“A”) count: 1
SNP call: “G”
(invariant)
Sequence based SNP calling:
What’s in a SNP?
Total read
depth: 3
Major allele
(“G”) count: 3
Minor allele
(“C”) count: 0
SNP call: “G”
(homozygote)
Sequence based SNP calling:
What’s in a SNP?
Total read depth: 3
Major allele (“G”) count: 2
Minor allele (“A”) count: 1
SNP call: “R” (A/G heterozygote)
Sequence based SNP calling:
What’s in a SNP?
Depth Depth
Heterozygote undercalling in
Vitis spp.
Without filters Aa > AA error rate
(heterozygote undercalling)
in Vitis is
30-50%
50,951 SNPs
Mean depth 5.7x
Heterozygote Undercalling
Solutions
1. Focus on the high coverage genotypes (through VCF plugin)
Ex: Vitisgen project
2. Use genetics (ex. Maize project)
Each project is different, and different solutions are appropriate for each situation
SNP calling in Vitis
(heterozygous)
Read Depth of Each allele is captured for SNP calling in Vitis using the Variant Call format (VCF)
0/0:3,0,0:3:88:0,0,108
Plugins available in TASSEL 3 to output VCF format tbt2vcfPlugin MergeDuplicateSNP_vcf_Plugin Does NOT work with HDF5 TBT/TOPM
VCF integration in TASSEL 4 in now implemented
Qi Sun Jon Zhang
Jeff Glaubitz
Terry Casstevens
Sequenced based SNP calling: What’s in a SNP?
VCF pipeline SNP calling algorithm
Hohenlohe PA, Cresko WA et al. PLOS Genetics
2010 Feb 26;6(2):e1000862.
Likelihood score calculation. No allele
frequency based prior probability was
used due to mixed population
structures.
Heterozygote undercalling in
Vitis spp.
0%
20%
40%
60%
80%
100%
120%
no filter >3 >5 GQ>98
Filtering criteria
Call rate
% Errors
8%
15%
30%
3%
100%
82%
56%
67%
depth depth- - -
Working With VCF Files VCFtools
http://vcftools.sourceforge.net/
VCFtools output formats: plink *** can be used as TASSEL input
012 IMPUTE ldhat BEAGLE-GL
VCFtools is installed on CBSU BioHPC Computing Lab http://cbsu.tc.cornell.edu/lab/lab.aspx
How to Get VCF Output Tassel 3.0
1. QseqToTagCountPlugin/FastqToTagCountPlugin2. MergeMultipleTagCountPlugin3. TagCountToFastqPlugin4. Align5. SAMConverterPlugin6. QseqToTBTPlugin/FastqToTBTPlugin (use –y to output in TBTByte format)7. MergeTagsByTaxaFilesPlugin8. tbt2vcfPlugin9. MergeDuplicateSNP_vcf_Plugin
running each plugin without any options will output a list of available options on the console there is no plugin to merge taxa for VCF files. Merge samples by name at the TBT stage if desired using
the flag –x with the MergeTagsByTaxaFilesPlugin.
TASSEL 4.0 From TagsToSNPByAlignmentPlugin
The non-model problem:Less-than-perfect reference genome
How less than perfect is it? Even relatively short contigs are suitable for anchoring 64bp
GBS tags!
Assembly errors can affect alignment and SNP calling.
Common with short read assemblies
Reference allele does not match GBS alleles
Reference genome assembly
Paralogous regions
Biological Reality
Less than Perfect Reference
SNPs can be called with the reference
pipeline even on genomes that are not
fully assembled
Hint: concatenate contigs into pseudo-
chromosomes, but pad with “Ns” in
between contigs to avoid aligning to
junctions – for 64 bp tags pad with > 64 “Ns”
A non-model case study:Vitis – a highly heterozygous outcrossing crop
species with a genus-level breeding program
“Mapping the way to a better grape”
– 37 Mapping populations
– Genotyping-by-Sequencing (GBS) and SSR based approach to identify QTL and candidate genes
– biotic stress resistance, abiotic stress resistance, and fruit quality.
– Inclusion of 32 different wild and cultivated species to help in map building and genome assembly.
Outcrosses
• Use pseudo-testcross markers (Aa x aa)
• Leverage deep sequencing of parentalgenotyping to phase markers
• Filter by Mendelian errors and/or Linkage Disequilibrium
• Methods development in Vitis for genetic mapping with GBS data
• Dense Mapping of 4-way Cross GBS Data
• Three pipelines:
– Synteny
– Synteny with genetic ordering
– De-novo linkage group formation and genetic ordering
Dense maps from GBS data – 4 way cross
Dense maps from GBS data – 4 way cross
Progeny Number (index)
• ~26,000 markers for a single bi-parental family with ~ 200 progeny
• Separate maps for male and female parents
• No parental genotypes
• Up to 50% missing data per SNP locus
Chromosome 1 markers from male parent in a VitisGen family (phased) and ordered by physical position
Ph
ysic
al p
osi
tio
n
The non-model problem:Take home messages
Designing your GBS study Know what you need for your analyses before
planning your project
GBS is a powerful reduced representation approach, with two distinct components molecular methods
bioinformatics pipelines
The GBS bioinformatics pipelines were developed and optimized for specific applications in mind They can work for you, but be mindful of your
needs
Keep Calm and… Filter – expect to filter up to 90% of SNPs when imputation is not
practical
Pedigree independent filtering: Read depth/genotype quality Percentage of absent calls (missingness) Minor allele frequency For inbred species, inbreeding coefficient filters out most paralogous
sites
Likely to require customized bioinformatics – especially without a reference genome
Be ready for post-processing of data into other formats! Have a plan!
The Best Advice I can Give You:
The Best Advice I can Give You:
Does my Experimental Design make Sense?
Do I have a Plan for Analysis?
Do my Results make Biological Sense??