Questions in genomic selection,
some answers and some history of
single-step
Ignacy Misztal
University of Georgia
Questions in genomic selection
• SNP are genes, markers or something else?
• Good accuracy at 30k SNP , standard 50-60k, a bit better at 700k • What is magic with 50K?• Why not more noise at 600K• Causative SNP?
• Stability problems with GRM• At about 5k, usually blended with A
• OK accuracy with few genotyped animals 1k-2k• Good in farm• Rise with extra genotypes slow• Discrepancy between simulation and field-data results
Inversion by recursion
u Pu Φ
1 2 1| , ,.., 'i i i iu u u u p u
1 1var( ) ( ) 'var( ) ( )
u I P Φ I P
Generic recursion
Cost low only if P sparse
P very sparse0.5 0.5i is diiu u u
For pedigree relationships (Henderson, 1976):
Is limited recursion applicable to genomic relationships?
Algorithm for proven and young animals (APY)
1 2 1
" " " "
| , ,..., + i i ij j ij j i
j proven j young
u u u u p u p u
=0 in GBLUPFor young animals
yyp
-1 -1-1 -1 -1pp pp p
pp
G 0 G GG = + M G G I
0 0 I genotypes for proven animals
genotypes for young animals
i iim g
p
y
-1i p pp p i
Z
Z
z 'Z 'G Z z
Linear cost for young animals
Misztal et al. (2014)
Tests with Holsteins (Fragomeni et al., 2015)
G needed G-1
APY inverse
Regular inverse
Correlations of GEBV with regular inverse
23k bullsas core
17k cows as core
> 0.99
> 0.99
20k random animals as core
> 0.99
Impact of recursion size in Holsteins and chicken
Holstein
broiler
15,0003000
Corr(GEBV, GEBV APY)
Number of randomly-chosen animals in recursion
Theory of junctions
…………
Heterogenetic and homogenic tracts in genome (Stam, 1980)
Called independent chromosome segments Me (Goddard et al., 2009; Daetwyler et al., 2010)
E(Me)=4NeL (Stam, 1980)Ne – effective population sizeL –length of genome in Morgans
Need 12 Me SNPs to detect 90% of junctions (MacLeod et al., 2005)
Haplotype blocks = Independent chromosome segments
• E(Me) = 4NeL Stam (1980)
• Ne – Effective population size
• L – Length of genome in Morgans
2NeL Hayes et al. (2009)
• Me 2NeL/[log(NeL)] Goddard et al. (2011)
Many more Brard and Ricard (2015)
Cuppen (2005)
Choose core “c” and noncore “n” animals
n nc c nu P u ε
u Tschromosome segments
c cs Qu ε
Theory of APY based on segments
Breeding value
small if number of core animals > number of segments
c cu u
c c
ncn n
I 0u u
P Iu ε
0cncc
nc nn
I 0 I PG 0G
P I 0 IM
11
1
cccn
ncnn
I 0G 0I PG
P I0 M0 I
1nc nc ccP G G
, nn , i,1: 1 ,1: 1{ ' }i i i i idiag g M p g 1
Choose core “c” and noncore “n” animals
n nc c nu P u
The inverse
Unknown matrices from conditional expectation
BV of noncore animals linear function of core animals
Matrix notation
Var(u)
Misztal&Legarra&Aguilar (2014)
's s sG = U D U' = U D U
U – eigenvalues
D – eigenvectors
Eigenvalues sum to 100%
What % is useful, 95%? 98% 99%, 99.999%?
Finding dimensionalities by eigenvalues
True accuracies as function of number of eigenvalues corresponding to given explained variance in G
Ne=200
Ne=20
Approximate number of animals / segments NeL 2NeL 4NeL
Accuracies maximized by 98% “information in G, 95% almost as goodLast 2% of information in G noise
Pocrnic et al., 2016a
Reliabilities – Jerseys (75k animals)
Milk
Protein
Fat
3300 6100 11,500 assumed dimensionality≈NeL ≈2NeL ≈4 NeL
(number of core animals)
100% = full inverse lower accuracy
Pocrnic et al., 2016b
Estimated dimensionality, effective population size and optimal number of SNP
Specie Approx Me(98%)
Effective population size (L=30M)
Optimal number of SNP(12 x Me)
Holsteins 14k 149 170k
Jerseys 10k 101 120k
Angus 11k 113 130k
Pigs 4k 43 (L=20M) 50k
Chicken 4k 44 50k
Pocrnic et al. (2016b)
Side effects of reduced dimensionality
• Number of segments• 800k in humans
• 5-15k in animals
• Impact on SNP selection and GWAS
Theory of limited dimensionality
……
Number of haplotypes: 4 Ne LNe within each ¼ Morgan segment
¼ Morgan
Ne haplotypes within each ¼ Morgan segment
QTLsGenome haplotypes
Dimensionality of ¼ Morgan case: Ne or number of identified QTLs Reduced dimensionality with weighted GRM
0.43
0.57
0.660.69 0.69 0.69
0.47
0.6
0.76
0.850.88 0.89
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
30 40 50 60 70 80 90 100
Acc
ura
cy
% of explained variance
Accuracies with largest eigenvalues only
6k animals with own records
6k bulls with 50 daughters
Eigenvalues 25 120 400 1400 2000 3000
0.43
0.57
0.660.69 0.69 0.69
0.47
0.6
0.76
0.850.88 0.89
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1000 2000 3000 4000
Acc
ura
cy
Number of eigenvalues
Accuracies with largest eigenvalues only
6k animals with own records
6k bulls with 50 daughters
0.47
0.6
0.76
0.850.88 0.89
0.34
0.71
0.840.87 0.88
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1000 2000 3000 4000
Acc
ura
cy
Number of eigenvalues
Largest eigenvalues or core animals in APY
APY
Eigenvalues
Which core animals in APY?
Bradford et al. (2017)
Simulated populations (QMSim; Sargolzaei and Schenkel, 2009) Ne = 40 #genotyped animals = 50,000
Core animals: Random gen 6 || gen 7 || gen8 || gen9 || gen 10 (y) Random all generations Incomplete pedigree Genotypes in gen 9 and 10 imputed with 98% accuracy
Which core animals in APY?
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
98% 95% 90%
Co
rrel
atio
n (
GEB
V, T
BV
)
Percent of variation explained in G
Accuracy
Gen 6 Gen 7 Gen 8 Gen 9 Gen 10 Random
G-1
Bradford et al. (2016)
Persistence over generations
Generations
Rel
iab
ility
1.0
BLUP
GBLUP - small
BayesB - small
GBLUP – very large
GBLUP – very large80% genome
Very large – equivalent to 4NeL animals with 99% accuracyAre SNP effects from Holstein national populations converging
Multitrait ssGBLUP: Is SNP selection important? Causative SNPs?
• SNP selection/weighting (BayesB, etc.) • Large impact with few genotypes
• Little or no impact with many
ssGBLUP accuracies using SNP60K and 100 QTNs – simulation study
0 10 20 30 40 50 60 70 80 90 100
BLUP
ssGBLUP - unweighted SNP60k
unweighted SNP60k + 100 QTN
SNP60k + 100 QTN weighted by GWAS
SNP60K + 100 QTN with "true" variance
plus by APY
only 100 QTN unweighted by APY
Fragomeni et al. (2017)
Rank (98%)
19k
5k
98
Nothing can be more fatal to progress than a
too confident reliance on mathematical
symbols; for the student is only too apt to take
the easier course, and consider the formula not
the fact as the physical reality.”
Kelvin
Group and sponsors
Yutaka
Masuda
Shogo
Tsuruta
Daniela
Lourenco
Breno
Fragomeni
Ignacio
Aguilar
Andres
Legarra
Ivan
Pocrnic
Heather
Bradford
Development of the combined matrix
-1 -1
12 22 22 21
22
IA A 0 A A 0H = A + G - A I I
I0 I 0 I
Initial (Misztal et al., 2009)
1-ungenotyped animals
2-genotyped animals22
0 0H=A+
0 G-A
1 1
1 1
22
0 0H =A +
0 G -A
Comprehensive (Legarra ,2009)
Inverse of Comprehensive (Aguilar et al., 2010)
Christensen and Lund, 2010Boemcke et al., 2010
Implementation at UGA
• Module genomic in BLUPF90 package (Aguilar et al. 2011)
• Option SNP_File xxx in RENUMF90
• Lots of options with defaults
• Creation of G-1: minutes for 10k
genotypes, hours for 50k genotypes
Prediction in 2004 DD2009
R2 (%) Inflation (%)
Parent Avg 24 31
Multistep
(VanRaden)+16 16
Single-step
Regular +17 31
Refined +17 4
Predictions for US final scores in Holsteins (Aguilar et
al., 2010)
1 1
22G -A
1 1
221.5G -0.6A
• US Holsteins (10 million animals)
• 18 traits
• Almost 50,000 genotypes of bulls and cows
• 2 days computing
Multitrait national genomic evaluation for type (Tsuruta et al., 2010)
Genomic evaluations of broiler chicken (Chen et al., 2010)
• 180k broiler chicken
• 3 k genotyped with SNP60k chip
• 3 methods– BLUP- full data
– BayesA – genotyped subset
– Single step – subset and full data set
Trait Accuracy*100
BLUPBayesA
Subset
Single-step
Subset
Single-step
Full
Body
Weight56 +4 +11 +12
Breast
Meat35 +1 +0 +6
Leg Score 29 -20 -23 +7
Accuracies for broiler chickens
Next cycle of selection
Body
Weight38 +13 +22 =
Breast Meat39 +10 +26 +29
Leg Score 28 -21 +6 =
Multiple trait
BayesA – days of computing + errors
Single-step –2 minutes
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08
(2.5)
(2.0)
(1.5)
(1.0)
(0.5)
-
0.5
200601 200701 200801 200901 201001 201101 201201 201301 201401 201501 201601
Birth weight
(kg/pig) (relative to '13-'15 avg)
Total born
(pigs/sow/yr) (relative to '13-'15 avg)
Birth year / month
Trend: genetic improvement in birth weight and total born(PIC Genetic Nucleus)
Total Born Birth Weight
1. Relationship based genomic selectionSource: PIC L02, L03 pure lines (Camborough)
Introduction of RBGS1
Introduction of RBGS1
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08
(2.5)
(2.0)
(1.5)
(1.0)
(0.5)
-
0.5
200601 200701 200801 200901 201001 201101 201201 201301 201401 201501 201601
Birth weight
(kg/pig) (relative to '13-'15 avg)
Total born
(pigs/sow/yr) (relative to '13-'15 avg)
Birth year / month
Trend: genetic improvement in birth weight and total born(PIC Genetic Nucleus)
Total Born Birth Weight
1. Relationship based genomic selectionSource: PIC L02, L03 pure lines (Camborough)
Introduction of RBGS1
Introduction of RBGS1
• Large research interest in GWAS• Limitations if Bayesian methods
– Simple models– Single trait– Complicated if not all animals genotyped
Can ssGBLUP be used for GWAS?
ssGBLUP for Genome Wide Association Studies
Can large QTL exist despite selection?
• Genetics and genomics of mortality in US Holsteins
• (Tokuhisa et al, 2014; Tsuruta et al., 2014)
• 6M records, SNP50k genotypes of 35k bulls
R2 in Israeli dairy – 1400 genotypes(Lino et al., 2012)
15
20
25
30
35
40
45
50
BLUP ssGBLUP 1 ssGBLUP 2 ssGBLUP 3
Milk
Fat%
Prot%
Why unknown parent groups
• Different lines or breeds (Harris and Johnson, 2012)
• Unrecorded parents across generations
Pedigree length and convergence
H-1= A
-1+
0 0
0 G-1 -A22
-1
é
ë
êê
ù
û
úú
G-1 - A22
-1Good convergence and genotyped animals biased down
G-1 - A22
-1Bad convergence and genotyped animals biased up
G-1 -A22,1
-1 - Bad convergence and genotypedanimals biased down and up
A22,2
-1 - A22,3
-1
Long medium shortpedigrees
Big A22 makes H less PD,Reduces convergence rate
Realized broiler accuracies with male, female or both genotypes – Trait A
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Male EPD Female EPD
196,000 birds with phenotype~13,000 genotypes
(Lourenco et al., submitted)
BLUP
Male
Fem
M+F
BLUP Male
Fem
M+F
Realized accuracies with male, female or both genotypes – Trait D
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Male EPD Female EPD
(Lourenco et al., in prep)
BLUP
Male
Fem
M+F
BLUP Male
Fem M+F
Why realized accuracies differ by sex?
parents
Bigger selection pressure on females
Selection graph for GEBV; possibly more differential selection EBV from BLUP
Decomposition of GEBV in Single-step
GEBV = w1CD+w2PA+w3PC +w4DGV +w5PI
Z 'MZ +aA-1 +0 0
0 G-1 - A22-1
é
ë
êê
ù
û
úú
ì
íï
îï
ü
ýï
þïu = Z 'My
CD – contemporary deviationPA – Parent averagePC – Progeny Contribution
DGV – direct genomic valuePI – Parental Index
w1 =1å
No genotype, no extra accuracy
GEBV for young animals
GEBV = w2PA+w4DGV +w5PI
GEBV = DGV
GEBV = PAIf no genotype
If genotype via SNP only
Complete
Little improvement with genomics if animal not genotyped
New studies
• Unbiased evaluations of US Holstein with > 2 M genotypes of varying quality
• Helping Interbull survive• Unbiased pseudo-observations for bulls• GBLUP MACE
• Crossbreeding evaluation without reduction of accuracy
• Resilience and genomic selection
Applied studies
• Pigs• Mortality, survival, changing correlations
• Chickens• Sexual dimorphism,…
• Dairy
• Beef• Altitude, GxE
• Fish
• Heat stress
• GxE
• Resilience
• Theory
Is UGA a good place to come?
• Good place for Science
• Improving South
• Politics small at universities
• Funding available
• Interesting projects
• Data from biggest animal institutions across species