acknowledgements - acteon gianola 2010/powepoint del curso.pdf1761 tunbridge wells, kent, england...
TRANSCRIPT
1
1
A mixed bag of tools for genome-assisted predictionVALENCIA September 20-24 2010
Daniel Gianola
Sewall Wright Professor of Animal Breeding and Genetics
University of WisconsinUniversity of Wisconsin
Dairy Science
A primer to genome based analysis of quantitative traits
with machine learning and non-parametric methods
2
ACKNOWLEDGEMENTSACKNOWLEDGEMENTS
WI:WI: Nanye Long (machine learning, RBF)Nanye Long (machine learning, RBF)SP:SP: Oscar GonzalezOscar Gonzalez--RecioRecio (genomic value)(genomic value)AL:AL: Gustavo de los Campos (RKHS, LASSO)Gustavo de los Campos (RKHS, LASSO)WI:WI: Hayrettin Okut (NN)Hayrettin Okut (NN)SPSP: : MarigelMarigel PerezPerez--Cabal (CV, Clustering)Cabal (CV, Clustering)ALAL: Ana Vazquez (CV, Clustering: Ana Vazquez (CV, ClusteringGEGE: : MalenaMalena ErbeErbe (CV)(CV)WIWI: Nick Wu (Dirichlet process priors): Nick Wu (Dirichlet process priors)WI:WI: GuilhermeGuilherme RosaRosaWI:WI: Kent WeigelKent WeigelAviagen:Aviagen: Santiago Santiago AvendanoAvendano, Andreas , Andreas KranisKranisInstitutInstitut Pasteur (Montevideo):Pasteur (Montevideo): Hugo Hugo NayaNaya
2
3
TOPICS COVERED(order is approximate)
1. Evolution of statistical methods in quantitative genetics2. Challenges from complexity and use of phenomic data3. Brief review of Bayesian inference4. The problem of dealing with interactions5. Machine learning Classification: SNP-mortality6. Comparison of classifiers (NB, BN, NN)7. Introduction to non-parametric regression: LOESS,
RKHS, radial basis functions, neural networks8. Models for genome-enabled evaluation: Bayes A,
Bayes B, Bayes C, Lasso9. Cross-validation10. Results from animals and plants
4
BIBLIOGRAPHY
3
5
1. EVOLUTION OF STATISTICAL METHODS IN ANIMAL BREEDING1. EVOLUTION OF STATISTICAL METHODS IN ANIMAL BREEDING
Archaen Visual appraisal (still widely used) [Biblical times…]
Pathozoic Path analysis, “Animal Breeding Plans” [1918-1945]
Anovian ANOVA (Method 1), least-squares, selection index [1936-1943]
Post-anovian Methods 2+3, MINQUE, MIVQUE [1953-1973]
Blupassic
LESS
MORE
Mixed models, BLUP, animal model, multi-traits [1948-1990]
Remlian VCE, ASREML, DMU [1971- 2009]
Posteriozoic Threshold models, Survival, MCMC, QTLs [1982-2008]
Intelligent design vuvu-Maradona
6
2. Challenges from complexity 2. Challenges from complexity and use of and use of phenomicphenomic datadata
4
7
The The PhenomicPhenomic datadata((phenotypes+genomicphenotypes+genomic))
1)Massive phenotypic data exist2)Massive genomic data increasingly available
Example: SNPs (also gene expression)� >107 SNPs dbSNP 124 (Nat. Center Biotechnology)� Perlegen: 1.58 million SNPs� Animals:
-Wong et al. (2004) -- chicken genetic variation map with 2.8 million SNPs-Hayes et al. (2004) -- 2500 SNPs in salmon genome-Poultry breeding companies-- Thousands of SNPs on sires/dams-USA (2008) -- >50,000 SNPs in over 3000 Holstein sires-All over developed world -- chips with 800,000 SNPs
8
SNP= DNA sequence variation occurring when a single nucleotide - A, T, C, or Gin the genome differs between members of a species (or between paired chromosomes
ABOVE: two sequenced DNA fragments AAGCCTA to AAGCTTA, contain a difference in a single nucleotide.
we say that there are two alleles : C and T
All you wanted to know about SNPsbut were afraid to ask…
5
9
Copy number (CNV ) of copy number polymorphisms (CNP): other source of
information about genetic variation• Individuals vary in number of copies of genomic regions
• Disease genes located in CNV regions
10
Fluorescent map, genes in
rows
GENE EXPRESSION: Clustere
6
11
“I always wanted to be the First…”
John Paul the Second
12
ANIMAL BREEDING:USE ALL SNP MARKERS IN MODELS
FOR GENOMIC-ASSISTED EVALUATION
Effect of chromosomal segment,allelic, haplotype
SNP effects combinedadditively
…but Meuwissen, Hayes and Goddard (2001) came first!
7
13
Schaeffer (2006):
YES, IT WOULD BE DIFFICULT!SEE NEXT…
14
FURTHER, COULD WE WRITE A MODEL FOR SOMETHING LIKE THIS?A SYSTEMS BIOLOGY MAP OF THE BRAIN
8
15
Dealing with epistatic Dealing with epistatic interactions and noninteractions and non--linearitieslinearities
gene x genegene x genegenegene x gene x genex gene x gene
genegene x gene x gene x genex gene x gene x gene……………………..
(Alice in Wonderland)
16
Fixed effects models(unravelling “physiological epistasis” a
la Cheverud?)
• Lots of “main effects”
• Splendid non-orthogonality
• Lots of 2-factor interactions
• Lots of 3-factor interactions
• Lots of non-estimability
• Lots of uninterpretable high-order interactions
• Run out of “degrees of freedom”
Epistatic networks will probably involve a few genes of large effect
9
17
RANDOM EFFECTS MODELS FOR ASSESSING EPISTASIS REST ON:
Cockerham (1954) and Kempthorne (1954)
--Orthogonal partition of genetic variance into additive, dominanceadditive x additive, etc. ONLY if
No selection No inbreeding No assortative matingNo mutationNo migrationLinkage equilibrium
ALL ASSUMPTIONSVIOLATED!
Just considerLinkage disequilibrium
My girlfriend is
a bitch
18
“I started by biting my nails…”Venus de Milo
10
19
CLOSE ENCOUNTERS OF THE PREHISTORIC KIND
GENOMICS ANDCOMPLEX BIOLOGY
NO! THE ADDITIVE GENETIC MODEL
Homosapiens
Neanderthal
My Dad is crazy
20
Is my dad crazy? Les SavantsA prevailing view, and for good reasons (Hill et al., 2008; Crow, 2010; Hill, 2010)
• Fisher’s theorem of natural selection • Interactions are second-order effects;
likely tiny and hard to detect• Epistasis probably arises with genes of
large effects, unlikely to be observed in outbred populations
• Epistatic systems generate additive variance and “release” it, so why worry?
11
21
• If everything behaves as additive, can additive models allow us to learn about “genetic architecture”?
• In areas where phenotypic prediction is crucial (medicine, precision mating) can the exploitation of interaction have added value?
• Is so, should we consider enriching our battery of statistical tricks?
Les Idiots SavantsA much less popular view
(Gianola and a bunch of other guys)
22
What can be expected from systems biology?
• von Bertalanffy (1968) wrote:
“There exist models, principles, and laws that apply to generalized systems or their subclasses,irrespective of their particular kind, the nature of their component elements, and the relation
or 'forces' between them.
It seems legitimate to ask for a theory, not of systems of a more or less special kind, but ofuniversal principles applying to systems in general.
In this way we postulate a new discipline called General System Theory. Its subject matter is the formulation and derivation of those principles which are valid for 'systems' in general.
Concepts like those of organization, wholeness, directiveness, teleology, and differentiation are alien to conventional physics. However, they pop up everywhere in the biological, behaviouraland social sciences, and are, in fact, indispensable for dealing with living organisms or social groupsThus, a basic problem posed to modern science is a general theory of organization.
General system theory is, in principle, capable of giving exact definitions for such concepts and,in suitable cases, of putting them to quantitative analysis…
Allgemeine Systemtheorie
12
23
Systems analysis is not new in the animal sciences…
Where is the beef?
24
SYSTEMS ANALYSIS IN ACTION: PENTAGON “SYSTEMS” VIEW OFTHE WAR IN AFGHANISTAN
GENERAL McCHRYSTAL:“By the time we understand this slide, the war will be over”
(and he was sacked by Obama after The Rolling Stones)
WHAT CAN WE EXPECT FROM SYSTEM ANALYSIS?
13
25
A VIEW OF LINEAR MODELS(as employed in q. genetics)
Mathematically, can be viewed as a “local” approximation of a complex process
Linear approximation
Quadratic approximation
nth order approximation FELDMAN and LEWONTIN (1975)CHEVALET (1994)
26
How good are linear and quadratic approximations? A Taylor series provides a local approximation only…
-5 -4 -3 -2 -1 1 2 3 4 5
-1.4
-1.2
-1.0
-0.8
-0.6
-0.4
-0.2
0.2
0.4
0.6
0.8
1.0
1.2
1.4
x
y1. Sin and cosine function 3. Quadratic approximation
2. Linear approximation
4. Approximationsare good at x=0…
y gx e gx sinx cosx
14
27
What is machine learning?Automatically produce models, such as rules and patterns, from data. Cosely related to data mining, statistics, inductive reasoning, pattern recognition, and theoretical computer science.
Pattern recognition
Data miningNeural networksUniversal
approximatorsMachinelearning
Kernelmethods
Samplingmethods
Cross-validationdesigns
Bayesian networks
Non-parametricprediction
EnsembleMethods:boosting
EnsembleMethods:bagging
Support vectormachines
Random forestalgorithms
Do not fight over methods (Gonzalez-Recio)
Gonzalez-Recio
28
Distinctive aspects of non-parametric fitting
• Investigate patterns free of strictures imposed by parametric models
• Regression coefficients appear but (typically) do not have an obvious interpretation
• Often: very good predictive performance in cross-validation
• Tuning methods and algorithms (maximization, MCMC) similar to those of parametric methods
• Often produce surprising results
15
29
Surprise! thin-plate splines
Risk of heart attack after 19 years as a function of cholesterol level and blood pressure. Left: logistic regression. Right: thin plate spline (Wahba , 2007)
222
211
1
222
21122110 log)( jiji
N
jjijijiii xxxxxxxxxxf
x
30
PENALIZED and BAYESIAN METHODS
for functional inference play a role• The idea of “penalty is ad-hoc• It does not arise “naturally” in classical inference• It appears very naturally in Bayesian inference
L2 penalty: equivalent toGaussian prior
L1 penalty: equivalent to doubleexponential prior
Penalties on covariance matrices equivalent to priors (e.g., inverseWishart)
Bayesian methods arise naturally in predictive inference
16
31
2. A BRIEF REVIEW OF 2. A BRIEF REVIEW OF BAYESIAN INFERENCEBAYESIAN INFERENCE
We do this because most prediction methods based on “regularization” admit a Bayesian
solution
Crucial to understand Bayesianism when analyzing high-dimensional data
32
Rev. Thomas Bayes
1702 London, England1761 Tunbridge Wells, Kent, England
Pierre-Simon Laplace
1749 Beaumont-en-Auge, France1827 Paris, France
1763. “An essay towards solving a problem in the doctrine of chances”. Philosophical Transactions of the Royal Society of London 53, 370-418.
1774. "Mémoire sur la probabilité des causes par les événements“. Savants étranges 6, 621-656. Oeuvres 8, 27-65
INVERSE PROBABILITY
17
33
HISTORICAL NOTES
• Karl Pearson (without knowing) used Bayes• Fisher (likelihood, fiducial inference)• Lack of admissibility of classical procedures
(James-Stein)• Revival: Neo-Bayesianism (Lindley, Box, Zellner)• MCMC procedures (Metropolis, Geman and
Geman)• Bayesian methods in genetics: Haldane (1948),
Dempfle (1977), Gianola and Fernando (1986)• Explosion of Bayesianism in statistics: Gelfand
and Smith (1990)• Explosion in genetics as well
34
Bayesian methods in Genetics: todayBayesian methods in Genetics: today-Classification of genotypes-Molecular evolution-Linkage mapping-QTL cartography-Genetic risk analysis-Gaussian linear and non-linear models: cross-sectional+ longitudinal univariate+ multivariate-Generalized linear models-Survival analysis-Thick-tailed processes-Mixtures-Semi-parametrics-Transcriptional analysis-Structural equation modeling-Bayesian proteomics with wavelets-Methods for genomic selection(the Bayesian Alphabet and more)
RED: animal breeders
18
35
THE BAYESIAN APPROACH IN A THE BAYESIAN APPROACH IN A NUTSHELLNUTSHELL
• All unknowns in statistical system treated as random• Randomness reflects (typically) subjective uncertainty• Can include as unknowns: The model (distribution, functional form) Its parameters (heritability, inbreeding coefficient) Genetic effects, number of QTL loci, marker effects Combine with what is known a priori with information
from data: Bayesian learning Bayesian approach can also be used for developing
predictors of future observations without taking inferencetoo seriously
36
HOW DOES ONE DO THIS?• Introduce a prior distribution for all unknowns
(PRIOR)• Define a distribution for the data under a
certain model (LIKELIHOOD)• Arrive at conditional distribution of all
unknowns given data (POSTERIOR)• Derive marginal or conditional posterior
distributions of interest by standard probability theory
• Display summaries or entire distribution• Interpret results probabilistically• Example: the posterior probability of Ho is 8%
19
37
PRIOR DATA POSTERIOR(of First Lady of France)
BAYESIAN INFERENCE and MCMC(can fit any model)
Most of the times the prior comes “out of the blue”
38
PRIOR DATA POSTERIOR
THUS: THE ANTI-BAYESIAN ARGUMENT…
20
39
+
+
Analysis with Prior 1: collecting more and more data….
40
+
+
Analysis with Prior 2: collecting more and more data….
21
41
Implications
• This is called “asymptotic domination” of the prior by the data (likelihood)
• For parameters on which there is a lot of information from the data, the prior matters little
• Prior may be influential in small samples; worthwhile to investigate sensitivity
• What is a small sample?• Even if prior matters little, Bayesian approach
allows to use probability theory to measure uncertainty
42
BAYES THEOREM: CONTINUOUSBAYES THEOREM: CONTINUOUS
•Evidence is now given by a vector of observations y
•Hypothesis is a vector of unknowns •A probability model M poses joint distribution [ , y |M]with density
h,y gfy| myp|y•Assume that both the unknowns and the parameter are continuous-valued
22
43
•Prior density g(θ)
•Sampling density (likelihood function when viewed as function of θ
fy |
•Marginal density of the observations
my h,yd fy|gd E fy|
•Posterior density
p |y g f y | m y
g f y |
equivalently
p |y g L |y
g L |y d
L|yPac-man operator
Prior must be properfor integral to exist
44
p |y g f y | m y
g f y |
Prior densityLikelihood function
Marginal data density
Posterior density
BAYES THEOREM IN A NUTSHELL
23
45
p|y py|ppy
p|y kl|yppy
p|y l|y
p|y l|y
l|yd
POSTERIOR AS A STANDARDIZED LIKELIHOOD
Constant not involving parameterslikelihood
If prior is flat
If likelihood is integrable
46
EXAMPLE: CONTINUOUS PROBLEM- INFERRING THE MEAN OFA NORMAL DISTRIBUTION WITH KNOWN VARIANCE
Sampling model
Part conferringlikelihood to μ
Maximum likelihood estimator of μ y
Frequentist distribution of ML estimator
y1, y2, . . . , yN NIID,2
N , 2
n DISCUSS
py1, y2, . . . , yN|,2 i1
N1
22exp − 1
22 yi − 2
122
N
exp − 122 ∑
i1
N
yi − 2
122
N
exp − 122 ∑
i1
N
yi − y 2 − N22
− y 2
24
47
Normal distribution with known variance: continued
p|y1, y2, . . . , yN,2 py1, y2, . . . , yN|,2p
p|y1, y2, . . . , yN,2 py1, y2, . . . , yN|,2
p|y,2 exp − N
22 − y 2
exp − N22 − y 2 d
exp − N
22 − y 2
2 2
N
Flat prior
p|y1, y2, . . . , yN,2 i1
N
exp − 122 yi − 2
exp − 122 ∑
i1
N
yi − 2
exp − 122 ∑
i1
N
yi − y 2 exp − N22 − y 2
exp − N22 − y 2
Kernel of density
|y,2 N y , 2
N
48
Bayesian treatment: uniform prior for μ (all values in the (a,b) rangeare equally plausible, a priori)
p 1b−a
Posterior density of μ
p|y1, y2, . . . , yN,2 py1,y2,...,yN|,2p
a
b
py1,y2,...,yN|,2pd
Known
p|y1, y2, . . . , yN,2
1
22
N
exp − 122 ∑
i1
N
yi− y 2− N22 − y 2 1
b−a
a
b
1
22
N
exp − 1
22 ∑i1
N
yi− y 2− N
22 − y 2 1b−a d
Note: only values in (a,b)are allowed
25
49
Doing the algebra (hard way):
p|y1, y2, . . . , yN,2 exp − N
22 − y 2
a
b
exp − N
22 − y 2 d
a
b
exp − N22 − y 2 d
a
b
1
2 2
N
1
2 2
N
−1
exp − N22 − y 2 d
N22
a
b
1
2 2
N
exp − N22 − y 2 d
N22 b − y
2
N
− a − y2
N
-1
50
p|y1, y2, . . . , yN,2
1
2 2
N
exp − N
22 − y 2
b−y
2N
− a− y
2N
Posterior distribution is truncated normal
|y,2 TNa,b y , 2
N
1
2 2
N
26
51
Doing the algebra with Pac-Man (easy way):
Start all over
p|y1, y2, . . . , yN,2
1
22
N
exp − 122 ∑
i1
N
yi− y 2− N22 − y 2 1
b−a
a
b
1
22
N
exp − 1
22 ∑i1
N
yi− y 2− N
22 − y 2 1b−a d
Pac-Man is allowed to eat anything not depending on μ, i.e., eatsall symbols or functions to the right of the “barCan eat the denominator, since μ is integrated outCan eat several things in the numerator, leading to
p|y1, y2, . . . , yN,2 exp − N22 − y 2
Since only values in (a,b) are allowed Posterior is TN
52
Side note: Independence versus conditional independence
yij si eij
s i NIID0,s2
eij NIID0,e2
si , eij independent ∨ i, j
yij NID,s2 e
2
Covyij, yij ′ s2
yij
yij′ N2
,
s2 e
2 s2
s2 s
2 e2
Pairs of members of same cluster have bivariate normal distribution: Observations from same cluster are not independent
Random cluster (e.g., family) effect
27
53
yij si eij
s i NIID0,s2
eij NIID0,e2
si , eij independent ∨ i, j
Eyij |si si
Varyij |si e2
Covyij, yij ′ |si Cov si eij , si eij ′ |si
Coveij , eij ′ |si Coveij , eij ′ 0
pyij , yij ′ |si pyij|sipyij |si N si,e2 N si ,e
2
Conditional distribution
GIVEN THE CLUSTER EFFECT, OBSERVATIONS IN SAMECLUSTER ARE CONDITIONALLY INDEPENDENT!
54
pyij, yij ′ , si pyij, yij ′ |si psi
pyij|si pyij |sipsi
pyij , yij ′ −
pyij|si pyij |sipsidsi
−
N si,e2 N si ,e
2N0,s2dsi
N2
,
s2 e
2 s2
s2 s
2 e2
Probability model of the observations and of the unobserved cluster effect
Process called“deconditioning”
MARGINAL DISTRIBUTIONS ARE OFTEN MORE COMPLEX
28
55
Conditional posterior distribution of the unobserved cluster effect, given the parameters
yij si eij
si
yi1
yi2
N
0
,
s2 s
2 s2
s2 s
2 e2 s
2
s2 s
2 s2 e
2
Joint distribution for ni=2
si |yi1 ,yi2 Ns,v s
Conditional posterior distribution can be shown to be
Can be viewed as “population”or uncertainty distribution parameters
56
s Esi|yi1 , yi2 2
2 e2
s2
yi1 yi2
2− 2
2 4−h 2
h 2
y i −
v s e
2
2 e2
s2
s 1
1s
2 n e
2
1 s
20 n
e2 y i −
11s
2 n
e2
n e
2 y i −
n
n e2
s2
y i −
v s 11s
2 n
e2
s2 e
2
e2 n s
2
e2
n e2
s2
Weighted averageof data and of “prior”
29
57
Suppose heritability is 0.1 e2 1, 2.5
2 unrelated sires with n1 4, y 1 3 and n2 8, y i 2.95
s 1 44 4−0.1
0.1
3 − 2. 5 4. 65116279070 10−2
s 2 8
8 4−0.10.1
2.95 − 2. 5 7. 65957446809 10−2
v 1 1
4 4−0.10.1
2. 32558139535 10−2
v 1 1
8 4−0.10.1
2. 12765957447 10−2
Progeny of sire 1 have better performanceSire 2 has higher posterior mean (EBV)Sire 2 has more “information” (less shrinkage)
2
58
What is the strength of the evidence that sire 2 is better than 1?
1) Consider posterior distribution of sire 1-sire 2 . Because siresare unrelated and parameters are known, the 2 sires have independentconditional posterior distributions
Not enough difference to statethat the 2 sires differ at all
s1 − s2 |y 1 , y 2 N s1 − s2 ,v 1
v 2
s1 − s2 |y 1 3, y 2 2. 6 N−3. 008 411 677 39 10−2 , 4. 453 240 969 82 10−2
-0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.5
1.0
1.5
2.0
s1-s2
Density
30
59
2) Consider posterior distribution of sire 1/sire 2 . If sires not differentthis posterior should be centered at 1.
Problem!!! The distribution cannot be found analytically
s1 /s2 | y 1 , y 2 ??????
Solution: s1 |y 1 Ns1 ,
v 1
s2 |y 2 Ns2 ,v 2 are independent distributions. Hence, we can sample the two randomvariable and form draws of
s1/s2 | y 1 , y 2
60
> s1<-rnorm(5000,0.0465,sqrt(0.0232))> s2<-rnorm(5000,0.0766,sqrt(0.02128))> d<-s1-s2>r<-s1/s2> plot(density(s1),main="s1")> plot(density(s2),main="s2")> plot(density(d),main="d")> plot(density(r),main="r",xlim=c(-100,100))
Sampling from the posterior distributions using R
-0.4 -0.2 0.0 0.2 0.4 0.6
0.0
0.5
1.0
1.5
2.0
2.5
s1
N = 5000 Bandwidth = 0.02499
De
nsi
ty
-0.6 -0.4 -0.2 0.0 0.2 0.4 0.6
0.0
0.5
1.0
1.5
2.0
2.5
s2
N = 5000 Bandwidth = 0.02356
De
nsi
ty
31
61
-1.0 -0.5 0.0 0.5
0.0
0.5
1.0
1.5
d
N = 5000 Bandwidth = 0.0345
De
nsi
ty
-100 -50 0 50 100
0.0
0.2
0.4
0.6
0.8
r
N = 5000 Bandwidth = 0.2401
De
nsity
> summary(d)Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.99100 -0.16390 -0.02539 -0.02357 0.11820 0.68890 > mean(d)[1] -0.02357277> var(d)[1] 0.04484314
> summary(r)Min. 1st Qu. Median Mean 3rd Qu. Max.
-1339.0000 -0.7747 0.1940 3.6470 1.1890 14820.0000 > mean(r)[1] 3.647075> var(r)[1] 46015.02
MUST BE CAREFUL HERE…
62
ILLUSTRATION THAT WE CAN “HIT” THE TRUE POSTERIORDISTRIBUTION OF d BY TAKING MORE SAMPLES
1) #Samples=5000
s1 − s2 |y 1 , y 2 N s1 − s2 ,v 1
v 2
s1 − s2 |y 1 3, y 2 2. 6 N−3. 008 411 677 39 10−2 , 4. 453 240 969 82 10−2
> mean(d)[1] -0.02357277> var(d)[1] 0.04484314
2) #Samples=20000 > mean(d)[1] -0.03136750> var(d)[1] 0.04511715
> mean(d)[1] -0.03037269> var(d)[1] 0.04457762
3) #Samples=200000
4) #Samples=1000000 > mean(d)[1] -0.02993406> var(d)[1] 0.04450943
32
63
EXAMPLE: A SIMPLE EXAMPLE: A SIMPLE BAYESIAN SURVIVAL BAYESIAN SURVIVAL
ANALYSIS MODELANALYSIS MODEL
64
BASICS• Non-negative random variable T• Typically: time-to-event
death onset of disease successful fertilization failure of component
• Censoring (right) can occur. For n individuals (i=1,2,…,n) observe
y i minti, vi
c i 1 if ti ≤ v i
0 if t i vi
“True” failure time
“Censoring point”
uncensored observation
censored
33
65
Ft PrT ≤ t 0
t
fudu
St 1 − Ft Prt ≤ t
ht ftSt
ft htSt
ddt
logSt 1St
ddt
St
− ftSt
−ht
Distribution function of failure time
Survivor function
Hazard function
>
66
ht − ddt
logSt
0
t
hudu − logSt
St exp −0
t
hudu
Ht 0
t
hudu
ft htexp −0
t
hudu
Integrated hazard
Cumulative hazard
Representationof density offailure times
34
67
Parametric exponential model: homogeneous population
fyi | exp−yi
Sy i| 1 − 0
yi
exp−udu
exp−y i
L|y i1
n
exp−yiciexp−y i1−ci
∑i1
n
ci
exp−∑i1
n
yi
Likelihoood(assuming conditional independence)
Exponentialdensity
Survivalfunction
Note that hazard (ratio between f and S) is constant= λ
68
Put Gamma prior on λ�
Posterior of λ
The posterior distribution is Gamma,with parameters
p|0,0 0−1 exp−0
p|y, 0,0 ∑i1
n
ci
exp−∑i1
n
yi0−1 exp−0
∑i1
n
ci0−1
exp − ∑i1
n
yi 0
o∗ 0 ∑
i1
n
ci
o∗ 0 ∑
i1
n
yi
35
69
Joint, Conditional and Marginal Posterior Distributions
•Put 1′ ,2
′ ′ representing distinct features of models,(e.g., means and variances)
•Then, elicit a joint prior density
where g 1 is the marginal prior and
g1,2 g1|2g2 g2|1g1g2 |1 is a conditional prior
•Joint posterior density is
p1,2|y L1,2|yg1,2
L1,2|yg1,2d1 d2
L1,2|yg1,2,
•Must decide which is the object of inference•Joint, conditional or marginal posterior probability statements?
70
Marginal posterior densities
•Obtained directly from probability calculus as:
p 1 |y p 1 , 2 |y d 2
p 2 |y p 1 , 2 |y d 1
•Additional marginalizing may be needed if 1 1A′ ,1B
′ ′
p1A|y p1,2|yd1Bd2
p1|yd1B.
36
71
Marginal posteriors are mixtures
•Let
p 1 |y p 1 , 2 |y d 2
p 1 | 2 , y p 2 |y d 2
E 2 |y p 1 | 2 , y ,
2 be a “nuisance” parameter
•Distribution 1|2,y describes uncertainty when the
nuisance parameter is known.
•Distribution 2 |y : uncertainty about nuisance parameter
•Distribution 1 |, y : uncertainty about primary parameter
Conditional posterior
Marginal posterior of nuisance
Marginal posterior of primary parameter
72
Conditional posterior distributions
•By definition of conditional density:
p 1 | 2 , y p 1 , 2 |y p 2 |y
•Here, one is interested in variation about 1 only
p1 |2, y p1,2|y
L1 ,2|yp1,2
L1 ,2|yp1|2
L1|2 , yp1 |2 .
•Identifying conditional posterior distributions: important forimplementing MCMC methods (sampling from posteriors)
37
73
Example of a 2-parameter problem: mean and variance of a normal
distributionIndependent priors
Joint posterior
Likelihood Prior
74
Marginal posterior density of variance (integrate mean out)
Marginal posterior density of mean (integrate variance out)
Cannot be written in closed form.
38
75
Assume c=0 and d=infinity (positive part of real is line is priorfor variance). Then:
Kernel of t-distributiontruncated in (a,b) on n-3 df
After adding and subtracting 1
76
Marginal posterior distribution of mean is then
If a=-infinity and b=infinity, then marginal posterior is univariate-ton n-3 degrees of freedom
CAN VERIFY
p,2|DATA, a, b, c, d ≠ p|DATA, a, b, c, d |p2|DATA, a, b, c, d|
TYPICALLY, PARAMETERS ARE NOT INDEPENDENT APOSTERIORI, EVEN IF SO A PRIORI
39
77
Analysis 2: improper uniform priors
p,2|y 2−n2 exp − 1
22 ∑i1
n
yi − 2
2−n2 exp −
∑i1
n
yi − y 2 n − y 2
22
Kernel is as before, but parameters are unbounded,except for parameter space of variance
78
p2|y 2−n2 exp − 1
22 ∑i1
n
yi − y 2
−
22/n
22/nexp − n− y 2
22 d
2−n−1
2 exp − 122 ∑
i1
n
yi − y 2
p2|y 2 −n−3
2 1 exp − 122 ∑
i1
n
yi − y 2
p2|y 2 −n−3
2 1 exp − S2n−3
22
S2 ∑i1
n
yi − y 2
n − 3
2|y n − 3S2n−3−2
Posterior is scaled-inversechi-squared
40
79
2
> nu<-20> S2<-5> chisqunu<-rchisq(20000,20)> sinvchi2<-nu*S2/chisqunu> summary(sinvchi2)
Min. 1st Qu. Median Mean 3rd Qu. Max. 1.734 4.200 5.172 5.564 6.493 23.450
> plot(density(sinvchi2),main="Scale inv. chi-square, df=20, Scale=5")
5 10 15 20 25
0.0
00
.05
0.1
00
.15
0.2
00
.25
Scale inv. chi-square, df=20, Scale=5
N = 20000 Bandwidth = 0.2124
De
nsi
ty
> nu<-4.00001> S2<-5> chisqunu<-rchisq(20000,4.00001)> sinvchi2<-nu*S2/chisqunu> summary(sinvchi2)> plot(density(sinvchi2),main="Scale inv. chi-square, df=4.00001, Scale=5")
0 20 40 60 80 1000
.00
0.0
50.
10
0.1
5
Scale inv. chi-square, df=4.00001, Scale=5
N = 20000 Bandwidth = 0.6285
De
nsity
80
p|y 1 − y 2
n−3s 2n
− n−312
−
1 − y 2
n−3s 2n
− n−312
d
Using same integration technique as before:
Posterior is univariate-t on n-3 degrees of freedom,and integration constant is
−
1 − y 2
n−3s 2n
− n−312
d Γ n−3
2 n−3s 2n
Γ n−312
41
81
Side note on the t-distribution
t z
2 /
; z N0, 1
Et 0
Vart − 2
ft Γ 1
2
Γ 2
1 t2
− 1
2
y St
Ey
Vary S2 − 2
t y −
Sdtdy
1S
Model with t-distributed errors
fy Γ 1
2
S2 Γ 2
1 y−2
S2
− 12
“scale”
“degrees offreedom”
82
Simulate a t-distribution with mean 10, scale 5 and 3 d.f. The varianceis 5 x 3/(3-2)= 15. We will compare with a normal distribution with Mean 10 and variance 15 > m<-10> df<-3> S<-sqrt(5)> z<-rnorm(50000,0,1)> chisq<-rchisq(50000,3)> y<-10+S*z/sqrt(chisq/3)> mean(z)[1] 0.0008953519> var(z)[1] 1.009415> summary(chisq)> mean(chisq)[1] 2.978861> var(chisq)[1] 5.896806> mean(y)[1] 9.999967> var(y)[1] 13.83878
42
83
-10 0 10 20 30
0.0
00
.02
0.0
40
.06
0.0
80
.10
w with normal distribution, mean=10, var=15
N = 50000 Bandwidth = 0.4002
Den
sity
-10 0 10 20 30
0.00
0.0
50
.10
0.1
5y with t-distribution, mean=10, df=, var=15
N = 50000 Bandwidth = 0.2675
De
nsity
The t-distribution accommodates more extreme values
> summary(y)Min. 1st Qu. Median Mean 3rd Qu. Max.
-61.440 8.266 10.020 10.000 11.730 100.400 > summary(w)
Min. 1st Qu. Median Mean 3rd Qu. Max. -6.509 7.436 10.010 10.020 12.620 25.300
Student-t
Normal
More extremevalues are accommodated
84
Predictive Distributions
•Let y f = unobserved vector of “future” or “missing” data. Then
p,y,yf pyf|,yp,y
pyf|,yp|ypy
•1)
•2) p, y f |y py f |, yp|y
py f |y py f |, yp|yd
E py f |, y
•3)
•4) If, given the parameters, data are conditionally independent:
py f |y py f | p|yd
TruncatedCensoredMissing dataFutureData augmentation!
p, y p, y,yf dyfUse for posterior predictive checks
Usual representation of posterior predictive density
43
85
A model for binary data
yi Bernoulli
py1, y2, . . . , yN| x1 − N−x
p a−11 − b−1
p|y x1 − N−xa−11 − b−1
xa−11 − N−xb−1 |y Betax a, N − x b
Prior Posterior
Distribution Beta Beta
Mean aab
xaNab
Variance abab 2ab1
xaN−xb
Nab2Nab1
Probability of success
Assuming conditional independence
Beta prior
86
Binary Data: Predictive distribution
p yf |y Nf
xf
x f1 − Nf−xf xa−11 − N−xb−1
Bx a, N − x bd
Nf
xf
Bx a, N − x b xfxa−11 − NfN−xf−xb−1d
Nf
xf
Bxf x a, Nf N − xf − x bBx a, N − x b
Future data
Future number of “successes”
Future number of Bernoulli trials
Beta-binomial distribution (discrete)
Posterior
44
87
Simulating the predictive distributionSuppose the posterior distribution of the success probability
is Beta(30,10). We draw 40,000 samples and plot the posterior
> #Bernoulli probability is prob~Beta(30,10)> prob<-rbeta(40000,30,10)> mean(prob)[1] 0.7499961> var(prob)[1] 0.004579322
0.4 0.5 0.6 0.7 0.8 0.9 1.00
12
34
56
Posterior distribution of success probability, Beta (30,10)
N = 40000 Bandwidth = 0.007315
De
nsi
ty
88
We construct the predictive distribution via composition sampling.Simulate 40,000 binomial trials with n=5#Simulation by composition
m<-2r<-40000theta<-matrix(0,m,r)
theta[1,]<-rbeta(40000,30,10)for (i in 1:r) {theta[2,i]<-rbinom(1,5,theta[1,i])}
mean(theta[1,])mean(theta[2,])histcases<-hist(theta[2,])plot<-density(theta[2,])
0 1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
density.default(x = theta[2, ])
N = 40000 Bandwidth = 0.1099
Den
sity
Histogram of theta[2, ]
theta[2, ]
Fre
que
ncy
0 1 2 3 4 5
05
000
100
001
500
0
> mean(theta[1,])[1] 0.7501773> mean(theta[2,])[1] 3.751525
45
89
Exact and estimated posterior densities(most of the time we will not be able to derive the posterior, but
may be able to sample from it)
True, unknown posteriorEstimated posterior
(from samples)
Parameter value
Biological importancePr(biol. Important)
90
Estimating a posterior expectation and variance from
samplesE|y p|yd
Samples available from |y
1,2, . . . ,S
Monte Carlo Error E|y − E|y
1S∑
i1
S
i − E|y
May be posterior is unknown or integral impossible to compute
E|y 1
S ∑i1
S
iEstimate integral as
Goes to 0 asS tends to infinity
Posterior Expectation:
46
91
Monte Carlo Variance of estimate of posterior mean
Measures variability to be expected if repeated sampling (each time S samples drawn) is done from the posterior
Var Monte Carlo Error Var|yE|y − E|y
Var Monte Carlo Error Var|y1S ∑
i1
S
i − E|y
Var|y1S ∑
i1
S
i
92
Var MCE 1S2 ∑
i1
S
Var|yi 2∑∑ij
Cov|yi , i
1S2 ∑
i1
S
Var|y 2Var|y∑∑ijij
Var|y
S1 2
S ∑∑ ijij
Null only if samplesare independent
IF MARKOV CHAIN MONTE CARLO SAMPLING ISPRACTICED, SAMPLES ARE TYPICALLY SERIALLYCORRELATED
IMPORTANT TO EVALUATE AUTO-CORRELATIONSIN MCMC, TO ASSES MONTE CARLO ERROR
47
93
POSTERIOR PROBABILITIESPOSTERIOR PROBABILITIES
• Joint probabilities
Pr ∈ |y
p|y d
•Marginal probabilities
Pr1 ∈ 1|y 1
Θ2
p2,1|yd
1
Θ2
p1|2,yp2|yd.
•Equivalently
Pr1 ∈ 1|y E2|yPr1 ∈ 1|2,y
94
MONTE CARLO ESTIMATES OF MONTE CARLO ESTIMATES OF POSTERIOR PROBABILITIESPOSTERIOR PROBABILITIES
• Suppose samples 21,2
1,...,2m available from 2|y
1m∑
i1
m
Pr 1 ∈1|2i,y
Pr1 ∈ 1|y Θ2
1
p1| 2,yd1 p 2|yd2
•1) Must be easy to sample from 2|y
•2) Pr1 ∈1|2,y must be available in closed form so it can
be evaluated at each draw 2i
48
95
Estimating the predictive density from posterior samples
py f |y py f | p|yd
Samples available from |y
1 , 2 , . . . , S
Called ergodic averaging: basis of the Monte Carlo method
pyf y0|y 1
S ∑i1
S
pyf y0|i
Density at point y0
96
METROPOLIS-HASTINGS ALGORITHM
…and derivatives
49
97
1. FORM OF ALGORITHM
Posterior or conditional posterior
Integration constant is not needed
1.Generate candidate ∗ from proposal density f∗ |t−1
2.Draw random number U0, 1
3.Compute ratio
R g∗/f∗ |t−1
gt−1/ft−1 |∗
4.IfU minR, 1 set t ∗
t t−1
R cpy|∗ p∗/f∗| t−1
cpy|t−1pt−1 /ft−1 |∗
py|∗ p∗/f∗ | t−1
py|t−1pt−1 /ft−1 |∗
Important: sample not rejected.Chain value is just repeated
98
2. SPECIAL FORMS: USING THE POSTERIOR AS PROPOSAL
R py|∗ p∗ /f∗ |t−1
py|t−1pt−1/f t−1 |∗
py|∗p∗ /cpy|∗p∗py|t−1pt−1/cpy| t−1pt−1
1
If this were not so, one would have doubts…
50
99
3. SPECIAL FORMS: METROPOLIS ALGORITHM
Take a symmetric (in its arguments) proposal density:
Acceptance rate becomes
IfU minR, 1 set t ∗
t t−1
f∗ |t−1 ft−1 |∗
R py|∗ p∗ /f∗ | t−1
py|t−1pt−1 /ft−1 |∗
py|∗p∗ py|t−1pt−1
100
4. SPECIAL FORMS: INDEPENDENCE CHAIN METROPOLIS
f∗ |t−1 f∗
Proposal density is independent of current state of the sequence of sampled values
R py|∗p∗ /f∗py|t−1pt−1 /f∗
py|∗p∗
py|t−1pt−1
Acceptance ratio calculated as in standard Metropolis
51
101
fi i∗ |i
t i∗ |−i
t
fi it |i
∗ it |i
∗
6. SPECIAL FORMS: GIBBS SAMPLING (notation is not precise…)
Use fully conditionalsas proposal distributions
i∗ |−i
t i
∗,−it
−it
−it |i
∗ i
∗,−it
i∗
By Bayes theorem
MH acceptance ratio (recall that g(θ) is a posterior (or conditional posterior) thatwe do not recognize (or know how to sample from)
R g∗/f∗ |t−1
gt−1/ft−1 |∗ g∗
gt−1 −i
t |i∗
i∗ |−i
t
g∗
gt−1
i∗,−i
t
i∗
i∗,−i
t
−it
g∗
gt−1 −i
t
i∗
1
GIBBS SAMPLING: SPECIAL CASE OF MH: proposal always accepted
-
Fix in handout
102
ILLUSTRATION OF DIRECT, COMPOSITION AND GIBBS SAMPLING
1) Assumptions and basic results
1
2
N0
2,
1 − 34 1 1
2 −0. 375
−0.375 12
2
E1 |2 E1 Cov1 ,2
Var2 2 − E2
1 −0. 37512
22 − 2
4. 0 − 1. 52
Var1 |2 Var11 − 2
1 1 − − 34
2
716
0. 4375
E2 |1 E2 Cov1 ,2
Var1 1 − E1
2 −0.3751
1 − 0
2 − 0.3751
Var2 |1 Var21 − 2
12
2 1 − − 3
4
2
764
0.109 375
Suppose we wish to draw samples from
52
103
2) Direct sampling from the bivariate normal distribution
1
2
N0
2,
1 − 34 1 1
2 −0. 375
−0.375 12
2
0
2 Cholesky
1 − 34 1 1
2 −0. 375
−0. 375 12
2
z1
z2
0
2
1. 0 0
−0. 375 0. 330 718913 883
z1
z2
z1
2 − 0. 375 z1 0. 330718 913883 z2
104
> # Simulation of a bivariate normal- Direct> mu<-matrix(c(0,2),2,1)> mu
[,1][1,] 0[2,] 2> > V<-matrix(c(1.0,-0.375,-0.375,0.25),2,2)> V
[,1] [,2][1,] 1.000 -0.375[2,] -0.375 0.250> C<-t(chol(V))> C
[,1] [,2][1,] 1.000 0.0000000[2,] -0.375 0.3307189
> m<-2> r<-20000> > thetavec<-matrix(0,m,r)> for (i in 1:r) {z<-matrix(rnorm(m),m,1)+ thetavec[,i]<-mu+C%*%z}> mean(thetavec[1,])[1] 0.003480938> mean(thetavec[2,])[1] 1.997486> v1<-var(thetavec[1,])> v2<-var(thetavec[2,])> v12<-cov(thetavec[1,],thetavec[2,])> Vsim<-matrix(c(v1,v12,v12,v2),2,2)> v1[1] 0.9859114> v2[1] 0.2473953> Vsim
[,1] [,2][1,] 0.9859114 -0.3697481[2,] -0.3697481 0.2473953
1
2
N0
2,
1 − 34 1 1
2 −0. 375
−0.375 12
2
53
105
PrX1 x1, X2 x2, . . . , XN xN
PrX1 x1 PrX2 x2|X1 x1 PrX3 x3|X1 x1, X2 x2
. . .PrXN xN|X1 x1, X2 x2, . . . , XN−1 xN−1
COMPOSITION OR CHAIN SAMPLING FROM A JOINTDISTRIBUTION
x′ x1, x2. . . , xN−1, xN
Then:
Is a realization from joint distribution above
106
3) Sampling from the bivariate normal distribution using composition
> m<-2> r<-20000> > thetavec<-matrix(0,m,r)> > thetavec[1,]<-rnorm(20000,0,1)> for (i in 1:r) {thetavec[2,i]<-2-0.375*thetavec[1,i]+sqrt(0.109375)*rnorm(1,0,1)}> mean(thetavec[1,])[1] -0.007715923> mean(thetavec[2,])[1] 2.004965> v1<-var(thetavec[1,])> v2<-var(thetavec[2,])> v12<-cov(thetavec[1,],thetavec[2,])> Vsim<-matrix(c(v1,v12,v12,v2),2,2)> Vsim
[,1] [,2][1,] 1.0113990 -0.3806295[2,] -0.3806295 0.2519804>
1
2
N0
2,
1 − 34 1 1
2 −0.375
−0. 375 12
2
54
107
GIBBS SAMPLING
[A,B,C|DATA]
Want to sample from joint posterior
Sample is
[A(j) ,B(j) ,C(j) |DATA]
Each coordinate is a draw from marginalposterior
[A(j) |DATA] [C(j) |DATA][B(j) |DATA]
108
[B|A,C,DATA]
[C|A,B,DATA]
[A|B,C,DATA]
Gibbs sampling works as follows:1) Form all fully conditional posteriors2) Draw and update successively3) Repeat a number of times without storing samples
(burn-in)4) Collect all subsequent samples, and thin them if
needed for storage purposes
55
109
j A B C
1 A1 B1 C1
2 A2 B2 C2
. . . .
t At Bt Bt
t1 At1 Bt1 Bt1
. . . .
tm Atm Btm Btm
At the end of process:
Discardfirst t samplesas burn-in
Keep subsequentm samples forPosterior analysis
110
Sampling from the bivariate normal distribution using the Gibbs sampler
1 |2 N4. 0 − 1. 52 , 0. 109 375
2 |1 N2 − 0. 3751 , 0. 437 5
> # Simulation by Gibbs sampling> > L<-15000 #Chain length> burn<-1000 #Length of burn-in> thetavec<-matrix(0,L,2) #Contains the chain> mu1<-0 #True mean of theta 1> mu2<-2 #True mean of theta 2> sigma1<-1 #True SD of theta 1> sigma2<-0.5 #True SD of theta 2> rho<--0.75 #True correlation> s1<-sqrt(1-rho^2)*sigma1 #SD of cond. distribution of 1 given 2> s2<-sqrt(1-rho^2)*sigma2 #SD of cond. distribution of 2 given 1
1
2
N0
2,
1 − 34 1 1
2 −0. 375
−0.375 12
2
Wish to drawsamples from
56
111
###Run the chain####
thetavec[1,]<-c(0,0) #Initialize
for (i in 2:L) { #Open loopthetasamp2<-thetavec[i-1,2] #Sample of theta2m1<-mu1+(rho*sigma1/sigma2)*(thetasamp2-mu2)thetavec[i,1]<-rnorm(1,m1,s1) thetasamp1<-thetavec[i,1] #Sample of theta1
m2<-mu2+(rho*sigma2/sigma1)*(thetasamp1-mu1)thetavec[i,2]<-rnorm(1,m2,s2) } #Close loop
b<-burn+1theta<-thetavec[b:L,] #post-burn in samples
112
#Evaluate samplescolMeans(theta)cov(theta)cor(theta)#Plot samplesplot(theta,main="Scatter of bivariate samples", cex=.5, xlab=bquote(thetavec[1]),
ylab=bquote(thetavec[2]),ylim=range(theta[,2]))
> colMeans(theta)[1] -0.01379416 2.00553885> cov(theta)
[,1] [,2][1,] 0.9910101 -0.3735499[2,] -0.3735499 0.2506818> cor(theta)
[,1] [,2][1,] 1.0000000 -0.7494595[2,] -0.7494595 1.0000000
-4 -2 0 2
01
23
4
Scatter of bivariate samples
thetavec1
thet
ave
c 2
1
2
N0
2,
1 − 34 1 1
2 −0.375
−0. 375 12
2
57
113
GIBBS SAMPLING IN A BETA-BINOMIAL MODEL: draw samples
from the predictive distribution
py1,y2, . . . , yn| i1
n
yi 1 − 1−yi
p a−11 − b−1
Likelihood and prior
p|y i1
n
yi 1 − 1−y ia−11 − b−1
∑i1
n
yia−1
1 − n−∑
i1
n
yib−1
Posterior
(Beta density)
114
pyf ,|y pyf |, yp|y
y f1 − 1−yf∑i1
n
yia−1
1 − n−∑
i1
n
yib−1
Joint density of parameter and future observation
pyf |,y pyf |
p|yf ,y Beta yf ∑i1
n
yi a, 1 − yf n −∑i1
n
yi b
The fully conditionals
• Initialize the parameter•Sample yf from Binomial distribution with this parameter•Sample parameter from Beta distribution with parameters as above•Repeat many times, throw early draws (burn-in)
58
115
Dosage No. Killed No. Exposed
wi yi ni
1.6097 6 59
1.7242 13 60
1.7552 18 62
1.7842 28 56
1.8113 52 63
1.8369 53 59
1.8610 61 62
1.8839 60 60
EXAMPLE: MH FOR A GLIM (Carlin and Louis, 2000)
Number of flour beetles killed after exposure to carbon disulphide
Generalized logit model
Prdeath|w hw expx1expx
m1
wi dose i 1,2, . . . , k
x w −
m1 0Unknown parameters
Priors
m1 Gammaa0,b 0 m1a0−1 exp − m1
b0 Nc0,d0
2 Inverse Gammae0, f 0 2−e01 exp − 1f 02
116
p,2, m1|y,a 0,b0 ,c0, d0, e0, f 0
i1
k
hwiyi1 − hwini−yi exp − − c0 2
2d02
2−e01 exp − 1f02
m1a0−1 exp − m1
b0
i1
k
hwiyi1 − hwini−yi m1a0−1
2 e01exp − − c0 2
2d02
− m1
b0− 1
f 02
Joint posterior
Joint posterior is not recognizable…Use Metropolis-Hastings
59
117
1
2 12
log2
3 logm1
1
2 exp22
m1 exp3
Transform variables, to work on
3 so that Gaussian proposals can be used
J
∂∂1
∂∂2
∂∂3
∂2
∂1
∂2
∂2
∂2
∂3
∂m1
∂1
∂m1
∂2
∂m1
∂3
1 0 0
0 2 exp22 0
0 0 exp3
→ |J| 2exp22 3
118
p1,2,3 |y,a0, b0, c0, d0,e0 ,f0
i1
k
hwiyi1 − hwini−yi exp3a0−1
exp22e01
exp − 1 − c0 2
2d02
− exp3b0
− 1f 0 exp22
exp22 3
New density= old density (evaluated at transformed variables) times Jacobian
p1,2,3 |y,a0, b0, c0, d0,e0 ,f0
i1
k
hwiyi1 − hwini−yi expa03 − 2e02
exp − 1 − c0 2
2d02
− exp3b0
− 1f 0 exp22
Collecting terms
POSTERIOR IS NOT RECOGNIZABLE…
60
119
Hyper-parameters: a0= .25, b0=4, c0=2, d0=10, e0=2.000004, f0=1000
1) Metropolis-Hastings proposal distribution used
1∗
2∗
3∗
N
1t−1
2t−1
3t−1
, D .00012 0 0
0 . 033 0
0 0 .10
R py|∗ p∗/f ∗ | t−1
p y|t−1 p t−1 /f t−1 |∗
f∗ |t−1 123|D |
exp − 12∗ − t−1 ′D−1∗ − t−1
ft−1 |∗ 123|D |
exp − 12t−1 − ∗ ′D−1t−1 − ∗
f∗ |t−1 ft−1 |∗
120
Numerical stability is improved by computing acceptance ratio as
R py|∗ p∗ py|t−1pt−1
i1
k
h∗wiyi1 − h∗wini−yi exp a03∗ − 2e02
∗ − 1∗−c0
2
2d02− exp3
∗
b0− 1
f 0 exp22∗
i1
k
h t−1wiyi1 − h t−1wi
ni−yi exp a03t−1 − 2e02
t−1 −1t−1 −c0
2
2d02 −
exp 3t−1
b0− 1
f0 exp 22t−1
i1
kh∗wi
h t−1wi
yi 1 − h ∗wi1 − h t−1 wi
ni−yi
exp a0 3∗ − 3
t−1 − 2e0 2∗ − 2
t−1 −1∗ − c0
2 − 1t−1 − c0
2
2d02
−exp 3
∗ − 3t−1
b0
exp − 1f0 exp22
∗
exp − 1
f 0 exp 22t−1
logR ∑i1
k
yi logh∗wi
h t−1wi ni − yi log
1 − h∗wi1 − h t−1wi
a0 3∗ − 3
t−1 − 2e0 2∗ − 2
t−1 −1∗ − c0 2 − 1
t−1 − c02
2d02
−exp 3
∗ − 3t−1
b 0
1f0
1exp 22
t−1− 1
exp22∗
R explogR
61
121
Three parallel chains run each with 10,000 iterations Burn-in= 2,000 in each chain Histograms based on the (10,000-2,000)3= 24,000 sampled values Autocorrelations and inter-correlations estimated from chain 2
Chains mix slowly (13.5% acceptance rate) High correlations between parameters: Makes sense to explore different proposal
1 −0.78 −0. 94
−0. 78 1 0. 89
1
122
2) Metropolis-Hastings proposal distribution used From first algorithm ,estimate posterior covariance matrix as
1
m ∑j1
m
j − j − ′
Use Gaussian proposal with covariance matrix (gave acceptance rate 27.3%)
2
0.000292 −003546 −0.007856
−003546 0. 074733 0.117809
−0.007856 0. 117809 0.241551
62
123
If fully conditionals are not recognizable, use other sampling
methods
• Metropolis
• Metropolis-Hastings
• Acceptance/rejection sampling
• Importance sampling
These methods require an auxiliary distribution, whichmust be tuned, and calculation of an “acceptance probability”
1
Dealing with epistatic Dealing with epistatic interactions and noninteractions and non--linearitieslinearities
gene x genegene x genegenegene x gene x genex gene x gene
genegene x gene x gene x genex gene x gene x gene……………………..
(Alice in Wonderland)
Statistical Interaction(fixed effects models)
yijk Ai Bj ABij eijk
Eyijk |Ai,Bj,ABij Ai Bj ABij
E yijk − yij ′k′ |Ai,Bj,ABij,Ai ′ ,Bj,ABi′j Ai Bj ABij
− Ai ′ Bj ABi′j
Ai − Ai′ ABij − ABi′j
Difference between levels of factor A depends on level of B
If factor A has a levels and factor B has b levels, the degrees of freedom are:- (a-1)- (b-1)- (a-1)(b-1) [assuming no-empty cells]
2
Fixed effects models(unravelling “physiological epistasis” a
la Cheverud?)
• Lots of “main effects”• Splendid non-orthogonality• Lots of 2-factor interactions• Lots of 3-factor interactions• Lots of non-estimability• Lots of uninterpretable high-order
interactions• Run out of “degrees of freedom”
Analysis with random effects models?
MEUWISSEN et al. (2001)
GIANOLA et al. (2003)
XU (2003)
--Use all SNP markers in statistical models--Mechanistic basis to mixed effects linear model
(genetic effects treated as random variables)--Highly parametric models--Strong assumptions made
“Ridge regression-type”
Will talk about this later
3
What is ridge regression?
Bayes model assumes, a priori N0,B2
OLS X′X−1X′y
RIDGE X′X I−1X′y
BAYES X′X B−1 e
2
2
−1
X′y e2
2B−10
Typically assumed 0 Typically identity matrix.However, can be givenstructure
Large values of λ “shrink” regressionstowards 0 (induces bias, but higherprecision than OLS)
Dealing with interactions (“statistical epistasis”): much of thistook place in inspiring Iowan landscapes…
gotyou as pigsmany as/)( 22ijkllkjiijkllkji pigpig
MOREPIGS HERE
SOME CORN
PIGS AGAIN
O K
Bayesians, keep out!
C. C. C
4
RANDOM EFFECTS MODELS FOR ASSESSING EPISTASIS REST ON:
Cockerham (1954) and Kempthorne (1954)
--Orthogonal partition of genetic variance into additive, dominance,additive x additive, etc. ONLY if
No selection No inbreeding No assortative matingNo mutationNo migrationLinkage equilibrium
5
N0,a2
N0,d2
N0,aa2
N0,ad2
N0,dd2
. . . .
N0,ddd...d2
The degrees of freedom of the distribution are NOT GIVEN by the number of levels.
There is now 1 df for each type of genetic effect.
Matrix representation
Variance-covariance
Decomposition
6
RANDOM EFFECTS MODELS FOR ASSESSING EPISTASIS REST ON:
Cockerham (1954) and Kempthorne (1954)
--Orthogonal partition of genetic variance into additive, dominanceadditive x additive, etc. ONLY if
No selection No inbreeding No assortative matingNo mutationNo migrationLinkage equilibrium
DO THESE ASSUMPTIONS HOLD?
ALL ASSUMPTIONSVIOLATED!
Just considerLinkage disequilibrium
My girlfriend is
a bitch
Digression: linkage disequilibrium
7
Dt 1 − rtD0
0 5 10 15 20 25 30 35 400.00
0.05
0.10
0.15
0.20
0.25
No. generations t
D
Evolution of linkage disequilibrium as a function of recombination rate
r= 0.01
r= 0.10
r= 0.50
8
A VIEW OF LINEAR MODELS(as employed in q. genetics)
Mathematically, can be viewed as a “local” approximation of a complex process
Linear approximation
Quadratic approximation
nth order approximation
9
Example
y gx e
Response variateModel residual
Some function of a covariate x
Suppose gx sinx cosx
Second-order Taylor series expansion about 0
ddxsinx cosx
ddxcosx − sinx
ddxsinx cosx cosx − sinx
ddxcosx − sinx −cosx − sinx
d2
dx 2sinx cosx −cosx − sinx
sinx cosx ≈ sin0 cos0 cos0 − sin0x − 0 12 −cos0 − sin 0x − 02
1 x − x2
2
How good are the linear and quadratic approximations? Recall that a Taylorseries provides a local approximation only…
-5 -4 -3 -2 -1 1 2 3 4 5
-1.4
-1.2
-1.0
-0.8
-0.6
-0.4
-0.2
0.2
0.4
0.6
0.8
1.0
1.2
1.4
x
y1. Sin and cosine function 3. Quadratic approximation
2. Linear approximation
4. Approximationsare good at x=0…
10
But we have environmental noise…:
evaluate function sin(x)+cos(x) at x=0, 0.5 and 1
True values are:
> sin(0)+cos(0)[1] 1
> sin(0.5)+cos(0.5)[1] 1.357008
> sin(1)+cos(1)[1] 1.381773
VERY CLOSE TO EACH OTHERNOISE CAN MASK SIGNALS!
Create an R data set (N=300) from adding 100 N(0,1) residuals to each of the 3 values
> y0<-sin(0)+cos(0) +rnorm(100,0,1)> y05<-sin(0.5)+cos(0.5)+rnorm(100,0,1)> y1<-sin(1)+cos(1) +rnorm(100,0,1)> y<-c(y0,y05,y1)
11
> y0<-sin(0)+cos(0) +rnorm(100000,0,1)> y05<-sin(0.5)+cos(0.5) +rnorm(100000,0,1)> y1<-sin(1)+cos(1) +rnorm(100000,0,1)> y<-c(y0,y05,y1)
Create a larger R data set (N=300000) by adding 100000 N(0,1) residuals to each of the 3 values
CANNOT SEE UNDERLYING STRUCTURE.LARGE NOISE (ERROR VARIANCE)
> y0<-sin(0)+cos(0) +rnorm(100000,0,.05)> y1<-sin(1)+cos(1) +rnorm(100000,0,.05)> y05<-sin(0.5)+cos(0.5) +rnorm(100000,0,.05)
Now we get a more precise measuring instrument with variance 0.05
STRUCTURE IS REVEALED BUTWE CANNOT DIFFERENTIATEBETWEEN TWO OF THE UNDERLYINGVALUES
12
…SO WE BUY ANOTHER INSTRUMENT WITH VARIANCE 0.001!
> y0<-sin(0)+cos(0) +rnorm(100000,0,.001)> y1<-sin(1)+cos(1) +rnorm(100000,0,.001)> y05<-sin(0.5)+cos(0.5) +rnorm(100000,0,.001)> y<-c(y0,y05,y1)
STILL CANNOT DIFFERENTIATEBETWEEN THE
> sin(0.5)+cos(0.5)[1] 1.357008
> sin(1)+cos(1)[1] 1.381773
1.0 1.2 1.4
02
46
810
12
density.default(x = y)
N = 300000 Bandwidth = 0.0126
Density
1.0 1.1 1.2 1.3 1.4
010
20
30
40
50
60
density.default(x = y, bw = 0.00
N = 300000 Bandwidth = 0.002
Density
HOWEVER, NON-PARAMETRIC DENSITY ESTIMATES DEPEND ON SOMEBANDWIDTH PARAMETER. BY REDUCING IT, WE CAN SEE THE ENTIRESTRUCTURE OF THE PROBLEM…
13
Heal Thyself: Systems Biology Model Reveals How Cells Avoid Becoming Cancerous. ScienceDaily (May 21, 2006)
Whataboutsomethinglike this?
What to do in genomic-assisted analysis of complex genetic signals?
• Include all markers, model all possible interactions? Unrealistic…
• Select sets of influential markers via model selection Huge search space Frequentist methods “err” probabilistically Bayesian model selection (RJMC) difficult to tune
• Use LASSO (least absolute shrinkage and selection operator): Tibshirani (1996). What about interactions?
• Explore model-free techniques that have been used successfully in many domains semi-parametric regression machine learning: focus on prediction, learning
mappings from inputs to outputs
14
DEFINITION OF MACHINE LEARNING(Wikipedia)
Machine learning: subfield of artificial intelligence concerned with design and development of algorithms that allow computers (machines)
to improve their performance over time (to learn) based on data,
A major focus of machine learning research is to automatically produce (induce) models, such as rules and patterns, from data. Hence,
machine learning is closely related to fields such as data mining, statistics, inductive reasoning, pattern recognition, and theoretical
computer science.
1
5. Machine learning classification: 5. Machine learning classification: chicken mortalitychicken mortality
• Genomics Initiative Project at Aviagen, Ltd.
• 231 sires, each genotyped for 5,523 SNPs
• End-point: sire family progeny mortality rates (0-14 d).
• Monomorphic SNPs excluded:
4,148 polymorphic SNPs
• 95.5% in HWE at 0.001 significance level
USED FOR VALIDATION
USED FOR SNP SELECTION
Thick tailThick tail
Comments follow
2
Pre-processing data
• AVIAGEN fitted a linear model to individual 0-1 records (generalized linear model would have been better)
• Linear model had nuisance effects (e.g., age of dam, contemporary group) and breeding values of birds. All effects were estimated
• Individual records corrected by subtracting from phenotypes, estimates of nuisance effects and one-half of the EBV of the dam
• Intent was to remove all effects other than those “due to”sires.
• Corrected records were averaged for each sire
Criticism
• Correction assumes additive inheritance, and smoothes data according to such theory.
• Better correction: use only estimates of nuisance effects.
• In dairy cattle, they are using EBVs of sires (inferred according to additive inheritance) as target. This is objectionable for the same reason.
• Bishop (2006): “Care must be taken during pre-processing because information is often discarded, and if this information is important to the solution of the problem then the overall accuracy of the system can suffer”
3
Convert into classification problem: partitioning of sire samples
• Mimic case-control studyCase-control studies: patients who have a disease and see if compared
with those not having the disease.
• 2 classes—high (case) and low mortality (control) groups
• Two thresholds for mortality rate, determined from bootstrap (1000) and empirical distributions
1 2
percentile (1 ) percentile
c c
Bootstrapping
• Regard sample of size N as pseudo-population• Form B samples of size N by re-sampling with replacement from
pseudo-population• Original y vector replaced by y*
b b=1,2,…,N Some elements of original data will appear more than once or not at all in sample b
• Estimate some parameter (e.g., the 10-percentile, in each of the Bsamples). Then you have a bootstrap distribution.
• Take the average over the B samples. Estimate its standard error with standard formulae.
• B between 50-200 seems to suffice to estimate the SE in simple models
EFRON, B. and TIBSHIRANI, R. J. 1993. An Introduction to the bootstrap. Chapman & Hall, London.
4
The 11 Case-control groups
B and E estimates similar: use bootstrap estimates to form groups
-5 -4 -3 -2 -1 0 1 2 3 4 5
0.1
0.2
0.3
0.4
x
y
“Control” “Case”
TwoTwo--step SNP selectionstep SNP selection
• Filter step reduces thousands of SNPs to small number.
• Wrapper optimizes performance of filtered SNPs by adopting naïve Bayesianclassifier to score subset.
WRAPPER COMPUTATIONS CARRIED OUT WITHWEKAWeka 3: Data Mining Software in Java
5
Filter: information gain
• Information gain: SNP by SNP– Change in uncertainty about mortality
outcome (e.g., high or low) due to a single SNP
1
2 2
-
InfoGain(F) ( , ) ( , )
( , ) log log
where and are the number of instances belonging to the + class
and - class, respectively.
i i i
i i i
N N NN NI I
N N N N N
N N N N N NI
N N N N N N
N N
N N N
Entropy from phenotypes Entropy from knowing SNP
STATISTICAL ENTROPY(discrete situation)
Information is property of a distribution.
The mean information of a distribution (statistical entropy)
is:
X has K states
Hp1,p2, . . . ,pK EIX E− logPrX x i
−∑i1
K
pi logpi.
probabilities
6
Hp1,p2, . . . ,pK EIX E− logPrX x i
−∑i1
K
pi logpi.
1) MUST BE POSITIVE
1) NULL VALUE NO UNCERTAINTY. IF ONE IS TOLD ABOUT THE OUTCOME, ENTROPY FALLS TO 0.
3) MAXIMUM ENTROPY LOT OF INFORMATION TOBE GAINED FROM OBSERVATION
4) DIFFERENCE IN ENTROPY BEFORE AND AFTER DATA:MEASURE OF INFORMATION CONTENT
(called Information Gain)
EXAMPLE: ENTROPY OF RANDOM POSITION AT DNA(states A, T, G, C equally frequent)
pA 0.7, pG 0.3
Entropy − 710
log27
10 3
10log2
310
0. 88 bits
Information content of position 2 − 0.88 1.12
EXAMPLE: ENTROPY OF CONSERVED POSITION(states have unequal frequencies)
−∑i1
414
log214
2 bits
7
Example: calculate information gain associated with the A-locus (SNP by SNP)
Number Dead Alive
AA 100 80 20
Aa 200 40 160
aa 100 60 40
400 180 220180400 0.45 220
400 0. 55
Maximum entropy
I NA
N , ND
N − 12 log2
12 − 1
2 log212 1.0
Entropy of mortality distribution
I NA
N , ND
N − 180400 log2
180400 − 220
400 log2220400
0.9927744540
Number Dead Alive
AA 100 80 20
Aa 200 40 160
aa 100 60 40
400 180 220180400 0.45 220
400 0. 55
8
Number Dead Alive
AA 100 80 20
Aa 200 40 160
aa 100 60 40
400 180 220180400 0.45 220
400 0. 55
AA − 20100 log2
20100 − 80
100 log280100 0. 7219280949
Aa − 160200 log2
160200 − 40
200 log240200 0. 7219280949
aa − 40100 log2
40100 − 60
100 log260100 0. 9709505945
100400
0.721 928 094 9 200400
0. 721 928 094 9 100400
0.970 950 594 5 0.784 183 719
INFO GAIN 0. 992 774 454 0 − 0. 784 183 719 0.208 590 735
Weka 3: Data Mining Software in Javahttp://www.cs.waikato.ac.nz/ml/weka/
Weka: collection of machine learning algorithms for data mining tasks.
The algorithms can either be applied directly to a dataset or called from Java code.Weka contains tools for data pre-processing, classification, regression, clustering,
association rules, and visualization.
It is also well-suited for developing new machine learning schemes.
9
Wrapper—classifier
Naive Bayesian classifier assumes conditional
independence of SNPs ( ) given mortality
class C.1, , kA A
Naïve Bayesassumption
Example: Aj=Bb, C= “low mortality class”
Multi-SNP genotype of a sire family
C= high or low
Suppose that TRAINING SAMPLE is
Number Dead Alive
AA 100 80 20
Aa 200 40 160
aa 100 60 40
400 180 220180400 0.45 220
400 0. 55
PrAA|Dead 80180 0.444 444 444 4
PrAa|Dead 40180 0. 222222222 2
Praa|Dead 60180 0. 3333333333
PrAA|alive 20220 9. 090 909 091 10−2
PrAa|alive 160220 0. 7272727273
Praa|alive 40220 0. 1818181818
10
A TEST case needs to be classified, and it is aa.What is the probability that will be alive?
Pralive|aa Praa|alivePralivePraa
0.181 818 181 80.55100400
0. 40
Test case will be predicted to be DEAD
1)If actually dead, classification is CORRECT2)If alive, classification is in ERROR
• Suppose sire family is “low” and wrapper sends it to “low” accurate classification
• Suppose sire family is “low and wrapper sends it to “high” inaccurate classification
• Suppose 50 families in group are classified, and 20 are missclassified error rate is 40%
Baseline error rate: 50%
11
[No. of SNPs]
Leave-one outCV
10-foldCV
Classification accuracy before and after SNP selection in each group
• Full set contains 50 SNPs retained in the filter step. Subset was searched by backward elimination.
“Coin tossing line”Randomclassification
12
Chromosome 1: the 9 sets of selected SNPs
• Groups 1 and 2, no SNP on chromosome 1 was selected based on these two groups
Predictive ability of mortality rates
Main effects model (by group), all 201 sires :
PRESS Criterion (leave-one-out):
View model as “smoother”and not as one aiming atmechanistic basis
13
y i xi′ ei
y X e
e i yi − xi′
−i X−i
′ X−i−1
X−i′ y −i
e i yi − xi′ −i
e i
1−hii
PRESS ∑i1
ne i
2
Basic Regression Model
The PRESS, cross-validation criterion
X ′X−1X ′yy X
XX ′X−1X ′y Hy
“Hat” matrix
dfmodel trH tr XX ′X−1X ′
tr X ′X−1X ′X p
Predictive ability of the 11 sets of selected SNPs
Relative PRESS
1.0
0.9615384615
1.153846154
1.038461538
1.192307692
1.153846154
1.346153846
1.538461538
1.576923077
1.384615385
1.653846154
14
7. Comparison between 7. Comparison between classification methods:classification methods:
SNPSNP--mortality associations in mortality associations in broilers in two environmentsbroilers in two environments
• Loss of information due to using only tails of distribution?
—use all sires now in “cases”
and “controls”– Compare with results from using a subset of sires
• Contrast 3 classification methods– Naïve Bayes classifier,
– Bayesian networks
– Neural networks
15
Data
• >200 sire families; chicks in low (L) or high (H) hygiene environments- Traitsearly age (0-14d) and late age (14-
42d)– > 5000 whole genome SNPs (on sires)
• Construct phenotypic values for sires by fitting a GLIMM at individual bird level, accounting for non-genetic and non-hygiene environment factor:sire’s adjusted progeny mortality means
Data set
EL: early mortality, low hygiene EH: early mortality, high hygiene
LL: late mortality, low hygiene LH: late mortality, high hygiene
• Four age-hygiene strata formed • Analysis performed in each stratum in the same way• Enable study of G x E
16
Methods:multi-category classification
• Five categorizations (number of categories K = 2, 3, 4, 5 and 10) employed to categorize sire’s adjusted means
• All sire means used
• Sire means ordered: each assigned to one of K equal-sized subsets– E.g., when K = 3, the two boundary values correspond to the
1/3 and 2/3 quantiles of the distribution of sire means
Methods“Filter”
• Rank SNPs with information gain of each
• Using categorized phenotypic values, information gain computed for each SNP for each categorization (K = 2, 3, 4, 5, 10)
• SNPs retained for further analysis were top 50 for information gain
17
Methods“wrapper”
• Optimize performance of top 50 SNPs selected by “filter”
• Yields “best” subset of SNPs, with the highest classification accuracy of the chosen classifier
• Classification method– Naïve Bayes (NB) classifier—NB wrapper– Bayesian networks (BN) —BN wrapper– Neural networks (NN) —NN wrapper
“Naïve Bayes wrapper”
• Naïve Bayes
• Strong assumption of independence of SNPs, given phenotype
X2
Y
X1 ………………
1 1( | , , ) ( ) ( , , | )n np Y X X p Y p X X Y
pX1,X2, . . . ,Xn|Y pX1|YpX2|Y. . .pXn|Y
Class label
SNP genotypes
Important these probabilities are trained with past data
18
BAYESIAN NETWORKS
• Bayesian hierarchical model: sequence of probability distributions describing uncertainty: starts from the likelihood…ends with some hyper-prior
• Bayesian network: hierarchical model for sets of discrete random variables. Our problem is discrete:
classification
a b
c
Visualize structure of a probabilistic model Give insights into structure, e.g., conditional independence Complex computations expressed graphically
PROBABILISTIC GRAPHICAL MODELS
A graph has Nodes or vertices. Each represents RV or group or RV Links or arcs or edges. Each expresses relationships Graph describes how joint distribution is decomposed Directed graph: may express causality Undirected graph: may express constraints
pa,b,c pa,bpc|a,b
pa,b,c papb|apc|a,b
Second representationUsed in graph
19
The representation pa, b,c is symmetrical with respect to variables
Different ordering pbpc|bpa|b, c leads to different graphical representation
px1, x2, . . . , xK px1px2|x1. . . pxk |xK−1,xK−2, . . .x1
One node per conditional distribution Each node has links from all lower numbered nodes Graph is said fully connected: one link for every pair of nodes
Absence of links reveals interesting information Not a fully connected graph
x1
x2
x3
x4 x5
x6x7
Example: not fully connected graph
p(x1) p(x2) p(x3)p(x4|x1,x2,x3)p(x5|x1,x3)p(x6|x4)p(x7|x4,x5)
Then, letting pak denote parents of k
px k1
K
pxk |pak
•Important for sampling•Can generate sample from
joint distribution by using theindependences indicated bynetwork
20
Example: Polynomial random regression as a graphical model
ti x i′w ei
t Xw e
w N 0, I 1
e N0, I2
pt,w|x,,2 pw|n1
N
ptn |xn,w,2
Joint distribution of data and of regression coefficients
regressions (in the machinelearning literature are called“weights”; intercept is called “bias”)
pw|t,x,,2
pw|n1
N
ptn |xn,w,2
pw|n1
N
ptn|xn ,w,2dw
pw|n1
N
ptn|xn,w,2
ptf |xf , t,x,,2 ptf,w|xf , t,x,,2dw
ptf|xf ,w,x,,2pw|t,x,,2 dw
Conditional distribution of regressions, given data (inference)
Conditional distribution of future observations, given current data (prediction)
Crucial in genomic-assisted improvement programs
21
Example: Joint distribution for polynomial random regression model
w
t1
t2
tn
Alternative forms of representation
w
Ntn wtnσ2
xn
α
Open circles: random variablesSolid circles: deterministic
Observed
Latent, hidden
pt,w|x,,2 pw|n1
N
ptn|xn,w,2
Network representation of predicting future case in polynomial random regression model
wtn
σ2
xn α
tf xf
N
pw|t,x,,2
pw|n1
N
p tn |xn,w,2
pw|n1
N
p tn|xn,w,2 dw
pw|n1
N
pt n|xn ,w,2
pt f |xf , t,x,,2 pt f,w|xf , t,x,,2 dw
pt f |xf ,w,x,,2 pw|t,x,,2 dw
INFERENCE
PREDICTION
22
Random variable has K states. Represent it as vector x of order K 1
x ′ x1 x2 . . . xK
′
If k th modality is observed, xk 1 and the rest are 0. Let Pr xk 1 k
px| k1
K
kx k
∑k1
K
k 1 →→→ K − 1 parameters
DISCRETE RANDOM VARIABLES
Two random variables, each with K states, so K2 possibilities. Let Prx1k 1,x2l 1 kl
px1,x2| k1
K
l1
K
klx 1kx 2l
∑k1
K
∑l1
K
kl 1 →→→ K2 − 1 parameters
2 RV
M RV px1,x2, . . .xM| k1
K
l1
K
. . .m1
K
kl...mx 1kx 2l...xMm
∑k1
K
∑l1
K
. . .∑m1
K
kl...m 1 →→→ KM − 1 parameters
NUMBER OF PROBABILITIES (PARAMETERS) GROWS EXPONENTIALLY WITH M
23
Suppose x1,x2 are independent
px1,x2 | k1
K
kx 1k
l1
K
lx 2l
∑k1
K
k 1,∑l1
K
l 1 →→→ 2K − 1 parameters
px1,x2| k1
K
l1
K
klx 1kx 2l
∑k1
K
∑l1
K
kl 1 →→→ K2 − 1 parameters
x1 x2
x1 x2
px1,x2| px1|px2|x1,
K − 1 KK − 1 K2 − 1 parameters
Equivalently
For M variables: M(K-1) Number of parameters grows linearly
DISCRETE NETWORK
• Minimum number of parameters is M(K-1):graph is fully disconnected
• Maximum number is KM-1: graph is fully connected
• Typically, search is for networks with intermediate levels of connectivity
• This is part of the modeling effort. Several options for reducing number of probabilities to estimate.
24
px1,x2, . . .xM | px1 |px2 |x1,. . .pxM|x1, . . .xM |
is such that (Markov chain)
px1,x2, . . .xM | px1 |px2 |x1,px3|x2,. . .pxM|xM−1 |
K − 1 KK − 1 KK − 1 . . .KK − 1
K − 1 M − 1KK − 1 K − 11 M − 1K
Linear (rather than exponential) on M
Example 1 of intermediate connectivity: Markov chain
xMx1 x2
….
Bayesian structure of a networkMarkov chain: assign a Dirichlet prior with parameters μi to probabilities of each node
Example 2 of intermediate connectivity
μ1μ2 μM
….
μ1
….
x1x2 xM
x1 x2xM
μ
25
Example 3 of reducing connectivity
Instead of using complete tables of probability distributions, use parameterized models
μ1μ2 μM
….x1x2 xM
y
binary binary binary
binary
Pry 1|X1 x 1,X2 x2, . . . ,XM xm
2M CONFIGURATIONS of X VARIABLES
2M CONDITIONAL PROBABILITIES NEED TO BE ESTIMATED
Pry 1|X1 x1,X2 x2, . . . ,XM xm w ′x w0 ∑j1
M
xjwj
w ′x expw′x
1expw′xlogit model
w ′x probit model
M 1 parameters →→ Linear in number of variables
One possible form of modelling
μ (.)
26
MARKOV BLANKETRecall
px k1
K
pxk |pak
pxi|x−i px1,x2, . . . ,xM
px−i
px1,x2, . . . ,xM∑px1,x2, . . . ,xM dx i
k1
K
pxk |pak
∑k1
K
pxk |pakdx i
For discrete random variables:
Means: all x’s other than xi
For continuous random variables:
pxi|x−i px1,x2, . . . ,xM
px−i
px1,x2, . . . ,xM
px1,x2, . . . ,xM dx i
k1
K
pxk |pak
k1
K
pxk |pakdx i
Further
px i|x−i k1
K
pxk |pak
px i|paik1
K
pxk |xi ∈ pak
Depends on parents of i Depends on children of i and on co-parents of i
27
Markov blanket of node xi comprises the set of parents, children and co-parents of node
The conditional distribution of xi given all other variables depends only on variablesin the blanket
xi
Parent
Co-parent
Children
“Bayesian Net-wrapper”
• Suppose the Bayesian network is:
SNP4DD 0.7Dd 0.2dd 0.1
[SNP3|SNP1]
28
• Bayesian network: joint distribution decomposed by conditional independence implied by graph
• The Markov blanket for disease can be found from, e.g.: P(SNP1=AA, SNP2=bb, SNP3=Cc, SNP4=DD, Disease=Present)
= P(SNP1=AA) P(SNP2=bb|SNP1=AA) P(SNP3=Cc|SNP1=AA) P(SNP4=DD)x
P(Disease=Present|SNP3=Cc, SNP4=DD)
SNP4DD 0.7Dd 0.2dd 0.1
[SNP3|SNP1]
FINDING THE NETWORK
“NP-hard problem”
• No shortcut or smart algorithm leads to a simple or rapid solution.• Only way to find optimal solution is a computationally-intensive, exhaustive analysis
in which all possible outcomes are tested.
29
Mh= A specific network structure
Choose network with largest posterior probability
Assuming networks are equi-probable, a priori, model choice driven bymarginal likelihood
Use Markov blanket representation
No. variables in net
Sample size
Network selection problem
Y
X3X2X1 X4
Consider network M1 2 class labels: “normal”, ”disease”
PY,X1,X2,X3,X4 PY|X1,X2,X3,X4PX1,X2,X3,X4 → 2 34 162 probabilities
PX1,X2,X3,X4 → 34 81 probabilities → 80 are "free"
PY|X1,X2,X3,X4 → 2 81 probabilities → 81 free
Network has 80 81 161 parameters
30
Now, look at network M2
SNP4DD 0.7Dd 0.2dd 0.1
PSNP1 AA, SNP2 bb,SNP3 Cc,SNP4 DD,Disease Present
PSNP1 AAPSNP2 bb|SNP1 AAPSNP3 Cc|SNP1 AA
PSNP4 DDPDisease Present|SNP3 Cc,SNP4 DD
[SNP3|SNP1]
PSNP1 → 2 parameters
PSNP2|SNP1 → 3 parameters
PSNP3|SNP1 → 3 parameters
PSNP4 → 2 parameters
PDisease|SNP3,SNP4 → 9 parameters
19 parameters in net
SNP4DD 0.7Dd 0.2dd 0.1
66
25
31
PData|parameters nn1,n2, . . . ,nk
j1
k
jnj
Pparameters 1B1, . . .k
j1
k
jaj−1
Γ ∑j1
k
aj
j1
k
Γaj
j1
k
jaj−1 → DIRICHLET
Pparameters|Data j1
k
jnjaj−1 → DIRICHLET
Ej |Data nj a
n ∑j1
k
aj
Sample size
No. obs. In “niche” k
ESTIMATING THE PROBABILITIES
Prior
Under Dirichlet-Multinomial model, have closed form for marginal of datafor a given network Mh
i= variable in netj= parent blanketk= class label
Observed no.
Observed no., summedover classes
32
Sebastiani et al. (2005) Nature Genetics
NEURAL NETWORKS
A very fast tour
33
Up to 103 connections
• Brain superior to von Neumann machines in cognitive tasks
• Microchips: nanoseconds, Brain: milliseconds
• ???
Brain recognizes familiar objects from unfamiliar anglesKey: not speed but organization of processing
34
Why?
• Tasks distributed over 1012 neurons
• Interconnected and activated
• Massively parallel
• Neurons adapt and self-organize
• Interconnectivity: up to 103 synaptic connections
“NN-wrapper”
• Neural network: multilayer feed-forward network (also called multilayer perceptron)
N SNPs M hidden nodes O output nodes in classification1 output node in regression
Two dummy variatesneeded to describegenotypes
3 categoriesassumed here
(classification)
35
Input to hidden node j
Activated Input to hidden node j
Activation function
Activated input to output layer
Activated input to output layer
net input to output layer
woj is an “intercept”
NEURAL NETWORKS ARE UNIVERSAL APPROXIMATORS
50 x values sampled from U[-1,1] and then evaluate f(x). Fit a two-layerNN with 3 hidden nodes and tanh activation functions
x2 Sin(x)
|x|
Step function
Output from hidden node
36
A simpler formulation of the same network: relationship to non-parametric regression
y i 0 ∑j1
# hidden nodes
j1
1exp x i′ j
ei
If # nodes is known (k), the number of parameters is:
1+k+k(1+ # x’s)= 1+ k (# x’s + 2)
Here, the activation function of the output of the hidden layer is the identity
Can overfit if too many hidden nodes“Regularization” needed
FITTING (TRAINING) THE NETWORK
• It can be formulated as a non-linear regression problem
• Can use any algorithm, such as Newton-Raphson, to minimize residual SS
• Widely used algorithm: error back-propagation• Better: treat regressions as random (i.e., assign
a prior) and induce regularization via the Bayesian approach
• The number of hidden nodes is a model selection problem
37
NEURAL NETWORK FOR CLASSIFICATION
DataBenign y11=1 y12=0 y13 =0
None y21=0 y22=0 y23 =1
Malignant y31=0 y32=1 y33=0
Benign y41=1 y42=0 y43=0
Malignant y51=0 y52=1 y53=0
Classify tumors [BENIGN, MALIGNANT, NONE]
#classes.If q=3, thentwo “free”probabilities
pig Pryig 1 expwig
∑g1
q
expwig
wig 0 ∑j1
# hidden nodes
j1
1 exp xi′j
l, i1
n
pi1yi1 pi2
yi2 . . .pi1yiq
i1
n
g1
qexp 0 ∑
j1
# hidden nodes
j1
1exp x i′j
∑g1
q
exp 0 ∑j1
# hidden nodes
j1
1exp x i′j
Likelihood formulation of neural network for classification problem
1 #x ′sk # of ′s 1 # of hidden nodes) q − 1 # of ′s
1 #x ′sk 1 kq − 1 q − 1 kq #x ′s
Number of parameters to be learned (estimated). k= # hidden nodes
AGAIN, CARE MUST BE EXERCIZED TO AVOID OVERFITTING
38
Back to chickens: Comparison of three wrappers
In all comparisons, NB had the smallest error rates
Classification error rates in the four strata
Comparison of NB’s classification accuracy using a fullset-SNP predictor and a subset SNP predictor
2-Cate. 3-Cate. 4-Cate. 5-Cate.10-Cate.0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9EH
Pre
dic
tio
n a
ccu
racy
2-Cate. 3-Cate. 4-Cate. 5-Cate.10-Cate.0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 LL
Pre
dic
tio
n a
ccu
racy
2-Cate. 3-Cate. 4-Cate. 5-Cate.10-Cate.0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Pre
dic
tio
n a
ccu
racy
EL
2-Cate. 3-Cate. 4-Cate. 5-Cate.10-Cate.0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 LH
Pre
dic
tio
n a
ccu
racy
Full Subset
Wrapper enhances classification accuracy over and above subset chosen by filter
39
Evaluation of SNP subsets generated by five classification schemes
• Two measures used as evaluation criteria– Measure A: predicted residual sum of squares (PRESS)
( is the predicted value for the sire i’s phenotype when it is not used to estimate parameters)
– Measure B: proportion of significant SNPs• Given subset of SNPs, F-statistic computed for each SNP• Given an individual SNP F-statistic, its p-value is approximated by
– Shuffling phenotypes across all sires 200 times, while keeping sires’ genotypes on this SNP fixed
– Proportion of the 200 permuted samples where a particular F-statistic exceeds the F-statistic seen in the original sample is calculated. This proportion is taken as the SNP’sp-value
• After obtaining p-values for all SNPs in the subset, significant ones are chosen by controlling the false discovery rate at level 0.05, via the “Benjamini-Hochberg procedure”. The proportion of significant SNPs in a subset is the end-point
1
2( )
1
SNP ,
ˆPRESS ( ) .
gn
i ij ij
N
i ii
M e
M M
( )ˆ
iM
Evaluation of SNP subsets generated by five classification schemes
Values with underlines are winners under the each of the two measures
Overall, schemes with 2 and 3 categories were better than others
40
• The issue of loosing sire informationPRESS values of SNP subsets selected from using
either extreme sires or using all sires
Using a subset (100x2α%)
of entire sire sample
Using all sires
•Loss of sample information not severe problem
•Because concern was classification accuracy, having more informative samples for each class is a more important issue
•Including all sire samples brings noise, leading to poorer predicting ability of selected SNPs.
Summary • 5 multi-category classification schemes and 4
classification methods compared, for selecting SNPs most relevant to classification.
• Best SNP subset with K=2 or K=3. Two measures (PRESS and proportion of significant SNPs) not improved by using a finer grading of mortality rates.
• Classification accuracy enhanced by using extreme samples: no loss of information; more parsimony.
• Simple naïve Bayes classifier outperformed BN and NN: over-fitting may have been a problem.
Unless NN coefficients treated as random effects…
1
8. Introduction to non-parametric curve fitting:
Loess, kernel regression, neural networks
Distinctive aspects of non-parametric fitting
• Objectives: primarily to investigate patterns free of strictures imposed by parametric models
• Can produce surprising results• Regression coefficients appear but (typically) do
not have an obvious interpretation• Often have a very good predictive performance in
cross-validation• Methods for tuning are similar to those for
parametric methods
2
Example: thin-plate splines
Risk of heart attack after 19 years as a function of cholesterol level and blood pressure. Left: logistic regression model. Right: thin plate spline fit. Wahba (2007)
222
211
1
222
21122110 log)( jiji
N
jjijijiii xxxxxxxxxxf
x
LOESS REGRESSIONLOESS REGRESSION::
NonNon--parametric exploration parametric exploration of inbreeding depression for of inbreeding depression for yield and somatic cell count yield and somatic cell count
in Jersey cattlein Jersey cattle
3
AN OVERVIEW OF LOWESS REGRESSION
1) DATA POINTS xi,yi; i 1,2, . . . ,n
4) COMPUTE Δx0 maxx i⊂Nx 0 |xo − xi |
2) SPANNING PARAMETER f; 0 f 1
k fn; k LARGEST INTEGER ≤ fn
3) FOR EACH x0 FIND k POINTS xi “CLOSEST” TO x0
Nx0 NEIGHBORHOOD OF k POINTS
7) REPEAT FOR EACH OF THE x0
5) TO EACH xi,yi; xi ⊂ Nx0 ASSIGN WEIGHT
wix0 1 − |xo − xi |Δx0
3 3
6) FIT BY WEIGHTED LEAST-SQUARES
∑i1
k
wix0y i − 0 − 1x i − 2xi2 2
RETURN yx0 0 −
1x i −
2xi
2
4
ROBUST LOWESS
•STANDARD LOWESS NOT ROBUST
� BASED ON LEAST-SQUARES
•BI-SQUARE LOWESS
�RE-WEIGHT POINTS ACCORDING TO RESIDUAL�IF RESIDUAL LARGE, WEIGHT IS DECREASED
2) CALCULATE LOESS RESIDUALS yi − y i
3) COMPUTE q 12 median|yi − y i |
4) CALCULATE BI-SQUARE ROBUST WEIGHTS
r i 1 −yi −
y i
6q 12
2 2
5) REPEAT LOESS WITH WEIGHTS riwix0
6) REPEAT 2-5 UNTIL LOESS CURVE ”CONVERGES”
1) FIT DATA USING STANDARD LOESS
5
ExampleExample•• Birth rate in US populationBirth rate in US population
(U. S. Department of Health, Education and (U. S. Department of Health, Education and Welfare) Welfare)
•• n=96 n=96
•• births per 1000 US population births per 1000 US population
•• during 1940during 1940--4747
Top > Ordinary Least Squares with 1st, 2nd & 3rd degree polynomial Bottom > LOWESS fit with f = .2, f=.4 & f=.6
6
• Examine relationships of yield (milk, protein, fat) and somatic cell score (SCS) with inbreeding coefficient (F) using field data from US Jerseys
• Use REML, BLUP and “local regression”method (LOESS) for this purpose
LEVEL OF INBREEDING IN HOLSTEINS, USALEVEL OF INBREEDING IN HOLSTEINS, USA
7
• Relationship between mean value of a quantitative trait and inbreeding coefficient (F) expected to be linearunder dominance
• Not so if epistatic interactions between dominance effects exist
(Crow & Kimura, 1970)
ONE-LOCUS MODEL
ADDITIVE MODEL WITH F (or H) AS COVARIATE�CONTRADICTORY
GENOTYPE X A1A1 A1A2 A2A2
FREQUENCY p121 − F p1F 2p1p21 − F p2
21 − F p2F
PHENOTYPE − A D A
EX Ap2 − p1 2p1p2D − 2p1p2DF
F
− 1 − F − 1
− %Heterozygosity
8
TWO (UNLINKED) LOCI:NO EPISTASIS
GENOTYPE A1A1 A1A2 A2A2
FREQUENCY p121 − F p1F 2p1p21 − F p2
21 − F p2F
B1B1 r121 − F r1F − A − B DA − B A − B
B1B2 2r1r 21 − F − A DB DA DB A DB
B2B2 r221 − F r2F − A B DA B A B
EX Ap2 − p1 Br2 − r 1
2p1p2DA 2r1r2DB
− 2p1p2DA r1r2DBF
′ ′F
TWO (UNLINKED) LOCI: EPISTASIS
GENOTYPE A1A1 A1A2 A2A2
FREQUENCY p121−F p1F 2p1p21−F p2
21−F p2F
B1B1 r121−F r1F −A−BI DA −B−L A−B−I
B1B2 2r1r21−F −ADB −K DA DB J ADB K
B2B2 r221−F r2F −AB−I DA BL ABI
•ALLELES AT A and B LOCI SAME SUBSCRIPT�ADD I(ADDITIVE X ADDITIVE)
•HOMOZYGOUS AT A HETEROZYGOUS AT B�SUBSTRACT AND ADD KHOMOZYGOUS AT B HETEROZYGOUS AT A�SUBSTRACT AND ADD L(ADDITIVE X DOMINANCE)
•HETEROZYGOUS AT A AND B� ADD J(DOMINANCE X DOMINANCE)
9
Mean value under dominance x dominance epistasis
EX Ap2 − p1 Br2 − r1 2p1p2DA 2r1r2DB
Ip1 − p2r1 − r2 2Lp1p2r1 − r2 2Kr1r2p1 − p2
4Jp1p2r1r2
− 2p1p2DA r1r2DB Lp1p2r1 − r2 Kr1r2p1 − p2 4Jp1p2r1r2F
4Jp1p2r1r2F2
′′ ′′F F2
•Dominance, additive x dominance, and dominance x dominance intervene in linear regression
•Dominance x dominance intervenes in second-order regression
•Epistasis without dominance does not enter into mean-F relationship
DATADATA
•• First lactation records (herds) on 59,778 First lactation records (herds) on 59,778 (1,142) Jersey cows (1,142) Jersey cows
•• 6 generations of known pedigree6 generations of known pedigree
•• First calving between 1995 and 2000 First calving between 1995 and 2000
10
Distribution of FDistribution of F
•• F calculated from all known pedigree F calculated from all known pedigree informationinformation
•• F ranged between 0 and 34%F ranged between 0 and 34%
•• Median F = 6.25% Median F = 6.25%
Histogram of F valuesHistogram of F values
(%)F
11
ProceduresProcedures•• Fit linear models without F as covariateFit linear models without F as covariate
•• Compute EBLUP residuals from these Compute EBLUP residuals from these modelsmodels
•• Fit nonparametric regression to EBLUP Fit nonparametric regression to EBLUP residuals in order to obtain nonparametric residuals in order to obtain nonparametric lines describing relationship between lines describing relationship between performance and inbreeding levelperformance and inbreeding level
Linear Models Linear Models
ijkkijkjiijk eaDDAGEHYSy )(1
Model Model
yijk = somatic cell score (SCS), milk, protein, or fat yield;
HYSi = fixed effect of herd-year-season (i = 1,2,….,12276 for DS2; 11158 for DS4 or 6406 for DS6, with seasons classes January–April, May–August, September–December);
AGEj = fixed effect of age at calving class; j = 1,2,….,6 (< 617, 617- 716, 717-816, 817-916, 917-1016, or >1016 days of age);= fixed regression coefficient of performance on days in milk;
Dijk = days in milk for animal k in herd-year-season i and age of calving class j;
D = 263;
ak = random additive genetic effect of animal k, and
eijk = random residual.
1
12
Linear Model Assumptions Linear Model Assumptions •• Genetic and residual effects assumed mutually Genetic and residual effects assumed mutually
independent, with and whereindependent, with and where A is the A is the additive relationship matrix (1 +additive relationship matrix (1 + FFkk in the in the kkthth diagonal diagonal position, position, FFkk is the inbreeding coefficient of animal is the inbreeding coefficient of animal kk) )
)σ,N(~ a2A0a),N(~ 2
eI0e
Nonparametric regressionNonparametric regression
•• Fit LOESS regression to BLUP residuals Fit LOESS regression to BLUP residuals with F as covariatewith F as covariate
•• Vary spanning parameter & degree of local Vary spanning parameter & degree of local polynomialpolynomial
•• Plot fitted values of residuals against FPlot fitted values of residuals against F
13
LOESSLOESS((Fitting done by Fitting done by locally weightedlocally weighted least least
squares)squares)
•• is LOESS fit using only residuals in the is LOESS fit using only residuals in the neighborhood ofneighborhood of FFii, i=1,2,, i=1,2,……nn(i=1,2,(i=1,2,……,n animals; j=1,,n animals; j=1,……,4 traits),4 traits)
•• Size of neighborhood determined bySize of neighborhood determined byqq = number of points in neighborhood= number of points in neighborhood
nn = total number of points= total number of points
ij~
n
qf
““RobustRobust”” LOESSLOESSWeights assigned to :Weights assigned to :
t=1,2,3,4t=1,2,3,4
I) I)
II)II)
qlFF
FFw
il
ikijk ,...2,1])
)max((1[ 33]1[
][][]1[ tijk
tijk
tijk ww
)(allofmedianmed ijkijk
ijkijkijk
t
med
ˆ~
])6
ˆ~(1[ 22][
ijk̂
14
(%)F
1f
polynomial local degreend2
pedigreeknownofsgenerationleastatwithCows 6
aσresidual
“Robust” original (black) with bootstrap (light blue) LOESS curves of yields for US Jerseys with at least 6 generations of known pedigree,
based on medians of EBLUP residuals (y-axis = )
f=0.9 f=0.5 f=0.9
2nd degree local polynomial
aijke ̂/ˆ
15
ConclusionsConclusions
•• LOESS analysis suggested local relationships.LOESS analysis suggested local relationships.
•• Effects of inbreeding seem nil, until for F values Effects of inbreeding seem nil, until for F values up to ~7%up to ~7%
•• Effects of inbreeding not accounted well by Effects of inbreeding not accounted well by additive modelsadditive models
•• Results may be confounded by effects of selection Results may be confounded by effects of selection that are unaccounted forthat are unaccounted for
Kernel Regression
16
yi gx i ei; i 1,2, . . . ,n
where: yi is the measurement taken on individual i x i is a p 1 vector of observed SNP genotypes g. is some unknown function relating genotypes to phenotypes. Set gx iEyi ∣ x i conditional expectation function ei 0,2 is a random residual
17
18
19
20
Neural networks applied to Neural networks applied to pedigree or genomicpedigree or genomic--enabled enabled
predictionprediction
21
i
M
mmmiMi Zy
11
p
jMjijMMMi
p
jmjijmmmi
p
jjiji xZxxZ
10
10
111011 . . . Z. . .
ipiji xxx . . . . . . 1
Hidden
Layer
Input
Output
Graphical representation of a single-layer neural network,with M nodes, for a continuous phenotype, ( i=1,…,n) and p molecular markers, (j=1,…,p).
αM+1,0+
Activation function
22
DATAPedigree structure:
mgd_pedigreesmgs_pedigreespgd_pedigreespgs_pedigreesPhenotyped animals
BREEDSJE840003000951180 1JECAN000010013378 1JEUSA000003414899 295
JE:JerseyCAN: CanadaUSA: United States
23
Data• All variables standardized before using in NN
• Inputs : Elements of Relationship Matrix
(Pedigree or Genomic, or both)
• Target : Fat Yield Deviation
Milk Yield Deviation
Protein Yield Deviation
• Rationale y u e
u 0, Aa2
y AA−1 u e
Au∗ e
y i ∑j1
N
aiju j∗ ei
Use elements ofA (or G) or A-1
as inputs in NN
or A-1A
Or A-1u**
24
Fitting the network
• Divide data into TRAINING, TUNING and TESTING sets• Network trained as a non-linear regression problem. Weights and
residuals have Gaussian distributions with variance parameters. • Given the variances, find mode of the weights using any non-linear
optimization method in TRAINING set. Variances assessed with Laplacian approximation (as in DFREML)
• TUNING set used to assess predictive performance of each cell in grid
• Final predictive performance assessed in TESTING set• NN with 1 Neuron and linear activation functions is “animal
model” with unknown variances
25
0
0.2
0.4
0.6
0.8
1
1.2
Line ar 1-neur 2-ne urs 3-neurs 4-ne urs 5-neurs 6-ne urs
train validat test general
0
0.2
0.4
0.6
0.8
1
1.2
Linear 1-neur 2-neurs 3-neurs 4-neurs 5-neurs 6-neurs
train validat test general
0
0.2
0.4
0.6
0.8
1
1.2
Linear 1-neur 2-neurs 3-neurs 4-neurs 5-neurs 6-neurs
train validat test general
Traits Fat Deviation Milk Deviation Protein Deviation
train validation test generaltrain validation test general train validation test general
Linear 0.96 0.03 0.11 0.660.9
5 0.07 0.07 0.67 0.93 0.05 0.02 0.67
1-neur 0.91 0.10 0.23 0.650.8
9 0.04 0.10 0.62 0.88 0.05 0.09 0.64
2-neurs 0.94 0.11 0.22 0.660.9
4 0.11 0.08 0.69 0.93 0.06 0.08 0.67
3-neurs 0.91 0.10 0.21 0.670.9
6 0.07 0.13 0.69 0.95 0.04 0.1 0.68
4-neurs 0.92 0.10 0.2 0.650.9
4 0.06 0.09 0.67 0.95 0.09 0.14 0.7
5-neurs 0.91 0.11 0.23 0.650.9
4 0.09 0.13 0.69 0.96 0.11 0.15 0.71
6-neurs 0.96 0.10 0.27 0.680.9
4 0.10 0.10 0.68 0.96 0.07 0.11 0.69
26
Animal breeders love correlations…•Parameter has well defined meaning for bi-variate normalidentically distributed pairs.
• Precise relationship to prediction ( squared correlation: fractionof variance accounted for)
•It is not a measure of “accuracy”
y k x
Corrx, y Covy, x
VaryVarx
Varx
VarxVarx 1
Ey − x2 k2
““You are the only You are the only woman in my lifewoman in my life””
Adam said to Eve
MSE (testing data set)
27
0 100 200 300-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0 100 200 300-1
-0.5
0
0.5
1
0 100 200 300-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0 500 1000 1500 2000-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0 500 1000 1500-0.2
-0.1
0
0.1
0.2
0 500 1000 1500-0.2
-0.1
0
0.1
0.2
0.3
Milk-linear Prot-linear
Fat-6 Neurons Milk-5 Neurons
Prot-5 Neurons
Fat-linear
Values of weights (regressions) for the linear and “best” NN
-0.2 0 0.2 0.4 0.60
10
20
30
40
50
60
70
-1 -0.5 0 0.5 10
20
40
60
80
100
-0.4 -0.2 0 0.2 0.40
20
40
60
80
-0.2 -0.1 0 0.1 0.20
200
400
600
800
1000
-0.2 -0.1 0 0.1 0.20
200
400
600
800
1000
1200
-0.2 -0.1 0 0.1 0.2 0.30
200
400
600
800
1000
1200
Fat-linear Milk-linear Prot-linear
Fat-6 neurons
Milk-5 neurons Prot-5 neurons
Distribution of weights for linear and “best” NN architectures
28
1
PENALIZED METHODS for functional inference
• The idea of “penalty is ad-hoc
• It does not arise “naturally” in classical inference
• It appears very naturally in Bayesian inference
L2 penalty: equivalent to
Gaussian prior
L1 penalty: equivalent to double
exponential prior
The concept of penalized likelihood(example in the mixed linear model)
y X Zu e
y|,u,R~NX Zu,R
u~N0,G
py|,u,R 12
N2 |R|
12
exp − 12
y − X − Zu′R−1y − X − Zu
pu|G 12
q
2 |G|12
exp − 12
u′G−1u
2
Assuming known variance components, the log of the joint density of the data andrandom effects is termed “penalized likelihood
X′R−1X X′R−1Z
Z′R−1X Z′R−1Z G−1
u
X′R−1y
Z′R−1y
l, u|y,R, G K − 12y − X − Zu′R−1y − X − Zu − 1
2u′G−1u
∂l,u|y,R,G∂ X′R−1y − X − Zu
∂l,u|y,R,G
∂u Z′R−1y − X − Zu − G−1u
Setting the derivatives to 0 yields
The solution to these equations produces the “maximum penalized likelihood” estimates of β and uThese solutions are also the BLUE(β) and BLUP(u)
8. Reproducing Kernel Hilbert 8. Reproducing Kernel Hilbert spaces mixed modelspaces mixed model
“Penalized sum of squares” Some norm under Hilbert space (H) of functions
Variational problem: find g(x) over entire space of functions minimizing SS(.)
Function of molecular information x (vector of SNP variables)
Smoothing parameter (λ)
SSgx, ∑i1
n
yi − wi′ − zi
′u − gxi ||gx||H
22
3
g. 0 ∑j1
n
jK. , xj
Solution to variational problem:
Khx, xj exp − x−xj ′x−xj h
Reproducing kernel
Example of reproducing kernel:
No. individuals withmolecular data
Regression coefficient
reduction of dimensionp (# SNPs) # indiv.
In an Euclidean space of dimension n, the
dot product between vectors v and w is∑i1
n
viwi,
and the norm is ‖v‖ ∑i1
n
v i2
Inner product generalizes dot product to vectors of infinite dimension.For instance, in a vector space of real functions with domain a,b,the inner product is
g1, g2 a
b
g1xg2xdx,
‖g1‖ a
b
g1x2dx
g1,g2 a
b
g1xg2xpxdx Eg1xg2x
IF x is a random variable with pdf p(x)
4
kx, tgxgtpx, tdxdt 0
Definition of positive-definite kernel (the theory deals with “reproducingkernels) function
Kh
k1, 1,h k1, 2,h . . . k1, n,h
k2, 1,h k2, 2,h . . . k2, n,h
. . . . . .
. . . . . .
. . . . . .
kn, 1,h kn, 2,h . . . kn, n,h
Positive-definite kernel matrix; symmetric, with k(i,j,h)=k(j,i,h)
h= scalar or vector of bandwidth parameters
MEASURES OF DISTANCETHAT CAN BE USED IN KERNELS
Euclidean
Manhattan
Bray-Curtis
5
ijdkjiK 1 exp,
2
jiijd xx
ipii xx ,...,1x
max
2
,ji
jik xx
.
Histogram of evaluations of Gaussian kernel by value of bandwidth parameter
LOCAL KERNELKERNEL GENERATINGSTRONG COVARIANCES
yi wi′ zi
′u ∑j1
n
exp − xi−x j ′xi−xj h j ei
Mixed model representation (enhancing pedigrees…)
ti′h exp − xi−x j
′xi−xj h
Define row vector
Th
t1′ h
t2′ h
.
tn′ h
t’i(h)= k’i(h)
T(h)=K(h)
6
y W Zu Th e
Do:
Then:
h assumed known here
W ′W W ′Z W ′Th
Z′W Z′Z A−1 e2
u2 Z′Th
T′hW T′hZ T′hTh Th e2
2
u
W′y
Z′y
T′hy
N0,T−1h2
12
Bandwidth parameter
Smoothingparameter
Penalized estimation
minargˆ KααKαyKαyαα
[1] Kimeldorf, G.S. & Wahba, G. (1970).
212 ,,,
K0αI0εαε
εKαy
NNp
Bayesian View
7
How to Choose the Reproducing Kernel? [1]
Model-derived Kernel
Predictive Approach
ji ttK ,
Pedigree-models K=A
Genomic Models:
- Marker-based kinship
- XXK
Explore a wide variety of kernels
=> Cross-validation
=> Bayesian methods
[1] Shawne-Taylor and Cristianini (2004)
Choosing the RK based on predictive ability
, , jidExpjiK xx
[1] de los Campos et al. (2010) Genetics Research
sindividualbetween distance (genetic)
: , jid xx
Strategies
- Grid of Values of ө + CV
- Fully Bayesian: assign a prior to ө
(computationally demanding)
- Kernel Averaging [1]
5111 ,1,, jiKjiKjiK
8
Example 1 of RKHS
y2 5
y3 3
y4 7
y5 8
1 2
1 3
1 1
1 5
0
1
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
a1
a2
a3
a4
a5
d1
d2
d3
d4
d5
e2
e3
e4
e5
X Za d e.
Additive Dominance
a2 5,d
2 4 and e2 20Henderson (1985) assumed
A
1 0 12
12
12
0 1 12
12
012
12
1 12
14
12
12
12
1 14
12
0 14
14
1
and D
1 0 0 0 0
0 1 0 0 0
0 0 1 14
0
0 0 14
1 0
0 0 0 0 1
Application of BLUP paradigm leads to
′ 5. 145 0.241 ,
a ′ 0.045 −0.192 −0.343 0.096 0.242 ,
d′ 0 −0.073 −0.365 0.162 0. 234 .
g a d 0. 045 −0. 265 −0. 708 0. 259 0. 477
9
K A D
2 0 12
12
12
0 2 12
12
012
12
2 34
14
12
12
34
2 14
12
0 14
14
2
Next, do RKHS with K=A+D as positive-definite kernel matrix
y2
y3
y4
y5
1 2
1 3
1 1
1 5
0
1
2 12
12
012
2 34
14
12
34
2 14
0 14
14
2
2
3
4
5
e2
e3
e4
e5
X K e.
2 a
2 d2 9 This is 1/λ
0 5. 289
1 0. 200 2 −0.128 3 −0.781 4 0.487 5 0.422
y1f
y2f
y3f
y4f
y5f
1 2
1 2
1 3
1 1
1 5
0
1
a1
a2
a3
a4
a5
d1
d2
d3
d4
d5
e1f
e2f
e3f
e4f
e5f
MPP e f ,
PREDICTING FUTURE RECORDS UNDER THE SAME ENVIRONMENTALCONDITIONS; PARAMETRICALLY
gK,1gK,2gK,3gK,4gK,5
0.036
−0.210
−0.569
0.206
0.382
COMPARED WITH
g a d 0.045 −0.265 −0.708 0.259 0.477
10
y1f
y2f
y3f
y4f
y5f
1 2
1 2
1 3
1 1
1 5
0
1
0 12
12
12
2 12
12
012
2 34
14
12
34
2 14
0 14
14
2
2
3
4
5
e1f
e2f
e3f
e4f
e5f
MKK e f.
PREDICTION OF FUTURE RECORDS NON-PARAMETRICALLY
y1f
y2f
y3f
y4f
y5f
|
y2
y3
y4
y5
,dispersion (smoothing) parameters
M .. , M .C .−1M .
′ If e2 ,
FOR BOTH APPROACHES THE PREDICTIVE DISTRIBUTION IS
P
5. 674 6. 020
5. 364 5. 460
5. 162 5. 353
5. 646 5. 834
6. 828 6. 115
; K
5. 754 5. 576
5. 286 5. 659
4. 735 5. 561
5. 919 5. 940
7. 061 6. 157
For the two procedures the mean and SD of the predictive distributions are:
11
Example 2 of RKHS
Drawn from exponential distribution
Drawn from Weibull distribution
Arbitrary Gaussian kernel adopted for the RKHS regressionusing as covariate a 2 1 vector: number of alleles at each of the two loci,e.g., xAA 2,xAa 1 and xaa 0. For example, the kernel entry AABB and AAbb is
kxAABB,xAAbb ,h exp − 2 − 22 2 − 02
h exp − 4
h,
Kh
AABB AABb AAbb AaBB AaBb Aabb aaBB aaBb aabb
AABB 1 e−1h e−
4h e−
1h e−
2h e−
5h e−
4h e−
5h e−
8h
AABb e−1h 1 e−
1h e−
2h e−
1h e−
2h e−
5h e−
4h e−
5h
AAbb e−4h e−
1h 1 e−
5h e−
2h e−
1h e−
8h e−
5h e−
4h
AaBB e−1h e−
2h e−
5h 1 e−
1h e−
4h e−
1h e−
2h e−
5h
AaBb e−2h e−
1h e−
2h e−
1h 1 e−
1h e−
2h e−
1h e−
2h
Aabb e−5h e−
2h e−
1h e−
4h e−
1h 1 e−
5h e−
2h e−
1h
aaBB e−4h e−
5h e−
8h e−
1h e−
2h e−
5h 1 e−
1h e−
4h
aaBb e−5h e−
4h e−
5h e−
2h e−
1h e−
2h e−
1h 1 e−
1h
aabb e−8h e−
5h e−
4h e−
5h e−
2h e−
1h e−
4h e−
1h 1
12
0 10 20 30 40 50 60 70 800.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Bandwidth parameter (h)
Kernel value
0 1 2 3 4 5 6 7 8 9 100.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Bandwidth parameter (h)
Kernel value
Kernel value k. , . ;h exp − Sh
against
bandwidth parameter h. Curves, fromupper to lower, correspond toS 1,2, 4, 5, 8
h 1. 75 as bandwidth parameter6 unique entries in the K matrix:1. 0 (diagonal elements, the two individuals have identical genotypes)0. 565 (3 alleles in common in a pair of individuals)0. 319 (2 alleles in common, 1 per locus)0. 102 (2 alleles in common at only one locus)0. 06 (1 allele in common)0. 01 (no alleles shared).
13
Training set
Testing set
100
IMPORTANT ISSUE TO DISCUSS HERE
TRAININGSET
14
TESTINGSET
15
Explanation of results
!!
ExampleOf RKHS 2
16
Source DF Anova SS Mean Square F Value Pr > F
a 2 0.00000000 0.00000000 0.00 1.0000b 2 0.00000000 0.00000000 0.00 1.0000c 2 0.00000000 0.00000000 0.00 1.0000a*b 4 0.00000000 0.00000000 0.00 1.0000a*c 4 0.00000000 0.00000000 0.00 1.0000b*c 4 13.33333333 3.33333333 1.00 0.4609
Error (a*b*c) 8 26.66666667 3.33333333
Variation between genotypic values is pure interaction
Training set:- 27 genotypes,
- 5 replicates per genotype, - residual variance 1.5
Testing set: 50 MC replicates, each as the training set.
Results in training set
17
Results testing set.
18
EXAMPLE 3: CHICKENDATA
• Average progeny “late mortality” (lm) in low hygiene environment for 200 sires of line29 (12,167 progenies).
– Pre-corrected for hatch, age of dam and dam,– Standardized log-transformed means
• SNPs: filter and wrapper strategy (Long et al., 2007)
– 24 SNPs selected out of over 5000 genotyped on sires
35
DATA
36Distribution of progeny means
19
MODELS
PEDIGREE SNPs
BayesianE-BLUP
F-metric model24 SNPsParametric
(linear) method
Non-Parametric RKHS24 SNPs
KernelRegression24 SNPs
37
Xu (2003)1000 SNPs
E-BLUP
eZuy ),(~ 2
uN A0u
),(~ 21eNe NR0
Number of progeny of sire i. Weighted residuals. (Varona and Sorensen, 2007)
122 ~ uuuu s
122 ~ eeee s
GIBBS SAMPLING200,000 samples
50,000 burn-in10 thinning period
GIBBS SAMPLING200,000 samples
50,000 burn-in10 thinning period
38
20
F-metric model (Least-squares Regression)
q
jijijai eαxy
1
jaα
aa)(saySNPhomozygousafor1
Aa)(saySNPusheterozygoafor0
AA)(saySNPhomozygousafor1
jax
Van der Veen (1959); Zeng et al. (2005)
39
q= 24 markers
F-metric model (Linear Regression)
)'α,...,α,α( 2421α
122 ~ eeee s
Coefficients: Bounded uniform priors (-99999, 99999)
Residual variance: Inverse chi-squared),(~ 21
eNe NR0
GIBBS SAMPLING200,000 samples
50,000 burn-in10 thin period
GIBBS SAMPLING200,000 samples
50,000 burn-in10 thin period
40
q
jijijai eαxy
1
21
Bayesian Regression (Xu, 2003)
41
1000 SNPs chosen randomly along the genome
aa)(saySNPhomozygousafor1
Aa)(saySNPusheterozygoafor0
AA)(saySNPhomozygousafor1
iax
1000
1jijijai ebxy
Bayesian Regression (Xu, 2003)(similar to Bayes A of Meuwissen et al. 2001)
ib
122 ~ eeee s
Regression coefficient for SNP i, assumed distributed as
Where is the variance associated to each SNP
Residual variance: Inverse chi-squared
),(~ 21eNe NR0
GIBBS SAMPLING200,000 samples
50,000 burn-in10 thin period
GIBBS SAMPLING200,000 samples
50,000 burn-in10 thin period
42
1000 SNPs chosen randomly along the genome
),0(~b 2ii N
2i
1000
1jijijai ebxy
122 ~ si
22
0−0
1′1e
21 ′Xe
2
X′1e
2X′Xe
2Diag 1
j2
−11′y
e2
X′y
e2
The Gibbs sampler: not much new here…
The conditional posterior of location effects is MULVN with mean vector
p j2|ELSE bj
2 vS2 1−2
The conditional posterior distributions of the variances of marker effects andof residual variance are
e2|ELSE y − Xb ′y − Xb ese
2
C−1r
|ELSE NC−1r,C−1
j=1,2,…,1000
KERNEL REGRESSION
• Non-parametric regression
Gianola et al. (2006)
iii egy )(x
ixis a (qx1) vector
representing the genotype of sire i
some unknown function of the whole SNP genotype for sire i,
representing the expected phenotypic value of sires
possessing the q-dimensional SNPs combination
)( ig x
}{ iee assumed distributed independently of X, and
around zero
44
23
KERNEL REGRESSION
• Non-parametric regression
– How do we estimate g(x) ?
iii egy )(x
)(
),()(
x
xx
p
dyyypg
g(x) =conditional expectation function.
Nadaraya-Watson estimator(Nadaraya, 1964; Watson, 1964)
Based on definition of conditional mean
45
iii egy )(x
KERNEL REGRESSION
•Non-parametric regression
46
n
iihiq
xXKynh
dyyyp1
)(1
),(x
n
iihq
xXKnh
p1
)(1
)(xg(x)=
Pure non-parametric regression.
Trinomial Kernel
h: smoothing parameter
24
47
Trinomial KERNEL
K(X-x) = Some function measuring distances between focal points or objects (genotypes).
2121
21
22121, )1()( iiii ddqdd
ihh hhhhK xx
Observed genotype
Focal genotype AA Aa aa
AA0 1 0
0 0 1
Aa1 0 1
0 0 1
aa0 1 0
1 1 0
REPRODUCING KERNEL HILBERT SPACES REGRESSION
• Penalized sum of squares has the form:2
221 ||)(||)]([)]'([]|)([ HggggJ xxXβyRxXβyx 1
αKα
k
k
k
X h
q
j
h
h
h
hg
)(
)(
)(
)|(
1
],...,,[ 10 nα
hK ii
ih
)()(exp)(
xxxxxx
),(),(),(
...),(...
),(),(),(
1
1111
nnhjnhnh
jih
nhjhh
h
xxKxxKxxK
xxK
xxKxxKxxK
K
48
25
REPRODUCING KERNEL HILBERT SPACES
• Embedding all these expression in the penalized sum of squares:
yRK
yR1
αKKRK1RK
KR11R11
1
hλ,h
λ,h
hhhh
h
ˆ
ˆ
1111
11
),0(~,| 11 hNh Kα
),0(~ Re N 122 ~ eeee s
122 ~ s
GIBBS SAMPLING200,000 samples
50,000 burn-in10 thin period
GIBBS SAMPLING200,000 samples
50,000 burn-in10 thin period
49Sequence alignment Kernel
50
Sequence alignment KERNEL
Dynamic programming algorithms
Similarity between two DNA sequences
Adapted to SNP sequences
No need to tune h
(Delcher et al., 1999, 2002)
)(exp)( iih ScoreK xxxx
26
Variance component & parameter estimates
………0.004-0.050HPD (95%)
………0.02 (0.01)µ (s.d)
h2
0.28-0.55……HPD (95%)
0.40 (0.07)……µ (s.d)
0.67-1.95……0.03-0.24HPD (95%)
1.03 (0.71)……0.10 (0.06)µ (s.d)
15.62-27.0911.78-23.6423.60-37.5116.88-32.04HPD (95%)
20.75 (2.91)17.07 (3.02)29.72 (3.56)24.38 (3.88)µ (s.d)
BR (Xu’s)RKHSF-metricE-BLUPPosteriorfeaturesParameter
51
2e
2u
2
Sum of posterior means of variances of the 1000 markers
• Spearman (above diagonal) and Pearson correlations (below diagonal) between posterior means of sire effects
• E-BLUP & Xu (2003) very similar.• LR most different ranking.
52
…0.800.580.570.92BR
0.84…0.790.500.84RKHS
0.760.93…0.380.66Kernel
0.530.510.480.56F-metric
0.910.840.770.52…E-BLUP
BRRKHSKernelF-metricE-BLUP
27
MODEL FIT
-Compute deviance measurement based on mean squared errors:
• A) Regression of adjusted average progeny on sire’s PTA or EGV
• B) Regression of raw average progeny on sire’s PTA or EGV
-Lowess regression (Non-parametric locally weighted regression)
53
MODEL FIT
•Regression of adjusted raw progeny LM on sire’s PTA or EGV
54
28
MODEL FIT
•Less dispersion in non-parametric models
•Lower MSE for kernel regression
•Worst for Linear regression (F-metric model)
Still….which model predicts the data best ?
55
Predictive ability
• Cross validation
1. 5 subsets, letting 20% sire means missing each time at random
2. Estimate PTA or EGV of sires with missing values from the augmented posterior distributions
3. Calculate correlations between actual and inferred average progeny, for each method within subset.
56
29
Predictive ability
•RKHS showed better predictive ability–25% higher reliability than Xu’s method
–100% higher reliability than E-BLUP
–233% higher reliability than F-metric (linear regression on markers)
•RKHS better than fixed or random regression on markers and E-BLUP.57
0.160.200.140.060.10GLOBAL
0.250.150.23-0.120.175th
0.150.280.130.07-0.044th
0.17-0.010.060.080.183rd
0.120.370.280.190.182nd
0.130.270.050.270.031st
BRRKHSKernelF-metricE-BLUPSubset
FCR measured on progeny of 333 sires with 3481 SNPsFCR measured on progeny of 61 birds (sons of the above sires)
2- generation data set
BAYES A --all markersRKHS --all markersRKHS --400 markers filtered using different INFOGAINSBLUP (Bayes) –pedigree information
Training set: 333 sires of sons
Predictive set: 61 sons of sires
EXAMPLE 4: CHICKEN DATA
30
Note that the confidence bands ofthe predictive correlations are wide
31
“BLUP”
“Bayes A” RKHS
US Jersey
- N= 1,762 sires (n=1446, training n=1130 ; testing, n=316 ).
- Markers: BovineSNP50 BeadChip (50k).
- Traits: PTAs for Milk, Protein Content and Daughter Pregnancy Rate
Models:
- Linear model
- Genomic-based kinship [1]
- Gaussian Kernel
- Fixed over a grid of values
- Kernel averaging:
[1] Hayes and Goddard (2008) Journal of Animal Science.
EXAMPLE 5: Application to US Jersey data
XXK
GK
, , jidExpjiK xx
32
PMSE
(MSE of predictive residuals)
Residual Variance
PMSE
Linear Model 0.588
Marker-based Kinship 0.590
Kernel average 0.588
Application to US Jersey data
Application to US Jersey data
PMSE
Linear Model 0.459
Marker-based Kinship 0.460
Kernel Average 0.421
Residual Variance
PMSE
(MSE of predictive residuals)
33
Empirical Application
Kernel Averaging seems be an effective strategy kernel selection
In this example (PTAs):
Linear Model, Kinship and Kernel Averaging performed similarly
Not necessarily so for other traits and other populations
Predictive ability of models for genomic selection in Wheat [1]
[1] Crossa et al. (2010) Genetics.
Predictive Correlation
Environment BL RKHS
Difference (%)
E1 0.518 0.601 +16%
E2 0.493 0.494 0%
E3 0.403 0.445 +10%
E4 0.457 0.524 +15% N= 599;
Trait: Grain Yield (4 environments);
Models: RKHS and Bayesian LASSO (BL)
34
Radial Basis Functions:another form of non-parametric
regression
Note that the basis functions are adaptive: depend on parameters (θ)
Matrix form
The kernel matrix Kθ need not be positive-definite here(contrary to RKHS regression)
35
SNP1 SNP2 SNP3 SNP1 SNP2 SNP3
Ind. 1 AA Gg TT 2 1 2
Ind. 2 AA gg tt 2 0 0
kx1,x2 exp −∑k1
3
k x1k − x2
k 2
exp −12 − 22 − 21 − 02 − 32 − 02
exp−2 − 43
kx2,x1 exp −∑k1
3
k x2k − x1
k 2
exp −12 − 22 − 20 − 12 − 30 − 22
exp−2 − 43
Genotypes can be coded, for example as
Bayesian structure: priors
Hyper-parameters are: a, ν, δ1 , δ2
36
Metropolis-Hastings
37
SIMULATION (TO EVALUATE MODEL BEHAVIOR)
THIS SIMULATION IS UNREALISTIC BECAUSE THE VARIANCE OF THEGENETIC SIGNAL IS MUCH, MUCH LARGER THAN THAT OF THERESIDUAL DISTRIBUTION (LARGE BROAD SENSE HERITABILITY)
8 SNPs but only 3 have an effect
38
Over-fitting
39
θ estimates in RBF
Posterior meanestimates of SNPeffects
WONDERFULPREDICTIONS
UNCERTAINTYVERY SMALL,BECAUSE OFSMALL RESIDUALVARIANCE
RBF(hetero) better than RBF (home) better than BAYES A
40
9. NON-PARAMETRIC BREEDING VALUE
y X a d iaa iad i dd e
X g e
Vy Vg Ve
Vg Aa2 Dd
2 A#Aaa2 A#Dad
2 D#Ddd2
Ea|g Cova,gVg−1g
a2AVg
−1g
Ea|y Eg|yEa|g E g|ya2AVg
−1g
a2AVg
−1Eg|yg.
Standard (parametric approach)
Non-parametric approach
y X K eNon-parametric genomic value,(counterpart of g)
K aK
aK a2AVarK−1K
a2AKK−12K−1K
a2
2
A.
VaraK a
2
2
2AK−1A2
Equal to A2 only if K−1 A−1 and
2 a2
If A I, non-parametric breeding values are correlated
VaraK a
2
2
2
K−12
Let
Define
41
COROLARIES
aprog,K a2
2Aprog,,
Non-parametric breeding value of yet to be born progeny
a−,K a2
2A−,
Non-parametric breeding value individuals not phenotypedor genotyped
y X u K e
Semi-parametric approach
u 0,Aa2
Make standardassumption
K D A#A A#D D#D
Tune some kernel suchas
Assessment of Cross-validation Strategies for Genomic Prediction in Cattle
M. Erbe, E. C. G. Pimentel, A. R. Sharifi and H. Simiane rDepartment of Animal SciencesAnimal Breeding and Genetics GroupGeorg-August-University Göttingen, Germany
03.08.2010 WCGALP Leipzig 2
Introduction
Genomic breeding value estimation
– Which model is appropriate?
– Which model is better when comparing two of them?
� validation is necessary
– normally: simulation studies with adequate design
– but when using real data : limited number of animals which are
genotyped and have phenotypes
03.08.2010 WCGALP Leipzig 3
Introduction
real data: limited number of animals which are genotyped and have phenotypes
�Cross-validation procedures
– training of the model with a reference set and prediction for
animals in the validation set (phenotypes masked)
���� How to split the data set?
���� How many animals for the validation set?
03.08.2010 WCGALP Leipzig 4
Data set and models
• 2294 HF bulls, 39’557 SNPs
• “phenotypes”: EBVs for milk yield and somatic cell score
• Genomic BLUP models
1) Model “A+G”
with , where A = pedigree based relationship matrix
, where G = genomic relationship matrixfollowing VanRaden (2008)based on 39’557 SNPs
eWgZuXby +++=
),0(~ 2Au uN σ
),0(~ 2Gg gN σ
03.08.2010 WCGALP Leipzig 5
Data set and models
• 2294 HF bulls, 39’557 SNPs
• “phenotypes”: EBVs for milk yield and somatic cell score
• Genomic BLUP models
2) Model “G*”
with , where G* = genomic relationship matrix basedon 4‘945 SNPs (1/8 of the availableSNPs, randomly chosen)
eWgXby ++=
*),0(~ 2Gg gN σ
03.08.2010 WCGALP Leipzig 6
Cross-validation procedure
2294 bulls (nall)
2194 bulls
100 bulls
step 1
training validation
randomsamples
Genotyped animalswith phenotypic information
03.08.2010 WCGALP Leipzig 7
Cross-validation procedure
2294 bulls (nall)
2094 bulls
200 bulls
step 2
training validation
Genotyped animalswith phenotypic information
100 bulls(randomly chosen)
03.08.2010 WCGALP Leipzig 8
Cross-validation procedure
2294 bulls (nall)
2294-n*100 bulls
n*100 bulls
step n
training validation
Genotyped animalswith phenotypic information
100 bulls(randomly chosen)
03.08.2010 WCGALP Leipzig 9
Cross-validation procedure
• step 1 (100 bulls in the validation set) to step 15 (1500 bulls in the validation set)� whole process was repeated 100 times
• for each step in each replicate:
� fitting the model (including variance component estimation) with
the training set
� prediction of the genomic breeding values for the bulls in the
vali dation set
� accuracy of prediction: calculation of the corre lation between
true b reeding value (TBV) and predicted genomic breeding
valu e (GEBV) for the bulls in the validation set
03.08.2010 WCGALP Leipzig 10
Cross-validation procedure
• step 1 (100 bulls in the validation set) to step 15 (1500 bulls in the validation set)� whole process was repeated 100 times
• for each step in each replicate:
� fitting the model (including variance component estimation) with
the training set
� prediction of the genomic breeding values for the bulls in the
validation set
� accuracy of prediction: calculation of the correlation between
true breeding value (TBV) and predicted genomic breeding
value (GEBV) for the bulls in the validation set TBVEBV
EBVGEBVGEBVTBV r
rr
,
,, =
03.08.2010 WCGALP Leipzig 11
Results: Performance of a model
• Performance of a model: How precisely can we describe the
prediction ability of a model when using cross-validation?
� have a look on the median of the correlations
obtained with 100 replicates
� have a look on the var iation over the 100 replicates
within each step
03.08.2010 WCGALP Leipzig 12
Results – trait: somatic cell score
03.08.2010 WCGALP Leipzig 13
Results – trait: milk yield
03.08.2010 WCGALP Leipzig 14
Results: Comparison of models
• What kind of splitting is best when the aim is to
compare two models?
���� comparison of the correlation coefficients of
the two models in each step in each replicate
(H0: identity of the correlation coefficents) (Meng et al., 1992)
� consideration of the mean of p-values obtained
for each step
03.08.2010 WCGALP Leipzig 15
Results: Comparison of models
• What kind of splitting is best when the aim is to compare two models?
01.0=α
03.08.2010 WCGALP Leipzig 16
Results: Influence of age structure
• Sorting bulls by age – model A+G
� step n: n*100 youngest bulls in validation set
03.08.2010 WCGALP Leipzig 17
Conclusions
• highest correlations with small validation sets
but: also very high variation between repeats when
validation set is small
• for comparison of two models: larger validation sets
(correlations estimated more stable)
• sorting by age � prediction abilities at the lower limit
� overestimation of the prediction ability with randomly
chosen validation sets
03.08.2010 WCGALP Leipzig 18
Acknowledgment
• Synbreed – Synergistic Plant and Animal Breeding
• FUGATOplus GenoTrack
• Vereinigte Informationssysteme Tierhaltung, Verden
Thank you for your attention!
03.08.2010 WCGALP Leipzig 20
Comparison of correlation coefficients
• Comparison of two correlation coefficients when thesamples are not independent
• Meng et al., 1992
2
)1(2
1
)1(1
1
22
212
2
2
2
rrr
r
rf
fr
rh
x
+=
−⋅−=
−⋅−
+=
hr
zzz
x ⋅−⋅−
=)1(2
ˆ 21&&
tscoefficienncorrelatiodtransformezz =21, &&
*GGAx GEBVsandGEBVsbetweenncorrelatior +=
−+⋅=
rr
z11
ln5.0&
1
BAYES A, BAYES B, LASSO
THE CURSE OF THE BAYESIAN ALPHABET
Featuring
Halle Berry,as “A”
Kim-Jong Il,as “Bayes”
Scarlett Johanssonas “B”
AND…
Sarah Palinsings
“To Russia with love, a view from
my igloo”
2
RECALL FROM EARLY PART OF COURSE
REASONABLE BAYESIAN MODEL
• For any parameter, must be able to “kill” the prior asymptotically
• For any parameter, statistical distance between prior and posterior (and therefore conditional posterior) must go to infinity
• If this distance has a finite upper bound, it means that the prior is influential
• Must be able to reduce statistical entropy as conveyed by the prior by a sizable amount. If the reduction is tiny prior very influential
3
STATE OF KNOWLEDGE(in a finite sample)
Minimum Prior
Maximum Conditional posterior
Intermediate Marginal posterior
THE PROCESS OF DECONDITIONING CONSUMES
INFORMATION ABOUT THE FOCAL POINT
Meaning: conditional posterior is the best world to live in
4
BAYES A + BAYES B
(as I understand them)
y 1 Xb e,
y|, X, b N1 Xb, Ie2
Linear model proposed by Meuwissen et al. (2001)
Additiveeffect ofSNP j
Code for genotypeof SNP j:x= -1,0,1
SCALAR
MATRIX
yi ∑j1
p
xijbj ei,
i 1, 2, . . . ,n; n p
yi|,xi,b,e2 N ∑
j1
p
xijb j,e2
5
uniform
e2 eSe
2e−2
bj N 0,bj
2 ; j 1,2, . . . ,p
bj
2 S2−2 for all j
The priors
Hyper-parameters, specified arbitrarily
bj | j2 N 0, j
2
j2|, S2 S2
−2
BAYES A (Meuwissen et al., 2001)
j=1,2,…,p
pbj |,S2 0
N 0, j2 pS2
−2d j2
0
j2 − 1
2 exp −bj
2
j2
j2 − 2
2 exp − S2
j2
dj2
0
j2 − 12
2 exp −bj
2 S2
j2
dj2
Γ 1 2 bj
2 S2 − 12
1 bj
2
S2
− 12
t0,,S2
Marginal prior
The prior of a markereffect is a t-distributionwith known scale and df
Note: each SNP has a variance(think of a sire model in whicheach sire effect has a variance)
These hyper-parameterswill control the extentof shrinkage. Question:does their influence vanishasymptotically?
MARGINALLY: NO DIFFERENTIAL SHRINKAGE WHATSOEVER IN BAYES A
6
Bayes B is Bayesianly “STRANGE”
Bayes B
1. Meuwissen takes the constant = 0
bj |j2
point mass at some constant k if j2 0
N 0,j2 if j
2 0
j2|
0 with probability
S2−2 with probability 1 −
2. Meuwissen assumes π is known, e.g., 0.95
3. Recall: if a prior varianceis 0, this means completecertainty
p bj, j2|
bj k and j2 0 with probability
N 0,j2 pS2
−2 with probability 1 −
Joint density:
pbj |
bj k with probability
0
N 0,j2 pS2−2 dj
2 with probability 1 −
Marginal prior
7
0
j2 − 1
2 exp −bj
2
j2
j2 − 2
2 exp − S2
j2
dj2
0
j2 − 12
2 exp −bj
2 S2
j2
dj2
Γ 1 2
bj2 S2 − 1
2
1 bj
2
S2
− 12
t0,,S2
Further
pbj | bj k with probability
t0,,S2 with probability 1 −
Then:PRIOR = MIXTURE OF A POINT MASS AND OF A t-DISTRIBUTION. BAYES B PUTSTHE MASS AT 0 (IF NOT 0, THIS GETS ABSORBED INTO THE GENERAL MEAN)
MARGINALLY: NO DIFFERENTIAL SHRINKAGE IN BAYES B EITHER
BAYES A IS A SPECIAL CASE OF BAYES B (π=0)
Meaning: if Bayes A has a flaw, this carries to Bayes B
8
A Gibbs sampler for Bayes A
(element-wise sampling)Note: the form of the implementation it
is just an algorithmic matter: it is immaterial with respect to the issues
Sampling the mean
Flat prior for the mean (or for the fixed effects) is not influential
|ELSE N 1n ∑
i1
n
yi −∑j1
p
xijbj , e2
n
9
Sampling the residual variance
e2|ELSE n1 e
n
∑i1
n
yi−−∑j1
p
x ijbj
2
eSe2
neen−2
The prior can be “killed” simply by increasing sample size
Goes to n
This will dominate the weighted averageas n increases
bj |ELSE N
∑i1
n
xij yi − −∑j ′1
p
x ijbj
∑i1
n
xij2 e
2
bj2
, e2
∑i1
n
x ij2 e
2
bj2
j 1, 2, . . . ,p
Sampling the marker effects
Kill the prior simply by increasing sample size. The effect of the shrinkage ratio vanishes
∑i1
n
xij2 e
2
bj2→∑
i1
n
x ij2
10
Sampling the variance of the marker effects
•Prior cannot be killed here. One can increase the number of data or of markers ad nauseum and gain only one degree of freedom, always•Recall that, in the conditional posterior,• all other parameters are known (i.e., they are assigned values)•Since one must de-condition, actually the true posterior moves less thanone degree of freedom away from the prior
Prior df: very influential
Typically very smallbj
2 |ELSE 1 1
bj2 S2
1 1−2
1 1 S2
bj
S
2
1 1−2
j 1, 2, . . . ,p
STATE OF KNOWLEDGE
Minimum PriorMaximum Conditional posteriorIntermediate Marginal posterior
11
For df>6, the relative variability of the posterior distribution of the variance ofa SNP effect is essentially COPYING that of their prior distribution
ENTROPY CALCULATIONS
Hak2 |,S2
− logpak2 |,S2 pak
2 |,S2 dak2
− 2− log S2
2Γ
2 1
2d
d 2
logΓ 2
.
Prior entropy
Entropy of the conditional posterior
Hak2 |ELSE
− logpak2 |ELSEpak
2 |ELSEdak2
− 12
− logS2 ak
2
2Γ 1
2 1 1
2d
d 12
logΓ 12
.
Learning from data: reduces entropy(cannot calculate entropy of posterior)
Variance of marker effect(sorry, change of notation)
12
Relative information gain
RIG H ak
2 |,S2 −H ak2 |ELSE
H ak2 |,S2
For ak 0, S 1 and 4, RIG 0.125
For ak 0, S 1 and 10, RIG 6. 51 10−2
For ak 0, S 1 and 100, RIG 9. 60 10−3
Metaphorically: the prior is totalitarian in Bayes A (B)
STATISTICAL DISTANCE BETWEEN CONDITIONAL POSTERIOR AND PRIOR(KULLBACK-LEIBLER)
IF KL IS LARGE, THEN LEARNING BEYOND THE PRIOR HAS TAKEN PLACE. KL SHOULD GO TO INFINITY AS DATA ACCUMULATE IN ANY
REASONABLE BAYESIAN MODEL
Specific distance at a given variance
prior
13
df=100, S=1
df=4, S=1
df=10, S=1
Logarithmic distance, ak =0
Logarithmic distances are indistinguishable at any value of the variance.See what happens if we let 10 markers to share the same variance…
Logarithmic distance increases sharply
KULLBACK-LEIBLER DISTANCESBETWEEN CONDITIONAL POSTERIOR AND PRIOR
2) 2. 64 10−2 for 10,S 1,p 1 and ak 0
3 2. 52 10−3 for 100,S 1,p 1 and ak 0
1) 7. 33 10−2 for 4,S 1,p 1 and ak 0
If 10 markers are allowed to share the same variance, KL= 4.47Relative to (1), KL distance increases 61 times…
14
Effect of the scale parameter of the prior
Prior (open boxes)
15
BAYES A (B)
• The prior always matters• The effect of the prior is via the extent of
shrinkage of marker effects• The extent of shrinkage can be
manipulated, with the data essentially providing no control
• Statistically greedy models (same will apply for any model assigning marker-specific variances)
SIMULATION
(never take a simulation too seriously)
16
RESURRECTION OF BAYES A
(If additive model holds, it may give sensible inferences about marker effects)
17
POSTERIOR DISTRIBUTION OF SNP VARIANCES UNDER UNIFORM PRIOR
POSTERIOR DISTRIBUTION OF SNP VARIANCES UNDER FOUR S.INV. CHI-SQUARE PRIORS
18
THE GOOD NEWS
Posterior distribution of SNP effects
Bayes A “picks Up the 3 relevant SNPs
19
DEATH-RESURRECTION-DEATH
Bayes A may give a distorted picture if there is non-linearity or
non-additivity
20
NONE OF THE RELEVANT GENETIC SIGNALS ARE CAPTURED
BAYES A vs. BAYES L
(Bayes L= Bayesian Lasso)
21
p
j
jep1 2
1 β
In the Bayesian Lasso, marker effects are assigned double exponential distributions
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
0.5
De
nsi
ty
Density of a Normal and of a Double-Exponential Distribution
EACH MARKER HAS THE SAME D.E DISTRIBUTIONNO DIFFERENTIAL SHRINKAGE EITHER
21
21
21 pp 2
1 ,| βyp
. .
p . . 2p . .
222ppp pp 2,| βnyp
Sdfpp ,221 11
21 ,| βyp
. .
. . 2p . .
Sdfppppp ,22 2,| βnyp
Bayesian LASSO
Bayes A
Graphical Representation of the hierarchical structure of the Bayesian LASSO and Bayes A
22
p ei2 |
2
2 2
2exp −
2ei2
2
pyi| i, 0
Nyi |i,ei2
2
2exp −
2ei2
2de
2
2
2 20
ei2 −
12 exp − 1
2yi − i2
ei2
2ei2 dei
2 ; i 1, 2, . . . ,n.
2|a,b Gammaa,b
Assume exponential distribution of variances
Mix (as in t-model)
Assume
Var ei2 |ELSE
E3 ei2 |ELSE
2
2
yi−xi′−z i
′u2
32
2
E ei2 |ELSE 2
yi−xi′−zi
′u2
Implementation is as in a t-model but transform ei2 1
ei2
pei2 |ELSE ei
2 −32 exp − 2
2ei2 2
yi−x i′−z i
′u 2
ei2 − 2
yi−xi′−zi
′u2
2
Inverse Gaussian (Wald) distribution
23
ANOTHER SIMULATION
(never take another simulation too seriously)
.300,...,1 280
1
ixyj
ijiji
280 markers. Residuals assumed N(0,1)
Pearson’s correlation between marker genotypes (average across markersand 100 Monte-Carlo simulations) by scenario (X0: low LD; X1 high LD).
0.3560.4500.5670.722X1
0.013-0.0020.0020.007X0
4321Scenario
Adjacency between markers
24
Only 10 markers had effects 270 had no effect on the trait simulated
Positions (chromosome and marker number) and effects of markers (there were 280 markers, with 270 with no effect)
NINE SPECIFICATIONS OF BAYES A
(9)(8)(7)1
(6)(5)(4)½
(3)(2)(1)0
5x10-210-310-5
Prior ScalePrior df
25
26
Ability to uncover relevant genomic regions
•For each method and replicate, markers ranked on absolute values of posterior means•For each effect, dummy variable created•Dummy was 1 if marker (or any of its 4 flanking markers) ranked on top 20. O.W= 0•Average over markers and replicates Index of “retrieved regions”
Bayes A affected by priors:Worse performance in Settings 1, 4 and 7Bayes A (settings 2, 3 , 6) and Lasso almost doubled ability
Simple fixes of Bayes A
• Assign the same variance toall markers (trivial Bayesian regression problem)
• Assign the same variance to groups of markers (e.g., chromosomes or genomic regions): model comparison issue
• Assign non-informative priors to S and to the degrees of freedom ν can be done. Just an algorithmic matter
27
Issues and questions
• Bayes A can be “fixed”, but may not the best thing to do. Open question…
• Bayes A, as is, may still have a good predictive (out of sample) behavior, even though it is not completely defensible
• Bayes B is Bayesianly ill-posed. If you do not believe me, check with local Bayesian statisticians…
• More reasonable: mixture at the level of the effects (not of the variances): I believe this is what the Dutch are doing (and Iowa people with beef cattle)
SOME POSTERIOR THOUGHTS
Prediction is a different ball game from inferenceDo not spend a lot of time inventing priors, or fancy models. An additive model may just do well…Spend more time in cross-validation and less in simulation. Now there is data!!Markers have ascertainment problem (Chikhi, 2008): simulations may give distorted pictureCannot deal with complexity with parametric methods, and non-parametric methods are almost as good as parametric ones even when assumptions holdSNP assisted genetic evaluation is holding well, and seems to outperform BLUP in most cases
28
10. Conclusions10. Conclusions
• Challenges to parametric methods posed by genomic and post-genomic data
• Future: Shift in paradigm. Semi-parametric and “machine learning” type techniques?
"It ain't what you don't know that gets you into trouble. It's what you know for sure that just ain't so."
(Mark Twain)
Takezawa (2005):
Model-free procedures can have betterpredictive performance even if the“true” model is used to generate and thenfitted to the data.