acknowledgements - acteon gianola 2010/powepoint del curso.pdf1761 tunbridge wells, kent, england...

1

1

A mixed bag of tools for genome-assisted predictionVALENCIA September 20-24 2010

Daniel Gianola

Sewall Wright Professor of Animal Breeding and Genetics

University of WisconsinUniversity of Wisconsin

Dairy Science

A primer to genome based analysis of quantitative traits

with machine learning and non-parametric methods

2

ACKNOWLEDGEMENTSACKNOWLEDGEMENTS

WI:WI: Nanye Long (machine learning, RBF)Nanye Long (machine learning, RBF)SP:SP: Oscar GonzalezOscar Gonzalez--RecioRecio (genomic value)(genomic value)AL:AL: Gustavo de los Campos (RKHS, LASSO)Gustavo de los Campos (RKHS, LASSO)WI:WI: Hayrettin Okut (NN)Hayrettin Okut (NN)SPSP: : MarigelMarigel PerezPerez--Cabal (CV, Clustering)Cabal (CV, Clustering)ALAL: Ana Vazquez (CV, Clustering: Ana Vazquez (CV, ClusteringGEGE: : MalenaMalena ErbeErbe (CV)(CV)WIWI: Nick Wu (Dirichlet process priors): Nick Wu (Dirichlet process priors)WI:WI: GuilhermeGuilherme RosaRosaWI:WI: Kent WeigelKent WeigelAviagen:Aviagen: Santiago Santiago AvendanoAvendano, Andreas , Andreas KranisKranisInstitutInstitut Pasteur (Montevideo):Pasteur (Montevideo): Hugo Hugo NayaNaya

2

3

TOPICS COVERED(order is approximate)

1. Evolution of statistical methods in quantitative genetics2. Challenges from complexity and use of phenomic data3. Brief review of Bayesian inference4. The problem of dealing with interactions5. Machine learning Classification: SNP-mortality6. Comparison of classifiers (NB, BN, NN)7. Introduction to non-parametric regression: LOESS,

RKHS, radial basis functions, neural networks8. Models for genome-enabled evaluation: Bayes A,

Bayes B, Bayes C, Lasso9. Cross-validation10. Results from animals and plants

4

BIBLIOGRAPHY

3

5

1. EVOLUTION OF STATISTICAL METHODS IN ANIMAL BREEDING1. EVOLUTION OF STATISTICAL METHODS IN ANIMAL BREEDING

Archaen Visual appraisal (still widely used) [Biblical times…]

Pathozoic Path analysis, “Animal Breeding Plans” [1918-1945]

Anovian ANOVA (Method 1), least-squares, selection index [1936-1943]

Post-anovian Methods 2+3, MINQUE, MIVQUE [1953-1973]

Blupassic

LESS

MORE

Mixed models, BLUP, animal model, multi-traits [1948-1990]

Remlian VCE, ASREML, DMU [1971- 2009]

Posteriozoic Threshold models, Survival, MCMC, QTLs [1982-2008]

Intelligent design vuvu-Maradona

6

2. Challenges from complexity 2. Challenges from complexity and use of and use of phenomicphenomic datadata

4

7

The The PhenomicPhenomic datadata((phenotypes+genomicphenotypes+genomic))

1)Massive phenotypic data exist2)Massive genomic data increasingly available

Example: SNPs (also gene expression)� >107 SNPs dbSNP 124 (Nat. Center Biotechnology)� Perlegen: 1.58 million SNPs� Animals:

-Wong et al. (2004) -- chicken genetic variation map with 2.8 million SNPs-Hayes et al. (2004) -- 2500 SNPs in salmon genome-Poultry breeding companies-- Thousands of SNPs on sires/dams-USA (2008) -- >50,000 SNPs in over 3000 Holstein sires-All over developed world -- chips with 800,000 SNPs

8

SNP= DNA sequence variation occurring when a single nucleotide - A, T, C, or Gin the genome differs between members of a species (or between paired chromosomes

ABOVE: two sequenced DNA fragments AAGCCTA to AAGCTTA, contain a difference in a single nucleotide.

we say that there are two alleles : C and T

All you wanted to know about SNPsbut were afraid to ask…

5

9

Copy number (CNV ) of copy number polymorphisms (CNP): other source of

information about genetic variation• Individuals vary in number of copies of genomic regions

• Disease genes located in CNV regions

10

Fluorescent map, genes in

rows

GENE EXPRESSION: Clustere

6

11

“I always wanted to be the First…”

John Paul the Second

12

ANIMAL BREEDING:USE ALL SNP MARKERS IN MODELS

FOR GENOMIC-ASSISTED EVALUATION

Effect of chromosomal segment,allelic, haplotype

SNP effects combinedadditively

…but Meuwissen, Hayes and Goddard (2001) came first!

7

13

Schaeffer (2006):

YES, IT WOULD BE DIFFICULT!SEE NEXT…

14

FURTHER, COULD WE WRITE A MODEL FOR SOMETHING LIKE THIS?A SYSTEMS BIOLOGY MAP OF THE BRAIN

8

15

Dealing with epistatic Dealing with epistatic interactions and noninteractions and non--linearitieslinearities

gene x genegene x genegenegene x gene x genex gene x gene

genegene x gene x gene x genex gene x gene x gene……………………..

(Alice in Wonderland)

16

Fixed effects models(unravelling “physiological epistasis” a

la Cheverud?)

• Lots of “main effects”

• Splendid non-orthogonality

• Lots of 2-factor interactions

• Lots of 3-factor interactions

• Lots of non-estimability

• Lots of uninterpretable high-order interactions

• Run out of “degrees of freedom”

Epistatic networks will probably involve a few genes of large effect

9

17

RANDOM EFFECTS MODELS FOR ASSESSING EPISTASIS REST ON:

Cockerham (1954) and Kempthorne (1954)

--Orthogonal partition of genetic variance into additive, dominanceadditive x additive, etc. ONLY if

No selection No inbreeding No assortative matingNo mutationNo migrationLinkage equilibrium

ALL ASSUMPTIONSVIOLATED!

Just considerLinkage disequilibrium

My girlfriend is

a bitch

18

“I started by biting my nails…”Venus de Milo

10

19

CLOSE ENCOUNTERS OF THE PREHISTORIC KIND

GENOMICS ANDCOMPLEX BIOLOGY

NO! THE ADDITIVE GENETIC MODEL

Homosapiens

Neanderthal

My Dad is crazy

20

Is my dad crazy? Les SavantsA prevailing view, and for good reasons (Hill et al., 2008; Crow, 2010; Hill, 2010)

• Fisher’s theorem of natural selection • Interactions are second-order effects;

likely tiny and hard to detect• Epistasis probably arises with genes of

large effects, unlikely to be observed in outbred populations

• Epistatic systems generate additive variance and “release” it, so why worry?

11

21

• If everything behaves as additive, can additive models allow us to learn about “genetic architecture”?

• In areas where phenotypic prediction is crucial (medicine, precision mating) can the exploitation of interaction have added value?

• Is so, should we consider enriching our battery of statistical tricks?

Les Idiots SavantsA much less popular view

(Gianola and a bunch of other guys)

22

What can be expected from systems biology?

• von Bertalanffy (1968) wrote:

“There exist models, principles, and laws that apply to generalized systems or their subclasses,irrespective of their particular kind, the nature of their component elements, and the relation

or 'forces' between them.

It seems legitimate to ask for a theory, not of systems of a more or less special kind, but ofuniversal principles applying to systems in general.

In this way we postulate a new discipline called General System Theory. Its subject matter is the formulation and derivation of those principles which are valid for 'systems' in general.

Concepts like those of organization, wholeness, directiveness, teleology, and differentiation are alien to conventional physics. However, they pop up everywhere in the biological, behaviouraland social sciences, and are, in fact, indispensable for dealing with living organisms or social groupsThus, a basic problem posed to modern science is a general theory of organization.

General system theory is, in principle, capable of giving exact definitions for such concepts and,in suitable cases, of putting them to quantitative analysis…

Allgemeine Systemtheorie

12

23

Systems analysis is not new in the animal sciences…

Where is the beef?

24

SYSTEMS ANALYSIS IN ACTION: PENTAGON “SYSTEMS” VIEW OFTHE WAR IN AFGHANISTAN

GENERAL McCHRYSTAL:“By the time we understand this slide, the war will be over”

(and he was sacked by Obama after The Rolling Stones)

WHAT CAN WE EXPECT FROM SYSTEM ANALYSIS?

13

25

A VIEW OF LINEAR MODELS(as employed in q. genetics)

Mathematically, can be viewed as a “local” approximation of a complex process

Linear approximation

Quadratic approximation

nth order approximation FELDMAN and LEWONTIN (1975)CHEVALET (1994)

26

How good are linear and quadratic approximations? A Taylor series provides a local approximation only…

-5 -4 -3 -2 -1 1 2 3 4 5

-1.4

-1.2

-1.0

-0.8

-0.6

-0.4

-0.2

0.2

0.4

0.6

0.8

1.0

1.2

1.4

x

y1. Sin and cosine function 3. Quadratic approximation

2. Linear approximation

4. Approximationsare good at x=0…

y gx e gx sinx cosx

14

27

What is machine learning?Automatically produce models, such as rules and patterns, from data. Cosely related to data mining, statistics, inductive reasoning, pattern recognition, and theoretical computer science.

Pattern recognition

Data miningNeural networksUniversal

approximatorsMachinelearning

Kernelmethods

Samplingmethods

Cross-validationdesigns

Bayesian networks

Non-parametricprediction

EnsembleMethods:boosting

EnsembleMethods:bagging

Support vectormachines

Random forestalgorithms

Do not fight over methods (Gonzalez-Recio)

Gonzalez-Recio

28

Distinctive aspects of non-parametric fitting

• Investigate patterns free of strictures imposed by parametric models

• Regression coefficients appear but (typically) do not have an obvious interpretation

• Often: very good predictive performance in cross-validation

• Tuning methods and algorithms (maximization, MCMC) similar to those of parametric methods

• Often produce surprising results

15

29

Surprise! thin-plate splines

Risk of heart attack after 19 years as a function of cholesterol level and blood pressure. Left: logistic regression. Right: thin plate spline (Wahba , 2007)

222

211

1

222

21122110 log)( jiji

N

jjijijiii xxxxxxxxxxf

x

30

PENALIZED and BAYESIAN METHODS

for functional inference play a role• The idea of “penalty is ad-hoc• It does not arise “naturally” in classical inference• It appears very naturally in Bayesian inference

L2 penalty: equivalent toGaussian prior

L1 penalty: equivalent to doubleexponential prior

Penalties on covariance matrices equivalent to priors (e.g., inverseWishart)

Bayesian methods arise naturally in predictive inference

16

31

2. A BRIEF REVIEW OF 2. A BRIEF REVIEW OF BAYESIAN INFERENCEBAYESIAN INFERENCE

We do this because most prediction methods based on “regularization” admit a Bayesian

solution

Crucial to understand Bayesianism when analyzing high-dimensional data

32

Rev. Thomas Bayes

1702 London, England1761 Tunbridge Wells, Kent, England

Pierre-Simon Laplace

1749 Beaumont-en-Auge, France1827 Paris, France

1763. “An essay towards solving a problem in the doctrine of chances”. Philosophical Transactions of the Royal Society of London 53, 370-418.

1774. "Mémoire sur la probabilité des causes par les événements“. Savants étranges 6, 621-656. Oeuvres 8, 27-65

INVERSE PROBABILITY

17

33

HISTORICAL NOTES

• Karl Pearson (without knowing) used Bayes• Fisher (likelihood, fiducial inference)• Lack of admissibility of classical procedures

(James-Stein)• Revival: Neo-Bayesianism (Lindley, Box, Zellner)• MCMC procedures (Metropolis, Geman and

Geman)• Bayesian methods in genetics: Haldane (1948),

Dempfle (1977), Gianola and Fernando (1986)• Explosion of Bayesianism in statistics: Gelfand

and Smith (1990)• Explosion in genetics as well

34

Bayesian methods in Genetics: todayBayesian methods in Genetics: today-Classification of genotypes-Molecular evolution-Linkage mapping-QTL cartography-Genetic risk analysis-Gaussian linear and non-linear models: cross-sectional+ longitudinal univariate+ multivariate-Generalized linear models-Survival analysis-Thick-tailed processes-Mixtures-Semi-parametrics-Transcriptional analysis-Structural equation modeling-Bayesian proteomics with wavelets-Methods for genomic selection(the Bayesian Alphabet and more)

RED: animal breeders

18

35

THE BAYESIAN APPROACH IN A THE BAYESIAN APPROACH IN A NUTSHELLNUTSHELL

• All unknowns in statistical system treated as random• Randomness reflects (typically) subjective uncertainty• Can include as unknowns: The model (distribution, functional form) Its parameters (heritability, inbreeding coefficient) Genetic effects, number of QTL loci, marker effects Combine with what is known a priori with information

from data: Bayesian learning Bayesian approach can also be used for developing

predictors of future observations without taking inferencetoo seriously

36

HOW DOES ONE DO THIS?• Introduce a prior distribution for all unknowns

(PRIOR)• Define a distribution for the data under a

certain model (LIKELIHOOD)• Arrive at conditional distribution of all

unknowns given data (POSTERIOR)• Derive marginal or conditional posterior

distributions of interest by standard probability theory

• Display summaries or entire distribution• Interpret results probabilistically• Example: the posterior probability of Ho is 8%

19

37

PRIOR DATA POSTERIOR(of First Lady of France)

BAYESIAN INFERENCE and MCMC(can fit any model)

Most of the times the prior comes “out of the blue”

38

PRIOR DATA POSTERIOR

THUS: THE ANTI-BAYESIAN ARGUMENT…

20

39

+

+

Analysis with Prior 1: collecting more and more data….

40

+

+

Analysis with Prior 2: collecting more and more data….

21

41

Implications

• This is called “asymptotic domination” of the prior by the data (likelihood)

• For parameters on which there is a lot of information from the data, the prior matters little

• Prior may be influential in small samples; worthwhile to investigate sensitivity

• What is a small sample?• Even if prior matters little, Bayesian approach

allows to use probability theory to measure uncertainty

42

BAYES THEOREM: CONTINUOUSBAYES THEOREM: CONTINUOUS

•Evidence is now given by a vector of observations y

•Hypothesis is a vector of unknowns •A probability model M poses joint distribution [ , y |M]with density

h,y gfy| myp|y•Assume that both the unknowns and the parameter are continuous-valued

23

45

p|y py|ppy

p|y kl|yppy

p|y l|y

p|y l|y

l|yd

POSTERIOR AS A STANDARDIZED LIKELIHOOD

Constant not involving parameterslikelihood

If prior is flat

If likelihood is integrable

46

EXAMPLE: CONTINUOUS PROBLEM- INFERRING THE MEAN OFA NORMAL DISTRIBUTION WITH KNOWN VARIANCE

Sampling model

Part conferringlikelihood to μ

Maximum likelihood estimator of μ y

Frequentist distribution of ML estimator

y1, y2, . . . , yN NIID,2

N , 2

n DISCUSS

py1, y2, . . . , yN|,2 i1

N1

22exp − 1

22 yi − 2

122

N

exp − 122 ∑

i1

N

yi − 2

122

N

exp − 122 ∑

i1

N

yi − y 2 − N22

− y 2

24

47

Normal distribution with known variance: continued

p|y1, y2, . . . , yN,2 py1, y2, . . . , yN|,2p

p|y1, y2, . . . , yN,2 py1, y2, . . . , yN|,2

p|y,2 exp − N

22 − y 2

exp − N22 − y 2 d

exp − N

22 − y 2

2 2

N

Flat prior

p|y1, y2, . . . , yN,2 i1

N

exp − 122 yi − 2

exp − 122 ∑

i1

N

yi − 2

exp − 122 ∑

i1

N

yi − y 2 exp − N22 − y 2

exp − N22 − y 2

Kernel of density

|y,2 N y , 2

N

48

Bayesian treatment: uniform prior for μ (all values in the (a,b) rangeare equally plausible, a priori)

p 1b−a

Posterior density of μ

p|y1, y2, . . . , yN,2 py1,y2,...,yN|,2p

a

b

py1,y2,...,yN|,2pd

Known

p|y1, y2, . . . , yN,2

1

22

N

exp − 122 ∑

i1

N

yi− y 2− N22 − y 2 1

b−a

a

b

1

22

N

exp − 1

22 ∑i1

N

yi− y 2− N

22 − y 2 1b−a d

Note: only values in (a,b)are allowed

25

49

Doing the algebra (hard way):

p|y1, y2, . . . , yN,2 exp − N

22 − y 2

a

b

exp − N

22 − y 2 d

a

b

exp − N22 − y 2 d

a

b

1

2 2

N

1

2 2

N

−1

exp − N22 − y 2 d

N22

a

b

1

2 2

N

exp − N22 − y 2 d

N22 b − y

2

N

− a − y2

N

-1

50

p|y1, y2, . . . , yN,2

1

2 2

N

exp − N

22 − y 2

b−y

2N

− a− y

2N

Posterior distribution is truncated normal

|y,2 TNa,b y , 2

N

1

2 2

N

26

51

Doing the algebra with Pac-Man (easy way):

Start all over

p|y1, y2, . . . , yN,2

1

22

N

exp − 122 ∑

i1

N

yi− y 2− N22 − y 2 1

b−a

a

b

1

22

N

exp − 1

22 ∑i1

N

yi− y 2− N

22 − y 2 1b−a d

Pac-Man is allowed to eat anything not depending on μ, i.e., eatsall symbols or functions to the right of the “barCan eat the denominator, since μ is integrated outCan eat several things in the numerator, leading to

p|y1, y2, . . . , yN,2 exp − N22 − y 2

Since only values in (a,b) are allowed Posterior is TN

52

Side note: Independence versus conditional independence

yij si eij

s i NIID0,s2

eij NIID0,e2

si , eij independent ∨ i, j

yij NID,s2 e

2

Covyij, yij ′ s2

yij

yij′ N2

,

s2 e

2 s2

s2 s

2 e2

Pairs of members of same cluster have bivariate normal distribution: Observations from same cluster are not independent

Random cluster (e.g., family) effect

27

53

yij si eij

s i NIID0,s2

eij NIID0,e2

si , eij independent ∨ i, j

Eyij |si si

Varyij |si e2

Covyij, yij ′ |si Cov si eij , si eij ′ |si

Coveij , eij ′ |si Coveij , eij ′ 0

pyij , yij ′ |si pyij|sipyij |si N si,e2 N si ,e

2

Conditional distribution

GIVEN THE CLUSTER EFFECT, OBSERVATIONS IN SAMECLUSTER ARE CONDITIONALLY INDEPENDENT!

54

pyij, yij ′ , si pyij, yij ′ |si psi

pyij|si pyij |sipsi

pyij , yij ′ −

pyij|si pyij |sipsidsi

−

N si,e2 N si ,e

2N0,s2dsi

N2

,

s2 e

2 s2

s2 s

2 e2

Probability model of the observations and of the unobserved cluster effect

Process called“deconditioning”

MARGINAL DISTRIBUTIONS ARE OFTEN MORE COMPLEX

28

55

Conditional posterior distribution of the unobserved cluster effect, given the parameters

yij si eij

si

yi1

yi2

N

0

,

s2 s

2 s2

s2 s

2 e2 s

2

s2 s

2 s2 e

2

Joint distribution for ni=2

si |yi1 ,yi2 Ns,v s

Conditional posterior distribution can be shown to be

Can be viewed as “population”or uncertainty distribution parameters

56

s Esi|yi1 , yi2 2

2 e2

s2

yi1 yi2

2− 2

2 4−h 2

h 2

y i −

v s e

2

2 e2

s2

s 1

1s

2 n e

2

1 s

20 n

e2 y i −

11s

2 n

e2

n e

2 y i −

n

n e2

s2

y i −

v s 11s

2 n

e2

s2 e

2

e2 n s

2

e2

n e2

s2

Weighted averageof data and of “prior”

29

57

Suppose heritability is 0.1 e2 1, 2.5

2 unrelated sires with n1 4, y 1 3 and n2 8, y i 2.95

s 1 44 4−0.1

0.1

3 − 2. 5 4. 65116279070 10−2

s 2 8

8 4−0.10.1

2.95 − 2. 5 7. 65957446809 10−2

v 1 1

4 4−0.10.1

2. 32558139535 10−2

v 1 1

8 4−0.10.1

2. 12765957447 10−2

Progeny of sire 1 have better performanceSire 2 has higher posterior mean (EBV)Sire 2 has more “information” (less shrinkage)

2

58

What is the strength of the evidence that sire 2 is better than 1?

1) Consider posterior distribution of sire 1-sire 2 . Because siresare unrelated and parameters are known, the 2 sires have independentconditional posterior distributions

Not enough difference to statethat the 2 sires differ at all

s1 − s2 |y 1 , y 2 N s1 − s2 ,v 1

v 2

s1 − s2 |y 1 3, y 2 2. 6 N−3. 008 411 677 39 10−2 , 4. 453 240 969 82 10−2

-0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.5

1.0

1.5

2.0

s1-s2

Density

30

59

2) Consider posterior distribution of sire 1/sire 2 . If sires not differentthis posterior should be centered at 1.

Problem!!! The distribution cannot be found analytically

s1 /s2 | y 1 , y 2 ??????

Solution: s1 |y 1 Ns1 ,

v 1

s2 |y 2 Ns2 ,v 2 are independent distributions. Hence, we can sample the two randomvariable and form draws of

s1/s2 | y 1 , y 2

60

> s1<-rnorm(5000,0.0465,sqrt(0.0232))> s2<-rnorm(5000,0.0766,sqrt(0.02128))> d<-s1-s2>r<-s1/s2> plot(density(s1),main="s1")> plot(density(s2),main="s2")> plot(density(d),main="d")> plot(density(r),main="r",xlim=c(-100,100))

Sampling from the posterior distributions using R

-0.4 -0.2 0.0 0.2 0.4 0.6

0.0

0.5

1.0

1.5

2.0

2.5

s1

N = 5000 Bandwidth = 0.02499

De

nsi

ty

-0.6 -0.4 -0.2 0.0 0.2 0.4 0.6

0.0

0.5

1.0

1.5

2.0

2.5

s2

N = 5000 Bandwidth = 0.02356

De

nsi

ty

31

61

-1.0 -0.5 0.0 0.5

0.0

0.5

1.0

1.5

d

N = 5000 Bandwidth = 0.0345

De

nsi

ty

-100 -50 0 50 100

0.0

0.2

0.4

0.6

0.8

r

N = 5000 Bandwidth = 0.2401

De

nsity

> summary(d)Min. 1st Qu. Median Mean 3rd Qu. Max.

-0.99100 -0.16390 -0.02539 -0.02357 0.11820 0.68890 > mean(d)[1] -0.02357277> var(d)[1] 0.04484314

> summary(r)Min. 1st Qu. Median Mean 3rd Qu. Max.

-1339.0000 -0.7747 0.1940 3.6470 1.1890 14820.0000 > mean(r)[1] 3.647075> var(r)[1] 46015.02

MUST BE CAREFUL HERE…

62

ILLUSTRATION THAT WE CAN “HIT” THE TRUE POSTERIORDISTRIBUTION OF d BY TAKING MORE SAMPLES

1) #Samples=5000

s1 − s2 |y 1 , y 2 N s1 − s2 ,v 1

v 2

s1 − s2 |y 1 3, y 2 2. 6 N−3. 008 411 677 39 10−2 , 4. 453 240 969 82 10−2

> mean(d)[1] -0.02357277> var(d)[1] 0.04484314

2) #Samples=20000 > mean(d)[1] -0.03136750> var(d)[1] 0.04511715

> mean(d)[1] -0.03037269> var(d)[1] 0.04457762

3) #Samples=200000

4) #Samples=1000000 > mean(d)[1] -0.02993406> var(d)[1] 0.04450943

32

63

EXAMPLE: A SIMPLE EXAMPLE: A SIMPLE BAYESIAN SURVIVAL BAYESIAN SURVIVAL

ANALYSIS MODELANALYSIS MODEL

64

BASICS• Non-negative random variable T• Typically: time-to-event

death onset of disease successful fertilization failure of component

• Censoring (right) can occur. For n individuals (i=1,2,…,n) observe

y i minti, vi

c i 1 if ti ≤ v i

0 if t i vi

“True” failure time

“Censoring point”

uncensored observation

censored

33

65

Ft PrT ≤ t 0

t

fudu

St 1 − Ft Prt ≤ t

ht ftSt

ft htSt

ddt

logSt 1St

ddt

St

− ftSt

−ht

Distribution function of failure time

Survivor function

Hazard function

>

66

ht − ddt

logSt

0

t

hudu − logSt

St exp −0

t

hudu

Ht 0

t

hudu

ft htexp −0

t

hudu

Integrated hazard

Cumulative hazard

Representationof density offailure times

34

67

Parametric exponential model: homogeneous population

fyi | exp−yi

Sy i| 1 − 0

yi

exp−udu

exp−y i

L|y i1

n

exp−yiciexp−y i1−ci

∑i1

n

ci

exp−∑i1

n

yi

Likelihoood(assuming conditional independence)

Exponentialdensity

Survivalfunction

Note that hazard (ratio between f and S) is constant= λ

68

Put Gamma prior on λ�

Posterior of λ

The posterior distribution is Gamma,with parameters

p|0,0 0−1 exp−0

p|y, 0,0 ∑i1

n

ci

exp−∑i1

n

yi0−1 exp−0

∑i1

n

ci0−1

exp − ∑i1

n

yi 0

o∗ 0 ∑

i1

n

ci

o∗ 0 ∑

i1

n

yi

35

69

Joint, Conditional and Marginal Posterior Distributions

•Put 1′ ,2

′ ′ representing distinct features of models,(e.g., means and variances)

•Then, elicit a joint prior density

where g 1 is the marginal prior and

g1,2 g1|2g2 g2|1g1g2 |1 is a conditional prior

•Joint posterior density is

p1,2|y L1,2|yg1,2

L1,2|yg1,2d1 d2

L1,2|yg1,2,

•Must decide which is the object of inference•Joint, conditional or marginal posterior probability statements?

70

Marginal posterior densities

•Obtained directly from probability calculus as:

p 1 |y p 1 , 2 |y d 2

p 2 |y p 1 , 2 |y d 1

•Additional marginalizing may be needed if 1 1A′ ,1B

′ ′

p1A|y p1,2|yd1Bd2

p1|yd1B.

36

71

Marginal posteriors are mixtures

•Let

p 1 |y p 1 , 2 |y d 2

p 1 | 2 , y p 2 |y d 2

E 2 |y p 1 | 2 , y ,

2 be a “nuisance” parameter

•Distribution 1|2,y describes uncertainty when the

nuisance parameter is known.

•Distribution 2 |y : uncertainty about nuisance parameter

•Distribution 1 |, y : uncertainty about primary parameter

Conditional posterior

Marginal posterior of nuisance

Marginal posterior of primary parameter

72

Conditional posterior distributions

•By definition of conditional density:

p 1 | 2 , y p 1 , 2 |y p 2 |y

•Here, one is interested in variation about 1 only

p1 |2, y p1,2|y

L1 ,2|yp1,2

L1 ,2|yp1|2

L1|2 , yp1 |2 .

•Identifying conditional posterior distributions: important forimplementing MCMC methods (sampling from posteriors)

37

73

Example of a 2-parameter problem: mean and variance of a normal

distributionIndependent priors

Joint posterior

Likelihood Prior

74

Marginal posterior density of variance (integrate mean out)

Marginal posterior density of mean (integrate variance out)

Cannot be written in closed form.

38

75

Assume c=0 and d=infinity (positive part of real is line is priorfor variance). Then:

Kernel of t-distributiontruncated in (a,b) on n-3 df

After adding and subtracting 1

76

Marginal posterior distribution of mean is then

If a=-infinity and b=infinity, then marginal posterior is univariate-ton n-3 degrees of freedom

CAN VERIFY

p,2|DATA, a, b, c, d ≠ p|DATA, a, b, c, d |p2|DATA, a, b, c, d|

TYPICALLY, PARAMETERS ARE NOT INDEPENDENT APOSTERIORI, EVEN IF SO A PRIORI

39

77

Analysis 2: improper uniform priors

p,2|y 2−n2 exp − 1

22 ∑i1

n

yi − 2

2−n2 exp −

∑i1

n

yi − y 2 n − y 2

22

Kernel is as before, but parameters are unbounded,except for parameter space of variance

78

p2|y 2−n2 exp − 1

22 ∑i1

n

yi − y 2

−

22/n

22/nexp − n− y 2

22 d

2−n−1

2 exp − 122 ∑

i1

n

yi − y 2

p2|y 2 −n−3

2 1 exp − 122 ∑

i1

n

yi − y 2

p2|y 2 −n−3

2 1 exp − S2n−3

22

S2 ∑i1

n

yi − y 2

n − 3

2|y n − 3S2n−3−2

Posterior is scaled-inversechi-squared

40

79

2

> nu<-20> S2<-5> chisqunu<-rchisq(20000,20)> sinvchi2<-nu*S2/chisqunu> summary(sinvchi2)

Min. 1st Qu. Median Mean 3rd Qu. Max. 1.734 4.200 5.172 5.564 6.493 23.450

> plot(density(sinvchi2),main="Scale inv. chi-square, df=20, Scale=5")

5 10 15 20 25

0.0

00

.05

0.1

00

.15

0.2

00

.25

Scale inv. chi-square, df=20, Scale=5

N = 20000 Bandwidth = 0.2124

De

nsi

ty

> nu<-4.00001> S2<-5> chisqunu<-rchisq(20000,4.00001)> sinvchi2<-nu*S2/chisqunu> summary(sinvchi2)> plot(density(sinvchi2),main="Scale inv. chi-square, df=4.00001, Scale=5")

0 20 40 60 80 1000

.00

0.0

50.

10

0.1

5

Scale inv. chi-square, df=4.00001, Scale=5

N = 20000 Bandwidth = 0.6285

De

nsity

80

p|y 1 − y 2

n−3s 2n

− n−312

−

1 − y 2

n−3s 2n

− n−312

d

Using same integration technique as before:

Posterior is univariate-t on n-3 degrees of freedom,and integration constant is

−

1 − y 2

n−3s 2n

− n−312

d Γ n−3

2 n−3s 2n

Γ n−312

41

81

Side note on the t-distribution

t z

2 /

; z N0, 1

Et 0

Vart − 2

ft Γ 1

2

Γ 2

1 t2

− 1

2

y St

Ey

Vary S2 − 2

t y −

Sdtdy

1S

Model with t-distributed errors

fy Γ 1

2

S2 Γ 2

1 y−2

S2

− 12

“scale”

“degrees offreedom”

82

Simulate a t-distribution with mean 10, scale 5 and 3 d.f. The varianceis 5 x 3/(3-2)= 15. We will compare with a normal distribution with Mean 10 and variance 15 > m<-10> df<-3> S<-sqrt(5)> z<-rnorm(50000,0,1)> chisq<-rchisq(50000,3)> y<-10+S*z/sqrt(chisq/3)> mean(z)[1] 0.0008953519> var(z)[1] 1.009415> summary(chisq)> mean(chisq)[1] 2.978861> var(chisq)[1] 5.896806> mean(y)[1] 9.999967> var(y)[1] 13.83878

42

83

-10 0 10 20 30

0.0

00

.02

0.0

40

.06

0.0

80

.10

w with normal distribution, mean=10, var=15

N = 50000 Bandwidth = 0.4002

Den

sity

-10 0 10 20 30

0.00

0.0

50

.10

0.1

5y with t-distribution, mean=10, df=, var=15

N = 50000 Bandwidth = 0.2675

De

nsity

The t-distribution accommodates more extreme values

> summary(y)Min. 1st Qu. Median Mean 3rd Qu. Max.

-61.440 8.266 10.020 10.000 11.730 100.400 > summary(w)

Min. 1st Qu. Median Mean 3rd Qu. Max. -6.509 7.436 10.010 10.020 12.620 25.300

Student-t

Normal

More extremevalues are accommodated

84

Predictive Distributions

•Let y f = unobserved vector of “future” or “missing” data. Then

p,y,yf pyf|,yp,y

pyf|,yp|ypy

•1)

•2) p, y f |y py f |, yp|y

py f |y py f |, yp|yd

E py f |, y

•3)

•4) If, given the parameters, data are conditionally independent:

py f |y py f | p|yd

TruncatedCensoredMissing dataFutureData augmentation!

p, y p, y,yf dyfUse for posterior predictive checks

Usual representation of posterior predictive density

43

85

A model for binary data

yi Bernoulli

py1, y2, . . . , yN| x1 − N−x

p a−11 − b−1

p|y x1 − N−xa−11 − b−1

xa−11 − N−xb−1 |y Betax a, N − x b

Prior Posterior

Distribution Beta Beta

Mean aab

xaNab

Variance abab 2ab1

xaN−xb

Nab2Nab1

Probability of success

Assuming conditional independence

Beta prior

86

Binary Data: Predictive distribution

p yf |y Nf

xf

x f1 − Nf−xf xa−11 − N−xb−1

Bx a, N − x bd

Nf

xf

Bx a, N − x b xfxa−11 − NfN−xf−xb−1d

Nf

xf

Bxf x a, Nf N − xf − x bBx a, N − x b

Future data

Future number of “successes”

Future number of Bernoulli trials

Beta-binomial distribution (discrete)

Posterior

44

87

Simulating the predictive distributionSuppose the posterior distribution of the success probability

is Beta(30,10). We draw 40,000 samples and plot the posterior

> #Bernoulli probability is prob~Beta(30,10)> prob<-rbeta(40000,30,10)> mean(prob)[1] 0.7499961> var(prob)[1] 0.004579322

0.4 0.5 0.6 0.7 0.8 0.9 1.00

12

34

56

Posterior distribution of success probability, Beta (30,10)

N = 40000 Bandwidth = 0.007315

De

nsi

ty

88

We construct the predictive distribution via composition sampling.Simulate 40,000 binomial trials with n=5#Simulation by composition

m<-2r<-40000theta<-matrix(0,m,r)

theta[1,]<-rbeta(40000,30,10)for (i in 1:r) {theta[2,i]<-rbinom(1,5,theta[1,i])}

mean(theta[1,])mean(theta[2,])histcases<-hist(theta[2,])plot<-density(theta[2,])

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

density.default(x = theta[2, ])

N = 40000 Bandwidth = 0.1099

Den

sity

Histogram of theta[2, ]

theta[2, ]

Fre

que

ncy

0 1 2 3 4 5

05

000

100

001

500

0

> mean(theta[1,])[1] 0.7501773> mean(theta[2,])[1] 3.751525

45

89

Exact and estimated posterior densities(most of the time we will not be able to derive the posterior, but

may be able to sample from it)

True, unknown posteriorEstimated posterior

(from samples)

Parameter value

Biological importancePr(biol. Important)

90

Estimating a posterior expectation and variance from

samplesE|y p|yd

Samples available from |y

1,2, . . . ,S

Monte Carlo Error E|y − E|y

1S∑

i1

S

i − E|y

May be posterior is unknown or integral impossible to compute

E|y 1

S ∑i1

S

iEstimate integral as

Goes to 0 asS tends to infinity

Posterior Expectation:

46

91

Monte Carlo Variance of estimate of posterior mean

Measures variability to be expected if repeated sampling (each time S samples drawn) is done from the posterior

Var Monte Carlo Error Var|yE|y − E|y

Var Monte Carlo Error Var|y1S ∑

i1

S

i − E|y

Var|y1S ∑

i1

S

i

92

Var MCE 1S2 ∑

i1

S

Var|yi 2∑∑ij

Cov|yi , i

1S2 ∑

i1

S

Var|y 2Var|y∑∑ijij

Var|y

S1 2

S ∑∑ ijij

Null only if samplesare independent

IF MARKOV CHAIN MONTE CARLO SAMPLING ISPRACTICED, SAMPLES ARE TYPICALLY SERIALLYCORRELATED

IMPORTANT TO EVALUATE AUTO-CORRELATIONSIN MCMC, TO ASSES MONTE CARLO ERROR

47

93

POSTERIOR PROBABILITIESPOSTERIOR PROBABILITIES

• Joint probabilities

Pr ∈ |y

p|y d

•Marginal probabilities

Pr1 ∈ 1|y 1

Θ2

p2,1|yd

1

Θ2

p1|2,yp2|yd.

•Equivalently

Pr1 ∈ 1|y E2|yPr1 ∈ 1|2,y

94

MONTE CARLO ESTIMATES OF MONTE CARLO ESTIMATES OF POSTERIOR PROBABILITIESPOSTERIOR PROBABILITIES

• Suppose samples 21,2

1,...,2m available from 2|y

1m∑

i1

m

Pr 1 ∈1|2i,y

Pr1 ∈ 1|y Θ2

1

p1| 2,yd1 p 2|yd2

•1) Must be easy to sample from 2|y

•2) Pr1 ∈1|2,y must be available in closed form so it can

be evaluated at each draw 2i

50

99

3. SPECIAL FORMS: METROPOLIS ALGORITHM

Take a symmetric (in its arguments) proposal density:

Acceptance rate becomes

IfU minR, 1 set t ∗

t t−1

f∗ |t−1 ft−1 |∗

R py|∗ p∗ /f∗ | t−1

py|t−1pt−1 /ft−1 |∗

py|∗p∗ py|t−1pt−1

100

4. SPECIAL FORMS: INDEPENDENCE CHAIN METROPOLIS

f∗ |t−1 f∗

Proposal density is independent of current state of the sequence of sampled values

R py|∗p∗ /f∗py|t−1pt−1 /f∗

py|∗p∗

py|t−1pt−1

Acceptance ratio calculated as in standard Metropolis

51

101

fi i∗ |i

t i∗ |−i

t

fi it |i

∗ it |i

∗

6. SPECIAL FORMS: GIBBS SAMPLING (notation is not precise…)

Use fully conditionalsas proposal distributions

i∗ |−i

t i

∗,−it

−it

−it |i

∗ i

∗,−it

i∗

By Bayes theorem

MH acceptance ratio (recall that g(θ) is a posterior (or conditional posterior) thatwe do not recognize (or know how to sample from)

R g∗/f∗ |t−1

gt−1/ft−1 |∗ g∗

gt−1 −i

t |i∗

i∗ |−i

t

g∗

gt−1

i∗,−i

t

i∗

i∗,−i

t

−it

g∗

gt−1 −i

t

i∗

1

GIBBS SAMPLING: SPECIAL CASE OF MH: proposal always accepted

-

Fix in handout

102

ILLUSTRATION OF DIRECT, COMPOSITION AND GIBBS SAMPLING

1) Assumptions and basic results

1

2

N0

2,

1 − 34 1 1

2 −0. 375

−0.375 12

2

E1 |2 E1 Cov1 ,2

Var2 2 − E2

1 −0. 37512

22 − 2

4. 0 − 1. 52

Var1 |2 Var11 − 2

1 1 − − 34

2

716

0. 4375

E2 |1 E2 Cov1 ,2

Var1 1 − E1

2 −0.3751

1 − 0

2 − 0.3751

Var2 |1 Var21 − 2

12

2 1 − − 3

4

2

764

0.109 375

Suppose we wish to draw samples from

52

103

2) Direct sampling from the bivariate normal distribution

1

2

N0

2,

1 − 34 1 1

2 −0. 375

−0.375 12

2

0

2 Cholesky

1 − 34 1 1

2 −0. 375

−0. 375 12

2

z1

z2

0

2

1. 0 0

−0. 375 0. 330 718913 883

z1

z2

z1

2 − 0. 375 z1 0. 330718 913883 z2

104

> # Simulation of a bivariate normal- Direct> mu<-matrix(c(0,2),2,1)> mu

[,1][1,] 0[2,] 2> > V<-matrix(c(1.0,-0.375,-0.375,0.25),2,2)> V

[,1] [,2][1,] 1.000 -0.375[2,] -0.375 0.250> C<-t(chol(V))> C

[,1] [,2][1,] 1.000 0.0000000[2,] -0.375 0.3307189

> m<-2> r<-20000> > thetavec<-matrix(0,m,r)> for (i in 1:r) {z<-matrix(rnorm(m),m,1)+ thetavec[,i]<-mu+C%*%z}> mean(thetavec[1,])[1] 0.003480938> mean(thetavec[2,])[1] 1.997486> v1<-var(thetavec[1,])> v2<-var(thetavec[2,])> v12<-cov(thetavec[1,],thetavec[2,])> Vsim<-matrix(c(v1,v12,v12,v2),2,2)> v1[1] 0.9859114> v2[1] 0.2473953> Vsim

[,1] [,2][1,] 0.9859114 -0.3697481[2,] -0.3697481 0.2473953

1

2

N0

2,

1 − 34 1 1

2 −0. 375

−0.375 12

2

53

105

PrX1 x1, X2 x2, . . . , XN xN

PrX1 x1 PrX2 x2|X1 x1 PrX3 x3|X1 x1, X2 x2

. . .PrXN xN|X1 x1, X2 x2, . . . , XN−1 xN−1

COMPOSITION OR CHAIN SAMPLING FROM A JOINTDISTRIBUTION

x′ x1, x2. . . , xN−1, xN

Then:

Is a realization from joint distribution above

106

3) Sampling from the bivariate normal distribution using composition

> m<-2> r<-20000> > thetavec<-matrix(0,m,r)> > thetavec[1,]<-rnorm(20000,0,1)> for (i in 1:r) {thetavec[2,i]<-2-0.375*thetavec[1,i]+sqrt(0.109375)*rnorm(1,0,1)}> mean(thetavec[1,])[1] -0.007715923> mean(thetavec[2,])[1] 2.004965> v1<-var(thetavec[1,])> v2<-var(thetavec[2,])> v12<-cov(thetavec[1,],thetavec[2,])> Vsim<-matrix(c(v1,v12,v12,v2),2,2)> Vsim

[,1] [,2][1,] 1.0113990 -0.3806295[2,] -0.3806295 0.2519804>

1

2

N0

2,

1 − 34 1 1

2 −0.375

−0. 375 12

2

54

107

GIBBS SAMPLING

[A,B,C|DATA]

Want to sample from joint posterior

Sample is

[A(j) ,B(j) ,C(j) |DATA]

Each coordinate is a draw from marginalposterior

[A(j) |DATA] [C(j) |DATA][B(j) |DATA]

108

[B|A,C,DATA]

[C|A,B,DATA]

[A|B,C,DATA]

Gibbs sampling works as follows:1) Form all fully conditional posteriors2) Draw and update successively3) Repeat a number of times without storing samples

(burn-in)4) Collect all subsequent samples, and thin them if

needed for storage purposes

55

109

j A B C

1 A1 B1 C1

2 A2 B2 C2

. . . .

t At Bt Bt

t1 At1 Bt1 Bt1

. . . .

tm Atm Btm Btm

At the end of process:

Discardfirst t samplesas burn-in

Keep subsequentm samples forPosterior analysis

110

Sampling from the bivariate normal distribution using the Gibbs sampler

1 |2 N4. 0 − 1. 52 , 0. 109 375

2 |1 N2 − 0. 3751 , 0. 437 5

> # Simulation by Gibbs sampling> > L<-15000 #Chain length> burn<-1000 #Length of burn-in> thetavec<-matrix(0,L,2) #Contains the chain> mu1<-0 #True mean of theta 1> mu2<-2 #True mean of theta 2> sigma1<-1 #True SD of theta 1> sigma2<-0.5 #True SD of theta 2> rho<--0.75 #True correlation> s1<-sqrt(1-rho^2)*sigma1 #SD of cond. distribution of 1 given 2> s2<-sqrt(1-rho^2)*sigma2 #SD of cond. distribution of 2 given 1

1

2

N0

2,

1 − 34 1 1

2 −0. 375

−0.375 12

2

Wish to drawsamples from

56

111

###Run the chain####

thetavec[1,]<-c(0,0) #Initialize

for (i in 2:L) { #Open loopthetasamp2<-thetavec[i-1,2] #Sample of theta2m1<-mu1+(rho*sigma1/sigma2)*(thetasamp2-mu2)thetavec[i,1]<-rnorm(1,m1,s1) thetasamp1<-thetavec[i,1] #Sample of theta1

m2<-mu2+(rho*sigma2/sigma1)*(thetasamp1-mu1)thetavec[i,2]<-rnorm(1,m2,s2) } #Close loop

b<-burn+1theta<-thetavec[b:L,] #post-burn in samples

112

#Evaluate samplescolMeans(theta)cov(theta)cor(theta)#Plot samplesplot(theta,main="Scatter of bivariate samples", cex=.5, xlab=bquote(thetavec[1]),

ylab=bquote(thetavec[2]),ylim=range(theta[,2]))

> colMeans(theta)[1] -0.01379416 2.00553885> cov(theta)

[,1] [,2][1,] 0.9910101 -0.3735499[2,] -0.3735499 0.2506818> cor(theta)

[,1] [,2][1,] 1.0000000 -0.7494595[2,] -0.7494595 1.0000000

-4 -2 0 2

01

23

4

Scatter of bivariate samples

thetavec1

thet

ave

c 2

1

2

N0

2,

1 − 34 1 1

2 −0.375

−0. 375 12

2

57

113

GIBBS SAMPLING IN A BETA-BINOMIAL MODEL: draw samples

from the predictive distribution

py1,y2, . . . , yn| i1

n

yi 1 − 1−yi

p a−11 − b−1

Likelihood and prior

p|y i1

n

yi 1 − 1−y ia−11 − b−1

∑i1

n

yia−1

1 − n−∑

i1

n

yib−1

Posterior

(Beta density)

114

pyf ,|y pyf |, yp|y

y f1 − 1−yf∑i1

n

yia−1

1 − n−∑

i1

n

yib−1

Joint density of parameter and future observation

pyf |,y pyf |

p|yf ,y Beta yf ∑i1

n

yi a, 1 − yf n −∑i1

n

yi b

The fully conditionals

• Initialize the parameter•Sample yf from Binomial distribution with this parameter•Sample parameter from Beta distribution with parameters as above•Repeat many times, throw early draws (burn-in)

58

115

Dosage No. Killed No. Exposed

wi yi ni

1.6097 6 59

1.7242 13 60

1.7552 18 62

1.7842 28 56

1.8113 52 63

1.8369 53 59

1.8610 61 62

1.8839 60 60

EXAMPLE: MH FOR A GLIM (Carlin and Louis, 2000)

Number of flour beetles killed after exposure to carbon disulphide

Generalized logit model

Prdeath|w hw expx1expx

m1

wi dose i 1,2, . . . , k

x w −

m1 0Unknown parameters

Priors

m1 Gammaa0,b 0 m1a0−1 exp − m1

b0 Nc0,d0

2 Inverse Gammae0, f 0 2−e01 exp − 1f 02

116

p,2, m1|y,a 0,b0 ,c0, d0, e0, f 0

i1

k

hwiyi1 − hwini−yi exp − − c0 2

2d02

2−e01 exp − 1f02

m1a0−1 exp − m1

b0

i1

k

hwiyi1 − hwini−yi m1a0−1

2 e01exp − − c0 2

2d02

− m1

b0− 1

f 02

Joint posterior

Joint posterior is not recognizable…Use Metropolis-Hastings

59

117

1

2 12

log2

3 logm1

1

2 exp22

m1 exp3

Transform variables, to work on

3 so that Gaussian proposals can be used

J

∂∂1

∂∂2

∂∂3

∂2

∂1

∂2

∂2

∂2

∂3

∂m1

∂1

∂m1

∂2

∂m1

∂3

1 0 0

0 2 exp22 0

0 0 exp3

→ |J| 2exp22 3

118

p1,2,3 |y,a0, b0, c0, d0,e0 ,f0

i1

k

hwiyi1 − hwini−yi exp3a0−1

exp22e01

exp − 1 − c0 2

2d02

− exp3b0

− 1f 0 exp22

exp22 3

New density= old density (evaluated at transformed variables) times Jacobian

p1,2,3 |y,a0, b0, c0, d0,e0 ,f0

i1

k

hwiyi1 − hwini−yi expa03 − 2e02

exp − 1 − c0 2

2d02

− exp3b0

− 1f 0 exp22

Collecting terms

POSTERIOR IS NOT RECOGNIZABLE…

60

119

Hyper-parameters: a0= .25, b0=4, c0=2, d0=10, e0=2.000004, f0=1000

1) Metropolis-Hastings proposal distribution used

1∗

2∗

3∗

N

1t−1

2t−1

3t−1

, D .00012 0 0

0 . 033 0

0 0 .10

R py|∗ p∗/f ∗ | t−1

p y|t−1 p t−1 /f t−1 |∗

f∗ |t−1 123|D |

exp − 12∗ − t−1 ′D−1∗ − t−1

ft−1 |∗ 123|D |

exp − 12t−1 − ∗ ′D−1t−1 − ∗

f∗ |t−1 ft−1 |∗

120

Numerical stability is improved by computing acceptance ratio as

R py|∗ p∗ py|t−1pt−1

i1

k

h∗wiyi1 − h∗wini−yi exp a03∗ − 2e02

∗ − 1∗−c0

2

2d02− exp3

∗

b0− 1

f 0 exp22∗

i1

k

h t−1wiyi1 − h t−1wi

ni−yi exp a03t−1 − 2e02

t−1 −1t−1 −c0

2

2d02 −

exp 3t−1

b0− 1

f0 exp 22t−1

i1

kh∗wi

h t−1wi

yi 1 − h ∗wi1 − h t−1 wi

ni−yi

exp a0 3∗ − 3

t−1 − 2e0 2∗ − 2

t−1 −1∗ − c0

2 − 1t−1 − c0

2

2d02

−exp 3

∗ − 3t−1

b0

exp − 1f0 exp22

∗

exp − 1

f 0 exp 22t−1

logR ∑i1

k

yi logh∗wi

h t−1wi ni − yi log

1 − h∗wi1 − h t−1wi

a0 3∗ − 3

t−1 − 2e0 2∗ − 2

t−1 −1∗ − c0 2 − 1

t−1 − c02

2d02

−exp 3

∗ − 3t−1

b 0

1f0

1exp 22

t−1− 1

exp22∗

R explogR

61

121

Three parallel chains run each with 10,000 iterations Burn-in= 2,000 in each chain Histograms based on the (10,000-2,000)3= 24,000 sampled values Autocorrelations and inter-correlations estimated from chain 2

Chains mix slowly (13.5% acceptance rate) High correlations between parameters: Makes sense to explore different proposal

1 −0.78 −0. 94

−0. 78 1 0. 89

1

122

2) Metropolis-Hastings proposal distribution used From first algorithm ,estimate posterior covariance matrix as

1

m ∑j1

m

j − j − ′

Use Gaussian proposal with covariance matrix (gave acceptance rate 27.3%)

2

0.000292 −003546 −0.007856

−003546 0. 074733 0.117809

−0.007856 0. 117809 0.241551

62

123

If fully conditionals are not recognizable, use other sampling

methods

• Metropolis

• Metropolis-Hastings

• Acceptance/rejection sampling

• Importance sampling

These methods require an auxiliary distribution, whichmust be tuned, and calculation of an “acceptance probability”

1

Dealing with epistatic Dealing with epistatic interactions and noninteractions and non--linearitieslinearities

gene x genegene x genegenegene x gene x genex gene x gene

genegene x gene x gene x genex gene x gene x gene……………………..

(Alice in Wonderland)

Statistical Interaction(fixed effects models)

yijk Ai Bj ABij eijk

Eyijk |Ai,Bj,ABij Ai Bj ABij

E yijk − yij ′k′ |Ai,Bj,ABij,Ai ′ ,Bj,ABi′j Ai Bj ABij

− Ai ′ Bj ABi′j

Ai − Ai′ ABij − ABi′j

Difference between levels of factor A depends on level of B

If factor A has a levels and factor B has b levels, the degrees of freedom are:- (a-1)- (b-1)- (a-1)(b-1) [assuming no-empty cells]

2

Fixed effects models(unravelling “physiological epistasis” a

la Cheverud?)

• Lots of “main effects”• Splendid non-orthogonality• Lots of 2-factor interactions• Lots of 3-factor interactions• Lots of non-estimability• Lots of uninterpretable high-order

interactions• Run out of “degrees of freedom”

Analysis with random effects models?

MEUWISSEN et al. (2001)

GIANOLA et al. (2003)

XU (2003)

--Use all SNP markers in statistical models--Mechanistic basis to mixed effects linear model

(genetic effects treated as random variables)--Highly parametric models--Strong assumptions made

“Ridge regression-type”

Will talk about this later

3

What is ridge regression?

Bayes model assumes, a priori N0,B2

OLS X′X−1X′y

RIDGE X′X I−1X′y

BAYES X′X B−1 e

2

2

−1

X′y e2

2B−10

Typically assumed 0 Typically identity matrix.However, can be givenstructure

Large values of λ “shrink” regressionstowards 0 (induces bias, but higherprecision than OLS)

Dealing with interactions (“statistical epistasis”): much of thistook place in inspiring Iowan landscapes…

gotyou as pigsmany as/)( 22ijkllkjiijkllkji pigpig

MOREPIGS HERE

SOME CORN

PIGS AGAIN

O K

Bayesians, keep out!

C. C. C

4



--Orthogonal partition of genetic variance into additive, dominance,additive x additive, etc. ONLY if


5

N0,a2

N0,d2

N0,aa2

N0,ad2

N0,dd2

. . . .

N0,ddd...d2

The degrees of freedom of the distribution are NOT GIVEN by the number of levels.

There is now 1 df for each type of genetic effect.

Matrix representation

Variance-covariance

Decomposition

6



--Orthogonal partition of genetic variance into additive, dominanceadditive x additive, etc. ONLY if


DO THESE ASSUMPTIONS HOLD?

ALL ASSUMPTIONSVIOLATED!

Just considerLinkage disequilibrium

My girlfriend is

a bitch

Digression: linkage disequilibrium

7

Dt 1 − rtD0

0 5 10 15 20 25 30 35 400.00

0.05

0.10

0.15

0.20

0.25

No. generations t

D

Evolution of linkage disequilibrium as a function of recombination rate

r= 0.01

r= 0.10

r= 0.50

8

A VIEW OF LINEAR MODELS(as employed in q. genetics)

Mathematically, can be viewed as a “local” approximation of a complex process

Linear approximation

Quadratic approximation

nth order approximation

9

Example

y gx e

Response variateModel residual

Some function of a covariate x

Suppose gx sinx cosx

Second-order Taylor series expansion about 0

ddxsinx cosx

ddxcosx − sinx

ddxsinx cosx cosx − sinx

ddxcosx − sinx −cosx − sinx

d2

dx 2sinx cosx −cosx − sinx

sinx cosx ≈ sin0 cos0 cos0 − sin0x − 0 12 −cos0 − sin 0x − 02

1 x − x2

2

How good are the linear and quadratic approximations? Recall that a Taylorseries provides a local approximation only…

-5 -4 -3 -2 -1 1 2 3 4 5

-1.4

-1.2

-1.0

-0.8

-0.6

-0.4

-0.2

0.2

0.4

0.6

0.8

1.0

1.2

1.4

x

y1. Sin and cosine function 3. Quadratic approximation

2. Linear approximation

4. Approximationsare good at x=0…

10

But we have environmental noise…:

evaluate function sin(x)+cos(x) at x=0, 0.5 and 1

True values are:

> sin(0)+cos(0)[1] 1

> sin(0.5)+cos(0.5)[1] 1.357008

> sin(1)+cos(1)[1] 1.381773

VERY CLOSE TO EACH OTHERNOISE CAN MASK SIGNALS!

Create an R data set (N=300) from adding 100 N(0,1) residuals to each of the 3 values

> y0<-sin(0)+cos(0) +rnorm(100,0,1)> y05<-sin(0.5)+cos(0.5)+rnorm(100,0,1)> y1<-sin(1)+cos(1) +rnorm(100,0,1)> y<-c(y0,y05,y1)

11

> y0<-sin(0)+cos(0) +rnorm(100000,0,1)> y05<-sin(0.5)+cos(0.5) +rnorm(100000,0,1)> y1<-sin(1)+cos(1) +rnorm(100000,0,1)> y<-c(y0,y05,y1)

Create a larger R data set (N=300000) by adding 100000 N(0,1) residuals to each of the 3 values

CANNOT SEE UNDERLYING STRUCTURE.LARGE NOISE (ERROR VARIANCE)

> y0<-sin(0)+cos(0) +rnorm(100000,0,.05)> y1<-sin(1)+cos(1) +rnorm(100000,0,.05)> y05<-sin(0.5)+cos(0.5) +rnorm(100000,0,.05)

Now we get a more precise measuring instrument with variance 0.05

STRUCTURE IS REVEALED BUTWE CANNOT DIFFERENTIATEBETWEEN TWO OF THE UNDERLYINGVALUES

12

…SO WE BUY ANOTHER INSTRUMENT WITH VARIANCE 0.001!

> y0<-sin(0)+cos(0) +rnorm(100000,0,.001)> y1<-sin(1)+cos(1) +rnorm(100000,0,.001)> y05<-sin(0.5)+cos(0.5) +rnorm(100000,0,.001)> y<-c(y0,y05,y1)

STILL CANNOT DIFFERENTIATEBETWEEN THE

> sin(0.5)+cos(0.5)[1] 1.357008

> sin(1)+cos(1)[1] 1.381773

1.0 1.2 1.4

02

46

810

12

density.default(x = y)

N = 300000 Bandwidth = 0.0126

Density

1.0 1.1 1.2 1.3 1.4

010

20

30

40

50

60

density.default(x = y, bw = 0.00

N = 300000 Bandwidth = 0.002

Density

HOWEVER, NON-PARAMETRIC DENSITY ESTIMATES DEPEND ON SOMEBANDWIDTH PARAMETER. BY REDUCING IT, WE CAN SEE THE ENTIRESTRUCTURE OF THE PROBLEM…

13

Heal Thyself: Systems Biology Model Reveals How Cells Avoid Becoming Cancerous. ScienceDaily (May 21, 2006)

Whataboutsomethinglike this?

What to do in genomic-assisted analysis of complex genetic signals?

• Include all markers, model all possible interactions? Unrealistic…

• Select sets of influential markers via model selection Huge search space Frequentist methods “err” probabilistically Bayesian model selection (RJMC) difficult to tune

• Use LASSO (least absolute shrinkage and selection operator): Tibshirani (1996). What about interactions?

• Explore model-free techniques that have been used successfully in many domains semi-parametric regression machine learning: focus on prediction, learning

mappings from inputs to outputs

14

DEFINITION OF MACHINE LEARNING(Wikipedia)

Machine learning: subfield of artificial intelligence concerned with design and development of algorithms that allow computers (machines)

to improve their performance over time (to learn) based on data,

A major focus of machine learning research is to automatically produce (induce) models, such as rules and patterns, from data. Hence,

machine learning is closely related to fields such as data mining, statistics, inductive reasoning, pattern recognition, and theoretical

computer science.

1

5. Machine learning classification: 5. Machine learning classification: chicken mortalitychicken mortality

• Genomics Initiative Project at Aviagen, Ltd.

• 231 sires, each genotyped for 5,523 SNPs

• End-point: sire family progeny mortality rates (0-14 d).

• Monomorphic SNPs excluded:

4,148 polymorphic SNPs

• 95.5% in HWE at 0.001 significance level

USED FOR VALIDATION

USED FOR SNP SELECTION

Thick tailThick tail

Comments follow

2

Pre-processing data

• AVIAGEN fitted a linear model to individual 0-1 records (generalized linear model would have been better)

• Linear model had nuisance effects (e.g., age of dam, contemporary group) and breeding values of birds. All effects were estimated

• Individual records corrected by subtracting from phenotypes, estimates of nuisance effects and one-half of the EBV of the dam

• Intent was to remove all effects other than those “due to”sires.

• Corrected records were averaged for each sire

Criticism

• Correction assumes additive inheritance, and smoothes data according to such theory.

• Better correction: use only estimates of nuisance effects.

• In dairy cattle, they are using EBVs of sires (inferred according to additive inheritance) as target. This is objectionable for the same reason.

• Bishop (2006): “Care must be taken during pre-processing because information is often discarded, and if this information is important to the solution of the problem then the overall accuracy of the system can suffer”

3

Convert into classification problem: partitioning of sire samples

• Mimic case-control studyCase-control studies: patients who have a disease and see if compared

with those not having the disease.

• 2 classes—high (case) and low mortality (control) groups

• Two thresholds for mortality rate, determined from bootstrap (1000) and empirical distributions

1 2

percentile (1 ) percentile

c c

Bootstrapping

• Regard sample of size N as pseudo-population• Form B samples of size N by re-sampling with replacement from

pseudo-population• Original y vector replaced by y*

b b=1,2,…,N Some elements of original data will appear more than once or not at all in sample b

• Estimate some parameter (e.g., the 10-percentile, in each of the Bsamples). Then you have a bootstrap distribution.

• Take the average over the B samples. Estimate its standard error with standard formulae.

• B between 50-200 seems to suffice to estimate the SE in simple models

EFRON, B. and TIBSHIRANI, R. J. 1993. An Introduction to the bootstrap. Chapman & Hall, London.

4

The 11 Case-control groups

B and E estimates similar: use bootstrap estimates to form groups

-5 -4 -3 -2 -1 0 1 2 3 4 5

0.1

0.2

0.3

0.4

x

y

“Control” “Case”

TwoTwo--step SNP selectionstep SNP selection

• Filter step reduces thousands of SNPs to small number.

• Wrapper optimizes performance of filtered SNPs by adopting naïve Bayesianclassifier to score subset.

WRAPPER COMPUTATIONS CARRIED OUT WITHWEKAWeka 3: Data Mining Software in Java

5

Filter: information gain

• Information gain: SNP by SNP– Change in uncertainty about mortality

outcome (e.g., high or low) due to a single SNP

1

2 2

-

InfoGain(F) ( , ) ( , )

( , ) log log

where and are the number of instances belonging to the + class

and - class, respectively.

i i i

i i i

N N NN NI I

N N N N N

N N N N N NI

N N N N N N

N N

N N N

Entropy from phenotypes Entropy from knowing SNP

STATISTICAL ENTROPY(discrete situation)

Information is property of a distribution.

The mean information of a distribution (statistical entropy)

is:

X has K states

Hp1,p2, . . . ,pK EIX E− logPrX x i

−∑i1

K

pi logpi.

probabilities

6

Hp1,p2, . . . ,pK EIX E− logPrX x i

−∑i1

K

pi logpi.

1) MUST BE POSITIVE

1) NULL VALUE NO UNCERTAINTY. IF ONE IS TOLD ABOUT THE OUTCOME, ENTROPY FALLS TO 0.

3) MAXIMUM ENTROPY LOT OF INFORMATION TOBE GAINED FROM OBSERVATION

4) DIFFERENCE IN ENTROPY BEFORE AND AFTER DATA:MEASURE OF INFORMATION CONTENT

(called Information Gain)

EXAMPLE: ENTROPY OF RANDOM POSITION AT DNA(states A, T, G, C equally frequent)

pA 0.7, pG 0.3

Entropy − 710

log27

10 3

10log2

310

0. 88 bits

Information content of position 2 − 0.88 1.12

EXAMPLE: ENTROPY OF CONSERVED POSITION(states have unequal frequencies)

−∑i1

414

log214

2 bits

7

Example: calculate information gain associated with the A-locus (SNP by SNP)

Number Dead Alive

AA 100 80 20

Aa 200 40 160

aa 100 60 40

400 180 220180400 0.45 220

400 0. 55

Maximum entropy

I NA

N , ND

N − 12 log2

12 − 1

2 log212 1.0

Entropy of mortality distribution

I NA

N , ND

N − 180400 log2

180400 − 220

400 log2220400

0.9927744540

Number Dead Alive

AA 100 80 20

Aa 200 40 160

aa 100 60 40

400 180 220180400 0.45 220

400 0. 55

8

Number Dead Alive

AA 100 80 20

Aa 200 40 160

aa 100 60 40

400 180 220180400 0.45 220

400 0. 55

AA − 20100 log2

20100 − 80

100 log280100 0. 7219280949

Aa − 160200 log2

160200 − 40

200 log240200 0. 7219280949

aa − 40100 log2

40100 − 60

100 log260100 0. 9709505945

100400

0.721 928 094 9 200400

0. 721 928 094 9 100400

0.970 950 594 5 0.784 183 719

INFO GAIN 0. 992 774 454 0 − 0. 784 183 719 0.208 590 735

Weka 3: Data Mining Software in Javahttp://www.cs.waikato.ac.nz/ml/weka/

Weka: collection of machine learning algorithms for data mining tasks.

The algorithms can either be applied directly to a dataset or called from Java code.Weka contains tools for data pre-processing, classification, regression, clustering,

association rules, and visualization.

It is also well-suited for developing new machine learning schemes.

9

Wrapper—classifier

Naive Bayesian classifier assumes conditional

independence of SNPs ( ) given mortality

class C.1, , kA A

Naïve Bayesassumption

Example: Aj=Bb, C= “low mortality class”

Multi-SNP genotype of a sire family

C= high or low

Suppose that TRAINING SAMPLE is

Number Dead Alive

AA 100 80 20

Aa 200 40 160

aa 100 60 40

400 180 220180400 0.45 220

400 0. 55

PrAA|Dead 80180 0.444 444 444 4

PrAa|Dead 40180 0. 222222222 2

Praa|Dead 60180 0. 3333333333

PrAA|alive 20220 9. 090 909 091 10−2

PrAa|alive 160220 0. 7272727273

Praa|alive 40220 0. 1818181818

10

A TEST case needs to be classified, and it is aa.What is the probability that will be alive?

Pralive|aa Praa|alivePralivePraa

0.181 818 181 80.55100400

0. 40

Test case will be predicted to be DEAD

1)If actually dead, classification is CORRECT2)If alive, classification is in ERROR

• Suppose sire family is “low” and wrapper sends it to “low” accurate classification

• Suppose sire family is “low and wrapper sends it to “high” inaccurate classification

• Suppose 50 families in group are classified, and 20 are missclassified error rate is 40%

Baseline error rate: 50%

11

[No. of SNPs]

Leave-one outCV

10-foldCV

Classification accuracy before and after SNP selection in each group

• Full set contains 50 SNPs retained in the filter step. Subset was searched by backward elimination.

“Coin tossing line”Randomclassification

12

Chromosome 1: the 9 sets of selected SNPs

• Groups 1 and 2, no SNP on chromosome 1 was selected based on these two groups

Predictive ability of mortality rates

Main effects model (by group), all 201 sires :

PRESS Criterion (leave-one-out):

View model as “smoother”and not as one aiming atmechanistic basis

13

y i xi′ ei

y X e

e i yi − xi′

−i X−i

′ X−i−1

X−i′ y −i

e i yi − xi′ −i

e i

1−hii

PRESS ∑i1

ne i

2

Basic Regression Model

The PRESS, cross-validation criterion

X ′X−1X ′yy X

XX ′X−1X ′y Hy

“Hat” matrix

dfmodel trH tr XX ′X−1X ′

tr X ′X−1X ′X p

Predictive ability of the 11 sets of selected SNPs

Relative PRESS

1.0

0.9615384615

1.153846154

1.038461538

1.192307692

1.153846154

1.346153846

1.538461538

1.576923077

1.384615385

1.653846154

14

7. Comparison between 7. Comparison between classification methods:classification methods:

SNPSNP--mortality associations in mortality associations in broilers in two environmentsbroilers in two environments

• Loss of information due to using only tails of distribution?

—use all sires now in “cases”

and “controls”– Compare with results from using a subset of sires

• Contrast 3 classification methods– Naïve Bayes classifier,

– Bayesian networks

– Neural networks

15

Data

• >200 sire families; chicks in low (L) or high (H) hygiene environments- Traitsearly age (0-14d) and late age (14-

42d)– > 5000 whole genome SNPs (on sires)

• Construct phenotypic values for sires by fitting a GLIMM at individual bird level, accounting for non-genetic and non-hygiene environment factor:sire’s adjusted progeny mortality means

Data set

EL: early mortality, low hygiene EH: early mortality, high hygiene

LL: late mortality, low hygiene LH: late mortality, high hygiene

• Four age-hygiene strata formed • Analysis performed in each stratum in the same way• Enable study of G x E

16

Methods:multi-category classification

• Five categorizations (number of categories K = 2, 3, 4, 5 and 10) employed to categorize sire’s adjusted means

• All sire means used

• Sire means ordered: each assigned to one of K equal-sized subsets– E.g., when K = 3, the two boundary values correspond to the

1/3 and 2/3 quantiles of the distribution of sire means

Methods“Filter”

• Rank SNPs with information gain of each

• Using categorized phenotypic values, information gain computed for each SNP for each categorization (K = 2, 3, 4, 5, 10)

• SNPs retained for further analysis were top 50 for information gain

17

Methods“wrapper”

• Optimize performance of top 50 SNPs selected by “filter”

• Yields “best” subset of SNPs, with the highest classification accuracy of the chosen classifier

• Classification method– Naïve Bayes (NB) classifier—NB wrapper– Bayesian networks (BN) —BN wrapper– Neural networks (NN) —NN wrapper

“Naïve Bayes wrapper”

• Naïve Bayes

• Strong assumption of independence of SNPs, given phenotype

X2

Y

X1 ………………

1 1( | , , ) ( ) ( , , | )n np Y X X p Y p X X Y

pX1,X2, . . . ,Xn|Y pX1|YpX2|Y. . .pXn|Y

Class label

SNP genotypes

Important these probabilities are trained with past data

18

BAYESIAN NETWORKS

• Bayesian hierarchical model: sequence of probability distributions describing uncertainty: starts from the likelihood…ends with some hyper-prior

• Bayesian network: hierarchical model for sets of discrete random variables. Our problem is discrete:

classification

a b

c

Visualize structure of a probabilistic model Give insights into structure, e.g., conditional independence Complex computations expressed graphically

PROBABILISTIC GRAPHICAL MODELS

A graph has Nodes or vertices. Each represents RV or group or RV Links or arcs or edges. Each expresses relationships Graph describes how joint distribution is decomposed Directed graph: may express causality Undirected graph: may express constraints

pa,b,c pa,bpc|a,b

pa,b,c papb|apc|a,b

Second representationUsed in graph

19

The representation pa, b,c is symmetrical with respect to variables

Different ordering pbpc|bpa|b, c leads to different graphical representation

px1, x2, . . . , xK px1px2|x1. . . pxk |xK−1,xK−2, . . .x1

One node per conditional distribution Each node has links from all lower numbered nodes Graph is said fully connected: one link for every pair of nodes

Absence of links reveals interesting information Not a fully connected graph

x1

x2

x3

x4 x5

x6x7

Example: not fully connected graph

p(x1) p(x2) p(x3)p(x4|x1,x2,x3)p(x5|x1,x3)p(x6|x4)p(x7|x4,x5)

Then, letting pak denote parents of k

px k1

K

pxk |pak

•Important for sampling•Can generate sample from

joint distribution by using theindependences indicated bynetwork

20

Example: Polynomial random regression as a graphical model

ti x i′w ei

t Xw e

w N 0, I 1

e N0, I2

pt,w|x,,2 pw|n1

N

ptn |xn,w,2

Joint distribution of data and of regression coefficients

regressions (in the machinelearning literature are called“weights”; intercept is called “bias”)

pw|t,x,,2

pw|n1

N

ptn |xn,w,2

pw|n1

N

ptn|xn ,w,2dw

pw|n1

N

ptn|xn,w,2

ptf |xf , t,x,,2 ptf,w|xf , t,x,,2dw

ptf|xf ,w,x,,2pw|t,x,,2 dw

Conditional distribution of regressions, given data (inference)

Conditional distribution of future observations, given current data (prediction)

Crucial in genomic-assisted improvement programs

22

Random variable has K states. Represent it as vector x of order K 1

x ′ x1 x2 . . . xK

′

If k th modality is observed, xk 1 and the rest are 0. Let Pr xk 1 k

px| k1

K

kx k

∑k1

K

k 1 →→→ K − 1 parameters

DISCRETE RANDOM VARIABLES

Two random variables, each with K states, so K2 possibilities. Let Prx1k 1,x2l 1 kl

px1,x2| k1

K

l1

K

klx 1kx 2l

∑k1

K

∑l1

K

kl 1 →→→ K2 − 1 parameters

2 RV

M RV px1,x2, . . .xM| k1

K

l1

K

. . .m1

K

kl...mx 1kx 2l...xMm

∑k1

K

∑l1

K

. . .∑m1

K

kl...m 1 →→→ KM − 1 parameters

NUMBER OF PROBABILITIES (PARAMETERS) GROWS EXPONENTIALLY WITH M

23

Suppose x1,x2 are independent

px1,x2 | k1

K

kx 1k

l1

K

lx 2l

∑k1

K

k 1,∑l1

K

l 1 →→→ 2K − 1 parameters

px1,x2| k1

K

l1

K

klx 1kx 2l

∑k1

K

∑l1

K

kl 1 →→→ K2 − 1 parameters

x1 x2

x1 x2

px1,x2| px1|px2|x1,

K − 1 KK − 1 K2 − 1 parameters

Equivalently

For M variables: M(K-1) Number of parameters grows linearly

DISCRETE NETWORK

• Minimum number of parameters is M(K-1):graph is fully disconnected

• Maximum number is KM-1: graph is fully connected

• Typically, search is for networks with intermediate levels of connectivity

• This is part of the modeling effort. Several options for reducing number of probabilities to estimate.

24

px1,x2, . . .xM | px1 |px2 |x1,. . .pxM|x1, . . .xM |

is such that (Markov chain)

px1,x2, . . .xM | px1 |px2 |x1,px3|x2,. . .pxM|xM−1 |

K − 1 KK − 1 KK − 1 . . .KK − 1

K − 1 M − 1KK − 1 K − 11 M − 1K

Linear (rather than exponential) on M

Example 1 of intermediate connectivity: Markov chain

xMx1 x2

….

Bayesian structure of a networkMarkov chain: assign a Dirichlet prior with parameters μi to probabilities of each node

Example 2 of intermediate connectivity

μ1μ2 μM

….

μ1

….

x1x2 xM

x1 x2xM

μ

25

Example 3 of reducing connectivity

Instead of using complete tables of probability distributions, use parameterized models

μ1μ2 μM

….x1x2 xM

y

binary binary binary

binary

Pry 1|X1 x 1,X2 x2, . . . ,XM xm

2M CONFIGURATIONS of X VARIABLES

2M CONDITIONAL PROBABILITIES NEED TO BE ESTIMATED

Pry 1|X1 x1,X2 x2, . . . ,XM xm w ′x w0 ∑j1

M

xjwj

w ′x expw′x

1expw′xlogit model

w ′x probit model

M 1 parameters →→ Linear in number of variables

One possible form of modelling

μ (.)

27

Markov blanket of node xi comprises the set of parents, children and co-parents of node

The conditional distribution of xi given all other variables depends only on variablesin the blanket

xi

Parent

Co-parent

Children

“Bayesian Net-wrapper”

• Suppose the Bayesian network is:

SNP4DD 0.7Dd 0.2dd 0.1

[SNP3|SNP1]

28

• Bayesian network: joint distribution decomposed by conditional independence implied by graph

• The Markov blanket for disease can be found from, e.g.: P(SNP1=AA, SNP2=bb, SNP3=Cc, SNP4=DD, Disease=Present)

= P(SNP1=AA) P(SNP2=bb|SNP1=AA) P(SNP3=Cc|SNP1=AA) P(SNP4=DD)x

P(Disease=Present|SNP3=Cc, SNP4=DD)


[SNP3|SNP1]

FINDING THE NETWORK

“NP-hard problem”

• No shortcut or smart algorithm leads to a simple or rapid solution.• Only way to find optimal solution is a computationally-intensive, exhaustive analysis

in which all possible outcomes are tested.

29

Mh= A specific network structure

Choose network with largest posterior probability

Assuming networks are equi-probable, a priori, model choice driven bymarginal likelihood

Use Markov blanket representation

No. variables in net

Sample size

Network selection problem

Y

X3X2X1 X4

Consider network M1 2 class labels: “normal”, ”disease”

PY,X1,X2,X3,X4 PY|X1,X2,X3,X4PX1,X2,X3,X4 → 2 34 162 probabilities

PX1,X2,X3,X4 → 34 81 probabilities → 80 are "free"

PY|X1,X2,X3,X4 → 2 81 probabilities → 81 free

Network has 80 81 161 parameters

30

Now, look at network M2


PSNP1 AA, SNP2 bb,SNP3 Cc,SNP4 DD,Disease Present

PSNP1 AAPSNP2 bb|SNP1 AAPSNP3 Cc|SNP1 AA

PSNP4 DDPDisease Present|SNP3 Cc,SNP4 DD

[SNP3|SNP1]

PSNP1 → 2 parameters

PSNP2|SNP1 → 3 parameters

PSNP3|SNP1 → 3 parameters

PSNP4 → 2 parameters

PDisease|SNP3,SNP4 → 9 parameters

19 parameters in net


66

25

31

PData|parameters nn1,n2, . . . ,nk

j1

k

jnj

Pparameters 1B1, . . .k

j1

k

jaj−1

Γ ∑j1

k

aj

j1

k

Γaj

j1

k

jaj−1 → DIRICHLET

Pparameters|Data j1

k

jnjaj−1 → DIRICHLET

Ej |Data nj a

n ∑j1

k

aj

Sample size

No. obs. In “niche” k

ESTIMATING THE PROBABILITIES

Prior

Under Dirichlet-Multinomial model, have closed form for marginal of datafor a given network Mh

i= variable in netj= parent blanketk= class label

Observed no.

Observed no., summedover classes

32

Sebastiani et al. (2005) Nature Genetics

NEURAL NETWORKS

A very fast tour

33

Up to 103 connections

• Brain superior to von Neumann machines in cognitive tasks

• Microchips: nanoseconds, Brain: milliseconds

• ???

Brain recognizes familiar objects from unfamiliar anglesKey: not speed but organization of processing

34

Why?

• Tasks distributed over 1012 neurons

• Interconnected and activated

• Massively parallel

• Neurons adapt and self-organize

• Interconnectivity: up to 103 synaptic connections

“NN-wrapper”

• Neural network: multilayer feed-forward network (also called multilayer perceptron)

N SNPs M hidden nodes O output nodes in classification1 output node in regression

Two dummy variatesneeded to describegenotypes

3 categoriesassumed here

(classification)

35

Input to hidden node j

Activated Input to hidden node j

Activation function

Activated input to output layer

Activated input to output layer

net input to output layer

woj is an “intercept”

NEURAL NETWORKS ARE UNIVERSAL APPROXIMATORS

50 x values sampled from U[-1,1] and then evaluate f(x). Fit a two-layerNN with 3 hidden nodes and tanh activation functions

x2 Sin(x)

|x|

Step function

Output from hidden node

36

A simpler formulation of the same network: relationship to non-parametric regression

y i 0 ∑j1

# hidden nodes

j1

1exp x i′ j

ei

If # nodes is known (k), the number of parameters is:

1+k+k(1+ # x’s)= 1+ k (# x’s + 2)

Here, the activation function of the output of the hidden layer is the identity

Can overfit if too many hidden nodes“Regularization” needed

FITTING (TRAINING) THE NETWORK

• It can be formulated as a non-linear regression problem

• Can use any algorithm, such as Newton-Raphson, to minimize residual SS

• Widely used algorithm: error back-propagation• Better: treat regressions as random (i.e., assign

a prior) and induce regularization via the Bayesian approach

• The number of hidden nodes is a model selection problem

37

NEURAL NETWORK FOR CLASSIFICATION

DataBenign y11=1 y12=0 y13 =0

None y21=0 y22=0 y23 =1

Malignant y31=0 y32=1 y33=0

Benign y41=1 y42=0 y43=0

Malignant y51=0 y52=1 y53=0

Classify tumors [BENIGN, MALIGNANT, NONE]

#classes.If q=3, thentwo “free”probabilities

pig Pryig 1 expwig

∑g1

q

expwig

wig 0 ∑j1

# hidden nodes

j1

1 exp xi′j

l, i1

n

pi1yi1 pi2

yi2 . . .pi1yiq

i1

n

g1

qexp 0 ∑

j1

# hidden nodes

j1

1exp x i′j

∑g1

q

exp 0 ∑j1

# hidden nodes

j1

1exp x i′j

Likelihood formulation of neural network for classification problem

1 #x ′sk # of ′s 1 # of hidden nodes) q − 1 # of ′s

1 #x ′sk 1 kq − 1 q − 1 kq #x ′s

Number of parameters to be learned (estimated). k= # hidden nodes

AGAIN, CARE MUST BE EXERCIZED TO AVOID OVERFITTING

38

Back to chickens: Comparison of three wrappers

In all comparisons, NB had the smallest error rates

Classification error rates in the four strata

Comparison of NB’s classification accuracy using a fullset-SNP predictor and a subset SNP predictor

2-Cate. 3-Cate. 4-Cate. 5-Cate.10-Cate.0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9EH

Pre

dic

tio

n a

ccu

racy


0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 LL

Pre

dic

tio

n a

ccu

racy


0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Pre

dic

tio

n a

ccu

racy

EL


0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 LH

Pre

dic

tio

n a

ccu

racy

Full Subset

Wrapper enhances classification accuracy over and above subset chosen by filter

39

Evaluation of SNP subsets generated by five classification schemes

• Two measures used as evaluation criteria– Measure A: predicted residual sum of squares (PRESS)

( is the predicted value for the sire i’s phenotype when it is not used to estimate parameters)

– Measure B: proportion of significant SNPs• Given subset of SNPs, F-statistic computed for each SNP• Given an individual SNP F-statistic, its p-value is approximated by

– Shuffling phenotypes across all sires 200 times, while keeping sires’ genotypes on this SNP fixed

– Proportion of the 200 permuted samples where a particular F-statistic exceeds the F-statistic seen in the original sample is calculated. This proportion is taken as the SNP’sp-value

• After obtaining p-values for all SNPs in the subset, significant ones are chosen by controlling the false discovery rate at level 0.05, via the “Benjamini-Hochberg procedure”. The proportion of significant SNPs in a subset is the end-point

1

2( )

1

SNP ,

ˆPRESS ( ) .

gn

i ij ij

N

i ii

M e

M M

( )ˆ

iM

Evaluation of SNP subsets generated by five classification schemes

Values with underlines are winners under the each of the two measures

Overall, schemes with 2 and 3 categories were better than others

40

• The issue of loosing sire informationPRESS values of SNP subsets selected from using

either extreme sires or using all sires

Using a subset (100x2α%)

of entire sire sample

Using all sires

•Loss of sample information not severe problem

•Because concern was classification accuracy, having more informative samples for each class is a more important issue

•Including all sire samples brings noise, leading to poorer predicting ability of selected SNPs.

Summary • 5 multi-category classification schemes and 4

classification methods compared, for selecting SNPs most relevant to classification.

• Best SNP subset with K=2 or K=3. Two measures (PRESS and proportion of significant SNPs) not improved by using a finer grading of mortality rates.

• Classification accuracy enhanced by using extreme samples: no loss of information; more parsimony.

• Simple naïve Bayes classifier outperformed BN and NN: over-fitting may have been a problem.

Unless NN coefficients treated as random effects…

1

8. Introduction to non-parametric curve fitting:

Loess, kernel regression, neural networks

Distinctive aspects of non-parametric fitting

• Objectives: primarily to investigate patterns free of strictures imposed by parametric models

• Can produce surprising results• Regression coefficients appear but (typically) do

not have an obvious interpretation• Often have a very good predictive performance in

cross-validation• Methods for tuning are similar to those for

parametric methods

2

Example: thin-plate splines

Risk of heart attack after 19 years as a function of cholesterol level and blood pressure. Left: logistic regression model. Right: thin plate spline fit. Wahba (2007)

222

211

1

222

21122110 log)( jiji

N

jjijijiii xxxxxxxxxxf

x

LOESS REGRESSIONLOESS REGRESSION::

NonNon--parametric exploration parametric exploration of inbreeding depression for of inbreeding depression for yield and somatic cell count yield and somatic cell count

in Jersey cattlein Jersey cattle

3

AN OVERVIEW OF LOWESS REGRESSION

1) DATA POINTS xi,yi; i 1,2, . . . ,n

4) COMPUTE Δx0 maxx i⊂Nx 0 |xo − xi |

2) SPANNING PARAMETER f; 0 f 1

k fn; k LARGEST INTEGER ≤ fn

3) FOR EACH x0 FIND k POINTS xi “CLOSEST” TO x0

Nx0 NEIGHBORHOOD OF k POINTS

7) REPEAT FOR EACH OF THE x0

5) TO EACH xi,yi; xi ⊂ Nx0 ASSIGN WEIGHT

wix0 1 − |xo − xi |Δx0

3 3

6) FIT BY WEIGHTED LEAST-SQUARES

∑i1

k

wix0y i − 0 − 1x i − 2xi2 2

RETURN yx0 0 −

1x i −

2xi

2

4

ROBUST LOWESS

•STANDARD LOWESS NOT ROBUST

� BASED ON LEAST-SQUARES

•BI-SQUARE LOWESS

�RE-WEIGHT POINTS ACCORDING TO RESIDUAL�IF RESIDUAL LARGE, WEIGHT IS DECREASED

2) CALCULATE LOESS RESIDUALS yi − y i

3) COMPUTE q 12 median|yi − y i |

4) CALCULATE BI-SQUARE ROBUST WEIGHTS

r i 1 −yi −

y i

6q 12

2 2

5) REPEAT LOESS WITH WEIGHTS riwix0

6) REPEAT 2-5 UNTIL LOESS CURVE ”CONVERGES”

1) FIT DATA USING STANDARD LOESS

5

ExampleExample•• Birth rate in US populationBirth rate in US population

(U. S. Department of Health, Education and (U. S. Department of Health, Education and Welfare) Welfare)

•• n=96 n=96

•• births per 1000 US population births per 1000 US population

•• during 1940during 1940--4747

Top > Ordinary Least Squares with 1st, 2nd & 3rd degree polynomial Bottom > LOWESS fit with f = .2, f=.4 & f=.6

6

• Examine relationships of yield (milk, protein, fat) and somatic cell score (SCS) with inbreeding coefficient (F) using field data from US Jerseys

• Use REML, BLUP and “local regression”method (LOESS) for this purpose

LEVEL OF INBREEDING IN HOLSTEINS, USALEVEL OF INBREEDING IN HOLSTEINS, USA

7

• Relationship between mean value of a quantitative trait and inbreeding coefficient (F) expected to be linearunder dominance

• Not so if epistatic interactions between dominance effects exist

(Crow & Kimura, 1970)

ONE-LOCUS MODEL

ADDITIVE MODEL WITH F (or H) AS COVARIATE�CONTRADICTORY

GENOTYPE X A1A1 A1A2 A2A2

FREQUENCY p121 − F p1F 2p1p21 − F p2

21 − F p2F

PHENOTYPE − A D A

EX Ap2 − p1 2p1p2D − 2p1p2DF

F

− 1 − F − 1

− %Heterozygosity

8

TWO (UNLINKED) LOCI:NO EPISTASIS

GENOTYPE A1A1 A1A2 A2A2

FREQUENCY p121 − F p1F 2p1p21 − F p2

21 − F p2F

B1B1 r121 − F r1F − A − B DA − B A − B

B1B2 2r1r 21 − F − A DB DA DB A DB

B2B2 r221 − F r2F − A B DA B A B

EX Ap2 − p1 Br2 − r 1

2p1p2DA 2r1r2DB

− 2p1p2DA r1r2DBF

′ ′F

TWO (UNLINKED) LOCI: EPISTASIS

GENOTYPE A1A1 A1A2 A2A2

FREQUENCY p121−F p1F 2p1p21−F p2

21−F p2F

B1B1 r121−F r1F −A−BI DA −B−L A−B−I

B1B2 2r1r21−F −ADB −K DA DB J ADB K

B2B2 r221−F r2F −AB−I DA BL ABI

•ALLELES AT A and B LOCI SAME SUBSCRIPT�ADD I(ADDITIVE X ADDITIVE)

•HOMOZYGOUS AT A HETEROZYGOUS AT B�SUBSTRACT AND ADD KHOMOZYGOUS AT B HETEROZYGOUS AT A�SUBSTRACT AND ADD L(ADDITIVE X DOMINANCE)

•HETEROZYGOUS AT A AND B� ADD J(DOMINANCE X DOMINANCE)

9

Mean value under dominance x dominance epistasis

EX Ap2 − p1 Br2 − r1 2p1p2DA 2r1r2DB

Ip1 − p2r1 − r2 2Lp1p2r1 − r2 2Kr1r2p1 − p2

4Jp1p2r1r2

− 2p1p2DA r1r2DB Lp1p2r1 − r2 Kr1r2p1 − p2 4Jp1p2r1r2F

4Jp1p2r1r2F2

′′ ′′F F2

•Dominance, additive x dominance, and dominance x dominance intervene in linear regression

•Dominance x dominance intervenes in second-order regression

•Epistasis without dominance does not enter into mean-F relationship

DATADATA

•• First lactation records (herds) on 59,778 First lactation records (herds) on 59,778 (1,142) Jersey cows (1,142) Jersey cows

•• 6 generations of known pedigree6 generations of known pedigree

•• First calving between 1995 and 2000 First calving between 1995 and 2000

10

Distribution of FDistribution of F

•• F calculated from all known pedigree F calculated from all known pedigree informationinformation

•• F ranged between 0 and 34%F ranged between 0 and 34%

•• Median F = 6.25% Median F = 6.25%

Histogram of F valuesHistogram of F values

(%)F

11

ProceduresProcedures•• Fit linear models without F as covariateFit linear models without F as covariate

•• Compute EBLUP residuals from these Compute EBLUP residuals from these modelsmodels

•• Fit nonparametric regression to EBLUP Fit nonparametric regression to EBLUP residuals in order to obtain nonparametric residuals in order to obtain nonparametric lines describing relationship between lines describing relationship between performance and inbreeding levelperformance and inbreeding level

Linear Models Linear Models

ijkkijkjiijk eaDDAGEHYSy )(1

Model Model

yijk = somatic cell score (SCS), milk, protein, or fat yield;

HYSi = fixed effect of herd-year-season (i = 1,2,….,12276 for DS2; 11158 for DS4 or 6406 for DS6, with seasons classes January–April, May–August, September–December);

AGEj = fixed effect of age at calving class; j = 1,2,….,6 (< 617, 617- 716, 717-816, 817-916, 917-1016, or >1016 days of age);= fixed regression coefficient of performance on days in milk;

Dijk = days in milk for animal k in herd-year-season i and age of calving class j;

D = 263;

ak = random additive genetic effect of animal k, and

eijk = random residual.

1

12

Linear Model Assumptions Linear Model Assumptions •• Genetic and residual effects assumed mutually Genetic and residual effects assumed mutually

independent, with and whereindependent, with and where A is the A is the additive relationship matrix (1 +additive relationship matrix (1 + FFkk in the in the kkthth diagonal diagonal position, position, FFkk is the inbreeding coefficient of animal is the inbreeding coefficient of animal kk) )

)σ,N(~ a2A0a),N(~ 2

eI0e

Nonparametric regressionNonparametric regression

•• Fit LOESS regression to BLUP residuals Fit LOESS regression to BLUP residuals with F as covariatewith F as covariate

•• Vary spanning parameter & degree of local Vary spanning parameter & degree of local polynomialpolynomial

•• Plot fitted values of residuals against FPlot fitted values of residuals against F

13

LOESSLOESS((Fitting done by Fitting done by locally weightedlocally weighted least least

squares)squares)

•• is LOESS fit using only residuals in the is LOESS fit using only residuals in the neighborhood ofneighborhood of FFii, i=1,2,, i=1,2,……nn(i=1,2,(i=1,2,……,n animals; j=1,,n animals; j=1,……,4 traits),4 traits)

•• Size of neighborhood determined bySize of neighborhood determined byqq = number of points in neighborhood= number of points in neighborhood

nn = total number of points= total number of points

ij~

n

qf

““RobustRobust”” LOESSLOESSWeights assigned to :Weights assigned to :

t=1,2,3,4t=1,2,3,4

I) I)

II)II)

qlFF

FFw

il

ikijk ,...2,1])

)max((1[ 33]1[

][][]1[ tijk

tijk

tijk ww

)(allofmedianmed ijkijk

ijkijkijk

t

med

ˆ~

])6

ˆ~(1[ 22][

ijk̂

14

(%)F

1f

polynomial local degreend2

pedigreeknownofsgenerationleastatwithCows 6

aσresidual

“Robust” original (black) with bootstrap (light blue) LOESS curves of yields for US Jerseys with at least 6 generations of known pedigree,

based on medians of EBLUP residuals (y-axis = )

f=0.9 f=0.5 f=0.9

2nd degree local polynomial

aijke ̂/ˆ

15

ConclusionsConclusions

•• LOESS analysis suggested local relationships.LOESS analysis suggested local relationships.

•• Effects of inbreeding seem nil, until for F values Effects of inbreeding seem nil, until for F values up to ~7%up to ~7%

•• Effects of inbreeding not accounted well by Effects of inbreeding not accounted well by additive modelsadditive models

•• Results may be confounded by effects of selection Results may be confounded by effects of selection that are unaccounted forthat are unaccounted for

Kernel Regression

16

yi gx i ei; i 1,2, . . . ,n

where: yi is the measurement taken on individual i x i is a p 1 vector of observed SNP genotypes g. is some unknown function relating genotypes to phenotypes. Set gx iEyi ∣ x i conditional expectation function ei 0,2 is a random residual

20

Neural networks applied to Neural networks applied to pedigree or genomicpedigree or genomic--enabled enabled

predictionprediction

21

i

M

mmmiMi Zy

11

p

jMjijMMMi

p

jmjijmmmi

p

jjiji xZxxZ

10

10

111011 . . . Z. . .

ipiji xxx . . . . . . 1

Hidden

Layer

Input

Output

Graphical representation of a single-layer neural network,with M nodes, for a continuous phenotype, ( i=1,…,n) and p molecular markers, (j=1,…,p).

αM+1,0+

Activation function

22

DATAPedigree structure:

mgd_pedigreesmgs_pedigreespgd_pedigreespgs_pedigreesPhenotyped animals

BREEDSJE840003000951180 1JECAN000010013378 1JEUSA000003414899 295

JE:JerseyCAN: CanadaUSA: United States

23

Data• All variables standardized before using in NN

• Inputs : Elements of Relationship Matrix

(Pedigree or Genomic, or both)

• Target : Fat Yield Deviation

Milk Yield Deviation

Protein Yield Deviation

• Rationale y u e

u 0, Aa2

y AA−1 u e

Au∗ e

y i ∑j1

N

aiju j∗ ei

Use elements ofA (or G) or A-1

as inputs in NN

or A-1A

Or A-1u**

24

Fitting the network

• Divide data into TRAINING, TUNING and TESTING sets• Network trained as a non-linear regression problem. Weights and

residuals have Gaussian distributions with variance parameters. • Given the variances, find mode of the weights using any non-linear

optimization method in TRAINING set. Variances assessed with Laplacian approximation (as in DFREML)

• TUNING set used to assess predictive performance of each cell in grid

• Final predictive performance assessed in TESTING set• NN with 1 Neuron and linear activation functions is “animal

model” with unknown variances

25

0

0.2

0.4

0.6

0.8

1

1.2

Line ar 1-neur 2-ne urs 3-neurs 4-ne urs 5-neurs 6-ne urs

train validat test general

0

0.2

0.4

0.6

0.8

1

1.2

Linear 1-neur 2-neurs 3-neurs 4-neurs 5-neurs 6-neurs


0

0.2

0.4

0.6

0.8

1

1.2

Linear 1-neur 2-neurs 3-neurs 4-neurs 5-neurs 6-neurs


Traits Fat Deviation Milk Deviation Protein Deviation

train validation test generaltrain validation test general train validation test general

Linear 0.96 0.03 0.11 0.660.9

5 0.07 0.07 0.67 0.93 0.05 0.02 0.67

1-neur 0.91 0.10 0.23 0.650.8

9 0.04 0.10 0.62 0.88 0.05 0.09 0.64

2-neurs 0.94 0.11 0.22 0.660.9

4 0.11 0.08 0.69 0.93 0.06 0.08 0.67

3-neurs 0.91 0.10 0.21 0.670.9

6 0.07 0.13 0.69 0.95 0.04 0.1 0.68

4-neurs 0.92 0.10 0.2 0.650.9

4 0.06 0.09 0.67 0.95 0.09 0.14 0.7

5-neurs 0.91 0.11 0.23 0.650.9

4 0.09 0.13 0.69 0.96 0.11 0.15 0.71

6-neurs 0.96 0.10 0.27 0.680.9

4 0.10 0.10 0.68 0.96 0.07 0.11 0.69

26

Animal breeders love correlations…•Parameter has well defined meaning for bi-variate normalidentically distributed pairs.

• Precise relationship to prediction ( squared correlation: fractionof variance accounted for)

•It is not a measure of “accuracy”

y k x

Corrx, y Covy, x

VaryVarx

Varx

VarxVarx 1

Ey − x2 k2

““You are the only You are the only woman in my lifewoman in my life””

Adam said to Eve

MSE (testing data set)

27

0 100 200 300-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0 100 200 300-1

-0.5

0

0.5

1

0 100 200 300-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0 500 1000 1500 2000-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0 500 1000 1500-0.2

-0.1

0

0.1

0.2

0 500 1000 1500-0.2

-0.1

0

0.1

0.2

0.3

Milk-linear Prot-linear

Fat-6 Neurons Milk-5 Neurons

Prot-5 Neurons

Fat-linear

Values of weights (regressions) for the linear and “best” NN

-0.2 0 0.2 0.4 0.60

10

20

30

40

50

60

70

-1 -0.5 0 0.5 10

20

40

60

80

100

-0.4 -0.2 0 0.2 0.40

20

40

60

80

-0.2 -0.1 0 0.1 0.20

200

400

600

800

1000

-0.2 -0.1 0 0.1 0.20

200

400

600

800

1000

1200

-0.2 -0.1 0 0.1 0.2 0.30

200

400

600

800

1000

1200

Fat-linear Milk-linear Prot-linear

Fat-6 neurons

Milk-5 neurons Prot-5 neurons

Distribution of weights for linear and “best” NN architectures

1

PENALIZED METHODS for functional inference

• The idea of “penalty is ad-hoc

• It does not arise “naturally” in classical inference

• It appears very naturally in Bayesian inference

L2 penalty: equivalent to

Gaussian prior

L1 penalty: equivalent to double

exponential prior

The concept of penalized likelihood(example in the mixed linear model)

y X Zu e

y|,u,R~NX Zu,R

u~N0,G

py|,u,R 12

N2 |R|

12

exp − 12

y − X − Zu′R−1y − X − Zu

pu|G 12

q

2 |G|12

exp − 12

u′G−1u

2

Assuming known variance components, the log of the joint density of the data andrandom effects is termed “penalized likelihood

X′R−1X X′R−1Z

Z′R−1X Z′R−1Z G−1

u

X′R−1y

Z′R−1y

l, u|y,R, G K − 12y − X − Zu′R−1y − X − Zu − 1

2u′G−1u

∂l,u|y,R,G∂ X′R−1y − X − Zu

∂l,u|y,R,G

∂u Z′R−1y − X − Zu − G−1u

Setting the derivatives to 0 yields

The solution to these equations produces the “maximum penalized likelihood” estimates of β and uThese solutions are also the BLUE(β) and BLUP(u)

8. Reproducing Kernel Hilbert 8. Reproducing Kernel Hilbert spaces mixed modelspaces mixed model

“Penalized sum of squares” Some norm under Hilbert space (H) of functions

Variational problem: find g(x) over entire space of functions minimizing SS(.)

Function of molecular information x (vector of SNP variables)

Smoothing parameter (λ)

SSgx, ∑i1

n

yi − wi′ − zi

′u − gxi ||gx||H

22

3

g. 0 ∑j1

n

jK. , xj

Solution to variational problem:

Khx, xj exp − x−xj ′x−xj h

Reproducing kernel

Example of reproducing kernel:

No. individuals withmolecular data

Regression coefficient

reduction of dimensionp (# SNPs) # indiv.

In an Euclidean space of dimension n, the

dot product between vectors v and w is∑i1

n

viwi,

and the norm is ‖v‖ ∑i1

n

v i2

Inner product generalizes dot product to vectors of infinite dimension.For instance, in a vector space of real functions with domain a,b,the inner product is

g1, g2 a

b

g1xg2xdx,

‖g1‖ a

b

g1x2dx

g1,g2 a

b

g1xg2xpxdx Eg1xg2x

IF x is a random variable with pdf p(x)

4

kx, tgxgtpx, tdxdt 0

Definition of positive-definite kernel (the theory deals with “reproducingkernels) function

Kh

k1, 1,h k1, 2,h . . . k1, n,h

k2, 1,h k2, 2,h . . . k2, n,h

. . . . . .

. . . . . .

. . . . . .

kn, 1,h kn, 2,h . . . kn, n,h

Positive-definite kernel matrix; symmetric, with k(i,j,h)=k(j,i,h)

h= scalar or vector of bandwidth parameters

MEASURES OF DISTANCETHAT CAN BE USED IN KERNELS

Euclidean

Manhattan

Bray-Curtis

5

ijdkjiK 1 exp,

2

jiijd xx

ipii xx ,...,1x

max

2

,ji

jik xx

.

Histogram of evaluations of Gaussian kernel by value of bandwidth parameter

LOCAL KERNELKERNEL GENERATINGSTRONG COVARIANCES

yi wi′ zi

′u ∑j1

n

exp − xi−x j ′xi−xj h j ei

Mixed model representation (enhancing pedigrees…)

ti′h exp − xi−x j

′xi−xj h

Define row vector

Th

t1′ h

t2′ h

.

tn′ h

t’i(h)= k’i(h)

T(h)=K(h)

6

y W Zu Th e

Do:

Then:

h assumed known here

W ′W W ′Z W ′Th

Z′W Z′Z A−1 e2

u2 Z′Th

T′hW T′hZ T′hTh Th e2

2

u

W′y

Z′y

T′hy

N0,T−1h2

12

Bandwidth parameter

Smoothingparameter

Penalized estimation

minargˆ KααKαyKαyαα

[1] Kimeldorf, G.S. & Wahba, G. (1970).

212 ,,,

K0αI0εαε

εKαy

NNp

Bayesian View

7

How to Choose the Reproducing Kernel? [1]

Model-derived Kernel

Predictive Approach

ji ttK ,

Pedigree-models K=A

Genomic Models:

- Marker-based kinship

- XXK

Explore a wide variety of kernels

=> Cross-validation

=> Bayesian methods

[1] Shawne-Taylor and Cristianini (2004)

Choosing the RK based on predictive ability

, , jidExpjiK xx

[1] de los Campos et al. (2010) Genetics Research

sindividualbetween distance (genetic)

: , jid xx

Strategies

- Grid of Values of ө + CV

- Fully Bayesian: assign a prior to ө

(computationally demanding)

- Kernel Averaging [1]

5111 ,1,, jiKjiKjiK

8

Example 1 of RKHS

y2 5

y3 3

y4 7

y5 8

1 2

1 3

1 1

1 5

0

1

0 1 0 0 0

0 0 1 0 0

0 0 0 1 0

0 0 0 0 1

a1

a2

a3

a4

a5

d1

d2

d3

d4

d5

e2

e3

e4

e5

X Za d e.

Additive Dominance

a2 5,d

2 4 and e2 20Henderson (1985) assumed

A

1 0 12

12

12

0 1 12

12

012

12

1 12

14

12

12

12

1 14

12

0 14

14

1

and D

1 0 0 0 0

0 1 0 0 0

0 0 1 14

0

0 0 14

1 0

0 0 0 0 1

Application of BLUP paradigm leads to

′ 5. 145 0.241 ,

a ′ 0.045 −0.192 −0.343 0.096 0.242 ,

d′ 0 −0.073 −0.365 0.162 0. 234 .

g a d 0. 045 −0. 265 −0. 708 0. 259 0. 477

9

K A D

2 0 12

12

12

0 2 12

12

012

12

2 34

14

12

12

34

2 14

12

0 14

14

2

Next, do RKHS with K=A+D as positive-definite kernel matrix

y2

y3

y4

y5

1 2

1 3

1 1

1 5

0

1

2 12

12

012

2 34

14

12

34

2 14

0 14

14

2

2

3

4

5

e2

e3

e4

e5

X K e.

2 a

2 d2 9 This is 1/λ

0 5. 289

1 0. 200 2 −0.128 3 −0.781 4 0.487 5 0.422

y1f

y2f

y3f

y4f

y5f

1 2

1 2

1 3

1 1

1 5

0

1

a1

a2

a3

a4

a5

d1

d2

d3

d4

d5

e1f

e2f

e3f

e4f

e5f

MPP e f ,

PREDICTING FUTURE RECORDS UNDER THE SAME ENVIRONMENTALCONDITIONS; PARAMETRICALLY

gK,1gK,2gK,3gK,4gK,5

0.036

−0.210

−0.569

0.206

0.382

COMPARED WITH

g a d 0.045 −0.265 −0.708 0.259 0.477

10

y1f

y2f

y3f

y4f

y5f

1 2

1 2

1 3

1 1

1 5

0

1

0 12

12

12

2 12

12

012

2 34

14

12

34

2 14

0 14

14

2

2

3

4

5

e1f

e2f

e3f

e4f

e5f

MKK e f.

PREDICTION OF FUTURE RECORDS NON-PARAMETRICALLY

y1f

y2f

y3f

y4f

y5f

|

y2

y3

y4

y5

,dispersion (smoothing) parameters

M .. , M .C .−1M .

′ If e2 ,

FOR BOTH APPROACHES THE PREDICTIVE DISTRIBUTION IS

P

5. 674 6. 020

5. 364 5. 460

5. 162 5. 353

5. 646 5. 834

6. 828 6. 115

; K

5. 754 5. 576

5. 286 5. 659

4. 735 5. 561

5. 919 5. 940

7. 061 6. 157

For the two procedures the mean and SD of the predictive distributions are:

11

Example 2 of RKHS

Drawn from exponential distribution

Drawn from Weibull distribution

Arbitrary Gaussian kernel adopted for the RKHS regressionusing as covariate a 2 1 vector: number of alleles at each of the two loci,e.g., xAA 2,xAa 1 and xaa 0. For example, the kernel entry AABB and AAbb is

kxAABB,xAAbb ,h exp − 2 − 22 2 − 02

h exp − 4

h,

Kh

AABB AABb AAbb AaBB AaBb Aabb aaBB aaBb aabb

AABB 1 e−1h e−

4h e−

1h e−

2h e−

5h e−

4h e−

5h e−

8h

AABb e−1h 1 e−

1h e−

2h e−

1h e−

2h e−

5h e−

4h e−

5h

AAbb e−4h e−

1h 1 e−

5h e−

2h e−

1h e−

8h e−

5h e−

4h

AaBB e−1h e−

2h e−

5h 1 e−

1h e−

4h e−

1h e−

2h e−

5h

AaBb e−2h e−

1h e−

2h e−

1h 1 e−

1h e−

2h e−

1h e−

2h

Aabb e−5h e−

2h e−

1h e−

4h e−

1h 1 e−

5h e−

2h e−

1h

aaBB e−4h e−

5h e−

8h e−

1h e−

2h e−

5h 1 e−

1h e−

4h

aaBb e−5h e−

4h e−

5h e−

2h e−

1h e−

2h e−

1h 1 e−

1h

aabb e−8h e−

5h e−

4h e−

5h e−

2h e−

1h e−

4h e−

1h 1

12

0 10 20 30 40 50 60 70 800.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Bandwidth parameter (h)

Kernel value

0 1 2 3 4 5 6 7 8 9 100.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Bandwidth parameter (h)

Kernel value

Kernel value k. , . ;h exp − Sh

against

bandwidth parameter h. Curves, fromupper to lower, correspond toS 1,2, 4, 5, 8

h 1. 75 as bandwidth parameter6 unique entries in the K matrix:1. 0 (diagonal elements, the two individuals have identical genotypes)0. 565 (3 alleles in common in a pair of individuals)0. 319 (2 alleles in common, 1 per locus)0. 102 (2 alleles in common at only one locus)0. 06 (1 allele in common)0. 01 (no alleles shared).

13

Training set

Testing set

100

IMPORTANT ISSUE TO DISCUSS HERE

TRAININGSET

14

TESTINGSET

15

Explanation of results

!!

ExampleOf RKHS 2

16

Source DF Anova SS Mean Square F Value Pr > F

a 2 0.00000000 0.00000000 0.00 1.0000b 2 0.00000000 0.00000000 0.00 1.0000c 2 0.00000000 0.00000000 0.00 1.0000a*b 4 0.00000000 0.00000000 0.00 1.0000a*c 4 0.00000000 0.00000000 0.00 1.0000b*c 4 13.33333333 3.33333333 1.00 0.4609

Error (a*b*c) 8 26.66666667 3.33333333

Variation between genotypic values is pure interaction

Training set:- 27 genotypes,

- 5 replicates per genotype, - residual variance 1.5

Testing set: 50 MC replicates, each as the training set.

Results in training set

17

Results testing set.

18

EXAMPLE 3: CHICKENDATA

• Average progeny “late mortality” (lm) in low hygiene environment for 200 sires of line29 (12,167 progenies).

– Pre-corrected for hatch, age of dam and dam,– Standardized log-transformed means

• SNPs: filter and wrapper strategy (Long et al., 2007)

– 24 SNPs selected out of over 5000 genotyped on sires

35

DATA

36Distribution of progeny means

19

MODELS

PEDIGREE SNPs

BayesianE-BLUP

F-metric model24 SNPsParametric

(linear) method

Non-Parametric RKHS24 SNPs

KernelRegression24 SNPs

37

Xu (2003)1000 SNPs

E-BLUP

eZuy ),(~ 2

uN A0u

),(~ 21eNe NR0

Number of progeny of sire i. Weighted residuals. (Varona and Sorensen, 2007)

122 ~ uuuu s

122 ~ eeee s

GIBBS SAMPLING200,000 samples

50,000 burn-in10 thinning period


50,000 burn-in10 thinning period

38

20

F-metric model (Least-squares Regression)

q

jijijai eαxy

1

jaα

aa)(saySNPhomozygousafor1

Aa)(saySNPusheterozygoafor0

AA)(saySNPhomozygousafor1

jax

Van der Veen (1959); Zeng et al. (2005)

39

q= 24 markers

F-metric model (Linear Regression)

)'α,...,α,α( 2421α

122 ~ eeee s

Coefficients: Bounded uniform priors (-99999, 99999)

Residual variance: Inverse chi-squared),(~ 21

eNe NR0


50,000 burn-in10 thin period



40

q

jijijai eαxy

1

21

Bayesian Regression (Xu, 2003)

41

1000 SNPs chosen randomly along the genome

aa)(saySNPhomozygousafor1

Aa)(saySNPusheterozygoafor0

AA)(saySNPhomozygousafor1

iax

1000

1jijijai ebxy

Bayesian Regression (Xu, 2003)(similar to Bayes A of Meuwissen et al. 2001)

ib

122 ~ eeee s

Regression coefficient for SNP i, assumed distributed as

Where is the variance associated to each SNP

Residual variance: Inverse chi-squared

),(~ 21eNe NR0





42

1000 SNPs chosen randomly along the genome

),0(~b 2ii N

2i

1000

1jijijai ebxy

122 ~ si

22

0−0

1′1e

21 ′Xe

2

X′1e

2X′Xe

2Diag 1

j2

−11′y

e2

X′y

e2

The Gibbs sampler: not much new here…

The conditional posterior of location effects is MULVN with mean vector

p j2|ELSE bj

2 vS2 1−2

The conditional posterior distributions of the variances of marker effects andof residual variance are

e2|ELSE y − Xb ′y − Xb ese

2

C−1r

|ELSE NC−1r,C−1

j=1,2,…,1000

KERNEL REGRESSION

• Non-parametric regression

Gianola et al. (2006)

iii egy )(x

ixis a (qx1) vector

representing the genotype of sire i

some unknown function of the whole SNP genotype for sire i,

representing the expected phenotypic value of sires

possessing the q-dimensional SNPs combination

)( ig x

}{ iee assumed distributed independently of X, and

around zero

44

23

KERNEL REGRESSION

• Non-parametric regression

– How do we estimate g(x) ?

iii egy )(x

)(

),()(

x

xx

p

dyyypg

g(x) =conditional expectation function.

Nadaraya-Watson estimator(Nadaraya, 1964; Watson, 1964)

Based on definition of conditional mean

45

iii egy )(x

KERNEL REGRESSION

•Non-parametric regression

46

n

iihiq

xXKynh

dyyyp1

)(1

),(x

n

iihq

xXKnh

p1

)(1

)(xg(x)=

Pure non-parametric regression.

Trinomial Kernel

h: smoothing parameter

24

47

Trinomial KERNEL

K(X-x) = Some function measuring distances between focal points or objects (genotypes).

2121

21

22121, )1()( iiii ddqdd

ihh hhhhK xx

Observed genotype

Focal genotype AA Aa aa

AA0 1 0

0 0 1

Aa1 0 1

0 0 1

aa0 1 0

1 1 0

REPRODUCING KERNEL HILBERT SPACES REGRESSION

• Penalized sum of squares has the form:2

221 ||)(||)]([)]'([]|)([ HggggJ xxXβyRxXβyx 1

αKα

k

k

k

X h

q

j

h

h

h

hg

)(

)(

)(

)|(

1

],...,,[ 10 nα

hK ii

ih

)()(exp)(

xxxxxx

),(),(),(

...),(...

),(),(),(

1

1111

nnhjnhnh

jih

nhjhh

h

xxKxxKxxK

xxK

xxKxxKxxK

K

48

25

REPRODUCING KERNEL HILBERT SPACES

• Embedding all these expression in the penalized sum of squares:

yRK

yR1

αKKRK1RK

KR11R11

1

hλ,h

λ,h

hhhh

h

ˆ

ˆ

1111

11

),0(~,| 11 hNh Kα

),0(~ Re N 122 ~ eeee s

122 ~ s





49Sequence alignment Kernel

50

Sequence alignment KERNEL

Dynamic programming algorithms

Similarity between two DNA sequences

Adapted to SNP sequences

No need to tune h

(Delcher et al., 1999, 2002)

)(exp)( iih ScoreK xxxx

26

Variance component & parameter estimates

………0.004-0.050HPD (95%)

………0.02 (0.01)µ (s.d)

h2

0.28-0.55……HPD (95%)

0.40 (0.07)……µ (s.d)

0.67-1.95……0.03-0.24HPD (95%)

1.03 (0.71)……0.10 (0.06)µ (s.d)

15.62-27.0911.78-23.6423.60-37.5116.88-32.04HPD (95%)

20.75 (2.91)17.07 (3.02)29.72 (3.56)24.38 (3.88)µ (s.d)

BR (Xu’s)RKHSF-metricE-BLUPPosteriorfeaturesParameter

51

2e

2u

2

Sum of posterior means of variances of the 1000 markers

• Spearman (above diagonal) and Pearson correlations (below diagonal) between posterior means of sire effects

• E-BLUP & Xu (2003) very similar.• LR most different ranking.

52

…0.800.580.570.92BR

0.84…0.790.500.84RKHS

0.760.93…0.380.66Kernel

0.530.510.480.56F-metric

0.910.840.770.52…E-BLUP

BRRKHSKernelF-metricE-BLUP

27

MODEL FIT

-Compute deviance measurement based on mean squared errors:

• A) Regression of adjusted average progeny on sire’s PTA or EGV

• B) Regression of raw average progeny on sire’s PTA or EGV

-Lowess regression (Non-parametric locally weighted regression)

53

MODEL FIT

•Regression of adjusted raw progeny LM on sire’s PTA or EGV

54

28

MODEL FIT

•Less dispersion in non-parametric models

•Lower MSE for kernel regression

•Worst for Linear regression (F-metric model)

Still….which model predicts the data best ?

55

Predictive ability

• Cross validation

1. 5 subsets, letting 20% sire means missing each time at random

2. Estimate PTA or EGV of sires with missing values from the augmented posterior distributions

3. Calculate correlations between actual and inferred average progeny, for each method within subset.

56

29

Predictive ability

•RKHS showed better predictive ability–25% higher reliability than Xu’s method

–100% higher reliability than E-BLUP

–233% higher reliability than F-metric (linear regression on markers)

•RKHS better than fixed or random regression on markers and E-BLUP.57

0.160.200.140.060.10GLOBAL

0.250.150.23-0.120.175th

0.150.280.130.07-0.044th

0.17-0.010.060.080.183rd

0.120.370.280.190.182nd

0.130.270.050.270.031st

BRRKHSKernelF-metricE-BLUPSubset

FCR measured on progeny of 333 sires with 3481 SNPsFCR measured on progeny of 61 birds (sons of the above sires)

2- generation data set

BAYES A --all markersRKHS --all markersRKHS --400 markers filtered using different INFOGAINSBLUP (Bayes) –pedigree information

Training set: 333 sires of sons

Predictive set: 61 sons of sires

EXAMPLE 4: CHICKEN DATA

30

Note that the confidence bands ofthe predictive correlations are wide

31

“BLUP”

“Bayes A” RKHS

US Jersey

- N= 1,762 sires (n=1446, training n=1130 ; testing, n=316 ).

- Markers: BovineSNP50 BeadChip (50k).

- Traits: PTAs for Milk, Protein Content and Daughter Pregnancy Rate

Models:

- Linear model

- Genomic-based kinship [1]

- Gaussian Kernel

- Fixed over a grid of values

- Kernel averaging:

[1] Hayes and Goddard (2008) Journal of Animal Science.

EXAMPLE 5: Application to US Jersey data

XXK

GK

, , jidExpjiK xx

32

PMSE

(MSE of predictive residuals)

Residual Variance

PMSE

Linear Model 0.588

Marker-based Kinship 0.590

Kernel average 0.588

Application to US Jersey data

Application to US Jersey data

PMSE

Linear Model 0.459

Marker-based Kinship 0.460

Kernel Average 0.421

Residual Variance

PMSE

(MSE of predictive residuals)

33

Empirical Application

Kernel Averaging seems be an effective strategy kernel selection

In this example (PTAs):

Linear Model, Kinship and Kernel Averaging performed similarly

Not necessarily so for other traits and other populations

Predictive ability of models for genomic selection in Wheat [1]

[1] Crossa et al. (2010) Genetics.

Predictive Correlation

Environment BL RKHS

Difference (%)

E1 0.518 0.601 +16%

E2 0.493 0.494 0%

E3 0.403 0.445 +10%

E4 0.457 0.524 +15% N= 599;

Trait: Grain Yield (4 environments);

Models: RKHS and Bayesian LASSO (BL)

34

Radial Basis Functions:another form of non-parametric

regression

Note that the basis functions are adaptive: depend on parameters (θ)

Matrix form

The kernel matrix Kθ need not be positive-definite here(contrary to RKHS regression)

35

SNP1 SNP2 SNP3 SNP1 SNP2 SNP3

Ind. 1 AA Gg TT 2 1 2

Ind. 2 AA gg tt 2 0 0

kx1,x2 exp −∑k1

3

k x1k − x2

k 2

exp −12 − 22 − 21 − 02 − 32 − 02

exp−2 − 43

kx2,x1 exp −∑k1

3

k x2k − x1

k 2

exp −12 − 22 − 20 − 12 − 30 − 22

exp−2 − 43

Genotypes can be coded, for example as

Bayesian structure: priors

Hyper-parameters are: a, ν, δ1 , δ2

36

Metropolis-Hastings

37

SIMULATION (TO EVALUATE MODEL BEHAVIOR)

THIS SIMULATION IS UNREALISTIC BECAUSE THE VARIANCE OF THEGENETIC SIGNAL IS MUCH, MUCH LARGER THAN THAT OF THERESIDUAL DISTRIBUTION (LARGE BROAD SENSE HERITABILITY)

8 SNPs but only 3 have an effect

38

Over-fitting

39

θ estimates in RBF

Posterior meanestimates of SNPeffects

WONDERFULPREDICTIONS

UNCERTAINTYVERY SMALL,BECAUSE OFSMALL RESIDUALVARIANCE

RBF(hetero) better than RBF (home) better than BAYES A

40

9. NON-PARAMETRIC BREEDING VALUE

y X a d iaa iad i dd e

X g e

Vy Vg Ve

Vg Aa2 Dd

2 A#Aaa2 A#Dad

2 D#Ddd2

Ea|g Cova,gVg−1g

a2AVg

−1g

Ea|y Eg|yEa|g E g|ya2AVg

−1g

a2AVg

−1Eg|yg.

Standard (parametric approach)

Non-parametric approach

y X K eNon-parametric genomic value,(counterpart of g)

K aK

aK a2AVarK−1K

a2AKK−12K−1K

a2

2

A.

VaraK a

2

2

2AK−1A2

Equal to A2 only if K−1 A−1 and

2 a2

If A I, non-parametric breeding values are correlated

VaraK a

2

2

2

K−12

Let

Define

41

COROLARIES

aprog,K a2

2Aprog,,

Non-parametric breeding value of yet to be born progeny

a−,K a2

2A−,

Non-parametric breeding value individuals not phenotypedor genotyped

y X u K e

Semi-parametric approach

u 0,Aa2

Make standardassumption

K D A#A A#D D#D

Tune some kernel suchas

Assessment of Cross-validation Strategies for Genomic Prediction in Cattle

M. Erbe, E. C. G. Pimentel, A. R. Sharifi and H. Simiane rDepartment of Animal SciencesAnimal Breeding and Genetics GroupGeorg-August-University Göttingen, Germany

03.08.2010 WCGALP Leipzig 2

Introduction

Genomic breeding value estimation

– Which model is appropriate?

– Which model is better when comparing two of them?

� validation is necessary

– normally: simulation studies with adequate design

– but when using real data : limited number of animals which are

genotyped and have phenotypes


Introduction

real data: limited number of animals which are genotyped and have phenotypes

�Cross-validation procedures

– training of the model with a reference set and prediction for

animals in the validation set (phenotypes masked)

�� How to split the data set?

�� How many animals for the validation set?


Data set and models

• 2294 HF bulls, 39’557 SNPs

• “phenotypes”: EBVs for milk yield and somatic cell score

• Genomic BLUP models

1) Model “A+G”

with , where A = pedigree based relationship matrix

, where G = genomic relationship matrixfollowing VanRaden (2008)based on 39’557 SNPs

eWgZuXby +++=

),0(~ 2Au uN σ

),0(~ 2Gg gN σ


Data set and models

• 2294 HF bulls, 39’557 SNPs

• “phenotypes”: EBVs for milk yield and somatic cell score

• Genomic BLUP models

2) Model “G*”

with , where G* = genomic relationship matrix basedon 4‘945 SNPs (1/8 of the availableSNPs, randomly chosen)

eWgXby ++=

*),0(~ 2Gg gN σ


Cross-validation procedure

2294 bulls (nall)

2194 bulls

100 bulls

step 1

training validation

randomsamples

Genotyped animalswith phenotypic information



2294 bulls (nall)

2094 bulls

200 bulls

step 2

training validation


100 bulls(randomly chosen)



2294 bulls (nall)

2294-n*100 bulls

n*100 bulls

step n

training validation


100 bulls(randomly chosen)



• step 1 (100 bulls in the validation set) to step 15 (1500 bulls in the validation set)� whole process was repeated 100 times

• for each step in each replicate:

� fitting the model (including variance component estimation) with

the training set

� prediction of the genomic breeding values for the bulls in the

vali dation set

� accuracy of prediction: calculation of the corre lation between

true b reeding value (TBV) and predicted genomic breeding

valu e (GEBV) for the bulls in the validation set



• step 1 (100 bulls in the validation set) to step 15 (1500 bulls in the validation set)� whole process was repeated 100 times

• for each step in each replicate:

� fitting the model (including variance component estimation) with

the training set

� prediction of the genomic breeding values for the bulls in the

validation set

� accuracy of prediction: calculation of the correlation between

true breeding value (TBV) and predicted genomic breeding

value (GEBV) for the bulls in the validation set TBVEBV

EBVGEBVGEBVTBV r

rr

,

,, =


Results: Performance of a model

• Performance of a model: How precisely can we describe the

prediction ability of a model when using cross-validation?

� have a look on the median of the correlations

obtained with 100 replicates

� have a look on the var iation over the 100 replicates

within each step


Results – trait: somatic cell score


Results – trait: milk yield


Results: Comparison of models

• What kind of splitting is best when the aim is to

compare two models?

�� comparison of the correlation coefficients of

the two models in each step in each replicate

(H0: identity of the correlation coefficents) (Meng et al., 1992)

� consideration of the mean of p-values obtained

for each step


Results: Comparison of models

• What kind of splitting is best when the aim is to compare two models?

01.0=α


Results: Influence of age structure

• Sorting bulls by age – model A+G

� step n: n*100 youngest bulls in validation set


Conclusions

• highest correlations with small validation sets

but: also very high variation between repeats when

validation set is small

• for comparison of two models: larger validation sets

(correlations estimated more stable)

• sorting by age � prediction abilities at the lower limit

� overestimation of the prediction ability with randomly

chosen validation sets


Acknowledgment

• Synbreed – Synergistic Plant and Animal Breeding

• FUGATOplus GenoTrack

• Vereinigte Informationssysteme Tierhaltung, Verden

Thank you for your attention!


Comparison of correlation coefficients

• Comparison of two correlation coefficients when thesamples are not independent

• Meng et al., 1992

2

)1(2

1

)1(1

1

22

212

2

2

2

rrr

r

rf

fr

rh

x

+=

−⋅−=

−⋅−

+=

hr

zzz

x ⋅−⋅−

=)1(2

ˆ 21&&

tscoefficienncorrelatiodtransformezz =21, &&

*GGAx GEBVsandGEBVsbetweenncorrelatior +=

−+⋅=

rr

z11

ln5.0&

1

BAYES A, BAYES B, LASSO

THE CURSE OF THE BAYESIAN ALPHABET

Featuring

Halle Berry,as “A”

Kim-Jong Il,as “Bayes”

Scarlett Johanssonas “B”

AND…

Sarah Palinsings

“To Russia with love, a view from

my igloo”

2

RECALL FROM EARLY PART OF COURSE

REASONABLE BAYESIAN MODEL

• For any parameter, must be able to “kill” the prior asymptotically

• For any parameter, statistical distance between prior and posterior (and therefore conditional posterior) must go to infinity

• If this distance has a finite upper bound, it means that the prior is influential

• Must be able to reduce statistical entropy as conveyed by the prior by a sizable amount. If the reduction is tiny prior very influential

3

STATE OF KNOWLEDGE(in a finite sample)

Minimum Prior

Maximum Conditional posterior

Intermediate Marginal posterior

THE PROCESS OF DECONDITIONING CONSUMES

INFORMATION ABOUT THE FOCAL POINT

Meaning: conditional posterior is the best world to live in

4

BAYES A + BAYES B

(as I understand them)

y 1 Xb e,

y|, X, b N1 Xb, Ie2

Linear model proposed by Meuwissen et al. (2001)

Additiveeffect ofSNP j

Code for genotypeof SNP j:x= -1,0,1

SCALAR

MATRIX

yi ∑j1

p

xijbj ei,

i 1, 2, . . . ,n; n p

yi|,xi,b,e2 N ∑

j1

p

xijb j,e2

5

uniform

e2 eSe

2e−2

bj N 0,bj

2 ; j 1,2, . . . ,p

bj

2 S2−2 for all j

The priors

Hyper-parameters, specified arbitrarily

bj | j2 N 0, j

2

j2|, S2 S2

−2

BAYES A (Meuwissen et al., 2001)

j=1,2,…,p

pbj |,S2 0

N 0, j2 pS2

−2d j2

0

j2 − 1

2 exp −bj

2

j2

j2 − 2

2 exp − S2

j2

dj2

0

j2 − 12

2 exp −bj

2 S2

j2

dj2

Γ 1 2 bj

2 S2 − 12

1 bj

2

S2

− 12

t0,,S2

Marginal prior

The prior of a markereffect is a t-distributionwith known scale and df

Note: each SNP has a variance(think of a sire model in whicheach sire effect has a variance)

These hyper-parameterswill control the extentof shrinkage. Question:does their influence vanishasymptotically?

MARGINALLY: NO DIFFERENTIAL SHRINKAGE WHATSOEVER IN BAYES A

6

Bayes B is Bayesianly “STRANGE”

Bayes B

1. Meuwissen takes the constant = 0

bj |j2

point mass at some constant k if j2 0

N 0,j2 if j

2 0

j2|

0 with probability

S2−2 with probability 1 −

2. Meuwissen assumes π is known, e.g., 0.95

3. Recall: if a prior varianceis 0, this means completecertainty

p bj, j2|

bj k and j2 0 with probability

N 0,j2 pS2

−2 with probability 1 −

Joint density:

pbj |

bj k with probability

0

N 0,j2 pS2−2 dj

2 with probability 1 −

Marginal prior

7

0

j2 − 1

2 exp −bj

2

j2

j2 − 2

2 exp − S2

j2

dj2

0

j2 − 12

2 exp −bj

2 S2

j2

dj2

Γ 1 2

bj2 S2 − 1

2

1 bj

2

S2

− 12

t0,,S2

Further

pbj | bj k with probability

t0,,S2 with probability 1 −

Then:PRIOR = MIXTURE OF A POINT MASS AND OF A t-DISTRIBUTION. BAYES B PUTSTHE MASS AT 0 (IF NOT 0, THIS GETS ABSORBED INTO THE GENERAL MEAN)

MARGINALLY: NO DIFFERENTIAL SHRINKAGE IN BAYES B EITHER

BAYES A IS A SPECIAL CASE OF BAYES B (π=0)

Meaning: if Bayes A has a flaw, this carries to Bayes B

8

A Gibbs sampler for Bayes A

(element-wise sampling)Note: the form of the implementation it

is just an algorithmic matter: it is immaterial with respect to the issues

Sampling the mean

Flat prior for the mean (or for the fixed effects) is not influential

|ELSE N 1n ∑

i1

n

yi −∑j1

p

xijbj , e2

n

9

Sampling the residual variance

e2|ELSE n1 e

n

∑i1

n

yi−−∑j1

p

x ijbj

2

eSe2

neen−2

The prior can be “killed” simply by increasing sample size

Goes to n

This will dominate the weighted averageas n increases

bj |ELSE N

∑i1

n

xij yi − −∑j ′1

p

x ijbj

∑i1

n

xij2 e

2

bj2

, e2

∑i1

n

x ij2 e

2

bj2

j 1, 2, . . . ,p

Sampling the marker effects

Kill the prior simply by increasing sample size. The effect of the shrinkage ratio vanishes

∑i1

n

xij2 e

2

bj2→∑

i1

n

x ij2

10

Sampling the variance of the marker effects

•Prior cannot be killed here. One can increase the number of data or of markers ad nauseum and gain only one degree of freedom, always•Recall that, in the conditional posterior,• all other parameters are known (i.e., they are assigned values)•Since one must de-condition, actually the true posterior moves less thanone degree of freedom away from the prior

Prior df: very influential

Typically very smallbj

2 |ELSE 1 1

bj2 S2

1 1−2

1 1 S2

bj

S

2

1 1−2

j 1, 2, . . . ,p

STATE OF KNOWLEDGE

Minimum PriorMaximum Conditional posteriorIntermediate Marginal posterior

11

For df>6, the relative variability of the posterior distribution of the variance ofa SNP effect is essentially COPYING that of their prior distribution

ENTROPY CALCULATIONS

Hak2 |,S2

− logpak2 |,S2 pak

2 |,S2 dak2

− 2− log S2

2Γ

2 1

2d

d 2

logΓ 2

.

Prior entropy

Entropy of the conditional posterior

Hak2 |ELSE

− logpak2 |ELSEpak

2 |ELSEdak2

− 12

− logS2 ak

2

2Γ 1

2 1 1

2d

d 12

logΓ 12

.

Learning from data: reduces entropy(cannot calculate entropy of posterior)

Variance of marker effect(sorry, change of notation)

12

Relative information gain

RIG H ak

2 |,S2 −H ak2 |ELSE

H ak2 |,S2

For ak 0, S 1 and 4, RIG 0.125

For ak 0, S 1 and 10, RIG 6. 51 10−2

For ak 0, S 1 and 100, RIG 9. 60 10−3

Metaphorically: the prior is totalitarian in Bayes A (B)

STATISTICAL DISTANCE BETWEEN CONDITIONAL POSTERIOR AND PRIOR(KULLBACK-LEIBLER)

IF KL IS LARGE, THEN LEARNING BEYOND THE PRIOR HAS TAKEN PLACE. KL SHOULD GO TO INFINITY AS DATA ACCUMULATE IN ANY

REASONABLE BAYESIAN MODEL

Specific distance at a given variance

prior

13

df=100, S=1

df=4, S=1

df=10, S=1

Logarithmic distance, ak =0

Logarithmic distances are indistinguishable at any value of the variance.See what happens if we let 10 markers to share the same variance…

Logarithmic distance increases sharply

KULLBACK-LEIBLER DISTANCESBETWEEN CONDITIONAL POSTERIOR AND PRIOR

2) 2. 64 10−2 for 10,S 1,p 1 and ak 0

3 2. 52 10−3 for 100,S 1,p 1 and ak 0

1) 7. 33 10−2 for 4,S 1,p 1 and ak 0

If 10 markers are allowed to share the same variance, KL= 4.47Relative to (1), KL distance increases 61 times…

14

Effect of the scale parameter of the prior

Prior (open boxes)

15

BAYES A (B)

• The prior always matters• The effect of the prior is via the extent of

shrinkage of marker effects• The extent of shrinkage can be

manipulated, with the data essentially providing no control

• Statistically greedy models (same will apply for any model assigning marker-specific variances)

SIMULATION

(never take a simulation too seriously)

16

RESURRECTION OF BAYES A

(If additive model holds, it may give sensible inferences about marker effects)

17

POSTERIOR DISTRIBUTION OF SNP VARIANCES UNDER UNIFORM PRIOR

POSTERIOR DISTRIBUTION OF SNP VARIANCES UNDER FOUR S.INV. CHI-SQUARE PRIORS

18

THE GOOD NEWS

Posterior distribution of SNP effects

Bayes A “picks Up the 3 relevant SNPs

19

DEATH-RESURRECTION-DEATH

Bayes A may give a distorted picture if there is non-linearity or

non-additivity

20

NONE OF THE RELEVANT GENETIC SIGNALS ARE CAPTURED

BAYES A vs. BAYES L

(Bayes L= Bayesian Lasso)

21

p

j

jep1 2

1 β

In the Bayesian Lasso, marker effects are assigned double exponential distributions

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

De

nsi

ty

Density of a Normal and of a Double-Exponential Distribution

EACH MARKER HAS THE SAME D.E DISTRIBUTIONNO DIFFERENTIAL SHRINKAGE EITHER

21

21

21 pp 2

1 ,| βyp

. .

p . . 2p . .

222ppp pp 2,| βnyp

Sdfpp ,221 11

21 ,| βyp

. .

. . 2p . .

Sdfppppp ,22 2,| βnyp

Bayesian LASSO

Bayes A

Graphical Representation of the hierarchical structure of the Bayesian LASSO and Bayes A

22

p ei2 |

2

2 2

2exp −

2ei2

2

pyi| i, 0

Nyi |i,ei2

2

2exp −

2ei2

2de

2

2

2 20

ei2 −

12 exp − 1

2yi − i2

ei2

2ei2 dei

2 ; i 1, 2, . . . ,n.

2|a,b Gammaa,b

Assume exponential distribution of variances

Mix (as in t-model)

Assume

Var ei2 |ELSE

E3 ei2 |ELSE

2

2

yi−xi′−z i

′u2

32

2

E ei2 |ELSE 2

yi−xi′−zi

′u2

Implementation is as in a t-model but transform ei2 1

ei2

pei2 |ELSE ei

2 −32 exp − 2

2ei2 2

yi−x i′−z i

′u 2

ei2 − 2

yi−xi′−zi

′u2

2

Inverse Gaussian (Wald) distribution

23

ANOTHER SIMULATION

(never take another simulation too seriously)

.300,...,1 280

1

ixyj

ijiji

280 markers. Residuals assumed N(0,1)

Pearson’s correlation between marker genotypes (average across markersand 100 Monte-Carlo simulations) by scenario (X0: low LD; X1 high LD).

0.3560.4500.5670.722X1

0.013-0.0020.0020.007X0

4321Scenario

Adjacency between markers

24

Only 10 markers had effects 270 had no effect on the trait simulated

Positions (chromosome and marker number) and effects of markers (there were 280 markers, with 270 with no effect)

NINE SPECIFICATIONS OF BAYES A

(9)(8)(7)1

(6)(5)(4)½

(3)(2)(1)0

5x10-210-310-5

Prior ScalePrior df

26

Ability to uncover relevant genomic regions

•For each method and replicate, markers ranked on absolute values of posterior means•For each effect, dummy variable created•Dummy was 1 if marker (or any of its 4 flanking markers) ranked on top 20. O.W= 0•Average over markers and replicates Index of “retrieved regions”

Bayes A affected by priors:Worse performance in Settings 1, 4 and 7Bayes A (settings 2, 3 , 6) and Lasso almost doubled ability

Simple fixes of Bayes A

• Assign the same variance toall markers (trivial Bayesian regression problem)

• Assign the same variance to groups of markers (e.g., chromosomes or genomic regions): model comparison issue

• Assign non-informative priors to S and to the degrees of freedom ν can be done. Just an algorithmic matter

27

Issues and questions

• Bayes A can be “fixed”, but may not the best thing to do. Open question…

• Bayes A, as is, may still have a good predictive (out of sample) behavior, even though it is not completely defensible

• Bayes B is Bayesianly ill-posed. If you do not believe me, check with local Bayesian statisticians…

• More reasonable: mixture at the level of the effects (not of the variances): I believe this is what the Dutch are doing (and Iowa people with beef cattle)

SOME POSTERIOR THOUGHTS

Prediction is a different ball game from inferenceDo not spend a lot of time inventing priors, or fancy models. An additive model may just do well…Spend more time in cross-validation and less in simulation. Now there is data!!Markers have ascertainment problem (Chikhi, 2008): simulations may give distorted pictureCannot deal with complexity with parametric methods, and non-parametric methods are almost as good as parametric ones even when assumptions holdSNP assisted genetic evaluation is holding well, and seems to outperform BLUP in most cases

28

10. Conclusions10. Conclusions

• Challenges to parametric methods posed by genomic and post-genomic data

• Future: Shift in paradigm. Semi-parametric and “machine learning” type techniques?

"It ain't what you don't know that gets you into trouble. It's what you know for sure that just ain't so."

(Mark Twain)

Takezawa (2005):

Model-free procedures can have betterpredictive performance even if the“true” model is used to generate and thenfitted to the data.

acknowledgements - acteon gianola 2010/powepoint del curso.pdf1761 tunbridge wells, kent, england...

Documents