btry 7210: topics in quantitative genomics and...

27
BTRY 7210: Topics in Quantitative Genomics and Genetics Jason Mezey Biological Statistics and Computational Biology (BSCB) Department of Genetic Medicine [email protected] Spring 2015, Thurs.,12:20-1:10

Upload: others

Post on 08-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

BTRY 7210: Topics in Quantitative Genomics and Genetics

Jason MezeyBiological Statistics and Computational Biology (BSCB)

Department of Genetic [email protected]

Spring 2015, Thurs.,12:20-1:10

Page 2: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

Why you’re here (=eQTL):

Spring 2015 Course Announcement

BTRY 7210 Topics in Quantitative Genomics and Genetics

Professor: Jason Mezey

Biological Statistics and Computational Biology

Time: Thurs. 12:20-1:10 PM

Room: 224 Weill Hall COURSE DESCRIPTION: We will consider the problem of identifying and leveraging expression Quantitative Trait Loci (eQTL) when analyzing genome-wide data. The class will include a TBD ratio of lectures by the instructor : reading / discussion of papers. Students taking the class for a grade will be required to produce a single, mini-critique report of current papers touching on topics covered in the class. General topics areas that will be considered will include: probability and statistics necessary for understanding eQTL analysis, the basics of eQTL analysis, quality control and model checking in eQTL analysis, advanced eQTL analysis techniques including hidden factor analysis, extending analyses to xQTL, biological value and interpretation of eQTL, combining eQTL and other bioinformatics data for biological discovery, structure and application of probabilistic graphical models that make use of eQTL for network analysis and discovery. GRADING: S/U or Audit. CREDITS: 1 SUGGESTED PREREQUISITES: Quantitative Genomics and Genetics (BTRY 6830 / 4830) and/or background in statistics and/or background in eQTL, GWAS or related genetic mapping analyses

Page 3: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

Today

• Logistics (time/locations, listserv, website, format, requirements, registering, deciding on topics to cover)

• Intuitive introduction to expression Quantitative Trait Loci (eQTL) and why you should care

• Basic concepts in biology, statistics, and eQTL analysis

Page 4: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

Logistics 1

• This class will take place in 224 Weill Hall, from 12:20-1:10PM every Thurs., unless I announce otherwise (!!)

• Each week, I will be on campus or will join by video-conference

• Format: a combination of lectures and possibly discussion of papers (that I will select) focused on specific subjects

• Updated info. on the class website (bottom of “classes” page): http://mezeylab.cb.bscb.cornell.edu/

• Make sure you are on the listserv (!!!) (email me to join / remove): [email protected]

Page 5: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

Logistics II

• Who you are: anyone interested in eQTL + (minimally) a working understanding of genetics and statistics - ideally, you took my class last semester...

• May I sit in? Yes! Come to as many or as few classes as you wish

• Taking the class for a grade (S/U or Audit):

• Please register officially

• If you Audit the class, there are no specific requirements

• If you take the class for S/U, you must attend and produce a mini-report (requiring ~10-20 hours of time) by the end of the semester

Page 6: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

Deciding on topics to cover• There are many topics that I think would be relevant and interesting

• I am therefore going to allow you all to vote on the possible subjects, choosing from the following:

• Format: a combination of lectures and possibly discussion of papers (that I will select) focused on specific subjects:

• Statistical and analysis methodology for identifying eQTL

• Bioinformatic analyses using eQTL (=integrating with different data types and using them to make inferences about complex phenotypes)

• Probabilistic Graphical Models (PGMs) and how these can make use of eQTL for network discovery

• Another topic of interest to you...

• When you email me for the listserv ([email protected]) also send me your top preference of a topic to focus on (relating to eQTL!!)

Page 7: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

Questions about logistics?

Page 8: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

What is an eQTL = expression Quantitative Trait Locus? (intuition)

xQTL identification

Genotype-phenotype association

eQTL (p < 10−30)

3.5

4.0

4.5

5.0

5.5

6.0

rs27290 genotype

ERAP

2 ex

pres

sion

A/A A/G G/G

no eQTL (n.s.)

3.5

4.0

4.5

5.0

5.5

6.0

rs1908530 genotype

ERAP

2 ex

pres

sion

T/T T/C C/C

Page 9: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

Why should I care about eQTL? (Part 1)• eQTL describe a fundamental aspect of biology: inherited allelic

variants that impact gene expression

• eQTL can be “discovered” from statistical analysis of the most basic types of genome-wide data: genotype and gene expression

• eQTL are used to characterize gene expression regulatory element, e.g. Brown et al. (ENCODE)

• eQTL are used to interpret GWAS hits, e.g. to narrow candidates

• eQTL represent a “natural perturbation” and can be used to infer novel regulatory (network) relationships

Page 10: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

Why should I care about eQTL? (Part II)• These represent sites of the genome that probably

contribute to many phenotypes, i.e. they mark active sites of the genotype-phenotype map

• Statistical (computational) approaches for genome-wide eQTL identification - when applied correctly - really do identify eQTL, i.e. this is not just a model fitting exercise = we are inferring real biology

• eQTL (and more broadly xQTL) will be the fundamental analysis and starting point when applying genome-wide data to understand biology

Page 11: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

Zero to eQTL in ~30 min

• The molecular biology behind eQTL

• Rigorous definition of an eQTL

• Detecting eQTL from the analysis of genome-wide data

• The importance of linkage disequilibrium (i.e. what we are really discovering in eQTL analyses)

• The statistical foundation

• Typical outcomes and interpretation

Page 12: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

Central Dogma of Molecular Biology

credit: wikipedia

Page 13: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

The eQTL concept I

• If we could quantify the amount of RNA in a particular tissue or cell type, under a specific set of conditions, this might be informative (i.e. a proxy for gene expression)

• A case where different allelic states at a specific site (locus) in the genome alter a measured expression variable in a tissue / cell population under a given a set of conditions is an eQTL

• Note that an eQTL therefore describes a variable pair (genotype-expression association)

Page 14: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

• expression - a quantifiable and (theoretically) repeatable measurement of the number of RNA molecules, deriving from a position in the genome, under specified conditions (we will use Y to represent such a measurement)

• polymorphism - the presence of at least two allelic states (A1 and A2) in a population at a specific locus in the genome where the existence of one of the alleles can be traced to a mutation event

• expression Quantitative Trait Locus (eQTL) - a polymorphic locus where an experimental exchange of one allele for another produces a change in expression on average:

• Note: that within this definition is “on average” and “under specified conditions” so the specific allele exchange need not cause a change in expression under every manipulation

• The allelic states defined by the original mutation event define the causal polymorphism of the eQTL

depends on the reference; microsatelites (microsats) or Short Sequence Repeats (SSR) - acase where there are different number of short, repeated motifs in one genome comparedto another, e.g. one genome as three ACAT’s in a row, while another has six ACAT’s ina row; Copy Number Variation (CNV) - a case where one individual has a repeat of along stretch of DNA (thousands of basepairs) compared to another individual or has less‘copies’ of a repeated long stretch; transposable elements - well characterized segments ofDNA that can excise and move to another location (sometimes these elements code for theproteins that cause them to move); chromosomal re-arrangements and aberrations - anycase where a large portion of a chromosome is moved to another location or cases wherethere is an ‘extra’ chromosome. Note that in all of these cases, describing the mutationdepends on finding the ‘same place’ in the genome, something that we will take for grantedin this course. Also, note that for our purpose, we don’t care about the actual descriptionof the genetic difference (the mutation), only that there is a difference, and we will codeall such differences using the same system.

The central goal of the methodology we will learn in this course is to identify causal mu-tations that have an effect on a phenotype, i.e. any aspect of an organism we can measure.We can define causal as follows:

causal mutation ≡ a position in the genome where an experimental manipulation ofthe DNA produces an effect on the phenotype on average.

There are a couple of strange components to this definition. When we say experimen-tal manipulation, we mean a case where we physically change the state of the DNA ata position in the genome, i.e. produce a mutation. This is actually possible in some or-ganisms (mice, flies, etc.) but we often will not actually do this manipulation, but ratherfind places where we assume such a manipulation will produce an effect. When we sayan ‘effect’ on the phenotype, we ideally mean that if we were to take an individual, pro-duce an identical clone grown under the same environmental conditions, where the onlydifference is a single experimental mutation, the phenotype would be different between thetwo individuals. In general, it is impossible to keep all conditions exactly the same, so weconsider a case where, if we were to produce an experimental manipulation keeping mostgenetic and environmental factors the same, the result would be a change in the phenotypeon average, i.e. that is to say, if we were to average over several manipulations, some ofthe manipulations would result in a change of the phenotype.

We can define a causal mutation more rigorously as follows:

A1 → A2 ⇒ Y (1)

where the single arrow reflects a mutation at a position in the genome that changes the(allelic) state of the DNA from one state to another, the double arrow indicates ‘the result’,

2

The eQTL concept II

Page 15: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

Detecting eQTL from the analysis of genome-wide data I

• Since eQTL reflect a case where different allelic combinations (genotypes) lead to different levels of gene expression, we could in theory discover an eQTL by testing for an association between measured genotypes and gene expression levels

• Most eQTL are “discovered” using this type of approach

• A typical (human) eQTL experiment includes m (= ~10-30K) expression variables and N (= ~0.1-10mil) genotypes measured in n individuals sampled from a population

• A typical (most!) analysis of such data proceeds by performing independent statistical tests of (a subset of) genotype-expression pairs, where tests that are significant after a multiple test correct (e.g. Bonferroni), are assumed to indicate an eQTL

Page 16: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

• This seems straightforward but there is a wrinkle: we often have not measured the causal polymorphism (genotypes) (!!)

• However, a rule of genetics is that genotypes that tend to be physically close to each other in the genome have a high correlations (=linkage disequilibrium or LD) and the further away genotypes are from one another, the lower their correlations (in general):

• We take advantage of this and assume that significant genotypes are in LD (correlated) with causal polymorphisms and therefore indicate their genomic position (!!)

• This is why we consider measured genotypes to be “markers” or “tags”

Detecting eQTL from the analysis of genome-wide data II

Chr. 1

A B C

Chr. 2

D

equilibrium, linkage

equilibrium, no linkage

LD

Page 17: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

Copyright: Journal of Diabetes and its Complications; Science Direct; Vendramini et al

• That is, if we test a (non-causal) marker genotype that is correlated with the causal genotype AND if the only correlated genotypes are in the same position in the genome THEN we can identify the genomic position of the casual genotype (!!)

• For almost all human eQTL, we know the genomic position but not the identities of the causal genotypes responsible for the eQTL

Detecting eQTL from the analysis of genome-wide data III

Page 18: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

Statistical foundation I

• We need to begin by defining our sample space for an eQTL experiment:

• For each individual in our sample space, we are interested in pairs of sample outcomes (a single pair at a time!):

• Where is the set of possible genotype outcomes for an individual at a locus and is the set of values of the expression variable for an individual

• Note that for a diploid, with two alleles (typical for humans!):

of probability). This means that some of the architects of probability theory are still alive,

and one of them is here at Cornell: Eugene Dynkin (who is in his 90’s). Dynkin (among

other accomplishments) proved a number of theorems and developed a number of impor-

tant methods (e.g. π-λ-systems) which are used to prove a number of important results

in basic probability. He is a great teacher and if you ever get the chance to take a course

from him, it’s worth it (and you get a living connection to the beginning of probability as

we know it!).

S = (−∞,∞) (5)

7 Conditional Probability

A critical concept in probability is the concept of conditional probability. Intuitively, we

can define the conditional probability as ‘the probability of an event, given that another

event has taken place’. That is, this concept makes formal the case where an event that

has taken place provides us information that changes the probability of a future or focal

event. The formal definition of the conditional probability of Ai given Aj is:

Pr(Ai|Aj) =Pr(Ai

Aj)

Pr(Aj)(6)

Ω = possible− individuals (7)

Ω = Ωg ∩ ΩP (8)

Ω = A1A1, A1A2, A2A2 (9)

Pr(FΩ) = Pr(Fg,P ) (10)

Y : (∗,ΩP ) → R (11)

X : (Ωg, ∗) → R (12)

At first glance, this relationship does not seem very intuitive. Let’s consider a quick

example that will make it clear why we define conditional probability this way. Let’s use

our ‘paired coin flip’ where PrHH = PrHT = PrTH = PrTT = 0.25. In this

case, we have the following:

H2nd T2nd

H1st HH HT

T1st TH TT

where we have the following probabilities:

9

of probability). This means that some of the architects of probability theory are still alive,

and one of them is here at Cornell: Eugene Dynkin (who is in his 90’s). Dynkin (among

other accomplishments) proved a number of theorems and developed a number of impor-

tant methods (e.g. π-λ-systems) which are used to prove a number of important results

in basic probability. He is a great teacher and if you ever get the chance to take a course

from him, it’s worth it (and you get a living connection to the beginning of probability as

we know it!).

S = (−∞,∞) (5)

7 Conditional Probability

A critical concept in probability is the concept of conditional probability. Intuitively, we

can define the conditional probability as ‘the probability of an event, given that another

event has taken place’. That is, this concept makes formal the case where an event that

has taken place provides us information that changes the probability of a future or focal

event. The formal definition of the conditional probability of Ai given Aj is:

Pr(Ai|Aj) =Pr(Ai

Aj)

Pr(Aj)(6)

Ω = possible− individuals (7)

Ω = Ωg ∩ ΩP (8)

Ω = A1A1, A1A2, A2A2 (9)

Pr(FΩ) = Pr(Fg,P ) (10)

Y : (∗,ΩP ) → R (11)

X : (Ωg, ∗) → R (12)

At first glance, this relationship does not seem very intuitive. Let’s consider a quick

example that will make it clear why we define conditional probability this way. Let’s use

our ‘paired coin flip’ where PrHH = PrHT = PrTH = PrTT = 0.25. In this

case, we have the following:

H2nd T2nd

H1st HH HT

T1st TH TT

where we have the following probabilities:

9

of probability). This means that some of the architects of probability theory are still alive,

and one of them is here at Cornell: Eugene Dynkin (who is in his 90’s). Dynkin (among

other accomplishments) proved a number of theorems and developed a number of impor-

tant methods (e.g. π-λ-systems) which are used to prove a number of important results

in basic probability. He is a great teacher and if you ever get the chance to take a course

from him, it’s worth it (and you get a living connection to the beginning of probability as

we know it!).

S = (−∞,∞) (5)

7 Conditional Probability

A critical concept in probability is the concept of conditional probability. Intuitively, we

can define the conditional probability as ‘the probability of an event, given that another

event has taken place’. That is, this concept makes formal the case where an event that

has taken place provides us information that changes the probability of a future or focal

event. The formal definition of the conditional probability of Ai given Aj is:

Pr(Ai|Aj) =Pr(Ai

Aj)

Pr(Aj)(6)

Ω = possible− individuals (7)

Ω = Ωg ∩ ΩP (8)

Ωg = A1A1, A1A2, A2A2 (9)

Pr(FΩ) = Pr(Fg,P ) (10)

Y : (∗,ΩP ) → R (11)

X : (Ωg, ∗) → R (12)

At first glance, this relationship does not seem very intuitive. Let’s consider a quick

example that will make it clear why we define conditional probability this way. Let’s use

our ‘paired coin flip’ where PrHH = PrHT = PrTH = PrTT = 0.25. In this

case, we have the following:

H2nd T2nd

H1st HH HT

T1st TH TT

where we have the following probabilities:

9

of probability). This means that some of the architects of probability theory are still alive,

and one of them is here at Cornell: Eugene Dynkin (who is in his 90’s). Dynkin (among

other accomplishments) proved a number of theorems and developed a number of impor-

tant methods (e.g. π-λ-systems) which are used to prove a number of important results

in basic probability. He is a great teacher and if you ever get the chance to take a course

from him, it’s worth it (and you get a living connection to the beginning of probability as

we know it!).

S = (−∞,∞) (5)

7 Conditional Probability

A critical concept in probability is the concept of conditional probability. Intuitively, we

can define the conditional probability as ‘the probability of an event, given that another

event has taken place’. That is, this concept makes formal the case where an event that

has taken place provides us information that changes the probability of a future or focal

event. The formal definition of the conditional probability of Ai given Aj is:

Pr(Ai|Aj) =Pr(Ai

Aj)

Pr(Aj)(6)

Ω = possible− individuals (7)

Ω = Ωg ∩ ΩP (8)

Ωg = A1A1, A1A2, A2A2 (9)

Pr(FΩ) = Pr(Fg,P ) (10)

Y : (∗,ΩP ) → R (11)

X : (Ωg, ∗) → R (12)

At first glance, this relationship does not seem very intuitive. Let’s consider a quick

example that will make it clear why we define conditional probability this way. Let’s use

our ‘paired coin flip’ where PrHH = PrHT = PrTH = PrTT = 0.25. In this

case, we have the following:

H2nd T2nd

H1st HH HT

T1st TH TT

where we have the following probabilities:

9

of probability). This means that some of the architects of probability theory are still alive,

and one of them is here at Cornell: Eugene Dynkin (who is in his 90’s). Dynkin (among

other accomplishments) proved a number of theorems and developed a number of impor-

tant methods (e.g. π-λ-systems) which are used to prove a number of important results

in basic probability. He is a great teacher and if you ever get the chance to take a course

from him, it’s worth it (and you get a living connection to the beginning of probability as

we know it!).

S = (−∞,∞) (5)

7 Conditional Probability

A critical concept in probability is the concept of conditional probability. Intuitively, we

can define the conditional probability as ‘the probability of an event, given that another

event has taken place’. That is, this concept makes formal the case where an event that

has taken place provides us information that changes the probability of a future or focal

event. The formal definition of the conditional probability of Ai given Aj is:

Pr(Ai|Aj) =Pr(Ai

Aj)

Pr(Aj)(6)

Ω = possible− individuals (7)

Ω = Ωg ∩ ΩP (8)

Ωg = A1A1, A1A2, A2A2 (9)

Pr(FΩ) = Pr(Fg,P ) (10)

Y : (∗,ΩP ) → R (11)

X : (Ωg, ∗) → R (12)

At first glance, this relationship does not seem very intuitive. Let’s consider a quick

example that will make it clear why we define conditional probability this way. Let’s use

our ‘paired coin flip’ where PrHH = PrHT = PrTH = PrTT = 0.25. In this

case, we have the following:

H2nd T2nd

H1st HH HT

T1st TH TT

where we have the following probabilities:

9

Page 19: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

• Next, we need to define the probability model:

• We will define two (types) or random variables (* = state does not matter):

• Note that the probability model induces a (joint) probability distribution on the these random variables:

For this sample space, we define a probability function (model):

Pr(S) = Pr(Sg, SP ) (5)

One could intuitively look at this as defining distinct probability functions for each of

these sample spaces Sg and SP , although these probability functions would be related and

would actually define a single (joint) pdf for the sample space S = Sg, SP , S = Sg∩SP .

We will define the following two (types) of random variables Y and X, where Y takes

the value of the phenotype to the reals (regardless of the genotype) and X takes the value

of the genotype to the reals (regardless of phenotype):

Y : (∗, SP ) → R (6)

X : (Sg, ∗) → R (7)

where ∗ indicates the state of the given subset does not matter. Again, we could intuitively

think of this as defining individual random variables for each sample space Sg and SP where

each element of these random vectors is associated with only one probability function, i.e.

a single random variable cannot be associated with more than one probability function.

A more accurate way to think about this set-up is that we have defined a random vector

[Y,X], where the probability function on S actually defines a joint probability function

over the random variables Y and X:

Pr(Y,X) (8)

and note we could have random vectors that include both discrete and continuous random

variables, such that the joint probability distributions could combine discrete and contin-

uous models.

As we discussed, regardless of the probability model describing our random variables /

vectors, we can use expectations and variances to describe basic aspects of the models. If

we can take the expectation of the random vector [X,Y ] we obtain:

E [Y,X] = [EY,EX] (9)

and the variance of this random vector is:

V ar [Y,X] =

V ar(Y ) Cov(Y,X)

Cov(Y,X) V ar(X)

If X reflects a causal mutation (=causal allele =causal polymorphism), then Cov(Y,X) = 0

(or Corr(Y,X) = 0). Our goal with quantitative genomic inference can therefore be broadly

stated as determining whether Cov(Y,X) = 0 using a sample and we will do this using a

5

Statistical foundation II

βα = βa

a+

βd

2(p1 − p2)

(97)

βµ,0 (98)

X(A1A1) = −1, X(A1A2) = 0, X(A2A2) = 1 (99)

Y = measurable expression value

H0 : Cov(Y,X) = 0 (100)

HA : Cov(Y,X) = 0 (101)

HA : Cov(Y,X) = 0 (102)

pval = Pr(T t|H0 : true) (103)

16

βα = βa

a+

βd

2(p1 − p2)

(97)

βµ,0 (98)

X(A1A1) = −1, X(A1A2) = 0, X(A2A2) = 1 (99)

Y = measurable expression value

H0 : Cov(Y,X) = 0 (100)

HA : Cov(Y,X) = 0 (101)

HA : Cov(Y,X) = 0 (102)

pval = Pr(T t|H0 : true) (103)

16

of probability). This means that some of the architects of probability theory are still alive,

and one of them is here at Cornell: Eugene Dynkin (who is in his 90’s). Dynkin (among

other accomplishments) proved a number of theorems and developed a number of impor-

tant methods (e.g. π-λ-systems) which are used to prove a number of important results

in basic probability. He is a great teacher and if you ever get the chance to take a course

from him, it’s worth it (and you get a living connection to the beginning of probability as

we know it!).

S = (−∞,∞) (5)

7 Conditional Probability

A critical concept in probability is the concept of conditional probability. Intuitively, we

can define the conditional probability as ‘the probability of an event, given that another

event has taken place’. That is, this concept makes formal the case where an event that

has taken place provides us information that changes the probability of a future or focal

event. The formal definition of the conditional probability of Ai given Aj is:

Pr(Ai|Aj) =Pr(Ai

Aj)

Pr(Aj)(6)

Ω = possible− individuals (7)

Ω = Ωg ∩ ΩP (8)

Ωg = A1A1, A1A2, A2A2 (9)

Pr(FΩ) = Pr(Fg,P ) (10)

Y : (∗,ΩP ) → R (11)

X : (Ωg, ∗) → R (12)

At first glance, this relationship does not seem very intuitive. Let’s consider a quick

example that will make it clear why we define conditional probability this way. Let’s use

our ‘paired coin flip’ where PrHH = PrHT = PrTH = PrTT = 0.25. In this

case, we have the following:

H2nd T2nd

H1st HH HT

T1st TH TT

where we have the following probabilities:

9

of probability). This means that some of the architects of probability theory are still alive,

and one of them is here at Cornell: Eugene Dynkin (who is in his 90’s). Dynkin (among

other accomplishments) proved a number of theorems and developed a number of impor-

tant methods (e.g. π-λ-systems) which are used to prove a number of important results

in basic probability. He is a great teacher and if you ever get the chance to take a course

from him, it’s worth it (and you get a living connection to the beginning of probability as

we know it!).

S = (−∞,∞) (5)

7 Conditional Probability

A critical concept in probability is the concept of conditional probability. Intuitively, we

can define the conditional probability as ‘the probability of an event, given that another

event has taken place’. That is, this concept makes formal the case where an event that

has taken place provides us information that changes the probability of a future or focal

event. The formal definition of the conditional probability of Ai given Aj is:

Pr(Ai|Aj) =Pr(Ai

Aj)

Pr(Aj)(6)

Ω = possible− individuals (7)

Ω = Ωg ∩ ΩP (8)

Ωg = A1A1, A1A2, A2A2 (9)

Pr(FΩ) = Pr(Fg,P ) (10)

Y : (∗,ΩP ) → R (11)

X : (Ωg, ∗) → R (12)

At first glance, this relationship does not seem very intuitive. Let’s consider a quick

example that will make it clear why we define conditional probability this way. Let’s use

our ‘paired coin flip’ where PrHH = PrHT = PrTH = PrTT = 0.25. In this

case, we have the following:

H2nd T2nd

H1st HH HT

T1st TH TT

where we have the following probabilities:

9

Page 20: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

• To assess whether the marker genotype indicates an eQTL, we need to assess the following hypothesis:

• To do this, we will collect a sample of size n of expression and genotype pairs (y, x) and define a statistic T(y, x), for which we know the distribution under the null hypothesis, such that we can calculate a p-value:

• p-value - the probability of obtaining a value of a statistic, or more extreme, conditional on H0 being true

• To analyze the data from a genome-wide eQTL experiment, we calculate a p-values for each of (a subset of) the total set of expression-genotype pairs and for cases where we reject the null (at an appropriate multiple test corrected type I error), we assume that this indicates an eQTL

• Note that we usually consider a run of contiguous genotypes for which we reject the null for the same expression variable to indicate the position of a single causal eQTL polymorphism

Statistical foundation III

βα = βa

a+

βd

2(p1 − p2)

(97)

βµ,0 (98)

X(A1A1) = −1, X(A1A2) = 0, X(A2A2) = −1 (99)

Y = measured expression value

H0 : Cov(Y,X) = 0 (100)

HA : Cov(Y,X) = 0 (101)

HA : Cov(Y,X) = 0 (102)

pval = Pr(T t|H0 : true) (103)

16

βα = βa

a+

βd

2(p1 − p2)

(97)

βµ,0 (98)

X(A1A1) = −1, X(A1A2) = 0, X(A2A2) = −1 (99)

Y = measured expression value

H0 : Cov(Y,X) = 0 (100)

HA : Cov(Y,X) = 0 (101)

HA : Cov(Y,X) = 0 (102)

pval = Pr(T t|H0 : true) (103)

16

βα = βa

a+

βd

2(p1 − p2)

(97)

βµ,0 (98)

X(A1A1) = −1, X(A1A2) = 0, X(A2A2) = −1 (99)

Y = measured expression value

H0 : Cov(Y,X) = 0 (100)

HA : Cov(Y,X) = 0 (101)

HA : Cov(Y,X) = 0 (102)

pval = Pr(T t|H0 : true) (103)

16

Page 21: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

Typical outcome I

xQTL identification

Genotype-phenotype association

eQTL (p < 10−30)

3.5

4.0

4.5

5.0

5.5

6.0

rs27290 genotype

ERAP

2 ex

pres

sion

A/A A/G G/G

no eQTL (n.s.)

3.5

4.0

4.5

5.0

5.5

6.0

rs1908530 genotype

ERAP

2 ex

pres

sion

T/T T/C C/C

Page 22: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

• This is a “cis-”eQTL because the significant genotypes are in the same location as the expressed gene (otherwise, it would be a “trans-”eQTL)

• Most eQTL are “cis-”, which makes biological sense

Typical outcome II

Page 23: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

• A typical genome-wide eQTL analysis with a relatively small sample size finds many eQTL

• We have a strong reason to believe that many of the eQTL reported are false positives (=future lecture, stay tuned!)

• This is a remarkably simple approach to finding eQTL, are there better analysis approaches (=stay tuned!)

• The landscape of available eQTL data is changing with innovations in next-generation sequencing technologies that are providing many more genotypes and providing a variety of measurements of gene expression and other types of variables (=we will discuss!)

• How do we validate eQTL? How do we leverage eQTL to learn more biology? What is in the immediate future for eQTL? (=you get the picture...)

Typical outcome III

Page 24: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

• The most significant SNP in a linkage disequilibrium (LD) block tend to be enriched for ENCODE Transcription Factor motifs, suggesting a functional mechanism

functions that were ENCODE Promoter and Enhancer histone marks, DNase peaks, transcription factorbinding, modification of transcription factor binding motifs and cis-acting SNPs previously discoveredin other eQTL studies. By thresholding our data so that only the SNPs with at least 5 points of evi-dence for ENCODE regulatory elements were retained, we found7 tumor-specific, 3 healthy-specific and2 global cis-acting SNPs. Interestingly, 4 / 7 of the remaining tumor-specific SNPs were mapped to 4genes ( MAP3K1, ROMO1, NT5E and MIER3 ) that were completely absent from the healthy-specificand global sets but also relevant to the pathogenesis of cancer . MAP3K1 (Mitogen-activated proteinkinase kinase kinase 1) is an enzyme that plays an essential role in in signal transduction cascades in-cluding the ERK and JNK kinase pathways as well as the NF-kappa-B pathway. ROMO1 (ReactiveOxygen Species Modulator 1) induces the production of ROS which is necessary for cell proliferation.NT5E (ecto-5’-nucleotidase or CD73) promotes inflammation and tumor-growth by acting on immunecells and has recetly been suggested as a potentital target for cancer immunotherapy ??. MIER3 (Meso-derm Induction Early Response Protein 3) is a gene whose exact function remains unknow but who is animportant paralog to MIER1, a fibroblast growth factor (FGF)-activated transcriptional repressor thatacts upon HDAC1 (histone deacetylase 1 ) ?? and also binds to the CREB-binding protein ??.

!"#$%&'(

!)*+*,

!)*+*-./

0.

).

Figure 5: A. B. C. D.

6

! !

!"#$%&'()'%*!(+%,&'-%./!,'01.,&'%*!2340!(5-6!7389:7!5$*+/.&-5)!$/$6$%&0!#.&.

! ;<$-5$&',.//)=!$>;?!6.@@'%*!,.%!'#$%&'()!,.+0./!A.5'.%&0!B<',<!.(($,&!*$%$!$C@5$00'-%D

! :'(E,+/&!&[email protected].&$!&<$!,.+0./!6+&.&'-%!(5-6!0'&$0!'%!0&5-%*!/'%F.*$!#'0$G+'/'H5'+6D

! I'%#!.//!2340!'%!<'*<!?:!B'&<!/$.#1234!J'D$D!8'01.,&'%*!A.5'.%&!'%!-+5!$>;?!0&+#)K

! "#$%&'()!234!'%!?:!H/-,F!&<.&!<.0!&<$!<'*<$0&!L(+%,&'-%./L!0,-5$D

! M$&!.!H'-/-*'0&!&-!('%#!0-6$&<'%*!'%&$5$0&'%*

Leveraging eQTL I

Page 25: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

Leveraging eQTL II• eQTL co-localize with disease loci identified in GWAS, indicating a

common genetic basis and a method for identifying candidate causal polymorphisms for disease risk

Page 26: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

• eQTL are used within Probabilistic Graphical Modeling (PGM) frameworks to discover new network / pathway / regulatory relationships

Leveraging eQTL III

Page 27: BTRY 7210: Topics in Quantitative Genomics and …mezeylab.cb.bscb.cornell.edu/labmembers/documents/QGJC15...most basic types of genome-wide data: genotype and gene expression •

That’s it for today

• Reminder (!!): if you are taking the class for a grade (S/U) please register and please email me to join the listserv AND let me know what topic you would be most interested in covering (!!)