lectures 2 – oct 3, 2011 cse 527 computational biology, fall 2011 instructor: su-in lee ta:...

Lectures 2 – Oct 3, 2011CSE 527 Computational Biology, Fall 2011

Instructor: Su-In LeeTA: Christopher Miles

Monday & Wednesday 12:00-1:20Johnson Hall (JHN) 022

Introduction to Probabilistic Models for Computational Biology

1

Review: Gene Regulation

AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC

DNA

AUGUGGAUUGUU

AUGCGCGUC

AUGUUACGCACCUAC

AUGAUUGAURNA

Protein MWIV MRV MLRTYMID

GeneAGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC

Genes regulate each others’ expression and activity.

AUGCGCGUC

MRV

Genetic regulatory network

gene

RNA degradatio

nMID

AUGAUUAUAUGAUUGAU

MID

“Gene Expression”

a switch! (“transcription factor binding site”)

Gene regulation

transcription

translation

Review: Variations in the DNA

AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC

Genetic regulatory network

“Single nucleotide polymorphism (SNP)”

AUGUGGAUUGUU

AUGCGCGUC

AUGUUACGCACCUAC

AUGAUUGAURNA

Protein MWIV MRV MLRTYMID

gene

CX

TX X X

A GXT

XC X

L

CX X

TXU

X X

Sequence variations perturb the regulatory network.

4

Outline Probabilistic models in biology

Model selection problems

Mathematical foundations

Bayesian networks Probabilistic Graphical Models: Principles and

Techniques, Koller & Friedman, The MIT Press

Learning from data Maximum likelihood estimation Expectation and maximization

5

Example 1 How a change in a nucleotide in DNA, blood

pressure and heart disease are related?

There can be several “models”…

Bloodpressure

Heartdisease

OR

DNAalteration

Bloodpressure

Heartdisease

DNAalteration

Bloodpressure

Heartdisease

DNAalteration

6

Example 2 How genes A, B and C regulate each other’s

expression levels (mRNA levels) ?

There can be several models…

A

B C

A

B C

A

B C

OR ?

7

Gene A

Gene B

Gene C

Exp 1 Exp 2 Exp N…

A

B C

A

B C

A

B C

OR ?

Statistical dependencies between expression levels of genes A, B, C?

Probability that model x is true given the data Model selection: argmaxx P(model x is true |

Data)

N instances

Model I Model II Model III

Probabilistic graphical models A graphical representation of statistical

dependencies.

8


Model selection problem


Bayesian networks


9

Probability Theory Review Assume random variables Val(A)={a1,a2,a3},

Val(B)={b1,b2}

Conditional probability Definition

Chain rule

Bayes’ rule

Probabilistic independence

10

Probabilistic Representation Joint distribution P over {x1,…, xn}

xi is binary 2n-1 entries

If x’s are independent P(x) = p(x1) … p(xn)

11

Conditional Parameterization The Diabetes example

Genetic risk (G), Diabetes (D) Val (G) = {g1,g0}, Val (D) =

{d1,d0}

P(G,D) = P(G) P(D|G) P(G): Prior distribution P(D|G): Conditional

probabilistic distribution (CPD)

Genetic risk

Diabetes

12

Naïve Bayes Model - Example Elaborating the diabetes example,

Genetic Risk (G), Diabetes (D), Hypertension (H) Val (G) = {g1,g0}, Val (D) = {d1,d0}, Val (H) =

{h1,h0} 8 entries

If S and G are independent given I, P(G,D,H) = P(G)P(D|G)P(H|G) 5 entries; more compact than joint

Genetic risk

Diabetes Hypertension

13

Naïve Bayes Model A class C where Val (C) = {c1,…,ck}.

Finding variables x1,…,xn

Naïve Bayes assumption The findings are conditionally independent

given the individual’s class. The model factorizes as:

The Diabetes example class: Genetic risk, findings: Diabetes,

Hypertension

14

Naïve Bayes Model - Example Medical diagnosis system

Class C: disease Findings X: symptoms

Computing the confidence:

Drawbacks Strong assumptions

15

Bayesian Network Directed acyclic graph (DAG)

Node: a random variable Edge: direct influence of one node on another

The Diabetes example revisited Genetic risk (G), Diabetes (D), Hypertension (H) Val (G) = {g1,g0}, Val (D) = {d1,d0}, Val (H) =

{h1,h0}Genetic risk

Diabetes Hypertension

Bayesian Network Semantics A Bayesian network structure G is a directed acyclic graph

whose nodes represent random variables X1,…,Xn. PaXi: parents of Xi in G NonDescendantsXi: variables in G that are not descendants of Xi.

G encodes the following set of conditional independence assumptions, called the local Markov assumptions, and denoted by IL(G):

For each variable Xi: x1

x2

x3

x4

x5

x6

x3

x7

x11

x10

x8

x9

16

17

The Genetics Example Variables

B: blood type (a phenotype) G: genotype of the gene that encodes a

person’s blood type; <A,A>, <A,B>, <A,O>, <B,B>, <B,O>, <O,O>

18

Bayesian Network Joint Distribution

Let G be a Bayesian network graph over the variables X1,…,Xn. We say that a distribution P factorizes according to G if P can be expressed as:

A Bayesian network is a pair (G,P) where P factorizes over G, and where P is specified as a set of CPDs associated with G’s nodes.

19

The Student Example More complex scenario

Course difficulty (D), quality of the recommendation letter (L), Intelligence (I), SAT (S), Grade (G)

Val(D) = {easy, hard}, Val(L) = {strong, weak},

Val(I) = {i1,i0}, Val (S) = {s1,s0}, Val (G) = {g1,g2,g3}

Joint distribution requires 47 entries

20

The Student Bayesian network Joint distribution

P(I,D,G,S,L) =

from Koller & Friedman

21

Parameter Estimation Assumptions

Fixed network structure Fully observed instances of the network variables: D={d[1],

…,d[M]} Maximum likelihood estimation (MLE)!

“Parameters” of the Bayesian network

For example, {i0,d1,g1,l0,s0

}

from Koller & Friedman

22


Model selection problem


Bayesian networks


23

Acknowledgement

Profs Daphne Koller & Nir Friedman,“Probabilistic Graphical Models”

lectures 2 – oct 3, 2011 cse 527 computational biology, fall 2011 instructor: su-in lee ta:...

Documents