lectures 2 – oct 3, 2011 cse 527 computational biology, fall 2011 instructor: su-in lee ta:...
TRANSCRIPT
Lectures 2 – Oct 3, 2011CSE 527 Computational Biology, Fall 2011
Instructor: Su-In LeeTA: Christopher Miles
Monday & Wednesday 12:00-1:20Johnson Hall (JHN) 022
Introduction to Probabilistic Models for Computational Biology
1
Review: Gene Regulation
AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC
DNA
AUGUGGAUUGUU
AUGCGCGUC
AUGUUACGCACCUAC
AUGAUUGAURNA
Protein MWIV MRV MLRTYMID
GeneAGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC
Genes regulate each others’ expression and activity.
AUGCGCGUC
MRV
Genetic regulatory network
gene
RNA degradatio
nMID
AUGAUUAUAUGAUUGAU
MID
“Gene Expression”
a switch! (“transcription factor binding site”)
Gene regulation
transcription
translation
Review: Variations in the DNA
AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC
Genetic regulatory network
“Single nucleotide polymorphism (SNP)”
AUGUGGAUUGUU
AUGCGCGUC
AUGUUACGCACCUAC
AUGAUUGAURNA
Protein MWIV MRV MLRTYMID
gene
CX
TX X X
A GXT
XC X
L
CX X
TXU
X X
Sequence variations perturb the regulatory network.
4
Outline Probabilistic models in biology
Model selection problems
Mathematical foundations
Bayesian networks Probabilistic Graphical Models: Principles and
Techniques, Koller & Friedman, The MIT Press
Learning from data Maximum likelihood estimation Expectation and maximization
5
Example 1 How a change in a nucleotide in DNA, blood
pressure and heart disease are related?
There can be several “models”…
Bloodpressure
Heartdisease
OR
DNAalteration
Bloodpressure
Heartdisease
DNAalteration
Bloodpressure
Heartdisease
DNAalteration
6
Example 2 How genes A, B and C regulate each other’s
expression levels (mRNA levels) ?
There can be several models…
A
B C
A
B C
A
B C
OR ?
7
Gene A
Gene B
Gene C
Exp 1 Exp 2 Exp N…
A
B C
A
B C
A
B C
OR ?
Statistical dependencies between expression levels of genes A, B, C?
Probability that model x is true given the data Model selection: argmaxx P(model x is true |
Data)
N instances
Model I Model II Model III
Probabilistic graphical models A graphical representation of statistical
dependencies.
8
Outline Probabilistic models in biology
Model selection problem
Mathematical foundations
Bayesian networks
Learning from data Maximum likelihood estimation Expectation and maximization
9
Probability Theory Review Assume random variables Val(A)={a1,a2,a3},
Val(B)={b1,b2}
Conditional probability Definition
Chain rule
Bayes’ rule
Probabilistic independence
10
Probabilistic Representation Joint distribution P over {x1,…, xn}
xi is binary 2n-1 entries
If x’s are independent P(x) = p(x1) … p(xn)
11
Conditional Parameterization The Diabetes example
Genetic risk (G), Diabetes (D) Val (G) = {g1,g0}, Val (D) =
{d1,d0}
P(G,D) = P(G) P(D|G) P(G): Prior distribution P(D|G): Conditional
probabilistic distribution (CPD)
Genetic risk
Diabetes
12
Naïve Bayes Model - Example Elaborating the diabetes example,
Genetic Risk (G), Diabetes (D), Hypertension (H) Val (G) = {g1,g0}, Val (D) = {d1,d0}, Val (H) =
{h1,h0} 8 entries
If S and G are independent given I, P(G,D,H) = P(G)P(D|G)P(H|G) 5 entries; more compact than joint
Genetic risk
Diabetes Hypertension
13
Naïve Bayes Model A class C where Val (C) = {c1,…,ck}.
Finding variables x1,…,xn
Naïve Bayes assumption The findings are conditionally independent
given the individual’s class. The model factorizes as:
The Diabetes example class: Genetic risk, findings: Diabetes,
Hypertension
14
Naïve Bayes Model - Example Medical diagnosis system
Class C: disease Findings X: symptoms
Computing the confidence:
Drawbacks Strong assumptions
15
Bayesian Network Directed acyclic graph (DAG)
Node: a random variable Edge: direct influence of one node on another
The Diabetes example revisited Genetic risk (G), Diabetes (D), Hypertension (H) Val (G) = {g1,g0}, Val (D) = {d1,d0}, Val (H) =
{h1,h0}Genetic risk
Diabetes Hypertension
Bayesian Network Semantics A Bayesian network structure G is a directed acyclic graph
whose nodes represent random variables X1,…,Xn. PaXi: parents of Xi in G NonDescendantsXi: variables in G that are not descendants of Xi.
G encodes the following set of conditional independence assumptions, called the local Markov assumptions, and denoted by IL(G):
For each variable Xi: x1
x2
x3
x4
x5
x6
x3
x7
x11
x10
x8
x9
16
17
The Genetics Example Variables
B: blood type (a phenotype) G: genotype of the gene that encodes a
person’s blood type; <A,A>, <A,B>, <A,O>, <B,B>, <B,O>, <O,O>
18
Bayesian Network Joint Distribution
Let G be a Bayesian network graph over the variables X1,…,Xn. We say that a distribution P factorizes according to G if P can be expressed as:
A Bayesian network is a pair (G,P) where P factorizes over G, and where P is specified as a set of CPDs associated with G’s nodes.
19
The Student Example More complex scenario
Course difficulty (D), quality of the recommendation letter (L), Intelligence (I), SAT (S), Grade (G)
Val(D) = {easy, hard}, Val(L) = {strong, weak},
Val(I) = {i1,i0}, Val (S) = {s1,s0}, Val (G) = {g1,g2,g3}
Joint distribution requires 47 entries
20
The Student Bayesian network Joint distribution
P(I,D,G,S,L) =
from Koller & Friedman
21
Parameter Estimation Assumptions
Fixed network structure Fully observed instances of the network variables: D={d[1],
…,d[M]} Maximum likelihood estimation (MLE)!
“Parameters” of the Bayesian network
For example, {i0,d1,g1,l0,s0
}
from Koller & Friedman
22
Outline Probabilistic models in biology
Model selection problem
Mathematical foundations
Bayesian networks
Learning from data Maximum likelihood estimation Expectation and maximization
23
Acknowledgement
Profs Daphne Koller & Nir Friedman,“Probabilistic Graphical Models”