motif finding yueyi irene liu cs374 lecture oct. 17, 2002
TRANSCRIPT
Motif Finding
Yueyi Irene Liu
CS374 Lecture
Oct. 17, 2002
Outline
• Background biology
• Motif-finding methods– Word enumeration– Gibbs sampling– Random projection– Phylogenetic footprinting– Reducer
Regulation of Gene Expression
• Chromatin structure• Transcription initiation• Transcript processing and modification• RNA transport• Transcript stability• Translation initiation• Post-Translational Modification• Protein Transport• Control of Protein Stability
Typical Structure of an Eukaryotic mRNA Gene
Control of Transcription Initiation
Motif
• A conserved pattern that is found in two or more sequences
• Can be found in – DNA (e.g., transcription factor binding sites)– Protein – RNA
Models for Representing Motifs
• Regular expression– Consensus
• TGACGCA
– Degenerate• WGACRCA
• Position Specific Matrix
TGACGCATGACGCAAGACGCATGACACAAGACGCA
1 2 3 4 5 6 7
A 0.4 0 1 0 0.2 0 1
T 0.6 0 0 0 0 0 0
G 0 1 0 0 0.8 0 0
C 0 0 0 1 0 1 0
Where to look for motifs?
• Gene families: a set of genes controlled by a common transcription factor or common environmental stimulus
• How do you construct gene families?– Microarray experiments
Known DNA sequences
Glass slide
Isolate mRNA
Cells of Interest
Reference samplegene
s
Resulting data
3.25 3.01 1.30 0.70
6.73 2.89 0.92 0.67
1.14 1.15 0.60 0.23
2.12 6.12 0.07 0.02
experiments
10
Microarrays
Motif-finding Methods
• Goal: Look for motifs (5-15bp) in the data set
• Methods:– Word enumeration method– Gibbs sampling– Random projection– Phylogenetic footprinting– Reducer
Word Enumeration
• For every word w, calculate: – Expected frequency based on entire upstream region of the
yeast genome• E.g., P(ATTGA) = (0.4)4(0.1)1, given P(A) = P(T) = 0.4,
P(G)=P(C) = 0.1• Expected number of occurrences of ATTGA: n*P(ATTGA)
– Observed frequency in the data set– Statistical significance of enrichment
Z = (O - E) / sqrt[np (1 - p)] ~ N(0, 1)– Disadvantage: only consider exact word
• E.g, YCTGCA: TCTGCA and CCTGCA
Gibbs Sampling
• Matrix to capture a motif
• Goal: find the best ak to maximize the difference between motif and background base distribution.
a2
a3
a4
ak
a1
Liu, X
Gibbs Sampling (Lawrence, et al, 1993)
• Step 1: Pick random start position, compute current motif matrix
• Step 2: Iterative update– Take one sequence out, update motif matrix
– Calcuate fitness score of each position of out sequence
– Pick start position in out sequence based on weight Ax
– Take out another sequence, …, until converge
• Step 3: Reset starting position
Liu, X
Gibbs Sampling InitializationPick random start position, compute motif matrix
a1
a2
a3
a4
ak
a1'
a3'
a4'
ak'
a2'
Liu, X
Gibbs Sampling Iteration Steps1) Take out one sequence, calculate the fitness score of
every subsequence relative to the current motif
a3'
a4'
ak'
a2'
?????????????????a1'
Liu, X
Fitness Score
• Ax = Qx / Px– Qx: probability of
generating subsequence x from current motif
– Px: probability of generating subsequence x from background
1 2 3
A 0.1 0.3 0.7
T 0.1 0.2 0.1
G 0.7 0.4 0.1
C 0.1 0.1 0.1
Current Motif
Background:
P(A) = P(T) = 0.4
P(G) = P(C) = 0.1
X = GGA:
Q? P?
Gibbs Sampling Iteration Steps2) Pick new start position sampling from fitness score
Sample from Fitness Score
0
1
2
3
4
5
0 1 2 3 4 5 6 7 8 9 10 11 12 …
Starting position of motif in sequence
Fitn
ess
a1''
a3'
a4'
ak'
a2'
Liu, X
Recent Development
• Random Projection
• Phylogenetic Footprinting
• Reducer
Random Projection (Buhler, 2002)
• (l, d)-motif problem: – M is an (unknown) motif of length l – Each occurrence of M is corrupted by exactly d
point substitutions in random positions
• No known biological motifs are
of (l, d)-motifCCcaAG
CCcgAG
CCgcAG
CCtaAG
CCtgAG
CtATgG
CCctAc
tCtTAG
CaAcAG
CCAgAa
Random Projection Algorithm
• Guiding principle: Some instances of a motif agree on a subset of positions.
• Use information from multiple motif instances to construct model.
ATGCGTC
...ccATCCGACca...
...ttATGAGGCtc...
...ctATAAGTCgc...
...tcATGTGACac... (7,2) motif
x(1)x(2)
x(5)x(8)
=M
Buhler, J
k-Projections
• Choose k positions in string of length l.
• Concatenate nucleotides at chosen k positions to form k-tuple.
• In l-dimensional Hamming space, projection onto k dimensional subspace.
ATGGCATTCAGATTC TGCTGAT
l = 15 k = 7P
P = (2, 4, 5, 7, 11, 12, 13)Buhler, J
Random Projection Algorithm
• Choose a projection by selecting k positions uniformly at random.
• For each l-tuple in input sequences, hash into bucket based on letters at k selected positions.
• Recover motif from bucket containing multiple l-tuples.
Bucket TGCT
TGCACCT
Input sequence x(i):…TCAATGCACCTAT...
Buhler, J
Example
• l = 7 (motif size) , k = 4 (projection size)
• Choose projection (1,2,5,7)
GCTC
...TAGACATCCGACTTGCCTTACTAC...
Buckets
Input Sequence
ATGC
ATCCGAC
GCCTTAC
Buhler, J
Hashing and Buckets
• Hash function h(x) obtained from k positions of projection.
• Buckets are labeled by values of h(x).
• Enriched buckets: contain more than s l-tuples, for some parameter s.
ATTCCATCGCTCATGC Buhler, J
Motif Refinement• How do we recover the motif from the
sequences in the enriched buckets?
• k nucleotides are known from hash value of bucket.
• Use information in other l-k positions as starting point for local refinement scheme, e.g. EM or Gibbs sampler
Local refinement algorithmATGCGTCCandidate motif
ATGC
ATCCGAC
ATGAGGCATAAGTC
ATGTGACBuhler, J
Parameter Selection
• Projection size k
• Choose k small so several motif instances hash to same bucket. (k < l - d)
• Choose k large to avoid contamination by spurious l-mers. ( 4k > t (n - l + 1)
• Bucket threshold s: (s = 3, s = 4)
Buhler, J
Recent Development
• Random Projection
• Phylogenetic Footprinting
• Reducer
Conservation of Regulatory Elements in Upstream of
ApoAI Gene
TATA boxTATA box
Hepatic site C CCAAT boxMouseRabbitHumanChicken
MouseRabbitHumanChicken
MouseRabbitHumanChicken
TATA box
AAGCA
AAGCA ACGCA
AAGCA
AAGCA
Substring Parsimony Problem
Given: • orthologous upstream sequences S1,…Sn
• phylogenetic tree T of the n species
• size k of the motif, threshold d
Problem: Find all sets of substrings s1,…sn of S1,…Sn , each of size k,
such that the parsimony score of s1,…sn on T is at most d
Blanchette, M
Parsimony Score
s1
s2
s3s4
s5s6
s`34
Minimum (all possible labelings of internal nodes) TEvu
vluld),(
))(),((
•l(v) – label of node v
•d(l1, l2) – Hamming distance
Tree T:
Blanchette, M
String Parsimony Problem
S1: AAAGCATTC
S2: TACGCACCC
S3: GAAGCAGGG
S1 S2 S3
AAGCA
AAGCA ACGCA
AAGCA
AAGCA
k = 5
d = 1
Algorithm: version I
• Root the tree at arbitrary internal node r
• Compute table Wu of size 4k for each node u, where Wu[s] – best parsimony score for subtree rooted at u when u is labeled with s
• Direct implementation of this recursion gives O(n∙k∙(42k + l), where l – average sequence length
)(leaf anot is if ),(][(min
of substring a is and leaf is if ,0
of substring anot is and leaf is if ,
][
uChildvvkt
u
u
u
utsdtW
Ssu
Ssu
sW
Blanchette, M
Algorithm: version II
• Define X(u, v)[s] – best parsimony score for subtree consisting of edge (u,v) and the subtree rooted at v
)),(][(min][),( tsdtWsX vtvu k
u labeled s
v
w
)(
),( ][][uChildv
vuu sXsW
Blanchette, M
Algorithm: version II (continued)
• Update X(u, v) in phases: in phase p maintain set Bp of sequences t, such that X(u, v)[t] = p
• Define: • Ra = {s: Wv[s] = a}
• N(s) = {t in ∑k: d(s, t) = 1}
• Start in phase m and let Bm = Rm
• Update
• Computation of X(u, v) takes O(k∙4k)
pBs pj
jpp BsNRB
)(11
Blanchette, M
Improvements
• Reduce the size of Bp when sequences contribute to X(u, v) greater than threshold dIn phase p, only care for sequence X(u, v) [s] if
Leads to significant reductions in stages d/2 … d
• Reduce the number of substrings inserted in W at the leavesFor substring s of Si, if its best match against any Sj, has
Hamming distance at least d, s can be discarded
otherwise 1
computedbeen has ][ if ][max
),(),(
)( p
sXsXpd
vuvu
vwuChildw
Blanchette, M
Results
• Practical limit on k = 10
• There appeared to be a threshold d0 with very few solutions below and many above
• Algorithm found ~80% known binding sites
• Performed better than ClustalW, MEME, Consensus
Blanchette, M
Recent Development
• Random Projection
• Phylogenetic Footprinting
• Reducer
Reducer (Bussemaker, et al 2001)
• Links motif finding to expression level• Ag = C + Σ Fu Nug
– Ag: gene expression level (logarithm of expression ratio)
– M: number of significant motifs– Ng: number of occurrences of motif u in gene g– C: baseline expression level (same for all genes)– F: increase/decrease of expression level caused by
presence of motif
Reducer (Cont’d)
Expression vector
Log ratio of expression levels
Gene1 Gene2 Gene3 Gene4 … GeneN
1.3 -3.7 10.3 4.5 -2.3
Motif vector
Number of times that motif occurs in the upstream region of the gene
Gene1 Gene2 Gene3 Gene4 … GeneN
AAAAA 2 0 5 3 0
AAAAT 5 3 2 1 5
…Liu, X
Reducer (Cont’d)
• Normalize expression (A) and motif (n) vectors
• Linear regression between A vector and every n vector to find the best fit n to A
• Step-wise regression to combine effects of motifs– Subtract the effect of one motif– Find the next best motif
Liu, X
Acknowlegement
• People from whom I borrowed slides:– Xiaole Liu (Reducer)– Olga Troyanskaya (Microarray)– Jeremy Buhler (Random projections)– Mathieu Blanchette (Phylogenetic footprinting)– Various web sources
cDNA clones(probes)
PCR product amplificationpurification
printing
microarray Hybridise target to microarray
mRNA target)
excitation
laser 1laser 2
emission
scanning
analysis
overlay images and normalise
0.1nl/spot
Information Content of Motifs
• Uncertainty
• Information = Hbefore - Hafter
Improvement on Original Gibbs sampler
• 0 ~ n copies of sites in each sequence
• Iterative masking to find multiple motifs
• Use higher order Markov models to improve motif specificity
Clinical Importance of Defects in Regulatory Elements
Burkitt’s Lymphoma
Statistical Methods
• Expectation Maximization (EM)– MEME
• Gibbs sampling– BioProspector– AlignACE
Motifs are not limited to DNAs
• RNA motifs– RNA – RNA interaction motifs, e.g., intron-exon
splice sites– RNA – protein interaction motifs, e.g., binding of
proteins to RNA polyA tail
• Protein motifs– E.g., Helix-turn-helix motif
Sequence Logo
Why is this Problem Hard?
• Motif information content low
• Hamming distance between each motif instance high