csce555 bioinformatics lecture 10 motif discovery meeting: mw 4:00pm-5:15pm swgn2a21 instructor: dr....
TRANSCRIPT
![Page 1: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/1.jpg)
CSCE555 BioinformaticsCSCE555 Bioinformatics
Lecture 10 Motif Discovery
Meeting: MW 4:00PM-5:15PM SWGN2A21Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555
University of South CarolinaDepartment of Computer Science and Engineering2008 www.cse.sc.edu.
HAPPY CHINESE NEW YEAR
![Page 2: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/2.jpg)
OutlineOutline
Introduction to DNA MotifMotif Representations (Recap)Motif database searchAlgorithms for motif discovery
04/21/23 2
![Page 3: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/3.jpg)
What is a DNA motif? What is a DNA motif? Motif A Recurring patternA short conserved sequence
pattern associated with distinct functions of a protein or DNA
DNA motifs asTranscription Factor biding sites
![Page 4: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/4.jpg)
Transcription: binding sites Transcription: binding sites (DNA) and factors (proteins)(DNA) and factors (proteins)
Colored lines are binding sites: DNA sequence patterns.Blobs are factors (proteins) that recognize binding sites.
![Page 5: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/5.jpg)
Example:Example:Transcription Factor Binding Transcription Factor Binding SitesSites
ERE
EstrogenReceptor Transcription start
DNA
Gene ERE Sequence
Efp … a g g g t c a t g g t g a c c c t …
TERT … t t g g t c a g g c t g a t c t c …
Oxytocin … g c g g t g a c c t t g a c c c c …
Lactoferrin … c a g g t c a a g g c g a t c t t …
Angiotensin … t a g g g c a t c g t g a c c c g …
VEGF … a t a a t c a g a c t g a c t g g …
(estrogen response element)
![Page 6: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/6.jpg)
Why are sequence patterns Why are sequence patterns useful--revisiteduseful--revisitedIn the context of transcriptional
regulation, sequence patterns can be used to help answer several questions.
What transcription factors are involved in regulating my gene?
Does my gene contain a DNA binding domain?
What novel transcription factor binding sites does my set of co-regulated genes contain?
![Page 7: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/7.jpg)
How do we represent How do we represent sequence patterns?sequence patterns?The three most common
pattern representation languages:
regular expressions (e.g.,leucine zipper)
profiles (PWMs, PSSMs etc.) hidden Markov models
(HMMs)
![Page 8: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/8.jpg)
1) Regular expressions define 1) Regular expressions define sets of sequences that they sets of sequences that they matchmatch
Sp1 binds to DNA via 3 zinc-finger binding domains:
C-X(2,4)-C-X(3)-[LIVMFYWC]-X(8)-H-
X(3,5)-H
These particular domains recognize Sp1 binding sites:
GRGGCRGGW Transcription factor Sp1 bindingto DNA
![Page 9: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/9.jpg)
2) Profiles are built from 2) Profiles are built from multiple alignments of multiple alignments of instances of a patterninstances of a pattern
Example: nuclear hormone receptor transcription factor binding site profile derived from experimentally determined sites.
Observed counts can be converted to frequencies by dividing by the number of observed instances.
So profiles are probabilistic models of sequence patterns.
Counts of number of times each letter is observed at each position in pattern.
![Page 10: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/10.jpg)
Lecture 3.2 10
3) Making a Markov 3) Making a Markov ModelModel
A C A - - - A T GT C A A C T A T CA C A C - - A G CA G A - - - A T CA C C G - - A T C
[AT][CG][AC][ACGT-](3)A[TG][GC]
~3600 possible valid sequences
![Page 11: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/11.jpg)
Lecture 3.2 11
Making a Markov Model of Making a Markov Model of MotifMotif
A:0.8T:0.2
C:0.8G:0.2
A:0.8C:0.2
A:1.0 T:0.8G:0.2
C:0.8G:0.2
C:0.4G:0.2T:0.2A:0.2
1.0 1.0 0.4
0.60.6
0.4
1.0 1.0
P(ACAC--ATC)=0.8x1.0x0.8x1.0x0.8x1.0x0.6x0.4 x0.6x1.0x1.0x0.8x1.0x0.8 = 0.0047
![Page 12: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/12.jpg)
How to score the match of a How to score the match of a sequence against three motif sequence against three motif models?models?Regular express: exact match or
fuzzy matchProfile: sum of log-oddsHMM: probability score P(s|H)
![Page 13: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/13.jpg)
OutlineOutline
Introduction to DNA MotifMotif Representations (Recap)Motif database searchAlgorithms for motif discovery
04/21/23 13
![Page 14: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/14.jpg)
How do we search for How do we search for occurrences of known occurrences of known patterns?patterns?Tools exist that allow us to search for one or more
known sequence patterns in one or more sequences in different ways.
The patterns can come from a database of known patterns or be novel patterns we have discovered using pattern discovery software or other means.
Some tools treat each pattern independently; others look for groups of matches to patterns.
All tools compare each pattern to each position and compute a score which can be the number of mutations (regular expression patterns) or a probability or log-odds (profiles and HMMs).
![Page 15: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/15.jpg)
Many useful databases of Many useful databases of patterns have been compiledpatterns have been compiledTRANSFAC – transcription factor
binding sites (profiles)PROSITE – protein sites and domains
(regular expressions and profiles)EPD – eukaryotic promoters (profiles)PFAM – protein families and domains
(HMMs)BLOCKS – protein families (profiles)
![Page 16: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/16.jpg)
Searching for known Searching for known patterns in a given sequence patterns in a given sequence MOTIF – search protein sequence
against Prosite, PFAM etc.; search DNA sequence against TRANSFAC
PROFILESCAN – search protein sequence against Prosite database of profiles or regular expressions
MAST – search for occurrences of one or more patterns in a DNA sequence (or database of sequences)
![Page 17: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/17.jpg)
OutlineOutline
Introduction to DNA MotifMotif Representations (Recap)Motif database searchAlgorithms for motif discovery
04/21/23 17
![Page 18: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/18.jpg)
The Motif Discovery The Motif Discovery ProblemProblemWe are given a set of sequences, each
containing an instance of an unknown motif. Find the motif.
Multiple, local sequence alignment.A clean, computer-sciencey problem. A
bit too clean, we should be suspicious…
![Page 19: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/19.jpg)
In Real LifeIn Real LifeA microarray experiment indicates that
50 genes share similar expression patterns.
Do they share a common type of transcription factor binding site?◦ Almost certainly some of the genes were
included erroneously: experimental noise.◦ Perhaps they share a common mRNA
degradation signal.Is the TFBS near the transcription start
site?◦ Yeast: probably. Human: who knows?
![Page 20: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/20.jpg)
Approaches to Motif Approaches to Motif DiscoveryDiscoveryMatrix-based:
◦Gibbs Sampling - most popular.◦Expectation maximization.◦Stormo’s greedy algorithm.
Consensus sequence-based:◦Several algorithms by Pevzner.◦Box-finder of Kielbasa et al.
![Page 21: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/21.jpg)
Three Ingredients of Almost Three Ingredients of Almost any Bioinformatics Methodany Bioinformatics Method
1. Search space (haystack)2. Scoring scheme3. Search algorithm (= optimization
technique)
Strictly speaking, Gibbs sampling and expectation-maximization are search algorithms. They are not specific to motif discovery; indeed they were first used in other contexts.
Mathematically precise formulation of the problem
![Page 22: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/22.jpg)
Gibbs Sampling: Gibbs Sampling: Simplifying Simplifying AssumptionsAssumptionsThe width of the motif is known in
advance.
No indels (gaps).
Each sequence contains precisely one instance of the motif.
The sequences are single-stranded (e.g. mRNA).
![Page 23: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/23.jpg)
Search SpaceSearch Space
N
Length = L
Motif width = W
Size of search space = (L – W + 1)N
L=100, W=15, N=10 size 1019
![Page 24: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/24.jpg)
Scoring SchemeScoring SchemeAssign a numeric score to any
proposed answer.
What score should this get?
caga
ctga
cacc
cgca
![Page 25: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/25.jpg)
Some DefinitionsSome Definitions
caga
ctga
cacc
cgca
1 2 3 4
a 0 2 0 3
c 4 0 2 1
g 0 1 2 0
t 0 1 0 0
count matrix: cki =
k
i
• pki = cki / N
• pi = background abundance of ith residue type
![Page 26: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/26.jpg)
1. Based on frequentist statistics / information theory:
2. Based on Bayesian statistics:
Two Scoring SchemesTwo Scoring Schemes
W
k tgcai
ci
tgcai
cki
kiki pp1 ,,,,,,
lnscore
W
k tgcai
ci
tgcaiki
kipcN1 ,,,,,,
!!3
6lnscore
![Page 27: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/27.jpg)
Worked ExampleWorked Example
W
k tgcai
ci
tgcaiki
kipcN1 ,,,,,,
!!3
6lnscore
1 2 3 4
a 0 2 0 3
c 4 0 2 1
g 0 1 2 0
t 0 1 0 0
2561
41 N
i
cikipcki =
N = 4pi = ¼
10532
!36
i
cikip
N
Score = 1.99 - 0.50 + 0.20 + 0.60 = 2.29
![Page 28: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/28.jpg)
Search AlgorithmSearch AlgorithmWe want the global maximum score!
(Or as close as we can get.)
Exact algorithms (e.g. dynamic programming) would be too slow (e.g. lifetime of universe).
Therefore we resort to a heuristic algorithm: Gibbs sampling, which is a type of Monte Carlo Markov chain method.
![Page 29: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/29.jpg)
Gibbs Sampling SearchGibbs Sampling Search
1
2
Suppose the search space is a 2D rectangle. (Typically, more than 2 dimensions!)
X
Start at a random point X.
Randomly pick a dimension.
Look at all points along this dimension.
Repeat.
Move to one of them randomly, proportional to its score π.
![Page 30: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/30.jpg)
Gibbs Sampling for Motif Gibbs Sampling for Motif SearchSearch
Choose a random starting state.
Randomly pick a sequence.
Look at all motif positions in this sequence.
Pick one randomly proportional to exp(score).
Repeat.
![Page 31: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/31.jpg)
Does it Work in Practice?Does it Work in Practice?Only successful cases get published!Seems more successful in microbes (bacteria &
yeast) than in animals.The search algorithm seems to work quite well,
the problem is the scoring scheme: real motifs often don’t have higher scores than you would find in random sequences by chance. I.e. the needle looks like hay.
Attempts to deal with this:◦ Assume the motif is an inverted palindrome (they often
are).◦ Only analyze sequence regions that are conserved in
another species (e.g. human vs. mouse).As usual, repetitive sequences cause problems.More powerful algorithm: MEME
![Page 32: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/32.jpg)
1. Go to our MEME server:
http://molgen.biol.rug.nl/meme/website/meme.html
1. Fill in your emailadres, description of the sequences
2. Open the fasta formatted file you just saved with Genome2d (click “Browse”)
3. Select the number of motifs, number of sites and the optimum width of the motif
4. Click “Search given strand only”
5. Click “Start search”
![Page 33: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/33.jpg)
Something like this will appear in your email. The results are quite self explanatory.
![Page 34: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/34.jpg)
SummarySummaryMotif discovery and Motif search problem Motif representationGibbs sampling algorithm for motif discoveryUsing MEME (Expectation Maximization
algorithm) for motif discovery
![Page 35: CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:](https://reader036.vdocuments.us/reader036/viewer/2022062519/56649ebd5503460f94bc677a/html5/thumbnails/35.jpg)
AcknowledgementAcknowledgementZhiping Weng (Boston Uni.)Timothy L. Bailey