1 regulatory motif finding discovery of regulatory elements by a computational method for...
Post on 19-Dec-2015
215 views
TRANSCRIPT
1
Regulatory Motif FindingRegulatory Motif Finding
Discovery of Regulatory Elements by a Computational Method forPhylogenetic Footprinting, Blanchette & Tompa (2002)
Statistical Models for Biological Sequence Motif Discovery,Liu J, Gupta, Liu X, Mayerhofere, Lawrence
2
““Regulatory Motif Finding”Regulatory Motif Finding”
What is being regulated?What is being regulated?
What is a “Motif?”What is a “Motif?”
Why do we want to find them?Why do we want to find them?
3
Central Dogma of GeneticsCentral Dogma of Genetics
(pict by Andrew Hughes, Rice University)
It’s “TRUE,” right?!It’s “TRUE,” right?!
Yes, but…Yes, but…
4
Every Protein in Every Cell?Every Protein in Every Cell?
Clearly, there are complicated Clearly, there are complicated mechanisms at workmechanisms at work
RhodopsinRhodopsin
But, we have the same DNA in all cells…But, we have the same DNA in all cells…
5
TranscriptionalTranscriptional Regulation Regulation
It is transcription (DNA It is transcription (DNA RNA) that RNA) that is being regulated.is being regulated.
RNA Polymerase II, aided by Transcription RNA Polymerase II, aided by Transcription Factors (TFs)Factors (TFs)
Where do TFs bind?Where do TFs bind?
6
Promoter RegionsPromoter Regions
(pict by Andrew Hughes, Rice University)
TATA box – usually ~ 30 bp upstream of geneTATA box – usually ~ 30 bp upstream of gene
But, there are others...Where? What Sequence?But, there are others...Where? What Sequence?
7
Promoter SequencePromoter Sequence
Many different possible locations, Many different possible locations, sometimes extremely far from the sometimes extremely far from the start of transcription!start of transcription!
What Sequence? THAT is the $64k (or What Sequence? THAT is the $64k (or $1B) Question…$1B) Question…
8
MotifsMotifs
Many different promoter sequences foundMany different promoter sequences found
Basal: TATA-box (-20), CCAAT-box (-100) Basal: TATA-box (-20), CCAAT-box (-100)
Additional Additional transcriptional regulatory domainstranscriptional regulatory domains
Activators and inhibitors use these domainsActivators and inhibitors use these domains
9
Motifs (2)Motifs (2)
Not Not exactexact sequences – that would be too easy sequences – that would be too easy
Strength of Binding Affects level of Strength of Binding Affects level of promotion/inhibition (C/G vs A/T)promotion/inhibition (C/G vs A/T)
Described either probabilistically with Described either probabilistically with motif logos or with extended single-letter motif logos or with extended single-letter nucleotide codesnucleotide codes
Often are Palindromic (GATATC)Often are Palindromic (GATATC)
Symbol MeaningA AdenineG GuanineC CytosineT ThymineU UracilY pYrimidine (C or T)R puRine (A or G)W "Weak" (A or T)S "Strong" (C or G)K "Keto" (T or G)M "aMino" (C or A)B not A (C or G or T)D not C (A or G or T)H not G (A or C or T)V not T (A or C or G)X,N,? unknown (A or C or G or T)
TGATGASSTTMMA – A – Promoter Promoter Sequence for Sequence for several several oncogenesoncogenes
Extended Single-Letter CodesExtended Single-Letter CodesLetters represent Letters represent
possible bases in possible bases in each position:each position:
11
Motif LogosMotif Logos
Height of letters represents probability of Height of letters represents probability of being found in that location in the motifbeing found in that location in the motif
12
Why do we care?Why do we care?
Gene regulation Gene regulation transcriptional transcriptional regulationregulation
Can teach us about our complex Can teach us about our complex signaling pathwayssignaling pathways
Drugs and MoneyDrugs and Money
13
So…Finding Regulatory MotifsSo…Finding Regulatory Motifs
Statistical Models paper (Liu et al)Statistical Models paper (Liu et al)
Assumes: We have located genes Assumes: We have located genes that we expect to be co-regulated that we expect to be co-regulated (microarrays, co-expression)(microarrays, co-expression)
14
So…Finding Regulatory MotifsSo…Finding Regulatory Motifs
Experimental methods of determining TF Experimental methods of determining TF binding sites (Gel Shift assay, DNA binding sites (Gel Shift assay, DNA Protection Assay)Protection Assay)
Statistical modelsStatistical models
15
Single-Site ModelSingle-Site Model
Assumes:Assumes:- Each sequence contains 1 motif- Each sequence contains 1 motif
- Sequences are generated by random - Sequences are generated by random draws draws from {A,C,G,T} with given prior from {A,C,G,T} with given prior probabilitiesprobabilities
- Motif has a frequency matrix for each - Motif has a frequency matrix for each positionposition
Use Gibbs site sampler: Missing Data Problem. Use Gibbs site sampler: Missing Data Problem. Randomly choose motif locations. Then move the Randomly choose motif locations. Then move the motif locations based on P(amotif locations based on P(akk))
16
Gibbs SamplingGibbs SamplingSamplingSampling::
For every K-long word xFor every K-long word xjj,…,x,…,xj+k-1j+k-1 in x: in x:
QQjj = Prob[ word | motif ] = M(1,x = Prob[ word | motif ] = M(1,xjj))……M(k,xM(k,xj+k-1j+k-1))
PPii = Prob[ word | background ] B(x = Prob[ word | background ] B(xjj))……B(xB(xj+k-1j+k-1))
Let Let
Sample a random new position aSample a random new position aii according to the according to the
probabilities Aprobabilities A11,…, A,…, A|x|-k+1|x|-k+1..
0 |x|
Prob
1||
1
/
/kx
jjj
jjj
PQ
PQA
17
Repetitive Block-Motif ModelRepetitive Block-Motif Model
View K sequences as one long sequence of View K sequences as one long sequence of length n. Model probability of a motif starting at length n. Model probability of a motif starting at each position ‘i’.each position ‘i’.
Problems:Problems:- Lose evolutionary relationship - Lose evolutionary relationship between sequencesbetween sequences- Allows multiple copies of motif - Allows multiple copies of motif in each sequencein each sequence- Total number of occurrences - Total number of occurrences unknownunknown
18
The Rest of the Statistical The Rest of the Statistical Models Paper…Models Paper…
Much math:Much math:– Scoring motif candidatesScoring motif candidates– Using potential motif dictionariesUsing potential motif dictionaries– Bayesian Prior ProbabilitiesBayesian Prior Probabilities– Finding motifs with insertions in them Finding motifs with insertions in them
(“gapped” motifs)(“gapped” motifs)
On to: Phylogenetic FootprintingOn to: Phylogenetic Footprinting
19
Phylogenetic FootprintingPhylogenetic Footprinting
Most of paper spent describing Most of paper spent describing background, resultsbackground, results
Methods are brief, not too deepMethods are brief, not too deep
20
Let Evolution Be Your GuideLet Evolution Be Your Guide
Phylogenetic Footprinting – Phylogenetic Footprinting –
““Identifying regulatory elements by Identifying regulatory elements by finding unusually well conserved regions finding unusually well conserved regions in a set of in a set of orthologousorthologous noncoding DNA noncoding DNA sequences from multiple species”sequences from multiple species”
21
Orthologs and ParalogsOrthologs and Paralogs
Gene duplicate within species: Paralog
Same gene in species with common ancestor: Ortholog
22
AdvantagesAdvantages
Doesn’t rely on reliably determining Doesn’t rely on reliably determining co-regulated genes (single-genome co-regulated genes (single-genome approach, non-trivial!)approach, non-trivial!)
Can be used to find regulatory Can be used to find regulatory elements specific to one single gene elements specific to one single gene (caveat: conserved across species)(caveat: conserved across species)
23
Standard MethodsStandard Methods
Usually start with MSA (ProbCons,clustalw)Usually start with MSA (ProbCons,clustalw)– But, this can lose signal (short regulatory But, this can lose signal (short regulatory
elements ~20bp, long promoter regions ~1000 bp)elements ~20bp, long promoter regions ~1000 bp)– Also, if species are evolutionarily close, Also, if species are evolutionarily close,
nonfunctional regions may also be well conservednonfunctional regions may also be well conserved
Can start with general motif discovery algs Can start with general motif discovery algs (MEME, Consensus, AlignAce, DIALIGN …)(MEME, Consensus, AlignAce, DIALIGN …)– But, these don’t take into account relative But, these don’t take into account relative
phylogenetic relationships of sequences. Will phylogenetic relationships of sequences. Will weight closely related sequences too highlyweight closely related sequences too highly
24
The PF AlgorithmThe PF Algorithm
Given:• phylogenetic tree T,• set of orthologous sequences at leaves of T,• length k of motif• threshold d
Problem:
• Find each set S of k-mers, one k-mer from each leaf, such that the “parsimony” score of S in T is at most d.
25
AGTCGTACGTGAC... (Human)
AGTAGACGTGCCG... (Chimp)
ACGTGAGATACGT... (Rabbit)
GAACGGAGTACGT... (Mouse)
TCGTGACGGTGAT... (Rat)
Size of motif sought: k = 4
Small Example Small Example (merci, CS262)(merci, CS262)
26
SolutionSolution
Parsimony score: 1 mutation
AGTCGTACGTGAC...
AGTAGACGTGCCG...
ACGTGAGATACGT...
GAACGGAGTACGT...
TCGTGACGGTGAT...ACGGACGT
ACGT
ACGT
An An ExhaustiveExhaustive Algorithm AlgorithmWu [s] = best parsimony score for subtree rooted at node u,
if u is labeled with string s.
AGTCGTACGTG
ACGGGACGTGC
ACGTGAGATAC
GAACGGAGTAC
TCGTGACGGTG
… ACGG: 2 ACGT: 1
...
…
ACGG: 0 ACGT: 2...
…
ACGG: 1 ACGT: 1
\...
…
ACGG: + ACGT: 0
...
… ACGG: 1 ACGT: 0
...
4k entries
… ACGG: 0 ACGT: +
...
… ACGG: ACGT :0 ...
… ACGG: ACGT :0 ...
… ACGG: ACGT :0 ...
28
Simple RecurrenceSimple Recurrence
Wu [s] = min ( Wv [t] + h(s,
t) ) v: children t of uWords Good:
K-mer score at a node is the sum of itschildren’s best parsimony scores for that k-mer
29
Running TimeRunning Time
Wu [s] = min ( Wv [t] + h(s,
t) ) v: children t of uO(k 42k )
time per node
Number of species
Average sequence
length
Motif length
Total time O(n k (42k + l ))
30
FootPrinterFootPrinterhttp://bio.cs.washington.edu/software.htmlhttp://bio.cs.washington.edu/software.html
Avoids pitfalls of using MSA or general-Avoids pitfalls of using MSA or general-purpose Motif-finding algorithmspurpose Motif-finding algorithms
Identifies all DNA motifs that appear Identifies all DNA motifs that appear to have evolved more slowly than the to have evolved more slowly than the surrounding sequencesurrounding sequence
Allows motifs to not appear in all Allows motifs to not appear in all sequences (LexA in gram +/- bacteria)sequences (LexA in gram +/- bacteria)
31
FootPrinter (2)FootPrinter (2)
““Given n orthologous input Given n orthologous input sequences and the phylogenetic tree sequences and the phylogenetic tree TT relating them, [footprinter] is relating them, [footprinter] is guaranteed to produce every set of guaranteed to produce every set of kk--mers, one from each input sequence, mers, one from each input sequence, that have a parsimony score at most that have a parsimony score at most dd with respect to with respect to TT, where , where kk and and dd are are parameters specified by the user.parameters specified by the user.
32
ParametersParameters
Can set minimum threshold on Can set minimum threshold on fraction of the phylogeny that must fraction of the phylogeny that must be spanned for motifs with each be spanned for motifs with each parsimony score ‘s’.parsimony score ‘s’.
33
ResultsResults
Examine 9 sets of orthologous or Examine 9 sets of orthologous or paralogous (works for duplicated genes paralogous (works for duplicated genes that have since evolved as well) that have since evolved as well) sequences.sequences.
Found: many old, + some highly Found: many old, + some highly conserved motifs of unknown function conserved motifs of unknown function (time for the experimentalists!)(time for the experimentalists!)
34
One example: Metallothionein One example: Metallothionein Gene FamilyGene Family
Good test family:Good test family:– Large number of promoter sequencesLarge number of promoter sequences– Wide variety of speciesWide variety of species– Large number of regulatory elements Large number of regulatory elements
experimentally verified in several species.experimentally verified in several species.
Most binding sites are within 300 bp of Most binding sites are within 300 bp of start codon (ATG)start codon (ATG)
35
Inputs Sequences:Inputs Sequences:590 bp upstream of590 bp upstream ofthe start codonthe start codon
Most found were Most found were present in multiple present in multiple isoform families – isoform families – gained accuracy gained accuracy by considering the by considering the paralogs, not just paralogs, not just the orthologsthe orthologs
36
But, FootPrinter isn’t PerfectBut, FootPrinter isn’t Perfect
Some known regulatory binding sites Some known regulatory binding sites were missed. Why?were missed. Why?
Ultimately, must be because the Ultimately, must be because the motifs were not well-enough motifs were not well-enough conserved to be detectedconserved to be detected
(but we can discuss more…)(but we can discuss more…)
37
FootPrinter Error (1)FootPrinter Error (1)
Some binding sites not well matched Some binding sites not well matched in other species.in other species.
Example: Thyroid hormone receptor Example: Thyroid hormone receptor T3R is conserved within rodents, but T3R is conserved within rodents, but not beyond. Would need many not beyond. Would need many closely related species to detect this closely related species to detect this motif.motif.
38
FootPrinter Error (2-5)FootPrinter Error (2-5)
Some motifs well conserved, but too Some motifs well conserved, but too shortshort
InDels in middle of motif – could allow InDels in middle of motif – could allow them, but would get many false +sthem, but would get many false +s
Some barely fail to meet statistical Some barely fail to meet statistical thresholds (close but no cigar)thresholds (close but no cigar)
Dimer TFs like two conserved Dimer TFs like two conserved regions with variable internal seq.regions with variable internal seq.