1 regulatory motif finding discovery of regulatory elements by a computational method for...

1

Regulatory Motif FindingRegulatory Motif Finding

Discovery of Regulatory Elements by a Computational Method forPhylogenetic Footprinting, Blanchette & Tompa (2002)

Statistical Models for Biological Sequence Motif Discovery,Liu J, Gupta, Liu X, Mayerhofere, Lawrence

2

““Regulatory Motif Finding”Regulatory Motif Finding”

What is being regulated?What is being regulated?

What is a “Motif?”What is a “Motif?”

Why do we want to find them?Why do we want to find them?

3

Central Dogma of GeneticsCentral Dogma of Genetics

(pict by Andrew Hughes, Rice University)

It’s “TRUE,” right?!It’s “TRUE,” right?!

Yes, but…Yes, but…

4

Every Protein in Every Cell?Every Protein in Every Cell?

Clearly, there are complicated Clearly, there are complicated mechanisms at workmechanisms at work

RhodopsinRhodopsin

But, we have the same DNA in all cells…But, we have the same DNA in all cells…

5

TranscriptionalTranscriptional Regulation Regulation

It is transcription (DNA It is transcription (DNA RNA) that RNA) that is being regulated.is being regulated.

RNA Polymerase II, aided by Transcription RNA Polymerase II, aided by Transcription Factors (TFs)Factors (TFs)

Where do TFs bind?Where do TFs bind?

6

Promoter RegionsPromoter Regions

(pict by Andrew Hughes, Rice University)

TATA box – usually ~ 30 bp upstream of geneTATA box – usually ~ 30 bp upstream of gene

But, there are others...Where? What Sequence?But, there are others...Where? What Sequence?

7

Promoter SequencePromoter Sequence

Many different possible locations, Many different possible locations, sometimes extremely far from the sometimes extremely far from the start of transcription!start of transcription!

What Sequence? THAT is the $64k (or What Sequence? THAT is the $64k (or $1B) Question…$1B) Question…

8

MotifsMotifs

Many different promoter sequences foundMany different promoter sequences found

Basal: TATA-box (-20), CCAAT-box (-100) Basal: TATA-box (-20), CCAAT-box (-100)

Additional Additional transcriptional regulatory domainstranscriptional regulatory domains

Activators and inhibitors use these domainsActivators and inhibitors use these domains

9

Motifs (2)Motifs (2)

Not Not exactexact sequences – that would be too easy sequences – that would be too easy

Strength of Binding Affects level of Strength of Binding Affects level of promotion/inhibition (C/G vs A/T)promotion/inhibition (C/G vs A/T)

Described either probabilistically with Described either probabilistically with motif logos or with extended single-letter motif logos or with extended single-letter nucleotide codesnucleotide codes

Often are Palindromic (GATATC)Often are Palindromic (GATATC)

Symbol MeaningA AdenineG GuanineC CytosineT ThymineU UracilY pYrimidine (C or T)R puRine (A or G)W "Weak" (A or T)S "Strong" (C or G)K "Keto" (T or G)M "aMino" (C or A)B not A (C or G or T)D not C (A or G or T)H not G (A or C or T)V not T (A or C or G)X,N,? unknown (A or C or G or T)

TGATGASSTTMMA – A – Promoter Promoter Sequence for Sequence for several several oncogenesoncogenes

Extended Single-Letter CodesExtended Single-Letter CodesLetters represent Letters represent

possible bases in possible bases in each position:each position:

11

Motif LogosMotif Logos

Height of letters represents probability of Height of letters represents probability of being found in that location in the motifbeing found in that location in the motif

12

Why do we care?Why do we care?

Gene regulation Gene regulation transcriptional transcriptional regulationregulation

Can teach us about our complex Can teach us about our complex signaling pathwayssignaling pathways

Drugs and MoneyDrugs and Money

13

So…Finding Regulatory MotifsSo…Finding Regulatory Motifs

Statistical Models paper (Liu et al)Statistical Models paper (Liu et al)

Assumes: We have located genes Assumes: We have located genes that we expect to be co-regulated that we expect to be co-regulated (microarrays, co-expression)(microarrays, co-expression)

14

So…Finding Regulatory MotifsSo…Finding Regulatory Motifs

Experimental methods of determining TF Experimental methods of determining TF binding sites (Gel Shift assay, DNA binding sites (Gel Shift assay, DNA Protection Assay)Protection Assay)

Statistical modelsStatistical models

15

Single-Site ModelSingle-Site Model

Assumes:Assumes:- Each sequence contains 1 motif- Each sequence contains 1 motif

- Sequences are generated by random - Sequences are generated by random draws draws from {A,C,G,T} with given prior from {A,C,G,T} with given prior probabilitiesprobabilities

- Motif has a frequency matrix for each - Motif has a frequency matrix for each positionposition

Use Gibbs site sampler: Missing Data Problem. Use Gibbs site sampler: Missing Data Problem. Randomly choose motif locations. Then move the Randomly choose motif locations. Then move the motif locations based on P(amotif locations based on P(akk))

16

Gibbs SamplingGibbs SamplingSamplingSampling::

For every K-long word xFor every K-long word xjj,…,x,…,xj+k-1j+k-1 in x: in x:

QQjj = Prob[ word | motif ] = M(1,x = Prob[ word | motif ] = M(1,xjj))……M(k,xM(k,xj+k-1j+k-1))

PPii = Prob[ word | background ] B(x = Prob[ word | background ] B(xjj))……B(xB(xj+k-1j+k-1))

Let Let

Sample a random new position aSample a random new position aii according to the according to the

probabilities Aprobabilities A11,…, A,…, A|x|-k+1|x|-k+1..

0 |x|

Prob

1||

1

/

/kx

jjj

jjj

PQ

PQA

17

Repetitive Block-Motif ModelRepetitive Block-Motif Model

View K sequences as one long sequence of View K sequences as one long sequence of length n. Model probability of a motif starting at length n. Model probability of a motif starting at each position ‘i’.each position ‘i’.

Problems:Problems:- Lose evolutionary relationship - Lose evolutionary relationship between sequencesbetween sequences- Allows multiple copies of motif - Allows multiple copies of motif in each sequencein each sequence- Total number of occurrences - Total number of occurrences unknownunknown

18

The Rest of the Statistical The Rest of the Statistical Models Paper…Models Paper…

Much math:Much math:– Scoring motif candidatesScoring motif candidates– Using potential motif dictionariesUsing potential motif dictionaries– Bayesian Prior ProbabilitiesBayesian Prior Probabilities– Finding motifs with insertions in them Finding motifs with insertions in them

(“gapped” motifs)(“gapped” motifs)

On to: Phylogenetic FootprintingOn to: Phylogenetic Footprinting

19

Phylogenetic FootprintingPhylogenetic Footprinting

Most of paper spent describing Most of paper spent describing background, resultsbackground, results

Methods are brief, not too deepMethods are brief, not too deep

20

Let Evolution Be Your GuideLet Evolution Be Your Guide

Phylogenetic Footprinting – Phylogenetic Footprinting –

““Identifying regulatory elements by Identifying regulatory elements by finding unusually well conserved regions finding unusually well conserved regions in a set of in a set of orthologousorthologous noncoding DNA noncoding DNA sequences from multiple species”sequences from multiple species”

21

Orthologs and ParalogsOrthologs and Paralogs

Gene duplicate within species: Paralog

Same gene in species with common ancestor: Ortholog

22

AdvantagesAdvantages

Doesn’t rely on reliably determining Doesn’t rely on reliably determining co-regulated genes (single-genome co-regulated genes (single-genome approach, non-trivial!)approach, non-trivial!)

Can be used to find regulatory Can be used to find regulatory elements specific to one single gene elements specific to one single gene (caveat: conserved across species)(caveat: conserved across species)

23

Standard MethodsStandard Methods

Usually start with MSA (ProbCons,clustalw)Usually start with MSA (ProbCons,clustalw)– But, this can lose signal (short regulatory But, this can lose signal (short regulatory

elements ~20bp, long promoter regions ~1000 bp)elements ~20bp, long promoter regions ~1000 bp)– Also, if species are evolutionarily close, Also, if species are evolutionarily close,

nonfunctional regions may also be well conservednonfunctional regions may also be well conserved

Can start with general motif discovery algs Can start with general motif discovery algs (MEME, Consensus, AlignAce, DIALIGN …)(MEME, Consensus, AlignAce, DIALIGN …)– But, these don’t take into account relative But, these don’t take into account relative

phylogenetic relationships of sequences. Will phylogenetic relationships of sequences. Will weight closely related sequences too highlyweight closely related sequences too highly

24

The PF AlgorithmThe PF Algorithm

Given:• phylogenetic tree T,• set of orthologous sequences at leaves of T,• length k of motif• threshold d

Problem:

• Find each set S of k-mers, one k-mer from each leaf, such that the “parsimony” score of S in T is at most d.

25

AGTCGTACGTGAC... (Human)

AGTAGACGTGCCG... (Chimp)

ACGTGAGATACGT... (Rabbit)

GAACGGAGTACGT... (Mouse)

TCGTGACGGTGAT... (Rat)

Size of motif sought: k = 4

Small Example Small Example (merci, CS262)(merci, CS262)

26

SolutionSolution

Parsimony score: 1 mutation

AGTCGTACGTGAC...

AGTAGACGTGCCG...

ACGTGAGATACGT...

GAACGGAGTACGT...

TCGTGACGGTGAT...ACGGACGT

ACGT

ACGT

An An ExhaustiveExhaustive Algorithm AlgorithmWu [s] = best parsimony score for subtree rooted at node u,

if u is labeled with string s.

AGTCGTACGTG

ACGGGACGTGC

ACGTGAGATAC

GAACGGAGTAC

TCGTGACGGTG

… ACGG: 2 ACGT: 1

...

…

ACGG: 0 ACGT: 2...

…

ACGG: 1 ACGT: 1

\...

…

ACGG: + ACGT: 0

...

… ACGG: 1 ACGT: 0

...

4k entries

… ACGG: 0 ACGT: +

...

… ACGG: ACGT :0 ...



28

Simple RecurrenceSimple Recurrence

Wu [s] = min ( Wv [t] + h(s,

t) ) v: children t of uWords Good:

K-mer score at a node is the sum of itschildren’s best parsimony scores for that k-mer

29

Running TimeRunning Time

Wu [s] = min ( Wv [t] + h(s,

t) ) v: children t of uO(k 42k )

time per node

Number of species

Average sequence

length

Motif length

Total time O(n k (42k + l ))

30

FootPrinterFootPrinterhttp://bio.cs.washington.edu/software.htmlhttp://bio.cs.washington.edu/software.html

Avoids pitfalls of using MSA or general-Avoids pitfalls of using MSA or general-purpose Motif-finding algorithmspurpose Motif-finding algorithms

Identifies all DNA motifs that appear Identifies all DNA motifs that appear to have evolved more slowly than the to have evolved more slowly than the surrounding sequencesurrounding sequence

Allows motifs to not appear in all Allows motifs to not appear in all sequences (LexA in gram +/- bacteria)sequences (LexA in gram +/- bacteria)

31

FootPrinter (2)FootPrinter (2)

““Given n orthologous input Given n orthologous input sequences and the phylogenetic tree sequences and the phylogenetic tree TT relating them, [footprinter] is relating them, [footprinter] is guaranteed to produce every set of guaranteed to produce every set of kk--mers, one from each input sequence, mers, one from each input sequence, that have a parsimony score at most that have a parsimony score at most dd with respect to with respect to TT, where , where kk and and dd are are parameters specified by the user.parameters specified by the user.

32

ParametersParameters

Can set minimum threshold on Can set minimum threshold on fraction of the phylogeny that must fraction of the phylogeny that must be spanned for motifs with each be spanned for motifs with each parsimony score ‘s’.parsimony score ‘s’.

33

ResultsResults

Examine 9 sets of orthologous or Examine 9 sets of orthologous or paralogous (works for duplicated genes paralogous (works for duplicated genes that have since evolved as well) that have since evolved as well) sequences.sequences.

Found: many old, + some highly Found: many old, + some highly conserved motifs of unknown function conserved motifs of unknown function (time for the experimentalists!)(time for the experimentalists!)

34

One example: Metallothionein One example: Metallothionein Gene FamilyGene Family

Good test family:Good test family:– Large number of promoter sequencesLarge number of promoter sequences– Wide variety of speciesWide variety of species– Large number of regulatory elements Large number of regulatory elements

experimentally verified in several species.experimentally verified in several species.

Most binding sites are within 300 bp of Most binding sites are within 300 bp of start codon (ATG)start codon (ATG)

35

Inputs Sequences:Inputs Sequences:590 bp upstream of590 bp upstream ofthe start codonthe start codon

Most found were Most found were present in multiple present in multiple isoform families – isoform families – gained accuracy gained accuracy by considering the by considering the paralogs, not just paralogs, not just the orthologsthe orthologs

36

But, FootPrinter isn’t PerfectBut, FootPrinter isn’t Perfect

Some known regulatory binding sites Some known regulatory binding sites were missed. Why?were missed. Why?

Ultimately, must be because the Ultimately, must be because the motifs were not well-enough motifs were not well-enough conserved to be detectedconserved to be detected

(but we can discuss more…)(but we can discuss more…)

37

FootPrinter Error (1)FootPrinter Error (1)

Some binding sites not well matched Some binding sites not well matched in other species.in other species.

Example: Thyroid hormone receptor Example: Thyroid hormone receptor T3R is conserved within rodents, but T3R is conserved within rodents, but not beyond. Would need many not beyond. Would need many closely related species to detect this closely related species to detect this motif.motif.

38

FootPrinter Error (2-5)FootPrinter Error (2-5)

Some motifs well conserved, but too Some motifs well conserved, but too shortshort

InDels in middle of motif – could allow InDels in middle of motif – could allow them, but would get many false +sthem, but would get many false +s

Some barely fail to meet statistical Some barely fail to meet statistical thresholds (close but no cigar)thresholds (close but no cigar)

Dimer TFs like two conserved Dimer TFs like two conserved regions with variable internal seq.regions with variable internal seq.

1 regulatory motif finding discovery of regulatory elements by a computational method for...

Documents

motif slide

prob slide

domains slide

motif sequences

money slide

lawrence slide

motif locations

coexpression slide