motif discovery tutorial 5. motif discovery –meme –mast –tomtom –gomo –prosite multiple...

Post on 20-Dec-2015

238 Views

Category:

Documents

5 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Motif discovery

Tutorial 5

• Motif discovery– MEME– MAST– TOMTOM– GOMO– PROSITE

Multiple sequence alignments and motif discovery

Can we find motifs using multiple sequence alignment?

1 2 3 4 5 6 7 8 9 10

A 0 0 0 0 0 3/6 1/6 2/6 0 0

D 0 3/6 2/6 0 0 1/6 5/6 1/6 0 1/6

E 0 0 4/6 1 0 0 0 0 1 5/6

G 0 1/6 0 0 1 1/3 0 0 0 0

H 0 1/6 0 0 0 0 0 0 0 0

N 0 1/6 0 0 0 0 0 0 0 0

Y 1 0 0 0 0 0 3/6 3/6 0 0

..YDEEGGDAEE....YDEEGGDAEE....YGEEGADYED....YDEEGADYEE....YNDEGDDYEE....YHDEGAADEE..

MotifA widespread pattern with a biological significance

Can we find motifs using multiple sequence alignment (MSA)?

YES! NO

Using MSA for motif discoveryCan only work if things align nicely alone

For most motifs this is not the case!

ClustalW - Inputhttp://www.ebi.ac.uk/Tools/clustalw2/index.html

Input sequences

Gap scoring

Scoring matrix

Email address

Output format

http://www.ebi.ac.uk/Tools/muscle/index.html

Muscle

Input sequences

Email address

Output format

Motif search: from de-novo motifs to motif annotation

gapped motifs

Large DNA data

http://meme.sdsc.edu/

MEME – Multiple EM* for Motif finding

http://meme.sdsc.edu/• Motif discovery from unaligned sequences

Genomic or protein sequences• Flexible model of motif presence (Motif can be absent in

some sequences or appear several times in one sequence)

*Expectation-maximization

MEME - InputEmail addres

s

Input file (fasta file)

How many times in each

sequence?

How many motifs?

How many

sites?

Range of motif

lengths

MEME - Output

Motif score

MEME - Output

Motif length

Number of times

Motif score

MEME - Output

Low uncertainty

=

High information content

MEME - Output

Multilevel Consensus

Patterns can be presented as regular expressions

[AG]-x-V-x(2)-{YW}

[] - Either residuex - Any residuex(2) - Any residue in the next 2 positions{} - Any residue except these

Examples: AYVACM, GGVGAA

Sequence names

Position in sequence

Strength of match

Motif within sequence

MEME - Output

Overall strength of motif matches

Motif location in the input sequence

MEME - OutputSequence names

What can we do with motifs?

• MAST - Search for them in non annotated sequence databases (protein and DNA)

• TOMTOM - Find the protein who binds the DNA motifs.

• GOMO - Find putative target genes (DNA) of motifs and analyze their associated annotation terms.

• PROSITE - Search for them in annotated protein sequence databases.

MAST

• Searches for motifs (one or more) in sequence databases:– Like BLAST but motifs for input– Similar to iterations of PSI-BLAST

• Profile defines strength of match– Multiple motif matches per sequence– Combined E value for all motifs

• MEME uses MAST to summarize results: – Each MEME result is accompanied by the MAST result for

searching the discovered motifs on the given sequences.

http://meme.sdsc.edu/meme4_4_0/cgi-bin/mast.cgi

MAST - InputEmail

address

Input file (motifs)

Database

MAST - OutputInput motifs

Presence of the motifs in a given database

TOMTOM

• Searches one or more query DNA motifs against one or more databases of target motifs, and reports for each query a list of target motifs, ranked by p-value.

• The output contains results for each query, in the order that the queries appear in the input file.

http://meme.sdsc.edu/meme/doc/tomtom.html

TOMTOM - Input

Input motif

Background frequencies

Database

DNA IUPAC* codeA --> adenosine M --> A C (amino) C --> cytidine S --> G C (strong) G --> guanine W --> A T (weak) T --> thymidine

B --> G T C D --> G A T R --> G A (purine) H --> A C T Y --> T C (pyrimidine) V --> G C A K --> G T (keto) N --> A G C T (any)

Example: YCAY = [TC]CA[TC]

*IUPAC = International Union of Pure and Applied Chemistry

TOMTOM - OutputInput motif

Matching motifs

TOMTOM – OutputWrong input, ok results

JASPAR

• Profiles – Transcription factor binding sites– Multicellular eukaryotes– Derived from published collections of experiments

• Open data accesss

scoreorganism logoName of gene/protein

GOMO

• GOMO takes DNA binding motifs to find putative target genes and analyze their associated GO terms. A list of significant GO terms that can be linked to the given motifs will be produced.

• GOMO returns a list of GO-terms that are significantly associated with target genes of the motif.

• Gene Ontology provides a controlled vocabulary to describe gene and gene product attributes in any organism.

GOMO - Input

Email addres

s

Input file (motifs)

Database

GOMO - OutputInput motifs

GO annotation

MF - Molecular functionBP - Biological process CC - Cellular compartment

ProSite is a database of protein domains and motifs that can be searched by either regular expression patterns or sequence profiles.

Prositehttp://www.expasy.org/tools/scanprosite

Prosite - inputInput motif

a regular expression

Database

Filters

Prosite - OutputInput motif

Location in the protein sequence

protein

top related