special topics in genomics motif analysis. sequence motif – a pattern of nucleotide or amino acid...

13
Special Topics in Genomics Motif Analysis

Upload: madeline-lawrence

Post on 18-Jan-2018

223 views

Category:

Documents


0 download

DESCRIPTION

Motif representation

TRANSCRIPT

Page 1: Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA

Special Topics in Genomics

Motif Analysis

Page 2: Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA

Sequence motif – a pattern of nucleotide or amino acid sequences

GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA

TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA

CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA

TTAGAGGCACAATTGCTTGGGTGGTGCACAAAAAAACAAG

AACAGCCTTGGATTAGCTGCTGGGGGGGTGAGTGGTCCAC

ATCAGAATGGGTGGTCCATATATCCCAAAGAAGAGGGTAG

TF

TF

TF

TF

TF

TF

123456789

TGGGTGGTC

TGGGTGGTA

TGGGAGGTC

TGGGTGGTG

TGAGTGGTC

TGGGTGGTC

Transcription Factor Binding Sites (TFBS)

DNA motif:

Protein motif:

Page 3: Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA

Motif representation

Page 4: Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA

Consensus sequence

Example: CACSTG

Page 5: Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA

Sequence LogoSchneider & Stephens, Nucleic Acids Res. 18:6097-6100 (1990)

Entropy (Shannon) – a measurement of uncertainty

The amount of uncertainty reduced by observing sequences is the amount of information (or information content) we obtained:

This is the height of each position in the logo plot.

Height of each nucleotide is proportional to its frequency

Page 6: Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA

Two questions in motif analysis

• Known motif mapping

Finding occurrences of a motif in nucleotide or amino acid sequences

• De novo motif discovery

Finding motifs that are previously unknown

Page 7: Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA

Known motif mapping

• Consensus mapping

STEP 1: provide a motif (e.g. CACSTG = CAC[C,G]TG)STEP 2: specify number of mismatches allowed (e.g. <=1)STEP 3: scan the sequence

CGCCGGGACCAGATCAACGCCGAGATCCGGCACATGAAGGAGCT m=3, no m=1, yes

A useful tool: CisGenome (http://www.biostat.jhsph.edu/~hji/cisgenome)

Page 8: Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA

Known motif mapping

• Motif matrix mapping (CisGenome)STEP 1: provide a motif and background modelSTEP 2: specify a likelihood ratio cutoff (e.g. LR>=500)STEP 3: scan the sequence

0

GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGACTGGGAGGTCCTCGGTTCAGAGTCACAGAGCA

LR>500, yes LR<500, no

Motif:Background:

A C G TA .3 .2 .2 .3C .2 .3 .3 .2G .2 .3 .3 .2T .3 .2 .2 .3

1 2 3 4 5 6 7 8 9A 0.00 0.00 0.17 0.00 0.17 0.00 0.00 0.00 0.17C 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.66G 0.00 1.00 0.83 1.00 0.00 1.00 1.00 0.00 0.17T 1.00 0.00 0.00 0.00 0.83 0.00 0.00 1.00 0.00

• Another tool for matrix mappingMAST (http://meme.sdsc.edu/meme/mast-intro.html)

Page 9: Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA

De novo motif discovery

• Two major class of methods:

1. Word enumeration

2. Matrix updating

Page 10: Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA

Word enumeration

Example: Sinha & Tompa, Nucleic Acids Res. 30: 5549-5560 (2002)

STEP 1: enumerate possible words;STEP 2: count word occurrences;STEP 3: compare observed word count with random expectation.

Page 11: Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA

Matrix updating

• CONSENSUS (Stormo & Hartzell, PNAS, 86: 1183-1187, 1990)

STEP 1: use all k-mers in the first sequence as seeds;

STEP 2: find matches (often use best matches) of each seed in the second sequence;

STEP 3: update seed matrices, exclude matrices with low information content;

STEP 4: repeat step 2 and 3 for all sequences.

Page 12: Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA

Matrix updating• Mixture model

0 , W

EM:

Lawrence and Reilly (1990)

Bailey and Elkan (1994), etc.

Gibbs Sampler:

Lawrence et al. (1993)

Liu (1994), Liu et al. (1995), etc.

S: GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGACTGGGAGGTCCTCGGTTCAGAGTCACAGAGCA

A: 000000000000001000000000000000000000000001000000000000000000000000000000

Motif:Background:

q = [q0,q1]q0 q1

),,(),,,|,(),|,,,( qWΘθqWΘASθSqWΘA 00 ff

A C G TA .3 .2 .2 .3C .2 .3 .3 .2G .2 .3 .3 .2T .3 .2 .2 .3

1 2 3 4 5 6 7 8 9A 0.00 0.00 0.17 0.00 0.17 0.00 0.00 0.00 0.17C 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.66G 0.00 1.00 0.83 1.00 0.00 1.00 1.00 0.00 0.17T 1.00 0.00 0.00 0.00 0.83 0.00 0.00 1.00 0.00

,W,q A

Inference by iterative estimation/sampling

Page 13: Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA

Other issues

• Dependencies within motif

• Functions of novel motifs