motif identification with gibbs sampler

17
Motif identification with Gibbs Sampler Xuhua Xia [email protected] http://dambe.bio.uottawa.ca Not enough material

Upload: duaa

Post on 08-Feb-2016

53 views

Category:

Documents


0 download

DESCRIPTION

Not enough material. Motif identification with Gibbs Sampler. Xuhua Xia [email protected] http://dambe.bio.uottawa.ca. Background. Named after Josiah Willard Gibbs (February 11, 1839 – April 28, 1903), winner of the Copley Medal of the Royal Society of London in 1901. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Motif identification with Gibbs Sampler

Motif identification with Gibbs Sampler

Xuhua [email protected]

http://dambe.bio.uottawa.ca

Not enough material

Page 2: Motif identification with Gibbs Sampler

Xuhua Xia Slide 2

Background• Named after Josiah Willard Gibbs (February 11, 1839 – April

28, 1903), winner of the Copley Medal of the Royal Society of London in 1901.

• One of Markov chain Monte Carlo algorithms• Biological applications

– Identification of regulatory sequences of genes (Aerts et al., 2005; Coessens et al., 2003; Lawrence et al., 1993; Qin et al., 2003; Thijs et al., 2001; Thijs et al., 2002a; Thijs et al., 2002b; Thompson et al., 2004; Thompson et al., 2003) and functional motifs in proteins (Mannella et al., 1996; Neuwald et al., 1995; Qu et al., 1998)

– Classification of biological images (Samso et al., 2002)– Pairwise sequence alignment (Zhu et al., 1998) and multiple sequence

alignment (Holmes and Bruno, 2001; Jensen and Hein, 2005).

Page 3: Motif identification with Gibbs Sampler

Xuhua Xia Slide 3

Motif Identification by Gibbs sampler(a)S1 TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCTS2 TCGACCCTCTGAACCTATCAGGGACCACAGTCAGCCAGGCAAGS3 AGCCTAGAGTGATGACTCCTATCTGGGTCCCCAGCAGGA……SN CTTTGTCACTGGATCTGATAAGAAACACCACCCCTGC

(b)S1 TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCTS2 TCGACCCTCTGAACCTATCAGGGACCACAGTCAGCCAGGCAAGS3 AGCCTAGAGTGATGACTCCTATCTGGGTCCCCAGCAGGA……SN CTTTGTCACTGGATCTGATAAGAAACACCACCCCTGC

Gibbs sampler

Other outputs of Gibbs sampler:

Position weight matrix that can be used to scan other sequences for motifs, the associated significance tests

Position weight matrix scores for identified motifs.

Page 4: Motif identification with Gibbs Sampler

Xuhua Xia Slide 4

Gibbs sampler in motif finding• Site sampler

• Motif sampler

Page 5: Motif identification with Gibbs Sampler

Xuhua Xia Slide 5

Algorithm details: Initialization 1 2 3 4 1234567890123456789012345678901234567890123 S1 TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCTS2 CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTGS3 TCGACCCTCTGAACCTATCAGGGACCACAGTCAGCCAGGCAAGS4 AAAACACTTGAGGGAGCAGATAACTGGGCCAACCATGACTCS5 GGGTGAATGGTACTGCTGATTACAACCTCTGGTGCTGC.. ...S11 CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG.. ...

Table 7-1. Site-specific distribution of nucleotides from the 29 random motifs of length 6. The second column lists the distribution of nucleotides outside the 29 random motifs.

Site

Nuc C0 1 2 3 4 5 6

A 278 8 7 9 6 10 7

C 279 3 8 5 10 6 5

G 230 7 5 6 5 3 11

T 248 11 9 9 8 10 6

Randomly choose motif start Ai.

FA: 325FC: 316FG: 267FT: 301Sum: 1209

Page 6: Motif identification with Gibbs Sampler

Xuhua Xia Slide 6

Algorithm details: Predictive update

Site

Nuc C0 1 2 3 4 5 6

A 279 7 7 9 6 10 7

C 279 3 8 5 10 6 5

G 233 7 4 6 4 3 10

T 250 11 9 8 8 9 6

Site

Nuc C0 1 2 3 4 5 6

A 278 8 7 9 6 10 7

C 279 3 8 5 10 6 5

G 230 7 5 6 5 3 11

T 248 11 9 9 8 10 6

S11 CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG

Page 7: Motif identification with Gibbs Sampler

Xuhua Xia Slide 7

Predictive update: Frequencies

1

00 , . .,

0

2790 0.2680279 279 233 250

Code

ii N

ii

A

CQ e g

C

Q

1 1

00

0

279 0.0001 3250 0.2680279 279 233 250 0.0001 1209

Code Code

i ii N N

i ii i

A

C FQ

C F

Q

1

1

1

1

1

7 0.2528

1

7 0.0001 325 0.250128 0.0001 1209

Code

Code

ij ijij N

iji

A

ij iij N

ii

A

C CQ

NC

Q

C FQ

N F

Q

Table 7-3. Site-specific distribution of nucleotide frequencies derived from data in Table 7-2, with = 0.0001 The second column lists the distribution of nucleotide frequencies outside the 28 random motifs.

Site

Nuc Q0 1 2 3 4 5 6

A 0.2680 0.2501 0.2501 0.3212 0.2145 0.3568 0.2501

C 0.2680 0.1078 0.2856 0.1789 0.3567 0.2145 0.1789

G 0.2238 0.2499 0.1432 0.2143 0.1432 0.1076 0.3566

T 0.2402 0.3922 0.3211 0.2856 0.2856 0.3211 0.2144

Page 8: Motif identification with Gibbs Sampler

Xuhua Xia Slide 8

Predictive update: PWMSite

Nuc Q0 1 2 3 4 5 6

A 0.2680 0.2501 0.2501 0.3212 0.2145 0.3568 0.2501

C 0.2680 0.1078 0.2856 0.1789 0.3567 0.2145 0.1789

G 0.2238 0.2499 0.1432 0.2143 0.1432 0.1076 0.3566

T 0.2402 0.3922 0.3211 0.2856 0.2856 0.3211 0.2144

1 2 3 4 5 6

A -0.0693 -0.0693 0.1811 -0.2228 0.2862 -0.0693

C -0.9113 0.0637 -0.4042 0.2862 -0.2228 -0.4042

G 0.1102 -0.4469 -0.0434 -0.4469 -0.7327 0.4659

T 0.4907 0.2906 0.1731 0.1731 0.2906 -0.1135

11

0.2501ln , . ., ln ln -0.06930.2680

ij Aij A

i A

Q QPWM e g PWM

Q Q

S11 CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG

Odds ratio for CATGCC = e-0.9113-0.0693+0.1731-0.4469-0.2228-0.4042 = 0.153

Page 9: Motif identification with Gibbs Sampler

Xuhua Xia Slide 9

Predictive updateS11 CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG

Table 7-4. Possible locations of the 6-mer motif along S11, together with the corresponding motifs and their position weight matrix scores expressed as odds ratios. The last column lists the odds ratios normalized to have a sum of 1.

Site 6-mer Odds Ratio PNorm

1 CATGCC 0.153 0.004

2 ATGCCC 0.850 0.021

3 TGCCCT 0.664 0.016

4 GCCCTC 0.944 0.023

5 CCCTCA 0.254 0.006

6 CCTCAA 0.843 0.021

7 CTCAAG 0.609 0.015

8 TCAAGT 0.717 0.018

9 CAAGTG 0.613 0.015

10 AAGTGT 0.426 0.011

11 AGTGTG 0.967 0.024

... ... ... ...

35 TCAAGG 1.279 0.032

40 – 6 + 1 = 35

Scaled to sum to 1

Pick up the one with the largest odds ratio, update the Ai value, and generate a new frequency matrix and a new PWM

Originally picked

New one to replace the originally picked because of the largest odds ratio

Page 10: Motif identification with Gibbs Sampler

Xuhua Xia Slide 10

Algorithm details: Predictive update

Site

Nuc C0 1 2 3 4 5 6

A 279 7 7 9+1 6+1 10 7

C 279 3 8+1 5 10 6 5

G 233 7 4 6 4 3+1 10+1

T 250 11+1 9 8 8 9 6

S11 CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG

A New PWM

Scan another sequence

Page 11: Motif identification with Gibbs Sampler

F as a criterion• Once all sequences are updated and a new set of Ai values

obtained, compute

• Update all the sequences again to obtain a new set of Ai and a new F. If the new F is greater the old F, replace the new set of Ai values by the new set of Ai values. Repeat until F value no long increases or when the maximum number of local iterations is reached.

• This (from initiation to this slide) completes one global cycle of iteration

• Repeat a number of global cycles until F does not increase.

,1 1

ln 0

CodeN mij

i ji j i

QF C

Q

Page 12: Motif identification with Gibbs Sampler

F as a criterion

1 2 3 4 5 1 2 3 4 5A 12 9 7 13 10 35 1 1 18 1C 10 10 6 10 8 1 1 1 2 19G 8 11 14 7 12 2 1 2 18 1T 10 10 13 10 10 2 37 36 2 19

40 40 40 40 40 40 40 40 40 40

Q0 Qi1 Qi2 Qi3 Qi4 Qi5 Qi1 Qi2 Qi3 Qi4 Qi5A 0.25 0.300 0.225 0.175 0.325 0.250 0.875 0.025 0.025 0.450 0.025C 0.25 0.250 0.250 0.150 0.250 0.200 0.025 0.025 0.025 0.050 0.475G 0.25 0.200 0.275 0.350 0.175 0.300 0.050 0.025 0.050 0.450 0.025T 0.25 0.250 0.250 0.325 0.250 0.250 0.050 0.925 0.900 0.050 0.475

Fi1 Fi2 Fi3 Fi4 Fi5 Fi1 Fi2 Fi3 Fi4 Fi5A 2.188 -0.948 -2.497 3.411 0.000 43.847 -2.303 -2.303 10.580 -2.303C 0.000 0.000 -3.065 0.000 -1.785 -2.303 -2.303 -2.303 -3.219 12.195G -1.785 1.048 4.711 -2.497 2.188 -3.219 -2.303 -3.219 10.580 -2.303T 0.000 0.000 3.411 0.000 0.000 -3.219 48.408 46.114 -3.219 12.195

F = 4.379 F = 149.404

.............. ..............

,1 1

ln 0

CodeN mij

i ji j i

QF C

Q

Page 13: Motif identification with Gibbs Sampler

Summary of the algorithms• To find a motif of length L from a set of N sequences, randomly pick up a

L-mer from each sequence• From the N L-mers, produce a PWM.• Randomly pick a sequence and use the PWM to scan the sequence along to

obtain a set of PWMS each for a L-mer along the sequence.• Use the L-mer with the highest PWMS to update PWM.• Repeat this scanning and updating until all sequences have been used. • Calculate F1

• Repeat the entire process and calculate F2.

• Continue the process until Fi does not increase any more.• Output

– the final PWM, as well as PWMS for each sequence– The aligned motifs– Associated statistics

Xuhua Xia Slide 13

Page 14: Motif identification with Gibbs Sampler

Xuhua Xia Slide 14

Final Report: Final FrequencyFinal site-specific counts: A C G U1 3 11 0 152 0 0 8 213 21 0 8 04 0 0 0 295 10 18 0 16 17 0 1 11

Final site-specific frequencies: A C G U1 0.10413 0.37882 0.00092 0.516132 0.00112 0.00109 0.27563 0.722173 0.72225 0.00109 0.27563 0.001034 0.00112 0.00109 0.00092 0.996885 0.34451 0.61920 0.00092 0.035376 0.58489 0.00109 0.03526 0.37877

Final PWM [ln(Qij/Q0)]: A C G U1 -0.93304 0.31199 -5.57384 0.869092 -5.46894 -5.54337 0.13202 1.204993 1.00364 -5.54337 0.13202 -5.344194 -5.46894 -5.54337 -5.57384 1.527375 0.26340 0.80335 -5.57384 -1.811316 0.79269 -5.54337 -1.92440 0.55966

Page 15: Motif identification with Gibbs Sampler

Xuhua Xia Slide 15

Motif alignment Seq V V 1 UCAGAACCAGUUAUAAAUUUAUCAUUUCCUUCUCCACUCCU 2 CCCACGCAGCCGCCCUCCUCCCCGGUCACUGACUGGUCCUG 3 UCGACCCUCUGAACCUAUCAGGGACCACAGUCAGCCAGGCAAG 4 AAAACACUUGAGGGAGCAGAUAACUGGGCCAACCAUGACUC 5 GGGUGAAUGGUACUGCUGAUUACAACCUCUGGUGCUGC 6 AGCCUAGAGUGAUGACUCCUAUCUGGGUCCCCAGCAGGA 7 GCCUCAGGAUCCAGCACACAUUAUCACAAACUUAGUGUCCA 8 CAUUAUCACAAACUUAGUGUCCAUCCAUCACUGCUGACCCU 9 UCGGAACAAGGCAAAGGCUAUAAAAAAAAUUAAGCAGC 10 GCCCCUUCCCCACACUAUCUCAAUGCAAAUAUCUGUCUGAAACGGUUCC 11 CAUGCCCUCAAGUGUGCAGAUUGGUCACAGCAUUUCAAGG 12GAUUGGUCACAGCAUUUCAAGGGAGAGACCUCAUUGUAAG 13 UCCCCAACUCCCAACUGACCUUAUCUGUGGGGGAGGCUUUUGA 14 CCUUAUCUGUGGGGGAGGCUUUUGAAAAGUAAUUAGGUUUAGC 15 AUUAUUUUCCUUAUCAGAAGCAGAGAGACAAGCCAUUUCUCUUUCCUCCC 23 GAAAAAAAAUAAAUGAAGUCUGCCUAUCUCCGGGCCAGAGCCCCU 24 UGCCUUGUCUGUUGUAGAUAAUGAAUCUAUCCUCCAGUGACU 25 GGCCAGGCUGAUGGGCCUUAUCUCUUUACCCACCUGGCUGU 26 CAACAGCAGGUCCUACUAUCGCCUCCCUCUAGUCUCUG 27 CCAACCGUUAAUGCUAGAGUUAUCACUUUCUGUUAUCAAGUGGCUUCAGC 28 GGGAGGGUGGGGCCCCUAUCUCUCCUAGACUCUGUG 29 CUUUGUCACUGGAUCUGAUAAGAAACACCACCCCUGC

Page 16: Motif identification with Gibbs Sampler

Xuhua Xia Slide 16

Motif scoresSeqName Motif Start PWMSS1 UUAUCA 18 493.3101S2 CGGUCA 22 40.4251S3 CUAUCA 14 282.6008S4 AGAUAA 17 16.2174S5 UGAUUA 16 12.3482S6 CUAUCU 18 223.8567S7 UUAUCA 20 493.3101S8 UUAUCA 2 493.3101S9 CUAUAA 17 164.6933S10 CUAUCU 14 223.8567S11 UGGUCA 21 70.5663S12 UUGUAA 33 120.2498S13 UUAUCU 20 390.7660S14 UUAUCU 2 390.7660S15 UUAUCA 10 493.3101... ... ... ...S27 UUAUCA 19 493.3101S28 CUAUCU 15 223.8567S29 UUGUCA 2 206.3393

Page 17: Motif identification with Gibbs Sampler

Xuhua Xia Slide 17

Motif sampler outputSeqName N 1 2 3

Seq1 2 10(TTATAA,93.4541) 18(TTATCA,163.6602)

Seq2 1 22(CGGTCA,14.5511)

Seq3 1 14(CTATCA,101.8203)

Seq4 0

Seq5 1 16(TGATTA,12.9266)

Seq6 1 18(CTATCT,90.7790)

Seq7 1 20(TTATCA,163.6602)

Seq8 2 2(TTATCA,163.6602) 24(CCATCA,10.2098)

Seq9 1 17(CTATAA,58.1420)

Seq10 3 14(CTATCT,90.7790) 28(ATATCT,41.4438) 32(CTGTCT,37.7888)

Seq11 1 21(TGGTCA,23.3886)

Seq12 2 3(TGGTCA,23.3886) 33(TTGTAA,38.9024)

Seq13 1 20(TTATCT,145.9129)

Seq14 1 2(TTATCT,145.9129)

Seq15 3 1(TTATTT,33.5700) 10(TTATCA,163.6602) 36(TTCTCT,17.7407)

Seq25 1 17(TTATCT,145.9129)

Seq26 1 15(CTATCG,21.2368)

Seq27 3 19(TTATCA,163.6602) 25(CTTTCT,13.3635) 32(TTATCA,163.6602)

Seq28 1 15(CTATCT,90.7790)

Seq29 2 2(UUGUCA,68.1272) 15(TGATAA,32.0835)