motifclick: cis-regulatory k - length motifs finding in cliques of 2(k-1)- mers shaoqiang zhang ...

MotifClick: cis-regulatory k-length motifs finding in cliques of 2(k-1)-mers

Shaoqiang Zhang

http://bioinfo.uncc.edu/szhang

April 3, 2013

Gene regulation in prokaryotes

TSS

+1

Promoterregion

Transcription factor binding

sites Terminator

mRNA

Transcription-10-35-300

3’ UTR

cis-regulatory elements

TF1

TF2

Gene1 Gene2 Gene3

Operon

Transcription Factor binding sites (TFBS)

Gene4Gene5

Gene1 Gene2 Gene6Gene3

TF

BS1

BS2

BS3

Co-regulated genes (Regulon) in a single genome

BS1BS2BS3

Cis-regulatory motif / binding site motif.

GeneGeneGeneGeneGene

Orthologous genesGenome1Genome2Genome3Genome4Genome5

Phylogenetic footprinting technique

TGTGAGATAGATCACACATGATTTAAATCGCA……………………………TGTGATCAACATCACA

motif

logo

BS1BS2

BS3

MotifsTTGTTACGTTATAACACGGTTATATTATAACACGGTTATGTTATAACATGGTTATGTTATAACATGGTTATGTTATAACATGGTTATGTTATAACA TGGTTATGTTATAACACGGTTATGTTATAACATGGTTATGTTATAACATTGTTATGTTATAACGATGTTATATTATTACATTGTTATGTTATAACATTGTTATGTTATAACATTGTTATGTTATAACATTGTTATGTTATAACATTGTTATGTTATAACATTGTTATAGTATAACATTAAAATGTTATAACATTAAAATGTTATAACATTAATATGTTATAACATTGTTATAATATAACAATGTTACATTATAACAATGTTACATTATAACAATGTTACATTATAACAATGTTACATTATAACACGGTTATGTTATAACATGGTTATGTTATAACATGGTTATGCTATAACATTAAAATGTTATAACATTAATATGTTATAACA

A -0.839 -5.231 -0.839 -0.839 -1.531 1.688 -5.231 -0.187 -2.909 -5.231 1.688 -5.231 1.639 1.688 -5.231 1.639

C -0.607 -4.695 -4.695 -4.695 -4.695 -4.695 -0.302 -4.695 -2.373 -4.695 -4.695 -4.695 -4.695 -4.695 2.224 -4.695

G -4.611 0.88 2.047 -4.611 -4.611 -4.611 -4.611 1.864 -2.289 -4.611 -4.611 -4.611 -4.611 -4.611 -4.611 -2.289

T 1.235 1.093 -5.174 1.484 1.594 -5.174 1.484 -5.174 1.594 1.745 -5.174 1.745 -2.852 -5.174 -5.174 -5.174

A 5 0 5 5 3 3 0 8 1 0 30 0 29 30 0 29

C 4 0 0 0 0 0 5 0 1 0 0 0 0 0 30 0

G 0 11 25 0 0 0 0 22 1 0 0 0 0 0 0 1

T 21 19 0 2 27 0 25 0 27 30 0 30 1 0 0 0

LibfF 411 )),((Motif Frequency matrix

L

L bq

ibpb,iP

41

1411 )(

),(log))(Prf(

Motif profile matrix (Position weight matrix)

Motif finding from co-regulated/orthologous genes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35

All MEME BioProspectorCUBICMotifSamplerMDscan

Top number of output motifs

Cov

erag

e of

kno

wn

BS

s

Weeder

CONSENSUS

A lot of motif finding programs have been developed such as MEME, BioProspector, MotifSampler, MotifCut, MDscan, Weeder, CONSENSUS etc.

We have also developed a motif finding program -------MotifClick

http://motifclick.uncc.edu

The binding sites of a TF may be divided into distinct sub-motifs.

Merge cliques

MotifClick: sub-motifs

Previous works

• Graph construction: G=(V,E) un-weighted graph, whereV={candidate motif segments}E={for each pair of input sequences, top 10 pairs of segments with the largest numbers of conserved segments in the input seqs}

• Finding clique from an edge• Expand each clique to a closu

re by adding candidate segments

• Sort motif closures in the p-value order

• Graph construction:G=(V,E,W) weighted graphV={all k-mers}E={each pair of k-mers}W={the probability that two k-mers belong to the same motif under the nucleotide background distribution}

• Maximum density subgraph finding (max-flow min-cut algorithm)

• Refine density subgraph • Sort motifs in the order of cons

tructing maximum density graphs.

BOBRO MotifCut

Main idea

• Weighted graph: reduce constructed graph scale by using 2(k-1)-mers.

• Edge weight: use match number and consider the background.

• Clique finding: use the program we designed in GLECLUBS (find clique from each node).

• Expansion: expand cliques into quasi-cliques to include more segments.

• Rank: based on the size of cliques.

Graph construction: Vertex set

s1

si

sN

Input a set of N sequences

2(k-1)

k-1

step length = k-1

Each k-mer is located in exactly one 2(k-1)-mer

size of the last one is in[k,2(k-1)]

Graph construction: Edge setFor each pair of 2(k-1)-mers M’ and M”, calculate the maximum match number:

),(max )(

barmatchNumbeMb

Mamerk

a

b

M

M

k-mer

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0 0.1 0.2 0.3 0.4 0.5

2

},,,{

))()((),(

TGCAb

bqbpbgbsSSD

Probability of each base in a binding site

Sum of squared distance

E coli known binding sites

0.02

0.2

If max match number >=cutoff and the two k-mers a and b with the max matches have

],0(),(

],0(),(

bgbSSD

bgaSSD

Then link M’ and M” with an edge.

How to select cutoffs and ?

05

1015

2025

3035

40

6 8 10 12 14 16

Random

Randomly select a k-mer in the input seqs set, find a k-mer having max matches with it in each seq.

5%

Keep 95% k-mers by deleting min ones and calculate the average match number of the 95%

k-mer with max matches

s1

sisN

=average match number

Sampling times=max{10, N/4}

NOTE: the cutoff can be amended later

Graph construction: G=(V,E)

s1

si

sN

sj

MotifCut: max density subgraphs

BOBRO: maximal clique starting from an edge

MotifClick: maximal cliques starting from each node

1: We can correct the cutoff by calculating the graph density. If the graph density>100, set until density<=100.And update the graph.

Graph construction: G=(V,E)

Cutoff=10Cutoff=11

Break ties by deleting the vertex with minimum sum of weights in the induced subgraph

Neighbor graph of vertex v

Cliques finding

},......,2,1{ CliqueMCliqueCliqueMax sum of matches

Min sum of matches

Top 1 motif: Clique1 (core) + Other cliques (expansion)

CliquesGroup=

Merge other cliques into Clique1

5

3

|1|

|1|

Clique

eotherCliquClique

4

3

||

|1|

eotherCliqu

eotherCliquClique5-clique

4-clique

or

After merging some other cliques into clique1, update the cliques group by removing clique1 and the cliques merged into clique1.

5

3

|}||,1min{|

|1|

eotherCliquClique

eotherCliquClique?????

Gapless alignmentsK-mer

discarddiscard

Cutoff= average match number

Max number of neighborsFor all k-mers in the quasi-clique of 2(k-1)-mers, find the k-mer with max number of neighbors.

MUSCLE4.0: too strict to get ideal results

Final alignment

Main steps

1. Read input fasta file into a matrix2. Calculate background3. Select match cutoff by estimating average match number 4. Build graph of 2(k-1)-mers5. Calculate graph density6. Update graph by deleting edges with matches=cutoff if gr

aph density > density cutoff7. Find all cliques associated with each vertex8. Select the clique with max sum of matches and merge it

with other cliques9. Do gapless alignments on the expanded quasi-clique.10.Update clique group, and go back step 8.

Flowchart of MotifClickEstimate average match number

Set match cutoff=average match num+1

Build graph of 2(k-1)-mers

Graph density<100

Yes

No

Update graphSet match cutoff=cutoff+1

Find all cliques associated with each vertex

Select the clique with max sum of matches and merge it with other cliques

Gapless alignments using average match number as cutoff

Update clique group

Improvement• How many kinds of nucleotides appear in a binding site?

Yeast SGD

1 0

2 4.4%

3 32.4%

4 63.2%http://www.yeastgenome.org

E.Coli RegulonDB

1 0

2 1%

3 14%

4 85%

SGD (S. cerevisiae Genome Database)

So, we only search the k-mers containing at less 3 kinds of nucleotides

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

RegulonDB

SGD

Improvement

TTTTTTCA 0.75

Percent of max length of single-nucleotide segments in BSs

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

SGD

RegulonDB


0.02

0.060.10

0.02

0.14

0.180.22

SSD cutoff=0.2

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

SGD

RegulonDB

DBTBS

Redfly

JASPARPer

cen

tag

e

SSD

Command-line options

*********USAGE:*********MotifClick <dataset> [OPTIONS] > OutputFile

<dataset> file containing DNA sequences in FASTA format

OPTIONS:-w motif width (default=16)-n maximum number of motifs to find (default=5)-b 2 if examine sites on both of DNA strands (default=1 only forward)-d upper bound of graph density (default=100)-s 0 if want more degenerate sites (default=1 if want fewer sites)

********* -s 1: match cutoff=average match number+1-s 0: match cutoff=average match number

Coded by standard C++ and compiled by GNU C++ compiler under Linux and Mac, and by MinGW (Minimalist GNU for Windows) under Windows(32bits).

http://bioinfo.uncc.edu/szhang/computing.htm

Synthetic data test

Compare with Motif finding tools: MEME, BioProspector, Weeder and MotifCut

Hu et al. have used RegulonDB database to evaluate five algorithms, AlignACE, MEME, BioProspector, MDscan, and MotifSampler, for the prediction of prokaryotic binding sites, and found that MEME often achieved the best sensitivity, and BioProspector often achieved the highest specificity.

Tompa et al. have used TRANSFAC database to assess 13 computational tools for the discovery of transcription factor binding sites in eukaryotes and found that Weeder was the best, and MEME were also good.

We test programs for k-mer sizes 8, 12, and 16.

Weeder can only find motifs with length 6,8,10,12 (parameters: small (6,8), medium(6,8,10), large(6,8,10,12), extra(6-12, mainly 8,10)

Shaoqiang Zhang et al find MEME and Bioprospector cover true BSs,Then CUBIC, MDscan, MotifSampler, consensus,

Synthetic data test

• Sensitivity : Sn=TP/(TP+FN)=(number of correctly predicted BSs)/(number of actual BSs)

• Specificity: Sp=TP/(TP+FP)=(number of correctly predicted BSs)/(number of predicted BSs)

• Performance coefficient: PC=TP/(TP+FP+FN)= )=(number of correctly predicted BSs)/(number of {actual U predicted BSs})

• F-measure/Harmonic mean: F=2*Sn*Sp/(Sn+Sp)

Binding sites level accuracy:

Synthetic data test

A motif containing 20 binding sites

The motif instance of 20 BSs was randomly seeded into a synthetic fasta file of 20 seqs, not necessarily one BS per seqs.

We generated synthetic sets of background sequences using 3rd-order Markov model.

Motif seqs setSynthetic background seqs set

We will test on 400 length X 20 seqs, 600X20, 800X20, and1000X20.

Meme inputfile.fasta –dna –mod anr –w 8 –nmotifs 1 –text > file.meme.out

Synthetic data test (8-mer/Octamer)

weederTFBS.out –f inputfile.fasta –W 8 –O SC –e 3 –R 50 –M –T 1adviser.out inputfile.fasta S

BioProspector –i inputfile.fasta –W 8 –d 1 –r 1 –o file.biop.out

Motif_cuts.exe inputfile.fasta 8 1

MotifClicker inputfile.fasta –w 8 –n 1 –s 1 >file.motifclick.out

Synthetic background seqs: the dependencies of 3rd-order Markov were estimated from all intergenic seqs of the yeast genome.

Motifs containing 20 BSs with information contents of 12 bits( at most 6 positions are conserved) were chosen from SGD database.

MotifClicker inputfile.fasta –w 8 –n 1 –s 0 >file.motifclick.out

Yeast background: AT: 0.65 GC:0.35

67

197

1132

498 492

297

202

82

330

4 5 0 16 5 10

200

400

600

800

1000

1200

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

SGD

binding site length

Bin

din

g s

ites

countWeederlaucher.out inputfile SC medium M T1

Number of mutations allowedUnfair to other tools

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

SGD

RegulonDB


0.02

0.060.10

0.02

0.14

0.180.22

Background seqs sets size 400*20, 600*20, 800*20, 1000*20,Seed motifs into 100 instances of each size

Synthetic data test (8-mer)Average SSD=0.06

Average SSD=0.10

100 instances of 400*20 seq sets

Note: Weeder did not output any results on the two motifs after setting number of ouput motifs as “T1”, so we decided to use “T2” and only consider top 1 motif of “T2”.

52

63

40

5761

56

41

64

1 1 1 1

1015

612

17 20

1018

75

33 30

46

0

10

20

30

40

50

60

70

80

Sensitivity Specificity Performance

coefficient

F-measure

MotifClicker(-s 1) MotifClicker(-s 0) MEME

MotifCut BioProspector Weeder

400*20

48

65

38

5559

55

40

57

7380

62

7778

91

73

84

4957

36

53

76

4540

57

0

10

20

30

40

50

60

70

80

90

100

Sensitivity Specificity Performance

coefficient

F-measure

MotifClicker(-s 1) MotifClicker(-s 0) MEME

MotifCut BioProspector Weeder

PC F-measure

K-mer size 8 (using two motifs with SSD=0.06 and SSD=0.10, respectively, on 100 datasets)

50 48

40 39

6055

4641

37 36 3431

4440

373233

30 2825

7571

6763

0

10

20

30

40

50

60

70

80

400*20 600*20 800*20 1000*20

MotifClicker(-s 1)

MotifClicker(-s 0)

MEME

MotifCut

BioProspector

Weeder

6460

5652

5550 48

4340 40 38 36

5350 48

45

39 38 36

30

39 37 35

29

0

10

20

30

40

50

60

70

400*20 600*20 800*20 1000*20

MotifClicker(-s 1)

MotifClicker(-s 0)

MEME

MotifCut

BioProspector

Weeder

3937

3533

43

3834

3231 3028

26

3432 31 30

2825

2220

3533 32

30

0

5

10

15

20

25

30

35

40

45

50

400*20 600*20 800*20 1000*20

MotifClicker(-s 1)

MotifClicker(-s 0)

MEME

MotifCut

BioProspector

Weeder

5653

47 45

5752

4742

38 38 3633

4844

423736 34 32

27

5149

46

40

0

10

20

30

40

50

60

70

400*20 600*20 800*20 1000*20

MotifClicker(-s 1)

MotifClicker(-s 0)

MEME

MotifCut

BioProspector

Weeder

Sensitivity Specificity

Dodeca-mer (12-mer)• Synthetic background seqs: the dependencies of 3rd-order Markov were

estimated from all intergenic seqs of the E. coli K12.

• Motifs containing 20 BSs with information contents of 14 bits( at most 7 positions are conserved) and the average SSD=0.02 between each BS and background were chosen from RegulonDB database.

• Seed motifs into 100 background seq sets.

• Test on 400*20, 600*20, 800*20, and 1000*20

• We abandoned Weeder, because it can only set motif length as “small” (length 6 with 1 mutation,length 8 with 2 mutations), “medium” (like small, plus length 10 with 3 mutations, “large” (like medium,plus length 12 with 4 mutations), and “extra”(length 6 with 1 mutation, length 8 with 3 mutations, length 10 with 4 mutations, length 12 with 4 mutations).

That is, Weeder only accepts motif length even values between 6~12.and for length 12 only accepts at most 4 mutations.

K-mer size 12, seed into 100 background seqs sets

7066

61 59

7770 69

55

70

6258

42

6865

6056

66

5245

39

0

10

20

30

40

50

60

70

80

90

400*20 600*20 800*20 1000*20

Mot i f Cl i cker( - s 1)Mot i f Cl i cker( - s 0)MEMEMot i f CutBi oProspector

8579 77 7674

6865

57

81

7167

50

8580 79

7683

6863

49

0

10

20

30

40

50

60

70

80

90

400*20 600*20 800*20 1000*20


62

5652 50

61

52 50

39

60

5046

30

61

5552

48

58

4238

28

0

10

20

30

40

50

60

70

400*20 600*20 800*20 1000*20


7772

68 66

7569 67

56

75

6662

46

75 7268

64

73

5953

43

0

10

20

30

40

50

60

70

80

90

400*20 600*20 800*20 1000*20


Sn Sp

PC F-measure

12-mer, add noise70

67 64

77 757070 68 6768

60 6166

62

49

0

10

20

30

40

50

60

70

80

90

400*20 400*25 (add 25%noi se)

400*30 (add 50%noi se)


8580 79

7470 68

81 78 76

8579 79

8379

61

0

10

20

30

40

50

60

70

80

90

400*20 400*25 (add 25%noi se)

400*30 (add 50%noi se)

Mot i f Cl i cker( - s 1)Mot i f Cl i cker( - s 0)MEMEMot i f CutBi oProspectorAl i gnACE

6257 55

6156

53

6057 56

61

52 5358

53

37

0

10

20

30

40

50

60

70

400*20 400*25 (add 25%noi se)

400*30 (add 50%noi se)


7773 71

75 72 6975 73 71

7568 68

7369

54

0

10

20

30

40

50

60

70

80

90

400*20 400*25 (add 25%noi se)

400*30 (add 50%noi se)


Sn Sp

PC F-measue

16-mer

• Synthetic background seqs: the dependencies of 3rd-order Markov were estimated from all intergenic seqs of the E. coli K12.

• Motifs containing 20 BSs with information contents of 16 bits( at most 8 positions are conserved) and the average SSD=0.02 between each BS and background were chosen from RegulonDB database.

• Seed motifs into 100 background seq sets.

• Test on 400*20, 600*20, 800*20, and 1000*20

16-mer

82 81 80 8088 85 87 85

77 76 75 7268 68 68 6767 64

5043

0

10

20

30

40

50

60

70

80

90

100

400*20 600*20 800*20 1000*20


88 85 837978 75 75 74

88 85 82 79

96 93 9086

94 91

78

66

0

10

20

30

40

50

60

70

80

90

100

400*20 600*20 800*20 1000*20


73 71 6966

7167 68 66

7067

6460

66 65 63 6164

60

44

36

0

10

20

30

40

50

60

70

80

400*20 600*20 800*20 1000*20


85 83 81 7983

80 81 7982 80 78 75

80 79 77 7578

75

61

52

0

10

20

30

40

50

60

70

80

90

400*20 600*20 800*20 1000*20


Sn Sp

PC F-measure

16-mer,add noise82 82 80

88 86 87

77 76 7668 67 6967 65 63

0

10

20

30

40

50

60

70

80

90

100

400*20 400*25 (add 25%noi se)

400*30 (add 50%noi se)

Mot i f Cl i cker ( - s 1)Mot i f Cl i cker( - s 0)MEMEMot i f CutBi oProspector

88 85 8578 76 76

88 85 86

96 94 9294 92 91

0

10

20

30

40

50

60

70

80

90

100

400*20 400*25 (add 25%noi se)

400*30 (add 50%noi se)


73 71 717168 6970 68 6866 65 6564 62

59

0

10

20

30

40

50

60

70

80

400*20 400*25 (add 25%noi se)

400*30 (add 50%noi se)


85 83 8283 81 8182 80 8180 78 7978 76 74

0

10

20

30

40

50

60

70

80

90

100

400*20 400*25 (add 25%noi se)

400*30 (add 50%noi se)


Sn Sp

PC F-measure

Motif finding in Yeast (8-mer)Motif finding tools Top 1 Top 5 Top 10 Top 15 Top 20 Top25

MotifClick 67/585

7158

0.081

85/1200

24916

0.048

92/1638

41752

0.039

95/1923

55084

0.035

95/2107

65852

0.032

96/2222

74820

0.030

MEME 70/754

10107

0.074

85/1202

34010

0.035

87/1615

49958

0.032

92/1931

60805

0.031

95/2087

69709

0.030

95/2198

77405

0.028

MotifCut 65/474

7632

0.062

85/1189

28974

0.041

86/1641

47583

0.034

93/1893

61107

0.031

95/1983

67017

0.030

95/1998

67503

0.030

BioProspector 79/780

10049

0.078

84/1145

20418

0.056

86/1465

31935

0.046

89/1701

42305

0.040

92/1911

52296

0.037

92/2038

61564

0.033

Weeder 77/969

23417

0.041

88/1698

56440

0.030

92/2063

81374

0.025

94/2255

96046

0.023

94/2396

106872

0.022

96/2483

113346

0.022

*At least 3 orthologous genes for each intergenic sequence set. http://www.yeastgenome.org

Motif finding in 5137 intergenic sequence sets of orthologous genes, which contain 99 TFs, belonging to 2932 BSs in SGD.

Motif finding in Ecoli K12 (16-mer)Tools Top 1 Top 5 Top 10 Top 15 Top 20 Top 25

MotifClick331/8525750.129

793/10877060.103

1055/114110560.095

1186/114127790.093

1262/117135920.093

1296/117140260.092

MEME298/8333520.089

877/109142430.062

1134/115209990.054

1202/117239120.050

1233/117254120.049

1254/117262010.048

MotifCut241/7519420.124

487/8947630.102

544/9665520.083

640/10291450.070

744/107104080.071

836/108112120.074

BioProspector354/8549500.072

743/10376780.097

953/112100900.107

1056/112112870.094

1150/116123060.093

1181/116130410.091

MotifClick+MEME

474/98 1029/114 1259/118 1335/120 1357/120 1377/120

BioProspector+MEME

472/92 1051/115 1258/118 1312/119 1339/119 1367/119

Ecoli K12: 2313 operon groups, RegulonDB v6.0: 122 TFs, 1411 BSs.Weeder and Consensus are the worst because they need high-quality input seqs set.

Tools Top 1 Top 5 Top 10 Top 15 Top 20 Top 25

MotifClick 331/85 793/108 1055/114 1186/114 1262/1171296/117

MEME 298/83 877/109 1134/115 1202/117 1233/117 1254/117

MotifCut744/10710408

836/10811212

BioProspector 354/85 743/103 953/112 1056/112 1150/116 1181/116

CUBIC 242/75 563/98 791/108 905/109 999/111 1062/114

MDscan 355/82 552/96 634/99 684/102 758/107 793/109

MotifSampler 168/61 486/92 612/102 729/102 792/107 831/108

Weeder 179/65 350/85 452/92 494/94 532/94 552/94

Consensus 168/63 186/68 200/74 210/76 214/76 220/76

MotifClick+MEME

474/98 1029/114 1259/118 1335/120 1357/120 1377/120

BioProspector+MEME

472/92 1051/115 1258/118 1312/119 1339/119 1367/119

Conclusions

• Synthetic data:MotifCut has highest specificity. MotifClick have highest sensitivity. MotifClick has the most complements with other tools.

• Yeast data and Ecoli dataMotifClick and MEME have close numbers of true predictions and more true predictions than other tools.MotifClick has the most complements with other tools.

motifclick: cis-regulatory k - length motifs finding in cliques of 2(k-1)- mers shaoqiang zhang ...

Documents

w weighted graph v

candidate motif segments

e unweighted graph

constructed graph scale

max match number

maximum match number

maximum density subgraph

binding sites0