mining dna network expressions with ...bio.eecs.uottawa.ca/archive/2015-03-31_motifgp.pdf2015/03/31...

1
MINING DNA NETWORK EXPRESSIONS WITH EVOLUTIONARY MULTI-OBJECTIVE PROGRAMMING WITH EVOLUTIONARY MULTI-OBJECTIVE PROGRAMMING MINING DNA NETWORK EXPRESSIONS Manuel Belmadani, M. Computer Science (specialization in Bioinformatics) Candidate [email protected] The proposed solution (MotifGP) is a motif discovery tool driven by a Strongly Typed Genetic Programming (STGP) algorithm equipped with multi-objective optimization. STGP is a stochastic search process based oevolutionary algorithms which evolves candidate solutions as typed checked programs. Multi-objective optimization allows the algorithm to evolve according to multiple metrics used to build predictions that are high in both support and specicicity. The software produces a collection of non-dominated multi-objective solutions, which can then be aligned to reference DNA motif databases (such as JASPAR, TRANSFAC, and Uniprobe) in order to validate/identify predictions. Abstract High-throughput sequencing experiments, such as ChIP-seq, identify protein interactions within large samples of DNA. Such experiments may identify and output thousands of peak sequences where an interaction of interest is regulated by that region of DNA. Computational biologists often require a de novo analysis of these peaks to mine short overrepresented patterns, frequently found amongst the sequences. This task is referred to as motif discovery. Existing tools struggle at motif discovery if the number of input sequences increases. The time and complexity costs of calculating pattern occurrence often requires compromises on runtime, quality, or length of the output predictions. MA0139.1 YDSYGSCVYSTDSTRDBCV Tomtom 13.03.15 17:31 0 1 2 bits 1 G C T 2 G A 3 T C G 4 T C 5 A T G 6 T A C 7 C 8 T A C 9 T C 10 C 11 C T 12 T A G 13 C G 14 C A T 15 G 16 A T G 17 A T C 18 C 19 G A 0 1 2 bits 1 C T 2 A G T 3 C G 4 C T 5 G 6 C G 7 C 8 A C G 9 C T 10 C G 11 T 12 A G T 13 C G 14 T 15 A G 16 A G T 17 C G T 18 C 19 A C G MA0139.1 AGRKGGCR Tomtom 13.03.15 13:33 0 1 2 bits 1 C T 2 G 3 T A G 4 T A C 5 C 6 G T A 7 G C 8 A T C 9 G A 10 G 11 A G 12 A T G 13 G 14 A T G 15 T A C 16 A G 17 A G C 18 C T 19 C G A 0 1 2 bits 1 A 2 G 3 A G 4 T G 5 G 6 G 7 C 8 G A MOTIFGP DREME CTCF MotifGP's motif fully overlaps the 19 bases of CTCF. Each tool's prediction targeted a dierent strand (orientation), showing a complementation between each methods. E-value: 6.13471E-09 E-value: 3.50248E-07 MA0145.2 WCCVGHTHVDDCCDG Tomtom 13.03.15 17:30 0 1 2 bits 1 G C 2 A T C 3 T G A 4 C G 5 A C T 6 A C T 7 C 8 9 T G A 10 T G A 11 G C 12 A T C 13 T G A 14 C G 0 1 2 bits 1 A T 2 C 3 C 4 A C G 5 G 6 A C T 7 T 8 A C T 9 A C G 10 A G T 11 A G T 12 C 13 C 14 A G T 15 G MA0145.2 ADCCRG Tomtom 13.03.15 13:33 0 1 2 bits 1 G C 2 A T C 3 T G A 4 C G 5 A C T 6 A C T 7 C 8 9 T G A 10 T G A 11 G C 12 A T C 13 T G A 14 C G 0 1 2 bits 1 A 2 T A G 3 C 4 C 5 G A 6 G Tcfcp2l1 MOTIFGP DREME The Tcfcp2l1 motif has two fragments joined by a weak gap (position 7 to 8). Evolutionary computing allows such fragments of a motif to evolve independently and eventually crossover, allowing short gaps to naturally appear in produced predictions. E-value: 1.33929E-07 E-value: 3.07791E-03 MA0039.2 GGGYGKGK Tomtom 13.03.15 13:33 0 1 2 bits 1 G A T 2 G 3 G 4 G 5 A C T 6 G 7 T G 8 T A G 9 T G 10 G T C 0 1 2 bits 1 G 2 G 3 G 4 C T 5 G 6 G T 7 G 8 T G MA0039.2 GYGDGGGYGKGGC Tomtom 13.03.15 14:20 0 1 2 bits 1 G A T 2 G 3 G 4 G 5 A C T 6 G 7 T G 8 T A G 9 T G 10 G T C 0 1 2 bits 1 G 2 C T 3 G 4 A G T 5 G 6 G 7 G 8 C T 9 G 10 G T 11 G 12 G 13 C MOTIFGP DREME Klf4 As shown by the Klf4 dataset prediction, MotifGP has the ability to overlap an entire target motif. It even nds additional nucleotides attached to the left end of the predicted motif, found in the input sequences. E-value: 1.18496E-09 E-value: 3.28351E-08 Experimental Results A sample of top ranking motifs identified by MotifGP are compared to DREME's prediction on the same datasets. Each top logo is the target motif from the database, aligned by TOMTOM. The bottom logos are the predictions by MotifGP (at the left), and DREME, (at the right). Benchmarking For each dataset, MotifGP was executed 10 times with a dierent random initialization. The E-value of the top DREME motif alignment is used as the benchmark value for MotifGP. Over the 10 individual runs of each dataset, we counted the number of runs where a prediction outperformed the benchmark. We also looked at the runs that are within up to 3 orders of magnitude of the benchmark E-Value. This experiment shows the algorithm performed favorably in term of stability in 11 out of 13 datasets, including 6 where MotifGP suceeded in all 10 runs. Dataset < Terminology Motif: A recurring feature in a set of elements, such as frequent patterns in DNA strings. Transcription Factor (TF): Proteins that bind to DNA at specific sites. Network Expressions: Subset of the regular expression grammar. Defines patterns in which each position has a set of possible characters. DNA Motif Discovery: Mining DNA sequences for motifs. Can be used to identify TF binding sites. To determine if MotifGP can outperform DREME, we compared both software against 13 datasets from mouse Embryonic Stem Cells ChIP-Seq experiments [1]. The recorded runtimes for each DREME run were used as time limits for MotifGP. Each dataset may contain one or more factor motif. Desirable predictions should report long overlap and the lowest E-values once aligned to known motifs. Methodology Strongly Typed Genetic Programming AAC[CT] + + A A [ ] C T C Strongly Typed Network Expression grammar Genetic Programming is a search heuristic where a population of programs created under a template grammar are evolved over generations of crossovers and mutations. MotifGP uses a grammar to produce Network Expressions that can be matched against sequences. In population based evolutionary algorithms, new individuals are created over generations and evaluated by a tness function. This scores the individuals in how well they perform as solutions. In multi-objective optimization, the individuals have more than one tness values. Fitnesses with multiple values are compared and ranked by a multi-objective optimization algorithm [3]. MotifGP matches candidate solutions (network expressions) against a set of input and control sequences. We use the match count on each set to compute two metrics of discrimination and coverage: For a given pattern, let P be the set of matched input sequences, C be the set of matched control sequences, and k be the total number of input sequences. Discrimination favors solutions that are not highly represented in the control sequences, while Coverage maintains solutions that are found in multiple input sequences. Multi-objective optimization Conclusions and Future Work Combining STGP and multi-objective optimization directs the solution search towards motifs that align with high confidence and fully overlaps known motifs from databases. Other existing ChIP-seq experiments can be revisited using MotifGP as it may be able to uncover subtle patterns that other tools could not. Future development of MotifGP will investigate different multi-objective fitness function variants and explore other evolutionary methods. While DREME predictions have signicant E-values, the pattern length is limited to 8 characters. Some known motifs are more than twice the size DREME can predict. This is a restriction set to keep low runtimes under real ChIP-seq datasets. We propose our software, MotifGP, as an alternative to this limitation. Our goal is to explore the potential of multi-objective evolutionary computing in nding signicant motif predictions on large datasets, under runtimes comparable to DREME. GGTAGAAATA TTAAAACCT ATAAACCT AAAACCT MotifGP DREME Alignment TOMTOM Motif discovery OR A[AT]A A[AC][CT] AAC[CT] ... ... DNA sequences Motif predictions JASPAR Uniprobe Motif databases Alignment score MA0039.2 GYGDGGGYGKGGC Tomtom 13.03.15 14:20 0 1 2 bits 1 A T 2 G 3 G 4 G 5 C T 6 G 7 T G 8 G 9 G 10 T C 0 1 2 bits 1 G 2 C T 3 G 4 A G T 5 G 6 G 7 G 8 C T 9 G 10 G T 11 G 12 G 13 C E-value:1.18E-08 Target: Klf4 (MA0039.2) ... De novo motif discovery pipeline Background and Motivation A state-of-the-art motif discovery tool, DREME, predicts DNA motifs by mining and rening k- mer (words) from an input set of sequences [2]. The predictions are compared to known motif databases with TOMTOM [5]. Reported alignments are listed and scored by E-value.

Upload: others

Post on 10-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MINING DNA NETWORK EXPRESSIONS WITH ...bio.eecs.uottawa.ca/archive/2015-03-31_MotifGP.pdf2015/03/31  · DNA strings. Transcription Factor (TF): Proteins that bind to DNA at specific

MINING DNA NETWORK EXPRESSIONS WITH EVOLUTIONARY MULTI-OBJECTIVE PROGRAMMING WITH EVOLUTIONARY MULTI-OBJECTIVE PROGRAMMINGMINING DNA NETWORK EXPRESSIONS

Manuel Belmadani, M. Computer Science (specialization in Bioinformatics) [email protected]

The proposed solution (MotifGP) is a motif discovery tool driven by a Strongly Typed Genetic Programming (STGP) algorithm equipped with multi-objective optimization. STGP is a stochastic search process based off evolutionary algorithms which evolves candidate solutions as typed checked programs. Multi-objective optimization allows the algorithm to evolve according to multiple metrics used to build predictions that are high in both support and specificificity. The software produces a collection of non-dominated multi-objective solutions, which can then be aligned to reference DNA motif databases (such as JASPAR, TRANSFAC, and Uniprobe) in order to validate/identify predictions.

Abstract

High-throughput sequencing experiments, such as ChIP-seq, identify protein interactions within large samples of DNA. Such experiments may identify and output thousands of peak sequences where an interaction of interest is regulated by that region of DNA. Computational biologists often require a de novo analysis of these peaks to mine short overrepresented patterns, frequently found amongst the sequences. This task is referred to as motif discovery. Existing tools struggle at motif discovery if the number of input sequences increases. The time and complexity costs of calculating pattern occurrence often requires compromises on runtime, quality, or length of the output predictions.

[1] Chen, X. et al. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 133, 1106–1117 (2008).[2] Bailey, T. L. DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics 27, 1653–1659 (2011).[3] Fortin, F.-A. & Parizeau, M. Revisiting the NSGA-II crowding-distance computation. in Proceedings of the 15th annual conference on Genetic and evolutionary computation 623–630 (ACM, 2013).[4] Fortin, F.-A., De Rainville, F.-M., Gardner, M.-A. G., Parizeau, M. & Gagné, C. DEAP: Evolutionary Algorithms Made Easy. J. Mach. Learn. Res. 13, 2171–2175 (2012)[5] Gupta, S., Stamatoyannopoulos, J. A., Bailey, T. L. & Noble, W. S. Quantifying similarity between motifs. Genome Biol. 8, R24 (2007)..

References

MA0139.1

YDSYGSCVYSTDSTRDBCV Tomtom 13.03.15 17:31

0

1

2

bits

1

GCT

2

T

GA

3

T

CG

4

TC

5

ATG

6

TAC

7

C

8

T

AC

9

TC

10

C

11

A

CT

12

TAG

13

T

CG

14

CAT

15

G

16

C

ATG

17

ATC

18

T

A

C

19

C

T

GA

0

1

2

bits

1

CT

2AGT

3

CG

4

CT

5

G

6

CG

7

C

8ACG

9

CT

10

CG

11

T

12

AGT

13

CG

14

T

15

AG

16

AGT

17

CGT

18

C

19

ACG

MA0139.1

AGRKGGCR Tomtom 13.03.15 13:33

0

1

2

bits

1GA

CT

2

A

T

G

3

TAG

4GTAC

5

C6

GTA

7

A

GC

8

ATC

9

T

GA

10

G11

AG

12

A

TG

13

G14

ATG

15

TAC

16

AG

17

A

GC

18

A

CT

19

CGA

0

1

2

bits

1A 2

G3A

G

4TG

5

G6

G7

C8

GA

MOTIFGP DREME

CTCFMotifGP's motif fully overlaps the 19 bases of CTCF. Each tool's prediction targeted a different strand (orientation), showing a complementation between each methods.E-value: 6.13471E-09 E-value: 3.50248E-07

MA0145.2

WCCVGHTHVDDCCDG Tomtom 13.03.15 17:30

0

1

2

bits

1

GC

2

ATC

3

TGA

4

C

G

5

ACT

6GA

CT

7

T

C

8 9CTGA

10

TGA

11

GC

12

ATC

13

TGA

14

C

G

0

1

2

bits

1AT

2

C

3

C

4ACG

5

G

6ACT

7T 8ACT

9ACG

10

AGT

11

AGT

12

C

13

C

14

AGT

15

G

MA0145.2

ADCCRG Tomtom 13.03.15 13:33

0

1

2

bits

1GC

2

ATC

3

TGA

4

C

G5

ACT

6GA

CT

7

T

C

8 9CTGA

10

TGA

11

GC

12

ATC

13

TGA

14

C

G

0

1

2

bits

1A 2TAG

3

C4

C5

GA

6

G

Tcfcp2l1

MOTIFGP DREME

The Tcfcp2l1 motif has two fragments joined by a weak gap (position 7 to 8). Evolutionary computing allows such fragments of a motif to evolve independently and eventually crossover, allowing short gaps to naturally appear in produced predictions.

E-value: 1.33929E-07 E-value: 3.07791E-03

MA0039.2

GGGYGKGK Tomtom 13.03.15 13:33

0

1

2

bits

1

GAT

2

A

G

3

G

4

G

5

A

CT

6

G

7

TG

8

TAG

9AC

TG

10

A

G

TC

0

1

2

bits

1

G

2

G

3

G

4

CT

5

G

6

GT

7

G

8TG

MA0039.2

GYGDGGGYGKGGC Tomtom 13.03.15 14:20

0

1

2

bits

1

GAT

2

A

G3

G4

G5

A

CT

6

G

7

TG

8

TAG

9AC

TG

10

A

G

TC

0

1

2

bits

1

G2

CT

3

G4A

GT

5

G6

G7

G8

CT

9

G10

GT

11

G12

G13

CMOTIFGP DREME

Klf4

As shown by the Klf4 dataset prediction, MotifGP has the ability to overlap an entire target motif. It even finds additional nucleotides attached to the left end of the predicted motif, found in the input sequences.E-value: 1.18496E-09 E-value: 3.28351E-08

Experimental ResultsA sample of top ranking motifs identified by MotifGP are compared to DREME's prediction on the same datasets. Each top logo is the target motif from the database, aligned by TOMTOM. The bottom logos are the predictions by MotifGP (at the left), and DREME, (at the right).

BenchmarkingFor each dataset, MotifGP was executed 10 times with a different random initialization. The E-value of the top DREME motif alignment is used as the benchmark value for MotifGP.

Over the 10 individual runs of each dataset, we counted the number of runs where a prediction outperformed the benchmark. We also looked at the runs that are within up to 3 orders of magnitude of the benchmark E-Value. This experiment shows the algorithm performed favorably in term of stability in 11 out of 13 datasets, including 6 where MotifGP suceeded in all 10 runs.

Dataset <

TerminologyMotif: A recurring feature in a set of elements, such as frequent patterns in DNA strings.

Transcription Factor (TF): Proteins that bind to DNA at specific sites.

Network Expressions: Subset of the regular expression grammar. Defines patterns in which each position has a set of possible characters.

DNA Motif Discovery: Mining DNA sequences for motifs. Can be used to identify TF binding sites.

To determine if MotifGP can outperform DREME, we compared both software against 13 datasets from mouse Embryonic Stem Cells ChIP-Seq experiments [1].

The recorded runtimes for each DREME run were used as time limits for MotifGP.

Each dataset may contain one or more factor motif. Desirable predictions should report long overlap and the lowest E-values once aligned to known motifs.

MethodologyStrongly Typed Genetic Programming

AAC[CT]

+ +

A A [ ]C

TC

Strongly Typed Network Expression grammar

Genetic Programming is a search heuristic where a population of programs created under a template grammar are evolved over generations of crossovers and mutations. MotifGP uses a grammar to produce Network Expressions that can be matched against sequences.

In population based evolutionary algorithms, new individuals are created over generations and evaluated by a fitness function. This scores the individuals in how well they perform as solutions. In multi-objective optimization, the individuals have more than one fitness values. Fitnesses with multiple values are compared and ranked by a multi-objective optimization algorithm [3].

MotifGP matches candidate solutions (network expressions) against a set of input and control sequences. We use the match count on each set to compute two metrics of discrimination and coverage:

For a given pattern, let P be the set of matched input sequences, C be the set of matched control sequences, and k be the total number of input sequences.

Discrimination favors solutions that are not highly represented in the control sequences, while Coverage maintains solutions that are found in multiple input sequences.

Multi-objective optimization

Conclusions and Future WorkCombining STGP and multi-objective optimization directs the solution search towards motifs that align with high confidence and fully overlaps known motifs from databases.

Other existing ChIP-seq experiments can be revisited using MotifGP as it may be able to uncover subtle patterns that other tools could not.

Future development of MotifGP will investigate different multi-objective fitness function variants and explore other evolutionary methods.

While DREME predictions have significant E-values, the pattern length is limited to 8 characters. Some known motifs are more than twice the size DREME can predict. This is a restriction set to keep low runtimes under real ChIP-seq datasets. We propose our software, MotifGP, as an alternative to this limitation. Our goal is to explore the potential of multi-objective evolutionary computing in finding significant motif predictions on large datasets, under runtimes comparable to DREME.

GGTAGAAATATTAAAACCTATAAACCT

AAAACCT MotifGP

DREMEAlignment

TOMTOM

Motif discovery

ORA[AT]AA[AC][CT]AAC[CT]

...

...

DNA sequencesMotif predictions

JASPARUniprobe

Motif databases

Alignment score

MA0039.2

GYGDGGGYGKGGC Tomtom 13.03.15 14:20

0

1

2

bits

1GAT

2AG

3G 4G 5ACT

6G 7TG

8TAG

9ACTG

10

A

G

TC

0

1

2

bits

1G 2CT

3G 4AGT

5G 6G 7G 8CT

9G 10

GT

11

G

12

G

13

C

E-value:1.18E-08

Target: Klf4(MA0039.2)

...

De novo motif discovery pipeline

Background and MotivationA state-of-the-art motif discovery tool, DREME, predicts DNA motifs by mining and refining k-mer (words) from an input set of sequences [2]. The predictions are compared to known motif databases with TOMTOM [5]. Reported alignments are listed and scored by E-value.