dna codes design - supsiroberto/dna.pdf · • the dna codes design problem • approaches in the...

65
IMPORTANT: Applications of metaheuristics to the Sequential Ordering Problem will be covered first. Applications of metaheuristics to DNA codes design will be treated only if time permits. Roberto 1 Roberto Montemanni Dalle Molle Institute for Artificial Intelligence University of Applied Science of Southern Switzerland Email: [email protected] Tel: +41 58 666 666 7 DNA Codes Design

Upload: others

Post on 24-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

!IMPORTANT:!!!Applications!of!metaheuristics!to!the!Sequential!Ordering!Problem!will!be!covered!first.!!!!Applications!of!metaheuristics!to!DNA!codes!design!will!be!treated!only!if!time!permits.!!!!Roberto!!

1

Roberto Montemanni

Dalle Molle Institute for Artificial Intelligence

University of Applied Science of Southern Switzerland

Email: [email protected] Tel: +41 58 666 666 7

DNA Codes Design

Page 2: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

2

Outline •  Introduction •  The DNA Codes Design problem •  Approaches in the literature •  Construction heuristics •  Simple local searches •  Metaheuristics

–  Intro to Stochastic Local Search –  Applications to the DNA codes design problem –  Intro to Variable Neighbourhood Search –  Applications to the DNA codes design problem

•  Bibliography

3

Contributions to slides

•  Dan C. Tulpan NRC Institute for Information Technology, Canada (Introduction, Applications, Stochastic Local Searches)

•  Marco Chiarandini University of Southern Denmark, Denmark

(Introduction to Stochastic Local Search) •  Thomas Stuetzle

Darmstadt University of Technology, Germany (Introduction to Variable Neighbourhood Search)

Page 3: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

4

Outline •  Introduction •  The DNA Codes Design problem •  Approaches in the literature •  Construction heuristics •  Simple local searches •  Metaheuristics

–  Intro to Stochastic Local Search –  Applications to the DNA codes design problem –  Intro to Variable Neighbourhood Search –  Applications to the DNA codes design problem

•  Bibliography

5

DNA – The Blueprint of Life

chimp

cow

dinosaur bird

fish

worm

bacteria human

DNA

9 pictures taken from ClipArt

Background: DNA

Page 4: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

6

What is DNA?

•  All organisms on this planet are made of the same type of genetic blueprint.

7

Real Applications

•  DNA computing => using DNA for massively parallel computations.

•  DNA Chemical libraries => for the development and test of new drugs

•  DNA Microarrays => for profiling genes and tracing genes within long DNA strands

•  DNA Nanotechnologies => for the development of new materials/devices

http://en.wikipedia.org/wiki/DNA_computing

Page 5: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

8

Outline •  Introduction •  The DNA Codes Design problem •  Approaches in the literature •  Construction heuristics •  Simple local searches •  Metaheuristics

–  Intro to Stochastic Local Search –  Applications to the DNA codes design problem –  Intro to Variable Neighbourhood Search –  Applications to the DNA codes design problem

•  Bibliography

9

DNA, Wikimedia Commons

What is DNA? •  genetic material •  four letter alphabet (nucleotides, bases):

–  A (adenine), –  C (cytosine), –  G (guanine), –  T (thymine)

•  complementary base pairs CG, AT •  hybridization via base pairing

A

A

C

G

T

3�

5�

T

T

G

C

A

3�

5�

A

T

G

G

T

3�

5�

T

T

G

C

A

3�

5�

Perfect hybridization Imperfect hybridization Background: DNA

Page 6: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

10

Modeling

Uniform Stability

A

A

C

G

T

3�

5�

T

T

G

C

A

3�

5�

A

A

C

G

T

5�

3�

C

A

C

C

C

3�

5�

Non-interaction

Design Goals

Desired properties •  Desired properties coming from real applications

•  Notice that properties are not the same for all applications

11

DNA Codes Design Problem description

Input data:

•  The alphabet {A, C, G, T}

•  A fixed length n for the codewords

•  A required distance d among codewords (used by constraints in Z)

• A set Z of constraints (explained in the next slides)

Optimization objective:

•  Find the largest possible set of codewords (= code) of length n on alphabet {A, C, G, T}, feasible with respect to constraints Z (based on d)

Why to maximize the size of the code? To have more flexibility in the applications seen before!

Page 7: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

12

AATTCCGG ACCTGATT

ATTCCCAG

ACCTTTTT

Codeword

Word Length n = 8

TATATATA

CATTCACC

GCTTATTC

GATTCAAT

TCACCATG

CCGTTACA

GCGCGCGC

CTATTCAC

TTGGCCAA

GGCTTTTA

CTACTACG

The solution respects a given a constraints set Z (we do not know Z at this stage!)

Example Code (solution)

DNA Codes Design Problem description

13

Requirements of a DNA Code

•  Success in specific hybridization between a DNA codeword and its complement.

•  No hybridization between DNA codewords from the same DNA code or between a DNA codeword and others complement.

How do these requirements translate into our constraints set Z?

DNA Codes Design Problem description

Page 8: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

14

Constraints considered (set Z):

•  Requirement: the distance between two codewords must be large (no hybridization).

•  Answer: HD (Hamming Distance)

-  Given two codewords w1 and w2

-  H(w1, w2) = number of positions i in which the ith letter of w1 differs from the ith letter of w2

-  example: w1 = GCTA, w2 = ATTA, H(w1, w2) = 2

-  Constraint: H(w1, w2) ≥ d

DNA Codes Design Problem description

15

Constraints considered (set Z):

•  Requirement: the number of G or C of each codeword must be the same (uniform stability) [=> self-hybridization is likely]

•  Answer: GC (GC-content constraint)

-  A fixed number of the letters of each word has to be either G or C: floor(n/2) in our case

-  example: ATA is not feasible, AGA is feasible

DNA Codes Design Problem description

Page 9: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

16

•  Requirement: the distance between a codeword and the complement of another codeword must be large.

Watson-Crick complement of a DNA codeword

wcc(w) = Watson-Crick complement of a DNA codeword w, obtained by reversing w and then by replacing each A in w by T (and vice-versa) and each C in G (and vice-versa)

-  example: wcc(ATGC) = GCAT

DNA Codes Design Problem description

17

Constraints considered (set Z):

• Requirement: the distance between a codeword and the complement of another codeword must be large.

•  Answer: RC (Reverse Complement Hamming distance)

-  Given two codewords w1 and w2

-  example: GCTA, ATGC

H(GCTA, wcc(ATGC)) = H(GCTA,GCAT) = 2

-  Constraint: H(w1, wcc(w2)) ≥ d

DNA Codes Design Problem description

Page 10: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

18

Example of a problem and its solution

•  Input data: n = 4, d = 3.

•  Constraints considered: HD, GC, RC

•  Solution:

the largest possible code with the characteristics above contains 6 codewords.

Optimal code with respect to the constraints considered (not unique!):

CTTC GGTT GTCA

AGGA ACTG TTGG

19

Problem description

•  Other kinds of constraints are possible.

•  They depend on the real-world application considered

•  In this mini-course we limit ourselves to the constraints on the previous slides

Important observation

Page 11: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

20

Outline •  Introduction •  The DNA Codes Design problem •  Approaches in the literature •  Construction heuristics •  Simple local searches •  Metaheuristics

–  Intro to Stochastic Local Search –  Applications to the DNA codes design problem –  Intro to Variable Neighbourhood Search –  Applications to the DNA codes design problem

•  Bibliography

21

TEMPLATE-MAP DESIGN •  Find the largest possible set of 8-mers with

–  50% GC content in each word –  at least four mismatches between each word and the complement of each distinct word

(reverse-complement constraint) –  at least four mismatches between each pair of words (direct Hamming constraint) –  based on template-map design

Approaches from the literature

Kobayashi, S., Konto, T., Arita, M. On template methods for DNA sequence design. Lecture Notes in Computer Science, 2568, 205-214 (2003).

Arita, M., Kobayashi, S. DNA sequence design using templates. New Generation Computing, 20, 263-277 (2002).

Frutos A.G., Liu, Q., Thiel A.J., Sanner A.M.W., Condon A.E., Smith L.M., Corn R.M. Demonstration of a word design strategy for DNA computing on surfaces. Nucleic Acids Res. 25, 4748-4757 (1997)

Koul, N. Heuristic Algorithms for Construction of Constant GC content DNA codes. Master thesis, USI (2010).

Page 12: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

22

TEM

PLAT

E-M

AP

DES

IGN

Approaches from the literature

•  The selection of maps and templates is based on reasoning and theoretical results

•  Difficult to apply results to different problems: not a general approach

23

MATHEMATICAL CONSTRUCTIONS •  Approaches adapted from classic Coding Theory •  Theoretical results, based on the characteristics of the

desired code, are used to produce mathematical constructions leading to (very regular) codes

•  Example: Theorem If C0 is a code that is fixed by reverse permutation R, then the subcode C1 of C0 consisting of the codewords that are unchanged by R is obtained as the intersection of C0 and the code R(C0).

Approaches from the literature

Gaborit P., King O. D. Linear construction for DNA codes. Theoretical Computer Science, 334, 99-113 (2005).

•  Not a general method. Results typically hold for the problem under investigation only

•  The codes obtained are very regular. For many applications this is not desirable

King, O. D. Bounds for DNA codes with constant GC-content. Electronic Journal of Combinatorics, 10, #R33 (2003).

Neelakandan, I. New Approaches for Constructing Constant Weight Binary Codes. Master thesis, USI (2010).

Page 13: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

24

HEURISTIC ALGORITHMS

•  Many of the classic heuristic algorithms have been adapted, implemented and tested

•  We will see some of them in details…!

Approaches from the literature

25

Outline •  Introduction •  The DNA Codes Design problem •  Approaches in the literature •  Construction heuristics •  Simple local searches •  Metaheuristics

–  Intro to Stochastic Local Search –  Applications to the DNA codes design problem –  Intro to Variable Neighbourhood Search –  Applications to the DNA codes design problem

•  Bibliography

Page 14: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

26

Construction Heuristics

Construction Heuristic (CH)

All possible codewords with the required GC-content are examined in a given order.

Codewords are incrementally accepted if feasible with respet to the already accepted ones.

Montemanni, R., Smith, D.H. Heuristic algorithms for constructing binary constant weight codes. IEEE Transactions on Information Theory 55(10), 4651-4656 (2009)

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Math. Modelling and Algorithms 7, 311-326 (2008).

Smith, D.H., Hughes L.A., Perkins S. A new table of constant weight binary codes of length grater than 28. Electron. J. of Combinatorics, 13(1), #A2 (2006).

27

Construction Heuristics

Example: n = 4, d = 3.

Constraints: HD, GC, RC

Lexicographic order:

AACC AACG AAGC AAGG ACAC ACAG ACCA ACCT ACGA ACGT ACTC ACTG AGAC AGAG AGCA AGCT AGGA AGGT AGTC AGTG ATCC ATCG ATGC ATGG CAAC CAAG CACA CACT CAGA CAGT CATC CATG CCAA CCAT CCTA CCTT CGAA CGAT CGTA CGTT CTAC CTAG CTCA CTCT CTGA CTGT CTTC CTTG GAAC GAAG GACA GACT GAGA GAGT GATC GATG GCAA GCAT GCTA GCTT GGAA GGAT GGTA GGTT GTAC GTAG GTCA GTCT GTGA GTGT GTTC GTTG TACC TACG TAGC TAGG TCAC TCAG TCCA TCCT TCGA TCGT TCTC TCTG TGAC TGAG TGCA TGCT TGGA TGGT TGTC TGTG TTCC TTCG TTGC TTGG

Solution: AACC ACAG AGGA CCTA GTCA

Page 15: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

28

Construction Heuristics

•  The method works over any possible order of the nodes (lexicographic, reverse lexicographic, random) => different algorithms in fact…

•  Computational experiments suggest that random orders guarantee better results on DNA code design problems

•  Slow for large problems (all possible codewords have to be examined!)

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. J. of Math. Modelling and Algorithms 7, 311-326 (2008).

29

Outline •  Introduction •  The DNA Codes Design problem •  Approaches in the literature •  Construction heuristics •  Simple local searches •  Metaheuristics

–  Intro to Stochastic Local Search –  Applications to the DNA codes design problem –  Intro to Variable Neighbourhood Search –  Applications to the DNA codes design problem

•  Bibliography

Page 16: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

30

Seed Building local search Seed Building (SB)

Iterative approach

A set of seed codewords is considered The set of seed codewords is dynamically adapted through iterations During each iteration: •  All possible codewords with the required GC-content are examined in a given order. •  Codewords are incrementally accepted if feasible with those already accepted in the current iteration and with the seed codewords. Statistics are used to expand or contract the set of seed codewords every ItrSeed iterations, based on the quality of the solutions built.

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. J. of Math. Modelling and Algorithms 7, 311-326 (2008).

Brouwer A.E., Shearer J.B., Sloane N.J.A., Smith W.D. A new table of constant weight codes. IEEE Trans. Inf. Theory 36, 1334-1380 (1990).

31

Seed Building local search

Seed codewords management

Page 17: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

32

Seed Building local search

Example: n = 4, d = 3.

Constraints: HD, GC, RC

Seed codewords: AACC ACAG

Random order:

CTTC CTTG CTCA CTCT CTGA CTGT CTAC CTAG CATC CATG CACA CACT CAGA CAGT CAAC CAAG CCTA CCTT CCAA CCAT CGTA CGTT CGAA CGAT GTTC GTTG GTCA GTCT GTGA GTGT GTAC GTAG GATC GATG GACA GACT GAGA GAGT GAAC GAAG GCTA GCTT GCAA GCAT GGTA GGTT GGAA GGAT TTCC TTCG TTGC TTGG TACC TACG TAGC TAGG TCTC TCTG TCCA TCCT TCGA TCGT TCAC TCAG TGTC TGTG TGCA TGCT TGGA TGGT TGAC TGAG ATCC ATCG ATGC ATGG AACC AACG AAGC AAGG ACTC ACTG ACCA ACCT ACGA ACGT ACAC ACAG AGTC AGTG AGCA AGCT AGGA AGGT AGAC AGAG

Solution: AACC ACAG CCTA GTCA TCCT

33

Seed Building local search

•  The method works over any possible order of the nodes (lexicographic, reverse lexicographic, random).

•  Experiments clearly show that a random order has to be preferred for DNA codes design problems.

•  The process of identify a good set of codewords is intrinsically difficult => codes produced are sometimes very good and sometimes very poor => not a very robust method

•  Slow for large problems (all possible codewords are examined at each iteration!)

Page 18: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

34

•  Clique Given an undirected graph G, a clique is a set of the vertices in which every vertex is connected to every other vertex of the clique

•  Maximal clique problem Given an undirected graph G, identify the largest (number of nodes) clique of G

•  Complexity Classic NP-hard problem

Clique Search local search

•  {0, 3, 4} is a clique

•  {2, 3, 4, 5} is a maximal clique

35

Clique Search local search Clique Search (CS)

Iterative approach A partial code can be completed by solving a subproblem (which is a maximum clique problem) to optimality During each iteration: •  All possible codewords with the required GC-content are examined in a random order. •  Codewords are accepted for the second phase if feasible with those of the partial code. •  A maximum clique problem is solved on the set of accepted codewords to complete the partial code Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Math. Modelling and Algorithms 7, 311-326 (2008).

Montemanni, R., Smith, D.H. Heuristic algorithms for constructing binary constant weight codes. IEEE Transactions on Information Theory 55(10), 4651-4656 (2009)

Page 19: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

36

Clique Search local search

37

Clique Search local search

Example: n = 4, d = 3. Constraints: HD, GC, RC

Partial code: CTTC CGAA TGGT GTGA

Maximum clique problem on feasible extensions of the partial solution:

CACT AGTG

AAGC GCTT

Page 20: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

38

Clique Search local search

Example: n = 4, d = 3. Constraints: HD, GC, RC

Partial code: CTTC CGAA TGGT GTGA

Maximum clique problem on feasible extensions of the partial solution:

CACT AGTG

AAGC GCTT

Solution: CTTC CGAA TGGT GTGA CACT GCTT

39

Clique Search local search

•  Solving a maximum clique problem (sub-procedure) is an NP-hard problem itself!

•  Heuristics have to be used for the maximum clique problem

=> no optimality is guarantee for the sub-problem solutions

•  The choice of the number of codewords to eliminate is crucial

!  too many codewords eliminated => very large maximum clique problem => high probability of having suboptimality

!  not enough codewords eliminated => very likely to find a code with the same number of codewords of the original

!  This aspect deserves a deeper study to tackle large problems!

Page 21: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

40

Hybrid Search local search Hybrid Search (HS)

Iterative approach

Merges the concepts of the two methods analyzed before.

A set of seed codewords is managed exactly as in Seed Building.

Seed codewords represent the partial code in the context of the Clique Search.

A relaxed distance d' < d is introduced.

A candidate code has to be at least at distance d from the seeds, and d' from the other candidate codes (this to keep the maximum clique problem to a reasonable size!)

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008).

41

Hybrid Search local search

Seed Building

Clique Search

Page 22: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

42

Hybrid Search local search

Example: n = 4, d = 3. Constraints: HD, GC, RC

Partial code (seed codewords): CAAC AGAG

Maximum clique problem on feasible extensions of the partial solution (heuristic distance d'=1 to reduce the codewords considered):

TGGT

TCTC TGTC

TTGC TAGG

TACG ATGC

ACTC

43

Hybrid Search local search

Example: n = 4, d = 3. Constraints: HD, GC, RC

Partial code (seed codewords): CAAC AGAG

Maximum clique problem on feasible extensions of the partial solution (heuristic distance d'=1 to reduce the codewords considered):

TGGT

TCTC TGTC

TTGC TAGG

TACG ATGC

ACTC

Solution: CAAC AGAG TCTC TGGT TACG ATGC

Page 23: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

44

Hybrid Search local search

•  Sums the advantages of Seed Building to those of Clique Search

but…

•  There is the risk of summing up drawbacks instead!

•  The method deserves a further detailed study for larger problems

45

Experimental comparison of some of the heuristic algorithms

Experimental settings Methods coded in ANSI C

Experiments on Dual AMD Opteron 250 2.4GHz / 4GB RAM machines

Maximum computation times: 10'000 seconds (2.8 hours)

Statistics over 5 runs for each combination problem/method

A (5,3,2) identifies the problem with constraints Cstrs (HD is always present, and therefore not listed), and with n = 5, d = 3, and GC content = floor(n/2) = 2. [this funny notation comes from coding theory…]

4 Cstrs

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008).

Page 24: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

46

Experimental comparison of some of the heuristic algorithms

•  SB = Seed Building

•  CS = Clique Search

•  HS = Hybrid Search

47

Experimental comparison of some of the heuristic algorithms

•  SB = Seed Building

•  CS = Clique Search

•  HS = Hybrid Search

Page 25: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

48

Experimental comparison of some of the heuristic algorithms

Comments •  No clear ranking is possible among the methods considered: Seed Building, Clique Search, and Hybrid Search

•  Methods are therefore likely to represent different neighbourhoods

49

Idea

•  All the methods seen until now work on the search space of feasible solutions (we never have constraints violated…)

•  What if we move into the search space of infeasible solutions? => we will have to minimize (i.e. bring down to zero!) a measure of infeasibility!

•  This makes it possible to develop a completely different kind of local search!

•  It is likely that the search space is visited in a different way by such a family of algorithms…

Page 26: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

50

Iterated Greedy Search local search Iterated Greedy Search (IGS) Iterative approach Working on an infeasible code W, trying to make it feasible.

Measure of the infeasibility of W:

where w = floor(n/2)

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008).

51

Iterated Greedy Search local search

Iterated Greedy Search (IGS)

An infeasible solution is obtained by adding a random codeword to a perturbed feasible solution

During each iteration:

•  A codeword σ is selected at random and the optimal (according to Inf(W)) change of one bit of σ is carried out.

•  If Inf(W)=0, we are done, and we can add a random codeword

Page 27: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

52

Iterated Greedy Search local search Perturbation of

the solution

Optimization of the solution

53

Iterated Greedy Search local search

Example: n = 4, d = 3. Constraints: HD, GC, RC

W Inf(W) ... TGGT GACC CGAA TCAC CCTT 1 TGGT GACT CGAA TCAC CCTT 0

TGGT GGCA CGAA TCAC CCTT TTTG 8 TGGT GGCA CGTA TCAC CCTT TTTG 8 TGGT GGCA CGTA TCAC GCTT TTTG 7 TGGT GGCA CGTC TCAC GCTT TTTG 7 … TGGT AGTG CGTC TCAC GCTT TTTG 4 TGGT AGTG CGTC TCAC GCTT TTCG 3 TGGT AGTG CTTC TCAC GCTT TTCG 0

TGGT AGTG GTAG TCAC GGTT TTCG AACT 9 TGGT AGTG GTAG TCTC GGTT TTCG AACT 9 ...

Page 28: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

54

Iterated Greedy Search local search

•  We change exactly one bit of a random codeword at each iteration: more complex neighbourhoods could be considered…

•  We never accept changes that make the solution worse: might be an idea to escape from local minima

•  A further investigation is deserved…

55

Experimental comparison of some of the heuristic algorithms

Experimental settings Methods coded in ANSI C

Experiments on Dual AMD Opteron 250 2.4GHz / 4GB RAM machines

Maximum computation times: 10'000 seconds (2.8 hours)

Statistics over 5 runs for each combination problem/method

A (5,3,2) identifies the problem with constraints Cstrs (HD is always present, and therefore not listed), and with n = 5, d = 3, and GC content = floor(n/2) = 2. [this funny notation comes from coding theory…]

4 Cstrs

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008).

Page 29: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

56

Experimental comparison of some of the heuristic algorithms

•  SB = Seed Building

•  CS = Clique Search

•  HS = Hybrid Search

•  IGS = Iterated Greedy Search

57

Experimental comparison of some of the heuristic algorithms

•  SB = Seed Building

•  CS = Clique Search

•  HS = Hybrid Search

•  IGS = Iterated Greedy Search

Page 30: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

58

Experimental comparison of some of the heuristic algorithms

Comments •  No clear ranking is possible among the methods considered: Seed Building, Clique Search, Hybrid Search and Iterative Greedy Search •  Methods are likely to represent different neighbourhoods

59

Outline •  Introduction •  The DNA Codes Design problem •  Approaches in the literature •  Construction heuristics •  Simple local searches •  Metaheuristics

–  Intro to Stochastic Local Search –  Applications to the DNA codes design problem –  Intro to Variable Neighbourhood Search –  Applications to the DNA codes design problem

•  Bibliography

Page 31: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

60

Goal: Effectively escape from local minima of given evaluation function. General approach: For fixed neighbourhood, use step function that permits worsening search steps. Specific methods: •  Randomized Iterative Improvement •  Simulated Annealing •  Attribute Based Hill Climber •  Dynamic Local Search •  Iterated Local Search •  Tabu Search

Stochastic Local Search: Simple SLS methods

61

Key idea: In each search step, with a fixed probability perform an uninformed random walk step instead of an iterative improvement step. Randomized Iterative Improvement (RII): determine initial candidate solution s while termination condition is not satisfied do

With probability p: choose a neighbor s0 of s uniformly at random Otherwise: choose a neighbor s0 of s such that g(s0) < g(s) or, if no such s0 exists, choose s0 such that g(s0) is minimal s := s0

Where g(s) is the objective function value (fitness) of solution s

Stochastic Local Search: Randomized Iterative Improvement

Page 32: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

62

Observations: •  No need to terminate search when local minimum is encountered.

Instead: Impose limit on number of search steps or CPU time, from beginning of search or after last improvement.

•  Probabilistic mechanism permits arbitrary long sequences of random walk steps

Therefore: When run sufficiently long, RII is guaranteed to find (optimal) solution to any problem instance with arbitrarily high probability.

•  Generally, RII is often outperformed by more complex LS methods.

Stochastic Local Search: Randomized Iterative Improvement

63

Outline •  Introduction •  The DNA Codes Design problem •  Approaches in the literature •  Construction heuristics •  Simple local searches •  Metaheuristics

–  Intro to Stochastic Local Search –  Applications to the DNA codes design problem –  Intro to Variable Neighbourhood Search –  Applications to the DNA codes design problem

•  Bibliography

Page 33: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

64

Target: a code with k codewords 1.  Start with k random codewords 2.  Mark unsatisfied constraints (conflicts) 3.  If no unsatisfied constraints go to 8 4.  Pick 2 codewords involved in a conflict 5.  With probability p select a better word minimizing the number of conflicts 6.  Otherwise select a random codeword 7.  Go to step 3. 8.  Display all k codewords

Stochastic Local Search for the DNA codes design problem

It is a Randomized Iterative Improvement! Tulpan, D.C., Hoos, H.H., Condon, A.E. Stochastic local search algorithms for DNA word design. Lectures Notes in Computer Science, Springer, Berlin, 2568, 229-241 (2002).

65

Random Walk

Best Improvement

SI SF

SBI

SRW 1 - p

p

Stochastic Local Search for the DNA codes design problem

Page 34: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

66

Initialization

Evaluate

Pick Conflict

Best Improvement Random Walk

Return Result

Probability p Probability 1-p

No Yes

Stochastic Local Search for the DNA codes design problem

67

Select Conflicts Neighbourhood

Random Walk

Iterative / Best Improvement

Stochastic Local Search for the DNA codes design problem

Page 35: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

68

Neighbors: TTTCTCAG, AATCTCAG, …

Random Walk Best Improvement p 1 - p

Conflicts: (1,3) (1,5) (2,7) (2,9) (12,14)

Pick Conflict

Conflicts: (1,3) (1,5) (2,11) (12,14) Current Set

ACCTGATT

ATTCTCAG

ACCTTTTT

TATATATA

CATTCACC

ATTCTCAA

GATTCAAT

TCACCATG

CCGTTACA

GCGCGCGC

CTATTCAC

TTGGCCAA

GGCTTTTA

CTACTACG

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

TTTCTCAG

Thesis Contributions: C1 Development of novel optimization algorithms

Given: a fixed set of constraints C, strand length n=8, set size k=14.

Stochastic Local Search for the DNA codes design problem

69

Simple SLS without Random Replacement

k = {100, 120, 140}, n = 8, d = 4

HD constraint only

1000 successful runs

Stochastic Local Search for the DNA codes design problem - results

Comments:

•  T h e n u m b e r o f i terat ions required increases with k

•  The increase is more dramatic when k is h i g h = > r i s k o f stagnation

Distribution of the number of iterations required to have a feasible solution for different values of k (target number of codewords)

Page 36: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

70

k = 70, n = 8, d = 4

HD, RC, GC constraints

1000 successful runs

SLS with Random Replacement vs Simple SLS

Stochastic Local Search for the DNA codes design problem - results

Distribution of the number of iterations required to have a feasible solution for k = 70 (target number of codewords)

Comments:

•  Random Replacement helps!

•  Stagnation reduced

•  Better robustness

71

100000

10000

1000

100

10 20 40 60 80 100 120 140 160

DNA set size

HD HD+GC

HD+GC+RC

Number of search iterations n = 8

Scaling of SLS with Random Replacement

Stochastic Local Search for the DNA codes design problem

, d = 4

Comments:

•  SLS scales up better when less constraints are considered

•  Why? Because less constraints => easier problem, intuitively

Page 37: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

72

New bounds on the size of DNA codes

Note: HD, GC constraints.

n d Previous best SLS

6

10

14

18

3

5

7

9

10 20

56

132

240

380

1520

85

256

500

1200

2193

Stochastic Local Search for the DNA codes design problem - results

73

Comments: •  There are improvements over previous best.

•  The method is still extremely simple and intuitive [good quality in general but...]

•  Is it possible to improve it with some refinement?

•  Where should we work to refine the method?

Stochastic Local Search for the DNA codes design problem - results

IDEA: trying different neighbourhoods!

Page 38: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

74

• Combinatorial problem Π: DNA Word Design

•  Problem instance π : DNA/quaternary code design [ particular (n,d) combinations ]

•  Search space S(π): set of (code word) sets s

•  Neighborhood relation N(π): k-exchange + random based neighborhoods

•  Initialization function init(π): random choosing or predefined

•  Step function step(π): chooses with probability p between best improvement and random walk

•  Terminate predicate terminate(π): a function depending on the number of iterations

performed or solution found

Improved Stochastic Local Search for the DNA codes design problem

Tulpan, D.C. Hoos, H.H. Hybrid randomised neighbourhoods improve stochastic local search for DNA code design. Lectures Notes in Computer Science, Springer, Berlin, 2671, 418-433 (2003).

Inst

ead

of si

mpl

e 1-

exch

ange

!

75

Neighbourhoods

Simple neighbourhoods

•  k-exchange / k-point mutation neighbourhoods

•  rotation-based neighbourhoods

•  random neighbourhoods

Complex neighbourhoods

•  1-exchange / 1-point mutation + rotation neighbourhoods

•  k-exchange / k-point mutation + random words neighbourhoods

•  1-exchange / 1-point mutation + rotations + random words negihbourhoods

Improved Stochastic Local Search for the DNA codes design problem

Page 39: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

76

Simple neighbourhoods

v-exchange / v-point mutation neighbourhoods

Example:

some of the codewords in the 2-exchange neighbourhood of CTA are:

ACA GTT TTG TCA

Improved Stochastic Local Search for the DNA codes design problem

77

Simple neighbourhoods

rotation-based neighbourhoods

Applying the neighbourhood to a given codeword, we get the codewords obtained from the input codeword by �shifting right� the codeword from 1 to n-1 positions.

Example:

CTA => TAC, ACT

Improved Stochastic Local Search for the DNA codes design problem

Page 40: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

78

Simple neighbourhoods

random neighbourhoods

Example: some of the codewords in the random neighbourhood of CTA

are: CAA CTT TTC TCA

Improved Stochastic Local Search for the DNA codes design problem

79

Complex neighbourhoods

1-exchange + rotation neighbourhoods

v-exchange + random words neighbourhoods

1-exchange + rotations + random words neighbourhoods

•  These neighbourhoods are obtained by applying all the neighbourhoods involved sequentially (repeated codewords have to be avoided)

•  When rotation is involved, it is applied to all the codewords obtained by the neighbourhoods previously applied

Improved Stochastic Local Search for the DNA codes design problem

Page 41: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

80

Improved Stochastic Local Search for the DNA codes design problem

The difference is here!

81

k-exchange Neighbourhoods k = 70, n = 8, d = 4

HD, RC, GC constraints

1000 successful runs

{1, 2, 3}-exchange neighbourhoods

Improved Stochastic Local Search for the DNA codes design problem - results

Distribution of the number of iterations required to have a feasible solution for different v-exchange methods

Comments:

•  Using larger neighbourhood seems to helps but…

•  The difference between 2-exchange and 3-exchange is not dramatic

•  Larger neighbourhood means more time at each iteration…

Why 16?

2 words

I have to respect GC content

Page 42: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

82

Neighbourhood CPU Time

1-exchange

2-exchange

3-exchange

.0017

.0088

.0314

Improved Stochastic Local Search for the DNA codes design problem - results

k-exchange Neighbourhoods Time for 1 iteration

Distribution of the CPU time required to have a feasible solution for different v-exchange methods

Comments:

•  1-exchange is still the best in terms of run times => not what we hoped!

83

Hybrid Randomized Neighbourhoods

k = 70, n = 8, d = 4

HD, RC, GC constraints

1000 successful runs

random, hybrid neighbourhoods

Improved Stochastic Local Search for the DNA codes design problem - results

Distribution of the number of iterations required to have a feasible solution for different hybrid neighbourhoods

Comments:

•  P u r e r a n d o m p e r f o r m s surprisingly well

•  1-exchange + random is however the best method

Page 43: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

84

All combinations of neighbourhoods together (usual benchmark)

Improved Stochastic Local Search for the DNA codes design problem

Distribution of the number of iterations required to have a feasible solution for different neighbourhoods

Comments:

•  1-exchange + rotation + random is the most promising combination in terms of number of iterations

•  Methods including the random neighbourhood are definitely better

85

Approximate CPU Cost per Iteration for all the combinations of neighbourhoods considered

.043333

.040833

.029167

.022889

.015100

.017294

.031493

.008830

.002184 CPU Time [sec]

1-exchange 2-exchange 3-exchange

1-exchange + rotations random

1-exchange + random 2-exchange + random 3-exchange + random

1-exchange + rotations + random

Neighbourhood Type

128 + 100

Neighbourhood size

184 + 112 72 + 112 16 + 112

128 128 184 72 16

Improved Stochastic Local Search for the DNA codes design problem

Comment:

•  1-exchange + random is a good compromise between speed and quality of the solutions

•  Let’s see now what happen if we consider both the time spent on each iteration, and the number of iterations required to converge… [next slide]

Page 44: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

86

All combinations of neighbourhood together (usual benchmark)

Improved Stochastic Local Search for the DNA codes design problem

Distribution of the CPU time required to have a feasible solution for different neighbourhoods

Comments:

•  Rotation is time consuming => methods with rotation are not so convenient anymore

•  1-exchange + random neighbourhood is far the most promising combination in terms of CPU time

87

Improved Stochastic Local Search for the DNA codes design problem

Is this randomized

step still interesting?

Page 45: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

88

Improved Stochastic Local Search for the DNA codes design problem

Number of iterations to have a feasible solution for different values of the randomizing parameter

Comments:

•  The randomized step is useless when the hybrid randomized neighbourhood is used!

•  This happens because the neighbourhood already does the “random work”

k = 70, n = 8, d = 4

HD, RC, GC constraints

1000 successful runs

random, hybrid neighbourhoods

89

Improved Stochastic Local Search for the DNA codes design problem

Page 46: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

90

n = 8, d = 4

HD, RC, GC constraints

1000 successful runs

1-exchange, random, hybrid

neighbourhoods

Scaling of the Improved SLS

Improved Stochastic Local Search for the DNA codes design problem - results

Comments:

•  Surprising how pure random neighbourhood scales up well

•  However, 1-exchange + random neighbourhood is the best

91

SLS Results and Analysis

•  New bounds for DNA set sizes •  Improved SLS using various neighborhoods

Length (n)

Hamming dist. (d)

Existing Bounds (k)

Simple SLS (k)

4 3 - 5

8 4 108 112*

10 5 - 127

12 6 - 210

Thesis Contributions: C1 Development of novel optimization algorithms

Improved SLS (k)

6

128

158

240

Combinatorial constraints: HD, RC, GC

[Tulpan et al., 2002] [Tulpan et al., 2003] [Frutos et al., 1997]

Improved Stochastic Local Search for the DNA codes design problem

Page 47: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

92

Conclusions

•  Random neighbourhoods => increased SLS performance

•  1-exchange + random neighbourhood is the best combination

•  Larger DNA codes have been obtained

Improved Stochastic Local Search for the DNA codes design problem

93

Another Stochastic Local Search for the DNA codes design problem

Chee, Y. M, Ling, S. Improved lower bounds for constant GC-content DNA codes. IEEE Transactions on Information Theory, 54(1), 391-394 (2008).

•  A different SLS algorithm has been presented in the literature.

•  It can be seen as a Simulated Annealing algorithm without a cooling schedule (constant temperature).

•  The current code L is always feasible

•  At each iteration a new (feasible) codeword s is added, and all the codewords of L that are not compatible with s are removed, leading to a new code L’

•  Code L’ is accepted with a certain probability depending on |L’| - |L| (difference in the cardinalities of the two sets)

Page 48: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

94

Another Stochastic Local Search for the DNA codes design problem

Target number of codewords (k before)

Max number of iterations

Code

Set of incompatible codes

Acceptance probability of the new code:

95

Another Stochastic Local Search for the DNA codes design problem

Improvements over previous bests in the literature (theoretical methods, other SLSs and a few more)

HD, GC and RC constraints

Page 49: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

96

Stochastic Local Searches for the DNA codes design problem

•  Different methods based on a similar idea lead to very different codes

•  There is not a method dominating the others

•  The methods seem to explore the search space in a different manner

•  Is it possible to combine the good property of (some of) the different approaches into a unique method?

97

Outline •  Introduction •  The DNA Codes Design problem •  Approaches in the literature •  Construction heuristics •  Simple local searches •  Metaheuristics

–  Intro to Stochastic Local Search –  Applications to the DNA codes design problem –  Intro to Variable Neighbourhood Search –  Applications to the DNA codes design problem

•  Bibliography

Page 50: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

98

VNS

99

VNS

Page 51: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

100

VNS

101

VNS

Page 52: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

102

VNS

103

Outline •  Introduction •  The DNA Codes Design problem •  Approaches in the literature •  Construction heuristics •  Simple local searches •  Metaheuristics

–  Intro to Stochastic Local Search –  Applications to the DNA codes design problem –  Intro to Variable Neighbourhood Search –  Applications to the DNA codes design problem

•  Future research •  Acknowledgment •  Bibliography

Page 53: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

104

A VNS algorithm for DNA codes design A primitive Variable Neighbourhood Search (VNS) algorithm is introduced.

It iteratively runs in turns the local search algorithms (basic ingredients) seen before.

The reference solution for local searches is always the best solution retrieved so far.

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008).

This is a Variable Neighbourhood Descent!

Montemanni, R., Smith, D.H. Heuristic algorithms for constructing binary constant weight codes. IEEE Transactions on Information Theory 55(10), 4651-4656 (2009)

Montemanni, R., Smith, D.H., Koul, N. Three metaheuristics for the construction of constant GC-content DNA codes. Post-proceedings of the VIII Metaheuristic International Conference. S. Voss and M. Caserta eds., Springer (to appear)

105

A VNS algorithm for DNA codes design

Methods involved in our implementation

Page 54: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

106

A VNS algorithm for DNA codes design

•  We hope to take advantage of the different philosophies behind the local search methods listed before

•  From previous experiments we know that the basic local searches visit the search space is a different way

•  We hope basic local searches will help each other to exit from local minima within a VNS framework

107

Experimental comparison of some of the heuristic algorithms

Experimental settings Methods coded in ANSI C

Experiments on Dual AMD Opteron 250 2.4GHz / 4GB RAM machines

Maximum computation times: 10'000 seconds (2.8 hours)

Statistics over 5 runs for each combination problem/method

A (5,3,2) identifies the problem with constraints Cstrs (HD is always present, and therefore not listed), and with n = 5, d = 3, and GC content = floor(n/2) = 2. [this funny notation comes from coding theory…]

4 Cstrs

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008).

Page 55: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

108

Experimental comparison of some of the heuristic algorithms

•  SB = Seed Building

•  CS = Clique Search

•  HS = Hybrid Search

•  IGS = Iterated Greedy Search

•  VNS = Variable Neighbourhood Search

109

Experimental comparison of some of the heuristic algorithms

•  SB = Seed Building

•  CS = Clique Search

•  HS = Hybrid Search

•  IGS = Iterated Greedy Search

•  VNS = Variable Neighbourhood Search

Page 56: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

110

Experimental comparison of some of the heuristic algorithms

Comments •  No clear ranking is possible among the basic methods considered: Seed Building, Clique Search, Hybrid Search and Iterative Greedy Search (as seen before…)

⇒  Methods are likely to represent different neighbourhoods

•  Variable Neighbourhood Search clearly dominates the other methods

⇒  VNS takes advantage of the different neighbourhoods

⇒  VNS is likely to be competitive against all the other methods!

111 Reference algorithm

Experimental results of VNS The VNS algorithm discussed in:

•  Montemanni, R., Smith, D.H. (2008). Construction of constant GC-content DNA codes via a Variable Neighbourhood Search Algorithm. Journal of Mathematical Modelling and Algorithms, 7, 311-326.

is compared with the methods discussed in the following 6 papers [which provide all the best known codes]:

•  Li, M., Lee, H. J., Condon, A. E., and Corn, R. M. (2002). DNA word design strategy for creating sets of non-interacting oligonucleotides for DNA microarrays. Langmuir, 18, 805-812. •  Tulpan, D. C., Hoos, H. H., and Condon, A. E. (2002). Stochastic local search algorithms for DNA word design. Lectures Notes in Computer Science, Springer, 2568, 229-241. •  Tulpan, D. C. and Hoos, H. H. (2003). Hybrid randomised neighbourhoods improve stochastic local search for DNA code design. Lectures Notes in Computer Science, Springer, 2671, 418-433. •  King, O. D. (2003). Bounds for DNA codes with constant GC-content. Electronic Journal of Combinatorics, 10, #R33. •  Gaborit, P. and King, O. D. (2005). Linear construction for DNA codes. Theoretical Computer Science, 334, 99-113. •  Chee, Y. M. and Ling, S. (2008). Improved lower bounds for constant GC-content DNA codes. IEEE Transactions on Information Theory, 54(1), 391-394.

Theor. Constructions Heuristic Algorithms

Page 57: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

112

Experimental results of VNS

Experimental settings •  Methods coded in ANSI C

•  Experiments on Dual AMD Opteron 250 2.4GHz / 4GB RAM machines

•  Maximum computation times: 100'000 seconds (27.8 hours)

=> Comparable with that of other heuristic algorithms

•  Best over 5 runs for each combination problem/method

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008).

113

•  We will consider 254 problems with

-  4 ≤ n ≤ 20

-  3 ≤ d ≤ n ≤ 20

-  Case 1: HD and GC constraints

-  Case 2: HD, RC and GC constraints

•  These settings matches those of the state-of-the-art tables maintained at http://llama.med.harvard.edu/~king/dnacodes.html by O.D. King (last checked November 2009)

•  We left out problems corresponding to very large codes (the current VNS algorithm cannot tackle them)

Experimental results of VNS

Page 58: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

114

•  over 254 problems considered:

•  in 128 cases the best known result is matched

•  in 52 cases a new best result is found

Experimental results of VNS

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm. Journal of Mathematical Modelling and Algorithms 7, 311-326 (2008).

115

Detailed results of VNS

Page 59: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

116

Detailed results of VNS

117

Detailed results of VNS

Page 60: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

118

Detailed results of VNS

119

•  After the publication of the paper we have been improving the VNS algorithms in many ways (work still in progress!)

•  over 254 problems considered:

•  in 128 132 cases the best known result is matched

•  in 52 87 cases a new best result is found

•  We miss the best known solution in 13.8% of the cases only!

•  We feel there is room for further improvements…

Experimental results of VNS

Montemanni, R., Smith D.H. Metaheuristics for the construction of constant GC-content DNA codes. Proceedings of the MIC 2009 Conference (2009)

Montemanni, R., Smith, D.H., Koul, N. Three metaheuristics for the construction of constant GC-content DNA codes. Post-proceedings of the VIII Metaheuristic International Conference. S. Voss and M. Caserta eds., Springer (to appear)

Page 61: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

120

Detailed results of VNS

Comments •  VNS works (slightly) better on problems with RC contraints

•  Result confirmed also by our last improved implementations

•  Is this because the other methods are more competitive without RC constraints?

YES => we might have not too much chances to improve on problems without RC constraints

NO => we probably have chances to improve on problems without RC constraints

=> Worth to be investigated!

121

Outline •  Introduction •  Real applications •  The DNA Codes Design problem •  Approaches in the literature •  Construction heuristics •  Simple local searches •  Metaheuristics

–  Intro to Stochastic Local Search –  Applications to the DNA codes design problem –  Intro to Variable Neighbourhood Search –  Applications to the DNA codes design problem

•  Future research •  Acknowledgment •  Bibliography

Page 62: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

122

Essential bibliography (1/4) [HEUR] => Heuristics related publication.

Brenner, S., Lerner, R.A. (1992). Encoded combinatorial chemistry. Proceedings of the National Academy of Science USA, 89, 5381-5383.

Adleman, L. (1994) Molecular computation of solutions to combinatorial problems. Science, 266, 1021-1024.

Frutos, A.G., Liu, Q., Thiel, A.J., Sanner, A.M.W., Condon, A.E., Smith, L.M., Corn, R.M. (1997). Demonstration of a word design strategy for DNA computing on surfaces. Nucleic Acids Research, 25, 4748-4757.

Hansen, P., Mladenovic, N. (2001). Variable neighbourhood search: principles and applications. European Journal of Operational Research, 130, 449-467. [HEUR]

Marathe, A., Condon, A.E., Corn, R.M.. (2001). On combinatorial DNA word design. Journal of Computational Biology, 8, 201-219.

Arita, M., Kobayashi, S. (2002). DNA sequence design using templates. New Generation Computing, 20, 263-277.

123

Essential bibliography (2/4)

Li, M., Lee, H.J., Condon, A.E., Corn, R.M. (2002). DNA word design strategy for creating sets of non-interacting oligonucleotides for DNA microarrays. Langmuir, 18, 805-812.

Tulpan, D.C., Hoos, H.H., Condon, A.E. (2002). Stochastic local search algorithms for DNA word design. Lectures Notes in Computer Science, Springer, Berlin, 2568, 229-241. [HEUR]

Tulpan, D.C. Hoos, H.H. (2003). Hybrid randomised neighbourhoods improve stochastic local search for DNA code design. Lectures Notes in Computer Science, Springer, Berlin, 2671, 418-433. [HEUR]

King, O.D. (2003). Bounds for DNA codes with constant GC-content. Electronic Journal of Combinatorics, 10, #R33. [HEUR]

Kobayashi, S., Konto, T., Arita, M. (2003). On template methods for DNA sequence design. Lecture Notes in Computer Science, 2568, 205-214.

Hoos, H.H., Stuetzle, T. (2004). Stochastic Local Search: foundations and applications. Morgan Kaufmann/Elsevier. [HEUR]

Page 63: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

124

Essential bibliography (3/4) Gaborit, P., King, O.D. (2005). Linear construction for DNA codes. Theoretical Computer Science, 334, 99-113. [HEUR]

Tulpan, D.C. (2006). Effective heuristic methods for DNA strand design. PhD thesis, University of British Columbia. [HEUR]

King, O.D. (2006). Tables of lower bounds for DNA codes with constant GC-content. http://llama.med.harvard.edu/~king/dnacodes.html, last checked: November 2009. [HEUR]

Chee, Y. M, Ling, S. (2008). Improved lower bounds for constant GC-content DNA codes. IEEE Transactions on Information Theory, 54(1), 391-394. [HEUR]

Montemanni, R., Smith, D.H. (2008). Construction of constant GC-content DNA codes via a Variable Neighbourhood Search Algorithm. Journal of Mathematical Modelling and Algorithms, 7, 311-326. [HEUR]

Montemanni, R., Smith, D.H. (2009). Heuristic algorithms for constructing binary constant weight codes. IEEE Transactions on Information Theory 55(10), 4651-4656. [HEUR]

Montemanni, R., Smith D.H. (2009). Metaheuristics for the construction of constant GC-content DNA codes. Proceedings of the MIC 2009 Conference. [HEUR]

125

Essential bibliography (4/4) Montemanni, R., Smith D.H., Koul, N. (2010). Three metaheuristics for the construction of constant GC-content DNA codes. Post-proceedings of the VIII Metaheuristic International Conference. S. Voss and M. Caserta eds., Springer. [HEUR]

Tulpan, D., Montemanni, R., Ghiggi, A. (2010). Computational Sequence Design Techniques for DNA Microarray Technologies. Submitted for publication. [HEUR]

Ghiggi, A. (2010). DNA strand design with thermodynamic constraints. Master thesis, USI. [HEUR]

Koul, N. (2010). Heuristic Algorithms for Construction of Constant GC content DNA codes. Master thesis, USI. [HEUR]

Neelakandan, I. (2010). New Approaches for Constructing Constant Weight Binary Codes. Master thesis, USI. [HEUR]

Page 64: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

126

Exercises 1 1.  We have the following code with n=4:

CGTA GGAA AATG TAGA

a.  Does it respect the GC-content constraint?

b.  Does it respect the Hamming distance constraint for a DNA codes design problem with d=2?

c.  Does it respect the Reverse Complement Hamming distance constraint for a DNA codes design problem with with d=2?

2.  Given the settings n=4, d=3 and constraints HD, GC, RC, consider the following code:

AACC CAGT GAAG TCCT TGAC

a.  Is it feasible?

b.  Can it be extended?

127

Exercises 2 1.  Given the settings n=3, d=2 and constraints HD, GC, show an execution of

the Construction Heuristic working on top of the inverse lexicographic order.

2.  Given the settings n=2, d=1 and constraints HD, GC, RC, show and execution of the Construction Heuristic working on top of the lexicographic order

3.  Given the settings n=3, d=2, constraints HD, GC, RC, and the following partial code:

CTT CAA TGT GTA

show an iteration of the Clique Search algorithm.

4.  Given the settings n=4, d=2, constraints HD, GC, RC, and the following code:

TGGT GACC CGAA TCTC CGTT

calculate its measure of infeasibility Inf(W) according to the definition given in slide 75 (Iterative Greedy Search)

Page 65: DNA Codes Design - SUPSIroberto/DNA.pdf · • The DNA Codes Design problem • Approaches in the literature • Construction heuristics • Simple local searches • Metaheuristics

128

Exercises 3 1.  Write the rotation neighbourhood of codeword CATGA.

2.  Write 5 of the codewords of the 3-exchange neighbourhood of codeword CATGA.

5.  Write 5 of the codewords of the random neighbourhood of codeword CATGA.

6.  Write 5 of the codewords of the 2-exchange + random neighbourhood of codeword CATGA.

7.  Consider the SLS method described from slide 119 on, with input parameters n=4, d=3, and constraints HD, GC, RC.

At a given iteration we have the following code L

CTTC GGTT GTCA AGGA ACTG TTGG

and the selected random codeword is TTGC.

Write down code L’ (we do not care if it will be accepted or not)