geometric crossover for biological sequences alberto moraglio, riccardo poli & rolv seehuus...

24
Geometric Crossover for Biological Sequences Alberto Moraglio, Riccardo Poli & Rolv Seehuus EuroGP 2006

Post on 20-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Geometric Crossover for Biological Sequences

Alberto Moraglio, Riccardo Poli

& Rolv Seehuus

EuroGP 2006

Contents

I. Geometric Crossover

II. Geometric Crossover for Sequences

III. Is Biological Recombination Geometric?

I. Geometric Crossover

Geometric Crossover

• Representation-independent generalization of traditional crossover

• Informally: all offspring are between parents

• Search space: all offspring are on shortest paths connecting parents

Geometric Crossover & Distance

• Search Space is a Metric Space: d(A,B) =length of shortest paths between A and B

• Metric space: all offspring C are in the segment between parents

• C in [A,B]d d(A,C)+d(C,B)=d(A,B)

Example1: Traditional Crossover

• Traditional Crossover is Geometric Crossover under Hamming Distance

Parent1: 011|101

Parent2: 010|111

Child: 011|111

HD(P1,C)+HD(C,P2)=HD(P1,P2)

1 + 1 = 2

Example2: Blending Crossover

• Blending Crossover for real vectors is geometric under Euclidean Distance

P1

P2

C

ED(P1,C)+ED(C,P2)=ED(P1,P2)

Many Recombinations are Geometric

• Traditional Crossover for multary strings

• Box and Discrete recombinations for real vectors

• PMX, Cycle and Order Crossovers for permutations

• Homologous Crossover for GP trees

• Ask me for more examples over a coffee!

Being geometric crossover is important because….

• We know how the search space is going to be searched by geometric crossover for any representation: convex search

• We know a rule-of-thumb on what type of landscapes geometric crossover will perform well: “smooth” landscape

• This is just a beginning of general theory, in the future we will know more!

II. Geometric Crossover for Sequences

Sequences & Edit Distance

• Sequence: variable-length string of character from an alphabet A

• Edit distance: minimum number of edit operations – insertion, deletion, substitution – to transform one sequence into the other

• A = {a,c,t,g}, seq1 = agcacaca, seq2 = acacacta• Seq1=agcacaca acacacta acacacta=Seq2• ED(Seq1,Seq2)=2 (g deleted, t inserted)

Sequence Alignment (on contents)

• Alignment: put spaces (-) in both sequences such as they become of the same length

Seq1’= agcacac-a Seq2’= a-cacacta• Alignment Score: number of mismatches = 2• Optimal alignment: minimal score alignment

(Best Inexact Alignment on Contents)• The score of the optimal alignment of two

sequences equals their edit distance: ED(Seq1,Seq2)=Score(A)=2

Homologous Crossover

1. Align optimally two parent sequences

2. Generate randomly a crossover mask as long as the alignment

3. Recombine as traditional crossover

4. Remove dashes from offspring

Mask = 111111000Seq1’= agcacac-aSeq2’= a-cacactaSeqC’= a-cacac-aSeqC = acacaca

Theorem: Geometricity of HC

• Homologous Crossover is geometric crossover under edit distance

Seq1=agcacaca SeqC=acacaca acacacta=Seq2

ED(Seq1,SeqC)+ED(SeqC,Seq2)=ED(Seq1,Seq2)

1 + 1 = 2

More theory on HC in the paper

• Extension to weighted edit distances Extension to block ins/del edit distances

• Peculiarity of metric segments in edit distance spaces

• Bounds on offspring size due to parents size

III. Is Biological Recombination Geometric?

Recombination at a molecular level

• DNA strands align on the contents, no positionally

• DNA are flexible, can be stretched or folded to align better to each others

• DNA strands do not need to be aligned at the extremities

• Some pair matching are preferred to others• DNA strands can form loops• Crossover points happen to be where DNA

strands align better • Not all details worked out yet!

Homologous Crossover as a Model of Biological Recombination

Homologous Crossover Biological Recombination•Alignment on Contents @

minimum distance•Ins/del move•Replacement move•Weighted move

•Block ins/del move

•Transpositions/reversals

•Alignments on contents @

minimum free energy•Frame-shift (one base gap)•Base mismatch•Allows to specify preferred matching (a-t preferred to a-g)•Allows to specify preference for loops, folds, bigger gaps• Subsequence transp./reversal

Many possible variants of edit distance that fit many real requirements of biological recombination

“Minimum Free Energy” & Edit Distance

DNA strands align optimally according to edit distance because:

(i) The alignment of two DNA strands (macromolecules) obeys chemistry: it is the state at “minimum free energy”

(ii) The weights of the edit moves can be interpreted as repulsion forces at a single basis level

(iii) The best alignment on edit distance is the best trade-off for which the global effect of repulsion forces is minimized: the “minimum free energy” alignment

Is Biological Recombination Geometric? Yes?!

So what?

Bridging Natural and Artificial Evolution

• Bridging Natural and Artificial Evolution

into a common theoretical framework

• Change in perspective: this allows to study real biological evolution as a computational process

• In the paper: we use geometric arguments to claim that biological evolution does efficient adaptation!

Summary

• Geometric crossover– Geometric crossover: offspring between parents– Many recombinations are geometric– Some general theory for geometric crossover

• Homologous crossover– Homologous crossover for sequences: alignment on contents before

recombination – Homologous crossover is geometric under edit distance

• Biological Recombination– Homologous crossover models biological recombination at DNA

level, so it is geometric– Geometric theory applies to biological recombination, bridging

biological & artificial evolution

Questions?