geometric crossover for biological sequences alberto moraglio, riccardo poli & rolv seehuus...
Post on 20-Dec-2015
214 views
TRANSCRIPT
Geometric Crossover for Biological Sequences
Alberto Moraglio, Riccardo Poli
& Rolv Seehuus
EuroGP 2006
Contents
I. Geometric Crossover
II. Geometric Crossover for Sequences
III. Is Biological Recombination Geometric?
Geometric Crossover
• Representation-independent generalization of traditional crossover
• Informally: all offspring are between parents
• Search space: all offspring are on shortest paths connecting parents
Geometric Crossover & Distance
• Search Space is a Metric Space: d(A,B) =length of shortest paths between A and B
• Metric space: all offspring C are in the segment between parents
• C in [A,B]d d(A,C)+d(C,B)=d(A,B)
Example1: Traditional Crossover
• Traditional Crossover is Geometric Crossover under Hamming Distance
Parent1: 011|101
Parent2: 010|111
Child: 011|111
HD(P1,C)+HD(C,P2)=HD(P1,P2)
1 + 1 = 2
Example2: Blending Crossover
• Blending Crossover for real vectors is geometric under Euclidean Distance
P1
P2
C
ED(P1,C)+ED(C,P2)=ED(P1,P2)
Many Recombinations are Geometric
• Traditional Crossover for multary strings
• Box and Discrete recombinations for real vectors
• PMX, Cycle and Order Crossovers for permutations
• Homologous Crossover for GP trees
• Ask me for more examples over a coffee!
Being geometric crossover is important because….
• We know how the search space is going to be searched by geometric crossover for any representation: convex search
• We know a rule-of-thumb on what type of landscapes geometric crossover will perform well: “smooth” landscape
• This is just a beginning of general theory, in the future we will know more!
Sequences & Edit Distance
• Sequence: variable-length string of character from an alphabet A
• Edit distance: minimum number of edit operations – insertion, deletion, substitution – to transform one sequence into the other
• A = {a,c,t,g}, seq1 = agcacaca, seq2 = acacacta• Seq1=agcacaca acacacta acacacta=Seq2• ED(Seq1,Seq2)=2 (g deleted, t inserted)
Sequence Alignment (on contents)
• Alignment: put spaces (-) in both sequences such as they become of the same length
Seq1’= agcacac-a Seq2’= a-cacacta• Alignment Score: number of mismatches = 2• Optimal alignment: minimal score alignment
(Best Inexact Alignment on Contents)• The score of the optimal alignment of two
sequences equals their edit distance: ED(Seq1,Seq2)=Score(A)=2
Homologous Crossover
1. Align optimally two parent sequences
2. Generate randomly a crossover mask as long as the alignment
3. Recombine as traditional crossover
4. Remove dashes from offspring
Mask = 111111000Seq1’= agcacac-aSeq2’= a-cacactaSeqC’= a-cacac-aSeqC = acacaca
Theorem: Geometricity of HC
• Homologous Crossover is geometric crossover under edit distance
Seq1=agcacaca SeqC=acacaca acacacta=Seq2
ED(Seq1,SeqC)+ED(SeqC,Seq2)=ED(Seq1,Seq2)
1 + 1 = 2
More theory on HC in the paper
• Extension to weighted edit distances Extension to block ins/del edit distances
• Peculiarity of metric segments in edit distance spaces
• Bounds on offspring size due to parents size
Recombination at a molecular level
• DNA strands align on the contents, no positionally
• DNA are flexible, can be stretched or folded to align better to each others
• DNA strands do not need to be aligned at the extremities
• Some pair matching are preferred to others• DNA strands can form loops• Crossover points happen to be where DNA
strands align better • Not all details worked out yet!
Homologous Crossover as a Model of Biological Recombination
Homologous Crossover Biological Recombination•Alignment on Contents @
minimum distance•Ins/del move•Replacement move•Weighted move
•Block ins/del move
•Transpositions/reversals
•Alignments on contents @
minimum free energy•Frame-shift (one base gap)•Base mismatch•Allows to specify preferred matching (a-t preferred to a-g)•Allows to specify preference for loops, folds, bigger gaps• Subsequence transp./reversal
Many possible variants of edit distance that fit many real requirements of biological recombination
“Minimum Free Energy” & Edit Distance
DNA strands align optimally according to edit distance because:
(i) The alignment of two DNA strands (macromolecules) obeys chemistry: it is the state at “minimum free energy”
(ii) The weights of the edit moves can be interpreted as repulsion forces at a single basis level
(iii) The best alignment on edit distance is the best trade-off for which the global effect of repulsion forces is minimized: the “minimum free energy” alignment
Bridging Natural and Artificial Evolution
• Bridging Natural and Artificial Evolution
into a common theoretical framework
• Change in perspective: this allows to study real biological evolution as a computational process
• In the paper: we use geometric arguments to claim that biological evolution does efficient adaptation!
Summary
• Geometric crossover– Geometric crossover: offspring between parents– Many recombinations are geometric– Some general theory for geometric crossover
• Homologous crossover– Homologous crossover for sequences: alignment on contents before
recombination – Homologous crossover is geometric under edit distance
• Biological Recombination– Homologous crossover models biological recombination at DNA
level, so it is geometric– Geometric theory applies to biological recombination, bridging
biological & artificial evolution