symbolic regression via genetic programming ai project #2 biointelligence lab cho, dong-yeon
DESCRIPTION
© 2005 SNU CSE Biointelligence Lab 3 Example (2/2) Kepler’s Third Law Square of any planet's orbital period (sidereal) is proportional to cube of its mean distance (semi-major axis) from Sun PlanetAP Mercury Venus Earth1.00 Mars Jupiter Saturn UranusTRANSCRIPT
Symbolic Regression via Genetic Symbolic Regression via Genetic ProgrammingProgramming
AI Project #2
Biointelligence labCho, Dong-Yeon
© 2005 SNU CSE Biointelligence Lab
2
Example (1/2)Example (1/2) Data
Relationship between A and P
A P0.39 0.240.72 0.611.00 1.001.52 1.845.20 11.99.53 29.419.1 83.5
© 2005 SNU CSE Biointelligence Lab
3
Example (2/2)Example (2/2) Kepler’s Third Law
Square of any planet's orbital period (sidereal) is proportional to cube of its mean distance (semi-major axis) from Sun
Planet A PMercury 0.39 0.24Venus 0.72 0.61Earth 1.00 1.00Mars 1.52 1.84
Jupiter 5.20 11.9Saturn 9.53 29.4Uranus 19.1 83.5
© 2005 SNU CSE Biointelligence Lab
4
Koza’s Algorithm1. Choose a set of possible functions and terminals for the program.
F = {+, - *, /, }, T = {A}2. Generate an initial population of random trees (programs) using the set of possible functions and terminals.3. Calculate the fitness of each program in the population by running it on a set of “fitness cases” (a set of input for which the correct output is known).4. Apply selection, crossover, and mutation to the population to form a new population.5. Steps 3 and 4 are repeated for some number of generations.
Evolving the Programs (1/2)Evolving the Programs (1/2)
© 2005 SNU CSE Biointelligence Lab
5
Evolving Lisp Programs (2/2) Evolving Lisp Programs (2/2) Kepler’s Third Law: P2 = cA3
FORTRAN
LISP
PROGRAM ORBITAL_PERIORDC # Mars #
A = 1.52P = SQRT(A * A * A)PRINT P
END ORBITAL_PERIORD
(defun orbital_period (); Mars ;(setf A 1.52)(sqrt (* A (* A A))))
Parse tree
© 2005 SNU CSE Biointelligence Lab
6
Symbolic Regression by GPSymbolic Regression by GP Objective
Find the function f for the given data (x, y)
Data Sets Set 1 and 2: 11 pairs Set 3: 50 pairs
)(xfy
© 2005 SNU CSE Biointelligence Lab
7
Functions and TerminalsFunctions and Terminals Functions
Numerical operators {+, -, *, /, exp, log, sin, cos, sqrt} Some operators should be protected from the illegal operation.
Terminals Input and constants
{x, R} where R [a, b]
© 2005 SNU CSE Biointelligence Lab
8
InitializationInitialization Maximum initial depth of trees Dmax is set. Full method (each branch has depth = Dmax):
nodes at depth d < Dmax randomly chosen from function set F nodes at depth d = Dmax randomly chosen from terminal set T
Grow method (each branch has depth Dmax): nodes at depth d < Dmax randomly chosen from F T nodes at depth d = Dmax randomly chosen from T
Common GP initialisation: ramped half-and-half, where grow and full method each deliver half of initial population
© 2005 SNU CSE Biointelligence Lab
9
Fitness FunctionsFitness Functions Relative Squared Error
The number of outputs that are within % of the correct value
n
i i
ii
yxfyFitness
1
2)(ˆ
© 2005 SNU CSE Biointelligence Lab
10
Selection (1/2)Selection (1/2) Fitness proportional (roulette wheel) selection
The roulette wheel can be constructed as follows. Calculate the total fitness for the population.
Calculate selection probability pk for each chromosome vk.
Calculate cumulative probability qk for each chromosome vk.
SIZEPOP
kkifF
_
1
)(
SIZEPOPkFifp k
k _,...,2,1 ,)(
SIZEPOPkpqk
jjk _,...,2,1 ,
1
© 2005 SNU CSE Biointelligence Lab
11
Procedure: Proportional_Selection Generate a random number r from the range [0,1]. If r q1, then select the first chromosome v1; else, select the
kth chromosome vk (2 k pop_size) such that qk-1 < r qk.pk qk
1 0.082407 0.082407
2 0.110652 0.193059
3 0.131931 0.324989
4 0.121423 0.446412
5 0.072597 0.519009
6 0.128834 0.647843
7 0.077959 0.725802
8 0.102013 0.827802
9 0.083663 0.911479
10 0.088521 1.000000
0.036441)(_
1
sizepop
kkifF
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1
© 2005 SNU CSE Biointelligence Lab
12
Selection (2/2)Selection (2/2) Tournament selection
Tournament size q Ranking-based selection
2 POP_SIZE 1 + 2 and - = 2 - +
11)(1
ipi
© 2005 SNU CSE Biointelligence Lab
13
GP FlowchartGP Flowchart
GA loop GP loop
© 2005 SNU CSE Biointelligence Lab
14
BloatBloat Bloat = “survival of the fattest”, i.e., the tree sizes
in the population are increasing over time Ongoing research and debate about the reasons Needs countermeasures, e.g.
Prohibiting variation operators that would deliver “too big” children
Parsimony pressure: penalty for being oversized
)#,(#)#,(#
DNCRSEDNPenaltyErrorFitness
© 2005 SNU CSE Biointelligence Lab
15
© 2005 SNU CSE Biointelligence Lab
16
ExperimentsExperiments At least three problems (+ your own data) Various experimental setup
Termination condition: maximum_generation 2 Models 3 settings 20 runs
Polynomial and general Effects of the penalty term Selection methods and their parameters Crossover pc and mutation pm
© 2005 SNU CSE Biointelligence Lab
17
ResultsResults For each problem
Result table and your analysis
Present the optimal function. Readable form and predicted function graph with data
Draw a learning curve for the run where the best solution was found.
You can draw all learning curves in one plot.
Polynomial GeneralAverage SD
Best Worst Average SD
Best Worst
Setting 1Setting 2Setting 3
© 2005 SNU CSE Biointelligence Lab
18Generation
Fitness (Error)
© 2005 SNU CSE Biointelligence Lab
19
ReferencesReferences Source Codes
GP libraries (C, C++, JAVA, …) MATLAB Tool box
Web sites http://www.cs.bham.ac.uk/~cmf/GPLib/GPLib.html http://cs.gmu.edu/~eclab/projects/ecj/ http://www.geneticprogramming.com/GPpages/softwar
e.html …
© 2005 SNU CSE Biointelligence Lab
20
Pay Attention!Pay Attention! Due: May 3, 2005 Submission
Source code and executable file(s) Proper comments in the source code Via e-mail
Report: Hardcopy!! Running environments Results for many experiments with various parameter settings Analysis and explanation about the results in your own way