hadoop and marklogic: using the genetic algorithm to generate source code
DESCRIPTION
Results of an experimental approach of using MarkLogic/Hadoop to generate source code using map reduce methods.TRANSCRIPT
MarkLogic and Hadoop – Genetic Algorithm
Jim Fulleremail: [email protected] twitter: @xquerySenior Engineer, Europe19/09/12
James Fuller
http://exslt.orghttp://www.xmlprague.cz
http://jim.fuller.name
@xquery
@perl6
XSLT UK 2001
Senior engineer
Overview
• Genetic Algorithm Refresher• Marklogic/Hadoop architecture for
implementing GA• Installing Hadoop• Installing MarkLogic Connector• Problem Statement• Review of GA process runs• Summary
Whats the Problem ?
• Bigdata breathes life into older algorithmic approaches
• I thought it would interesting to turn ‘bigdata’ problem on its head (code versus data)
• Demonstrate hadoop with MarkLogic, working to each other strengths
Get out of your comfort zone
• This talk is slightly different then the description … 150 slides! Part I.
• Its got hadoop/marklogic and the genetic algorithm but have focused on the process and early results
• Doing data science means pushing yourself out of your comfort zone
• Start simple, then iterate
Genetic Algorithm Refresher• The Genetic Algorithm ( GA ) is a model of the
evolution of a population of artificial individuals emulating Darwinian Selection.
• Each individual is a chromosome which contains discrete units of information (genes).
• The driving force behind the search for new and better solutions is the retention and combination of good partial solutions to a problem
Abridged Genetic Algorithm • The Fundamental Theorem of Genetic Algorithms
M(H, t):# of individuals in population 't' with the schema 'H'.f(H): average fitness of the individuals with the schema 'H'.F: average fitness of the entire population.p1:probability of the schema being destroyed by crossover.p2:probability of the schema being destroyed by mutation.
GA operations
• Reproduction: An individual is perfectly replicated to a new population
• Crossover ( Recombination ): Parental material is recombined to create offspring to join new population
• Mutation: random changes (is key for pushing past local optima)
• Permutation: reordering • Editing: evaluation to a terminal• Encapsulation: single indivisible function• Decimation: removal of individuals
Typical GA ProcessStep 0. Create a random initial population of individuals
Step 1. Evaluate the fitness of each individual
Step 2. Select individuals according to their fitness, which will participate in generating offspring (moms+dads)
Step 3. Apply primary and secondary genetic operations to generate new offspring population
Step 4. Repeat the steps 1,2,3, to generate X number of generations
Step 5. choose fittest individual of last generation based on stop criteria
Endemic GA Problems
• Finding the optimal solution to complex high dimensional, multimodal problems often requires very expensive fitness function
• Hard to pose problem statement e.g. Stop criteria is not clear in every problem
• Premature convergence on local optima
(+( 2 3) 4) evaluates to 10 and symbolic expression looks like;
Bit strings vs Lisp Parse Trees
3
4
+
2
Hierarchical computer programs are more expressive then manipulating linear strings
XSLT – markup is useful!
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version=“2.0">
<xsl:template match="a"> <d/>
<c/> </xsl:template></xsl:stylesheet>
<d/><c/>
<xsl:template/>
<xsl:stylesheet/>
Obvious Difficulties to address; different node types and xpath
Objective Generate an xslt program that transforms source xml into result xml which is equivalent to target xml
Terminal Set <a/> <b/> <c/> <d/>
Function Set Subset of xslt instructions
Fitness Cases One fitness case
Raw fitness Treediffmerge result, node count + standard diff
Standardized fitness
Same as raw fitness, approaching 0 is better fitness
Parameters M=500, G=51
Source XML
<a><b>
<c><d></d>
</c></b>
</a>
Target XML – clear stop criteria
<a><b>
<c><d></d>
</c></b>
</a>
Generation zero
• XML Instance Generator which is part of the Sun Multi-Schema Validator
• Sun Multi-Schema Validator• The following can do it
– OxygenXML – Visual Studio– Eclipse
• Ended up using IBM XML Generate – very old, supply it a schema and it would generate example xml
Step 1a: Evaluate against Input
XSLT generation
xslt Source.xml
result.xml
transformation
MarkLogic evals and places the result into the property for the xslt itself
Step 1b: Evaluate Fitness
XSLT generation
xslt Source.xml
result.xml
evaluate fitness
transformation
HADOOP
fitness performed with treediffmerge + standard diff
XML Diff issues
• Many diff algorithms are based on a paper published in 1976 by J. W. Hunt and M. D. McIlroy, An Algorithm for Differential File Comparison
• XML has a structure, text based diff programs do not take this into accordance
• simple example: <footie/> versus <footie></footie>logically these are equal
• XML Canonization helps !
TREEDIFFMERGE DIFFERENCE RESULTS
<?xml version="1.0" encoding="UTF-8"?><diff xmlns:diff='http://diff.org'> <diff:insert dst="1">
<a><b>
<c>
<d />
</c> </b>
</a> </diff:insert>
</diff>
<?xml version="1.0" encoding="UTF-8"?><root/>
<?xml version="1.0" encoding="UTF-8"?><diff xmlns:diff='http://diff.org'> <diff:copy src="2" dst="1">
<diff:copy src="16" dst="2" />
</diff:copy></diff>
<?xml version="1.0" encoding="utf-8"?><root><a/><a><a><c/><c><a><d/></a><c/></c></a><b><b/><a/><c/><b> <c> <d/> </c> </b></b><a/></a><d><a><c/><a/><a/></a><c/></d><c/></root>
XML Canonize + TreeDiffMerge
Simple if we match: we are done!<?xml version="1.0" encoding="UTF-8"?><diff />
<?xml version="1.0" encoding="utf-8"?><root><a> <b> <c> <d/> </c> </b></a></root>
MarkLogic/Hadoop ArchitectureInterlude
MarkLogic
MarkLogic
Connector API via XDBC
Connector API via XDBC
From Hadoop pov
Hadoop Installation Recipe• installing Hadoop (setting up a single node cluster)
– brew install hadoop– make sure ssh is setup properly– generate id_rsa and id_rsa.pub– append pub to auth keys
• cat id_rsa.pub >> authorized_keys – enable remote on mac osx
• configure hadoop– edit core-site.xml– edit mapred-site.xml
• ssh localhost– format hdfs
• hadoop namenode –format
• bin/start-all.sh– if asks for password, you got problem with your ssh setup
• to check that all is well– run jps– ps ax | grep hadoop | wc –l– Check
• http://localhost:50030/jobtracker.jsp• http://localhost:50060/tasktracker.jsp• http://localhost:50070/dfshealth.jsp
Installing ML Hadoop Connector
• copy latest xcc and connector jars to hadoop lib
• Copy ml-examples jar as well• Copy ml hadoop conf to hadoop conf
Starting it all Up
• Start marklogic• Create database• Create xdbc connection (how hadoop/ml
communicate)• Edit marklogic-hello-world.xml
• Make sure hadoop is started
Starting it all Up
• Load test Data via query console
xquery version "1.0-ml";
let $hello := <data><child>hello mom</child></data>let $world := <data><child>world event</child></data>
return( xdmp:document-insert("hello.xml", $hello), xdmp:document-insert("world.xml", $world))
Run hello world example
• bin/start-all.sh
• hadoop jar lib/marklogic-xcc-examples-6.0.20120914.jar com.marklogic.mapreduce.examples.HelloWorld
• Review https://gist.github.com/2484318
Fitness (hadoop) step
• Applies XML canonization• Performs treediffmerge, outputs and writes to
original xslt document xml property• Performs text diff and writes to original xslt
document xml property
Step 2. Select individuals
• Probabilistic selection to choose which individuals participate in genetic operation
Selected XSLT population
Select individuals for genetic operations, based on their fitness
About fitness
• Raw fitness: is the natural representation in terms of the specific problem (primitive counting nodes of treediffmerge patch)
• Standardized fitness: lower the better• Adjusted fitness: lies between 0-1• Normalized fitness: lies between 0-1 with
sum of fitness values = 1• In our case the lower the number of ‘different’
nodes the better, use standardized fitness
Step 3. Apply Primary Genetic Operations
Selected XSLT population
New generation
Reproduction
Individual reproduced into new generation
Step 3. Primary Genetic Operations
Selected XSLT population
New generation
Creates 2 offspring‘Mom’
‘Dad’
Crossover ( Recombination )
Select parents then crossover creates 2 offspring
Step 3. Primary Genetic OperationsCrossover ( Recombination )
‘Dad XSLT’‘Mom XSLT’
‘offspring xslt’
‘offspring xslt’
New generationSwap nodes between selected parent xslt
Crossover with xqueryxquery version "1.0-ml";import module namespace mem = "http://xqdev.com/in-mem-update" at "/MarkLogic/appservices/utils/in-mem-update.xqy" ;
let $mom := <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/" as="item()*"> <bar>help</bar> </xsl:template> <xsl:template match="text()" as="item()*"/> </xsl:stylesheet> let $dad := <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/" as="item()*"> <a><b><c>test</c></b></a> </xsl:template> </xsl:stylesheet> let $momCount := fn:count($mom//.) let $dadCount := fn:count($dad//.) (: never want root node :) let $momRdm := xdmp:random($momCount - 2) + 2 let $dadRdm := xdmp:random($dadCount - 2) + 2 (: node selection :) let $momNode := ($mom//.)[$momRdm] let $dadNode := ($dad//.)[$dadRdm]
(: crossover :) let $newMom := mem:node-replace( $momNode, $dadNode ) let $newDad := mem:node-replace( $dadNode, $momNode ) return <result> <newMom>{$newMom}</newMom> <newDad>{$newDad}</newDad> </result>
Step 3. Secondary Genetic Operations
• Mutation: is a form of random crossover• Permutation: Reorganize nodes• Editing: evaluate a set of nodes• Encapsulation: takes a branch and replaces
with 1 indivisible node• Decimation: removes individual based on
domain specific criteria
Step 3. Secondary Genetic Operationsmutation
‘selected XSLT’
Pick a node and randomly mutate
Completely new set of instructions
‘offspring xslt’
Step 3. Secondary Genetic Operationspermutation
‘selected XSLT’ ‘offspring xslt’
Permutated node order
Step 3. Secondary Genetic Operationsediting
‘selected XSLT’ ‘offspring xslt’
Replace node with evaluated expression
Step 3. Secondary Genetic Operationsencapsulation
‘selected XSLT’ ‘define new function’
Identify useful subtrees and encapsulate by defining new function
‘XSLT’
Step 3. Secondary Genetic Operations
decimation
Identify very poor fitness individuals and remove from population
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"></xsl:stylesheet>
<xsl:stylesheet/>
Initial tests
• Initial Population= 500, generations = 51• Set initial genetic operation probabilities:
90% crossover on selected individuals10% reproduction on selected individuals0% secondary operations on selected
individuals
Results
• runs faster with more servers … extreme scale out – unusual for GA
• Arrived quickly to a ‘correct’ solution• Though some runs Local optima was ‘wrong solution’
e.g. embedded literal• need to constrain xpath (baby steps)• Need to constrain terminal set• Enhance fitness definition
Source XML
<a><b>
<c><d></d>
</c></b>
</a>
Target XML
<a><b/>
<c/> <d/>
</a>
Results
• Needed larger generations/ more individuals• Mutation operation needed to kick out of local
optima
Summary
• This approach can be applied to any language parse tree (xquery with xqueryparser.xq)
• Difficulties with little languages being embedded
• Today, commercially applicable to generating mapping solutions, more research required
• Illustrates applying strength of ML/Hadoop together
• Will place code and results on github soon …
References• JOHN R KOZA, Genetic Programming, MIT Press 1992• J. W. Hunt and M. D. McIlroy , An Algorithm for
Differential File Comparison published in 1976