hadoop and marklogic: using the genetic algorithm to generate source code

MarkLogic and Hadoop – Genetic Algorithm

Jim Fulleremail: [email protected] twitter: @xquerySenior Engineer, Europe19/09/12

mailto:[email protected]

James Fuller

http://exslt.orghttp://www.xmlprague.cz

http://jim.fuller.name

@xquery

@perl6

XSLT UK 2001

Senior engineer

Overview

• Genetic Algorithm Refresher• Marklogic/Hadoop architecture for

implementing GA• Installing Hadoop• Installing MarkLogic Connector• Problem Statement• Review of GA process runs• Summary

Whats the Problem ?

• Bigdata breathes life into older algorithmic approaches

• I thought it would interesting to turn ‘bigdata’ problem on its head (code versus data)

• Demonstrate hadoop with MarkLogic, working to each other strengths

Get out of your comfort zone

• This talk is slightly different then the description … 150 slides! Part I.

• Its got hadoop/marklogic and the genetic algorithm but have focused on the process and early results

• Doing data science means pushing yourself out of your comfort zone

• Start simple, then iterate

Genetic Algorithm Refresher• The Genetic Algorithm ( GA ) is a model of the

evolution of a population of artificial individuals emulating Darwinian Selection.

• Each individual is a chromosome which contains discrete units of information (genes).

• The driving force behind the search for new and better solutions is the retention and combination of good partial solutions to a problem

Abridged Genetic Algorithm • The Fundamental Theorem of Genetic Algorithms

M(H, t):# of individuals in population 't' with the schema 'H'.f(H): average fitness of the individuals with the schema 'H'.F: average fitness of the entire population.p1:probability of the schema being destroyed by crossover.p2:probability of the schema being destroyed by mutation.

GA operations

• Reproduction: An individual is perfectly replicated to a new population

• Crossover ( Recombination ): Parental material is recombined to create offspring to join new population

• Mutation: random changes (is key for pushing past local optima)

• Permutation: reordering • Editing: evaluation to a terminal• Encapsulation: single indivisible function• Decimation: removal of individuals

Typical GA ProcessStep 0. Create a random initial population of individuals

Step 1. Evaluate the fitness of each individual

Step 2. Select individuals according to their fitness, which will participate in generating offspring (moms+dads)

Step 3. Apply primary and secondary genetic operations to generate new offspring population

Step 4. Repeat the steps 1,2,3, to generate X number of generations

Step 5. choose fittest individual of last generation based on stop criteria

Endemic GA Problems

• Finding the optimal solution to complex high dimensional, multimodal problems often requires very expensive fitness function

• Hard to pose problem statement e.g. Stop criteria is not clear in every problem

• Premature convergence on local optima

(+( 2 3) 4) evaluates to 10 and symbolic expression looks like;

Bit strings vs Lisp Parse Trees

3

4

+

2

Hierarchical computer programs are more expressive then manipulating linear strings

XSLT – markup is useful!

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version=“2.0">

<xsl:template match="a"> <d/>

<c/> </xsl:template></xsl:stylesheet>

<d/><c/>

<xsl:template/>

<xsl:stylesheet/>

Obvious Difficulties to address; different node types and xpath

Objective Generate an xslt program that transforms source xml into result xml which is equivalent to target xml

Terminal Set <a/> <c/> <d/>

Function Set Subset of xslt instructions

Fitness Cases One fitness case

Raw fitness Treediffmerge result, node count + standard diff

Standardized fitness

Same as raw fitness, approaching 0 is better fitness

Parameters M=500, G=51

Source XML

<a>

<c><d></d>

</c>

</a>

Target XML – clear stop criteria

<a>

<c><d></d>

</c>

</a>

Generation zero

• XML Instance Generator which is part of the Sun Multi-Schema Validator

• Sun Multi-Schema Validator• The following can do it

– OxygenXML – Visual Studio– Eclipse

• Ended up using IBM XML Generate – very old, supply it a schema and it would generate example xml

https://msv.dev.java.net/

https://msv.dev.java.net/

Step 1a: Evaluate against Input

XSLT generation

xslt Source.xml

result.xml

transformation

MarkLogic evals and places the result into the property for the xslt itself

Step 1b: Evaluate Fitness

XSLT generation

xslt Source.xml

result.xml

evaluate fitness

transformation

HADOOP

fitness performed with treediffmerge + standard diff

XML Diff issues

• Many diff algorithms are based on a paper published in 1976 by J. W. Hunt and M. D. McIlroy, An Algorithm for Differential File Comparison

• XML has a structure, text based diff programs do not take this into accordance

• simple example: <footie/> versus <footie></footie>logically these are equal

• XML Canonization helps !

TREEDIFFMERGE DIFFERENCE RESULTS

<?xml version="1.0" encoding="UTF-8"?><diff xmlns:diff='http://diff.org'> <diff:insert dst="1">

<a>

<c>

<d />

</c> 

</a> </diff:insert>

</diff>

<?xml version="1.0" encoding="UTF-8"?><root/>

<?xml version="1.0" encoding="UTF-8"?><diff xmlns:diff='http://diff.org'> <diff:copy src="2" dst="1">

<diff:copy src="16" dst="2" />

</diff:copy></diff>

<?xml version="1.0" encoding="utf-8"?><root><a/><a><a><c/><c><a><d/></a><c/></c></a><a/><c/> <c> <d/> </c> <a/></a><d><a><c/><a/><a/></a><c/></d><c/></root>

XML Canonize + TreeDiffMerge

Simple if we match: we are done!<?xml version="1.0" encoding="UTF-8"?><diff />

<?xml version="1.0" encoding="utf-8"?><root><a> <c> <d/> </c> </a></root>

MarkLogic/Hadoop ArchitectureInterlude

MarkLogic

MarkLogic

Connector API via XDBC

Connector API via XDBC

From Hadoop pov

Hadoop Installation Recipe• installing Hadoop (setting up a single node cluster)

– brew install hadoop– make sure ssh is setup properly– generate id_rsa and id_rsa.pub– append pub to auth keys

• cat id_rsa.pub >> authorized_keys – enable remote on mac osx

• configure hadoop– edit core-site.xml– edit mapred-site.xml

• ssh localhost– format hdfs

• hadoop namenode –format

• bin/start-all.sh– if asks for password, you got problem with your ssh setup

• to check that all is well– run jps– ps ax | grep hadoop | wc –l– Check

• http://localhost:50030/jobtracker.jsp• http://localhost:50060/tasktracker.jsp• http://localhost:50070/dfshealth.jsp

http://localhost:50030/jobtracker.jsp

http://localhost:50030/jobtracker.jsp

http://localhost:50060/tasktracker.jsp



Installing ML Hadoop Connector

• copy latest xcc and connector jars to hadoop lib

• Copy ml-examples jar as well• Copy ml hadoop conf to hadoop conf

Starting it all Up

• Start marklogic• Create database• Create xdbc connection (how hadoop/ml

communicate)• Edit marklogic-hello-world.xml

• Make sure hadoop is started

Starting it all Up

• Load test Data via query console

xquery version "1.0-ml";

let $hello := <data><child>hello mom</child></data>let $world := <data><child>world event</child></data>

return( xdmp:document-insert("hello.xml", $hello), xdmp:document-insert("world.xml", $world))

Run hello world example

• bin/start-all.sh

• hadoop jar lib/marklogic-xcc-examples-6.0.20120914.jar com.marklogic.mapreduce.examples.HelloWorld

• Review https://gist.github.com/2484318

Fitness (hadoop) step

• Applies XML canonization• Performs treediffmerge, outputs and writes to

original xslt document xml property• Performs text diff and writes to original xslt

document xml property

Step 2. Select individuals

• Probabilistic selection to choose which individuals participate in genetic operation

Selected XSLT population

Select individuals for genetic operations, based on their fitness

About fitness

• Raw fitness: is the natural representation in terms of the specific problem (primitive counting nodes of treediffmerge patch)

• Standardized fitness: lower the better• Adjusted fitness: lies between 0-1• Normalized fitness: lies between 0-1 with

sum of fitness values = 1• In our case the lower the number of ‘different’

nodes the better, use standardized fitness

Step 3. Apply Primary Genetic Operations


New generation

Reproduction

Individual reproduced into new generation

Step 3. Primary Genetic Operations


New generation

Creates 2 offspring‘Mom’

‘Dad’

Crossover ( Recombination )

Select parents then crossover creates 2 offspring

Step 3. Primary Genetic OperationsCrossover ( Recombination )

‘Dad XSLT’‘Mom XSLT’

‘offspring xslt’


New generationSwap nodes between selected parent xslt

Crossover with xqueryxquery version "1.0-ml";import module namespace mem = "http://xqdev.com/in-mem-update" at "/MarkLogic/appservices/utils/in-mem-update.xqy" ;

let $mom := <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/" as="item()*"> <bar>help</bar> </xsl:template> <xsl:template match="text()" as="item()*"/> </xsl:stylesheet> let $dad := <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/" as="item()*"> <a><c>test</c></a> </xsl:template> </xsl:stylesheet> let $momCount := fn:count($mom//.) let $dadCount := fn:count($dad//.) (: never want root node :) let $momRdm := xdmp:random($momCount - 2) + 2 let $dadRdm := xdmp:random($dadCount - 2) + 2 (: node selection :) let $momNode := ($mom//.)[$momRdm] let $dadNode := ($dad//.)[$dadRdm]

(: crossover :) let $newMom := mem:node-replace( $momNode, $dadNode ) let $newDad := mem:node-replace( $dadNode, $momNode ) return <result> <newMom>{$newMom}</newMom> <newDad>{$newDad}</newDad> </result>

Step 3. Secondary Genetic Operations

• Mutation: is a form of random crossover• Permutation: Reorganize nodes• Editing: evaluate a set of nodes• Encapsulation: takes a branch and replaces

with 1 indivisible node• Decimation: removes individual based on

domain specific criteria

Step 3. Secondary Genetic Operationsmutation

‘selected XSLT’

Pick a node and randomly mutate

Completely new set of instructions


Step 3. Secondary Genetic Operationspermutation

‘selected XSLT’ ‘offspring xslt’

Permutated node order

Step 3. Secondary Genetic Operationsediting

‘selected XSLT’ ‘offspring xslt’

Replace node with evaluated expression

Step 3. Secondary Genetic Operationsencapsulation

‘selected XSLT’ ‘define new function’

Identify useful subtrees and encapsulate by defining new function

‘XSLT’

Step 3. Secondary Genetic Operations

decimation

Identify very poor fitness individuals and remove from population

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"></xsl:stylesheet>

<xsl:stylesheet/>

Initial tests

• Initial Population= 500, generations = 51• Set initial genetic operation probabilities:

90% crossover on selected individuals10% reproduction on selected individuals0% secondary operations on selected

individuals

Results

• runs faster with more servers … extreme scale out – unusual for GA

• Arrived quickly to a ‘correct’ solution• Though some runs Local optima was ‘wrong solution’

e.g. embedded literal• need to constrain xpath (baby steps)• Need to constrain terminal set• Enhance fitness definition

Source XML

<a>

<c><d></d>

</c>

</a>

Target XML

<a>

<c/> <d/>

</a>

Results

• Needed larger generations/ more individuals• Mutation operation needed to kick out of local

optima

Summary

• This approach can be applied to any language parse tree (xquery with xqueryparser.xq)

• Difficulties with little languages being embedded

• Today, commercially applicable to generating mapping solutions, more research required

• Illustrates applying strength of ML/Hadoop together

• Will place code and results on github soon …

References• JOHN R KOZA, Genetic Programming, MIT Press 1992• J. W. Hunt and M. D. McIlroy , An Algorithm for

Differential File Comparison published in 1976

hadoop and marklogic: using the genetic algorithm to generate source code

Technology

xml diff

hadoop xslt generation

average fitness

genetic algorithm ga

hadoop pov

ibm xml

generateexample xml

schema h