christoph f. eick, walter d. sanz, and ruijian zhang cs.uh/~ceick/matta.html

SPIE Data Mining Conference, Orlando, April 1999

Christoph F. Eick, Walter D. Sanz, and Ruijian Zhangwww.cs.uh.edu/~ceick/matta.html

University of HoustonOrganization

1. Introduction 2. General Approach and Corresponding Methodology 3. Wols --- A Genetic Programming System to Find Building Blocks 4. Experimental Results 5. Summary and Conclusion

A Genetic Programming System for Building Block Analysis to Enhance Data Analysis and

Data Mining Techniques


Regression analysis, which is also know as curve fitting, may be broadly defined as the analysis of relationships among variables. The actual specification of the form of the function weather it be linear, quadratic, or polynomial, is left up entirely to the user.

Symbolic regression is the process of discovering both the functional form of a target function and all of its necessary coefficients. Symbolic regression involves finding a mathematical expression in symbolic form, that provides a good, best or perfect fit between a finite sampling of values of the input variables and the associated values of the dependent variables.

Genetic Programming simulates the Darwinian evolutionary process and naturally occurring genetic operations on computer programs. Genetic programming provides a way to search for the fittest computer program in a search space which consists of all possible computer programs composed of functions and terminals appropriate to the problem domain.

1. Introduction


Example -- Evolving Syntax Trees with Genetic Programming:

Geneartion 0:*

BA

-

DC

cos

D

+

5C

Geneartion 1:*

A

-

DB

cos *

BAC /

E5

A * B = *

BA

0.1 0.2 0.90.5

0.07 0.7 0.1

0.05crossover

mutation

copying

*

BA

0.1

Best So Far

cos

/

E5

0.05

Population = 4


Motivating Example

Task: Approximate boundaries between classes with rectangles

C1

C2

C1

C2

Easy! Difficult!

Idea: Transform the representation space so that the approximationproblem becomes simpler (e.g. instead (x,y), just use (x-y)).

Problem: How do we find such representation space transformations?


Definition: Building Blocks are patterns which occur with a high frequency in a set of solutions.

Example:

(- (* C D) (+ A B)) (- C (+ B A)) (- (+ A B) (COS A)) (- (+ C (/ E (SQRT D))) (+ A B))

Leaf Building Block: (+ A B) Root Originating Building Block: (- ? (+ A B))

FUNC1 = (+ A B)

Building Blocks


Definition: Constructive Induction is a general approach for coping with inadequate attributes, redundant, and useless attributes found in original data. Constructive Induction explores various representation spaces. Building block analysis and constructive induction are used in this research to transform the original data by adding new attributes and by dropping useless ones.

Example:

A B C A - B A - C B - C

#1

#2

Primative Attributes and New AttributesCases

0.8 0.1 0.2 0.82 0.84 0.28

0.4 0.5 0.3 0.7 0.58 0.65

A B C

#1

#2

Primative AttributesCases

0.8 0.1 0.2

0.4 0.5 0.3

Constructive Induction


Primary Learning System: the one we use for the task at hand (e.g. C5.0)Secondary Learning System: is used to find building blocks

General Steps of the Methodology:

0. Run the primary learning system for the data collection. Record prediction performance and good solutions.

1. Run the genetic programming/symbolic regression/building block analysis system (secondary learning system) several times for the data collection with a rich function set. Record prediction performance and good solutions.

2. Analyze the good solutions found by the secondary learning system for the occurrence of building blocks.

3. Using the results of step 2, transform the data collection by generating new attributes.4. Redo steps 0-3 for the modified data collection, as long as a significant improvement

occurs

2. General Approach andCorresponding Methodology


3. Wols --- A Genetic Programming System that finds Bulding Blocks

ExecutableSRGPCODE

BuildingBlock

DetectionTool

SRGP

Front EndInterface


Wols’ Major Inputs

A1 A2 B

#1

#2

AttributesCases

0.8 0.1 0.2

0.4 0.3 0.3

#3

#4

0.3 0.4 0.4

0.2 0.5 0.1

Example:

B = f(A1, A2) Function Set: *, +, SIN

B = A1 * SIN(A2 + A1)


Features of the Wols System The WOLS system is basically a modified version of the standard genetic program model.

A standard genetic program model employs crossover and mutation operators that evolve solutions (here approximation functions) relying on the principles of the survival of the fittest: better solutions reproduce with a higher probability.

The WOLS system employs the traditional “sub-tree swapping” crossover operator, but provides a quite sophisticated set of 7 mutation operators, that apply various random changes to solutions.

The goal of the WOLS system is to find good approximations for an unknown function ƒ, and to analyze the decomposition of good approximations with the goal to identify building blocks.

The front end of the WOLS system takes a data collection DC=(A1,…,An, B) as its input and tries to find good approximations h, where B=h(A1,…,An), of the unknown function f with respect to a given function set and a given error function.

The symbolic regression genetic programming component (SRGP) of the WOLS system includes an interactive front end interface which allows a user to easily generate a genetic program that employs symbolic regression and is tailor made to the problem he wishes to solve.


The system run starts by asking the user interactively for parameters needed to run the symbolic regression system. The information the user provides is then used to generate the functions for the specific problem he wishes to solve. These problem specific functions are then merged with a set of general genetic programming functions to form a complete problem specific genetic program.

The SRGP component allows the user to select between three different error functions:

• Manhattan error function (minimizes the absolute error in the output variable)

• Least Square error function (minimizes the squared error in the output variable)

• Classification error function (minimizes the number of misclassifications) The WOLS system is capable of solving both regression analysis and classification

problems. In regression analysis problems the goal is to minimize the error, while classification problems the goal is to minimize the number of misclassifications.

Features of the Wols System (cont.)


The goal of the decision block analysis tool is to identify frequently occurring patterns in sets of solutions. In our particular approach, building blocks are expressions that involve functions symbols, variable symbols, as well as the ‘?’ (“don’t care”) symbol.

For example, if DC=(A1, A2, A3, B) is a data collection the variable A1, the expression (A2 * A3), and expression ((sin A1) + (A2 * ?)) are all potential building blocks.

In general we are interested in finding patterns that occur in about 85% or more of the set of analyzed approximations.

The identification of building blocks is useful for both improving the attribute space and in determining which part of the search space has been explored.

Features of the Wols System (cont.)


Decision Block Analysis Functions Root Analysis: Finds common root patterns among a population of

solution trees.

General Analysis: Finds common patterns anywhere in a solution tree.

Leaf Analysis: Finds common patterns at the leaf nodes of solution trees.

Frequency Analysis: Finds the total number of times a specific operator,variable,

or constant is used in the population of solution trees.


Experiment 1- 5 variable problem with 30 points training data Experiment 2 - 5 variable problem with 150 points training data Experiment 3 - Simple Linear Regression Experiment 4 - Time for a Hot Object To Cool Experiment 5 - Time for a Hot Object To Cool Using Constructive Induction Experiment 6 - Glass Classification Experiment 7 - Learning An Approximation Function

4. Experimental Results


Experiment 7 - Learning An Approximation Function

f(a, b, c, d, e) = |b-|d-0.44||*|log(a+1)| + a*b*(1-log(2)) Random values were plugged into function f in order to generate test

cases of the form (a, b, c, d, e, f(a, b, c, d, e)) training data set of 150 test cases, testing data set size 150 test cases Goal was to find a function h(a, b, c, d, e) that best approximates

function f. function set: +, *, loga(x):= |log(x)|, cosa(x):=|cos(x)|, sqrta(x):=|

sqrt(x)|, por(a,b):= a+b-a*b, diva(a/b):= |a/b|, aminus(a,b):= |a-b|, expa(x):= emin(x 300), sin100a:= |sin(100*x)|

variable set: a, b, c, d, e


Fitness Function: Average Absolute Manhattan Distance

where:n = total number of test cases

xi = output generated by the solution found through genetic programming

yi = output generated by the known function (desired output)

RUN 1 - Direct Approach RUN 2 - Using Leaf Building Blocks RUN 3 - Using Leaf Building Blocks and Reduced Operator and Variable Sets For each run a population size of 200 was used. Each run was terminated after 100 generations.

n

yxn

iii

1

Experiment 7 - continued


RUN 1 - Direct Approach Average Absolute Manhattan Distance Error Training: 0.01102 Testing: 0.01332 Leaf Building Blocks In The Top 10% Of The Population Pattern (* B A) occurred 25 times in 100.0 % of the solutions. Pattern (AMINUS 0.41999999999999998 D) occurred 32 times in 95.0 % of the solutions. Pattern (EXPA (AMINUS 0.41999999999999998 D)) occurred 28 times in 95.0 % of the solutions.

Frequency Analysis In The Top 10% Of The Population + was used 9 times in 25.0% of population aminus was used 56 times in 100.0% of population * was used 176 times in 100.0% of population / was used 75 times in 100.0% of population por was used 111 times in 100.0% of population sqrta was used 63 times in 100.0% of population


Frequency Analysis (continued) sin100a was used 3 times in 10.0% of population cosa was used 1 times in 5.0% of the population loga was used 2 times in 5.0% of population expa was used 33 times in 100.0% of population A was used 76 times in 100.0% of population B was used 132 times in 100.0% of population C was used 9 times in 35.0% of population D was used 57 times in 100.0% of population E was used 21 times in 40.0% of population CONSTANT was used 152 times in 100.0% of population

Leaf Building Blocks: (* B A), (AMINUS 0.41999999999999998 D), and (EXPA

(AMINUS 0.41999999999999998 D))

Since variables C and E are the only two variables not present in 100% of the solutions

analyzed they will be eliminated in RUN 3 along with the least used operator cosa. Also

notice that variables C and E were not actually present in function f and were

successfully filtered out.


RUN 2 - Using Leaf Building Blocks Solution (* ADF1 (SQRTA (+ (SQRTA (AMINUS (SQRTA (POR B (SQRTA (AMINUS (* (* B (POR 0.93000000000000005 (AMINUS (+ ADF1 ADF1) ADF1))) (POR (AMINUS (POR (SQRTA (AMINUS (* 0.91000000000000003 (POR A (EXPA (AMINUS (POR ADF2 (POR (SQRTA ADF2) (+ (DIVA 0.80000000000000004 B) (SQRTA (AMINUS (SQRTA (AMINUS B ADF2)) ADF2))))) ADF2)))) ADF2)) (LOGA (DIVA 0.80000000000000004 B))) ADF1) ADF2)) ADF2)))) ADF2)) 0.029999999999999999)))

Where:

ADF1 = (* B A) ADF2 = (AMINUS 0.41999999999999998 D) ADF3 = (EXPA (AMINUS 0.41999999999999998 D))

Average Absolute Manhattan Distance Error

Training: 0.00886

Testing: 0.01056

In RUN 2 there is a 19.6% increase in accuracy in training

and a 20.7% improvement in testing over RUN 1.


Best Solution Run3 Solution (DIVA ADF1 (EXPA (* (LOGA ADF3) (AMINUS (+ (DIVA (POR 0.23999999999999999 A) (POR B (LOGA (POR ADF1 (POR (POR (EXPA A) (EXPA (LOGA (POR (DIVA ADF3 B) (AMINUS ADF3 B))))) (+ (EXPA 0.95000000000000007) A)))))) ADF2) ADF1)))) Where: ADF1 = (* B A) ADF2 = (AMINUS 0.41999999999999998 D) ADF3 = (EXPA (AMINUS 0.41999999999999998 D)) Variables C and E and operator cosa were removed from the variable and operator sets.


RUN 3 - Using Leaf Building Blocks and Reduced Operator and Variable Sets

Solution (DIVA ADF1 (EXPA (* (LOGA ADF3) (AMINUS (+ (DIVA (POR 0.23999999999999999 A) (POR B (LOGA (POR ADF1 (POR (POR (EXPA A) (EXPA (LOGA (POR (DIVA ADF3 B) (AMINUS ADF3 B))))) (+ (EXPA 0.95000000000000007) A)))))) ADF2) ADF1))))

Where: ADF1 = (* B A) ADF2 = (AMINUS 0.41999999999999998 D) ADF3 = (EXPA (AMINUS 0.41999999999999998 D))

Variables C and E and operator cosa were removed from the variable and operator sets.

Average Absolute Manhattan Distance Error

Training: 0.00825

Testing: 0.00946

In RUN 3 there is a 7.0% increase in accuracy in

training and a 10.4% improvement in testing over

RUN 2 and a 25.1% increase in accuracy in training

and a 29.0% improvement in testing over RUN 1.


The WOLS system is a tool whose objective is to find good approximations for an unknown function , and to analyze the decomposition of good approximations with the goal to identify building blocks.

Our experiments demonstrated that significant improvements in the predictive accuracy of the function approximations obtained can be achieved by employing building block analysis and constructive induction. However, more experiments are needed to demonstrate the usefulness of the WOLS-system.

The symbolic regression/genetic programming front-end (SRGP) did quite well for a number of regression system benchmarks. This interesting observation has to be investigated in more detail in the future.

The WOLS system was written in GCL Common LISP Version 2.2. All Experiments were run on an Intel 266 MHZ Pentium II with 96 MB RAM.

5. Summary and Conclusion


Future Work Improving the WOLS system’s performance in the area of

classification problems. Conduct more experiments that use the WOLS system as a secondary

learning system, whose goal is to search for good building blocks useful for improving the attribute space for other inductive learning methods such as C4.5.

Construct a better user interface, such as a graphical user interface. Although the empirical results obtained from the large number of

experiments involving regression were encouraging more work still needs to be done in this area as well.


Experiment 3 - Simple Linear Regression In this experiment the WOLS system, NLREG, and Matlab were all used to obtain an approximation

function for a simple linear regression problem. NLREG, like most conventional regression analysis packages, is only capable of finding the numeric

coefficients for a function whose form (i.e. linear, quadratic, or polynomial) has been prespecified by the user. A poor choice, made by the user, of the functions form will in most cases lead to a very poor solution which would not be truly representative of the programs ability to solve the problem. In order to avoid this problem, for this experiment both the data set and the actual functional form, from one of the example problems provided with this software, were used.

In Matlab there are basically two different approaches available to perform regression analysis. The quickest method, which was used in this experiment, is to use the command polyfit(x, y, n), which instructs Matlab to fit an nth order polynomial to the data. The second method is similar to NLREG and requires the functional form to be specified by the user.

training data set of 30 test cases; testing data set size 15 test cases

The following parameters were used for the WOLS system:

Population Size: 500

Generations: 200

function set: +, *, loga(x):= |log(x)|, cosa(x):=|cos(x)|, sqrta(x):=|sqrt(x)|, por(a,b):= a+b-a*b,

diva(a/b):= |a/b|, aminus(a,b):= |a-b|, expa(x):= emin(x 300), sin100a:= |sin(100*x)|

variable set: a


NLREG Solution

1: Title "Linear equation: y = p0 + p1*x"; 2: Variables x,y; // Two variables: x and y 3: Parameters p0,p1; // Two parameters to be estimated: p0 and p1 4: Function y = p0 + p1*x; // Model of the function

Stopped due to: Both parameter and relative function convergence.Proportion of variance explained (R^2) = 0.9708 (97.08%)Adjusted coefficient of multiple determination (Ra^2) = 0.9698 (96.98%)

---- Descriptive Statistics for Variables ----

Variable Minimum value Maximum value Mean value Standard dev.---------- -------------- -------------- --------------- ------------- x 1 8 4.538095 2.120468 y 6.3574 22.5456 14.40188 4.390238

---- Calculated Parameter Values ----

Parameter Initial guess Final estimate Standard error t Prob(t)---------- --------------- ----------------- ----------------- --------- ------- p0 1 5.14417276 0.3336883 15.42 0.00001 p1 1 2.0399976 0.06681681 30.53 0.00001

---- Analysis of Variance ----

Source DF Sum of Squares Mean Square F value Prob(F)---------- ---- -------------- -------------- --------- -------Regression 1 542.6514 542.6514 932.15 0.00001Error 28 16.30013 0.5821475Total 29 558.9515 y = 5.14417276 + 2.0399976x The average absolute Manhattan distance error Training: 0.6361 Testing: 0.7438


Matlab Solution

1st order polynomial(n = 1) y = 2.0400x + 5.1442 Training: 0.6361 Testing: 0.7438

5th order polynomial(n = 5) y = 0.0070x5 - 0.1580x4 + 1.3365x3 - 5.2375x2 + 11.5018x - 1.0632 Training: 0.5838 Testing: 0.8034

10th order polynomial(n = 10) y = 0.0004x10 - 0.0187x9 + 0.3439x8 - 3.5856x7 + 23.2763x6- 97.5702x5 + 265.0925x4 - 456.1580x3 + 469.7099x2 - 254.6558x + 59.9535 Training: 0.4700 Testing: 0.8669

In viewing the results above, one the first things that is noticed is that as the order of the polynomial increases, the training performance also increases, but the testing performance decreases. This is because the higher order polynomials tend emulated the training set data too closely which in return causes a poorer performance when the testing data set values are applied.


WOLS Solution(+ (POR (+ X X) 0.080000000000000002) (+ (+ (DIVA 0.95000000000000007 0.38) (+ (DIVA (LOGA (* (POR (EXPA (POR 0.94000000000000006 (LOGA (COSA (+ (DIVA (+ (EXPA (DIVA (+ (EXPA (+ 0.68000000000000005 0.13)) 0.51000000000000001) 0.38)) 0.51000000000000001) 0.38) X))))) 0.42999999999999999) (POR X (LOGA (SIN100A (LOGA (* (POR 0.98999999999999999 0.42999999999999999) (POR X0.040000000000000001)))))))) 0.38) (COSA (LOGA (* (POR (EXPA (POR 0.46000000000000002 1.0)) 0.46000000000000002) (POR (SIN100A (COSA X)) (+ (DIVA (DIVA 0.83000000000000007 0.17999999999999999) 0.38) X))))))) (SIN100A (* X 0.26000000000000001))))

Average Absolute Manhattan Distance Training: 0.3788 Testing: 0.6906

Runtime wise the WOLS system cannot compete with NLREG or Matlab each of which found a solution in around 1 second, while the WOLS system spent nearly 4 hours generating its solution.

The WOLS system produced a solution that had 40% better training performance and a 7 % better testing performance then both the NLREG and Matlab solutions.

This solution was obtained without using building block analysis and constructive induction.

christoph f. eick, walter d. sanz, and ruijian zhang cs.uh/~ceick/matta.html

Documents