anna atramentov major: computer science program of study committee: vasant honavar, major professor...
TRANSCRIPT
Anna Atramentov
Major: Computer Science
Program of Study Committee:
Vasant Honavar, Major Professor
Drena Leigh Dobbs
Yan-Bin Jia
Iowa State University,
Ames, Iowa
2003
A Multi-Relational Decision Tree Learning Algorithm – Implementation and Experiments
KDD and Relational Data Mining
Term KDD stands for Knowledge Discovery in Databases Traditional techniques in KDD work with the instances represented by one
table
Relational Data Mining is a subfield of KDD where the instances are represented by several tables
Day Outlook Temp-re Humidity Wind Play Tennis
d1 Sunny Hot High Weak No
d2 Sunny Hot High Strong No
d3 Overcast Hot High Weak Yes
d4 Overcast Cold Normal Weak No
Department
d1 Math 1000
d2 Physics 300
d3 Computer Science
400
Staff
p1 Dale d1 Professor 70 - 80k
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist
40-50k
p4 David d3 Professor 80-100k
Graduate Student
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p4 d3
s3 Michel 3.9 3 p4 d4
Motivation
Importance of relational learning:
Growth of data stored in MRDB Techniques for learning unstructured data often extract the data into MRDB
Promising approach to relational learning:
MRDM (Multi-Relational Data Mining) framework developed by Knobbe’s (1999)
MRDTL (Multi-Relational Decision Tree Learning) algorithm implemented by Leiva (2002)
Goals Speed up MRDM framework and in particular MRDTL algorithm Incorporate handling of missing values Perform more extensive experimental evaluation of the algorithm
Relational Learning Literature
Inductive Logic Programming (Inductive Logic Programming (Dzeroski and Lavrac, 2001; Dzeroski et al., Dzeroski and Lavrac, 2001; Dzeroski et al., 2001; Blockeel, 1998; De Raedt, 19972001; Blockeel, 1998; De Raedt, 1997))
First order extensions of probabilistic models First order extensions of probabilistic models
Relational Bayesian Networks(Relational Bayesian Networks(Jaeger, 1997Jaeger, 1997))
Probabilistic Relational Models (Probabilistic Relational Models (Getoor, 2001; Koller, 1999Getoor, 2001; Koller, 1999))
Bayesian Logic Programs (Bayesian Logic Programs (Kersting et al., 2000Kersting et al., 2000))
Combining First Order Logic and Probability TheoryCombining First Order Logic and Probability Theory
Multi-Relational Data Mining (Multi-Relational Data Mining (Knobbe et al., 1999Knobbe et al., 1999))
Propositionalization methods (Propositionalization methods (Krogel and Wrobel, 2001Krogel and Wrobel, 2001))
PRMs extension for cumulative learning for learning and reasoning as agents PRMs extension for cumulative learning for learning and reasoning as agents interact with the world (interact with the world (Pfeffer, 2000Pfeffer, 2000))
Approaches for mining data in form of graph (Approaches for mining data in form of graph (Holder and Cook, 2000; Holder and Cook, 2000; Gonzalez et al., 2000Gonzalez et al., 2000))
Problem Formulation
Example of multi-relational database
Given: Data stored in relational data baseGoal: Build decision tree for predicting target attribute in the target table
schemainstances
Department
d1 Math 1000
d2 Physics 300
d3 Computer Science
400
Staff
p1 Dale d1 Professor
70 - 80k
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist
40-50k
p4 David d3 Professor
80-100k
Graduate Student
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p4 d3
s3 Michel 3.9 3 p4 d4
Department
ID
Specialization
#Students
Staff
ID
Name
Department
Position
Salary
Grad.Student
ID
Name
GPA
#Publications
Advisor
Department
No
{d3, d4}{d1, d2}
{d1, d2, d3, d4}Tree_induction(D: data) A = optimal_attribute(D) if stopping_criterion (D) return leaf(D) else Dleft := split(D, A) Dright := splitcomplement(D, A) childleft := Tree_induction(Dleft) childright := Tree_induction(Dright) return node(A, childleft, childright)
Propositional decision tree algorithm. Construction phase
Day Outlook Temp-re Humidity Wind PlayTennis
d1 Sunny Hot High Weak No
d2 Sunny Hot High Strong No
d3 Overcast
Hot High Weak Yes
d4 Overcast
Cold Normal Weak No
Outlook not sunny
…
…
…
…
Temperature
hot not hot
No Yes
{d3} {d4}
Day Outlook Temp Hum-ty Wind PlayT
Overcast Hot High Weak Yes
d4 Overcast Cold Normal Weak No
Day Outlook Temp Hum-ty Wind PlayT
d1 Sunny Hot High Weak No
d2 Sunny Hot High Strong No
sunny
MR setting. Splitting data with Selection Graphs
ID Specialization #Students
d1 Math 1000
d2 Physics 300
d3 Computer Science
400
Department Graduate Student
ID Name Department
Position Salary
p1 Dale d1 Professor 70 - 80k
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist
40-50k
p4 David d3 Professor 80-100k
Staff
ID Name GPA #Public. Advisor Department
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p4 d3
s3 Michel 3.9 3 p4 d4
Staff
Grad. Student
Grad. Student
GPA >2.0
Department
Staff
Grad.Student
complement selection graphs
Staff Grad. Student
GPA >2.0
Staff Grad. Student
ID Name Department
Position Salary
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist
40-50k
ID Name Department
Position Salary
p4 David d3 Professor
80-100k
ID Name Department
Position Salary
Dale d1 Professor 70-80k
What is selection graph?
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
It corresponds to the subset of the instances from target table
Nodes correspond to the tables from the database
Edges correspond to the associations between tables
Open edge = “have at least one”
Closed edge = “have non of ”
Department
Staff
Grad.Student
Specialization=math
Transforming selection graphs into SQL queriesStaff
Staff
Staff
Staff Grad. Student
Grad. Student
Grad. Student
Grad. Student
SelectSelect distinct T0.idT0.id
FromFrom Staff
WhereWhere T0.position=Professor T0.position=ProfessorPosition = Professor
Select Select distinct T0.idT0.id
FromFrom Staff T0, Graduate_Student T1T0, Graduate_Student T1
Where Where T0.id=T1.AdvisorT0.id=T1.Advisor
SelectSelect distinct T0.idT0.id
FromFrom Staff T0T0
WhereWhere T0.id not in T0.id not in
( ( SelectSelect T1. id T1. id
FromFrom Graduate_Student T1) Graduate_Student T1)
GPA >3.9
Select distinct T0. idFrom Staff T0, Graduate_Student T1Graduate_Student T1Where T0.id=T1.AdvisorT0.id=T1.Advisor
T0. id not in ( ( SelectSelect T1. id T1. id
FromFrom Graduate_Student T1 Graduate_Student T1
WhereWhere T1.GPA > 3.9) T1.GPA > 3.9)
Generic query:
select distinct T0.primary_key from table_list where join_list and condition_list
MR decision tree
Staff
… …
… …
… …
Staff Staff
StaffStaff Grad. StudentGrad.Student
Grad.Student Grad.Student
GPA >3.9
GPA >3.9
Grad.Student
Each node contains selection graph Each child selection graph is a supergraph
of the parent selection graph
How to choose selection graphs in nodes?
Problem: There are too many supergraph selection graphs to choose from in each node
Solution: start with initial selection graph find greedy heuristic to choose supergraph
selection graphs: refinements use binary splits for simplicity for each refinement
get complement refinement choose the best refinement based
on information gain criterion
Problem: Somepotentiallygood refinementsmay give noimmediate benefit
Solution: look ahead capability
Staff
… …
… …
… …
Staff Staff
StaffStaff Grad. StudentGrad.Student
Grad.Student Grad.Student
GPA >3.9
GPA >3.9
Grad.Student
Refinements of selection graph
add condition to the node - explore attribute information in the tables
add present edge and open node –explore relational properties between the tables
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
Staff
Grad.Student
Specialization=math
Refinements of selection graph
Position = Professor
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Position != Professor
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
refinement
complement refinement
Department
Staff
Grad.Student
add condition to the nodeadd condition to the node add present edge and open node
Specialization=math
Specialization=math
Specialization=math
Refinements of selection graph
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
GPA >2.0
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Grad.StudentGPA >2.0
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
Staff
Grad.Student
add condition to the nodeadd condition to the node add present edge and open node
refinement
complement refinementSpecialization=math
Specialization=math
Specialization=math
Refinements of selection graph
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
#Students >200
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
#Students >200
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
Staff
Grad.Student
add condition to the nodeadd condition to the node add present edge and open node
refinement
complement refinementSpecialization=math
Specialization=math
Specialization=math
Refinements of selection graph
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
Staff
Grad.Student
add condition to the node add present edge and open nodeadd present edge and open node
refinement
complement refinement
Note: information gain = 0
Specialization=math
Specialization=math
Specialization=math
Refinements of selection graph
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Staff
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Staff
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
Staff
Grad.Student
refinement
complement refinement
add condition to the node add present edge and open nodeadd present edge and open node
Specialization=math
Specialization=math
Specialization=math
Refinements of selection graph
Staff
Grad.Student
GPA >3.9
Grad.Student
Department Staff
Staff
Grad.Student
GPA >3.9
Grad.Student
Department Staff
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
Staff
Grad.Student
refinement
complement refinement
add condition to the node add present edge and open nodeadd present edge and open node
Specialization=math
Specialization=math
Specialization=math
Refinements of selection graph
Staff
Grad.Student
GPA >3.9
Grad.Student
Department Grad.S
Staff
Grad.Student
GPA >3.9
Grad.Student
Department Grad.S
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
Staff
Grad.Student
refinement
complement refinement
add condition to the node add present edge and open nodeadd present edge and open node
Specialization=math
Specialization=math
Specialization=math
Look ahead capability
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
Staff
Grad.Student
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
refinement
complement refinement
Specialization=math
Specialization=math
Specialization=math
Look ahead capability
Department
Staff
Grad.Student
#Students > 200
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
refinement
complement refinement
#Students > 200
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Department
Department
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Specialization=math
Specialization=math
Specialization=math
MRDTL algorithm. Construction phase
Staff
… …
… …
… …
Staff StaffGrad.Student Grad.Student
Staff Grad. Student
GPA >3.9
StaffGrad.Student
GPA >3.9
Grad.Student
for each non-leaf node: consider all possible refinements
and their complements of the node’s selection graph
choose the best onesbased on informationgain criterion
createchildrennodes
MRDTL algorithm. Classification phaseStaff
… …
… …… …
Staff Staff
StaffStaff Grad. Student
Grad.Student
Grad.Student Grad.Student
GPA >3.9
GPA >3.9
Grad.Student
Staff Grad. Student
GPA >3.9
Department
Spec=math
Staff Grad. Student
GPA >3.9
Department
Spec=physics
Position =Professor ……………..
70-80k 80-100k
for each leaf: apply selection graph of the
leaf to the test data classify resulting instances
with classificationof the leaf
The most time consuming operations of MRDTL
Entropy associated with this selection Entropy associated with this selection graph:graph:Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Specialization=math
ID Name Dep Position Salary
p1 Dale d1 Postdoc c1
p2 Martin d1 Postdoc c1
p3 David d4 Postdoc c1
p4 Peter d3 Postdoc c1
p5 Adrian d2 Professor c2
p6 Doina d3 Professor c2
… …
… …
n1
n2
…
E = E = ( (nnii /N /N)) log log ((nnii /N /N))
Query associated with counts Query associated with counts nnii::
select distinct Staff.Salary, count(distinct Staff.ID)
from Staff, Grad.Student, Deparment
where join_list and condition_listgroup by Staff.Salary
Result of the query is the following list:Result of the query is the following list:
cci i , n, nii
The most time consuming operations of MRDTL
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
GPA >2.0
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Grad.StudentGPA >2.0
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Specialization=math
Specialization=math
Specialization=math
Entropy associated with each of Entropy associated with each of the refinementsthe refinements
select distinct Staff.Salary, count(distinct Staff.ID)
from table_listwhere join_list and condition_listgroup by Staff.Salary
A way to speed up - eliminate redundant calculations
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Specialization=math
Problem:Problem:For selection graph with 162 For selection graph with 162 nodes the time to execute a nodes the time to execute a query is more than 3 minutes!query is more than 3 minutes!
Redundancy in calculation:Redundancy in calculation:For this selection graph tables For this selection graph tables Staff and Grad.Student will be Staff and Grad.Student will be joined over and over for all the joined over and over for all the children refinements of the treechildren refinements of the tree
A way to fix:A way to fix:calculate it only once and save calculate it only once and save for all further calculationsfor all further calculations
Speed Up Method. Sufficient tables
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Specialization=math
Staff_ID
Grad.Student_ID
Dep_ID Salary
p1 s1 d1 c1
p2 s1 d1 c1
p3 s6 d4 c1
p4 s3 d3 c1
p5 s1 d2 c2
p6 s9 d3 c2
… …
… …
Speed Up Method. Sufficient tables
Entropy associated with this selection Entropy associated with this selection graph:graph:Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Specialization=math
n1
n2
…
E = E = ( (nnii /N /N)) log log ((nnii /N /N))
Query associated with counts Query associated with counts nnii::
select S.Salary, count(distinct S.Staff_ID)
from Sgroup by S.Salary
Result of the query is the following list:Result of the query is the following list:
cci i , n, nii
Staff_ID Grad.Student_ID Dep_ID Salary
p1 s1 d1 c1
p2 s1 d1 c1
p3 s6 d4 c1
p4 s3 d3 c1
p5 s1 d2 c2
p6 s9 d3 c2
… …
… …
Speed Up Method. Sufficient tables
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Specialization=math
select S.Salary, X.A, count(distinct S.Staff_ID)
from S, Xwhere S.X_ID = X.IDgroup by S.Salary, X.A
Queries associated with the addQueries associated with the addcondition refinement:condition refinement:
Calculations for the complement Calculations for the complement refinement:refinement:
count(ccount(cii , , RRcompcomp((SS)) = count(c)) = count(cii, , SS) – count(c) – count(cii , , RR((SS))))
Speed Up Method. Sufficient tables
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Specialization=math
Queries associated with the addQueries associated with the addedge refinement:edge refinement:
select S.Salary, count(distinct S.Staff_ID)
from S, X, Ywhere S.X_ID = X.ID, and e.condgroup by S.Salary
Calculations for the complement Calculations for the complement refinement:refinement:
count(ccount(cii , , RRcompcomp((SS)) = count(c)) = count(cii, , SS) – count(c) – count(cii , , RR((SS))))
Speed Up Method
Significant speed up in obtaining the counts needed for the calculations of the entropy and information gain
The speed up is reached by the additional space used by the algorithm
Handling Missing Values
Staff.Position
Staff.Name Staff.Dep Department.Spec
For each attribute which has missing values we build a Naïve Bayes model:
ID Specialization #Students
d1 Math 1000
d2 Physics 300
d3 Computer Science
400
Department
Graduate Student
ID Name Department
Position Salary
p1 Dale d1 ? 70 - 80k
p2 Martin d3 ? 30-40k
p3 Victor d2 VisitorScientist
40-50k
p4 David d3 ? 80-100k
Staff
ID Name GPA #Public. Advisor Department
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p1 d3
s3 Michel 3.9 3 p4 d4
…
Staff.Position, b
Staff.Name, a
P(a|b)
Handling Missing Values
Then the most probable value for the missing attribute is calculated by formula:
ID Specialization #Students
d1 Math 1000
Department
Graduate Student
ID Name Department
Position Salary
p1 Dale d1 ? 70 - 80k
Staff
ID Name GPA #Public. Advisor Department
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p1 d3
P(vi | X1.A1, X2.A2, X3.A3 …) =
P(X1.A1, X2.A2, X3.A3 …| vi) P(vi) / P(X1.A1, X2.A2, X3.A3 … ) =
P(X1.A1| vi) P(X2.A2| vi) P(X3.A3| vi) … P(vi) / P(X1.A1, X2.A2, X3.A3 … )
Experimental results. Mutagenesis
Most widely DB used in ILP.
Describes molecules of certain nitro aromatic compounds.
Goal: predict their mutagenic activity (label attribute) – ability to cause DNA to mutate. High mutagenic activity can cause cancer.
Two subsets regression friendly (188 molecules) and regression unfriendly (42 molecules). We used only regression friendly subset.
5 levels of background knowledge: B0, B1, B2, B3, B4. They provide richer descriptions of the examples. We used B2 level.
Experimental results. Mutagenesis
Results of 10-fold cross-validation for regression friendly set.
Data Set Accuracy Sel graphsize (max)
Tree size Time withspeed up
Time withoutspeed up
mutagenesis 87.5% 3 9 28.45 52.15
Best-known reported accuracy is 86%
Schema of the mutagenesis database
Consists of a variety of details about the various genes of one particular type of organism.
Genes code for proteins, and these proteins tend to localize in various parts of cells and interact with one another in order to perform crucial functions.
2 Tasks: Prediction of gene/protein localization and function 862 training genes, 381 test genes.
Experimental results. KDD Cup 2001
Many attribute values are missing: 70% of CLASS attribute, 50% of COMPLEX, and 50% of MOTIF in composition table
FUNCTION
localization Accuracy Sel graphsize (max)
Tree size Time withspeed up
Time withoutspeed up
With handling missing values
76.11% 19 213 202.9 secs 1256.38 secs
Without handling missing values
50.14% 33 575 550.76 secs 2257.20 secs
Experimental results. KDD Cup 2001
function Accuracy Sel graphsize (max)
Tree size(max)
Time withspeed up
Time withoutspeed up
With handling missing values
91.44% 9 63 151.19 secs 307.83 secs
Without handling missing values
88.56% 9 19 61.29 secs 118.41 secs
Best-known reported accuracy is 93.6%
Best-known reported accuracy is 72.1%
Experimental results. PKDD 2001 Discovery Challenge
Consists of 5 tables Target table consists of 1239 records The task is to predict the degree of the thrombosis attribute from ANTIBODY_EXAM table
The results for 5:2 cross validation:
Data Set Accuracy Sel Graphsize (max)
Tree size Time with speed up
Time without speed up
thrombosis 98.1% 31 71 127.75 198.22
Best-known reported accuracy is 99.28%
PATIENT_INFO
DIAGNOSIS
THROMBOSISANTIBODY_EXAM
ANA_PATTERN
Summary the algorithm significantly outperforms MRDTL in terms of running time the accuracy results are comparable with the best reported results
obtained using different data-mining algorithms
Future work
Incorporation of the more sophisticated techniques for handling missing values
Incorporating of more sophisticated pruning techniques or complexity regularizations
More extensive evaluation of MRDTL on real-world data sets Development of ontology-guided multi-relational decision tree learning
algotihms to generate classifiers at multiple levels of abstraction [Zhang et al., 2002]
Development of variants of MRDTL that can learn from heterogeneous, distributed, autonomous data sources, based on recently developed techniques for distributed learning and ontology based data integration
Thanks to
Dr. Honavar for providing guidance, help and support throughout this research
Colleges from Artificial Intelligence Lab for various helpful discussions
My committee members: Drena Dobbs and Yan-Bin Jia for their help
Professors and lecturers of the Computer Science department for the knowledge that they gave me through lectures and discussions
Iowa State University and Computer Science department for funding in part this research