Title: Experiments%20with%20MRDTL%20
1A Multi-Relational Decision Tree Learning
Algorithm Implementation and Experiments
Anna Atramentov Major Computer Science Program
of Study Committee Vasant Honavar, Major
Professor Drena Leigh Dobbs Yan-Bin Jia Iowa
State University, Ames, Iowa 2003
2KDD and Relational Data Mining
- Term KDD stands for Knowledge Discovery in
Databases - Traditional techniques in KDD work with the
instances represented by one table - Relational Data Mining is a subfield of KDD where
the instances are represented by several tables
Day Outlook Temp-re Humidity Wind Play Tennis
d1 Sunny Hot High Weak No
d2 Sunny Hot High Strong No
d3 Overcast Hot High Weak Yes
d4 Overcast Cold Normal Weak No
Staff Staff Staff Staff Staff
p1 Dale d1 Professor 70 - 80k
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist 40-50k
p4 David d3 Professor 80-100k
Department Department Department
d1 Math 1000
d2 Physics 300
d3 Computer Science 400
Graduate Student Graduate Student Graduate Student Graduate Student Graduate Student Graduate Student
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p4 d3
s3 Michel 3.9 3 p4 d4
3Motivation
- Importance of relational learning
- Growth of data stored in MRDB
- Techniques for learning unstructured data often
extract the data into MRDB - Promising approach to relational learning
- MRDM (Multi-Relational Data Mining) framework
developed by Knobbes (1999) - MRDTL (Multi-Relational Decision Tree Learning)
algorithm implemented by Leiva (2002)
Goals
- Speed up MRDM framework and in particular MRDTL
algorithm - Incorporate handling of missing values
- Perform more extensive experimental evaluation of
the algorithm
4Relational Learning Literature
- Inductive Logic Programming (Dzeroski and Lavrac,
2001 Dzeroski et al., 2001 Blockeel, 1998 De
Raedt, 1997) - First order extensions of probabilistic models
- Relational Bayesian Networks(Jaeger, 1997)
- Probabilistic Relational Models (Getoor, 2001
Koller, 1999) - Bayesian Logic Programs (Kersting et al., 2000)
- Combining First Order Logic and Probability
Theory - Multi-Relational Data Mining (Knobbe et al.,
1999) - Propositionalization methods (Krogel and Wrobel,
2001) - PRMs extension for cumulative learning for
learning and reasoning as agents interact with
the world (Pfeffer, 2000) - Approaches for mining data in form of graph
(Holder and Cook, 2000 Gonzalez et al., 2000)
5Problem Formulation
- Given Data stored in relational data base
- Goal Build decision tree for predicting target
attribute in the target table
Example of multi-relational database
Department Department Department
d1 Math 1000
d2 Physics 300
d3 Computer Science 400
schema
instances
Department
ID
Specialization
Students
Graduate Student Graduate Student Graduate Student Graduate Student Graduate Student Graduate Student
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p4 d3
s3 Michel 3.9 3 p4 d4
Grad.Student
ID
Name
GPA
Publications
Advisor
Department
Staff
ID
Name
Department
Position
Salary
Staff Staff Staff Staff Staff
p1 Dale d1 Professor 70 - 80k
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist 40-50k
p4 David d3 Professor 80-100k
6Propositional decision tree algorithm.
Construction phase
Day Outlook Temp Hum-ty Wind PlayT
d1 Sunny Hot High Weak No
d2 Sunny Hot High Strong No
Day Outlook Temp-re Humidity Wind PlayTennis
d1 Sunny Hot High Weak No
d2 Sunny Hot High Strong No
d3 Overcast Hot High Weak Yes
d4 Overcast Cold Normal Weak No
Day Outlook Temp Hum-ty Wind PlayT
d3 Overcast Hot High Weak Yes
d4 Overcast Cold Normal Weak No
d1, d2, d3, d4
Tree_induction(D data) A
optimal_attribute(D) if stopping_criterion
(D) return leaf(D) else Dleft
split(D, A) Dright
splitcomplement(D, A) childleft
Tree_induction(Dleft) childright
Tree_induction(Dright) return node(A,
childleft, childright)
Outlook
7MR setting. Splitting data with Selection Graphs
Department
Graduate Student
ID Name GPA Public. Advisor Department
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p4 d3
s3 Michel 3.9 3 p4 d4
ID Specialization Students
d1 Math 1000
d2 Physics 300
d3 Computer Science 400
Staff
Department
ID Name Department Position Salary
p4 David d3 Professor 80-100k
ID Name Department Position Salary
p1 Dale d1 Professor 70 - 80k
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist 40-50k
p4 David d3 Professor 80-100k
Grad.Student
ID Name Department Position Salary
p1 Dale d1 Professor 70-80k
Staff
ID Name Department Position Salary
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist 40-50k
complement selection graphs
8What is selection graph?
Department
- It corresponds to the subset of the instances
from target table - Nodes correspond to the tables from the database
- Edges correspond to the associations between
tables - Open edge have at least one
- Closed edge have non of
Grad.Student
Staff
Grad.Student
Department
Staff
Specializationmath
9Transforming selection graphs into SQL queries
Staff
Select distinct T0.id From Staff Where
T0.positionProfessor
Position Professor
Select distinct T0.id From Staff T0,
Graduate_Student T1 Where T0.idT1.Advisor
Staff
Grad. Student
Generic query select distinct
T0.primary_key from table_list where
join_list and condition_list
Staff
Grad. Student
Select distinct T0.id From Staff T0 Where T0.id
not in ( Select T1. id
From Graduate_Student T1)
Grad. Student
Select distinct T0. id From Staff T0,
Graduate_Student T1 Where T0.idT1.Advisor T0.
id not in ( Select T1. id From
Graduate_Student T1 Where T1.GPA gt 3.9)
Staff
Grad. Student
GPA gt3.9
10MR decision tree
- Each node contains selection graph
- Each child selection graph is a supergraphof the
parent selection graph
11How to choose selection graphs in nodes?
- Problem There are too many supergraph selection
graphs to choose from in each node - Solution
- start with initial selection graph
- find greedy heuristic to choose
supergraphselection graphs refinements - use binary splits for simplicity
- for each refinementget complement refinement
- choose the best refinement basedon information
gain criterion - Problem Somepotentiallygood refinementsmay
give noimmediate benefit - Solution
- look ahead capability
12Refinements of selection graph
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
- add condition to the node - explore attribute
information in the tables - add present edge and open node explore
relational properties between the tables
13Refinements of selection graph
refinement
Specializationmath
Position Professor
Specializationmath
complement refinement
Specializationmath
- add condition to the node
- add present edge and open node
Position ! Professor
14Refinements of selection graph
refinement
GPA gt2.0
Specializationmath
Specializationmath
complement refinement
Specializationmath
- add condition to the node
- add present edge and open node
15Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Students gt200
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Specializationmath
- add condition to the node
- add present edge and open node
16Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Note information gain 0
Specializationmath
- add condition to the node
- add present edge and open node
17Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Specializationmath
- add condition to the node
- add present edge and open node
18Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Specializationmath
- add condition to the node
- add present edge and open node
19Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Specializationmath
- add condition to the node
- add present edge and open node
20Look ahead capability
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
21Look ahead capability
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Students gt 200
Specializationmath
Specializationmath
22MRDTL algorithm. Construction phase
- for each non-leaf node
- consider all possible refinements and their
complements of the nodes selection graph - choose the best onesbased on informationgain
criterion - createchildrennodes
Staff
Staff
Staff
Grad.Student
Grad.Student
23MRDTL algorithm. Classification phase
Staff
- for each leaf
- apply selection graph of theleaf to the test
data - classify resulting instanceswith
classificationof the leaf
Staff
Staff
Grad.Student
Grad.Student
Grad.Student
Staff
Staff
Grad. Student
Grad.Student
GPA gt3.9
GPA gt3.9
Staff
Grad. Student
Staff
Grad. Student
GPA gt3.9
Position Professor
..
GPA gt3.9
Department
Department
70-80k
80-100k
Specmath
Specphysics
24The most time consuming operations of MRDTL
- Entropy associated with this selection graph
Specializationmath
E ? (ni /N) log (ni /N)
Query associated with counts ni
ID Name Dep Position Salary
p1 Dale d1 Postdoc c1
p2 Martin d1 Postdoc c1
p3 David d4 Postdoc c1
p4 Peter d3 Postdoc c1
p5 Adrian d2 Professor c2
p6 Doina d3 Professor c2
select distinct Staff.Salary, count(distinct
Staff.ID) from Staff, Grad.Student,
Deparment where join_list and condition_list grou
p by Staff.Salary
n1
n2
Result of the query is the following list
ci , ni
25The most time consuming operations of MRDTL
GPA gt2.0
Specializationmath
Specializationmath
- Entropy associated with each of the refinements
select distinct Staff.Salary, count(distinct
Staff.ID) from table_list where join_list and
condition_list group by Staff.Salary
Specializationmath
26A way to speed up - eliminate redundant
calculations
- ProblemFor selection graph with 162 nodes the
time to execute a query is more than 3 minutes! - Redundancy in calculationFor this selection
graph tables Staff and Grad.Student will be
joined over and over for all the children
refinements of the tree - A way to fixcalculate it only once and save for
all further calculations
Specializationmath
27Speed Up Method. Sufficient tables
Staff_ID Grad.Student_ID Dep_ID Salary
p1 s1 d1 c1
p2 s1 d1 c1
p3 s6 d4 c1
p4 s3 d3 c1
p5 s1 d2 c2
p6 s9 d3 c2
Specializationmath
28Speed Up Method. Sufficient tables
- Entropy associated with this selection graph
Specializationmath
E ? (ni /N) log (ni /N)
Query associated with counts ni
Staff_ID Grad.Student_ID Dep_ID Salary
p1 s1 d1 c1
p2 s1 d1 c1
p3 s6 d4 c1
p4 s3 d3 c1
p5 s1 d2 c2
p6 s9 d3 c2
select S.Salary, count(distinct S.Staff_ID) from
S group by S.Salary
n1
Result of the query is the following list
n2
ci , ni
29Speed Up Method. Sufficient tables
Queries associated with the addcondition
refinement
select S.Salary, X.A, count(distinct
S.Staff_ID) from S, X where S.X_ID X.ID group
by S.Salary, X.A
Specializationmath
Calculations for the complement refinement
count(ci , Rcomp(S)) count(ci, S) count(ci ,
R(S))
30Speed Up Method. Sufficient tables
Queries associated with the addedge refinement
select S.Salary, count(distinct S.Staff_ID) from
S, X, Y where S.X_ID X.ID, and e.cond group by
S.Salary
Specializationmath
Calculations for the complement refinement
count(ci , Rcomp(S)) count(ci, S) count(ci ,
R(S))
31Speed Up Method
- Significant speed up in obtaining the counts
needed for the calculations of the entropy and
information gain - The speed up is reached by the additional space
used by the algorithm
32Handling Missing Values
Graduate Student
ID Name GPA Public. Advisor Department
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p1 d3
s3 Michel 3.9 3 p4 d4
Department
ID Specialization Students
d1 Math 1000
d2 Physics 300
d3 Computer Science 400
Staff
ID Name Department Position Salary
p1 Dale d1 ? 70 - 80k
p2 Martin d3 ? 30-40k
p3 Victor d2 VisitorScientist 40-50k
p4 David d3 ? 80-100k
Staff.Position, b
Staff.Name, a P(ab)
- For each attribute which has missing values we
build a Naïve Bayes model
Staff.Position
Staff.Name
Staff.Dep
Department.Spec
33Handling Missing Values
Graduate Student
ID Name GPA Public. Advisor Department
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p1 d3
Department
ID Specialization Students
d1 Math 1000
Staff
ID Name Department Position Salary
p1 Dale d1 ? 70 - 80k
- Then the most probable value for the missing
attribute is calculated by formula
- P(vi X1.A1, X2.A2, X3.A3 )
- P(X1.A1, X2.A2, X3.A3 vi) P(vi) / P(X1.A1,
X2.A2, X3.A3 ) - P(X1.A1 vi) P(X2.A2 vi) P(X3.A3 vi) P(vi) /
P(X1.A1, X2.A2, X3.A3 )
34Experimental results. Mutagenesis
- Most widely DB used in ILP.
- Describes molecules of certain nitro aromatic
compounds. - Goal predict their mutagenic activity (label
attribute) ability to cause DNA to mutate.
High mutagenic activity can cause cancer. - Two subsets regression friendly (188 molecules)
and regression unfriendly (42 molecules). We used
only regression friendly subset. - 5 levels of background knowledge B0, B1, B2, B3,
B4. They provide richer descriptions of the
examples. We used B2 level.
35Experimental results. Mutagenesis
- Schema of the mutagenesis database
- Results of 10-fold cross-validation for
regression friendly set.
Data Set Accuracy Sel graphsize (max) Tree size Time withspeed up Time withoutspeed up
mutagenesis 87.5 3 9 28.45 52.15
Best-known reported accuracy is 86
36Experimental results. KDD Cup 2001
- Consists of a variety of details about the
various genes of one particular type of organism.
- Genes code for proteins, and these proteins tend
to localize in various parts of cells and
interact with one another in order to perform
crucial functions. - 2 Tasks Prediction of gene/protein localization
and function - 862 training genes, 381 test genes.
- Many attribute values are missing 70 of CLASS
attribute, 50 of COMPLEX, and 50 of MOTIF in
composition table
37Experimental results. KDD Cup 2001
localization Accuracy Sel graphsize (max) Tree size Time withspeed up Time withoutspeed up
With handling missing values 76.11 19 213 202.9 secs 1256.38 secs
Without handling missing values 50.14 33 575 550.76 secs 2257.20 secs
Best-known reported accuracy is 72.1
function Accuracy Sel graphsize (max) Tree size(max) Time withspeed up Time withoutspeed up
With handling missing values 91.44 9 63 151.19 secs 307.83 secs
Without handling missing values 88.56 9 19 61.29 secs 118.41 secs
Best-known reported accuracy is 93.6
38Experimental results. PKDD 2001 Discovery
Challenge
- Consists of 5 tables
- Target table consists of 1239 records
- The task is to predict the degree of the
thrombosis attribute from ANTIBODY_EXAM table - The results for 52 cross validation
DIAGNOSIS
ANA_PATTERN
PATIENT_INFO
THROMBOSIS
ANTIBODY_EXAM
Data Set Accuracy Sel Graphsize (max) Tree size Time with speed up Time without speed up
thrombosis 98.1 31 71 127.75 198.22
Best-known reported accuracy is 99.28
39Summary
- the algorithm significantly outperforms MRDTL in
terms of running time - the accuracy results are comparable with the best
reported results obtained using different
data-mining algorithms
Future work
- Incorporation of the more sophisticated
techniques for handling missing values - Incorporating of more sophisticated pruning
techniques or complexity regularizations - More extensive evaluation of MRDTL on real-world
data sets - Development of ontology-guided multi-relational
decision tree learning algotihms to generate
classifiers at multiple levels of abstraction
Zhang et al., 2002 - Development of variants of MRDTL that can learn
from heterogeneous, distributed, autonomous data
sources, based on recently developed techniques
for distributed learning and ontology based data
integration
40Thanks to
- Dr. Honavar for providing guidance, help and
support throughout this research - Colleges from Artificial Intelligence Lab for
various helpful discussions - My committee members Drena Dobbs and Yan-Bin Jia
for their help - Professors and lecturers of the Computer Science
department for the knowledge that they gave me
through lectures and discussions - Iowa State University and Computer Science
department for funding in part this research