Title: Experiments with MRDTL A Multirelational Decision Tree Learning Algorithm
1A Multi-Relational Decision Tree Learning
Algorithm Implementation and Experiments
Anna Atramentov Major Computer Science Program
of Study Committee Vasant Honavar, Major
Professor Drena Leigh Dobbs Yan-Bin Jia Iowa
State University, Ames, Iowa 2003
2KDD and Relational Data Mining
- Term KDD stands for Knowledge Discovery in
Databases - Traditional techniques in KDD work with the
instances represented by one table - Relational Data Mining is a subfield of KDD where
the instances are represented by several tables
3Motivation
- Importance of relational learning
- Growth of data stored in MRDB
- Techniques for learning unstructured data often
extract the data into MRDB - Promising approach to relational learning
- MRDM (Multi-Relational Data Mining) framework
developed by Knobbes (1999) - MRDTL (Multi-Relational Decision Tree Learning)
algorithm implemented by Leiva (2002)
Goals
- Speed up MRDM framework and in particular MRDTL
algorithm - Incorporate handling of missing values
- Perform more extensive experimental evaluation of
the algorithm
4Relational Learning Literature
- Inductive Logic Programming (Dzeroski and Lavrac,
2001 Dzeroski et al., 2001 Blockeel, 1998 De
Raedt, 1997) - First order extensions of probabilistic models
- Relational Bayesian Networks(Jaeger, 1997)
- Probabilistic Relational Models (Getoor, 2001
Koller, 1999) - Bayesian Logic Programs (Kersting et al., 2000)
- Combining First Order Logic and Probability
Theory - Multi-Relational Data Mining (Knobbe et al.,
1999) - Propositionalization methods (Krogel and Wrobel,
2001) - PRMs extension for cumulative learning for
learning and reasoning as agents interact with
the world (Pfeffer, 2000) - Approaches for mining data in form of graph
(Holder and Cook, 2000 Gonzalez et al., 2000)
5Problem Formulation
- Given Data stored in relational data base
- Goal Build decision tree for predicting target
attribute in the target table
Example of multi-relational database
schema
instances
6Propositional decision tree algorithm.
Construction phase
d1, d2, d3, d4
Tree_induction(D data) A
optimal_attribute(D) if stopping_criterion
(D) return leaf(D) else Dleft
split(D, A) Dright
splitcomplement(D, A) childleft
Tree_induction(Dleft) childright
Tree_induction(Dright) return node(A,
childleft, childright)
Outlook
7MR setting. Splitting data with Selection Graphs
Department
Graduate Student
Staff
complement selection graphs
8What is selection graph?
- It corresponds to the subset of the instances
from target table - Nodes correspond to the tables from the database
- Edges correspond to the associations between
tables - Open edge have at least one
- Closed edge have non of
Grad.Student
Department
Staff
Specializationmath
9Transforming selection graphs into SQL queries
Staff
Select distinct T0.id From Staff Where
T0.positionProfessor
Position Professor
Select distinct T0.id From Staff T0,
Graduate_Student T1 Where T0.idT1.Advisor
Staff
Grad. Student
Generic query select distinct
T0.primary_key from table_list where
join_list and condition_list
Staff
Grad. Student
Select distinct T0.id From Staff T0 Where T0.id
not in ( Select T1. id
From Graduate_Student T1)
Grad. Student
Select distinct T0. id From Staff T0,
Graduate_Student T1 Where T0.idT1.Advisor T0.
id not in ( Select T1. id From
Graduate_Student T1 Where T1.GPA gt 3.9)
Staff
Grad. Student
GPA gt3.9
10MR decision tree
- Each node contains selection graph
- Each child selection graph is a supergraphof the
parent selection graph
11How to choose selection graphs in nodes?
- Problem There are too many supergraph selection
graphs to choose from in each node - Solution
- start with initial selection graph
- find greedy heuristic to choose
supergraphselection graphs refinements - use binary splits for simplicity
- for each refinementget complement refinement
- choose the best refinement basedon information
gain criterion - Problem Somepotentiallygood refinementsmay
give noimmediate benefit - Solution
- look ahead capability
12Refinements of selection graph
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
- add condition to the node - explore attribute
information in the tables - add present edge and open node explore
relational properties between the tables
13Refinements of selection graph
refinement
Specializationmath
Position Professor
Specializationmath
complement refinement
Specializationmath
- add condition to the node
- add present edge and open node
Position ! Professor
14Refinements of selection graph
refinement
GPA gt2.0
Specializationmath
Specializationmath
complement refinement
Specializationmath
- add condition to the node
- add present edge and open node
15Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Students gt200
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Specializationmath
- add condition to the node
- add present edge and open node
16Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Note information gain 0
Specializationmath
- add condition to the node
- add present edge and open node
17Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Specializationmath
- add condition to the node
- add present edge and open node
18Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Specializationmath
- add condition to the node
- add present edge and open node
19Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Specializationmath
- add condition to the node
- add present edge and open node
20Look ahead capability
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
21Look ahead capability
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Students gt 200
Specializationmath
Specializationmath
22MRDTL algorithm. Construction phase
- for each non-leaf node
- consider all possible refinements and their
complements of the nodes selection graph - choose the best onesbased on informationgain
criterion - createchildrennodes
Staff
Staff
Staff
Grad.Student
Grad.Student
23MRDTL algorithm. Classification phase
Staff
- for each leaf
- apply selection graph of theleaf to the test
data - classify resulting instanceswith
classificationof the leaf
Staff
Staff
Grad.Student
Grad.Student
Grad.Student
Staff
Staff
Grad. Student
Grad.Student
GPA gt3.9
GPA gt3.9
Staff
Grad. Student
Staff
Grad. Student
GPA gt3.9
Position Professor
..
GPA gt3.9
Department
Department
70-80k
80-100k
Specmath
Specphysics
24The most time consuming operations of MRDTL
- Entropy associated with this selection graph
Specializationmath
E ? (ni /N) log (ni /N)
Query associated with counts ni
select distinct Staff.Salary, count(distinct
Staff.ID) from Staff, Grad.Student,
Deparment where join_list and condition_list grou
p by Staff.Salary
n1
n2
Result of the query is the following list
ci , ni
25The most time consuming operations of MRDTL
GPA gt2.0
Specializationmath
Specializationmath
- Entropy associated with each of the refinements
select distinct Staff.Salary, count(distinct
Staff.ID) from table_list where join_list and
condition_list group by Staff.Salary
Specializationmath
26A way to speed up - eliminate redundant
calculations
- ProblemFor selection graph with 162 nodes the
time to execute a query is more than 3 minutes! - Redundancy in calculationFor this selection
graph tables Staff and Grad.Student will be
joined over and over for all the children
refinements of the tree - A way to fixcalculate it only once and save for
all further calculations
Specializationmath
27Speed Up Method. Sufficient tables
Specializationmath
28Speed Up Method. Sufficient tables
- Entropy associated with this selection graph
Specializationmath
E ? (ni /N) log (ni /N)
Query associated with counts ni
select S.Salary, count(distinct S.Staff_ID) from
S group by S.Salary
n1
Result of the query is the following list
n2
ci , ni
29Speed Up Method. Sufficient tables
Queries associated with the addcondition
refinement
select S.Salary, X.A, count(distinct
S.Staff_ID) from S, X where S.X_ID X.ID group
by S.Salary, X.A
Specializationmath
Calculations for the complement refinement
count(ci , Rcomp(S)) count(ci, S) count(ci ,
R(S))
30Speed Up Method. Sufficient tables
Queries associated with the addedge refinement
select S.Salary, count(distinct S.Staff_ID) from
S, X, Y where S.X_ID X.ID, and e.cond group by
S.Salary
Specializationmath
Calculations for the complement refinement
count(ci , Rcomp(S)) count(ci, S) count(ci ,
R(S))
31Speed Up Method
- Significant speed up in obtaining the counts
needed for the calculations of the entropy and
information gain - The speed up is reached by the additional space
used by the algorithm
32Handling Missing Values
Graduate Student
Department
Staff
- For each attribute which has missing values we
build a Naïve Bayes model
Staff.Position
Staff.Name
Staff.Dep
Department.Spec
33Handling Missing Values
Graduate Student
Department
Staff
- Then the most probable value for the missing
attribute is calculated by formula
- P(vi X1.A1, X2.A2, X3.A3 )
- P(X1.A1, X2.A2, X3.A3 vi) P(vi) / P(X1.A1,
X2.A2, X3.A3 ) - P(X1.A1 vi) P(X2.A2 vi) P(X3.A3 vi) P(vi) /
P(X1.A1, X2.A2, X3.A3 )
34Experimental results. Mutagenesis
- Most widely DB used in ILP.
- Describes molecules of certain nitro aromatic
compounds. - Goal predict their mutagenic activity (label
attribute) ability to cause DNA to mutate.
High mutagenic activity can cause cancer. - Two subsets regression friendly (188 molecules)
and regression unfriendly (42 molecules). We used
only regression friendly subset. - 5 levels of background knowledge B0, B1, B2, B3,
B4. They provide richer descriptions of the
examples. We used B2 level.
35Experimental results. Mutagenesis
- Schema of the mutagenesis database
- Results of 10-fold cross-validation for
regression friendly set.
Best-known reported accuracy is 86
36Experimental results. KDD Cup 2001
- Consists of a variety of details about the
various genes of one particular type of organism.
- Genes code for proteins, and these proteins tend
to localize in various parts of cells and
interact with one another in order to perform
crucial functions. - 2 Tasks Prediction of gene/protein localization
and function - 862 training genes, 381 test genes.
- Many attribute values are missing 70 of CLASS
attribute, 50 of COMPLEX, and 50 of MOTIF in
composition table
37Experimental results. KDD Cup 2001
Best-known reported accuracy is 72.1
Best-known reported accuracy is 93.6
38Experimental results. PKDD 2001 Discovery
Challenge
- Consists of 5 tables
- Target table consists of 1239 records
- The task is to predict the degree of the
thrombosis attribute from ANTIBODY_EXAM table - The results for 52 cross validation
Best-known reported accuracy is 99.28
39Summary
- the algorithm significantly outperforms MRDTL in
terms of running time - the accuracy results are comparable with the best
reported results obtained using different
data-mining algorithms
Future work
- Incorporation of the more sophisticated
techniques for handling missing values - Incorporating of more sophisticated pruning
techniques or complexity regularizations - More extensive evaluation of MRDTL on real-world
data sets - Development of ontology-guided multi-relational
decision tree learning algotihms to generate
classifiers at multiple levels of abstraction
Zhang et al., 2002 - Development of variants of MRDTL that can learn
from heterogeneous, distributed, autonomous data
sources, based on recently developed techniques
for distributed learning and ontology based data
integration
40Thanks to
- Dr. Honavar for providing guidance, help and
support throughout this research - Colleges from Artificial Intelligence Lab for
various helpful discussions - My committee members Drena Dobbs and Yan-Bin Jia
for their help - Professors and lecturers of the Computer Science
department for the knowledge that they gave me
through lectures and discussions - Iowa State University and Computer Science
department for funding in part this research