Experiments with MRDTL A Multirelational Decision Tree Learning Algorithm presentation

About This Presentation

Transcript and Presenter's Notes

Title: Experiments with MRDTL A Multirelational Decision Tree Learning Algorithm

1
A Multi-Relational Decision Tree Learning
Algorithm Implementation and Experiments
Anna Atramentov Major Computer Science Program
of Study Committee Vasant Honavar, Major
Professor Drena Leigh Dobbs Yan-Bin Jia Iowa
State University, Ames, Iowa 2003
2
KDD and Relational Data Mining

Term KDD stands for Knowledge Discovery in
Databases
Traditional techniques in KDD work with the
instances represented by one table
Relational Data Mining is a subfield of KDD where
the instances are represented by several tables

3
Motivation

Importance of relational learning
Growth of data stored in MRDB
Techniques for learning unstructured data often
extract the data into MRDB
Promising approach to relational learning
MRDM (Multi-Relational Data Mining) framework
developed by Knobbes (1999)
MRDTL (Multi-Relational Decision Tree Learning)
algorithm implemented by Leiva (2002)

Goals

Speed up MRDM framework and in particular MRDTL
algorithm
Incorporate handling of missing values
Perform more extensive experimental evaluation of
the algorithm

4
Relational Learning Literature

Inductive Logic Programming (Dzeroski and Lavrac,
2001 Dzeroski et al., 2001 Blockeel, 1998 De
Raedt, 1997)
First order extensions of probabilistic models
Relational Bayesian Networks(Jaeger, 1997)
Probabilistic Relational Models (Getoor, 2001
Koller, 1999)
Bayesian Logic Programs (Kersting et al., 2000)
Combining First Order Logic and Probability
Theory
Multi-Relational Data Mining (Knobbe et al.,
1999)
Propositionalization methods (Krogel and Wrobel,
2001)
PRMs extension for cumulative learning for
learning and reasoning as agents interact with
the world (Pfeffer, 2000)
Approaches for mining data in form of graph
(Holder and Cook, 2000 Gonzalez et al., 2000)

5
Problem Formulation

Given Data stored in relational data base
Goal Build decision tree for predicting target
attribute in the target table

Example of multi-relational database
schema
instances
6
Propositional decision tree algorithm.
Construction phase
d1, d2, d3, d4
Tree_induction(D data) A
optimal_attribute(D) if stopping_criterion
(D) return leaf(D) else Dleft
split(D, A) Dright
splitcomplement(D, A) childleft
Tree_induction(Dleft) childright
Tree_induction(Dright) return node(A,
childleft, childright)
Outlook
7
MR setting. Splitting data with Selection Graphs
Department
Graduate Student
Staff
complement selection graphs
8
What is selection graph?

It corresponds to the subset of the instances
from target table
Nodes correspond to the tables from the database
Edges correspond to the associations between
tables
Open edge have at least one
Closed edge have non of

Grad.Student
Department
Staff
Specializationmath
9
Transforming selection graphs into SQL queries
Staff
Select distinct T0.id From Staff Where
T0.positionProfessor
Position Professor
Select distinct T0.id From Staff T0,
Graduate_Student T1 Where T0.idT1.Advisor
Staff
Grad. Student
Generic query select distinct
T0.primary_key from table_list where
join_list and condition_list
Staff
Grad. Student
Select distinct T0.id From Staff T0 Where T0.id
not in ( Select T1. id
From Graduate_Student T1)
Grad. Student
Select distinct T0. id From Staff T0,
Graduate_Student T1 Where T0.idT1.Advisor T0.
id not in ( Select T1. id From
Graduate_Student T1 Where T1.GPA gt 3.9)
Staff
Grad. Student
GPA gt3.9
10
MR decision tree

Each node contains selection graph
Each child selection graph is a supergraphof the
parent selection graph

11
How to choose selection graphs in nodes?

Problem There are too many supergraph selection
graphs to choose from in each node
Solution
start with initial selection graph
find greedy heuristic to choose
supergraphselection graphs refinements
use binary splits for simplicity
for each refinementget complement refinement
choose the best refinement basedon information
gain criterion
Problem Somepotentiallygood refinementsmay
give noimmediate benefit
Solution
look ahead capability

12
Refinements of selection graph
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9

add condition to the node - explore attribute
information in the tables
add present edge and open node explore
relational properties between the tables

13
Refinements of selection graph
refinement
Specializationmath
Position Professor
Specializationmath
complement refinement
Specializationmath

add condition to the node
add present edge and open node

Position ! Professor
14
Refinements of selection graph
refinement
GPA gt2.0
Specializationmath
Specializationmath
complement refinement
Specializationmath

add condition to the node
add present edge and open node

15
Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Students gt200
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Specializationmath

add condition to the node
add present edge and open node

16
Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Note information gain 0
Specializationmath

add condition to the node
add present edge and open node

17
Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Specializationmath

add condition to the node
add present edge and open node

18
Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Specializationmath

add condition to the node
add present edge and open node

19
Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Specializationmath

add condition to the node
add present edge and open node

20
Look ahead capability
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
21
Look ahead capability
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Students gt 200
Specializationmath
Specializationmath
22
MRDTL algorithm. Construction phase

for each non-leaf node
consider all possible refinements and their
complements of the nodes selection graph
choose the best onesbased on informationgain
criterion
createchildrennodes

Staff
Staff
Staff
Grad.Student
Grad.Student
23
MRDTL algorithm. Classification phase
Staff

for each leaf
apply selection graph of theleaf to the test
data
classify resulting instanceswith
classificationof the leaf

Staff
Staff
Grad.Student
Grad.Student
Grad.Student

Staff
Staff
Grad. Student
Grad.Student
GPA gt3.9
GPA gt3.9

Staff
Grad. Student
Staff
Grad. Student
GPA gt3.9
Position Professor
..
GPA gt3.9
Department
Department
70-80k
80-100k
Specmath
Specphysics
24
The most time consuming operations of MRDTL

Entropy associated with this selection graph

Specializationmath
E ? (ni /N) log (ni /N)
Query associated with counts ni
select distinct Staff.Salary, count(distinct
Staff.ID) from Staff, Grad.Student,
Deparment where join_list and condition_list grou
p by Staff.Salary
n1
n2
Result of the query is the following list

ci , ni
25
The most time consuming operations of MRDTL
GPA gt2.0
Specializationmath
Specializationmath

Entropy associated with each of the refinements

select distinct Staff.Salary, count(distinct
Staff.ID) from table_list where join_list and
condition_list group by Staff.Salary
Specializationmath
26
A way to speed up - eliminate redundant
calculations

ProblemFor selection graph with 162 nodes the
time to execute a query is more than 3 minutes!
Redundancy in calculationFor this selection
graph tables Staff and Grad.Student will be
joined over and over for all the children
refinements of the tree
A way to fixcalculate it only once and save for
all further calculations

Specializationmath
27
Speed Up Method. Sufficient tables
Specializationmath
28
Speed Up Method. Sufficient tables

Entropy associated with this selection graph

Specializationmath
E ? (ni /N) log (ni /N)
Query associated with counts ni
select S.Salary, count(distinct S.Staff_ID) from
S group by S.Salary
n1
Result of the query is the following list
n2
ci , ni

29
Speed Up Method. Sufficient tables
Queries associated with the addcondition
refinement
select S.Salary, X.A, count(distinct
S.Staff_ID) from S, X where S.X_ID X.ID group
by S.Salary, X.A
Specializationmath
Calculations for the complement refinement
count(ci , Rcomp(S)) count(ci, S) count(ci ,
R(S))
30
Speed Up Method. Sufficient tables
Queries associated with the addedge refinement
select S.Salary, count(distinct S.Staff_ID) from
S, X, Y where S.X_ID X.ID, and e.cond group by
S.Salary
Specializationmath
Calculations for the complement refinement
count(ci , Rcomp(S)) count(ci, S) count(ci ,
R(S))
31
Speed Up Method

Significant speed up in obtaining the counts
needed for the calculations of the entropy and
information gain
The speed up is reached by the additional space
used by the algorithm

32
Handling Missing Values
Graduate Student
Department
Staff

For each attribute which has missing values we
build a Naïve Bayes model

Staff.Position
Staff.Name
Staff.Dep
Department.Spec

33
Handling Missing Values
Graduate Student
Department
Staff

Then the most probable value for the missing
attribute is calculated by formula

P(vi X1.A1, X2.A2, X3.A3 )
P(X1.A1, X2.A2, X3.A3 vi) P(vi) / P(X1.A1,
X2.A2, X3.A3 )
P(X1.A1 vi) P(X2.A2 vi) P(X3.A3 vi) P(vi) /
P(X1.A1, X2.A2, X3.A3 )

34
Experimental results. Mutagenesis

Most widely DB used in ILP.
Describes molecules of certain nitro aromatic
compounds.
Goal predict their mutagenic activity (label
attribute) ability to cause DNA to mutate.
High mutagenic activity can cause cancer.
Two subsets regression friendly (188 molecules)
and regression unfriendly (42 molecules). We used
only regression friendly subset.
5 levels of background knowledge B0, B1, B2, B3,
B4. They provide richer descriptions of the
examples. We used B2 level.

35
Experimental results. Mutagenesis

Schema of the mutagenesis database

Results of 10-fold cross-validation for
regression friendly set.

Best-known reported accuracy is 86
36
Experimental results. KDD Cup 2001

Consists of a variety of details about the
various genes of one particular type of organism.
Genes code for proteins, and these proteins tend
to localize in various parts of cells and
interact with one another in order to perform
crucial functions.
2 Tasks Prediction of gene/protein localization
and function
862 training genes, 381 test genes.

Many attribute values are missing 70 of CLASS
attribute, 50 of COMPLEX, and 50 of MOTIF in
composition table

37
Experimental results. KDD Cup 2001
Best-known reported accuracy is 72.1
Best-known reported accuracy is 93.6
38
Experimental results. PKDD 2001 Discovery
Challenge

Consists of 5 tables
Target table consists of 1239 records
The task is to predict the degree of the
thrombosis attribute from ANTIBODY_EXAM table
The results for 52 cross validation

Best-known reported accuracy is 99.28
39
Summary

the algorithm significantly outperforms MRDTL in
terms of running time
the accuracy results are comparable with the best
reported results obtained using different
data-mining algorithms

Future work

Incorporation of the more sophisticated
techniques for handling missing values
Incorporating of more sophisticated pruning
techniques or complexity regularizations
More extensive evaluation of MRDTL on real-world
data sets
Development of ontology-guided multi-relational
decision tree learning algotihms to generate
classifiers at multiple levels of abstraction
Zhang et al., 2002
Development of variants of MRDTL that can learn
from heterogeneous, distributed, autonomous data
sources, based on recently developed techniques
for distributed learning and ontology based data
integration

40
Thanks to

Dr. Honavar for providing guidance, help and
support throughout this research
Colleges from Artificial Intelligence Lab for
various helpful discussions
My committee members Drena Dobbs and Yan-Bin Jia
for their help
Professors and lecturers of the Computer Science
department for the knowledge that they gave me
through lectures and discussions
Iowa State University and Computer Science
department for funding in part this research

Write a Comment

User Comments (0)

About PowerShow.com

Experiments with MRDTL A Multirelational Decision Tree Learning Algorithm PowerPoint PPT Presentation