Experiments%20with%20MRDTL%20 - PowerPoint PPT Presentation

About This Presentation

Title:

Experiments%20with%20MRDTL%20

Description:

A Multi-Relational Decision Tree Learning Algorithm Implementation and Experiments Anna Atramentov Major: Computer Science Program of Study Committee: – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 41

Provided by: AnnaA158

Learn more at: http://msl.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Experiments%20with%20MRDTL%20

1
A Multi-Relational Decision Tree Learning
Algorithm Implementation and Experiments
Anna Atramentov Major Computer Science Program
of Study Committee Vasant Honavar, Major
Professor Drena Leigh Dobbs Yan-Bin Jia Iowa
State University, Ames, Iowa 2003
2
KDD and Relational Data Mining

Term KDD stands for Knowledge Discovery in
Databases
Traditional techniques in KDD work with the
instances represented by one table
Relational Data Mining is a subfield of KDD where
the instances are represented by several tables

Day Outlook Temp-re Humidity Wind Play Tennis
d1 Sunny Hot High Weak No
d2 Sunny Hot High Strong No
d3 Overcast Hot High Weak Yes
d4 Overcast Cold Normal Weak No
Staff Staff Staff Staff Staff
p1 Dale d1 Professor 70 - 80k
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist 40-50k
p4 David d3 Professor 80-100k
Department Department Department
d1 Math 1000
d2 Physics 300
d3 Computer Science 400
Graduate Student Graduate Student Graduate Student Graduate Student Graduate Student Graduate Student
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p4 d3
s3 Michel 3.9 3 p4 d4
3
Motivation

Importance of relational learning
Growth of data stored in MRDB
Techniques for learning unstructured data often
extract the data into MRDB
Promising approach to relational learning
MRDM (Multi-Relational Data Mining) framework
developed by Knobbes (1999)
MRDTL (Multi-Relational Decision Tree Learning)
algorithm implemented by Leiva (2002)

Goals

Speed up MRDM framework and in particular MRDTL
algorithm
Incorporate handling of missing values
Perform more extensive experimental evaluation of
the algorithm

4
Relational Learning Literature

Inductive Logic Programming (Dzeroski and Lavrac,
2001 Dzeroski et al., 2001 Blockeel, 1998 De
Raedt, 1997)
First order extensions of probabilistic models
Relational Bayesian Networks(Jaeger, 1997)
Probabilistic Relational Models (Getoor, 2001
Koller, 1999)
Bayesian Logic Programs (Kersting et al., 2000)
Combining First Order Logic and Probability
Theory
Multi-Relational Data Mining (Knobbe et al.,
1999)
Propositionalization methods (Krogel and Wrobel,
2001)
PRMs extension for cumulative learning for
learning and reasoning as agents interact with
the world (Pfeffer, 2000)
Approaches for mining data in form of graph
(Holder and Cook, 2000 Gonzalez et al., 2000)

5
Problem Formulation

Given Data stored in relational data base
Goal Build decision tree for predicting target
attribute in the target table

Example of multi-relational database
Department Department Department
d1 Math 1000
d2 Physics 300
d3 Computer Science 400
schema
instances
Department
ID
Specialization
Students
Graduate Student Graduate Student Graduate Student Graduate Student Graduate Student Graduate Student
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p4 d3
s3 Michel 3.9 3 p4 d4
Grad.Student
ID
Name
GPA
Publications
Advisor
Department
Staff
ID
Name
Department
Position
Salary
Staff Staff Staff Staff Staff
p1 Dale d1 Professor 70 - 80k
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist 40-50k
p4 David d3 Professor 80-100k
6
Propositional decision tree algorithm.
Construction phase
Day Outlook Temp Hum-ty Wind PlayT
d1 Sunny Hot High Weak No
d2 Sunny Hot High Strong No
Day Outlook Temp-re Humidity Wind PlayTennis
d1 Sunny Hot High Weak No
d2 Sunny Hot High Strong No
d3 Overcast Hot High Weak Yes
d4 Overcast Cold Normal Weak No
Day Outlook Temp Hum-ty Wind PlayT
d3 Overcast Hot High Weak Yes
d4 Overcast Cold Normal Weak No
d1, d2, d3, d4
Tree_induction(D data) A
optimal_attribute(D) if stopping_criterion
(D) return leaf(D) else Dleft
split(D, A) Dright
splitcomplement(D, A) childleft
Tree_induction(Dleft) childright
Tree_induction(Dright) return node(A,
childleft, childright)
Outlook
7
MR setting. Splitting data with Selection Graphs
Department
Graduate Student
ID Name GPA Public. Advisor Department
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p4 d3
s3 Michel 3.9 3 p4 d4
ID Specialization Students
d1 Math 1000
d2 Physics 300
d3 Computer Science 400
Staff
Department
ID Name Department Position Salary
p4 David d3 Professor 80-100k
ID Name Department Position Salary
p1 Dale d1 Professor 70 - 80k
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist 40-50k
p4 David d3 Professor 80-100k
Grad.Student
ID Name Department Position Salary
p1 Dale d1 Professor 70-80k
Staff
ID Name Department Position Salary
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist 40-50k
complement selection graphs
8
What is selection graph?
Department

It corresponds to the subset of the instances
from target table
Nodes correspond to the tables from the database
Edges correspond to the associations between
tables
Open edge have at least one
Closed edge have non of

Grad.Student
Staff
Grad.Student
Department
Staff
Specializationmath
9
Transforming selection graphs into SQL queries
Staff
Select distinct T0.id From Staff Where
T0.positionProfessor
Position Professor
Select distinct T0.id From Staff T0,
Graduate_Student T1 Where T0.idT1.Advisor
Staff
Grad. Student
Generic query select distinct
T0.primary_key from table_list where
join_list and condition_list
Staff
Grad. Student
Select distinct T0.id From Staff T0 Where T0.id
not in ( Select T1. id
From Graduate_Student T1)
Grad. Student
Select distinct T0. id From Staff T0,
Graduate_Student T1 Where T0.idT1.Advisor T0.
id not in ( Select T1. id From
Graduate_Student T1 Where T1.GPA gt 3.9)
Staff
Grad. Student
GPA gt3.9
10
MR decision tree

Each node contains selection graph
Each child selection graph is a supergraphof the
parent selection graph

11
How to choose selection graphs in nodes?

Problem There are too many supergraph selection
graphs to choose from in each node
Solution
start with initial selection graph
find greedy heuristic to choose
supergraphselection graphs refinements
use binary splits for simplicity
for each refinementget complement refinement
choose the best refinement basedon information
gain criterion
Problem Somepotentiallygood refinementsmay
give noimmediate benefit
Solution
look ahead capability

12
Refinements of selection graph
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9

add condition to the node - explore attribute
information in the tables
add present edge and open node explore
relational properties between the tables

13
Refinements of selection graph
refinement
Specializationmath
Position Professor
Specializationmath
complement refinement
Specializationmath

add condition to the node
add present edge and open node

Position ! Professor
14
Refinements of selection graph
refinement
GPA gt2.0
Specializationmath
Specializationmath
complement refinement
Specializationmath

add condition to the node
add present edge and open node

15
Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Students gt200
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Specializationmath

add condition to the node
add present edge and open node

16
Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Note information gain 0
Specializationmath

add condition to the node
add present edge and open node

17
Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Specializationmath

add condition to the node
add present edge and open node

18
Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Specializationmath

add condition to the node
add present edge and open node

19
Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Specializationmath

add condition to the node
add present edge and open node

20
Look ahead capability
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
21
Look ahead capability
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Students gt 200
Specializationmath
Specializationmath
22
MRDTL algorithm. Construction phase

for each non-leaf node
consider all possible refinements and their
complements of the nodes selection graph
choose the best onesbased on informationgain
criterion
createchildrennodes

Staff
Staff
Staff
Grad.Student
Grad.Student
23
MRDTL algorithm. Classification phase
Staff

for each leaf
apply selection graph of theleaf to the test
data
classify resulting instanceswith
classificationof the leaf

Staff
Staff
Grad.Student
Grad.Student
Grad.Student

Staff
Staff
Grad. Student
Grad.Student
GPA gt3.9
GPA gt3.9

Staff
Grad. Student
Staff
Grad. Student
GPA gt3.9
Position Professor
..
GPA gt3.9
Department
Department
70-80k
80-100k
Specmath
Specphysics
24
The most time consuming operations of MRDTL

Entropy associated with this selection graph

Specializationmath
E ? (ni /N) log (ni /N)
Query associated with counts ni
ID Name Dep Position Salary
p1 Dale d1 Postdoc c1
p2 Martin d1 Postdoc c1
p3 David d4 Postdoc c1
p4 Peter d3 Postdoc c1
p5 Adrian d2 Professor c2
p6 Doina d3 Professor c2

select distinct Staff.Salary, count(distinct
Staff.ID) from Staff, Grad.Student,
Deparment where join_list and condition_list grou
p by Staff.Salary
n1
n2
Result of the query is the following list

ci , ni
25
The most time consuming operations of MRDTL
GPA gt2.0
Specializationmath
Specializationmath

Entropy associated with each of the refinements

select distinct Staff.Salary, count(distinct
Staff.ID) from table_list where join_list and
condition_list group by Staff.Salary
Specializationmath
26
A way to speed up - eliminate redundant
calculations

ProblemFor selection graph with 162 nodes the
time to execute a query is more than 3 minutes!
Redundancy in calculationFor this selection
graph tables Staff and Grad.Student will be
joined over and over for all the children
refinements of the tree
A way to fixcalculate it only once and save for
all further calculations

Specializationmath
27
Speed Up Method. Sufficient tables
Staff_ID Grad.Student_ID Dep_ID Salary
p1 s1 d1 c1
p2 s1 d1 c1
p3 s6 d4 c1
p4 s3 d3 c1
p5 s1 d2 c2
p6 s9 d3 c2

Specializationmath
28
Speed Up Method. Sufficient tables

Entropy associated with this selection graph

Specializationmath
E ? (ni /N) log (ni /N)
Query associated with counts ni
Staff_ID Grad.Student_ID Dep_ID Salary
p1 s1 d1 c1
p2 s1 d1 c1
p3 s6 d4 c1
p4 s3 d3 c1
p5 s1 d2 c2
p6 s9 d3 c2

select S.Salary, count(distinct S.Staff_ID) from
S group by S.Salary
n1
Result of the query is the following list
n2
ci , ni

29
Speed Up Method. Sufficient tables
Queries associated with the addcondition
refinement
select S.Salary, X.A, count(distinct
S.Staff_ID) from S, X where S.X_ID X.ID group
by S.Salary, X.A
Specializationmath
Calculations for the complement refinement
count(ci , Rcomp(S)) count(ci, S) count(ci ,
R(S))
30
Speed Up Method. Sufficient tables
Queries associated with the addedge refinement
select S.Salary, count(distinct S.Staff_ID) from
S, X, Y where S.X_ID X.ID, and e.cond group by
S.Salary
Specializationmath
Calculations for the complement refinement
count(ci , Rcomp(S)) count(ci, S) count(ci ,
R(S))
31
Speed Up Method

Significant speed up in obtaining the counts
needed for the calculations of the entropy and
information gain
The speed up is reached by the additional space
used by the algorithm

32
Handling Missing Values
Graduate Student
ID Name GPA Public. Advisor Department
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p1 d3
s3 Michel 3.9 3 p4 d4
Department
ID Specialization Students
d1 Math 1000
d2 Physics 300
d3 Computer Science 400
Staff
ID Name Department Position Salary
p1 Dale d1 ? 70 - 80k
p2 Martin d3 ? 30-40k
p3 Victor d2 VisitorScientist 40-50k
p4 David d3 ? 80-100k
Staff.Position, b
Staff.Name, a P(ab)

For each attribute which has missing values we
build a Naïve Bayes model

Staff.Position
Staff.Name
Staff.Dep
Department.Spec

33
Handling Missing Values
Graduate Student
ID Name GPA Public. Advisor Department
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p1 d3
Department
ID Specialization Students
d1 Math 1000
Staff
ID Name Department Position Salary
p1 Dale d1 ? 70 - 80k

Then the most probable value for the missing
attribute is calculated by formula

P(vi X1.A1, X2.A2, X3.A3 )
P(X1.A1, X2.A2, X3.A3 vi) P(vi) / P(X1.A1,
X2.A2, X3.A3 )
P(X1.A1 vi) P(X2.A2 vi) P(X3.A3 vi) P(vi) /
P(X1.A1, X2.A2, X3.A3 )

34
Experimental results. Mutagenesis

Most widely DB used in ILP.
Describes molecules of certain nitro aromatic
compounds.
Goal predict their mutagenic activity (label
attribute) ability to cause DNA to mutate.
High mutagenic activity can cause cancer.
Two subsets regression friendly (188 molecules)
and regression unfriendly (42 molecules). We used
only regression friendly subset.
5 levels of background knowledge B0, B1, B2, B3,
B4. They provide richer descriptions of the
examples. We used B2 level.

35
Experimental results. Mutagenesis

Schema of the mutagenesis database

Results of 10-fold cross-validation for
regression friendly set.

Data Set Accuracy Sel graphsize (max) Tree size Time withspeed up Time withoutspeed up
mutagenesis 87.5 3 9 28.45 52.15
Best-known reported accuracy is 86
36
Experimental results. KDD Cup 2001

Consists of a variety of details about the
various genes of one particular type of organism.
Genes code for proteins, and these proteins tend
to localize in various parts of cells and
interact with one another in order to perform
crucial functions.
2 Tasks Prediction of gene/protein localization
and function
862 training genes, 381 test genes.

Many attribute values are missing 70 of CLASS
attribute, 50 of COMPLEX, and 50 of MOTIF in
composition table

37
Experimental results. KDD Cup 2001
localization Accuracy Sel graphsize (max) Tree size Time withspeed up Time withoutspeed up
With handling missing values 76.11 19 213 202.9 secs 1256.38 secs
Without handling missing values 50.14 33 575 550.76 secs 2257.20 secs
Best-known reported accuracy is 72.1
function Accuracy Sel graphsize (max) Tree size(max) Time withspeed up Time withoutspeed up
With handling missing values 91.44 9 63 151.19 secs 307.83 secs
Without handling missing values 88.56 9 19 61.29 secs 118.41 secs
Best-known reported accuracy is 93.6
38
Experimental results. PKDD 2001 Discovery
Challenge

Consists of 5 tables
Target table consists of 1239 records
The task is to predict the degree of the
thrombosis attribute from ANTIBODY_EXAM table
The results for 52 cross validation

DIAGNOSIS
ANA_PATTERN
PATIENT_INFO
THROMBOSIS
ANTIBODY_EXAM
Data Set Accuracy Sel Graphsize (max) Tree size Time with speed up Time without speed up
thrombosis 98.1 31 71 127.75 198.22
Best-known reported accuracy is 99.28
39
Summary

the algorithm significantly outperforms MRDTL in
terms of running time
the accuracy results are comparable with the best
reported results obtained using different
data-mining algorithms

Future work

Incorporation of the more sophisticated
techniques for handling missing values
Incorporating of more sophisticated pruning
techniques or complexity regularizations
More extensive evaluation of MRDTL on real-world
data sets
Development of ontology-guided multi-relational
decision tree learning algotihms to generate
classifiers at multiple levels of abstraction
Zhang et al., 2002
Development of variants of MRDTL that can learn
from heterogeneous, distributed, autonomous data
sources, based on recently developed techniques
for distributed learning and ontology based data
integration

40
Thanks to

Dr. Honavar for providing guidance, help and
support throughout this research
Colleges from Artificial Intelligence Lab for
various helpful discussions
My committee members Drena Dobbs and Yan-Bin Jia
for their help
Professors and lecturers of the Computer Science
department for the knowledge that they gave me
through lectures and discussions
Iowa State University and Computer Science
department for funding in part this research