Experiments%20with%20MRDTL%20 - PowerPoint PPT Presentation

About This Presentation
Title:

Experiments%20with%20MRDTL%20

Description:

A Multi-Relational Decision Tree Learning Algorithm Implementation and Experiments Anna Atramentov Major: Computer Science Program of Study Committee: – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 41
Provided by: AnnaA158
Learn more at: http://msl.cs.uiuc.edu
Category:

less

Transcript and Presenter's Notes

Title: Experiments%20with%20MRDTL%20


1
A Multi-Relational Decision Tree Learning
Algorithm Implementation and Experiments
Anna Atramentov Major Computer Science Program
of Study Committee Vasant Honavar, Major
Professor Drena Leigh Dobbs Yan-Bin Jia Iowa
State University, Ames, Iowa 2003
2
KDD and Relational Data Mining
  • Term KDD stands for Knowledge Discovery in
    Databases
  • Traditional techniques in KDD work with the
    instances represented by one table
  • Relational Data Mining is a subfield of KDD where
    the instances are represented by several tables

Day Outlook Temp-re Humidity Wind Play Tennis
d1 Sunny Hot High Weak No
d2 Sunny Hot High Strong No
d3 Overcast Hot High Weak Yes
d4 Overcast Cold Normal Weak No
Staff Staff Staff Staff Staff
p1 Dale d1 Professor 70 - 80k
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist 40-50k
p4 David d3 Professor 80-100k
Department Department Department
d1 Math 1000
d2 Physics 300
d3 Computer Science 400
Graduate Student Graduate Student Graduate Student Graduate Student Graduate Student Graduate Student
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p4 d3
s3 Michel 3.9 3 p4 d4
3
Motivation
  • Importance of relational learning
  • Growth of data stored in MRDB
  • Techniques for learning unstructured data often
    extract the data into MRDB
  • Promising approach to relational learning
  • MRDM (Multi-Relational Data Mining) framework
    developed by Knobbes (1999)
  • MRDTL (Multi-Relational Decision Tree Learning)
    algorithm implemented by Leiva (2002)

Goals
  • Speed up MRDM framework and in particular MRDTL
    algorithm
  • Incorporate handling of missing values
  • Perform more extensive experimental evaluation of
    the algorithm

4
Relational Learning Literature
  • Inductive Logic Programming (Dzeroski and Lavrac,
    2001 Dzeroski et al., 2001 Blockeel, 1998 De
    Raedt, 1997)
  • First order extensions of probabilistic models
  • Relational Bayesian Networks(Jaeger, 1997)
  • Probabilistic Relational Models (Getoor, 2001
    Koller, 1999)
  • Bayesian Logic Programs (Kersting et al., 2000)
  • Combining First Order Logic and Probability
    Theory
  • Multi-Relational Data Mining (Knobbe et al.,
    1999)
  • Propositionalization methods (Krogel and Wrobel,
    2001)
  • PRMs extension for cumulative learning for
    learning and reasoning as agents interact with
    the world (Pfeffer, 2000)
  • Approaches for mining data in form of graph
    (Holder and Cook, 2000 Gonzalez et al., 2000)

5
Problem Formulation
  • Given Data stored in relational data base
  • Goal Build decision tree for predicting target
    attribute in the target table

Example of multi-relational database
Department Department Department
d1 Math 1000
d2 Physics 300
d3 Computer Science 400
schema
instances
Department
ID
Specialization
Students
Graduate Student Graduate Student Graduate Student Graduate Student Graduate Student Graduate Student
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p4 d3
s3 Michel 3.9 3 p4 d4
Grad.Student
ID
Name
GPA
Publications
Advisor
Department
Staff
ID
Name
Department
Position
Salary
Staff Staff Staff Staff Staff
p1 Dale d1 Professor 70 - 80k
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist 40-50k
p4 David d3 Professor 80-100k
6
Propositional decision tree algorithm.
Construction phase
Day Outlook Temp Hum-ty Wind PlayT
d1 Sunny Hot High Weak No
d2 Sunny Hot High Strong No
Day Outlook Temp-re Humidity Wind PlayTennis
d1 Sunny Hot High Weak No
d2 Sunny Hot High Strong No
d3 Overcast Hot High Weak Yes
d4 Overcast Cold Normal Weak No
Day Outlook Temp Hum-ty Wind PlayT
d3 Overcast Hot High Weak Yes
d4 Overcast Cold Normal Weak No
d1, d2, d3, d4
Tree_induction(D data) A
optimal_attribute(D) if stopping_criterion
(D) return leaf(D) else Dleft
split(D, A) Dright
splitcomplement(D, A) childleft
Tree_induction(Dleft) childright
Tree_induction(Dright) return node(A,
childleft, childright)
Outlook
7
MR setting. Splitting data with Selection Graphs
Department
Graduate Student
ID Name GPA Public. Advisor Department
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p4 d3
s3 Michel 3.9 3 p4 d4
ID Specialization Students
d1 Math 1000
d2 Physics 300
d3 Computer Science 400
Staff
Department
ID Name Department Position Salary
p4 David d3 Professor 80-100k
ID Name Department Position Salary
p1 Dale d1 Professor 70 - 80k
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist 40-50k
p4 David d3 Professor 80-100k
Grad.Student
ID Name Department Position Salary
p1 Dale d1 Professor 70-80k
Staff
ID Name Department Position Salary
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist 40-50k
complement selection graphs
8
What is selection graph?
Department
  • It corresponds to the subset of the instances
    from target table
  • Nodes correspond to the tables from the database
  • Edges correspond to the associations between
    tables
  • Open edge have at least one
  • Closed edge have non of

Grad.Student
Staff
Grad.Student
Department
Staff
Specializationmath
9
Transforming selection graphs into SQL queries
Staff
Select distinct T0.id From Staff Where
T0.positionProfessor
Position Professor
Select distinct T0.id From Staff T0,
Graduate_Student T1 Where T0.idT1.Advisor
Staff
Grad. Student
Generic query select distinct
T0.primary_key from table_list where
join_list and condition_list
Staff
Grad. Student
Select distinct T0.id From Staff T0 Where T0.id
not in ( Select T1. id
From Graduate_Student T1)
Grad. Student
Select distinct T0. id From Staff T0,
Graduate_Student T1 Where T0.idT1.Advisor T0.
id not in ( Select T1. id From
Graduate_Student T1 Where T1.GPA gt 3.9)
Staff
Grad. Student
GPA gt3.9
10
MR decision tree
  • Each node contains selection graph
  • Each child selection graph is a supergraphof the
    parent selection graph

11
How to choose selection graphs in nodes?
  • Problem There are too many supergraph selection
    graphs to choose from in each node
  • Solution
  • start with initial selection graph
  • find greedy heuristic to choose
    supergraphselection graphs refinements
  • use binary splits for simplicity
  • for each refinementget complement refinement
  • choose the best refinement basedon information
    gain criterion
  • Problem Somepotentiallygood refinementsmay
    give noimmediate benefit
  • Solution
  • look ahead capability

12
Refinements of selection graph
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
  • add condition to the node - explore attribute
    information in the tables
  • add present edge and open node explore
    relational properties between the tables

13
Refinements of selection graph
refinement
Specializationmath
Position Professor
Specializationmath
complement refinement
Specializationmath
  • add condition to the node
  • add present edge and open node

Position ! Professor
14
Refinements of selection graph
refinement
GPA gt2.0
Specializationmath
Specializationmath
complement refinement
Specializationmath
  • add condition to the node
  • add present edge and open node

15
Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Students gt200
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Specializationmath
  • add condition to the node
  • add present edge and open node

16
Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Note information gain 0
Specializationmath
  • add condition to the node
  • add present edge and open node

17
Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Specializationmath
  • add condition to the node
  • add present edge and open node

18
Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Specializationmath
  • add condition to the node
  • add present edge and open node

19
Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Specializationmath
  • add condition to the node
  • add present edge and open node

20
Look ahead capability
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
21
Look ahead capability
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Students gt 200
Specializationmath
Specializationmath
22
MRDTL algorithm. Construction phase
  • for each non-leaf node
  • consider all possible refinements and their
    complements of the nodes selection graph
  • choose the best onesbased on informationgain
    criterion
  • createchildrennodes

Staff
Staff
Staff
Grad.Student
Grad.Student
23
MRDTL algorithm. Classification phase
Staff
  • for each leaf
  • apply selection graph of theleaf to the test
    data
  • classify resulting instanceswith
    classificationof the leaf

Staff
Staff
Grad.Student
Grad.Student
Grad.Student


Staff
Staff
Grad. Student
Grad.Student
GPA gt3.9
GPA gt3.9




Staff
Grad. Student
Staff
Grad. Student
GPA gt3.9
Position Professor
..
GPA gt3.9
Department
Department
70-80k
80-100k
Specmath
Specphysics
24
The most time consuming operations of MRDTL
  • Entropy associated with this selection graph

Specializationmath
E ? (ni /N) log (ni /N)
Query associated with counts ni
ID Name Dep Position Salary
p1 Dale d1 Postdoc c1
p2 Martin d1 Postdoc c1
p3 David d4 Postdoc c1
p4 Peter d3 Postdoc c1
p5 Adrian d2 Professor c2
p6 Doina d3 Professor c2


select distinct Staff.Salary, count(distinct
Staff.ID) from Staff, Grad.Student,
Deparment where join_list and condition_list grou
p by Staff.Salary
n1
n2
Result of the query is the following list

ci , ni
25
The most time consuming operations of MRDTL
GPA gt2.0
Specializationmath
Specializationmath
  • Entropy associated with each of the refinements

select distinct Staff.Salary, count(distinct
Staff.ID) from table_list where join_list and
condition_list group by Staff.Salary
Specializationmath
26
A way to speed up - eliminate redundant
calculations
  • ProblemFor selection graph with 162 nodes the
    time to execute a query is more than 3 minutes!
  • Redundancy in calculationFor this selection
    graph tables Staff and Grad.Student will be
    joined over and over for all the children
    refinements of the tree
  • A way to fixcalculate it only once and save for
    all further calculations

Specializationmath
27
Speed Up Method. Sufficient tables
Staff_ID Grad.Student_ID Dep_ID Salary
p1 s1 d1 c1
p2 s1 d1 c1
p3 s6 d4 c1
p4 s3 d3 c1
p5 s1 d2 c2
p6 s9 d3 c2


Specializationmath
28
Speed Up Method. Sufficient tables
  • Entropy associated with this selection graph

Specializationmath
E ? (ni /N) log (ni /N)
Query associated with counts ni
Staff_ID Grad.Student_ID Dep_ID Salary
p1 s1 d1 c1
p2 s1 d1 c1
p3 s6 d4 c1
p4 s3 d3 c1
p5 s1 d2 c2
p6 s9 d3 c2


select S.Salary, count(distinct S.Staff_ID) from
S group by S.Salary
n1
Result of the query is the following list
n2
ci , ni

29
Speed Up Method. Sufficient tables
Queries associated with the addcondition
refinement
select S.Salary, X.A, count(distinct
S.Staff_ID) from S, X where S.X_ID X.ID group
by S.Salary, X.A
Specializationmath
Calculations for the complement refinement
count(ci , Rcomp(S)) count(ci, S) count(ci ,
R(S))
30
Speed Up Method. Sufficient tables
Queries associated with the addedge refinement
select S.Salary, count(distinct S.Staff_ID) from
S, X, Y where S.X_ID X.ID, and e.cond group by
S.Salary
Specializationmath
Calculations for the complement refinement
count(ci , Rcomp(S)) count(ci, S) count(ci ,
R(S))
31
Speed Up Method
  • Significant speed up in obtaining the counts
    needed for the calculations of the entropy and
    information gain
  • The speed up is reached by the additional space
    used by the algorithm

32
Handling Missing Values
Graduate Student
ID Name GPA Public. Advisor Department
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p1 d3
s3 Michel 3.9 3 p4 d4
Department
ID Specialization Students
d1 Math 1000
d2 Physics 300
d3 Computer Science 400
Staff
ID Name Department Position Salary
p1 Dale d1 ? 70 - 80k
p2 Martin d3 ? 30-40k
p3 Victor d2 VisitorScientist 40-50k
p4 David d3 ? 80-100k
Staff.Position, b
Staff.Name, a P(ab)
  • For each attribute which has missing values we
    build a Naïve Bayes model

Staff.Position
Staff.Name
Staff.Dep
Department.Spec

33
Handling Missing Values
Graduate Student
ID Name GPA Public. Advisor Department
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p1 d3
Department
ID Specialization Students
d1 Math 1000
Staff
ID Name Department Position Salary
p1 Dale d1 ? 70 - 80k
  • Then the most probable value for the missing
    attribute is calculated by formula
  • P(vi X1.A1, X2.A2, X3.A3 )
  • P(X1.A1, X2.A2, X3.A3 vi) P(vi) / P(X1.A1,
    X2.A2, X3.A3 )
  • P(X1.A1 vi) P(X2.A2 vi) P(X3.A3 vi) P(vi) /
    P(X1.A1, X2.A2, X3.A3 )

34
Experimental results. Mutagenesis
  • Most widely DB used in ILP.
  • Describes molecules of certain nitro aromatic
    compounds.
  • Goal predict their mutagenic activity (label
    attribute) ability to cause DNA to mutate.
    High mutagenic activity can cause cancer.
  • Two subsets regression friendly (188 molecules)
    and regression unfriendly (42 molecules). We used
    only regression friendly subset.
  • 5 levels of background knowledge B0, B1, B2, B3,
    B4. They provide richer descriptions of the
    examples. We used B2 level.

35
Experimental results. Mutagenesis
  • Schema of the mutagenesis database
  • Results of 10-fold cross-validation for
    regression friendly set.

Data Set Accuracy Sel graphsize (max) Tree size Time withspeed up Time withoutspeed up
mutagenesis 87.5 3 9 28.45 52.15
Best-known reported accuracy is 86
36
Experimental results. KDD Cup 2001
  • Consists of a variety of details about the
    various genes of one particular type of organism.
  • Genes code for proteins, and these proteins tend
    to localize in various parts of cells and
    interact with one another in order to perform
    crucial functions.
  • 2 Tasks Prediction of gene/protein localization
    and function
  • 862 training genes, 381 test genes.
  • Many attribute values are missing 70 of CLASS
    attribute, 50 of COMPLEX, and 50 of MOTIF in
    composition table

37
Experimental results. KDD Cup 2001
localization Accuracy Sel graphsize (max) Tree size Time withspeed up Time withoutspeed up
With handling missing values 76.11 19 213 202.9 secs 1256.38 secs
Without handling missing values 50.14 33 575 550.76 secs 2257.20 secs
Best-known reported accuracy is 72.1
function Accuracy Sel graphsize (max) Tree size(max) Time withspeed up Time withoutspeed up
With handling missing values 91.44 9 63 151.19 secs 307.83 secs
Without handling missing values 88.56 9 19 61.29 secs 118.41 secs
Best-known reported accuracy is 93.6
38
Experimental results. PKDD 2001 Discovery
Challenge
  • Consists of 5 tables
  • Target table consists of 1239 records
  • The task is to predict the degree of the
    thrombosis attribute from ANTIBODY_EXAM table
  • The results for 52 cross validation

DIAGNOSIS
ANA_PATTERN
PATIENT_INFO
THROMBOSIS
ANTIBODY_EXAM
Data Set Accuracy Sel Graphsize (max) Tree size Time with speed up Time without speed up
thrombosis 98.1 31 71 127.75 198.22
Best-known reported accuracy is 99.28
39
Summary
  • the algorithm significantly outperforms MRDTL in
    terms of running time
  • the accuracy results are comparable with the best
    reported results obtained using different
    data-mining algorithms

Future work
  • Incorporation of the more sophisticated
    techniques for handling missing values
  • Incorporating of more sophisticated pruning
    techniques or complexity regularizations
  • More extensive evaluation of MRDTL on real-world
    data sets
  • Development of ontology-guided multi-relational
    decision tree learning algotihms to generate
    classifiers at multiple levels of abstraction
    Zhang et al., 2002
  • Development of variants of MRDTL that can learn
    from heterogeneous, distributed, autonomous data
    sources, based on recently developed techniques
    for distributed learning and ontology based data
    integration

40
Thanks to
  • Dr. Honavar for providing guidance, help and
    support throughout this research
  • Colleges from Artificial Intelligence Lab for
    various helpful discussions
  • My committee members Drena Dobbs and Yan-Bin Jia
    for their help
  • Professors and lecturers of the Computer Science
    department for the knowledge that they gave me
    through lectures and discussions
  • Iowa State University and Computer Science
    department for funding in part this research
Write a Comment
User Comments (0)
About PowerShow.com