Title: Mining Relational Model Trees
1Mining Relational Model Trees
Annalisa Appice Dipartimento di
Informatica Universita di Bari
Department of Computer Science University of Bari
Knowledge Acquisition Machine Learning Lab
2Regression problem in classical data mining
- Given
- m independent (or predictor) variables Xi (both
continuous and discrete) - a continuous dependent (or response) variable Y
to be predicted - a set of n training cases (x1, x2, , xm, y)
- Build
- a function yg(x) such that it correctly predicts
the value of the response variable for each
m-tuple (x1, x2, , xm)
3Regression trees and model trees
Partitioning of observations local regression
models ? regression or models trees
4Model trees state of the art
- Statistics
- Ciampi (1991) RECPAM
- Siciliano Mola (1994)
- Data Mining
- Karalic, (1992) RETIS
- Quinlan, (1992) M5
- Wang Witten, (1997) M5
- Lubinsky, (1994) TSIR
- Torgo, (1997) HTL
The tree-structure is generated according to a
top-down strategy.
5Model trees state of the art
- Models in the leaves have only a local validity
? they are built on the basis of training cases
falling in the corresponding partition of the
feature space. - Global effects can be represented by variables
that are introduced in the regression models at
higher levels of the model trees ? - A different tree-structure is required!
- Internal nodes can
- either define a further partitioning of the
feature space - or introduce some regression variables in the
models to be associated to the leaves.
6Two types of nodes
- Two types of nodes
- Splitting nodes perform a Boolean test.
Xi ? ?
Xi?xi1,,xih
continuous variable
discrete variable
t
t
tR
tL
tR
tL
tL
YabXu
YcdXw
YabXu
YcdXw
7What is passed down?
- Splitting nodes pass down to each child only a
subgroup of training cases, without any change on
the variables. - Regression nodes pass down to their unique child
all training cases. Values of the variables not
included in the model are transformed to remove
the linear effect of those variables already
included.
8An example of model tree
9Building a regression model stepwise some tricks
- Example build a multiple regression model with
two independent variables - YabX1 cX2
- through a sequence of straight-line
regressions
- Build Y a1b1X1
- Build X2 a2b2X1
- Compute the residuals on X2 X'2 X2 - (a2b2X1)
- Compute the residuals on Y Y' Y - (a1b1X1)
- Regress Y on X'2 alone Y a3 b3X'2.
By substituting the equation of X'2 in the last
equation Y a3 a1- a2b3 b3X2
(b2b3-b1)X1. it can be proven that aa3-a2b3
a1, b-b2b3 b1 and cb3.
10The global effect of regression nodes
R
Y
R1
R2
Xj
?
- Both regression models associated to the leaves
include Xi. - The contribution of Xi to Y can be different for
each leaf, but - It can be reliably estimated on the whole region
R
11An example of model tree
12Advantages of the proposed tree structure
- It captures both the global and the local
effects of regression variables - Multiple regression models at the leaves can be
efficiently built stepwise - The multiple regression model at a leaf can be
easily computed ? the heuristic function for the
selection of regression and splitting nodes can
take it into account
13Evaluating splitting and regression nodes
Xi ? ?
t
YabXu
YcdXv
tL
tR
R(tL) (R(tL) ) is the resubstitution error
associated of the left (right) child.
14Filtering useless splitting nodes
- Problem a splitting node with identical
straight-line regressions associated with
children? the split is really modelling a
regression step - How to recognize?
- Solution compare the two regression lines
associated with children of a splitting according
to a statistical test for coincident regression
lines (Weisberg, 1985).
15Stopping criteria
- The first performs the partial F-test to evaluate
the contribution of a new independent variable to
the model. - The second requires the number of cases in each
node to be greater than a minimum value. - The third operates when all continuous variables
along the path from the root to the current node
are used in regression steps and there are no
discrete variables in the training set. - The fourth creates a leaf if the error in the
current node is below a fraction of the error in
the root node. - The fifth stops the growth when the coefficient
of determination is greater than a minimum value.
16Related works and problems
- In principle, the optimal split should be chosen
on the basis of the fit of each regression model
to the data. - Problem in some systems (M5, M5 and HTL) the
heuristic function does not take into account the
model associated with the leaves of the tree. - ? The evaluation function is incoherent with
respect to the model tree being built. - ? Some simple regression models are not correctly
discovered
17Related works and problems
- Example
-
- Cubist splits the data at -0.1 and builds the
following models - X ? -0.1 Y 0.78 0.175X
- X gt -0.1 Y 1.143 - 0.281X
18Related works and problems
- Retis solves this problem by computing the best
multiple regression model at the leaves for each
splitting node. - The problem is theoretically solved, but
- Computationally expensive approach a multiple
regression model for each possible test. The
choice of the first split is O(m3N2). - All continuous variables are involved in multiple
linear models associated to the leaves. So, when
some of the independent variables are linearly
related to each other, several problems may occur
(Collinearity).
19Related works and problems
- TSIR induces model trees with regression nodes
and splitting nodes, but - The effect of the regressed variable in a
regression node is not removed when cases are
passed down - the multiple regression model associated to each
leaf cannot be correctly interpreted from a
statistical viewpoint.
20Computational complexity
- It can be proved that SMOTI has an O(m3n2) worst
case complexity for the selection of any node
(splitting or regression). - RETIS has the same complexity for node selection,
although RETIS does not select a subset of
variables to solve collinearity problems.
21Empirical evaluation
- For pairwise comparison with Retis and M5, which
art the state-of-the-art model tree induction
systems the non-parametric Wilcoxon two-sample
paired signed rank test is used. - Experiments (Malerba et al, 20041)
- laboratory-sized data sets
- UCI datasets
22Empirical evaluation on laboratory-sized data
Retis
M5
SMOTI
23Empirical evaluation on laboratory-sized data
Retis
Time(s)
M5
SMOTI
Number of examples
24 Empirical Evaluation on UCI data
25 Empirical Evaluation on UCI data.
- For some datasets SMOTI mines interesting
patterns that no previous study on model trees
has ever revealed. - This aspect proves the easy interpretability of
the model trees induced by SMOTI. - For example
Abalone (marine crustaceans).
The goal is to
predict the age (number of rings).
SMOTI builds a model tree with a regression
node in the root. The straight-line regression
selected at the root is almost invariant for all
model trees and expresses a linear dependence
between the number of rings (dependent variable)
and the shucked weight (independent variable).
This is a clear example of global effect.
26SMOTI open issues
- The DM system KDB2000
- http//www.di.uniba.it/malerba/software/kdb2000/i
ndex.htm - that implements SMOTI is not tightly integrated
with the DBMS ? Tighter integration with a DBMS - Cannot be applied directly to multi-relational
data mining tasks ? - the unit of analysis is an individual described
by a set of random variables each of which result
in just one single value
27From classical to relational data mining
- ...while in the most real world application
complex objects are described in terms of
properties and relations - Example
- In spatial domains the effect of a predictor
variable at any site may not be limited to the
specified site (spatial autocorrelation)
E.g. no communal establishment (schools,
hospitals) in an ED, but many of them are located
in the nearby EDs.
28Multi-relational representation
- Augment data table with information about
neighboring units.
target relevant objects
29Regression Problem in relational data mining
- Given
- a training set O stored in relational tables
ST0,T1,,Th of a relational database D - a set of v primary key constraints PK on
relations in S, - a set of w foreign key constraints FK on
relations in S, - a target relation T(X1, ,Xn, Y ) ? S,
- a target continuous attribute Y in T, different
from the primary key or foreign key in T. - Find
- a multi-relational regression model which
predicts the value of Y for for some object
represented as a tuple in T and related tuples in
S according to foreign key paths.
30How to work with (multi-)relational data?
- Moulding relational database in a single table
such that traditional attribute-value algorithms
are able to work on - create a single relation by deriving attributes
from other joined tables - construct of a single relation that summarizes
and/or aggregates information found in other
tables - Solve mining problems in their original
representation. - FORS (Karalic, 1997)
- SRT(Kramer, 1996), S-CART (Kramer,
1999),TILDE-RT(Blockeel, 1998 )
31Strengths and Weaknesses of current
multi-relational regression methods
- Strengths
- solve Relational Regression problems in their
original representation. - able to exploit background knowledge in the
mining process - learn multi-relational patterns
- Weaknesses
- knowledge of data model is not used to guide the
search process - data is stored as Prolog facts
- not integrated with the database
- do not differentiate global vs. local effects of
variables in a regression model
Idea to combine the achievements of the KDD
field on the integration of data mining with
database systems, with results reported in the
ILP field on how to upgrade propositional data
mining algorithms to multi-relational
representations.
32Global/local effect multi-relational model
Mr-SMOTI
Tightly integrating the data mining engine with a
relational DBMS
Upgrading SMOTI to multi-relational
representations
Mr-SMOTI
- Mr-SMOTI is the relational extension of SMOTI
that outputs relational model trees such that - each node corresponds with a subset of training
data and it is associated with a portion of D
intensionally described by a relational pattern, - each leaf is associated with a (multiple)
regression function which may involve predictor
variables from several tables in D, - each variable that is eventually introduced in
left branch of a node must not occur in the right
branch of that node, - relational patterns associated with nodes are
represented with regression selection graphs that
extends selection graph definition (Knobbe,99), - Regression selection graphs are translated into
SQL expressions stored in XML format.
33What is a regression selection graph?
- It corresponds to tuples describing a subset of
the instances from database eventually modified
by removing effect of regression steps - Nodes correspond to the tables from the database
whose attributes are replaced by corresponding
residuals - Arcs correspond to foreign key associations
between tables - Open arcs have at least one
- Closed arcs have no of
CreditLine
34Relational splitting nodes
- add condition add negative condition
- add present arc and open node add absent arc
and closed node
- add condition add negative condition (split
condition) - add present arc and open node add absent arc
and closed node (join condition)
Customer
Detail
Order
35Relational splitting nodes
- add condition add negative condition (split
condition) - add present arc and open node add absent arc
and closed node (join condition)
Customer
Detail
Order
2nd case
Customer
Order
Customer
Detail
Order
Detail
Quantity ?22
Customer
Order
Detail
Quantity ?22
36Relational splitting nodes
- add condition add negative condition (split
condition) - add present arc and open node add absent arc
and closed node (join condition)
Customer
Order
37Relational splitting nodes with look-ahead
Customer
Customer
Order
Customer
Detail
Quantity ?22
Customer
Order
Detail
Quantity ?22
38Relational regression nodes
Customer(Id, Sale,CreditLine,Agent)
Order(Id, Date, Client, Pieces)
CreditLine CreditLine-(5Sale-0.5) Pieces
Pieces-(-2.5Sale-3.2)
Order(Id, Date, Client, Pieces2.5Sale3.2)
Customer(Id, Sale, CreditLine-5Sale0.5,Agent)
39Relational model trees an example
Customer
Customer
Order
Customer
Order
Customer(Id, Sale, CreditLine-5 Sale0.5,Agent)
Order(Id, Date, Client, Pieces2.5Sale3.2)
Order
Customer
Date in 02/09/02
Customer(Id, Sale, CreditLine-5Sale0.5-0.1(Pieces
2.5Sale3.2)2,Agent)
Order(Id, Date, Client, Pieces2.5Sale3.2)
40How to choose the best relational node?
- Start with root node that is associated with
selection graph containing only target node - Find greedy heuristics to choose regression
selection graph refinements - use binary splits for simplicity
- for each refinement get complementary refinement
- store regression coefficient in order to compute
residuals on continuous attributes - choose the best refinement based on evaluation
functions
41Evaluating relational splitting node
Customer
Customer
Order
Customer
Order
42Evaluating relational regression node
- where
- R(t) is the resubstitution error computed on the
tuples returned on tuples extracted by regression
selection graph associated with t, - t is the best splitting node following t.
43Stopping criteria
- The first requires the number of target objects
in each node to be greater than a minimum value. - The second operates when all continuous
attributes along the path from the root to the
current node are used in regression steps and
there are no add open node and present arc
refinement including new continuous attributes. - The third stops the growth when the coefficient
of determination is greater than a minimum value.
44Mr-SMOTI some details
- Mr-SMOTI has been implemented as a component of
the KDD system MURENA. - MURENA has been implemented in java and
interfaces an Oracle datatabase. - http//www.di.uniba.it/7Ececi/micFiles/systems/Th
e20MURENA20project.html
45Empirical evaluation on laboratory-sized data
46Empirical evaluation on laboratory-sized data
Wilcoxon test (alpha0.05)
47Empirical evaluation on real data
48Improving efficency
by materializing intermediate results
49Questions?