Mining Relational Model Trees - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

Mining Relational Model Trees

Description:

tL. Regression nodes compute only a straight-line regression. They have only one child. ... complex objects are described in terms of properties and relations ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 49

Provided by: aapp9

Category:

more less

Transcript and Presenter's Notes

Title: Mining Relational Model Trees

1
Mining Relational Model Trees
Annalisa Appice Dipartimento di
Informatica Universita di Bari
Department of Computer Science University of Bari
Knowledge Acquisition Machine Learning Lab
2
Regression problem in classical data mining

Given
m independent (or predictor) variables Xi (both
continuous and discrete)
a continuous dependent (or response) variable Y
to be predicted
a set of n training cases (x1, x2, , xm, y)
Build
a function yg(x) such that it correctly predicts
the value of the response variable for each
m-tuple (x1, x2, , xm)

3
Regression trees and model trees
Partitioning of observations local regression
models ? regression or models trees
4
Model trees state of the art

Statistics
Ciampi (1991) RECPAM
Siciliano Mola (1994)

Data Mining
Karalic, (1992) RETIS
Quinlan, (1992) M5
Wang Witten, (1997) M5
Lubinsky, (1994) TSIR
Torgo, (1997) HTL

The tree-structure is generated according to a
top-down strategy.
5
Model trees state of the art

Models in the leaves have only a local validity
? they are built on the basis of training cases
falling in the corresponding partition of the
feature space.
Global effects can be represented by variables
that are introduced in the regression models at
higher levels of the model trees ?
A different tree-structure is required!
Internal nodes can
either define a further partitioning of the
feature space
or introduce some regression variables in the
models to be associated to the leaves.

6
Two types of nodes

Two types of nodes
Splitting nodes perform a Boolean test.

Xi ? ?
Xi?xi1,,xih
continuous variable
discrete variable
t
t
tR
tL
tR
tL
tL
YabXu
YcdXw
YabXu
YcdXw
7
What is passed down?

Splitting nodes pass down to each child only a
subgroup of training cases, without any change on
the variables.
Regression nodes pass down to their unique child
all training cases. Values of the variables not
included in the model are transformed to remove
the linear effect of those variables already
included.

8
An example of model tree
9
Building a regression model stepwise some tricks

Example build a multiple regression model with
two independent variables
YabX1 cX2
through a sequence of straight-line
regressions

Build Y a1b1X1
Build X2 a2b2X1
Compute the residuals on X2 X'2 X2 - (a2b2X1)
Compute the residuals on Y Y' Y - (a1b1X1)
Regress Y on X'2 alone Y a3 b3X'2.

By substituting the equation of X'2 in the last
equation Y a3 a1- a2b3 b3X2
(b2b3-b1)X1. it can be proven that aa3-a2b3
a1, b-b2b3 b1 and cb3.
10
The global effect of regression nodes
R
Y
R1
R2
Xj
?

Both regression models associated to the leaves
include Xi.
The contribution of Xi to Y can be different for
each leaf, but
It can be reliably estimated on the whole region
R

11
An example of model tree
12
Advantages of the proposed tree structure

It captures both the global and the local
effects of regression variables
Multiple regression models at the leaves can be
efficiently built stepwise
The multiple regression model at a leaf can be
easily computed ? the heuristic function for the
selection of regression and splitting nodes can
take it into account

13
Evaluating splitting and regression nodes

Splitting node

Xi ? ?
t
YabXu
YcdXv
tL
tR
R(tL) (R(tL) ) is the resubstitution error
associated of the left (right) child.
14
Filtering useless splitting nodes

Problem a splitting node with identical
straight-line regressions associated with
children? the split is really modelling a
regression step
How to recognize?
Solution compare the two regression lines
associated with children of a splitting according
to a statistical test for coincident regression
lines (Weisberg, 1985).

15
Stopping criteria

The first performs the partial F-test to evaluate
the contribution of a new independent variable to
the model.
The second requires the number of cases in each
node to be greater than a minimum value.
The third operates when all continuous variables
along the path from the root to the current node
are used in regression steps and there are no
discrete variables in the training set.
The fourth creates a leaf if the error in the
current node is below a fraction of the error in
the root node.
The fifth stops the growth when the coefficient
of determination is greater than a minimum value.

16
Related works and problems

In principle, the optimal split should be chosen
on the basis of the fit of each regression model
to the data.
Problem in some systems (M5, M5 and HTL) the
heuristic function does not take into account the
model associated with the leaves of the tree.
? The evaluation function is incoherent with
respect to the model tree being built.
? Some simple regression models are not correctly
discovered

17
Related works and problems

Example
Cubist splits the data at -0.1 and builds the
following models
X ? -0.1 Y 0.78 0.175X
X gt -0.1 Y 1.143 - 0.281X

18
Related works and problems

Retis solves this problem by computing the best
multiple regression model at the leaves for each
splitting node.
The problem is theoretically solved, but
Computationally expensive approach a multiple
regression model for each possible test. The
choice of the first split is O(m3N2).
All continuous variables are involved in multiple
linear models associated to the leaves. So, when
some of the independent variables are linearly
related to each other, several problems may occur
(Collinearity).

19
Related works and problems

TSIR induces model trees with regression nodes
and splitting nodes, but
The effect of the regressed variable in a
regression node is not removed when cases are
passed down
the multiple regression model associated to each
leaf cannot be correctly interpreted from a
statistical viewpoint.

20
Computational complexity

It can be proved that SMOTI has an O(m3n2) worst
case complexity for the selection of any node
(splitting or regression).
RETIS has the same complexity for node selection,
although RETIS does not select a subset of
variables to solve collinearity problems.

21
Empirical evaluation

For pairwise comparison with Retis and M5, which
art the state-of-the-art model tree induction
systems the non-parametric Wilcoxon two-sample
paired signed rank test is used.
Experiments (Malerba et al, 20041)
laboratory-sized data sets
UCI datasets

22
Empirical evaluation on laboratory-sized data
Retis
M5
SMOTI
23
Empirical evaluation on laboratory-sized data
Retis
Time(s)
M5
SMOTI
Number of examples
24
Empirical Evaluation on UCI data
25
Empirical Evaluation on UCI data.

For some datasets SMOTI mines interesting
patterns that no previous study on model trees
has ever revealed.
This aspect proves the easy interpretability of
the model trees induced by SMOTI.
For example

Abalone (marine crustaceans).
The goal is to
predict the age (number of rings).
SMOTI builds a model tree with a regression
node in the root. The straight-line regression
selected at the root is almost invariant for all
model trees and expresses a linear dependence
between the number of rings (dependent variable)
and the shucked weight (independent variable).
This is a clear example of global effect.
26
SMOTI open issues

The DM system KDB2000
http//www.di.uniba.it/malerba/software/kdb2000/i
ndex.htm
that implements SMOTI is not tightly integrated
with the DBMS ? Tighter integration with a DBMS
Cannot be applied directly to multi-relational
data mining tasks ?
the unit of analysis is an individual described
by a set of random variables each of which result
in just one single value

27
From classical to relational data mining

...while in the most real world application
complex objects are described in terms of
properties and relations
Example
In spatial domains the effect of a predictor
variable at any site may not be limited to the
specified site (spatial autocorrelation)

E.g. no communal establishment (schools,
hospitals) in an ED, but many of them are located
in the nearby EDs.
28
Multi-relational representation

Augment data table with information about
neighboring units.

target relevant objects
29
Regression Problem in relational data mining

Given
a training set O stored in relational tables
ST0,T1,,Th of a relational database D
a set of v primary key constraints PK on
relations in S,
a set of w foreign key constraints FK on
relations in S,
a target relation T(X1, ,Xn, Y ) ? S,
a target continuous attribute Y in T, different
from the primary key or foreign key in T.
Find
a multi-relational regression model which
predicts the value of Y for for some object
represented as a tuple in T and related tuples in
S according to foreign key paths.

30
How to work with (multi-)relational data?

Moulding relational database in a single table
such that traditional attribute-value algorithms
are able to work on
create a single relation by deriving attributes
from other joined tables
construct of a single relation that summarizes
and/or aggregates information found in other
tables
Solve mining problems in their original
representation.
FORS (Karalic, 1997)
SRT(Kramer, 1996), S-CART (Kramer,
1999),TILDE-RT(Blockeel, 1998 )

31
Strengths and Weaknesses of current
multi-relational regression methods

Strengths
solve Relational Regression problems in their
original representation.
able to exploit background knowledge in the
mining process
learn multi-relational patterns
Weaknesses
knowledge of data model is not used to guide the
search process
data is stored as Prolog facts
not integrated with the database
do not differentiate global vs. local effects of
variables in a regression model

Idea to combine the achievements of the KDD
field on the integration of data mining with
database systems, with results reported in the
ILP field on how to upgrade propositional data
mining algorithms to multi-relational
representations.
32
Global/local effect multi-relational model
Mr-SMOTI
Tightly integrating the data mining engine with a
relational DBMS
Upgrading SMOTI to multi-relational
representations
Mr-SMOTI

Mr-SMOTI is the relational extension of SMOTI
that outputs relational model trees such that
each node corresponds with a subset of training
data and it is associated with a portion of D
intensionally described by a relational pattern,
each leaf is associated with a (multiple)
regression function which may involve predictor
variables from several tables in D,
each variable that is eventually introduced in
left branch of a node must not occur in the right
branch of that node,
relational patterns associated with nodes are
represented with regression selection graphs that
extends selection graph definition (Knobbe,99),
Regression selection graphs are translated into
SQL expressions stored in XML format.

33
What is a regression selection graph?

It corresponds to tuples describing a subset of
the instances from database eventually modified
by removing effect of regression steps
Nodes correspond to the tables from the database
whose attributes are replaced by corresponding
residuals
Arcs correspond to foreign key associations
between tables
Open arcs have at least one
Closed arcs have no of

CreditLine
34
Relational splitting nodes

add condition add negative condition
add present arc and open node add absent arc
and closed node

add condition add negative condition (split
condition)
add present arc and open node add absent arc
and closed node (join condition)

Customer
Detail
Order
35
Relational splitting nodes

add condition add negative condition (split
condition)
add present arc and open node add absent arc
and closed node (join condition)

Customer
Detail
Order
2nd case
Customer
Order
Customer
Detail
Order
Detail
Quantity ?22
Customer
Order
Detail
Quantity ?22
36
Relational splitting nodes

add condition add negative condition (split
condition)
add present arc and open node add absent arc
and closed node (join condition)

Customer
Order
37
Relational splitting nodes with look-ahead
Customer
Customer
Order
Customer
Detail
Quantity ?22
Customer
Order
Detail
Quantity ?22
38
Relational regression nodes

add regression condition

Customer(Id, Sale,CreditLine,Agent)
Order(Id, Date, Client, Pieces)
CreditLine CreditLine-(5Sale-0.5) Pieces
Pieces-(-2.5Sale-3.2)
Order(Id, Date, Client, Pieces2.5Sale3.2)
Customer(Id, Sale, CreditLine-5Sale0.5,Agent)
39
Relational model trees an example
Customer
Customer
Order
Customer
Order

Customer(Id, Sale, CreditLine-5 Sale0.5,Agent)
Order(Id, Date, Client, Pieces2.5Sale3.2)
Order
Customer
Date in 02/09/02

Customer(Id, Sale, CreditLine-5Sale0.5-0.1(Pieces
2.5Sale3.2)2,Agent)
Order(Id, Date, Client, Pieces2.5Sale3.2)
40
How to choose the best relational node?

Start with root node that is associated with
selection graph containing only target node
Find greedy heuristics to choose regression
selection graph refinements
use binary splits for simplicity
for each refinement get complementary refinement
store regression coefficient in order to compute
residuals on continuous attributes
choose the best refinement based on evaluation
functions

41
Evaluating relational splitting node
Customer
Customer
Order
Customer
Order
42
Evaluating relational regression node

?(t) min R(t),?(t).

where
R(t) is the resubstitution error computed on the
tuples returned on tuples extracted by regression
selection graph associated with t,
t is the best splitting node following t.

43
Stopping criteria

The first requires the number of target objects
in each node to be greater than a minimum value.
The second operates when all continuous
attributes along the path from the root to the
current node are used in regression steps and
there are no add open node and present arc
refinement including new continuous attributes.
The third stops the growth when the coefficient
of determination is greater than a minimum value.

44
Mr-SMOTI some details

Mr-SMOTI has been implemented as a component of
the KDD system MURENA.
MURENA has been implemented in java and
interfaces an Oracle datatabase.
http//www.di.uniba.it/7Ececi/micFiles/systems/Th
e20MURENA20project.html

45
Empirical evaluation on laboratory-sized data
46
Empirical evaluation on laboratory-sized data
Wilcoxon test (alpha0.05)
47
Empirical evaluation on real data
48
Improving efficency
by materializing intermediate results
49
Questions?

Write a Comment

User Comments (0)