Mining Relational Model Trees - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Mining Relational Model Trees

Description:

tL. Regression nodes compute only a straight-line regression. They have only one child. ... complex objects are described in terms of properties and relations ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 49
Provided by: aapp9
Category:

less

Transcript and Presenter's Notes

Title: Mining Relational Model Trees


1
Mining Relational Model Trees
Annalisa Appice Dipartimento di
Informatica Universita di Bari
Department of Computer Science University of Bari
Knowledge Acquisition Machine Learning Lab
2
Regression problem in classical data mining
  • Given
  • m independent (or predictor) variables Xi (both
    continuous and discrete)
  • a continuous dependent (or response) variable Y
    to be predicted
  • a set of n training cases (x1, x2, , xm, y)
  • Build
  • a function yg(x) such that it correctly predicts
    the value of the response variable for each
    m-tuple (x1, x2, , xm)

3
Regression trees and model trees
Partitioning of observations local regression
models ? regression or models trees
4
Model trees state of the art
  • Statistics
  • Ciampi (1991) RECPAM
  • Siciliano Mola (1994)
  • Data Mining
  • Karalic, (1992) RETIS
  • Quinlan, (1992) M5
  • Wang Witten, (1997) M5
  • Lubinsky, (1994) TSIR
  • Torgo, (1997) HTL

The tree-structure is generated according to a
top-down strategy.
5
Model trees state of the art
  • Models in the leaves have only a local validity
    ? they are built on the basis of training cases
    falling in the corresponding partition of the
    feature space.
  • Global effects can be represented by variables
    that are introduced in the regression models at
    higher levels of the model trees ?
  • A different tree-structure is required!
  • Internal nodes can
  • either define a further partitioning of the
    feature space
  • or introduce some regression variables in the
    models to be associated to the leaves.

6
Two types of nodes
  • Two types of nodes
  • Splitting nodes perform a Boolean test.

Xi ? ?
Xi?xi1,,xih
continuous variable
discrete variable
t
t
tR
tL
tR
tL
tL
YabXu
YcdXw
YabXu
YcdXw
7
What is passed down?
  • Splitting nodes pass down to each child only a
    subgroup of training cases, without any change on
    the variables.
  • Regression nodes pass down to their unique child
    all training cases. Values of the variables not
    included in the model are transformed to remove
    the linear effect of those variables already
    included.

8
An example of model tree
9
Building a regression model stepwise some tricks
  • Example build a multiple regression model with
    two independent variables
  • YabX1 cX2
  • through a sequence of straight-line
    regressions
  • Build Y a1b1X1
  • Build X2 a2b2X1
  • Compute the residuals on X2 X'2 X2 - (a2b2X1)
  • Compute the residuals on Y Y' Y - (a1b1X1)
  • Regress Y on X'2 alone Y a3 b3X'2.

By substituting the equation of X'2 in the last
equation Y a3 a1- a2b3 b3X2
(b2b3-b1)X1. it can be proven that aa3-a2b3
a1, b-b2b3 b1 and cb3.
10
The global effect of regression nodes
R
Y
R1
R2
Xj
?
  • Both regression models associated to the leaves
    include Xi.
  • The contribution of Xi to Y can be different for
    each leaf, but
  • It can be reliably estimated on the whole region
    R

11
An example of model tree
12
Advantages of the proposed tree structure
  • It captures both the global and the local
    effects of regression variables
  • Multiple regression models at the leaves can be
    efficiently built stepwise
  • The multiple regression model at a leaf can be
    easily computed ? the heuristic function for the
    selection of regression and splitting nodes can
    take it into account

13
Evaluating splitting and regression nodes
  • Splitting node

Xi ? ?
t
YabXu
YcdXv
tL
tR
R(tL) (R(tL) ) is the resubstitution error
associated of the left (right) child.
14
Filtering useless splitting nodes
  • Problem a splitting node with identical
    straight-line regressions associated with
    children? the split is really modelling a
    regression step
  • How to recognize?
  • Solution compare the two regression lines
    associated with children of a splitting according
    to a statistical test for coincident regression
    lines (Weisberg, 1985).

15
Stopping criteria
  • The first performs the partial F-test to evaluate
    the contribution of a new independent variable to
    the model.
  • The second requires the number of cases in each
    node to be greater than a minimum value.
  • The third operates when all continuous variables
    along the path from the root to the current node
    are used in regression steps and there are no
    discrete variables in the training set.
  • The fourth creates a leaf if the error in the
    current node is below a fraction of the error in
    the root node.
  • The fifth stops the growth when the coefficient
    of determination is greater than a minimum value.

16
Related works and problems
  • In principle, the optimal split should be chosen
    on the basis of the fit of each regression model
    to the data.
  • Problem in some systems (M5, M5 and HTL) the
    heuristic function does not take into account the
    model associated with the leaves of the tree.
  • ? The evaluation function is incoherent with
    respect to the model tree being built.
  • ? Some simple regression models are not correctly
    discovered

17
Related works and problems
  • Example
  • Cubist splits the data at -0.1 and builds the
    following models
  • X ? -0.1 Y 0.78 0.175X
  • X gt -0.1 Y 1.143 - 0.281X

18
Related works and problems
  • Retis solves this problem by computing the best
    multiple regression model at the leaves for each
    splitting node.
  • The problem is theoretically solved, but
  • Computationally expensive approach a multiple
    regression model for each possible test. The
    choice of the first split is O(m3N2).
  • All continuous variables are involved in multiple
    linear models associated to the leaves. So, when
    some of the independent variables are linearly
    related to each other, several problems may occur
    (Collinearity).

19
Related works and problems
  • TSIR induces model trees with regression nodes
    and splitting nodes, but
  • The effect of the regressed variable in a
    regression node is not removed when cases are
    passed down
  • the multiple regression model associated to each
    leaf cannot be correctly interpreted from a
    statistical viewpoint.

20
Computational complexity
  • It can be proved that SMOTI has an O(m3n2) worst
    case complexity for the selection of any node
    (splitting or regression).
  • RETIS has the same complexity for node selection,
    although RETIS does not select a subset of
    variables to solve collinearity problems.

21
Empirical evaluation
  • For pairwise comparison with Retis and M5, which
    art the state-of-the-art model tree induction
    systems the non-parametric Wilcoxon two-sample
    paired signed rank test is used.
  • Experiments (Malerba et al, 20041)
  • laboratory-sized data sets
  • UCI datasets

22
Empirical evaluation on laboratory-sized data
Retis
M5
SMOTI
23
Empirical evaluation on laboratory-sized data
Retis
Time(s)
M5
SMOTI
Number of examples
24
Empirical Evaluation on UCI data
25
Empirical Evaluation on UCI data.
  • For some datasets SMOTI mines interesting
    patterns that no previous study on model trees
    has ever revealed.
  • This aspect proves the easy interpretability of
    the model trees induced by SMOTI.
  • For example

Abalone (marine crustaceans).
The goal is to
predict the age (number of rings).
SMOTI builds a model tree with a regression
node in the root. The straight-line regression
selected at the root is almost invariant for all
model trees and expresses a linear dependence
between the number of rings (dependent variable)
and the shucked weight (independent variable).
This is a clear example of global effect.
26
SMOTI open issues
  • The DM system KDB2000
  • http//www.di.uniba.it/malerba/software/kdb2000/i
    ndex.htm
  • that implements SMOTI is not tightly integrated
    with the DBMS ? Tighter integration with a DBMS
  • Cannot be applied directly to multi-relational
    data mining tasks ?
  • the unit of analysis is an individual described
    by a set of random variables each of which result
    in just one single value

27
From classical to relational data mining
  • ...while in the most real world application
    complex objects are described in terms of
    properties and relations
  • Example
  • In spatial domains the effect of a predictor
    variable at any site may not be limited to the
    specified site (spatial autocorrelation)

E.g. no communal establishment (schools,
hospitals) in an ED, but many of them are located
in the nearby EDs.
28
Multi-relational representation
  • Augment data table with information about
    neighboring units.

target relevant objects
29
Regression Problem in relational data mining
  • Given
  • a training set O stored in relational tables
    ST0,T1,,Th of a relational database D
  • a set of v primary key constraints PK on
    relations in S,
  • a set of w foreign key constraints FK on
    relations in S,
  • a target relation T(X1, ,Xn, Y ) ? S,
  • a target continuous attribute Y in T, different
    from the primary key or foreign key in T.
  • Find
  • a multi-relational regression model which
    predicts the value of Y for for some object
    represented as a tuple in T and related tuples in
    S according to foreign key paths.

30
How to work with (multi-)relational data?
  • Moulding relational database in a single table
    such that traditional attribute-value algorithms
    are able to work on
  • create a single relation by deriving attributes
    from other joined tables
  • construct of a single relation that summarizes
    and/or aggregates information found in other
    tables
  • Solve mining problems in their original
    representation.
  • FORS (Karalic, 1997)
  • SRT(Kramer, 1996), S-CART (Kramer,
    1999),TILDE-RT(Blockeel, 1998 )

31
Strengths and Weaknesses of current
multi-relational regression methods
  • Strengths
  • solve Relational Regression problems in their
    original representation.
  • able to exploit background knowledge in the
    mining process
  • learn multi-relational patterns
  • Weaknesses
  • knowledge of data model is not used to guide the
    search process
  • data is stored as Prolog facts
  • not integrated with the database
  • do not differentiate global vs. local effects of
    variables in a regression model

Idea to combine the achievements of the KDD
field on the integration of data mining with
database systems, with results reported in the
ILP field on how to upgrade propositional data
mining algorithms to multi-relational
representations.
32
Global/local effect multi-relational model
Mr-SMOTI
Tightly integrating the data mining engine with a
relational DBMS
Upgrading SMOTI to multi-relational
representations
Mr-SMOTI
  • Mr-SMOTI is the relational extension of SMOTI
    that outputs relational model trees such that
  • each node corresponds with a subset of training
    data and it is associated with a portion of D
    intensionally described by a relational pattern,
  • each leaf is associated with a (multiple)
    regression function which may involve predictor
    variables from several tables in D,
  • each variable that is eventually introduced in
    left branch of a node must not occur in the right
    branch of that node,
  • relational patterns associated with nodes are
    represented with regression selection graphs that
    extends selection graph definition (Knobbe,99),
  • Regression selection graphs are translated into
    SQL expressions stored in XML format.

33
What is a regression selection graph?
  • It corresponds to tuples describing a subset of
    the instances from database eventually modified
    by removing effect of regression steps
  • Nodes correspond to the tables from the database
    whose attributes are replaced by corresponding
    residuals
  • Arcs correspond to foreign key associations
    between tables
  • Open arcs have at least one
  • Closed arcs have no of

CreditLine
34
Relational splitting nodes
  • add condition add negative condition
  • add present arc and open node add absent arc
    and closed node
  • add condition add negative condition (split
    condition)
  • add present arc and open node add absent arc
    and closed node (join condition)

Customer
Detail
Order
35
Relational splitting nodes
  • add condition add negative condition (split
    condition)
  • add present arc and open node add absent arc
    and closed node (join condition)

Customer
Detail
Order
2nd case
Customer
Order
Customer
Detail
Order
Detail
Quantity ?22
Customer
Order
Detail
Quantity ?22
36
Relational splitting nodes
  • add condition add negative condition (split
    condition)
  • add present arc and open node add absent arc
    and closed node (join condition)

Customer
Order
37
Relational splitting nodes with look-ahead
Customer
Customer
Order
Customer
Detail
Quantity ?22
Customer
Order
Detail
Quantity ?22
38
Relational regression nodes
  • add regression condition

Customer(Id, Sale,CreditLine,Agent)
Order(Id, Date, Client, Pieces)
CreditLine CreditLine-(5Sale-0.5) Pieces
Pieces-(-2.5Sale-3.2)
Order(Id, Date, Client, Pieces2.5Sale3.2)
Customer(Id, Sale, CreditLine-5Sale0.5,Agent)
39
Relational model trees an example
Customer
Customer
Order
Customer
Order

Customer(Id, Sale, CreditLine-5 Sale0.5,Agent)
Order(Id, Date, Client, Pieces2.5Sale3.2)
Order
Customer
Date in 02/09/02

Customer(Id, Sale, CreditLine-5Sale0.5-0.1(Pieces
2.5Sale3.2)2,Agent)
Order(Id, Date, Client, Pieces2.5Sale3.2)
40
How to choose the best relational node?
  • Start with root node that is associated with
    selection graph containing only target node
  • Find greedy heuristics to choose regression
    selection graph refinements
  • use binary splits for simplicity
  • for each refinement get complementary refinement
  • store regression coefficient in order to compute
    residuals on continuous attributes
  • choose the best refinement based on evaluation
    functions

41
Evaluating relational splitting node
Customer
Customer
Order
Customer
Order
42
Evaluating relational regression node
  • ?(t) min R(t),?(t).
  • where
  • R(t) is the resubstitution error computed on the
    tuples returned on tuples extracted by regression
    selection graph associated with t,
  • t is the best splitting node following t.

43
Stopping criteria
  • The first requires the number of target objects
    in each node to be greater than a minimum value.
  • The second operates when all continuous
    attributes along the path from the root to the
    current node are used in regression steps and
    there are no add open node and present arc
    refinement including new continuous attributes.
  • The third stops the growth when the coefficient
    of determination is greater than a minimum value.

44
Mr-SMOTI some details
  • Mr-SMOTI has been implemented as a component of
    the KDD system MURENA.
  • MURENA has been implemented in java and
    interfaces an Oracle datatabase.
  • http//www.di.uniba.it/7Ececi/micFiles/systems/Th
    e20MURENA20project.html

45
Empirical evaluation on laboratory-sized data
46
Empirical evaluation on laboratory-sized data
Wilcoxon test (alpha0.05)
47
Empirical evaluation on real data
48
Improving efficency
by materializing intermediate results
49
Questions?
Write a Comment
User Comments (0)
About PowerShow.com