Efficient Construction of Decision Trees for Sparse Categorical Attributes presentation

About This Presentation

Transcript and Presenter's Notes

Title: Efficient Construction of Decision Trees for Sparse Categorical Attributes

1
Efficient Construction of Decision Trees for
Sparse Categorical Attributes

Shih-Hsiang Lo, Jian-Chih Ou and Ming-Syan Chen

2
Agenda

Introduction
Preliminaries
Inference Based Classifier
Performance Studies
Conclusion

3
Introduction

Classification
Decision Tree
Problem Identification
Inference Class
Performance
Sparse Data

4
Introduction - Classification

Classification is an important issue both in data
mining and machine learning, with such important
techniques as Bayesian classification, neural
networks, genetic algorithms and decision trees.
Decision tree classifiers have been identified as
efficient methods for classification.

5
Introduction Decision Tree

It was proven that a decision tree with scale-up
and parallel capability is very efficient and
suitable for large training sets.
Decision tree generation algorithms do not
require additional information than that already
contained in the training data.
Decision trees earn similar and sometimes better
accuracy compared to other classification
methods.
Most of them are divided into two distinct
phases, a tree building and a pruning phase.

6
Introduction Problem Identification

According to our observation on real data, it is
first observed that in many real-life datasets,
such as customers credit-rating data of banks
and credit-card companies, medical diagnosis data
and document categorization data.
The attributes are mostly categorical attributes
and the value of an attribute usually implies one
target class.

7
Introduction Inference Class

In classification, we call the attribute
corresponding to the target label to classify the
target attribute. An attribute which is not a
target attribute is called an ordinary attribute.
Thus, an inference class is defined as the target
class to which the majority of an attribute value
belongs.

8
(No Transcript)
9
Introduction - Performance

Then, note that after mapping each ordinary
attribute value to its influence class, it would
be better and efficient to divide the ordinary
attribute values according to their inference
classes instead of their original values before
proceeding to perform the goodness function,
e.g., gini index, computation for node splitting.

10
Introduction Sparse Data

In addition, we find that only a few attributes
in real data are major discriminating attributes
where a discriminating attribute is an attribute,
by whose value we are likely to distinguish one
tuple from another.
We further use information gain as the splitting
measure of each attribute and use the splitting
criteria of C4.5 algorithm as the testing one in
order to see the attribute distribution all above
our observation on real data, we can say that
most real data, which have few major
discriminating attributes, could be identified as
the sparse data.

11
Preliminaries

Decision Tree Techniques
Information Theory
Information Gain Gain Ratio
Gini Index
Attribute Distribution
Data Sparsity

12
Decision Tree Techniques Prior Works

Decision tree techniques play an important role
both in machine learning and data mining, and
numerous decision tree algorithms have been
developed over the years, e.g., ID3, C4.5, CART,
SLIQ, SPRINT.
Most of them are divided into two distinct
phases, a tree building and a pruning phase.

13
Decision Tree Techniques Tree Building Phase

The tree building phase can be further divided
into two steps, a tree node splitting step and a
partitioning data step.
In the tree node splitting step, the tree node
splitting of a classifier is being able to choose
the best attribute among all attributes in data
as the tree node.
Then, the second step of the tree building phase,
partitioning data step, partitions the data
according to the chosen attribute in the first
step.
Thus, in the tree building phase, the training
data set is recursively partitioned until all
partitions are either pure or sufficiently small.

14
Decision Tree Techniques- Overfitting Effect

Since the tree building phase constructs a
perfect tree that can accurately classify every
record from the training set, it happens that a
decision tree which is perfect for the known
records may be overly sensitive to statistical
irregularities of the training set.
As pointed out in, one often achieves greater
accuracy in the classification of new objects by
using an imperfect, smaller decision tree rather
than one which perfectly classifies all known
records.

15
Decision Tree Techniques - Tree Pruning Phase

Thus, most algorithms perform a pruning phase
after the building phase in which nodes are
iteratively pruned to prevent overfitting and
to obtain a tree with higher accuracy.
An important class of pruning algorithms is based
on the Minimum Description Length (MDL) principle.

16
A Typical Framework of DT
17
Information Gain (1)

Suppose P is the generalization of p data samples
and m distinct classes for P (P1,P2,...,Pm). The
expected information I needed to classify a given
sample P is

18
Information Gain(2)

Attribute A with a1, a2, ..., ak can partition
P into CC1, C2, ..., Ck. Let Cj contain pij
samples of class Pi. The weighted information
gain E(A) required to explore all subtrees is
calculated as below

19
Information Gain(3)

The information gained by branching A can be
obtained as below

20
Gain Ratio

For estimating the noise effects of information
gain, gain ratio which is an alternative
measurement of normalized information gain is
proposed as below.

21
Gini Index

If a data set T contains examples from n classes,
gini index, gini(T ), is defined aswhere pj
is the relative frequency of class j in T .

22
Gini Index

If a data set T is split into k subsets T1,T2,
... , Tk with sizes N1,N2, ... ,Nk respectively,
the gini index 10 of the split data contains
examples from n classes, the gini index, gini(T
) is defined as where pij is the relative
frequency of class j in Ti.

23
Attribute Distribution

Before defining the data sparsity, we need to
clarify the definition of attribute distribution.
The attribute distribution is defined as the
distribution of the measure values of attributes
where the measure values of attributes are
obtained by one classifier.
For example, we use algorithm C4.5 as the data
evaluator and use information gain as the measure
in splitting phase of C4.5.

24
Sample datasets
25
(No Transcript)
26
Attribute Distribution

Therefore, we define that the dataset which has
many candidates of discriminating attribute and
also owns high standard deviation of attribute
distribution is called dense data.
In other words, the dataset which has few
candidates of discriminating attribute and also
owns low standard deviation of attribute
distribution is call sparse data.

27
Definition of Data Sparsity (1)

According to the above observation, we use the
algorithm C4.5 as data evaluator and use the
standard deviation of attribute distribution to
define the data sparsity in mathematics.

28
Definition of Data Sparsity(2)

Suppose attribute set X with A1, A2, ...Aq
resides in data P. The mean and variance for X
with respect to measure matrix, such as
information gain, gain ratio and gini index, can
be obtained below.

29
Definition of Data Sparsity(3)

Further, the standard deviation for X can be also
obtained as follows
The standard deviation of attribute set X is used
as measure of sparsity of data P . Then, the
sparsity of data is defined as below.

30
An Example of Data Sparsity

For example in Table 1 and Table 2, the
sparsityc4.5 of data tennis 1 is 0.0491 and the
one of data tennis 2 is 0.0872.
Thus, the attributes in data with low sparsity
are defined as sparse attributes.
In the experiments we use the sparsityc4.5 to
assess the attribute distribution of datasets.

31
Inference Based Classifier(1)

In essence, IBC is a decision tree classifier
that refines the splitting criteria for
categorical attributes in the building phase in
order to reveal the major discriminating
attribute from sparse attributes.
Also, IBC can improve the overall execution
efficiency and alleviate overfitting problem as
shown in experiments.

32
Inference Based Classifier(2)

Note that information gain and gini index are
common measurements for selecting the best split
node. Without loss of generality, we adopt
information gain as a measurement to identify the
sparsity of attributes and gini index as the
measurement for node splitting criterion.

33
Inference Based Classifier(3)

Inference Class
Algorithm IBC
Inference Class Identification Phase
Node Split Phase

34
Inference Class (1)
35
(No Transcript)
36
Inference Class(2)

For the example profile in Figure 1, if A is
age with value lt30, then domain(t) fair,
excellent, and nA(lt30, fair)2, and nA(lt30,
excellent)1. fair is therefore the inference
class for the value lt30 of the attribute age.

37
Inference Class(3)

The unique target class which most tuples with
their attribute Aai imply is called the
inference class for a value ai of attribute A.
If the target class to which most tuples with
their attribute A ai imply is not unique, we
say attribute value ai is associated with a
neutral class. Also, we call that value ai is a
neural attribute value. As will be seen later, by
replacing the original attribute value with its
inference class in performing the node-splitting,
IBC is able to build the decision tree very
efficiently without comprising the quality of
classification.

38
Algorithm of IBC

IBC is divided into two major phases, i.e.,
partitioning values of an attribute according to
their inference class,
selecting the best splitting attribute with the
lowest gini index value from these attributes

39
Inference Class Identification(1)
40
Inference Class Identification(2)
41
Inference Class Identification(3)
42
(No Transcript)
43
Node Split Phase(1)
44
Node Split Phase(2)

For the example, the Node Split Phase of IBC
chooses the attribute Humidity with the lowest
gini index value as the best splitting attribute
for the decision tree node.
Then, IBC partitions Table 1 into two subtables
which consist of one table where the value of
attribute Humidity is High and the other one
where the value of attribute Humidity is Normal.
Following a similar procedure of IBC for these
subtables, the whole decision tree is built as
depicted in Figure 5 where the purity is also
examined in each leaf.

45
(No Transcript)
46
Experimental Datasets

We experimented with three real-life datasets
from the UCI Machine Learning Repository.
These datasets are used by the machine learning
community for the empirical analysis of machine
learning algorithms.
We use a small portion of data as the training
dataset and the rest of the data is used as the
testing dataset.
Note that the attributes in these selected data
belongs to categorical attributes.

47
Features of Real Datasets
48
Classification Accuracy
49
Algorithm Complexity Analysis

Because SLIQ was shown to outperform C4.5, so we
only compare SLIQ, K-mean based and IBC in
scale-up experiments.
Before scale-up experiments, we briefly explain
the complexity of three methods. In general case,
the complexity of SLIQ is O(k n2), the
complexity of K-mean based is O(k (n - k)2) and
the complexity of IBC is O(k n) where k is the
number of attributes and n is the size of data
set.

50
Scale-Up Experiments(2)
51
Conclusion(1)

According to our observation on real data, the
distribution of attributes with respect to
information gain was very sparse because only a
few attributes are major discriminating
attributes where a discriminating attribute is an
attribute, by whose value we are likely to
distinguish one tuple from another.

52
Conclusion(2)

The experimental results showed that IBC
significantly outperformed the companion methods
in execution efficiency for dataset with
categorical attributes of sparse distribution
while attaining approximately the same
classification accuracies.
Consequently, IBC was considered as an accurate
and efficient classifier for sparse categorical
attributes.

Write a Comment

User Comments (0)

About PowerShow.com

Efficient Construction of Decision Trees for Sparse Categorical Attributes PowerPoint PPT Presentation