Title: Efficient Construction of Decision Trees for Sparse Categorical Attributes
1Efficient Construction of Decision Trees for
Sparse Categorical Attributes
- Shih-Hsiang Lo, Jian-Chih Ou and Ming-Syan Chen
2Agenda
- Introduction
- Preliminaries
- Inference Based Classifier
- Performance Studies
- Conclusion
3Introduction
- Classification
- Decision Tree
- Problem Identification
- Inference Class
- Performance
- Sparse Data
4Introduction - Classification
- Classification is an important issue both in data
mining and machine learning, with such important
techniques as Bayesian classification, neural
networks, genetic algorithms and decision trees. - Decision tree classifiers have been identified as
efficient methods for classification.
5Introduction Decision Tree
- It was proven that a decision tree with scale-up
and parallel capability is very efficient and
suitable for large training sets. - Decision tree generation algorithms do not
require additional information than that already
contained in the training data. - Decision trees earn similar and sometimes better
accuracy compared to other classification
methods. - Most of them are divided into two distinct
phases, a tree building and a pruning phase.
6Introduction Problem Identification
- According to our observation on real data, it is
first observed that in many real-life datasets,
such as customers credit-rating data of banks
and credit-card companies, medical diagnosis data
and document categorization data. - The attributes are mostly categorical attributes
and the value of an attribute usually implies one
target class.
7Introduction Inference Class
- In classification, we call the attribute
corresponding to the target label to classify the
target attribute. An attribute which is not a
target attribute is called an ordinary attribute. - Thus, an inference class is defined as the target
class to which the majority of an attribute value
belongs.
8(No Transcript)
9Introduction - Performance
- Then, note that after mapping each ordinary
attribute value to its influence class, it would
be better and efficient to divide the ordinary
attribute values according to their inference
classes instead of their original values before
proceeding to perform the goodness function,
e.g., gini index, computation for node splitting.
10Introduction Sparse Data
- In addition, we find that only a few attributes
in real data are major discriminating attributes
where a discriminating attribute is an attribute,
by whose value we are likely to distinguish one
tuple from another. - We further use information gain as the splitting
measure of each attribute and use the splitting
criteria of C4.5 algorithm as the testing one in
order to see the attribute distribution all above
our observation on real data, we can say that
most real data, which have few major
discriminating attributes, could be identified as
the sparse data.
11Preliminaries
- Decision Tree Techniques
- Information Theory
- Information Gain Gain Ratio
- Gini Index
- Attribute Distribution
- Data Sparsity
12Decision Tree Techniques Prior Works
- Decision tree techniques play an important role
both in machine learning and data mining, and
numerous decision tree algorithms have been
developed over the years, e.g., ID3, C4.5, CART,
SLIQ, SPRINT. - Most of them are divided into two distinct
phases, a tree building and a pruning phase.
13Decision Tree Techniques Tree Building Phase
- The tree building phase can be further divided
into two steps, a tree node splitting step and a
partitioning data step. - In the tree node splitting step, the tree node
splitting of a classifier is being able to choose
the best attribute among all attributes in data
as the tree node. - Then, the second step of the tree building phase,
partitioning data step, partitions the data
according to the chosen attribute in the first
step. - Thus, in the tree building phase, the training
data set is recursively partitioned until all
partitions are either pure or sufficiently small.
14Decision Tree Techniques- Overfitting Effect
- Since the tree building phase constructs a
perfect tree that can accurately classify every
record from the training set, it happens that a
decision tree which is perfect for the known
records may be overly sensitive to statistical
irregularities of the training set. - As pointed out in, one often achieves greater
accuracy in the classification of new objects by
using an imperfect, smaller decision tree rather
than one which perfectly classifies all known
records.
15Decision Tree Techniques - Tree Pruning Phase
- Thus, most algorithms perform a pruning phase
after the building phase in which nodes are
iteratively pruned to prevent overfitting and
to obtain a tree with higher accuracy. - An important class of pruning algorithms is based
on the Minimum Description Length (MDL) principle.
16A Typical Framework of DT
17Information Gain (1)
- Suppose P is the generalization of p data samples
and m distinct classes for P (P1,P2,...,Pm). The
expected information I needed to classify a given
sample P is
18Information Gain(2)
- Attribute A with a1, a2, ..., ak can partition
P into CC1, C2, ..., Ck. Let Cj contain pij
samples of class Pi. The weighted information
gain E(A) required to explore all subtrees is
calculated as below
19Information Gain(3)
- The information gained by branching A can be
obtained as below
20Gain Ratio
- For estimating the noise effects of information
gain, gain ratio which is an alternative
measurement of normalized information gain is
proposed as below.
21Gini Index
- If a data set T contains examples from n classes,
gini index, gini(T ), is defined aswhere pj
is the relative frequency of class j in T .
22Gini Index
- If a data set T is split into k subsets T1,T2,
... , Tk with sizes N1,N2, ... ,Nk respectively,
the gini index 10 of the split data contains
examples from n classes, the gini index, gini(T
) is defined as where pij is the relative
frequency of class j in Ti.
23Attribute Distribution
- Before defining the data sparsity, we need to
clarify the definition of attribute distribution.
The attribute distribution is defined as the
distribution of the measure values of attributes
where the measure values of attributes are
obtained by one classifier. - For example, we use algorithm C4.5 as the data
evaluator and use information gain as the measure
in splitting phase of C4.5.
24Sample datasets
25(No Transcript)
26Attribute Distribution
- Therefore, we define that the dataset which has
many candidates of discriminating attribute and
also owns high standard deviation of attribute
distribution is called dense data. - In other words, the dataset which has few
candidates of discriminating attribute and also
owns low standard deviation of attribute
distribution is call sparse data.
27Definition of Data Sparsity (1)
- According to the above observation, we use the
algorithm C4.5 as data evaluator and use the
standard deviation of attribute distribution to
define the data sparsity in mathematics.
28Definition of Data Sparsity(2)
- Suppose attribute set X with A1, A2, ...Aq
resides in data P. The mean and variance for X
with respect to measure matrix, such as
information gain, gain ratio and gini index, can
be obtained below.
29Definition of Data Sparsity(3)
- Further, the standard deviation for X can be also
obtained as follows - The standard deviation of attribute set X is used
as measure of sparsity of data P . Then, the
sparsity of data is defined as below.
30An Example of Data Sparsity
- For example in Table 1 and Table 2, the
sparsityc4.5 of data tennis 1 is 0.0491 and the
one of data tennis 2 is 0.0872. - Thus, the attributes in data with low sparsity
are defined as sparse attributes. - In the experiments we use the sparsityc4.5 to
assess the attribute distribution of datasets.
31Inference Based Classifier(1)
- In essence, IBC is a decision tree classifier
that refines the splitting criteria for
categorical attributes in the building phase in
order to reveal the major discriminating
attribute from sparse attributes. - Also, IBC can improve the overall execution
efficiency and alleviate overfitting problem as
shown in experiments.
32Inference Based Classifier(2)
- Note that information gain and gini index are
common measurements for selecting the best split
node. Without loss of generality, we adopt
information gain as a measurement to identify the
sparsity of attributes and gini index as the
measurement for node splitting criterion.
33Inference Based Classifier(3)
- Inference Class
- Algorithm IBC
- Inference Class Identification Phase
- Node Split Phase
34Inference Class (1)
35(No Transcript)
36Inference Class(2)
- For the example profile in Figure 1, if A is
age with value lt30, then domain(t) fair,
excellent, and nA(lt30, fair)2, and nA(lt30,
excellent)1. fair is therefore the inference
class for the value lt30 of the attribute age.
37Inference Class(3)
- The unique target class which most tuples with
their attribute Aai imply is called the
inference class for a value ai of attribute A. - If the target class to which most tuples with
their attribute A ai imply is not unique, we
say attribute value ai is associated with a
neutral class. Also, we call that value ai is a
neural attribute value. As will be seen later, by
replacing the original attribute value with its
inference class in performing the node-splitting,
IBC is able to build the decision tree very
efficiently without comprising the quality of
classification.
38Algorithm of IBC
- IBC is divided into two major phases, i.e.,
- partitioning values of an attribute according to
their inference class, - selecting the best splitting attribute with the
lowest gini index value from these attributes
39Inference Class Identification(1)
40Inference Class Identification(2)
41Inference Class Identification(3)
42(No Transcript)
43Node Split Phase(1)
44Node Split Phase(2)
- For the example, the Node Split Phase of IBC
chooses the attribute Humidity with the lowest
gini index value as the best splitting attribute
for the decision tree node. - Then, IBC partitions Table 1 into two subtables
which consist of one table where the value of
attribute Humidity is High and the other one
where the value of attribute Humidity is Normal. - Following a similar procedure of IBC for these
subtables, the whole decision tree is built as
depicted in Figure 5 where the purity is also
examined in each leaf.
45(No Transcript)
46Experimental Datasets
- We experimented with three real-life datasets
from the UCI Machine Learning Repository. - These datasets are used by the machine learning
community for the empirical analysis of machine
learning algorithms. - We use a small portion of data as the training
dataset and the rest of the data is used as the
testing dataset. - Note that the attributes in these selected data
belongs to categorical attributes.
47Features of Real Datasets
48Classification Accuracy
49Algorithm Complexity Analysis
- Because SLIQ was shown to outperform C4.5, so we
only compare SLIQ, K-mean based and IBC in
scale-up experiments. - Before scale-up experiments, we briefly explain
the complexity of three methods. In general case,
the complexity of SLIQ is O(k n2), the
complexity of K-mean based is O(k (n - k)2) and
the complexity of IBC is O(k n) where k is the
number of attributes and n is the size of data
set.
50Scale-Up Experiments(2)
51Conclusion(1)
- According to our observation on real data, the
distribution of attributes with respect to
information gain was very sparse because only a
few attributes are major discriminating
attributes where a discriminating attribute is an
attribute, by whose value we are likely to
distinguish one tuple from another.
52Conclusion(2)
- The experimental results showed that IBC
significantly outperformed the companion methods
in execution efficiency for dataset with
categorical attributes of sparse distribution
while attaining approximately the same
classification accuracies. - Consequently, IBC was considered as an accurate
and efficient classifier for sparse categorical
attributes.