Efficient Construction of Decision Trees for Sparse Categorical Attributes PowerPoint PPT Presentation

presentation player overlay
1 / 52
About This Presentation
Transcript and Presenter's Notes

Title: Efficient Construction of Decision Trees for Sparse Categorical Attributes


1
Efficient Construction of Decision Trees for
Sparse Categorical Attributes
  • Shih-Hsiang Lo, Jian-Chih Ou and Ming-Syan Chen

2
Agenda
  • Introduction
  • Preliminaries
  • Inference Based Classifier
  • Performance Studies
  • Conclusion

3
Introduction
  • Classification
  • Decision Tree
  • Problem Identification
  • Inference Class
  • Performance
  • Sparse Data

4
Introduction - Classification
  • Classification is an important issue both in data
    mining and machine learning, with such important
    techniques as Bayesian classification, neural
    networks, genetic algorithms and decision trees.
  • Decision tree classifiers have been identified as
    efficient methods for classification.

5
Introduction Decision Tree
  • It was proven that a decision tree with scale-up
    and parallel capability is very efficient and
    suitable for large training sets.
  • Decision tree generation algorithms do not
    require additional information than that already
    contained in the training data.
  • Decision trees earn similar and sometimes better
    accuracy compared to other classification
    methods.
  • Most of them are divided into two distinct
    phases, a tree building and a pruning phase.

6
Introduction Problem Identification
  • According to our observation on real data, it is
    first observed that in many real-life datasets,
    such as customers credit-rating data of banks
    and credit-card companies, medical diagnosis data
    and document categorization data.
  • The attributes are mostly categorical attributes
    and the value of an attribute usually implies one
    target class.

7
Introduction Inference Class
  • In classification, we call the attribute
    corresponding to the target label to classify the
    target attribute. An attribute which is not a
    target attribute is called an ordinary attribute.
  • Thus, an inference class is defined as the target
    class to which the majority of an attribute value
    belongs.

8
(No Transcript)
9
Introduction - Performance
  • Then, note that after mapping each ordinary
    attribute value to its influence class, it would
    be better and efficient to divide the ordinary
    attribute values according to their inference
    classes instead of their original values before
    proceeding to perform the goodness function,
    e.g., gini index, computation for node splitting.

10
Introduction Sparse Data
  • In addition, we find that only a few attributes
    in real data are major discriminating attributes
    where a discriminating attribute is an attribute,
    by whose value we are likely to distinguish one
    tuple from another.
  • We further use information gain as the splitting
    measure of each attribute and use the splitting
    criteria of C4.5 algorithm as the testing one in
    order to see the attribute distribution all above
    our observation on real data, we can say that
    most real data, which have few major
    discriminating attributes, could be identified as
    the sparse data.

11
Preliminaries
  • Decision Tree Techniques
  • Information Theory
  • Information Gain Gain Ratio
  • Gini Index
  • Attribute Distribution
  • Data Sparsity

12
Decision Tree Techniques Prior Works
  • Decision tree techniques play an important role
    both in machine learning and data mining, and
    numerous decision tree algorithms have been
    developed over the years, e.g., ID3, C4.5, CART,
    SLIQ, SPRINT.
  • Most of them are divided into two distinct
    phases, a tree building and a pruning phase.

13
Decision Tree Techniques Tree Building Phase
  • The tree building phase can be further divided
    into two steps, a tree node splitting step and a
    partitioning data step.
  • In the tree node splitting step, the tree node
    splitting of a classifier is being able to choose
    the best attribute among all attributes in data
    as the tree node.
  • Then, the second step of the tree building phase,
    partitioning data step, partitions the data
    according to the chosen attribute in the first
    step.
  • Thus, in the tree building phase, the training
    data set is recursively partitioned until all
    partitions are either pure or sufficiently small.

14
Decision Tree Techniques- Overfitting Effect
  • Since the tree building phase constructs a
    perfect tree that can accurately classify every
    record from the training set, it happens that a
    decision tree which is perfect for the known
    records may be overly sensitive to statistical
    irregularities of the training set.
  • As pointed out in, one often achieves greater
    accuracy in the classification of new objects by
    using an imperfect, smaller decision tree rather
    than one which perfectly classifies all known
    records.

15
Decision Tree Techniques - Tree Pruning Phase
  • Thus, most algorithms perform a pruning phase
    after the building phase in which nodes are
    iteratively pruned to prevent overfitting and
    to obtain a tree with higher accuracy.
  • An important class of pruning algorithms is based
    on the Minimum Description Length (MDL) principle.

16
A Typical Framework of DT
17
Information Gain (1)
  • Suppose P is the generalization of p data samples
    and m distinct classes for P (P1,P2,...,Pm). The
    expected information I needed to classify a given
    sample P is

18
Information Gain(2)
  • Attribute A with a1, a2, ..., ak can partition
    P into CC1, C2, ..., Ck. Let Cj contain pij
    samples of class Pi. The weighted information
    gain E(A) required to explore all subtrees is
    calculated as below

19
Information Gain(3)
  • The information gained by branching A can be
    obtained as below

20
Gain Ratio
  • For estimating the noise effects of information
    gain, gain ratio which is an alternative
    measurement of normalized information gain is
    proposed as below.

21
Gini Index
  • If a data set T contains examples from n classes,
    gini index, gini(T ), is defined aswhere pj
    is the relative frequency of class j in T .

22
Gini Index
  • If a data set T is split into k subsets T1,T2,
    ... , Tk with sizes N1,N2, ... ,Nk respectively,
    the gini index 10 of the split data contains
    examples from n classes, the gini index, gini(T
    ) is defined as where pij is the relative
    frequency of class j in Ti.

23
Attribute Distribution
  • Before defining the data sparsity, we need to
    clarify the definition of attribute distribution.
    The attribute distribution is defined as the
    distribution of the measure values of attributes
    where the measure values of attributes are
    obtained by one classifier.
  • For example, we use algorithm C4.5 as the data
    evaluator and use information gain as the measure
    in splitting phase of C4.5.

24
Sample datasets
25
(No Transcript)
26
Attribute Distribution
  • Therefore, we define that the dataset which has
    many candidates of discriminating attribute and
    also owns high standard deviation of attribute
    distribution is called dense data.
  • In other words, the dataset which has few
    candidates of discriminating attribute and also
    owns low standard deviation of attribute
    distribution is call sparse data.

27
Definition of Data Sparsity (1)
  • According to the above observation, we use the
    algorithm C4.5 as data evaluator and use the
    standard deviation of attribute distribution to
    define the data sparsity in mathematics.

28
Definition of Data Sparsity(2)
  • Suppose attribute set X with A1, A2, ...Aq
    resides in data P. The mean and variance for X
    with respect to measure matrix, such as
    information gain, gain ratio and gini index, can
    be obtained below.

29
Definition of Data Sparsity(3)
  • Further, the standard deviation for X can be also
    obtained as follows
  • The standard deviation of attribute set X is used
    as measure of sparsity of data P . Then, the
    sparsity of data is defined as below.

30
An Example of Data Sparsity
  • For example in Table 1 and Table 2, the
    sparsityc4.5 of data tennis 1 is 0.0491 and the
    one of data tennis 2 is 0.0872.
  • Thus, the attributes in data with low sparsity
    are defined as sparse attributes.
  • In the experiments we use the sparsityc4.5 to
    assess the attribute distribution of datasets.

31
Inference Based Classifier(1)
  • In essence, IBC is a decision tree classifier
    that refines the splitting criteria for
    categorical attributes in the building phase in
    order to reveal the major discriminating
    attribute from sparse attributes.
  • Also, IBC can improve the overall execution
    efficiency and alleviate overfitting problem as
    shown in experiments.

32
Inference Based Classifier(2)
  • Note that information gain and gini index are
    common measurements for selecting the best split
    node. Without loss of generality, we adopt
    information gain as a measurement to identify the
    sparsity of attributes and gini index as the
    measurement for node splitting criterion.

33
Inference Based Classifier(3)
  • Inference Class
  • Algorithm IBC
  • Inference Class Identification Phase
  • Node Split Phase

34
Inference Class (1)
35
(No Transcript)
36
Inference Class(2)
  • For the example profile in Figure 1, if A is
    age with value lt30, then domain(t) fair,
    excellent, and nA(lt30, fair)2, and nA(lt30,
    excellent)1. fair is therefore the inference
    class for the value lt30 of the attribute age.

37
Inference Class(3)
  • The unique target class which most tuples with
    their attribute Aai imply is called the
    inference class for a value ai of attribute A.
  • If the target class to which most tuples with
    their attribute A ai imply is not unique, we
    say attribute value ai is associated with a
    neutral class. Also, we call that value ai is a
    neural attribute value. As will be seen later, by
    replacing the original attribute value with its
    inference class in performing the node-splitting,
    IBC is able to build the decision tree very
    efficiently without comprising the quality of
    classification.

38
Algorithm of IBC
  • IBC is divided into two major phases, i.e.,
  • partitioning values of an attribute according to
    their inference class,
  • selecting the best splitting attribute with the
    lowest gini index value from these attributes

39
Inference Class Identification(1)
40
Inference Class Identification(2)
41
Inference Class Identification(3)
42
(No Transcript)
43
Node Split Phase(1)
44
Node Split Phase(2)
  • For the example, the Node Split Phase of IBC
    chooses the attribute Humidity with the lowest
    gini index value as the best splitting attribute
    for the decision tree node.
  • Then, IBC partitions Table 1 into two subtables
    which consist of one table where the value of
    attribute Humidity is High and the other one
    where the value of attribute Humidity is Normal.
  • Following a similar procedure of IBC for these
    subtables, the whole decision tree is built as
    depicted in Figure 5 where the purity is also
    examined in each leaf.

45
(No Transcript)
46
Experimental Datasets
  • We experimented with three real-life datasets
    from the UCI Machine Learning Repository.
  • These datasets are used by the machine learning
    community for the empirical analysis of machine
    learning algorithms.
  • We use a small portion of data as the training
    dataset and the rest of the data is used as the
    testing dataset.
  • Note that the attributes in these selected data
    belongs to categorical attributes.

47
Features of Real Datasets
48
Classification Accuracy
49
Algorithm Complexity Analysis
  • Because SLIQ was shown to outperform C4.5, so we
    only compare SLIQ, K-mean based and IBC in
    scale-up experiments.
  • Before scale-up experiments, we briefly explain
    the complexity of three methods. In general case,
    the complexity of SLIQ is O(k n2), the
    complexity of K-mean based is O(k (n - k)2) and
    the complexity of IBC is O(k n) where k is the
    number of attributes and n is the size of data
    set.

50
Scale-Up Experiments(2)
51
Conclusion(1)
  • According to our observation on real data, the
    distribution of attributes with respect to
    information gain was very sparse because only a
    few attributes are major discriminating
    attributes where a discriminating attribute is an
    attribute, by whose value we are likely to
    distinguish one tuple from another.

52
Conclusion(2)
  • The experimental results showed that IBC
    significantly outperformed the companion methods
    in execution efficiency for dataset with
    categorical attributes of sparse distribution
    while attaining approximately the same
    classification accuracies.
  • Consequently, IBC was considered as an accurate
    and efficient classifier for sparse categorical
    attributes.
Write a Comment
User Comments (0)
About PowerShow.com