SLIQ: A Fast Scalable Classifier for Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

SLIQ: A Fast Scalable Classifier for Data Mining

Description:

SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen 1996. Presentation by: Vladan Radosavljevic Outline Introduction ... – PowerPoint PPT presentation

Number of Views:123
Avg rating:3.0/5.0
Slides: 25
Provided by: Vlad55
Category:

less

Transcript and Presenter's Notes

Title: SLIQ: A Fast Scalable Classifier for Data Mining


1
SLIQ A Fast Scalable Classifier for Data Mining
  • Manish Mehta, Rakesh Agrawal, Jorma Rissanen
  • 1996.
  • Presentation by Vladan Radosavljevic

2
Outline
  • Introduction
  • Motivation
  • SLIQ Algorithm
  • Building tree
  • Pruning
  • Example
  • Results
  • Conclusion

3
Introduction
  • Most of the classification algorithms are
    designed for memory resident data limited
    suitability for mining large datasets
  • Solution build a scalable classifier - SLIQ
  • SLIQ Supervised Learning in Quest, Quest was
    the data mining project at the IBM

4
Motivation
  • Recall (ID3, C4.5, CART)

5
Motivation
  • NON SCALABLE DECISION TREES
  • The complexity lies in determining the best split
    for each attribute
  • The cost of evaluating splits for numerical
    attributes is dominated by the cost of sorting
    values at each node
  • The cost of evaluating splits for categorical
    attributes is dominated by the cost of searching
    for the best subset
  • Pruning
  • crossvalidation inapplicable for large datasets
  • divide data in two parts - training and test set
    - sizes, distribution???

6
Motivation
  • Improve scalability of tree classifiers
  • Previous proposals
  • Sampling data at each node
  • Discretization of numerical attributes
  • Partitioning input data and build tree for each
    partition
  • All methods achieve low accuracy!
  • SLIQ improve learning time without loss in
    accuracy!

7
SLIQ
  • Key features
  • Tree classifier, handling both numerical and
    categorical attributes
  • Presort numerical attributes before tree has been
    built
  • Breadth first growing strategy
  • Goodness test Gini index
  • Inexpensive tree pruning algorithm based on
    Minimum Description Length (MDL)

8
SLIQ - Algorithm
  • Eliminate the need to sort the data at each node
  • Create sorted list for each numerical attribute
  • Create class list

9
SLIQ - Algorithm
  • Example

10
SLIQ - Algorithm
  • Split evaluation

11
SLIQ - Algorithm
  • Example

12
SLIQ - Algorithm
  • Update class list

13
SLIQ - Algorithm
  • Example

14
SLIQ - Algorithm
  • For large-cardinality categorical attributes
    (determined based on threshold) the best split is
    computed in greedy way, otherwise all possible
    splits are evaluated
  • When node becomes pure stop splitting it, then
    condense attribute lists by discarding examples
    that correspond to the pure node
  • SLIQ is able to scale for large datasets with no
    loss in accuracy the splits evaluated with or
    without pre-sorting are identical

15
SLIQ - Pruning
  • Post pruning algorithm based on Minimum
    Description Length principle
  • Find a model that minimizes
  • Cost(M,D) Cost(DM) Cost(M)
  • Cost(M) - cost of the model
  • Cost(DM) - cost of encoding the data D if
  • model M is given

16
SLIQ - Pruning
  • Cost of the data classification error
  • Cost of the model
  • Encoding the tree number of bits
  • Encoding the splits
  • numerical attribute - constant (empirically 1)
  • categorical attribute - depends on cardinality
  • The MDL pruning evaluate the code length at each
    node to determine whether to prune one or both
    child or leave the node intact

17
SLIQ - pruning
  • Three pruning strategies
  • Full pruning both children and convert node to
    the leaf
  • Partial prune into the leaf or prune the left
    child or prune the right child or leave node
    intact
  • Hybrid apply Full method and then partial
    (prune left, prune right or leave intact)

18
Results
  • SLIQ was tested on the datasets

19
Results
  • Pruning strategy comparison

20
Results
  • Accuracy

21
Results
  • Scalability

22
Conclusion
  • SLIQ demonstrates to be a fast, low-cost and
    scalable classifier that builds accurate trees
  • Based on empirical test which compared SLIQ to
    other tree based classifiers, SLIQ achieves a
    comparable accuracy while producing smaller
    decision trees
  • Scalability??? Memory problem when increasing
    number of attributes or number of classes

23
References
  • 1 M. Mehta, R. Agrawal and J. Rissanen, "SLIQ
    A Fast Scalable Classifier for Data Mining", in
    Proceedings of the 5th International Conference
    on Extending Database Technology, Avignon,
    France, Mar. 1996.

24
  • THANK YOU!
Write a Comment
User Comments (0)
About PowerShow.com