RainForest - A Framework for Fast Decision Tree Construction of Large Datasets - PowerPoint PPT Presentation

About This Presentation
Title:

RainForest - A Framework for Fast Decision Tree Construction of Large Datasets

Description:

Motivation of RainForest ... RainForest Framework: Top-down Tree Induction Schema ... The memory size determines the implementation of RainForest Schema. ... – PowerPoint PPT presentation

Number of Views:733
Avg rating:3.0/5.0
Slides: 20
Provided by: Vic24
Learn more at: https://cse.osu.edu
Category:

less

Transcript and Presenter's Notes

Title: RainForest - A Framework for Fast Decision Tree Construction of Large Datasets


1
RainForest - A Framework for Fast Decision Tree
Construction of Large Datasets
  • J. Gehrke, R. Ramakrishnan, V. Ganti
  • Dept. of Computer Sciences University of
    Wisconsin-Madison
  • Presented By Hui Yang
  • April 18,2001

2
Introduction to Classification
  • An important Data Mining Problem
  • Input a database of training records
  • Class label attributes
  • Predictor Attributes
  • Goal
  • to build a concise model of the distribution of
    class label in terms of predictor attributes
  • Applications
  • scientific experiments,medical diagnosis, fraud
    detection, etc.

3
Decision TreeA Classification Model
  • It is one of the most attractive classification
    models
  • There are a large number of algorithms to
    construct decision trees
  • E.g. SLIQ, CART, C4.5 SPRINT
  • Most are main memory algorithms
  • Tradeoff between supporting large databases,
    performance and constructing more accurate
    decision trees

4
Motivation of RainForest
  • Developing a unifying framework that can be
    applied to most decision tree algorithms, and
    results in a scalable version of this algorithm
    without modifying the results.
  • Separating the scalability aspects of these
    algorithms from the central features that
    determine the quality of the decision trees

5
Decision Tree Terms
  • Root,Leaf, Internal Nodes
  • Each leaf is labeled with one class label
  • Each internal node is labeled with one predictor
    attribute called the splitting attribute
  • Each edge e from node n has a predicate q
    associated with it, q only involves splitting
    attributes.
  • P set of predicates on all outgoing edges of an
    internal node Non-overlapping, Exhaustive
  • Crit(n) splitting criteria of n combination of
    splitting attributes and predicates

6
Decision Tree Terms(Contd)
  • F(n) Family of database(D) tuples of a node n
  • Definition
  • let Ee1,e2,,ek, Qq1,q2,,qk be the edge
    set and predicate set for a node n p be the
    parent node of n
  • If nroot, F(n) D
  • If n?root, let q(p?n) be the predicate on e(p?n),
  • F(n) t tF(n),t F(p), and q(p? n
    True

7
RainForest FrameworkTop-down Tree Induction
Schema
  • Input node n, partition D, classification
    algorithm CL
  • Output decision tree for D rooted at n
  • TopDown Decision Tree Induction Schema
  • BuildTree(Node n, datapartition D, algorithm CL)
  • (1) Apply CL to D to find crit(n)
  • (2) let k be the number of children of n
  • (3) if (k gt 0)
  • (4) Create k children c1 ... ck of n
  • (5) Use best split to partition D into D1 . . .
    Dk
  • (6) for (i 1 i lt k i)
  • (7) BuildTree(ci , Di )
  • (8) endfor
  • (9) endif
  • RainForest Refinement
  • (1a) for each predictor attribute p
  • (1b) Call CL.find_best_partitioning(AVCset of p)
  • (1c) endfor
  • (2a) k CLdecide_splitting_criterion()

8
RainForestTree Induction Schema (Contd)
  • AVC stands for AttributeValue, Classlabel
  • AVC-set AVC-set of a predictor attribute a to be
    the projection of F(n) onto a and the class label
    whereby counts of the individual class labels are
    aggregated
  • AVC-group the AVCgroup of a node n to be the
    set of all AVCsets at node n.
  • Size of the AVCset of a predictor attribute a at
    node n
  • depends only on the number of distinct attribute
    values of a and the number of class labels in
    F(n).
  • AVC-group(r) is not equal to F( r ) contains
    aggregated information that is sufficient for
    decision tree construction

9
AVC-groups and Main Memory
  • The memory size determines the implementation of
    RainForest Schema.
  • Case 1 AVC-group of the root node fits in the
    M.M.
  • RF-Write RF-Read RF-Hybrid
  • Case 2 each individual AVC-set of the root node
    fits into M.M., but the AVC-group does not.
  • RF-Vertical
  • Case 3 Other than Case 12

10
Steps for Algorithms in RainForest Family
  • 1. AVCgroup Construction
  • 2. Choose Splitting Attribute and Predicate
  • This step uses the decision tree algorithm CL
    that is being scaled using the RainForest
    framework
  • 3. Partition D Across the Children Nodes
  • We must read the entire dataset and write out all
    records, partitioning them into child buckets''
    according to the splitting criterion chosen in
    the previous step.

11
Algorithms RF-Write/RF-Read
  • PrerequisiteAVC-group fits into M.M.
  • RF-Write
  • For each level of the tree,it reads the entire
    database twice and writes the entire database
    once
  • RF-Read
  • Makes an increasing number of scans of entire
    database
  • Marks one end of the design spectrum in the
    RainForest framework

12
AlgorithmRF-Hybrid
  • Combination of RF-Write and RF-Read
  • Performance can be improved by concurrent
    construction of AVC-sets

13
AlgorithmRF-Vertical
  • Prerequisite individual AVC-set can fit into
    M.M.
  • For very large sets, a temporary file is
    generated for each node, the large sets are
    constructed from this temporary file.
  • For small sets, construct them in M.M.

14
Experiments Datasets
15
Experiment Results (1)
  • When the overall maximum number of entries in the
    AVC-group of the root node is about 2.1 million,
    requiring a maximum memory size of 17MB.

16
Experiment Results (2)
  • The performance of RF-Write, RF-Read and
    RF-Hybrid as the input database increases

17
Experiment Results (3)
  • How internal properties of the AVC-groups of the
    training database influence performance?
  • Result the AVC-group size and Main Memory size
    are the two factors which determine the
    performance.

18
Experiment Results (4)
  • How performance is affected as the number of
    attributes is increasing?
  • Result a roughly linear scaleup with the number
    of attributes.

19
Conclusion
  • A scaling decision tree algorithm that is
    applicable to all decision tree algorithms at
    that time.
  • AVC-group is the key idea.
  • Database scan at each level of the decision tree
  • Too much dependence over the size of available
    main memory
Write a Comment
User Comments (0)
About PowerShow.com