RainForest - A Framework for Fast Decision Tree Construction of Large Datasets - PowerPoint PPT Presentation

About This Presentation

Title:

RainForest - A Framework for Fast Decision Tree Construction of Large Datasets

Description:

Motivation of RainForest ... RainForest Framework: Top-down Tree Induction Schema ... The memory size determines the implementation of RainForest Schema. ... – PowerPoint PPT presentation

Number of Views:733

Avg rating:3.0/5.0

Slides: 20

Provided by: Vic24

Learn more at: https://cse.osu.edu

Category:

more less

Transcript and Presenter's Notes

Title: RainForest - A Framework for Fast Decision Tree Construction of Large Datasets

1
RainForest - A Framework for Fast Decision Tree
Construction of Large Datasets

J. Gehrke, R. Ramakrishnan, V. Ganti
Dept. of Computer Sciences University of
Wisconsin-Madison
Presented By Hui Yang
April 18,2001

2
Introduction to Classification

An important Data Mining Problem
Input a database of training records
Class label attributes
Predictor Attributes
Goal
to build a concise model of the distribution of
class label in terms of predictor attributes
Applications
scientific experiments,medical diagnosis, fraud
detection, etc.

3
Decision TreeA Classification Model

It is one of the most attractive classification
models
There are a large number of algorithms to
construct decision trees
E.g. SLIQ, CART, C4.5 SPRINT
Most are main memory algorithms
Tradeoff between supporting large databases,
performance and constructing more accurate
decision trees

4
Motivation of RainForest

Developing a unifying framework that can be
applied to most decision tree algorithms, and
results in a scalable version of this algorithm
without modifying the results.
Separating the scalability aspects of these
algorithms from the central features that
determine the quality of the decision trees

5
Decision Tree Terms

Root,Leaf, Internal Nodes
Each leaf is labeled with one class label
Each internal node is labeled with one predictor
attribute called the splitting attribute
Each edge e from node n has a predicate q
associated with it, q only involves splitting
attributes.
P set of predicates on all outgoing edges of an
internal node Non-overlapping, Exhaustive
Crit(n) splitting criteria of n combination of
splitting attributes and predicates

6
Decision Tree Terms(Contd)

F(n) Family of database(D) tuples of a node n
Definition
let Ee1,e2,,ek, Qq1,q2,,qk be the edge
set and predicate set for a node n p be the
parent node of n
If nroot, F(n) D
If n?root, let q(p?n) be the predicate on e(p?n),
F(n) t tF(n),t F(p), and q(p? n
True

7
RainForest FrameworkTop-down Tree Induction
Schema

Input node n, partition D, classification
algorithm CL
Output decision tree for D rooted at n
TopDown Decision Tree Induction Schema
BuildTree(Node n, datapartition D, algorithm CL)
(1) Apply CL to D to find crit(n)
(2) let k be the number of children of n
(3) if (k gt 0)
(4) Create k children c1 ... ck of n
(5) Use best split to partition D into D1 . . .
Dk
(6) for (i 1 i lt k i)
(7) BuildTree(ci , Di )
(8) endfor
(9) endif
RainForest Refinement
(1a) for each predictor attribute p
(1b) Call CL.find_best_partitioning(AVCset of p)
(1c) endfor
(2a) k CLdecide_splitting_criterion()

8
RainForestTree Induction Schema (Contd)

AVC stands for AttributeValue, Classlabel
AVC-set AVC-set of a predictor attribute a to be
the projection of F(n) onto a and the class label
whereby counts of the individual class labels are
aggregated
AVC-group the AVCgroup of a node n to be the
set of all AVCsets at node n.
Size of the AVCset of a predictor attribute a at
node n
depends only on the number of distinct attribute
values of a and the number of class labels in
F(n).
AVC-group(r) is not equal to F( r ) contains
aggregated information that is sufficient for
decision tree construction

9
AVC-groups and Main Memory

The memory size determines the implementation of
RainForest Schema.
Case 1 AVC-group of the root node fits in the
M.M.
RF-Write RF-Read RF-Hybrid
Case 2 each individual AVC-set of the root node
fits into M.M., but the AVC-group does not.
RF-Vertical
Case 3 Other than Case 12

10
Steps for Algorithms in RainForest Family

1. AVCgroup Construction
2. Choose Splitting Attribute and Predicate
This step uses the decision tree algorithm CL
that is being scaled using the RainForest
framework
3. Partition D Across the Children Nodes
We must read the entire dataset and write out all
records, partitioning them into child buckets''
according to the splitting criterion chosen in
the previous step.

11
Algorithms RF-Write/RF-Read

PrerequisiteAVC-group fits into M.M.
RF-Write
For each level of the tree,it reads the entire
database twice and writes the entire database
once
RF-Read
Makes an increasing number of scans of entire
database
Marks one end of the design spectrum in the
RainForest framework

12
AlgorithmRF-Hybrid

Combination of RF-Write and RF-Read
Performance can be improved by concurrent
construction of AVC-sets

13
AlgorithmRF-Vertical

Prerequisite individual AVC-set can fit into
M.M.
For very large sets, a temporary file is
generated for each node, the large sets are
constructed from this temporary file.
For small sets, construct them in M.M.

14
Experiments Datasets
15
Experiment Results (1)

When the overall maximum number of entries in the
AVC-group of the root node is about 2.1 million,
requiring a maximum memory size of 17MB.

16
Experiment Results (2)

The performance of RF-Write, RF-Read and
RF-Hybrid as the input database increases

17
Experiment Results (3)

How internal properties of the AVC-groups of the
training database influence performance?
Result the AVC-group size and Main Memory size
are the two factors which determine the
performance.

18
Experiment Results (4)

How performance is affected as the number of
attributes is increasing?
Result a roughly linear scaleup with the number
of attributes.

19
Conclusion

A scaling decision tree algorithm that is
applicable to all decision tree algorithms at
that time.
AVC-group is the key idea.
Database scan at each level of the decision tree
Too much dependence over the size of available
main memory

Write a Comment

User Comments (0)