Decision Tree Algorithms in the Parallel Setting SPRINT

About This Presentation

Title:

Decision Tree Algorithms in the Parallel Setting SPRINT

Description:

Select splitting Criteria( Information gain, Gain ratio, Gini Index, Chi Square test) ... R., C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, ... – PowerPoint PPT presentation

Number of Views:402

Avg rating:3.0/5.0

Slides: 31

Provided by: csh96

Category:

more less

Transcript and Presenter's Notes

Title: Decision Tree Algorithms in the Parallel Setting SPRINT

1
Decision Tree Algorithms in the Parallel Setting
(SPRINT)

Submitted by
Prasad Valapet B00343912
Soumen Sengupta B00191069
Probir Ghosh B00350646

2
Agenda

Introduction
Problem Defined
Objective
Decision Tree Overview
Data Structures
Serial SPRINT
Calculating Gini Index
Parallelizing SPRINT
Synchronous Tree Construction Approach
Performance Evaluation
Conclusion
References

3
Classification
Categorical attribute
Continuous attribute

Classification is a data mining technique that is
used to build a model to represent the records
with a particular class in a training set.
Training Examples
Categorical Attributes
Continuous Attributes
Class Attribute

Class Label
4
Problem Defined

Decision Tree Algorithms like the ID3 and C4.5
assume that the training examples will fit into
memory
For large datasets decision tree construction can
be inefficient if they have to be swapped in and
out of memory
Sampling of data is used so that they can fit
into memory

5
Problem Defined (contd.)

Other approaches partition the data into subsets
( that fit into memory) then a decision tree
is constructed by combining classifiers from each
subsets
An accurate decision tree classifier can be
obtained if the total training set is used

6
Objective

To implement a decision tree based classification
algorithm (SPRINT) that addresses the scalability
efficiency issues in dealing with large
datasets
To investigate the speed up, sizeup and scaleup
characteristics of SPRINT

7
Decision Tree An operation overview
Feature Extraction

Select splitting Criteria( Information gain, Gain
ratio, Gini Index, Chi Square test)
Apply recursive partitioning until all the
examples( training data) are classified or
attributes are exhausted
Pruning the tree
Test the tree

Build phase
Train the model
Handle over fitting of data
Prune Phase
Test the model
Age
gt28
lt28
Occupation
Salary
Private
Public
lt50k
gt50K
8
Data Structures

Attribute Lists
- Attribute value
- Class Label
- Record id.
Attribute lists are created for each Attribute
The entries in a attribute list are called
attribute records

9
Histograms

Two class histograms are used to store the class
distribution for continuous attributes.
Cbelow contains the class distribution of the
records that have already been processed
Cabove contains the class distribution of the
records that have not been processed

Class Histograms at Cursor position 2 shown
10
Count Matrix

The count matrix stores the class distribution of
each value of a categorical attribute.

Count matrix For Occupation
11
SPRINT( serial algorithm)

Partition (Data S)
if (all points in S are of the same
class) then
return
for each attribute A do
evaluate splits on attribute A
Use best split point found to partition
S into
S1 and S2
Partition(S1)
Partition (S2)

12
Evaluating split point for all attributes

Gini Index k
Gini(S) 1 S pj2
i1
where k is the total number of classes
pj is the relative frequency of class i
S is the size of the training set
For a pure data sample S containing a single
class distribution, Gini (S) 0

13
Evaluating splitting points for all attributes
(contd.)

Ginisplit n1/S Gini (S1) n2/n Gini (S2)
Where n1 is the number of records in subset S1
n2 the number of records in subset S2
n the total number of records
The attribute with the minimum Gini index is
chosen to split the node.

Uses attribute lists as the data structure
The attribute list for continuous attributes are
sorted
The attribute lists are split among nodes.
The tree is grown in a Breadth first manner
The record ids of the splitting attribute are
hashed with the tree node as the id.
Remaining attributes can then be split by looking
up the hash table

15
Partitioning of Records
Attribute List for Node 0
Age
Attribute List for Node 1
Attribute List for Node 2
lt29.5
gt29.5
1
2
16
Decision Tree Algorithms in the Parallel Setting

Parallelizing the Construction of Decision tree
Data are read from the text file serially.
Attribute list are created in parallel.
Count matrix for categorical attributes and split
points for continuous attributes are calculated.
Gini index for continuous and categorical
attributes are found in parallel.
Depending on the gini index each node for
decision tree is identified and the tree is
constructed in parallel.

17
Decision Tree Algorithms in the Parallel Setting

Data are read from the text file serially
We read the data into the structure.
Attribute list are created in parallel
Data is scattered using MPI_Scatter to all the
processors.
SndCnt index / size.

18
Decision Tree Algorithms in the Parallel Setting

Count Matrix for categorical attributes and
Split points for continuous attributes are found
in parallel
For Categorical Attributes
Structure is used to store the count Matrix
For each attributes, at each processor the count
matrix is calculated locally.
The count matrix is reduced using MPI_Reduce at
root processor to get the total count matrix for
categorical attributes.

19
Decision Tree Algorithms in the Parallel Setting
P0
Manager
Clerk
P1
Manager
Clerk
P2
Manager
Clerk
Count Matrix at Each Processor
20
Decision Tree Algorithms in the Parallel Setting
P0
Manager
Count Matrix at Root Processor
Clerk
Manager
MPI_Reduce
P1
Clerk
P2
Count Matrix at Each Processor
21
Decision Tree Algorithms in the Parallel Setting

For Continuous Attributes
A Structure is used to store the cabove and
cbelow value.
At each split point cabove and cbelow are
calculated and its values are passed on to the
next processor.

22
Decision Tree Algorithms in the Parallel Setting
Position 0
P0
P0
Position 1
P0
P 1
Position 2
P 1
P 2
Position 3
23
Decision Tree Algorithms in the Parallel Setting

Gini index for continuous attributes

At each split point the gini index is
calculated.
If there are more than one split point at a
processor the minimum among them is found and
sent to the next processor.
At last, the root processor receives the minimum
gini index of the continuous attributes.

P0
P0
P1
24
Decision Tree Algorithms in the Parallel Setting

Gini index for categorical attributes

Gini 1

The Gini index is calculated at processor 0 for
the count matrix.
The minimum among them is the gini index for that
attribute.

P0
Gini 2
Count Matrix for Categorical attributes
Then minimum gini index is found among the
attributes.
25
Synchronous Tree Construction Approach
Doesnt Require any data movement. Class
Distribution Information at the local processor
collected for each attribute Global Reduction
used to exchange class distribution
26
Performance Evaluation

Accuracy of the Decision tree classifier
of examples classified correctly / Total
of examples.
Speedup
- speedup vs of processors for a fixed
problem size
Scaleup
- Response time vs of processors for a
fixed
problem size
Sizeup
- Response time vs Problem Size for a fixed
configuration of processors

27
Conclusion

The design of SPRINT allows it to be easily
parallelized.
SPRINT removes all memory restrictions and is
computationally efficient
handles large datasets very well.

28
References

John Shafer, Rakesh Agrawal, Manish Mehta
SPRINTA Scalable Parallel Classifier for Data
Mining,India, September 1996, 22nd
Proceedings of International Conference Very
Large Databases, VLDB
http//www.dsi.unive.it/dm/shafer96spr
int.pdf
Rick Kufrin Decision trees on Parallel
Processors, National Center for Supercomputing
Applications, Third Workshop on Parallel
Processing for Artificial Intelligence (PPAI-95),
Montreal, Canada, August 1995
http//mycroft.ncsa.uiuc.edu/www-0/paper
s/ppai-95.ps.z
Manish Mehta, Rakesh Agrawal, Jorma Rissanen
SLIQ A fast Scalable Classifier Data Mining,
1996, IBM Almaden Research Center.
http//citeseer.nj.nec.com/cache/papers/cs/4347/
httpzSzzSzwww.almaden.ibm.comzSzuzSzragrawalzSzpa
perszSzedbt96_sliq.pdf/mehta96sliq.pdf
High Performance Data Mining Talk,Domenico
Talia ,Euro-Par 2002 ,Parallel
Processing,8th International Euro-Par
Conference Paderborn, Germany,
August 2002 Proceedings 45 pages
.

29
References Contd.

Srivastava E. Han V. Kumar V. Singh Information
Dynamic load balancing of
unstructured computations in decision tree
Classifiers
http//gunther.smeal.psu.edu/papers/E-Comme
rce/544/httpzSzzSzwww-
users.cs.umn.eduzSzkumarzSzpaperszSzclassp
ar-ipps.pdf/dynamic-load-
balancing-of.pdf
Mahesh V Joshi, George Karypis, Vipin Kumar
ScalParc A new Scalable and
efficient Parallel Classification Algorithm
for Mining large Datasets,
http//ipdps.eece.unm.edu/1998/papers/292.p
df
Srivastava E. Han V. Kumar V. Singh An
efficient scalable parallel classifier for data
mining http//citeseer.nj.nec
.com/cache/papers/cs/930/ftpzSzzSzftp.cs.umn.eduz
SzdeptzSzuserszSzkumarzSzclass-paper.pdf/an-effici
ent-scalable-parallel.pdf
Quinlan, J. R., C4.5 Programs for Machine
Learning, Morgan Kaufmann, San Mateo, CA, 1992.

Thank You

Write a Comment

User Comments (0)

About PowerShow.com

Decision Tree Algorithms in the Parallel Setting SPRINT - PowerPoint PPT Presentation

Decision Tree Algorithms in the Parallel Setting SPRINT

Select splitting Criteria( Information gain, Gain ratio, Gini Index, Chi Square test) ... R., C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, ... – PowerPoint PPT presentation