Title: Decision Tree Algorithms in the Parallel Setting SPRINT
1Decision Tree Algorithms in the Parallel Setting
(SPRINT)
- Submitted by
- Prasad Valapet B00343912
- Soumen Sengupta B00191069
- Probir Ghosh B00350646
2Agenda
- Introduction
- Problem Defined
- Objective
- Decision Tree Overview
- Data Structures
- Serial SPRINT
- Calculating Gini Index
- Parallelizing SPRINT
- Synchronous Tree Construction Approach
- Performance Evaluation
- Conclusion
- References
-
3Classification
Categorical attribute
Continuous attribute
- Classification is a data mining technique that is
used to build a model to represent the records
with a particular class in a training set. - Training Examples
- Categorical Attributes
- Continuous Attributes
- Class Attribute
Class Label
4Problem Defined
- Decision Tree Algorithms like the ID3 and C4.5
assume that the training examples will fit into
memory - For large datasets decision tree construction can
be inefficient if they have to be swapped in and
out of memory - Sampling of data is used so that they can fit
into memory
5Problem Defined (contd.)
- Other approaches partition the data into subsets
( that fit into memory) then a decision tree
is constructed by combining classifiers from each
subsets - An accurate decision tree classifier can be
obtained if the total training set is used
6Objective
- To implement a decision tree based classification
algorithm (SPRINT) that addresses the scalability
efficiency issues in dealing with large
datasets - To investigate the speed up, sizeup and scaleup
characteristics of SPRINT
7Decision Tree An operation overview
Feature Extraction
- Select splitting Criteria( Information gain, Gain
ratio, Gini Index, Chi Square test) - Apply recursive partitioning until all the
examples( training data) are classified or
attributes are exhausted - Pruning the tree
- Test the tree
Build phase
Train the model
Handle over fitting of data
Prune Phase
Test the model
Age
gt28
lt28
Occupation
Salary
Private
Public
lt50k
gt50K
8Data Structures
- Attribute Lists
- - Attribute value
- - Class Label
- - Record id.
- Attribute lists are created for each Attribute
- The entries in a attribute list are called
attribute records
9Histograms
- Two class histograms are used to store the class
distribution for continuous attributes. - Cbelow contains the class distribution of the
records that have already been processed - Cabove contains the class distribution of the
records that have not been processed
Class Histograms at Cursor position 2 shown
10Count Matrix
- The count matrix stores the class distribution of
each value of a categorical attribute.
Count matrix For Occupation
11SPRINT( serial algorithm)
- Partition (Data S)
- if (all points in S are of the same
class) then - return
- for each attribute A do
- evaluate splits on attribute A
- Use best split point found to partition
S into - S1 and S2
- Partition(S1)
- Partition (S2)
12Evaluating split point for all attributes
- Gini Index k
- Gini(S) 1 S pj2
- i1
- where k is the total number of classes
- pj is the relative frequency of class i
- S is the size of the training set
- For a pure data sample S containing a single
class distribution, Gini (S) 0 -
-
13Evaluating splitting points for all attributes
(contd.)
- Ginisplit n1/S Gini (S1) n2/n Gini (S2)
- Where n1 is the number of records in subset S1
- n2 the number of records in subset S2
- n the total number of records
-
- The attribute with the minimum Gini index is
chosen to split the node.
14- Uses attribute lists as the data structure
- The attribute list for continuous attributes are
sorted - The attribute lists are split among nodes.
- The tree is grown in a Breadth first manner
- The record ids of the splitting attribute are
hashed with the tree node as the id. - Remaining attributes can then be split by looking
up the hash table
15Partitioning of Records
Attribute List for Node 0
Age
Attribute List for Node 1
Attribute List for Node 2
lt29.5
gt29.5
1
2
16Decision Tree Algorithms in the Parallel Setting
- Parallelizing the Construction of Decision tree
- Data are read from the text file serially.
- Attribute list are created in parallel.
- Count matrix for categorical attributes and split
points for continuous attributes are calculated. - Gini index for continuous and categorical
attributes are found in parallel. - Depending on the gini index each node for
decision tree is identified and the tree is
constructed in parallel.
17Decision Tree Algorithms in the Parallel Setting
- Data are read from the text file serially
- We read the data into the structure.
- Attribute list are created in parallel
- Data is scattered using MPI_Scatter to all the
processors. - SndCnt index / size.
18Decision Tree Algorithms in the Parallel Setting
- Count Matrix for categorical attributes and
Split points for continuous attributes are found
in parallel - For Categorical Attributes
- Structure is used to store the count Matrix
- For each attributes, at each processor the count
matrix is calculated locally. - The count matrix is reduced using MPI_Reduce at
root processor to get the total count matrix for
categorical attributes.
19Decision Tree Algorithms in the Parallel Setting
P0
Manager
Clerk
P1
Manager
Clerk
P2
Manager
Clerk
Count Matrix at Each Processor
20Decision Tree Algorithms in the Parallel Setting
P0
Manager
Count Matrix at Root Processor
Clerk
Manager
MPI_Reduce
P1
Clerk
P2
Count Matrix at Each Processor
21Decision Tree Algorithms in the Parallel Setting
- For Continuous Attributes
- A Structure is used to store the cabove and
cbelow value. - At each split point cabove and cbelow are
calculated and its values are passed on to the
next processor.
22Decision Tree Algorithms in the Parallel Setting
Position 0
P0
P0
Position 1
P0
P 1
Position 2
P 1
P 2
Position 3
23Decision Tree Algorithms in the Parallel Setting
- Gini index for continuous attributes
- At each split point the gini index is
calculated. - If there are more than one split point at a
processor the minimum among them is found and
sent to the next processor. - At last, the root processor receives the minimum
gini index of the continuous attributes.
P0
P0
P1
24Decision Tree Algorithms in the Parallel Setting
- Gini index for categorical attributes
Gini 1
- The Gini index is calculated at processor 0 for
the count matrix. - The minimum among them is the gini index for that
attribute.
P0
Gini 2
Count Matrix for Categorical attributes
Then minimum gini index is found among the
attributes.
25Synchronous Tree Construction Approach
Doesnt Require any data movement. Class
Distribution Information at the local processor
collected for each attribute Global Reduction
used to exchange class distribution
26Performance Evaluation
- Accuracy of the Decision tree classifier
- of examples classified correctly / Total
of examples. -
- Speedup
- - speedup vs of processors for a fixed
problem size - Scaleup
- - Response time vs of processors for a
fixed - problem size
- Sizeup
- - Response time vs Problem Size for a fixed
- configuration of processors
27Conclusion
- The design of SPRINT allows it to be easily
parallelized. - SPRINT removes all memory restrictions and is
computationally efficient - handles large datasets very well.
28References
- John Shafer, Rakesh Agrawal, Manish Mehta
SPRINTA Scalable Parallel Classifier for Data - Mining,India, September 1996, 22nd
Proceedings of International Conference Very
Large Databases, VLDB - http//www.dsi.unive.it/dm/shafer96spr
int.pdf -
- Rick Kufrin Decision trees on Parallel
Processors, National Center for Supercomputing
Applications, Third Workshop on Parallel
Processing for Artificial Intelligence (PPAI-95),
Montreal, Canada, August 1995 - http//mycroft.ncsa.uiuc.edu/www-0/paper
s/ppai-95.ps.z - Manish Mehta, Rakesh Agrawal, Jorma Rissanen
SLIQ A fast Scalable Classifier Data Mining,
1996, IBM Almaden Research Center. - http//citeseer.nj.nec.com/cache/papers/cs/4347/
httpzSzzSzwww.almaden.ibm.comzSzuzSzragrawalzSzpa
perszSzedbt96_sliq.pdf/mehta96sliq.pdf - High Performance Data Mining Talk,Domenico
Talia ,Euro-Par 2002 ,Parallel - Processing,8th International Euro-Par
Conference Paderborn, Germany, - August 2002 Proceedings 45 pages
-
- .
29References Contd.
- Srivastava E. Han V. Kumar V. Singh Information
Dynamic load balancing of - unstructured computations in decision tree
Classifiers - http//gunther.smeal.psu.edu/papers/E-Comme
rce/544/httpzSzzSzwww- - users.cs.umn.eduzSzkumarzSzpaperszSzclassp
ar-ipps.pdf/dynamic-load- - balancing-of.pdf
-
- Mahesh V Joshi, George Karypis, Vipin Kumar
ScalParc A new Scalable and - efficient Parallel Classification Algorithm
for Mining large Datasets, - http//ipdps.eece.unm.edu/1998/papers/292.p
df -
- Srivastava E. Han V. Kumar V. Singh An
efficient scalable parallel classifier for data - mining http//citeseer.nj.nec
.com/cache/papers/cs/930/ftpzSzzSzftp.cs.umn.eduz
SzdeptzSzuserszSzkumarzSzclass-paper.pdf/an-effici
ent-scalable-parallel.pdf - Quinlan, J. R., C4.5 Programs for Machine
Learning, Morgan Kaufmann, San Mateo, CA, 1992. -
30