Decision Tree Algorithms in the Parallel Setting SPRINT - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Decision Tree Algorithms in the Parallel Setting SPRINT

Description:

Select splitting Criteria( Information gain, Gain ratio, Gini Index, Chi Square test) ... R., C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, ... – PowerPoint PPT presentation

Number of Views:402
Avg rating:3.0/5.0
Slides: 31
Provided by: csh96
Category:

less

Transcript and Presenter's Notes

Title: Decision Tree Algorithms in the Parallel Setting SPRINT


1
Decision Tree Algorithms in the Parallel Setting
(SPRINT)
  • Submitted by
  • Prasad Valapet B00343912
  • Soumen Sengupta B00191069
  • Probir Ghosh B00350646

2
Agenda
  • Introduction
  • Problem Defined
  • Objective
  • Decision Tree Overview
  • Data Structures
  • Serial SPRINT
  • Calculating Gini Index
  • Parallelizing SPRINT
  • Synchronous Tree Construction Approach
  • Performance Evaluation
  • Conclusion
  • References

3
Classification
Categorical attribute
Continuous attribute
  • Classification is a data mining technique that is
    used to build a model to represent the records
    with a particular class in a training set.
  • Training Examples
  • Categorical Attributes
  • Continuous Attributes
  • Class Attribute

Class Label
4
Problem Defined
  • Decision Tree Algorithms like the ID3 and C4.5
    assume that the training examples will fit into
    memory
  • For large datasets decision tree construction can
    be inefficient if they have to be swapped in and
    out of memory
  • Sampling of data is used so that they can fit
    into memory

5
Problem Defined (contd.)
  • Other approaches partition the data into subsets
    ( that fit into memory) then a decision tree
    is constructed by combining classifiers from each
    subsets
  • An accurate decision tree classifier can be
    obtained if the total training set is used

6
Objective
  • To implement a decision tree based classification
    algorithm (SPRINT) that addresses the scalability
    efficiency issues in dealing with large
    datasets
  • To investigate the speed up, sizeup and scaleup
    characteristics of SPRINT

7
Decision Tree An operation overview
Feature Extraction
  • Select splitting Criteria( Information gain, Gain
    ratio, Gini Index, Chi Square test)
  • Apply recursive partitioning until all the
    examples( training data) are classified or
    attributes are exhausted
  • Pruning the tree
  • Test the tree

Build phase
Train the model
Handle over fitting of data
Prune Phase
Test the model
Age
gt28
lt28
Occupation
Salary
Private
Public
lt50k
gt50K
8
Data Structures
  • Attribute Lists
  • - Attribute value
  • - Class Label
  • - Record id.
  • Attribute lists are created for each Attribute
  • The entries in a attribute list are called
    attribute records

9
Histograms
  • Two class histograms are used to store the class
    distribution for continuous attributes.
  • Cbelow contains the class distribution of the
    records that have already been processed
  • Cabove contains the class distribution of the
    records that have not been processed

Class Histograms at Cursor position 2 shown
10
Count Matrix
  • The count matrix stores the class distribution of
    each value of a categorical attribute.

Count matrix For Occupation
11
SPRINT( serial algorithm)
  • Partition (Data S)
  • if (all points in S are of the same
    class) then
  • return
  • for each attribute A do
  • evaluate splits on attribute A
  • Use best split point found to partition
    S into
  • S1 and S2
  • Partition(S1)
  • Partition (S2)

12
Evaluating split point for all attributes
  • Gini Index k
  • Gini(S) 1 S pj2
  • i1
  • where k is the total number of classes
  • pj is the relative frequency of class i
  • S is the size of the training set
  • For a pure data sample S containing a single
    class distribution, Gini (S) 0

13
Evaluating splitting points for all attributes
(contd.)
  • Ginisplit n1/S Gini (S1) n2/n Gini (S2)
  • Where n1 is the number of records in subset S1
  • n2 the number of records in subset S2
  • n the total number of records
  • The attribute with the minimum Gini index is
    chosen to split the node.

14
  • Uses attribute lists as the data structure
  • The attribute list for continuous attributes are
    sorted
  • The attribute lists are split among nodes.
  • The tree is grown in a Breadth first manner
  • The record ids of the splitting attribute are
    hashed with the tree node as the id.
  • Remaining attributes can then be split by looking
    up the hash table

15
Partitioning of Records
Attribute List for Node 0
Age
Attribute List for Node 1
Attribute List for Node 2
lt29.5
gt29.5
1
2
16
Decision Tree Algorithms in the Parallel Setting
  • Parallelizing the Construction of Decision tree
  • Data are read from the text file serially.
  • Attribute list are created in parallel.
  • Count matrix for categorical attributes and split
    points for continuous attributes are calculated.
  • Gini index for continuous and categorical
    attributes are found in parallel.
  • Depending on the gini index each node for
    decision tree is identified and the tree is
    constructed in parallel.

17
Decision Tree Algorithms in the Parallel Setting
  • Data are read from the text file serially
  • We read the data into the structure.
  • Attribute list are created in parallel
  • Data is scattered using MPI_Scatter to all the
    processors.
  • SndCnt index / size.

18
Decision Tree Algorithms in the Parallel Setting
  • Count Matrix for categorical attributes and
    Split points for continuous attributes are found
    in parallel
  • For Categorical Attributes
  • Structure is used to store the count Matrix
  • For each attributes, at each processor the count
    matrix is calculated locally.
  • The count matrix is reduced using MPI_Reduce at
    root processor to get the total count matrix for
    categorical attributes.

19
Decision Tree Algorithms in the Parallel Setting
P0
Manager
Clerk
P1
Manager
Clerk
P2
Manager
Clerk
Count Matrix at Each Processor
20
Decision Tree Algorithms in the Parallel Setting
P0
Manager
Count Matrix at Root Processor
Clerk
Manager
MPI_Reduce
P1
Clerk
P2
Count Matrix at Each Processor
21
Decision Tree Algorithms in the Parallel Setting
  • For Continuous Attributes
  • A Structure is used to store the cabove and
    cbelow value.
  • At each split point cabove and cbelow are
    calculated and its values are passed on to the
    next processor.

22
Decision Tree Algorithms in the Parallel Setting
Position 0
P0
P0
Position 1
P0
P 1
Position 2
P 1
P 2
Position 3
23
Decision Tree Algorithms in the Parallel Setting
  • Gini index for continuous attributes
  • At each split point the gini index is
    calculated.
  • If there are more than one split point at a
    processor the minimum among them is found and
    sent to the next processor.
  • At last, the root processor receives the minimum
    gini index of the continuous attributes.

P0
P0
P1
24
Decision Tree Algorithms in the Parallel Setting
  • Gini index for categorical attributes

Gini 1
  • The Gini index is calculated at processor 0 for
    the count matrix.
  • The minimum among them is the gini index for that
    attribute.

P0
Gini 2
Count Matrix for Categorical attributes
Then minimum gini index is found among the
attributes.
25
Synchronous Tree Construction Approach
Doesnt Require any data movement. Class
Distribution Information at the local processor
collected for each attribute Global Reduction
used to exchange class distribution
26
Performance Evaluation
  • Accuracy of the Decision tree classifier
  • of examples classified correctly / Total
    of examples.
  • Speedup
  • - speedup vs of processors for a fixed
    problem size
  • Scaleup
  • - Response time vs of processors for a
    fixed
  • problem size
  • Sizeup
  • - Response time vs Problem Size for a fixed
  • configuration of processors

27
Conclusion
  • The design of SPRINT allows it to be easily
    parallelized.
  • SPRINT removes all memory restrictions and is
    computationally efficient
  • handles large datasets very well.

28
References
  • John Shafer, Rakesh Agrawal, Manish Mehta
    SPRINTA Scalable Parallel Classifier for Data
  • Mining,India, September 1996, 22nd
    Proceedings of International Conference Very
    Large Databases, VLDB
  • http//www.dsi.unive.it/dm/shafer96spr
    int.pdf
  • Rick Kufrin Decision trees on Parallel
    Processors, National Center for Supercomputing
    Applications, Third Workshop on Parallel
    Processing for Artificial Intelligence (PPAI-95),
    Montreal, Canada, August 1995
  • http//mycroft.ncsa.uiuc.edu/www-0/paper
    s/ppai-95.ps.z
  • Manish Mehta, Rakesh Agrawal, Jorma Rissanen
    SLIQ A fast Scalable Classifier Data Mining,
    1996, IBM Almaden Research Center.
  • http//citeseer.nj.nec.com/cache/papers/cs/4347/
    httpzSzzSzwww.almaden.ibm.comzSzuzSzragrawalzSzpa
    perszSzedbt96_sliq.pdf/mehta96sliq.pdf
  • High Performance Data Mining Talk,Domenico
    Talia ,Euro-Par 2002 ,Parallel
  • Processing,8th International Euro-Par
    Conference Paderborn, Germany,
  • August 2002 Proceedings 45 pages
  • .

29
References Contd.
  • Srivastava E. Han V. Kumar V. Singh Information
    Dynamic load balancing of
  • unstructured computations in decision tree
    Classifiers
  • http//gunther.smeal.psu.edu/papers/E-Comme
    rce/544/httpzSzzSzwww-
  • users.cs.umn.eduzSzkumarzSzpaperszSzclassp
    ar-ipps.pdf/dynamic-load-
  • balancing-of.pdf
  • Mahesh V Joshi, George Karypis, Vipin Kumar
    ScalParc A new Scalable and
  • efficient Parallel Classification Algorithm
    for Mining large Datasets,
  • http//ipdps.eece.unm.edu/1998/papers/292.p
    df
  • Srivastava E. Han V. Kumar V. Singh An
    efficient scalable parallel classifier for data
  • mining http//citeseer.nj.nec
    .com/cache/papers/cs/930/ftpzSzzSzftp.cs.umn.eduz
    SzdeptzSzuserszSzkumarzSzclass-paper.pdf/an-effici
    ent-scalable-parallel.pdf
  • Quinlan, J. R., C4.5 Programs for Machine
    Learning, Morgan Kaufmann, San Mateo, CA, 1992.

30
  • Thank You
Write a Comment
User Comments (0)
About PowerShow.com