Title: SPRINT: A Scalable Parallel Classifier for Data Mining
1SPRINT A Scalable Parallel Classifier for Data
Mining
- IBM Almaden Research Center, 1996
- ????? ? ?????
- ????3?? ???
21. Introduction (1/2)
- Classification recently has been focus on
algorithms that can handle large databases. - In classification, are given
- A set of example records called a training set
- Each record consists of several fields or
attributes - Attributes continuous coming from an ordered
domain - categorical coming from
an unordered domain - One of the attributes, called the classifying
attributes - Objective of classification
- To build a model of the classifying attribute
based upon the other attributes
31. Introduction (2/2)
- Decision tree are suited for data mining.
- Therefore, focused on building a scalable and
parallelizable decision-tree classifier - This paper present a decision-tree-based
classification algorithm, called SPRINT - Removes all of the memory restriction
- Fast and scalable
- Easily parallelized
42. Serial Algorithm (1/2)
- Decision tree classifier is built in two phases
- Growth phase
- The tree is built by recursively partitioning the
data until each partition is either pure or
sufficiently small (figure 2) - Prune phase
- Pruning requires access only to the fully grown
decision tree
52. Serial Algorithm (2/2)
- Two major issues that have critical performance
implication in the tree-growth phase - How to find split points that define node tests
- Having chosen a split point, how to partition the
data - SPRINT addresses the above two issues differently
from previous algorithms - It has no restriction on the size of input
- Yet is a fast algorithm
- It shares with SLIQ the advantage of a one-time
sort, but uses different data structure
62.1 Data Structures (1/4)
- Attribute lists
- SPRINT initially creates an attribute list for
each attribute in the data (figure 3)
72.1 Data Structures (2/4)
- As the tree is grown and nodes are split to
create new children, the attribute lists
belonging to each node are partitioned and
associated with the children.(figure 4)
82.1 Data Structures (3/4)
- Histograms
- For continuous attributes, two histogram are
associated with each decision-tree node that is
under consideration for splitting - Cbelow maintains this distribution for
attribute records that have already been
processed - Cabove maintains it for those that have not
- Categorical attributes also have a histogram
associated with a node - However, only one histogram is needed and it
contains the class distribution for each value of
the given attribute - ? call this histogram a count matrix
92.1 Data Structures(4/4)
102.2 Finding split points(1/3)
- While growing the tree, the goal at each node
- To determine the split point that best divides
the training records belonging to that leaf - This paper use the gini index
- Gini index based on this papers experience with
SLIQ - For data set S containing examples from n classes
- gini(S) 1 - ?pj2 pj relative frequency of
class j in S - If a split divides S into two subsets S1 and S2,
the index of the divided data ginisplit(S) - Ginisplit(S) (n1/n)gini(S1)(n2/n)gini(S2)
- Advantage of this index its calculation requires
only the distribution of the class in each of the
partitions.
112.2 Finding split points(2/3)
- Continuous attributes
- For continuous attributes, the candidate split
points are mid-points between every two
consecutive attribute values in the training data - Histogram Cbelow is initialized to zeros whereas
Cabove is initialized with the class distribution
for all records for the node - For the root node, this distribution is obtained
at the time of sorting - For other nodes this distribution is obtained
when the node is created - Attribute records are read one at a time and
Cbelow and Cablove are updated for each record
read (figure 5) - Note that Cbelow and Cablove have all the
necessary information to compute the gini index.
122.2 Finding split points (3/3)
- Categorical attributes
- For categorical split-points, make a single scan
through the attribute list collection counts in
the count matrix for each combination of class
label and attribute valus found in the
data(figure 6) - The important point
- Information required for computing the gini index
for any subset splitting is available in the
count matrix
132.3 Performing the split
- Once the best split point has been found for
node, we execute the split by creating child
nodes and dividing the attribute records between
them (figure 4) - For the remaining attribute lists of the
node(CarType in our example) - have no test that can apply to the attributes
values to decide how to divide the records ?
therefore work with the rid - As we partition the list of the splitting
attributes( i.e. Age), insert the rid of each
record into probe structure(hash table), noting
to which child the record was moved - During this splitting operation, we also build
class histogram for each new leaf
142.4 Comparison with SLIQ(1/2)
- The technique of creating separate attribute
lists from the original data ?was proposed by the
SLIQ algorithm - In SLIQ
- Entry in an attribute list an attribute value
and a rid - Class label kept in a separate data-structure
called a class list which is indexed by rid - Entry in class list contains a pointer to a
node of the classification tree - Figure 7
- Our goal in designing SPRINT was not to
outperform SLIQ on datasets where a class list
can fit in memory - To develop an accurate classifier for datasets
that are simply too large for any other algorithm - To de able to develop such a classifier
sufficiently - SPRINT is designed to be easily parallelizable
152.4 Comparison with SLIQ(2/2)
163. Parallelizing Classification
- In parallel tree-growth, the primary problems
- Finding good split-points
- Partitioning the data using the discovered split
points - As in any parallel algorithm must be considered
- Issues of data placement
- Workload balancing
- Resolved in the SPRINT algorithm
- assume a shared-nothing parallel environment
where each of N processors has private memory and
disk
173.1 Data Placement and Workload Balancing
- SPRINT achieves uniform data placement and
workload balancing by distributing the attribute
lists evenly over N processor of a shared-nothing
machine. - The partitioning is achieved by first
distributing the training-set examples equally
among all the processors.
183.2 Finding split points
- Very similar to the serial algorithm
- In serial version, processor
- Scan the attributes either evaluating split
points for continuous attributes or collecting
distribution counts for categorical attributes - This does not change in the parallel algorithm
- No extra work or communication is required while
each processor is scanning its attribute-list
partitions. - The differences between the serial and parallel
algorithm - Appear only before and after the attribute-list
partitions are scanned.
193.3 Performing the Splits
- Splitting is identical to the serial algorithm
- Only additional step is that before building the
probe structure, we will need to collect rids
from all the processors
203.4 Parallelizing SLIQ
- To primary approaches for parallelizing SLIQ
- One where the class list is replicated in the
memory of every processor - called SLIQ/R
- The other where it is distributed such that each
processors memory holds only a portion of the
entire list - called SLIQ/D
214. Performance Evaluation
- The primary metric for evaluating classifier
performance is classification accuracy the
percentage of test samples that are correctly
classified. - The other important metrics are classification
time and the size of the decision tree. - The ideal goal for decision tree
- Classifier is to produce compact, accurate trees
in a short classification time
224.1 Datasets (1/2)
- Due to the lack of a classification benchmark
containing large datasets? use the synthetic
database for all of experiment
234.1 Datasets (2/2)
- Ten classification functions produce databases
with distributions with varying complexities - In this paper, present results for two of these
function
244.2 Serial Performance
- For serial analysis, compare the response times
of serial SPRINT and SLIQ on training sets of
various sizes - Experiments were conducted on an IBM RS/6000 250
workstation running AIX level 3.2.5 - The CPU has a clock rate of 66MHz and 16MB of
main memory
254.3 Parallel Performance
- To examine how well the SPRINT algorithm performs
in parallel environments - Implemented parallelization on an IBM SP2
- Using the standard MPI communication
primitives - Experiments were conducted on a 16-node IBM sp2
Model 9076 - Each node in the multiprocessor is a 370 Node
consisting of a POWER1 processor running at
62.5MHZ with 128MB of real memory.
264.3.1 Comparison of Parallel Algorithm (1/2)
- First compare parallel SPRINT to the two
parallelizations of SLIQ. - In these experiments, each processor contained
50,000 training examples and the number of
processors varied from 2 to 16 - The total training-set size ranges from 100,000
records to 1.6 million records
274.3.1 Comparison of Parallel Algorithm (2/2)
284.3.2 Scaleup
294.3.3 Speedup
304.3.4 Sizeup
315. Conclusion (1/2)
- Need for algorithm for building classifiers that
can handle very large database. - Recently proposed SLIQ algorithm was the first to
address these concerns. - In this paper, presented a new classification
algorithm called SPRINT - Removes all memory restrictions that limit
existing decision-tree algorithm - Yet exhibits the same excellent scaling behavior
as SLIQ
325. Conclusion (2/2)
- Design goals
- Included the requirement that the algorithm be
easily and efficiently parallelizable. - SPRINT does have an efficient parallelization
that requires very few additions to the serial
algorithm. - SPRINT handles datasets that are too large for
SLIQ. - Implementation on SP2, a shared-nothing
multiprocessor, showed that SPRINT does indeed
parallelize efficiently. - Parallel SPRINTs efficiency improves as the
problem size increase. - Scaleup, speedup, and sizeup characterisitc