Title: Parallel and Distributed Computing for Data Mining: A review
1Parallel and Distributed Computingfor Data
Mining A review
- Elio Lozano Inca
- University of Puerto RicoMayagüez Campus
2Outline
- Data Mining
- K - means algorithm.
- K - Nearest neighbor.
- Resampling techniques.
- Support vector machine.
- Decision tree (C4.5) .
- Conclusions.
3What is Data Mining?
- Data Mining is The nontrivial extraction of
implicit, previously unknown and potentially
useful information from data. (Frawley et al
2001). - It uses machine learning and statistical
techniques to discovery and present knowledge for
easy comprehension for humans. - Data mining is an application of a more general
problem called Pattern Recognition.
4Data Mining Tasks
- Classification Learn a method for predicting the
instance class from pre-labeled (classified)
instances k-nearest neighbors, Decision Trees,
Neural Networks.
- Clustering Find natural grouping of
instances given un-labeled data K-means
5Serial K-means
- K-means Algorithm
- Initialize k points (mj ).
- For each Xi determine the point to which it is
closest and make it a member of that cluster. - Then find the new mj s as averages of the points
in their cluster. - Repeat 2) and 3) until convergence.
Hard Duda (1973) Find the minimum variance
clustering of the data into k clusters such a way
as to minimize
6K-means
Assign each point to the closest cluster center
7Parallel K-means
- Dharmendra Dhillon 2002.
- 1) Master process distributes the initial k
points (mj) to all nodes. - 2) Master process distributes the data points
X1,X2, ...,Xn equally to all nodes. - 3) Each node determines the centroid to which its
set of Xi is closest, one Xi at a time. Each node
maintains a sum and a count of Xi to be
associated with each of the mj, and send its
respective labels of belonging to master process.
- 4) At each iteration, the master process computes
the new mj . - 5) Repeat 3) and 4) until convergence.
8Discussion
- Parallel algorithm proposed by Dhillon has almost
linear speedup when the data has more instances
than features and the processors have same
computational velocity. - Whereas when the features are grater than
instances or when the processors have different
velocities, the proposed parallel algorithm scale
poorly. - Fortunately there are other methods to
parallelize this algorithm - Feature based partition, in which the data are
split between features - Once the master process send the data, each
process compute its respective sum and them send
it back to master, which join the partial results
and then compute new centroids.
9The K-Nearest Neighbors
- Cover and Hart 1967
- Given an unknown sample the k-nearest algorithm
- searches the pattern space for k training samples
- that are closest (using Euclidean distance) to
the unknown - sample.
10Parallel K-nearest neighbors
- Ruoming Agrawal 2001.
- 1) The training set is distributed among the
nodes. - 2) Given an unknown sample, each node processes
the training samples it owns to calculate the
nearest neighbors locally. - 3) A global reduction computes the overall
k-nearest neighbors from the locally k-nearest
neighbors of each node.
11Discussion
- Ruominng concluded that his algorithm achieve
high efficiency for distributed memory and shared
memory - Whereas when the number of instances are less
than number of features this algorithm dont give
good performance. - I propose another parallel algorithm for this
problem. - Split the data feature based, for each object we
compute the partial distances from this object to
the training set. - After that a global reduction is achieved, and
the master process find the k nearest neighbors.
12Resampling techniques
- The technique of combine classifiers (Bagging
Breiman 1994 and Boosting Freund Shapire 1996
rely in resampling techniques) - Consists on built a classifier from a sequence of
training sets, each with n different
observations. - Instead of obtain the sequence of training
samples in the real life it is obtained in
artificial manner. - The technique used in this case is bootstrapping,
and then the classifier is obtained by voting the
sequence of classifiers from the bootstrap
samples.
13Bootstrapping process
14Parallel Bootstrapping
- Beddo 2002
- 1) The master process sends the data set to all
nodes. - 2) Each node produce approximately B/p (where B
is the number of bootstrap samples) classifiers
from B/p bootstrap samples. - 3) Finally all nodes perform a reduction, and the
master process finds the classifier by voting
them.
15Discussion
- Beddo concludes that resampling techniques are
parallel in nature and they reach linear speedup
because a little communication is performed
between processors. - We proposed a natural dynamic load partition to
avoid static partition. - The master process gives one bootstrap training
set to each slave. - Each slave compute its classifier and classify
the test data, and send the result back to master
process. - Master process join the classifiers obtained by
such slave and give other bootstrap sample. This
process continue until there are not left
bootstrap samples. - In Boosting, bootstrap samples depends of the
previous classifier, for this reason we dont
split the bootstrap samples between processes,
With each bootstrap sample each processes
classifies its partial test sample to find the
partial errors and then sends its result to
master process.
16Support Vector Machine
- Vapnik 1979.
- This algorithm maximizes the margin and minimizes
the error.
The proximal SVM classifier (Fung Mangasariam
2001) changes the inequality constrains to
equalities
Setting the gradient with respect to (p,b) to
zero gives
Where
17Row distributed SVM
Poulet 2003.
- The data set is divided into c blocks of rows.
- Each block is computed on different processor.
Master process sends the data set to remote
machines. - Each remote machine perform the calculation of
its - and , next sends
back its result to master. - Once the master process received the results, it
performs and
, then inverse the matrix to get the
final result.
18Column Distributed SVM
Poulet 2003.
Using Sherman Morrison Woodbury formula
Column based parallel algorithm is similar to the
row distributed, each process perform the
calculation of its from its block of
columns and send it back to master. Once the
master received the result from the slave
processes, it performs the last sums, and inverse
the matrix to get the final result.
19Discussion
Poulet said that their algorithms achieved good
performance for both data sets with large number
of rows, or with large number of columns. These
parallel algorithms are split in blocks
statically, so cant reach good work load
balancing when the processor speeds are
different. Dynamic work balancing can be used
to have good performance.
20ID3 Quinlan 1986, C4.5 Quinlan 1993.A decision
tree T encodes a classifier or regression
function in form of a tree.
Decision Tree
- Encoded classifier
- If (agelt30 and carTypeMinivan)Then YES
- If (age lt30 and(carTypeSports or
carTypeTruck))Then NO - If (age gt 30)Then YES
splitting predicate
Age
lt30
gt30
Car Type
YES
Minivan
Sports, Truck
NO
YES
21Decision Tree
22C4.5 Decision Tree
23Feature Based Parallelization
Taner et al 2004
24Node Based Parallelization
Taner et al 2004
25Data Based Parallelization
Taner et al 2004
26(No Transcript)
27Discussion
- Taner et al.
- Feature based parallelization have better speedup
than Node parallelization when the instances in
the data sets are greater than number of
features. Otherwise Node based parallelization
had better speedup. - Amado et al. 2003
- Hybrid algorithm can be characterized as using
data parallelism (feature based or data based)
and node based parallelism. - For nodes covering a significant amount of
examples, it is used data parallelism to avoid
load imbalance problem. The cost of communicate
nodes covering fewer examples can be higher than
the time spent in processing the examples. - To avoid this problem, one of the process
continues alone the construction of the tree
rooted at the node (node based parallelism).
Usually, the switch between data parallelism and
task parallelism (node based parallelism) is
performed when the communications cost overcomes
the processing and data transfer cost.
28Conclusions
- Almost all parallel data mining algorithms
(K-means, Neighbors, Resampling Techniques,
Support Vector Machines and Decision Trees) have
almost linear speedup, because the amount of
communication and the amount of computation
required to merge local models into global model
are small. - Most Data mining algorithms (K-means, Neighbors,
Replication Techniques) can be parallelized by
replication, building a local model and then
combining these models to obtain a global model. - Some parallel algorithms (K-means, Nearest
Neighbors) had been designed thinking in a
particular data sets, and dont have good
performance for other data sets. For this reason
I have proposed new algorithms for cover all
cases for real data sets.