Parallel and Distributed Computing for Data Mining: A review - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Parallel and Distributed Computing for Data Mining: A review

Description:

... Computing. for Data Mining: A review. Elio Lozano Inca ... Car Type. YES. NO. YES 30 =30. Sports, Truck. splitting predicate. Decision Tree. Decision Tree ... – PowerPoint PPT presentation

Number of Views:659

Avg rating:3.0/5.0

Slides: 29

Provided by: Elio150

Category:

more less

Transcript and Presenter's Notes

Title: Parallel and Distributed Computing for Data Mining: A review

1
Parallel and Distributed Computingfor Data
Mining A review

Elio Lozano Inca
University of Puerto RicoMayagüez Campus

2
Outline

Data Mining
K - means algorithm.
K - Nearest neighbor.
Resampling techniques.
Support vector machine.
Decision tree (C4.5) .
Conclusions.

3
What is Data Mining?

Data Mining is The nontrivial extraction of
implicit, previously unknown and potentially
useful information from data. (Frawley et al
2001).
It uses machine learning and statistical
techniques to discovery and present knowledge for
easy comprehension for humans.
Data mining is an application of a more general
problem called Pattern Recognition.

4
Data Mining Tasks

Classification Learn a method for predicting the
instance class from pre-labeled (classified)
instances k-nearest neighbors, Decision Trees,
Neural Networks.

Clustering Find natural grouping of
instances given un-labeled data K-means

5
Serial K-means

K-means Algorithm
Initialize k points (mj ).
For each Xi determine the point to which it is
closest and make it a member of that cluster.
Then find the new mj s as averages of the points
in their cluster.
Repeat 2) and 3) until convergence.

Hard Duda (1973) Find the minimum variance
clustering of the data into k clusters such a way
as to minimize
6
K-means
Assign each point to the closest cluster center
7
Parallel K-means

Dharmendra Dhillon 2002.
1) Master process distributes the initial k
points (mj) to all nodes.
2) Master process distributes the data points
X1,X2, ...,Xn equally to all nodes.
3) Each node determines the centroid to which its
set of Xi is closest, one Xi at a time. Each node
maintains a sum and a count of Xi to be
associated with each of the mj, and send its
respective labels of belonging to master process.
4) At each iteration, the master process computes
the new mj .
5) Repeat 3) and 4) until convergence.

8
Discussion

Parallel algorithm proposed by Dhillon has almost
linear speedup when the data has more instances
than features and the processors have same
computational velocity.
Whereas when the features are grater than
instances or when the processors have different
velocities, the proposed parallel algorithm scale
poorly.
Fortunately there are other methods to
parallelize this algorithm
Feature based partition, in which the data are
split between features
Once the master process send the data, each
process compute its respective sum and them send
it back to master, which join the partial results
and then compute new centroids.

9
The K-Nearest Neighbors

Cover and Hart 1967
Given an unknown sample the k-nearest algorithm
searches the pattern space for k training samples
that are closest (using Euclidean distance) to
the unknown
sample.

10
Parallel K-nearest neighbors

Ruoming Agrawal 2001.
1) The training set is distributed among the
nodes.
2) Given an unknown sample, each node processes
the training samples it owns to calculate the
nearest neighbors locally.
3) A global reduction computes the overall
k-nearest neighbors from the locally k-nearest
neighbors of each node.

11
Discussion

Ruominng concluded that his algorithm achieve
high efficiency for distributed memory and shared
memory
Whereas when the number of instances are less
than number of features this algorithm dont give
good performance.
I propose another parallel algorithm for this
problem.
Split the data feature based, for each object we
compute the partial distances from this object to
the training set.
After that a global reduction is achieved, and
the master process find the k nearest neighbors.

12
Resampling techniques

The technique of combine classifiers (Bagging
Breiman 1994 and Boosting Freund Shapire 1996
rely in resampling techniques)
Consists on built a classifier from a sequence of
training sets, each with n different
observations.
Instead of obtain the sequence of training
samples in the real life it is obtained in
artificial manner.
The technique used in this case is bootstrapping,
and then the classifier is obtained by voting the
sequence of classifiers from the bootstrap
samples.

13
Bootstrapping process
14
Parallel Bootstrapping

Beddo 2002
1) The master process sends the data set to all
nodes.
2) Each node produce approximately B/p (where B
is the number of bootstrap samples) classifiers
from B/p bootstrap samples.
3) Finally all nodes perform a reduction, and the
master process finds the classifier by voting
them.

15
Discussion

Beddo concludes that resampling techniques are
parallel in nature and they reach linear speedup
because a little communication is performed
between processors.
We proposed a natural dynamic load partition to
avoid static partition.
The master process gives one bootstrap training
set to each slave.
Each slave compute its classifier and classify
the test data, and send the result back to master
process.
Master process join the classifiers obtained by
such slave and give other bootstrap sample. This
process continue until there are not left
bootstrap samples.
In Boosting, bootstrap samples depends of the
previous classifier, for this reason we dont
split the bootstrap samples between processes,
With each bootstrap sample each processes
classifies its partial test sample to find the
partial errors and then sends its result to
master process.

16
Support Vector Machine

Vapnik 1979.
This algorithm maximizes the margin and minimizes
the error.

The proximal SVM classifier (Fung Mangasariam
2001) changes the inequality constrains to
equalities
Setting the gradient with respect to (p,b) to
zero gives
Where
17
Row distributed SVM
Poulet 2003.

The data set is divided into c blocks of rows.
Each block is computed on different processor.
Master process sends the data set to remote
machines.
Each remote machine perform the calculation of
its
and , next sends
back its result to master.
Once the master process received the results, it
performs and
, then inverse the matrix to get the
final result.

18
Column Distributed SVM
Poulet 2003.
Using Sherman Morrison Woodbury formula
Column based parallel algorithm is similar to the
row distributed, each process perform the
calculation of its from its block of
columns and send it back to master. Once the
master received the result from the slave
processes, it performs the last sums, and inverse
the matrix to get the final result.
19
Discussion
Poulet said that their algorithms achieved good
performance for both data sets with large number
of rows, or with large number of columns. These
parallel algorithms are split in blocks
statically, so cant reach good work load
balancing when the processor speeds are
different. Dynamic work balancing can be used
to have good performance.
20
ID3 Quinlan 1986, C4.5 Quinlan 1993.A decision
tree T encodes a classifier or regression
function in form of a tree.
Decision Tree

Encoded classifier
If (agelt30 and carTypeMinivan)Then YES
If (age lt30 and(carTypeSports or
carTypeTruck))Then NO
If (age gt 30)Then YES

splitting predicate
Age
lt30
gt30
Car Type
YES
Minivan
Sports, Truck
NO
YES
21
Decision Tree
22
C4.5 Decision Tree
23
Feature Based Parallelization
Taner et al 2004
24
Node Based Parallelization
Taner et al 2004
25
Data Based Parallelization
Taner et al 2004
26
(No Transcript)
27
Discussion

Taner et al.
Feature based parallelization have better speedup
than Node parallelization when the instances in
the data sets are greater than number of
features. Otherwise Node based parallelization
had better speedup.
Amado et al. 2003
Hybrid algorithm can be characterized as using
data parallelism (feature based or data based)
and node based parallelism.
For nodes covering a significant amount of
examples, it is used data parallelism to avoid
load imbalance problem. The cost of communicate
nodes covering fewer examples can be higher than
the time spent in processing the examples.
To avoid this problem, one of the process
continues alone the construction of the tree
rooted at the node (node based parallelism).
Usually, the switch between data parallelism and
task parallelism (node based parallelism) is
performed when the communications cost overcomes
the processing and data transfer cost.

28
Conclusions

Almost all parallel data mining algorithms
(K-means, Neighbors, Resampling Techniques,
Support Vector Machines and Decision Trees) have
almost linear speedup, because the amount of
communication and the amount of computation
required to merge local models into global model
are small.
Most Data mining algorithms (K-means, Neighbors,
Replication Techniques) can be parallelized by
replication, building a local model and then
combining these models to obtain a global model.
Some parallel algorithms (K-means, Nearest
Neighbors) had been designed thinking in a
particular data sets, and dont have good
performance for other data sets. For this reason
I have proposed new algorithms for cover all
cases for real data sets.