Parallel and Distributed Computing for Data Mining: A review - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Parallel and Distributed Computing for Data Mining: A review

Description:

... Computing. for Data Mining: A review. Elio Lozano Inca ... Car Type. YES. NO. YES 30 =30. Sports, Truck. splitting predicate. Decision Tree. Decision Tree ... – PowerPoint PPT presentation

Number of Views:659
Avg rating:3.0/5.0
Slides: 29
Provided by: Elio150
Category:

less

Transcript and Presenter's Notes

Title: Parallel and Distributed Computing for Data Mining: A review


1
Parallel and Distributed Computingfor Data
Mining A review
  • Elio Lozano Inca
  • University of Puerto RicoMayagüez Campus

2
Outline
  • Data Mining
  • K - means algorithm.
  • K - Nearest neighbor.
  • Resampling techniques.
  • Support vector machine.
  • Decision tree (C4.5) .
  • Conclusions.

3
What is Data Mining?
  • Data Mining is The nontrivial extraction of
    implicit, previously unknown and potentially
    useful information from data. (Frawley et al
    2001).
  • It uses machine learning and statistical
    techniques to discovery and present knowledge for
    easy comprehension for humans.
  • Data mining is an application of a more general
    problem called Pattern Recognition.

4
Data Mining Tasks
  • Classification Learn a method for predicting the
    instance class from pre-labeled (classified)
    instances k-nearest neighbors, Decision Trees,
    Neural Networks.
  • Clustering Find natural grouping of
    instances given un-labeled data K-means

5
Serial K-means
  • K-means Algorithm
  • Initialize k points (mj ).
  • For each Xi determine the point to which it is
    closest and make it a member of that cluster.
  • Then find the new mj s as averages of the points
    in their cluster.
  • Repeat 2) and 3) until convergence.

Hard Duda (1973) Find the minimum variance
clustering of the data into k clusters such a way
as to minimize
6
K-means
Assign each point to the closest cluster center
7
Parallel K-means
  • Dharmendra Dhillon 2002.
  • 1) Master process distributes the initial k
    points (mj) to all nodes.
  • 2) Master process distributes the data points
    X1,X2, ...,Xn equally to all nodes.
  • 3) Each node determines the centroid to which its
    set of Xi is closest, one Xi at a time. Each node
    maintains a sum and a count of Xi to be
    associated with each of the mj, and send its
    respective labels of belonging to master process.
  • 4) At each iteration, the master process computes
    the new mj .
  • 5) Repeat 3) and 4) until convergence.

8
Discussion
  • Parallel algorithm proposed by Dhillon has almost
    linear speedup when the data has more instances
    than features and the processors have same
    computational velocity.
  • Whereas when the features are grater than
    instances or when the processors have different
    velocities, the proposed parallel algorithm scale
    poorly.
  • Fortunately there are other methods to
    parallelize this algorithm
  • Feature based partition, in which the data are
    split between features
  • Once the master process send the data, each
    process compute its respective sum and them send
    it back to master, which join the partial results
    and then compute new centroids.

9
The K-Nearest Neighbors
  • Cover and Hart 1967
  • Given an unknown sample the k-nearest algorithm
  • searches the pattern space for k training samples
  • that are closest (using Euclidean distance) to
    the unknown
  • sample.

10
Parallel K-nearest neighbors
  • Ruoming Agrawal 2001.
  • 1) The training set is distributed among the
    nodes.
  • 2) Given an unknown sample, each node processes
    the training samples it owns to calculate the
    nearest neighbors locally.
  • 3) A global reduction computes the overall
    k-nearest neighbors from the locally k-nearest
    neighbors of each node.

11
Discussion
  • Ruominng concluded that his algorithm achieve
    high efficiency for distributed memory and shared
    memory
  • Whereas when the number of instances are less
    than number of features this algorithm dont give
    good performance.
  • I propose another parallel algorithm for this
    problem.
  • Split the data feature based, for each object we
    compute the partial distances from this object to
    the training set.
  • After that a global reduction is achieved, and
    the master process find the k nearest neighbors.

12
Resampling techniques
  • The technique of combine classifiers (Bagging
    Breiman 1994 and Boosting Freund Shapire 1996
    rely in resampling techniques)
  • Consists on built a classifier from a sequence of
    training sets, each with n different
    observations.
  • Instead of obtain the sequence of training
    samples in the real life it is obtained in
    artificial manner.
  • The technique used in this case is bootstrapping,
    and then the classifier is obtained by voting the
    sequence of classifiers from the bootstrap
    samples.

13
Bootstrapping process
14
Parallel Bootstrapping
  • Beddo 2002
  • 1) The master process sends the data set to all
    nodes.
  • 2) Each node produce approximately B/p (where B
    is the number of bootstrap samples) classifiers
    from B/p bootstrap samples.
  • 3) Finally all nodes perform a reduction, and the
    master process finds the classifier by voting
    them.

15
Discussion
  • Beddo concludes that resampling techniques are
    parallel in nature and they reach linear speedup
    because a little communication is performed
    between processors.
  • We proposed a natural dynamic load partition to
    avoid static partition.
  • The master process gives one bootstrap training
    set to each slave.
  • Each slave compute its classifier and classify
    the test data, and send the result back to master
    process.
  • Master process join the classifiers obtained by
    such slave and give other bootstrap sample. This
    process continue until there are not left
    bootstrap samples.
  • In Boosting, bootstrap samples depends of the
    previous classifier, for this reason we dont
    split the bootstrap samples between processes,
    With each bootstrap sample each processes
    classifies its partial test sample to find the
    partial errors and then sends its result to
    master process.

16
Support Vector Machine
  • Vapnik 1979.
  • This algorithm maximizes the margin and minimizes
    the error.

The proximal SVM classifier (Fung Mangasariam
2001) changes the inequality constrains to
equalities
Setting the gradient with respect to (p,b) to
zero gives
Where
17
Row distributed SVM
Poulet 2003.
  • The data set is divided into c blocks of rows.
  • Each block is computed on different processor.
    Master process sends the data set to remote
    machines.
  • Each remote machine perform the calculation of
    its
  • and , next sends
    back its result to master.
  • Once the master process received the results, it
    performs and
    , then inverse the matrix to get the
    final result.

18
Column Distributed SVM
Poulet 2003.
Using Sherman Morrison Woodbury formula
Column based parallel algorithm is similar to the
row distributed, each process perform the
calculation of its from its block of
columns and send it back to master. Once the
master received the result from the slave
processes, it performs the last sums, and inverse
the matrix to get the final result.
19
Discussion
Poulet said that their algorithms achieved good
performance for both data sets with large number
of rows, or with large number of columns. These
parallel algorithms are split in blocks
statically, so cant reach good work load
balancing when the processor speeds are
different. Dynamic work balancing can be used
to have good performance.
20
ID3 Quinlan 1986, C4.5 Quinlan 1993.A decision
tree T encodes a classifier or regression
function in form of a tree.
Decision Tree
  • Encoded classifier
  • If (agelt30 and carTypeMinivan)Then YES
  • If (age lt30 and(carTypeSports or
    carTypeTruck))Then NO
  • If (age gt 30)Then YES

splitting predicate
Age
lt30
gt30
Car Type
YES
Minivan
Sports, Truck
NO
YES
21
Decision Tree
22
C4.5 Decision Tree
23
Feature Based Parallelization
Taner et al 2004
24
Node Based Parallelization
Taner et al 2004
25
Data Based Parallelization
Taner et al 2004
26
(No Transcript)
27
Discussion
  • Taner et al.
  • Feature based parallelization have better speedup
    than Node parallelization when the instances in
    the data sets are greater than number of
    features. Otherwise Node based parallelization
    had better speedup.
  • Amado et al. 2003
  • Hybrid algorithm can be characterized as using
    data parallelism (feature based or data based)
    and node based parallelism.
  • For nodes covering a significant amount of
    examples, it is used data parallelism to avoid
    load imbalance problem. The cost of communicate
    nodes covering fewer examples can be higher than
    the time spent in processing the examples.
  • To avoid this problem, one of the process
    continues alone the construction of the tree
    rooted at the node (node based parallelism).
    Usually, the switch between data parallelism and
    task parallelism (node based parallelism) is
    performed when the communications cost overcomes
    the processing and data transfer cost.

28
Conclusions
  • Almost all parallel data mining algorithms
    (K-means, Neighbors, Resampling Techniques,
    Support Vector Machines and Decision Trees) have
    almost linear speedup, because the amount of
    communication and the amount of computation
    required to merge local models into global model
    are small.
  • Most Data mining algorithms (K-means, Neighbors,
    Replication Techniques) can be parallelized by
    replication, building a local model and then
    combining these models to obtain a global model.
  • Some parallel algorithms (K-means, Nearest
    Neighbors) had been designed thinking in a
    particular data sets, and dont have good
    performance for other data sets. For this reason
    I have proposed new algorithms for cover all
    cases for real data sets.
Write a Comment
User Comments (0)
About PowerShow.com