Parallel and Distributed Data Mining: A REVIEW - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Parallel and Distributed Data Mining: A REVIEW

Description:

If (age 30 and carType=Minivan) Then YES. If (age 30 and (carType=Sports or carType=Truck) ... Minivan. Age. Car Type. YES. NO. YES 30 =30. Sports, Truck ... – PowerPoint PPT presentation

Number of Views:285
Avg rating:3.0/5.0
Slides: 28
Provided by: Elio150
Category:

less

Transcript and Presenter's Notes

Title: Parallel and Distributed Data Mining: A REVIEW


1
Parallel and Distributed Data Mining A REVIEW
  • Phd Student Elio Lozano
  • Advisor Dr. Edgar Acuña
  • University of Puerto Rico
  • Mayagüez Campus

2
What is Data Mining?
  • Data Mining is The nontrivial extraction of
    implicit, previously unknown and potentially
    useful information from data. (Frawley et al
    2001).
  • It uses machine learning and statistical
    techniques to discovery and present knowledge for
    easy comprehension for humans.
  • Data mining is an application of a more general
    problem called Pattern Recognition.

3
Parallel Distributed Data Mining
  • Distributed Data Mining techniques
  • Meta learning, Distributed classification
  • Parallel Data Mining
  • Association algorithm, classification tree,
    clustering, discriminate methods, neural
    networks, Genetic algorithms.

4
Meta- Learning
  • Finds global classifiers from distributed
    databases.
  • Learning from learned method (Chan and Stolfo
    1997).
  • Java agents for meta-learning (JAM) (Prodromidis
    et al. 1999)

5
Distributed Data Mining
  • The classifiers are trained from local training
    sets.
  • Predictions are generated for each learned
    classifier.
  • Meta level training set is composed from the
    validation set and the classifiers on the
    validation set.
  • Final classifier is trained from the meta level
    training set.

6
Strategies for meta-learning
  • Arbitrating

7
Meta-Learning
  • Combiner

8
Meta-Learning
  • Benefits of Meta-Learning
  • Base learning process can be executed in
    parallel.
  • We can use serial code without parallelizing it.
  • Small sets of data can fit in main memory.

9
Resampling techniques
  • The technique of combine classifiers (Bagging
    Breiman 1994 and Boosting Freund Shapire 1996
    rely in resampling techniques)
  • Consists on built a classifier from a sequence of
    training sets, each with n different
    observations.
  • Instead of obtain the sequence of training
    samples in the real life it is obtained in
    artificial manner.
  • The technique used in this case is bootstrapping,
    and then the classifier is obtained by voting the
    sequence of classifiers from the bootstrap
    samples.

10
Bootstrapping process
11
Parallel Bootstrapping
  • Beddo 2002
  • 1) The master process sends the data set to all
    nodes.
  • 2) Each node produce approximately B/p (where B
    is the number of bootstrap samples) classifiers
    from B/p bootstrap samples.
  • 3) Finally all nodes perform a reduction, and the
    master process finds the classifier by voting
    them.

12
Discussion
  • Beddo concludes that resampling techniques are
    parallel in nature and they reach linear speedup
    because a little communication is performed
    between processors.
  • We proposed a natural dynamic load partition to
    avoid static partition.
  • The master process gives one bootstrap training
    set to each slave.
  • Each slave compute its classifier and classify
    the test data, and send the result back to master
    process.
  • Master process join the classifiers obtained by
    such slave and give other bootstrap sample. This
    process continue until there are not left
    bootstrap samples.
  • In Boosting, bootstrap samples depends of the
    previous classifier, for this reason we dont
    split the bootstrap samples between processes,
    With each bootstrap sample each processes
    classifies its partial test sample to find the
    partial errors and then sends its result to
    master process.

13
Parallel Data mining
  • There is two main issues
  • Task parallelism.
  • Processor are divided into subgroups and subtask
    are assigned to processor subgroup based on the
    cost of processing each subtask.
  • Data Parallelism.
  • Task are solved in parallel using all the
    processors. But may suffer from load imbalance
    because the data may not be uniformly spread
    across the processors.

14
The K-Nearest Neighbors
  • Cover Hart 1967
  • Given an unknown sample the k-nearest algorithm
  • searches the pattern space for k training samples
  • that are closest (using Euclidean distance) to
    the unknown
  • sample.

15
Parallel K-nearest neighbors
  • Ruoming Agrawal 2001.
  • 1) The training set is distributed among the
    nodes.
  • 2) Given an unknown sample, each node processes
    the training samples it owns to calculate the
    nearest neighbors locally.
  • 3) A global reduction computes the overall
    k-nearest neighbors from the locally k-nearest
    neighbors of each node.

16
Discussion
  • Ruominng concluded that his algorithm achieved
    high efficiency for distributed memory and shared
    memory
  • Whereas when the number of instances are less
    than number of features this algorithm dont give
    good performance (such as golubdata).
  • We propose another parallel algorithm for this
    problem.
  • Split the data feature based, for each object we
    compute the partial distances from this object to
    the training set.
  • After that a global reduction is achieved, and
    the master process find the k nearest neighbors.

17
R Wrappers for MPI
  • Rmpi (Hao Yu)
  • Interface (Wrapper) to MPI (Message-Passing
    Interface)
  • Provides an interface (wrapper) to MPI APIs.
  • It also provides interactive R slave environment
    in which distributed
  • statistical computing can be carried out.
  • Maintainer. Hao Yu lthyu_at_stats.uwo.cagt
  • Snow (Luke Tierney, A. J. Rossini, Na Li, H.
    Sevcikova)
  • Simple Network of Workstations (snow)
  • Support for simple parallel computing in R.
  • Maintainer Luke Tierney ltluke_at_stat.uiowa.edugt

18
Experimental Results
  • Data
  • Lansat
  • Shuttle
  • Cluster of 5 Intel Xeon (TM). 2.8 GHz CPU with 3
    GB of RAM.
  • Cluster of 4 nodes HP intel itanium. Each node
    with 2 1.5 GHz CPU with 1.5GB.

19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
K-means
  • K-means Algorithm
  • Initialize k points (mj ).
  • For each Xi determine the point to which it is
    closest and make it a member of that cluster.
  • Then find the new mj s as averages of the points
    in their cluster.
  • Repeat 2) and 3) until convergence.

Hard Duda (1973) Find the minimum variance
clustering of the data into k clusters such a way
as to minimize
23
K-means
Assign each point to the closest cluster center
24
Parallel K-means
  • Dhillon Modha 1999.
  • 1) Master process distributes the initial k
    points (mj) to all nodes.
  • 2) Master process distributes the data points
    X1,X2, ...,Xn equally to all nodes.
  • 3) Each node determines the midpoint to which its
    set of Xi is closest, one Xi at a time. Each node
    maintains a sum and a count of Xi to be
    associated with each of the mj, and send its
    respective labels of belonging to master process.
  • 4) At each iteration, the master process computes
    the new mj and sends it to slave processes.
  • 5) Repeat 3) and 4) until convergence.

25
Discussion
  • Parallel algorithm proposed by Dhillon has almost
    linear speedup when the data has more instances
    than features and the processors have same
    computational velocity.
  • Whereas when the features are grater than
    instances or when the processors have different
    velocities, the proposed parallel algorithm scale
    poorly.
  • Fortunately there are other methods to
    parallelize this algorithm
  • Feature based partition, in which the data are
    split between features
  • Once the master process send the data, each
    process compute its respective sum and them send
    it back to master, which join the partial results
    and then compute new centroids.

26
Parallel K-means run time (Rmpi)
27
(No Transcript)
28
ID3 Quinlan 1986, C4.5 Quinlan 1993.A decision
tree T encodes a classifier or regression
function in form of a tree.
Decision Tree
  • Encoded classifier
  • If (agelt30 and carTypeMinivan)Then YES
  • If (age lt30 and(carTypeSports or
    carTypeTruck))Then NO
  • If (age gt 30)Then YES

splitting predicate
Age
lt30
gt30
Car Type
YES
Minivan
Sports, Truck
NO
YES
29
Decision Tree
30
C4.5 Decision Tree
31
Feature Based Parallelization
Taner et al 2004
32
Node Based Parallelization
Taner et al 2004
33
Data Based Parallelization
Taner et al 2004
34
(No Transcript)
35
Discussion
  • Taner et al.
  • Feature based parallelization have better speedup
    than Node parallelization when the instances in
    the data sets are greater than number of
    features. Otherwise Node based parallelization
    had better speedup.
  • Amado et al. 2003
  • Hybrid algorithm can be characterized as using
    data parallelism (feature based or data based)
    and node based parallelism.
  • For nodes covering a significant amount of
    examples, it is used data parallelism to avoid
    load imbalance problem. The cost of communicate
    nodes covering fewer examples can be higher than
    the time spent in processing the examples.
  • To avoid this problem, one of the process
    continues alone the construction of the tree
    rooted at the node (node based parallelism).
    Usually, the switch between data parallelism and
    task parallelism (node based parallelism) is
    performed when the communications cost overcomes
    the processing and data transfer cost.

36
Types of Clustering Algorithms
  • Hierarchical
  • Agglomerative
  • Initially, each record is its own cluster.
  • Repeatedly combine the 2 most similar clusters.
  • Divisive
  • Initially, all records are in a single cluster.
  • Repeatedly split 1 cluster into 2 smaller
    clusters.
  • Partition-based
  • Partition records into k clusters
  • Optimize objective function measuring goodness of
    clustering
  • Heuristic optimization via hill-climbing
  • Random choice of initial clusters.
  • Iteratively refine clusters by re-assigning
    records when doing so improves objective function
  • Eventually converge to local maximum

37
Association Rule Discovery Definition
  • Given a set of records each of which contain some
    number of items from a given collection
  • Produce dependency rules which will predict
    occurrence of an item based on occurrences of
    other items.

Rules Discovered Milk --gt Coke
Diaper, Milk --gt Beer
38
Parallel Association Rule
  • Zaki (1998) uses a lattice theoretic approach to
    represent the space of frequent itemsets and
    partitions it into smaller subspaces for parallel
    processing.
  • Each node in the lattice represent the space of
    frequent itemset.
  • He proposed 4 parallel algorithms, and they
    require only 2 database scans.

39
Any Questions ?
Thank you
Write a Comment
User Comments (0)
About PowerShow.com