Parallel Mining of Association Rules - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Parallel Mining of Association Rules

Description:

The need of fast algorithms for discovering association rules ... Why Parallel Algorithms? ... Three parallel algorithms: CD, DD, CaD based on Apriori ... – PowerPoint PPT presentation

Number of Views:191

Avg rating:3.0/5.0

Slides: 44

Provided by: tinghi

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Mining of Association Rules

1
Parallel Mining of Association Rules
Rakesh Agrawal John C.Shafer
Presented by Ting Hian Ong Xu XingJian http//www
.comp.nus.edu.sg/tinghian/cs6203
2

Introduction
Overview of Serial AlgorithmParallel
Algorithms
Count Distribution (CD)
Data Distribution (DD)
Candidate Distribution (CaD)
Parallel Rule Generation
Performance Sensitivity Analysis
Conclusions
Q A

3
Ultra-large databases
Possibility of faster access and manipulation

DATA MINING
The efficient discovery of previously unknown
patterns in large databases
? The need of fast algorithms for discovering
association rules

Why Parallel Algorithms?
Databases (raw transaction data instead of
samples) to be mined are often very large - in GB
and TB
The need of fast algorithm for discovering
association rules
Transaction databases has to be scanned
repeatedly in discovering the frequent itemsets
Requires a lot of computation power, memory and
I/O, which can only provided by parallel computer
using parallel algorithms

Three parallel algorithms introduced
Count Distribution (CD)
Data Distribution (DD)
Candidate Distribution (CaD)
Based on the serial algorithm Apriori

Association Rules
The problem of mining association rules is to
generate all association rules that have certain
user-specified minimum support and confidence.
Problem Decomposition
Find all sets of items whose support is greater
than the user-specified minimum support (frequent
itemsets)
Use frequent itemsets to generate the desired
rules

7
Apriori Algorithm L1 frequent 1-itemsets k
2 while (Lk-1 ¹ 0) do Cknew candidates of
size k generated from Lk-1 forall transactions
t Î D do Increment the count of all candidates
in Ck that are contained in t LkAll
candidates in Ck with minimum support kk1 en
d Answer Èk Lk
8
Apriori Algorithm Candidate Generation Join
step insert into Ck select p.item1, p.item2,
, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where
p.item1q.item1, , p.itemk-2q.itemk-2,
p.itemk-1 lt q.itemk-1 Prune step delete all
itemsets c Î Ck such that some (k-1)-subset of c
is not in Lk-1
9

Three parallel algorithms CD, DD, CaD based on
Apriori
Discovering frequent itemsets (1) is much more
expensive than generating rules (2)
Phase 1
Each node generates candidate k-itemsets locally
from the frequent (k-1)-itemsets ? how to
partition?
Phase 2
The match candidates itemsets and transactions
collect the local counts ? how to distribute?
Phase 3
- determine the global counts for itemsets ? how
to find?
find frequent k-itemsets and replicate in all
nodes

Implemented on an IBM POWERparallel System SP2, a
shared-nothing machine, where each of N
processors has a private memory and a private
disk.
Data is evenly distributed among the nodes

11
(No Transcript)
12

Objective minimizing communication
Techniques
- Straight-forward parallelization of Apriori
Carry out redundant duplicate computations in
parallel to avoid communication
Only requires communicating count values (no data
tuples are exchanged)
Processors can scan the local data asynchronously
in parallel

Algorithm
Pass 1
Each processor Pi generates its local candidate
itemset Ci1 depending on the items present in its
local data partition Di
Develop and Exchange local counts Ci1
Develop global support counts C1

Algorithm
Pass kgt1
Pi generates the complete Ck using the complete
Lk-1 created at the end of pass (k-1). Each
processor has the identical Lk-1 thus generates
identical Ck and puts its count values in a
common order into a count array
Pi makes a pass over data partition Di and
develop local support counts for candidates in Ck
Pi exchanges local Ck counts with all other
processors to develop global Ck counts. All
processors must synchronize.
Pi computes Lk from Ck
Pi independently decide to terminate or continue
to the next pass

15
(No Transcript)
16

Disadvantages
CD does not exploit the aggregate memory of the
system
Must synchronize and develop global count at the
end of each pass

Objective utilize aggregate main memory of the
system effectively
Technique
Partitions the candidates into disjoint sets,
which are assigned to different processors. Each
processor works with the entire dataset but only
portion of the candidate set.
Each processor counts mutually exclusive
candidates. On a N-processor configuration, DD
can count in a single pass candidate set that
would require N pass in CD

18
Basic Idea

Example 2 processors
Data Distribution only processes a subset of Ck
to utilize the aggregate memory
Exchange data to develop global counts for Cki

data
19

Algorithm
Pass 1 Same as the CD algorithm
Pass kgt1
Pi generates Ck from Lk-1. It retains only 1/N of
the itemsets forming Cik
Pi develops support counts for itemsets in Cik
for ALL transactions (using local data pages and
data pages received from other processors)
At the end of the data pass, Pi calculates Lik
using local Cik
Processors exchange Lik so that every processor
has the complete Lk for generating Ck1 for the
next pass (requires processors to synchronize)
Pi can independently decide whether to terminate
or continue on to the next pass

20
Lik
Lik
Lik
Lk
21
Disadvantages heavy communication Each
processor must broadcast their local data and
frequent itemsets to all other processors and
synchronize in every pass.
22

Problem
CD and DD require processors to synchronize at
the end of each pass
Basic Idea Remove dependence among processors
Data dependence

Complete transactions are required to compute
support count (in CD)

Frequent itemsets dependency

A global itemset Lk is needed during the pruning
step of Apiori candidate generation algorithm(in
DD)
23

Remove Data Dependency
Each processor Pi works on Cki, a disjoint subset
of Ck
Pi derives global support counts for Cki from
local data.
Replicate data amongst processors in order to
achieve the above
Reduce Frequent itemset dependency
Does not wait for the complete pruning
information to arrive from other processors.
Prune the candidate set as much as possible
Late arriving pruning information is used in
subsequent passes.

Algorithm
Pass kltl Use either the CD or DD algorithm
Pass kl
Partition Lk-1 among N processors
Pi generates Cik logically using only the Lik-1
partition (use standard pruning)
Pi develops global counts for candidates in Cik
and the database is repartitioned into D Ri at
the same time (requires communicating local data)
Pi receive Ljk from all other processors needed
for pruning Cik1
Pi computes Lik from Cik and asynchronously send
it to the other N-1 processors
Pass kgtl
Pi collects all frequent itemsets sent by other
processors
Pi generates Cik using local Lik-,, take care of
pruning(Ljl-1)
Pi passes over D Ri and counts Cik
Pi computes Lik from Cik and asynchronously send
it to the other N-1 processors

How to partition Lk ?
Partition the itemsets in Lk based on common k-1
long prefixes
Assume items in the itemsets are
lexicographically ordered
Example (in the paper) an error ADE
L3 ABC, ABD, ABE, ACD, ACE, BCD, BCE, BDE,
CDE
L4 ABCD, ABCE, ABDE, ACDE, BCDE
L5 ABCDE
L6 Æ
ABC, ABD, ABE ? all have common prefix AB
The apriori candidate generation procedure
generate ABCD, ABCE, ABDE, and ABCDE by joining
only the items in e
Repartition the database according to Lk Partition

In candidate distribution, each processor works
independently by counting only its portion of
global candidate set using only local data
CaD must communicate the entire dataset during
the redistribution pass (kl step 3), but only
once. Unlike DD, processors may selectively
filter out transactions it sends to other
processors depending upon how the dependency
graph is partitioned.

27
Given a frequent itemset l examine a subset a and
generate rule a gt (l-a) with support
support(l) and confidence support(l) /
support(a) Example Frequent itemsets ABCD,
AB Confidence support(ABCD) / support
(AB) Only proceed to smaller subsets if rules
have the required minconf. Example Frequent
itemset ABCD, If ABC Þ D doesnt satisfy
minconf, AB Þ CD will not have minconf
28

Examination of dataset is not required.-gt Cheap
Generating rules in parallel need partitioning
the set of all frequent itemsets. Each processor
generates rule for its partition only using the
algorithm.
Sensitive to itemsets length, balancing by
partitioning the itemsets of each length
equally.
Each processor must have access to all frequent
itemsets before rule generation begins for
calculating the confidence.
?In CaD occurs waiting time for slower processors
to discover and transmit all frequent itemsets.
Due to load imbalance, this can be performed
off-line, possibly on a serial processor.

Hardware specifications
a 32-node IBM SP2 Model 302
Each node is a Thin Node 2 consisting of a
POWER2processor running at 66.7 MHz with 256MB
memory
Each node has 2GB disk of which less than 500MB
available for tests
Combined communication hardware has a rated peak
bandwitdh of 80 MBps and latency lt 40 ms. Actual
point-to-point bandwidth reached 20 MBps
Message Passing Interface (MPI) was used to
facilitate communication among processors

Six synthetic datasets used of varying complexity
All datasets size were about 100 MB per
processor
Data Parameters

T Average transaction length I Average size
of frequent itemsets D Average number of
transactions
31

TEST PARAMETERS
Response Time
The time elapsed from the initiation of the
execution to the end time of the last processor
finishing the computation
Note
- Run on the 6 datasets on 16-node configuration
- Since limited disk space available, the
response time for the serial version are run on 1
nodes worth of data or 1/16th of the database
- Repartitioning for CaD was done in the 4th
pass (best performance)

32
DD
CaD
CD
Serial
Response times for CD and CaD are much lower than
DD and close to the serial version run with 1/N
data
33

DD was able to exploit aggregate memory of the
multiprocessor and make fewer passes in the case
of datasets with large average transaction and
frequent itemset lengths.
CaD makes just as many data passes as CD, because
the large candidate sets that force CD into
multiple subpasses all occur before CaD takes
over with its redistribution pass.

34
Normal
No Communication

Normal DD the same 100 MB data replicated on
each of the 16 nodes
No-communication DD a node is not receiving
data from other nodes, simply processed its local
data 15 more times.
Half of the time taken by DD was for
communication.
I/O savings due to DD making fewer passes become
negligible

DD performs quite low for 2 reasons
Extra communication
Every node in the system must process every
single transaction
CaD must communicate the entire dataset during
the redistribution pass ONCE, also suffers the
same problems as DD.
Unfortunately a single pass of redistribution is
costly. The savings from each processor that can
run completely independently with smaller
candidate sets can not compensate the cost.
Although CDs overhead is small (less than 7.5
to serial version), synchronization cost can be
large if the data distributions are skewed or the
nodes are not equally capable (different memory,
processor speed, I/O bandwidth, and capacities)
Suggestion CD Load Balance

TEST PARAMETERS (only on CD algorithm)
Scaleup
Increased the size of the database in direct
proportion to the number of nodes in the system
Sizeup
Fixed the size of the multiprocessor at 16
nodes, while increasing the database from 25MB
per node to 400MB per node
Speedup
Fixed the size of each database at 400 MB and
varied the number of processors

37
SCALE UP CD scales linearly able to keep the
response time almost constant as the database
and multiprocessor size increase. Reasons The
itemsets found by CD doesnt change as the
database size increased, the number of candidates
whose support must be summed by the communication
phase remains constant
38
SIZE UP CD shows sublinear performance, the
program is actually more efficient as the
database size increase. More efficientincreasing
size of database ? more I/O and transaction
processing ? less portion of time spent in
communication
39
SPEED UP CD has a very good speedup performance,
up to 8 processors Larger datasets shows better
speedup characteristics.The more data processed
per node, the less significant becomes the
communication time
40

Count Distribution attempts to minimize
communication by replicating the candidate sets
in each processors memory
Data Distribution maximizes the use of aggregate
memory by allowing each processor works with the
entire dataset but only portion of the candidate
set
Candidate Distribution eliminates the
synchronization costs at the end of every pass,
maximizes the use of aggregate memory while
limiting heavy communication to a single
redistribution pass

41
(No Transcript)
42
(No Transcript)
43

Count Distribution exhibited linear scale-up and
excellent speed-up and size-up behaviour
Data Distribution lost out because of the cost of
broadcasting local data from each processor to
every other processor and Candidate Distribution
lost because the cost of data redistribution.
Not all problems require an intricate
parallelization