Title: Arif Djunaidy Rully Soelaiman Daning Tyaspamadya
1MINING ASSOCIATION RULES FROM LARGE DATABASES
USING THE LATTICE-BASED APPROACH AND HYBRID
SEARCH METHOD
- Arif DjunaidyRully SoelaimanDaning Tyaspamadya
Faculty of Information Technology ITS - Surabaya
2Background - 1
- In data mining, association rules represent
relationships that may exist among items in their
transactional databases - Since, the association rules that can be
exploited may represent the customers behavior,
identification of the frequent itemsets and the
formation of the conditional implication rules
among items are paramount important to perform - Efficient algorithms capable of optimizing those
overheads in mining meaningful association rules
are therefore required - However, for large databases, the extraction of a
set of meaningful association rules may require
substantial memory and database scanning that may
in turn increase the overall computing time of
the mining process
3Background - 2
- The task of discovering all frequent associations
in very large databases is quite challenging - The search space is exponential in the number of
database attributes - With millions of database objects, the problem of
I/O minimization becomes paramount - Most current approaches are iterative in nature,
requiring multiple database scans - Most approaches use very complicated data
internal data structures, which have poor
locality and add additional space and computation
overheads
4Key Features of Our Approach
- All frequent itemsets are enumerated via simple
tid-list intersections - A lattice-theoretic approach is used to decompose
the original search space (lattice) into smaller
pieces (sub-lattices) that can be processed
independently and easier - The hybrid search strategy for enumerating the
frequent itemsets within each sub-lattice - Our approach is designed to involve only a few
database scans to minimize the I/O costs
5Problem Statement - 1
- An association rule can be written as A ? B,
where - A is an itemset called the antecedent or
left-hand side (LHS), and - B is an itemset called the consequent or
right-hand side (RHS) - The association mining task is to discover a set
of association rules among a large number of
objects in a given database
6Problem Statement - 2
- The basic and fundamental task of the mining
association rules application is to generate all
association rules X ? Y (X, Y are itemsets) that
can be extracted from the database. These rules
must satisfy both the support and confidence
constraints - Support constraint Sup (X ? Y),
- Confidence constraint Sup (X ? Y) / Sup (X)
- Sup(X), is defined as the number of transactions
in which it occurs as a subset - An itemset is categorized as a frequent itemset
if its support is more than a minimum support
(MinSup) supplied by a user - The confidence factor represents the conditional
probability that a transaction contains Y (given
that the transaction contains X) - An association rule is said to be confident if
its confidence factor value is more than the
minimum confidence (MinCof) supplied by the user.
7Simple Example - 1
- Consider the sales database of food store, where
the objects represent customers and itemsets
represent food - In this example, the discovered patterns are the
set of food frequently bought together by the
customers. - An example pattern found could be that, 60
percent of the customers who buy cereal also buy
milk - The store can then use this knowledge for shelf
placement, controlling the stock, etc. - There are many potential application areas for
association rule technology, which include
catalog design, customer segmentation, store
layout, and so on
8Simple Example - 2
MinSup 50
MinCof 100
9The Lattice-Based Approach - 1
- We use the Lattice-Theoretic to
- Identify all frequent itemsets
- Count the support of association rules
- Pre-req Construct the tid-list from the
transaction database
10The Lattice-Based Approach - 2
- Construct the powerset Lattice P(I)
MinSup 50
Maximal freq. itemsets
11The Lattice-Based Approach - 3
- Compute support of iternsets via tid-list
intersections
12Hybrid Search for Freq. Itemsets - 1
- Hybrid Search used to quickly enumerate all
frequent itemsets - Hybrid Search combines both the top-down and
bottom-up search strategies and is based on the
intuition that the greater the support of a
frequent itemset, the more likely it is to be a
part of a longer frequent itemset - The hybrid approach is divided in two main steps
- Initial phase containing the atoms rearrangement,
and - The hybrid process itself for generating all
frequent itemsets. In the second step, the
recursion process is repeated until no more
frequent itemset can be generated
13Hybrid Search for Freq. Itemsets - 2
- The first step simply rearranges the atoms in
descending order of their supports. The sorting
algorithm is involved in this step - The second step starts by intersecting a pair of
atoms one at a time - The intersection process is started from a pair
of atoms each of which having the largest support
among others to produce a larger and longer
frequent itemset. - The process stops when an extension becomes
infrequent (i.e., itemset that does not satisfy
the minimum support requirement). - The second bottom-up phase is then entered
14Hybrid Search for Freq. Itemsets - 3
Infrequent Itemsets (MinSup 50)
Infrequent Itemsets
15Design of Application
16Test Data
Statistics of Test Data
17Experimental Results - 1
Number of k-itemsets
18Experimental Results - 2
Number of Association Rules
19Experimental Results - 3
Computing Time
20Experimental Results - 4
Support Counting Performance
21Experimental Results - 5
Comparison Results
22Conclusions
- Experimental results show that the use of this
approach as well as the hybrid search method can
speed-up the computing time compared to both
apriori-based algorithms as well as the similar
lattice-based approach that uses the bottom-up
search strategy - Another interesting advantage of using the
lattice-based algorithm is concerned with time
used for scanning the databases. In this
context, the lattice-based algorithms requires a
single database scan once only. Hence, the I/O
overhead can be maximally minimized - As far as the computing speed is concerned, it
seems that substantial computing time are still
required to execute large databases. Although,
the lattice-approach is relatively powerful, it
indicates that some other computing
methodologies, such as the parallel algorithms
using the distributed computing environments need
to be considered to solve the computing speed
problem