GenMax - PowerPoint PPT Presentation

About This Presentation
Title:

GenMax

Description:

... itemsets in the MFI. 2. Frequency testing : ... LMFI := Local MFI ... The Databases used in the experiments are grouped according to the MFI length distribution ... – PowerPoint PPT presentation

Number of Views:159
Avg rating:3.0/5.0
Slides: 23
Provided by: csta3
Category:
Tags: genmax | mfi

less

Transcript and Presenter's Notes

Title: GenMax


1
GenMax
  • From
  • Efficiently Mining Frequent Itemsets
  • By
  • Karam Gouda Mohammed J. Zaki

2
The Problem
  • Given a large database of items transactions,
    find all frequent itemsets
  • A frequent itemset is a set of items that occurs
    in at-least a user-specified percentage of the
    data-base
  • We call this percentage min_sup (for minimum
    support).

3
  • A Maximal Frequent Itemset is a frequent
    itemset, that doesnt have a frequent superset
  • FI frequent itemsets
  • MFI maximal frequent itemsets
  • Fact
  • MFI ltlt FI
  • GenMax is an algorithm to find the exact MFI

4
Example
Item/Tid A B C D
1 x x x
2 x x
3 x x x
4 x x x x
5 x
6 x x
7 x
Min_sup 3
ABCD ABC ABD
ACD BCD AB AC AD BC BD
CD A B C D
5
Some Useful Definitions
  • The Combine-Set of an itemset I , is the set of
    items that can be added to I to create a frequent
    itemset.
  • For example , in the previous example, The
    combine-set of the itemset A is B, C.
  • The combine-set of the empty itemset is called F1
    and is actually the set of frequent itemsets
    ofsize 1.

6
(No Transcript)
7
(No Transcript)
8
Improvement
  • At each level, sort the combine-set (C) in
    increasing order of support
  • An itemset with low support has a smaller chance
    of producing a large combine-set in the next
    level
  • The sooner we prune the tree, the more work we
    save
  • This heuristic was first used in MaxMiner

9
Bottlenecks
  • Superset checking
  • The best algorithms for superset checking give
    an amortized bound of per
    operation.
  • thats bad if we have many itemsets in the MFI.
  • 2. Frequency testing
  • How can we make frequency testing faster ?

10
Optimizing Superset Checking
  • A technique called Progressive Focusing is
    used to narrow down the group of potential
    supersets, as the recursive calls are made
  • LMFI Local MFI
  • Before each recursive call, we construct the LMFI
    for the next call, based on the current LMFI and
    the new item added.

11
LMFI Example
FGHI FGHJ
FGH FGI
FG
12
(No Transcript)
13
Frequency Testing Optimization
  • GenMax uses a vertical database format
  • For each item , we have a set of all the
    transactions containing this item.
  • This set is called a tidset. (Transaction ID
    Set).
  • This method makes support computations easier,
    because we dont have to go over the entire
    database.

14
Vertical Database
Item/Tid A B C D
1 x x x
2 x x
3 x x x
4 x x x x
5 x
6 x x
7 x
A 1, 3, 4, 5 B 1, 3, 4, 6 C
1 ,2 ,3 ,4 ,7 D 2, 4, 6 t(A) 1, 3, 4,
5 t(AC) 1, 3, 4 supp(I) t(I)
15
ABC ABD ABE AB
C , E
t(ABC) t(ABE)
Each item y in the combine-set ,
actually represents the itemset ,
and stores the tidset associated with it.
16
Additional Optimization
  • Diffsets dont store the entire tidsets, only
    the differences between tidsets (described in
    Fast Vertical Mining Using Diffsets)

17
Experimental Results
  • GenMax is compared with
  • MaxMiner , MAFIA, MAFIA-PP
  • MaxMiner MAFIA-PP give the exact MFI, while
    MAFIA gives a superset of the MFI
  • The Databases used in the experiments are grouped
    according to the MFI length distribution

18
Type I Datasets
19
Type II Datasets
20
Type III Datasets
21
Type IV Datasets
22
The End
Write a Comment
User Comments (0)
About PowerShow.com