Title: GenMax
1GenMax
- From
- Efficiently Mining Frequent Itemsets
- By
- Karam Gouda Mohammed J. Zaki
2The Problem
- Given a large database of items transactions,
find all frequent itemsets - A frequent itemset is a set of items that occurs
in at-least a user-specified percentage of the
data-base - We call this percentage min_sup (for minimum
support).
3- A Maximal Frequent Itemset is a frequent
itemset, that doesnt have a frequent superset - FI frequent itemsets
- MFI maximal frequent itemsets
- Fact
- MFI ltlt FI
- GenMax is an algorithm to find the exact MFI
4Example
Item/Tid A B C D
1 x x x
2 x x
3 x x x
4 x x x x
5 x
6 x x
7 x
Min_sup 3
ABCD ABC ABD
ACD BCD AB AC AD BC BD
CD A B C D
5Some Useful Definitions
- The Combine-Set of an itemset I , is the set of
items that can be added to I to create a frequent
itemset. - For example , in the previous example, The
combine-set of the itemset A is B, C. - The combine-set of the empty itemset is called F1
and is actually the set of frequent itemsets
ofsize 1.
6(No Transcript)
7(No Transcript)
8Improvement
- At each level, sort the combine-set (C) in
increasing order of support - An itemset with low support has a smaller chance
of producing a large combine-set in the next
level - The sooner we prune the tree, the more work we
save - This heuristic was first used in MaxMiner
9Bottlenecks
- Superset checking
- The best algorithms for superset checking give
an amortized bound of per
operation. - thats bad if we have many itemsets in the MFI.
- 2. Frequency testing
- How can we make frequency testing faster ?
10Optimizing Superset Checking
- A technique called Progressive Focusing is
used to narrow down the group of potential
supersets, as the recursive calls are made - LMFI Local MFI
- Before each recursive call, we construct the LMFI
for the next call, based on the current LMFI and
the new item added.
11LMFI Example
FGHI FGHJ
FGH FGI
FG
12(No Transcript)
13Frequency Testing Optimization
- GenMax uses a vertical database format
- For each item , we have a set of all the
transactions containing this item. - This set is called a tidset. (Transaction ID
Set). - This method makes support computations easier,
because we dont have to go over the entire
database.
14Vertical Database
Item/Tid A B C D
1 x x x
2 x x
3 x x x
4 x x x x
5 x
6 x x
7 x
A 1, 3, 4, 5 B 1, 3, 4, 6 C
1 ,2 ,3 ,4 ,7 D 2, 4, 6 t(A) 1, 3, 4,
5 t(AC) 1, 3, 4 supp(I) t(I)
15ABC ABD ABE AB
C , E
t(ABC) t(ABE)
Each item y in the combine-set ,
actually represents the itemset ,
and stores the tidset associated with it.
16Additional Optimization
- Diffsets dont store the entire tidsets, only
the differences between tidsets (described in
Fast Vertical Mining Using Diffsets)
17Experimental Results
- GenMax is compared with
- MaxMiner , MAFIA, MAFIA-PP
- MaxMiner MAFIA-PP give the exact MFI, while
MAFIA gives a superset of the MFI - The Databases used in the experiments are grouped
according to the MFI length distribution
18Type I Datasets
19Type II Datasets
20Type III Datasets
21Type IV Datasets
22The End