Sampling Large Databases for Association Rules (Toivenon

About This Presentation

Title:

Description:

Number of Views:43

Avg rating:3.0/5.0

Slides: 19

Provided by: Farz

Category:

Tags: association | databases | rules | sampling | toivenon

Transcript and Presenter's Notes

Title: Sampling Large Databases for Association Rules (Toivenon

1
Sampling Large Databases for Association
Rules(Toivenons Approach, 1996)

2
Outline

3
Introduction

4
Introduction(Cont)Overview of Toivonens Method

Main Steps
Pick a random sample from the database.
Use the sample to determine all probable
association rules.
Verify the results with the rest of the database,
i.e. Eliminated incorrectly detected association
rules and add missing association rules.
The Main Contribution
To show that all exact frequencies can be found
efficiently, by analyzing first a random sample
and then the whole database with the proposed
method.

5
Preliminaries

6
Preliminaries

Association Rule implication X ? Y where X,Y ? I
and X ? Y Ø
Support of Association Rule X ? Y Percentage of
transactions that contain X ?Y
Confidence of Association Rule X ? Y Ratio of
number of transactions that contain X ? Y to the
number that contain X
Problem Find the strong association rules of a
given set I with respect to threshold min_fr and
confidence min_conf.

7
Algorithms for Mining Association Rules

Level-wise Algorithms
Idea If a set is not frequent then its
supersets can not be frequent.
On level k, candidate itemsets X of size k are
generated such that all subsets of X are
frequent.
Partition Algorithm
Idea Partition the data to sections small
enough to be handled in main memory.
First Pass Find locally frequent Itemsets.
Second Pass Union of the local frequent
itemsets

8
Sampling for Frequent Sets

9
Negative Border

10
Frequent Set Discovery

Intuition Given a collection S of sets that are
frequent, the negative border contains the
closest itemsets that could be frequent too.
After finding the collection of frequent
itemsets, S, we check negative border of S
If no frequent items are addedgt We can conclude
that all frequent sets are already found. (Why?)
Decrease minimum support to increase the chance
of success.
If at least one frequent itemset is found in
negative border gt We can conclude that some of
its supersets may be frequent.(Why?)
In the case of failure, we can either report
failure and stop, or scan the database again and
check the supersets to find the exact result.

Success
Failure
11
Toivonens Algorithm
12
Failure Handling

In the fraction of cases where a possible failure
is reported, all frequent sets can be found by
making a second pass over the database

The algorithm simply computes the collection of
all sets that could possibly be frequent.
13
Analysis of Sampling

14
Experimental Results
15
Conclusion

Advantages
Reduced failure probability, while keeping
candidate-count low enough for memory
Disadvantages
Potentially large number of candidates
in second pass

16
References