Sampling Large Databases for Association Rules (Toivenon - PowerPoint PPT Presentation

About This Presentation
Title:

Sampling Large Databases for Association Rules (Toivenon

Description:

Use the sample to determine all probable association rules. ... On level k, candidate itemsets X of size k are generated such that all subsets ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 19
Provided by: Farz
Category:

less

Transcript and Presenter's Notes

Title: Sampling Large Databases for Association Rules (Toivenon


1
Sampling Large Databases for Association
Rules(Toivenons Approach, 1996)
  • Farzaneh Mirzazadeh
  • Fall 2007

2
Outline
  • Introduction
  • Preliminaries
  • Definitions, and Problem Statement
  • Two General Approaches
  • Sampling Method for Mining Association Rules
  • The algorithm
  • Analysis
  • Experimental Results

3
Introduction
  • Problem Discovery of Association Rules
  • Domain Very Large Databases
  • Bottleneck Time
  • Main Memory Processes Ignorable
  • Disk I/O An Influential Factor
  • Suggestion Minimize the Number of Scans of the
    Database
  • Only One Full Pass Over the Database

4
Introduction(Cont)Overview of Toivonens Method
  • Main Steps
  • Pick a random sample from the database.
  • Use the sample to determine all probable
    association rules.
  • Verify the results with the rest of the database,
    i.e. Eliminated incorrectly detected association
    rules and add missing association rules.
  • The Main Contribution
  • To show that all exact frequencies can be found
    efficiently, by analyzing first a random sample
    and then the whole database with the proposed
    method.

5
Preliminaries
  • Items
  • II1,I2,,Im
  • Transactions
  • rt1,t2, , tn, tj? I
  • Support of an itemset
  • Percentage of transactions which contain that
    itemset.
  • Frequent Itemsets
  • Association Rules
  • Strong Association Rules

6
Preliminaries
  • Association Rule implication X ? Y where X,Y ? I
    and X ? Y Ø
  • Support of Association Rule X ? Y Percentage of
    transactions that contain X ?Y
  • Confidence of Association Rule X ? Y Ratio of
    number of transactions that contain X ? Y to the
    number that contain X
  • Problem Find the strong association rules of a
    given set I with respect to threshold min_fr and
    confidence min_conf.

7
Algorithms for Mining Association Rules
  • Level-wise Algorithms
  • Idea If a set is not frequent then its
    supersets can not be frequent.
  • On level k, candidate itemsets X of size k are
    generated such that all subsets of X are
    frequent.
  • Partition Algorithm
  • Idea Partition the data to sections small
    enough to be handled in main memory.
  • First Pass Find locally frequent Itemsets.
  • Second Pass Union of the local frequent
    itemsets

8
Sampling for Frequent Sets
  • Major Steps
  • Random sampling
  • Finding the frequent itemsets of the sample
  • Finding other probable candidates using the
    concept of Negative Border
  • Using the rest of the database to check the
    candidates

9
Negative Border
  • All sets which are not in our frequent itemsets,
    but all their subsets are.
  • minimal itemsets not in S, where S
    is the collection of frequent itemsets
  • Example
  • S A, B, C, F, A,B, A,C, A,F,
    C,F, A,C,F
  • B, C, B, F, D, E

10
Frequent Set Discovery
  • Intuition Given a collection S of sets that are
    frequent, the negative border contains the
    closest itemsets that could be frequent too.
  • After finding the collection of frequent
    itemsets, S, we check negative border of S
  • If no frequent items are addedgt We can conclude
    that all frequent sets are already found. (Why?)
  • Decrease minimum support to increase the chance
    of success.
  • If at least one frequent itemset is found in
    negative border gt We can conclude that some of
    its supersets may be frequent.(Why?)
  • In the case of failure, we can either report
    failure and stop, or scan the database again and
    check the supersets to find the exact result.

Success
Failure
11
Toivonens Algorithm
12
Failure Handling
  • In the fraction of cases where a possible failure
    is reported, all frequent sets can be found by
    making a second pass over the database

The algorithm simply computes the collection of
all sets that could possibly be frequent.
13
Analysis of Sampling
  • Sample Size and Probability of Failure

14
Experimental Results
15
Conclusion
  • Advantages
  • Reduced failure probability, while keeping
    candidate-count low enough for memory
  • Disadvantages
  • Potentially large number of candidates
  • in second pass

16
References
  • 1 H. Toivonen, Sampling Large Databases for
    Association Rules, Proc. of VLDB Conference,
    India, 1996.

17
Questions
  • ?

18
  • Thank you
Write a Comment
User Comments (0)
About PowerShow.com