Mining Confident Rules Without Support Requirements - PowerPoint PPT Presentation

About This Presentation
Title:

Mining Confident Rules Without Support Requirements

Description:

CIKM01. 1. Mining Confident Rules Without Support Requirements. Ke Wang. Yu He. D. W. Cheung. F. Y. L. Chin. CIKM01. 2. Association Rules. Given a table over A1, ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 21
Provided by: css64
Category:

less

Transcript and Presenter's Notes

Title: Mining Confident Rules Without Support Requirements


1
Mining Confident Rules Without Support
Requirements
  • Ke Wang
  • Yu He
  • D. W. Cheung
  • F. Y. L. Chin

2
Association Rules
  • Given a table over A1,,Ak, C
  • Find all rules Aiai Cc of minimum
    confidence and minimum support
  • Support sup(Aiai) records containing Aiai
  • Confidence sup(Aiai Cc)/sup(Aiai)

3
Low Support Rules
  • Interesting rules unknown low
    support
  • High support rules low confidence
  • Often, patterns are fragmented into many low
    support rules

Find all rules above the minimum confidence
4
Confidence-based Pruning
  • Without minimum support, the classic
    support-based pruning inapplicable
  • Confident rules are neither downward closed nor
    upward closed
  • Need new strategies for pushing the confidence
    requirement.

5
Confidence-based Pruning
  • r1 Ageyoung Buyyes
  • r2 Ageyoung, GenderM Buyyes
  • r3 Ageyoung, GenderF Buyyes

Observation 1 if r1 is confident, so is one of
r2 and r3 (specialized by Gender)
Observation 2 if no specialized rule of r1 is
confident, r1 can be pruned
6
Confidence-based Pruning
  • Level-wise rule generation Generate a candidate
    rule x c only if for every attribute A not in
    x c, some A-specialization of x c is
    confident.

7
The algorithm
  • Input table T over A1,,Am,C, and miniconf
  • Output all confident rules
  • 1. km
  • 2. Rulek all confident m-rules
  • 3. while kgt1 and Rulek is not empty do
  • 4. generate Candk-1 from Rulek
  • 5. compute the confidence of Candk-1 in one
    pass of T
  • 6. Rulek-1 all confident candidates in
    Candk-1
  • 7. k--
  • 8. return all Rulek

8
Disk-based Implementation
  • Assumption T, Rulek, Candk-1 are stored on disk.
  • We focus on
  • generating Candk-1 from Rulek and
  • computing the confidence for Candk-1.
  • Key clustering T, Rulek, Candk-1 according to
    attributes Ai

9
Clustering by Hash Partitioning
  • hi --- the hash function for attribute Ai, i1,,
    m
  • Table T is partitioned into T-buckets
  • Rulek is partitioned into R-buckets
  • Candk-1 is partitioned into C-buckets
  • A bucket-id is a sequence of hash values involved
    b1,bk

10
Pruning by Checking Bucket Ids
  • A tuple in a T-bucket supports a candidate in a
    C-bucket only if the T-bucket id matches the
    C-bucket id.
  • E.g., T-bucekt A1.1,A2.1,A3.2 matches C-buckets
    A1.1, A3.2 and A1.1,A2.1
  • A C-bucket b1,,bk is nonempty only if for
    every other attribute A, some R-bucket
    b1,,bk,bA is nonempty

11
Hypergraph Hk-1
  • A vertex corresponds to a T-bucket
  • An edge corresponds to a C-bucket, which contains
    a vertex if and only if the C-bucket matches the
    T-bucket
  • Hk-1 is in memory.

12
The Optimal Blocking
  • Assume that we can read several T-buckets each
    time, called a T-block.
  • For each T-block, we need to access the matching
    C-buckets from disk.
  • We want the optimal blocking of T-blocks so that
    the access of C-buckets is minimized.
  • This problem is NP-hard.

13
Heuristics
  • Heuristic I The more T-buckets match a C-bucket,
    the higher priority such T-buckets should be in
    the next T-block.
  • Heuristic II The more C-buckets matches a
    T-bucket, the higher priority this T-bucket
    should be in the next T-block.

14
  • C1 C2 C3 C4
  • T1 T2 T3 T4 T5
  • (T1T2T3)(T4T5) C1,C2,C4 read twice, C3 read once
  • Heursitic I (T1T2T5)(T3T4) C1,C2,C4 read once,
    C3 read twice
  • Heuristic II (T1T3T5)(T2T4) C1,C4 read twice,
    C2,C3 read once.

15
Experiments
  • Synthetic datasets from An interval classifier
    for database mining application, VLDB 92.
  • 9 attributes, 1 class.
  • Default data size 100K

16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
Conclusion
  • The experiments show that the proposed
    confidence-based pruning is effective.
Write a Comment
User Comments (0)
About PowerShow.com