Mining Approximate Frequent Itemsets in the Presence of Noise - PowerPoint PPT Presentation

About This Presentation
Title:

Mining Approximate Frequent Itemsets in the Presence of Noise

Description:

Mining Approximate Frequent Itemsets in the Presence of Noise By- J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel and J. Prins Presentation by- Apurv Awasthi – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 35
Provided by: Apu88
Category:

less

Transcript and Presenter's Notes

Title: Mining Approximate Frequent Itemsets in the Presence of Noise


1
Mining Approximate Frequent Itemsets in the
Presence of Noise
  • By- J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel
    and J. Prins

Presentation by- Apurv Awasthi
2
Title Statement
  • This paper introduces an approach to implement
    noise tolerant frequent itemset mining of the
    binary matrix representation of the database

3
Index
  • Introduction to Frequent Itemset Mining
  • Frequent Itemset Mining
  • Binary Matrix Representation Model
  • Problems
  • Motivation
  • Proposed Model
  • Proposed Algorithm
  • AFI Mining vs. Exact Frequent Itemset Mining
  • Related Works
  • Experimental Results
  • Discussion
  • Conclusion

4
Introduction to Frequent Itemset
Mining
  • Frequent pattern a pattern (a set of items,
    subsequences, substructures, etc.) that occurs
    frequently in a data set
  • Originally developed to discover association
    rules
  • Applications
  • Bio-molecular applications
  • DNA sequence analysis, protein structure analysis
  • Business applications
  • Market basket analysis, sale campaign analysis

5
The Binary Matrix Representation Model
  • Model for representing relational databases
  • Rows correspond to objects
  • Columns correspond to attributes of the objects
  • 1 indicates presence
  • 0 indicates absence
  • Frequent Itemset mining is a key technique for
    analyzing such data
  • Apply Apriori algorithm

Item --gt I1 I2 I3 I4 I5
Transaction I1 I2 I3 I4 I5
T1 1 0 1 1 0
T2 0 1 1 0 1
T3 1 1 1 0 1
T4 0 1 0 0 1
T5 1 0 0 0 0
6
Problem with Frequent Itemset Mining
  • The traditional model for mining frequent
    itemsets requires that every item must occur in
    each supporting transaction
  • NOT a practical assumption!
  • Real data is typically subject to noise
  • Reasons for noise
  • Human error
  • Measurement error
  • Vagaries of human behavior
  • Stochastic nature of studied biological behavior

7
Effect of Noise
  • Fragmentation of Patterns by Noise
  • Discover multiple small fragments of the true
    itemset
  • Miss the true itemset itself!
  • Example
  • Exact frequent itemset mining algorithm will miss
    the main itemset A
  • Observe three fragmented itemsets Itemset 1,2
    and 3
  • Fragmented itemsets may not satisfy the minimum
    support criteria and will therefore be discarded

8
Mathematical Proof of Fragmentation
From Significance and Recovery of block
structures in binary matrices with noise - by X.
Sun A.B. Nobel
  • With probability 1,
  • M(Y) lt 2logan- 2loga(logan) when n is
    sufficiently large
  • i.e. in the presence of noise, only a fraction of
    the initial block of 1s can be recovered

Where Matrix X contains actual values recorded
in the absence of any noise Matrix Z binary
noise matrix whose entries are independent
Bernoullis random variable such that Z Bern(p)
for 0ltplt0.5 M(Y) is the largest k such that Y
contains k transactions having k common items Y
X xor Z, a (1 - p)-1
9
Motivation
  • The failure of classical frequent itemset mining
    to detect simple patterns in the presence of
    random errors (i.e. noise) compromises the
    ability of these algorithms to detect
    association, cluster items or build classifiers
    when such errors are present

10
Possible Solutions
  • Let the matrix contain a small fraction of 0s

DRAWBACK Free
riders like column h (for matrix C) and
row 6 (for matrix B)
SOLUTION Limit the number of 0s in each row and
column
11
Proposed Model
  1. Use Approximate Frequent Itemset (AFI)
  • AFI characteristics
  • Sub-matrix contains large fraction of 1s
  • Supporting transaction should contain most of the
    items i.e. number of 0s in every row
    must fall below user defined threshold (?r)
  • Supporting item should occur in most of the
    transaction i.e. number of 0s in
    every column must fall below user defined
    threshold (?c)
  • Number of rows gt minimum support

12
AFI
  • Mathematical definition
  • For a given binary matrix D having I0 items and
    T0 transactions, an itemset I c I0 is an
    approximate frequent itemset AFI(?r,?c) if there
    exists a set of transactions T c T0 with T
    T0.minsup such that
  • Similarly, define weak AFI(?)

13
AFI example
  • A, B and C are weak AFI (0.25)
  • A valid AFI(0.25,0.25)
  • B weak AFI(,0.25)
  • C weak AFI(0.25,)

14
Drawback of AFI
  • AFI criteria violates the Apriori property!
  • Apriori Property all sub-itemsets of a frequent
    itemset must be frequent
  • But, sub-itemset of an AFI need not be AFI e.g. A
    is a valid AFI for minSupport 4, but b,c,e,
    b,c,d etc are not valid AFIs
  • PROBLEM now minimum support can not be used as
    a pruning technique
  • SOLUTION a generalization of Apriori properties
    for noisy conditions (called Noise Tolerant
    Support Pruning)

15
Proposed Model
  1. Use Approximate Frequent Itemset (AFI)
  2. Noise Tolerant Support Pruning to prune and
    generate candidate itemsets
  3. 0/1 Extension - to count the support of a noise -
    tolerant itemset based on the support set of its
    sub-itemsets

16
Noise Tolerant Support Pruning
  • For a given ?r, ?c and minsup the noise tolerant
    pruning support for a length-k itemset is-

Proof
17
0/1 Extensions
  • Starting from singleton itemsets, generate (k1)
    itemsets from k itemsets in sequential manner
  • The number of 0s allowed in the itemset grows
    with the length of the itemset in a discrete
    manner
  • 1 Extension
  • If then the transaction set of a (k1)
    itemset I is the intersection of the transaction
    sets of its length k subsets
  • 0 Extension
  • If then the transaction set of a (k1)
    itemset I is the union of the transaction sets of
    its length k subsets

Proof
18
Proposed Algorithm
19
AFI vs. Exact Frequent Itemset
AFI Mining
?r, ?c 1/3 n8 minsup 1
20
AFI vs. Exact Frequent Itemset
Exact Frequent Itemset Mining
1-candidates
Transaction Item
T1 a,b,c
T2 a,b
T3 a,c
T4 b,c
T5 a,b,c,d
T6 d
T7 b,d
T8 a
2-candidates
Freq 1-itemsets
Freq 2-itemsets
Itemset Support
a 5
b 5
c 4
d 3
Itemset Support
a 5
b 5
c 4
Itemset Support
ab 3
ac 3
bc 3
Itemset
Null
MinSup 0.5 i.e. 4 transactions n 8
21
AFI vs. Exact Frequent Itemset - Result
Approximate Frequent Itemset Exact Frequent Itemset
Generates the frequent itemset a,b,c Can not generate any frequent itemset in the presence of noise for the given minimum support value

22
Related Works
  • Yang et al. (2001) proposed two error-tolerant
    models, termed weak error-tolerant itemsets or
    weak ETI which is equivalent to weak AFI and
    strong ETI which is equivalent to AFI(?r,)
  • DRAWBACK
  • No efficient pruning technique rely on
    heuristics and sampling techniques
  • Do not preclude columns of 0
  • Steinbach et al. (2004) proposed a support
    envelope which is a tool for exploration and
    visualization of the high-level structures of
    association patterns. A symmetric ETI model is
    proposed such that the same fraction of errors
    are allowed in both rows and columns.
  • DRAWBACK
  • Implements same error co-efficient for rows and
    columns i.e. ?r ?c
  • Admits only a fixed number of 0s in the itemset.
    Fraction of noise does not vary with size of
    itemset sub-matrix

23
Related Works
  • Seppänen and Mannila (2004) proposed to mine the
    dense itemsets in the presence of noise where the
    dense itemsets are the itemsets with a
    sufficiently large sub-matrix that exceeds a
    given density threshold of attributes present.
  • DRAWBACK
  • Enforces the constraint that all sub-itemsets of
    a dense itemset must be frequent will fail to
    identify larger itemsets that have sufficient
    support because all sub-itemsets might not have
    enough support
  • Requires repeated scans of the database

24
Experimental Results - Scalability
  • Scalability
  • Database of 10,000 transactions and 100 items
  • Run time increases as noise tolerance increases
  • Reducing item wise error constraint leads to
    greater reduction in run time as compared to
    transaction wise error constraint

25
Experimental Results Synthetic Data
  • Quality Testing for single cluster
  • Create data with an embedded pattern
  • Add noise by flipping each entry with probability
    p where 0.01 p 0.2

26
Experimental Results Synthetic Data
  • Quality Testing for multiple clusters
  • Create data with multiple embedded pattern
  • Add noise by flipping each entry with probability
    p where 0.01 p 0.2

27
Experimental Results Real World Data
  • Zoo Data Set
  • Database contained 101 instances and 18 attribute
  • All the instances are classified into 7 classes
    e.g. Mammals, fish etc

Exact ETI (?r) AFI (?r,?c)
Generated subsets of animal in each class Then found subsets of their common features Identified "fins" and "domestic" as common features NOT necessarily true Only AFI was able to recover 3 classes with 100 accuracy
28
Discussion
  • Advantages
  • Flexibility of placing constraints independently
    along rows and columns
  • Generalized Apriori technique for pruning
  • Avoids repeated scans of database by using 0/1
    extension

29
Summary
  • The paper outlines an algorithm for mining
    approximate frequent itemsets from noisy data
  • It introduces
  • an AFI model
  • Generalized Apriori property for pruning
  • The proposed algorithm generates more useful
    itemsets compared to existing algorithms and is
    also computationally more efficient

30
Thank You!
31
Extra Slides for Questionnaire
32
Applying Apriori Algorithm
Data base D
1-candidates
Freq 1-itemsets
Item --gt a b c d e
Transaction a b c d e
T1 1 0 1 1 0
T2 0 1 1 0 1
T3 1 1 1 0 1
T4 0 1 0 0 1
T5 0 0 0 0 0
TID Items
T1 a, c, d
T2 b, c, e
T3 a, b, c, e
T4 b, e
Itemset Sup
a 2
b 3
c 3
d 1
e 3
Itemset Sup
a 2
b 3
c 3
e 3
Scan D
Min_sup2
2-candidates
Counting
Freq 2-itemsets
3-candidates
Itemset
ab
ac
ae
bc
be
ce
Scan D
Itemset Sup
ab 1
ac 2
ae 1
bc 2
be 3
ce 2
Itemset Sup
ac 2
bc 2
be 3
ce 2
Itemset
bce
Scan D
Freq 3-itemsets
Itemset Sup
bce 2
33
Noise Tolerant Support Pruning - Proof
34
0/1 Extensions Proof
  • Number of zeroes allowed in an itemset grows
    with the length of the itemset
Write a Comment
User Comments (0)
About PowerShow.com