Title: Mining Approximate Frequent Itemsets in the Presence of Noise
1Mining Approximate Frequent Itemsets in the
Presence of Noise
- By- J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel
and J. Prins
Presentation by- Apurv Awasthi
2Title Statement
- This paper introduces an approach to implement
noise tolerant frequent itemset mining of the
binary matrix representation of the database
3Index
- Introduction to Frequent Itemset Mining
- Frequent Itemset Mining
- Binary Matrix Representation Model
- Problems
- Motivation
- Proposed Model
- Proposed Algorithm
- AFI Mining vs. Exact Frequent Itemset Mining
- Related Works
- Experimental Results
- Discussion
- Conclusion
4 Introduction to Frequent Itemset
Mining
- Frequent pattern a pattern (a set of items,
subsequences, substructures, etc.) that occurs
frequently in a data set - Originally developed to discover association
rules - Applications
- Bio-molecular applications
- DNA sequence analysis, protein structure analysis
- Business applications
- Market basket analysis, sale campaign analysis
5The Binary Matrix Representation Model
- Model for representing relational databases
- Rows correspond to objects
- Columns correspond to attributes of the objects
- 1 indicates presence
- 0 indicates absence
- Frequent Itemset mining is a key technique for
analyzing such data - Apply Apriori algorithm
Item --gt I1 I2 I3 I4 I5
Transaction I1 I2 I3 I4 I5
T1 1 0 1 1 0
T2 0 1 1 0 1
T3 1 1 1 0 1
T4 0 1 0 0 1
T5 1 0 0 0 0
6Problem with Frequent Itemset Mining
- The traditional model for mining frequent
itemsets requires that every item must occur in
each supporting transaction
- NOT a practical assumption!
- Real data is typically subject to noise
- Reasons for noise
- Human error
- Measurement error
- Vagaries of human behavior
- Stochastic nature of studied biological behavior
7Effect of Noise
- Fragmentation of Patterns by Noise
- Discover multiple small fragments of the true
itemset - Miss the true itemset itself!
- Example
- Exact frequent itemset mining algorithm will miss
the main itemset A - Observe three fragmented itemsets Itemset 1,2
and 3 - Fragmented itemsets may not satisfy the minimum
support criteria and will therefore be discarded
8Mathematical Proof of Fragmentation
From Significance and Recovery of block
structures in binary matrices with noise - by X.
Sun A.B. Nobel
- With probability 1,
- M(Y) lt 2logan- 2loga(logan) when n is
sufficiently large - i.e. in the presence of noise, only a fraction of
the initial block of 1s can be recovered
Where Matrix X contains actual values recorded
in the absence of any noise Matrix Z binary
noise matrix whose entries are independent
Bernoullis random variable such that Z Bern(p)
for 0ltplt0.5 M(Y) is the largest k such that Y
contains k transactions having k common items Y
X xor Z, a (1 - p)-1
9Motivation
- The failure of classical frequent itemset mining
to detect simple patterns in the presence of
random errors (i.e. noise) compromises the
ability of these algorithms to detect
association, cluster items or build classifiers
when such errors are present
10Possible Solutions
- Let the matrix contain a small fraction of 0s
DRAWBACK Free
riders like column h (for matrix C) and
row 6 (for matrix B)
SOLUTION Limit the number of 0s in each row and
column
11Proposed Model
- Use Approximate Frequent Itemset (AFI)
- AFI characteristics
- Sub-matrix contains large fraction of 1s
- Supporting transaction should contain most of the
items i.e. number of 0s in every row
must fall below user defined threshold (?r) - Supporting item should occur in most of the
transaction i.e. number of 0s in
every column must fall below user defined
threshold (?c) - Number of rows gt minimum support
12AFI
- Mathematical definition
- For a given binary matrix D having I0 items and
T0 transactions, an itemset I c I0 is an
approximate frequent itemset AFI(?r,?c) if there
exists a set of transactions T c T0 with T
T0.minsup such that - Similarly, define weak AFI(?)
13AFI example
- A, B and C are weak AFI (0.25)
- A valid AFI(0.25,0.25)
- B weak AFI(,0.25)
- C weak AFI(0.25,)
14Drawback of AFI
- AFI criteria violates the Apriori property!
- Apriori Property all sub-itemsets of a frequent
itemset must be frequent - But, sub-itemset of an AFI need not be AFI e.g. A
is a valid AFI for minSupport 4, but b,c,e,
b,c,d etc are not valid AFIs
- PROBLEM now minimum support can not be used as
a pruning technique - SOLUTION a generalization of Apriori properties
for noisy conditions (called Noise Tolerant
Support Pruning)
15Proposed Model
- Use Approximate Frequent Itemset (AFI)
- Noise Tolerant Support Pruning to prune and
generate candidate itemsets - 0/1 Extension - to count the support of a noise -
tolerant itemset based on the support set of its
sub-itemsets
16Noise Tolerant Support Pruning
- For a given ?r, ?c and minsup the noise tolerant
pruning support for a length-k itemset is-
Proof
170/1 Extensions
- Starting from singleton itemsets, generate (k1)
itemsets from k itemsets in sequential manner - The number of 0s allowed in the itemset grows
with the length of the itemset in a discrete
manner - 1 Extension
- If then the transaction set of a (k1)
itemset I is the intersection of the transaction
sets of its length k subsets - 0 Extension
- If then the transaction set of a (k1)
itemset I is the union of the transaction sets of
its length k subsets
Proof
18Proposed Algorithm
19AFI vs. Exact Frequent Itemset
AFI Mining
?r, ?c 1/3 n8 minsup 1
20AFI vs. Exact Frequent Itemset
Exact Frequent Itemset Mining
1-candidates
Transaction Item
T1 a,b,c
T2 a,b
T3 a,c
T4 b,c
T5 a,b,c,d
T6 d
T7 b,d
T8 a
2-candidates
Freq 1-itemsets
Freq 2-itemsets
Itemset Support
a 5
b 5
c 4
d 3
Itemset Support
a 5
b 5
c 4
Itemset Support
ab 3
ac 3
bc 3
Itemset
Null
MinSup 0.5 i.e. 4 transactions n 8
21AFI vs. Exact Frequent Itemset - Result
Approximate Frequent Itemset Exact Frequent Itemset
Generates the frequent itemset a,b,c Can not generate any frequent itemset in the presence of noise for the given minimum support value
22Related Works
- Yang et al. (2001) proposed two error-tolerant
models, termed weak error-tolerant itemsets or
weak ETI which is equivalent to weak AFI and
strong ETI which is equivalent to AFI(?r,) - DRAWBACK
- No efficient pruning technique rely on
heuristics and sampling techniques - Do not preclude columns of 0
- Steinbach et al. (2004) proposed a support
envelope which is a tool for exploration and
visualization of the high-level structures of
association patterns. A symmetric ETI model is
proposed such that the same fraction of errors
are allowed in both rows and columns. - DRAWBACK
- Implements same error co-efficient for rows and
columns i.e. ?r ?c - Admits only a fixed number of 0s in the itemset.
Fraction of noise does not vary with size of
itemset sub-matrix
23Related Works
- Seppänen and Mannila (2004) proposed to mine the
dense itemsets in the presence of noise where the
dense itemsets are the itemsets with a
sufficiently large sub-matrix that exceeds a
given density threshold of attributes present. - DRAWBACK
- Enforces the constraint that all sub-itemsets of
a dense itemset must be frequent will fail to
identify larger itemsets that have sufficient
support because all sub-itemsets might not have
enough support - Requires repeated scans of the database
24Experimental Results - Scalability
- Scalability
- Database of 10,000 transactions and 100 items
- Run time increases as noise tolerance increases
- Reducing item wise error constraint leads to
greater reduction in run time as compared to
transaction wise error constraint
25Experimental Results Synthetic Data
- Quality Testing for single cluster
- Create data with an embedded pattern
- Add noise by flipping each entry with probability
p where 0.01 p 0.2
26Experimental Results Synthetic Data
- Quality Testing for multiple clusters
- Create data with multiple embedded pattern
- Add noise by flipping each entry with probability
p where 0.01 p 0.2
27Experimental Results Real World Data
- Zoo Data Set
- Database contained 101 instances and 18 attribute
- All the instances are classified into 7 classes
e.g. Mammals, fish etc
Exact ETI (?r) AFI (?r,?c)
Generated subsets of animal in each class Then found subsets of their common features Identified "fins" and "domestic" as common features NOT necessarily true Only AFI was able to recover 3 classes with 100 accuracy
28Discussion
- Advantages
- Flexibility of placing constraints independently
along rows and columns - Generalized Apriori technique for pruning
- Avoids repeated scans of database by using 0/1
extension
29Summary
- The paper outlines an algorithm for mining
approximate frequent itemsets from noisy data - It introduces
- an AFI model
- Generalized Apriori property for pruning
- The proposed algorithm generates more useful
itemsets compared to existing algorithms and is
also computationally more efficient
30Thank You!
31Extra Slides for Questionnaire
32Applying Apriori Algorithm
Data base D
1-candidates
Freq 1-itemsets
Item --gt a b c d e
Transaction a b c d e
T1 1 0 1 1 0
T2 0 1 1 0 1
T3 1 1 1 0 1
T4 0 1 0 0 1
T5 0 0 0 0 0
TID Items
T1 a, c, d
T2 b, c, e
T3 a, b, c, e
T4 b, e
Itemset Sup
a 2
b 3
c 3
d 1
e 3
Itemset Sup
a 2
b 3
c 3
e 3
Scan D
Min_sup2
2-candidates
Counting
Freq 2-itemsets
3-candidates
Itemset
ab
ac
ae
bc
be
ce
Scan D
Itemset Sup
ab 1
ac 2
ae 1
bc 2
be 3
ce 2
Itemset Sup
ac 2
bc 2
be 3
ce 2
Itemset
bce
Scan D
Freq 3-itemsets
Itemset Sup
bce 2
33Noise Tolerant Support Pruning - Proof
340/1 Extensions Proof
- Number of zeroes allowed in an itemset grows
with the length of the itemset