Title: CSE 980: Data Mining
1CSE 980 Data Mining
- Lecture 12 Extension to Association Analysis
Formulation
2Mining Infrequent Patterns
- An infrequent pattern is an itemset or rule whose
support is less than minsup threshold - When do infrequent patterns become interesting?
- Negative correlation
- P(A,B) ltlt P(A)P(B)
- e.g Windows vs Linux
- Exception rules
- (FireYes) ? (AlarmOff) may be infrequent, it is
interesting because it may suggest a faulty alarm
system - Challenge
- There is an enormous number of infrequent patterns
3Negative Associations
- A negative itemset X is an itemset that satisfies
the following properties - X A ? B where A is a set of positive items and
B is a set of negative items - Support, s(X) minsup
- A negative association rule r is a rule extracted
from a negative itemset X and satisfies the
following properties - S(X) minsup
- Confidence(r) minconf
- Example tea ? coffee
4Negative vs Frequent Patterns
Venn diagram includes all possible patterns
extracted from a given itemset Negative patterns
are either negative itemsets or negative
association rules
5Negative vs Frequent Patterns
minsup 40 minconf 50
- Coke ? Milk (support40, conf 100)
- ? frequent and strong association rule
- Beer ? Milk (support40 conf 67, but
support(Milk)80) - ?negatively-correlated pattern
- Milk ? Eggs (support 80, conf
100) - ? negative association rule
6Approach 1 Using Negative Items
- Computationally expensive
- Tends to produce many uninteresting negative
associations
7Approach 1 Using Negative Items
Size 3
B
Size 2
B
A
A
Support of A,B, A,B and A,B can be very
large
C
C
8Approach 2 Using Positive Itemsets
- Boulicaut et al 2000
- Compute support of negative itemsets based on the
support of positive itemsets - e.g. X Y ? Z
-
- e.g. s(ABCD) s(AB)-s(ABC)-s(ABD)s(ABCD)
-
- To use this formula
- Need to use a very low support threshold, or
- Use approximation
s(X) support of X
9Approach 3 Using Domain Knowledge
- Approach
- Compute expected support using item taxonomy
- If actual support much lower than expected
support, then declare it as a negative itemset - Challenges
- there could be multiple taxonomies (based on
brand, size, etc) - limited to nodes that are directly connected to
the frequent itemsets
Suppose C and G are frequent
10Approach 3 Using Domain Knowledge
- A negative itemset is a set of items whose actual
support is significantly lower than its expected
support - Negative association rule X ? Y
- Rule interest measure
- Approach
- Find frequent itemsets at each level of the
taxonomy - Identify candidate negative itemsets based on the
frequent itemsets found and their item taxonomy - Count actual support of candidate itemsets and
retain only the negative itemsets - Generate negative association rules from negative
itemsets
11Approach 4 Indirect Association
a
M
b
THEN a and b are expected to occur frequently
together
- a and b are indirectly associated via mediator M
- M identifies the context in which the negative
association is interesting
12When does Indirect Association become interesting?
For all pairs of items
With Mediator
No Mediator
FM
FN
Frequent
Minimum itempair support
IM
IN
If
Infrequent
IM/FM IN/FN
then Indirect Association is not surprising
mediator thresholds
13Finding Interesting Negative Associations
With Mediator
No Mediator
- IM/FM is small
- IM/IN is small
- ? Indirect Association is interesting
Frequent
FM
FN
Infrequent
IN
IM
14Finding Interesting Negative Association
Indirect Association is interesting when minimum
itempair support threshold is small. But, if
threshold is too low, very few indirect
associations are obtained.
15Grouping Indirect Associations
- Indirect associations can be grouped together
into more compact structures if they have same
mediator
Check degree of association
16Mining Indirect Associations
Join step
Prune step
17Application LA-Times
18Application LA-Times
19Application LA-Times
20Application Reuters-21578 news
- Indirect association can identify different
contexts of a word
21Application Reuters-21578 news
22Application Reuters-21578 news
23Application Retail Data
- Indirect association can identify competing
(sometimes) complementary items
24Application Retail Data
- Note There is no checkered-flag border wallpaper