Title: Mining Association Rules in Large Databases
1Mining Association Rules in Large Databases
2Alternative Methods for Frequent Itemset
Generation
- Traversal of Itemset Lattice
- General-to-specific vs Specific-to-general
3Alternative Methods for Frequent Itemset
Generation
- Traversal of Itemset Lattice
- Equivalent Classes
4Alternative Methods for Frequent Itemset
Generation
- Traversal of Itemset Lattice
- Breadth-first vs Depth-first
5ECLAT
- For each item, store a list of transaction ids
(tids)
TID-list
6ECLAT
- Determine support of any k-itemset by
intersecting tid-lists of two of its (k-1)
subsets. - 3 traversal approaches
- top-down, bottom-up and hybrid
- Advantage very fast support counting
- Disadvantage intermediate tid-lists may become
too large for memory
?
?
7Mining Frequent Closed Patterns CLOSET
- Flist list of all frequent items in support
ascending order - Flist d-a-f-e-c
- Divide search space
- Patterns having d
- Patterns having d but no a, etc.
- Find frequent closed pattern recursively
- Every transaction having d also has cfa ? cfad is
a frequent closed pattern - J. Pei, J. Han R. Mao. CLOSET An Efficient
Algorithm for Mining Frequent Closed Itemsets",
DMKD'00.
Min_sup2
8Iceberg Queries
- Icerberg query Compute aggregates over one or a
set of attributes only for those whose aggregate
values is above certain threshold - Example
- select P.custID, P.itemID, sum(P.qty)
- from purchase P
- group by P.custID, P.itemID
- having sum(P.qty) gt 10
- Compute iceberg queries efficiently by Apriori
- First compute lower dimensions
- Then compute higher dimensions only when all the
lower ones are above the threshold
9Multiple-Level Association Rules
- Items often form hierarchy.
- Items at the lower level are expected to have
lower support. - Rules regarding itemsets at
- appropriate levels could be quite useful.
- A transactional database can be encoded based on
dimensions and levels - We can explore shared multi-level mining
Food
bread
milk
full fat
white
wheat
skim
....
Fraser
Wonder
10Mining Multi-Level Associations
- A top_down, progressive deepening approach
- First find high-level strong rules
- milk bread 20,
60. - Then find their lower-level weaker rules
- full fat milk
wheat bread 6, 50. - Variations at mining multiple-level association
rules. - Level-crossed association rules
- full fat milk Wonder wheat bread
- Association rules with multiple, alternative
hierarchies - full fat milk Wonder bread
11Multi-Dimensional Association Concepts
- Single-dimensional rules
- buys(X, milk) ? buys(X, bread)
- Multi-dimensional rules ?2 dimensions or
predicates - Inter-dimension association rules (no repeated
predicates) - age(X,19-25) ? occupation(X,student) ?
buys(X,coke) - hybrid-dimension association rules (repeated
predicates) - age(X,19-25) ? buys(X, popcorn) ? buys(X,
coke) - Categorical Attributes
- finite number of possible values, no ordering
among values - Quantitative Attributes
- numeric, implicit ordering among values
12Techniques for Mining Multi-Dimensional
Associations
- Search for frequent k-predicate set
- Example age, occupation, buys is a 3-predicate
set. - Techniques can be categorized by how age is
treated. - 1. Using static discretization of quantitative
attributes - Quantitative attributes are statically
discretized by using predefined concept
hierarchies. - 2. Quantitative association rules
- Quantitative attributes are dynamically
discretized into binsbased on the distribution
of the data. - 3. Distance-based association rules
- This is a dynamic discretization process that
considers the distance between data points.
13Quantitative Association Rules
- Numeric attributes are dynamically discretized
- Such that the confidence or compactness of the
rules mined is maximized. - 2-D quantitative association rules Aquan1 ?
Aquan2 ? Acat - Cluster adjacent
- association rules
- to form general
- rules using a 2-D
- grid.
- Example
age(X,30-34) ? income(X,24K - 48K) ?
buys(X,high resolution TV)
14ARCS (Association Rule Clustering System)
- How does ARCS work?
- 1. Binning
- 2. Find frequent predicate-set
- 3. Clustering
- 4. Optimize
15Limitations of ARCS
- Only quantitative attributes on LHS of rules.
- Only 2 attributes on LHS. (2D limitation)
- An alternative to ARCS
- Non-grid-based
- equi-depth binning
- clustering based on a measure of partial
completeness. - Mining Quantitative Association Rules in Large
Relational Tables by R. Srikant and R. Agrawal.
16Interestingness Measurements
- Objective measures
- Two popular measurements
- support and
- confidence
- Subjective measures
- A rule (pattern) is interesting if
- it is unexpected (surprising to the user) and/or
- actionable (the user can do something with it)
17Computing Interestingness Measure
- Given a rule X ? Y, information needed to compute
rule interestingness can be obtained from a
contingency table
Contingency table for X ? Y
- Used to define various measures
- support, confidence, lift, Gini, J-measure,
etc.
18Drawback of Confidence
19Statistical Independence
- Population of 1000 students
- 600 students know how to swim (S)
- 700 students know how to bike (B)
- 420 students know how to swim and bike (S,B)
- P(S?B) 420/1000 0.42
- P(S) ? P(B) 0.6 ? 0.7 0.42
- P(S?B) P(S) ? P(B) gt Statistical independence
- P(S?B) gt P(S) ? P(B) gt Positively correlated
- P(S?B) lt P(S) ? P(B) gt Negatively correlated
20Statistical-based Measures
- Measures that take into account statistical
dependence
21Example Lift/Interest
- Association Rule Tea ? Coffee
- Confidence P(CoffeeTea) 0.75
- but P(Coffee) 0.9
- Lift 0.75/0.9 0.8333 (lt 1, therefore is
negatively associated)
22There are lots of measures proposed in the
literature Some measures are good for certain
applications, but not for others What criteria
should we use to determine whether a measure is
good or bad? What about Apriori-style support
based pruning? How does it affect these measures?
23Example ?-Coefficient
- ?-coefficient is analogous to correlation
coefficient for continuous variables
? Coefficient is the same for both tables
24Properties of A Good Measure
- Piatetsky-Shapiro 3 properties a good measure M
must satisfy - M(A,B) 0 if A and B are statistically
independent - M(A,B) increase monotonically with P(A,B) when
P(A) and P(B) remain unchanged - M(A,B) decreases monotonically with P(A) or
P(B) when P(A,B) and P(B) or P(A) remain
unchanged
25Constraint-Based Mining
- Interactive, exploratory mining giga-bytes of
data? - Could it be real? Making good use of
constraints! - What kinds of constraints can be used in mining?
- Knowledge type constraint classification,
association, etc. - Data constraint SQL-like queries
- Find product pairs sold together in Vancouver in
Dec.98. - Dimension/level constraints
- in relevance to region, price, brand, customer
category. - Interestingness constraints
- strong rules (min_support ? 3, min_confidence ?
60). - Rule constraints
- cheap item sales (price lt 10) triggers big
sales (sum gt 200).
26Rule Constraints in Association Mining
Left-hand side
Right-hand side
- Two kind of rule constraints
- Rule form constraints meta-rule guided mining.
- P(x, y) Q(x, w) takes(x, database
systems). - Rule (content) constraint constraint-based query
optimization. - sum(LHS) lt 100 min(LHS) gt 20 count(LHS) gt 3
sum(RHS) gt 1000
27Constrain-Based Association Query
- Database (1) trans (TID, Itemset ), (2)
itemInfo (Item, Type, Price) - A constrained asso. query (CAQ) is in the form of
(S1, S2 )C , - where C is a set of constraints on S1, S2
including frequency constraint - A classification of (single-variable)
constraints - Class constraint S ? A. e.g. S ? Item
- Domain constraint
- S? v, ? ? ?, ?, ?, ?, ?, ? . e.g. S.Price lt
100 - v? S, ? is ? or ?. e.g. snacks ? S.Type
- V? S, or S? V, ? ? ?, ?, ?, ?, ?
- e.g. snacks, sodas ? S.Type
- Aggregation constraint agg(S) ? v, where agg is
in min, max, sum, count, avg, and ? ? ?, ?,
?, ?, ?, ? . - e.g. count(S1.Type) ? 1 , avg(S2.Price) ? 100
28Constrained Association Query Optimization Problem
- Given a CAQ (S1, S2) C , the algorithm
should be - sound It only finds frequent sets that satisfy
the given constraints C - complete All frequent sets satisfy the given
constraints C are found - A naïve solution
- Apply Apriori for finding all frequent sets, and
then to test them for constraint satisfaction one
by one. - A better approach
- Comprehensive analysis of the properties of
constraints and try to push them as deeply as
possible inside the frequent set computation.
29Anti-monotone and Monotone Constraints
- A constraint Ca is anti-monotone iff. for any
pattern S not satisfying Ca, none of the
super-patterns of S can satisfy Ca - A constraint Cm is monotone iff. for any pattern
S satisfying Cm, every super-pattern of S also
satisfies it
30Property of Constraints Anti-Monotone
- Anti-monotonicity If a set S violates the
constraint, any superset of S violates the
constraint. - Examples
- sum(S.Price) ? v is anti-monotone
- sum(S.Price) ? v is not anti-monotone
- Application
- Push sum(S.price) ? 1000 deeply into iterative
frequent set computation.
31Characterization of Anti-Monotonicity
Constraints
S ? v, ? ? ?, ?, ? v ? S S ? V S ? V min(S)
? v min(S) ? v max(S) ? v max(S) ? v count(S) ?
v count(S) ? v sum(S) ? v sum(S) ? v avg(S) ?
v, ? ? ?, ? (frequent constraint)
yes no no yes no yes yes no yes no yes no con
vertible (yes)
32Succinct Constraint
- If a rule constraint is succinct set, we can
directly generate precisely the sets that satisfy
it, before support counting begins. - Example min(I.price)?500, I is a set of products
- Knowing the price of each product can directly
evaluate this predicate - Such constraints are precounting prunable
33Property of Constraints Succinctness
- Succinctness
- For any set S1 and S2 satisfying C, S1 ? S2
satisfies C - Given A1 is the sets of size 1 satisfying C, then
any set S satisfying C are based on A1 , i.e., it
contains a subset that belongs to A1 , - Example
- sum(S.Price ) ? v is not succinct
- min(S.Price ) ? v is succinct
- Optimization
- If C is succinct, then C is pre-counting
prunable. The satisfaction of the constraint
alone is not affected by the iterative support
counting.
34Characterization of Constraints by Succinctness
S ? v, ? ? ?, ?, ? v ? S S ?V S ? V S ?
V min(S) ? v min(S) ? v min(S) ? v max(S) ?
v max(S) ? v max(S) ? v count(S) ? v count(S) ?
v count(S) ? v sum(S) ? v sum(S) ? v sum(S) ?
v avg(S) ? v, ? ? ?, ?, ? (frequent
constraint)
Yes yes yes yes yes yes yes yes yes yes yes weakly
weakly weakly no no no no (no)
35Convertible Constraint
- Suppose all items in patterns are listed in a
total order R - A constraint C is convertible anti-monotone iff a
pattern S satisfying the constraint implies that
each suffix of S w.r.t. R also satisfies C - Example avg(I.price)500
- If items are added to the itemset, sorted by
price we know that avg(I.price) can only increase
with the addition of a new item. - Similarly, there are convertible monotone
constraints
36Example of Convertible Constraints Avg(S) ? V
- Let R be the value descending order over the set
of items - E.g. I9, 8, 6, 4, 3, 1
- Avg(S) ? v is convertible monotone w.r.t. R
- If S is a suffix of S1, avg(S1) ? avg(S)
- 8, 4, 3 is a suffix of 9, 8, 4, 3
- avg(9, 8, 4, 3)6 ? avg(8, 4, 3)5
- If S satisfies avg(S) ?v, so does S1
- 8, 4, 3 satisfies constraint avg(S) ? 4, so
does 9, 8, 4, 3