Mining Association Rules in Large Databases - PowerPoint PPT Presentation

About This Presentation
Title:

Mining Association Rules in Large Databases

Description:

Mining Association Rules in Large Databases. Alternative Methods for Frequent ... '19-25') occupation(X,'student' ... Example: {age, occupation, buys} is a ... – PowerPoint PPT presentation

Number of Views:131
Avg rating:3.0/5.0
Slides: 37
Provided by: HKUC4
Learn more at: https://www.cs.bu.edu
Category:

less

Transcript and Presenter's Notes

Title: Mining Association Rules in Large Databases


1
Mining Association Rules in Large Databases
2
Alternative Methods for Frequent Itemset
Generation
  • Traversal of Itemset Lattice
  • General-to-specific vs Specific-to-general

3
Alternative Methods for Frequent Itemset
Generation
  • Traversal of Itemset Lattice
  • Equivalent Classes

4
Alternative Methods for Frequent Itemset
Generation
  • Traversal of Itemset Lattice
  • Breadth-first vs Depth-first

5
ECLAT
  • For each item, store a list of transaction ids
    (tids)

TID-list
6
ECLAT
  • Determine support of any k-itemset by
    intersecting tid-lists of two of its (k-1)
    subsets.
  • 3 traversal approaches
  • top-down, bottom-up and hybrid
  • Advantage very fast support counting
  • Disadvantage intermediate tid-lists may become
    too large for memory

?
?
7
Mining Frequent Closed Patterns CLOSET
  • Flist list of all frequent items in support
    ascending order
  • Flist d-a-f-e-c
  • Divide search space
  • Patterns having d
  • Patterns having d but no a, etc.
  • Find frequent closed pattern recursively
  • Every transaction having d also has cfa ? cfad is
    a frequent closed pattern
  • J. Pei, J. Han R. Mao. CLOSET An Efficient
    Algorithm for Mining Frequent Closed Itemsets",
    DMKD'00.

Min_sup2
8
Iceberg Queries
  • Icerberg query Compute aggregates over one or a
    set of attributes only for those whose aggregate
    values is above certain threshold
  • Example
  • select P.custID, P.itemID, sum(P.qty)
  • from purchase P
  • group by P.custID, P.itemID
  • having sum(P.qty) gt 10
  • Compute iceberg queries efficiently by Apriori
  • First compute lower dimensions
  • Then compute higher dimensions only when all the
    lower ones are above the threshold

9
Multiple-Level Association Rules
  • Items often form hierarchy.
  • Items at the lower level are expected to have
    lower support.
  • Rules regarding itemsets at
  • appropriate levels could be quite useful.
  • A transactional database can be encoded based on
    dimensions and levels
  • We can explore shared multi-level mining

Food
bread
milk
full fat
white
wheat
skim
....
Fraser
Wonder
10
Mining Multi-Level Associations
  • A top_down, progressive deepening approach
  • First find high-level strong rules
  • milk bread 20,
    60.
  • Then find their lower-level weaker rules
  • full fat milk
    wheat bread 6, 50.
  • Variations at mining multiple-level association
    rules.
  • Level-crossed association rules
  • full fat milk Wonder wheat bread
  • Association rules with multiple, alternative
    hierarchies
  • full fat milk Wonder bread

11
Multi-Dimensional Association Concepts
  • Single-dimensional rules
  • buys(X, milk) ? buys(X, bread)
  • Multi-dimensional rules ?2 dimensions or
    predicates
  • Inter-dimension association rules (no repeated
    predicates)
  • age(X,19-25) ? occupation(X,student) ?
    buys(X,coke)
  • hybrid-dimension association rules (repeated
    predicates)
  • age(X,19-25) ? buys(X, popcorn) ? buys(X,
    coke)
  • Categorical Attributes
  • finite number of possible values, no ordering
    among values
  • Quantitative Attributes
  • numeric, implicit ordering among values

12
Techniques for Mining Multi-Dimensional
Associations
  • Search for frequent k-predicate set
  • Example age, occupation, buys is a 3-predicate
    set.
  • Techniques can be categorized by how age is
    treated.
  • 1. Using static discretization of quantitative
    attributes
  • Quantitative attributes are statically
    discretized by using predefined concept
    hierarchies.
  • 2. Quantitative association rules
  • Quantitative attributes are dynamically
    discretized into binsbased on the distribution
    of the data.
  • 3. Distance-based association rules
  • This is a dynamic discretization process that
    considers the distance between data points.

13
Quantitative Association Rules
  • Numeric attributes are dynamically discretized
  • Such that the confidence or compactness of the
    rules mined is maximized.
  • 2-D quantitative association rules Aquan1 ?
    Aquan2 ? Acat
  • Cluster adjacent
  • association rules
  • to form general
  • rules using a 2-D
  • grid.
  • Example

age(X,30-34) ? income(X,24K - 48K) ?
buys(X,high resolution TV)
14
ARCS (Association Rule Clustering System)
  • How does ARCS work?
  • 1. Binning
  • 2. Find frequent predicate-set
  • 3. Clustering
  • 4. Optimize

15
Limitations of ARCS
  • Only quantitative attributes on LHS of rules.
  • Only 2 attributes on LHS. (2D limitation)
  • An alternative to ARCS
  • Non-grid-based
  • equi-depth binning
  • clustering based on a measure of partial
    completeness.
  • Mining Quantitative Association Rules in Large
    Relational Tables by R. Srikant and R. Agrawal.

16
Interestingness Measurements
  • Objective measures
  • Two popular measurements
  • support and
  • confidence
  • Subjective measures
  • A rule (pattern) is interesting if
  • it is unexpected (surprising to the user) and/or
  • actionable (the user can do something with it)

17
Computing Interestingness Measure
  • Given a rule X ? Y, information needed to compute
    rule interestingness can be obtained from a
    contingency table

Contingency table for X ? Y
  • Used to define various measures
  • support, confidence, lift, Gini, J-measure,
    etc.

18
Drawback of Confidence
19
Statistical Independence
  • Population of 1000 students
  • 600 students know how to swim (S)
  • 700 students know how to bike (B)
  • 420 students know how to swim and bike (S,B)
  • P(S?B) 420/1000 0.42
  • P(S) ? P(B) 0.6 ? 0.7 0.42
  • P(S?B) P(S) ? P(B) gt Statistical independence
  • P(S?B) gt P(S) ? P(B) gt Positively correlated
  • P(S?B) lt P(S) ? P(B) gt Negatively correlated

20
Statistical-based Measures
  • Measures that take into account statistical
    dependence

21
Example Lift/Interest
  • Association Rule Tea ? Coffee
  • Confidence P(CoffeeTea) 0.75
  • but P(Coffee) 0.9
  • Lift 0.75/0.9 0.8333 (lt 1, therefore is
    negatively associated)

22
There are lots of measures proposed in the
literature Some measures are good for certain
applications, but not for others What criteria
should we use to determine whether a measure is
good or bad? What about Apriori-style support
based pruning? How does it affect these measures?
23
Example ?-Coefficient
  • ?-coefficient is analogous to correlation
    coefficient for continuous variables

? Coefficient is the same for both tables
24
Properties of A Good Measure
  • Piatetsky-Shapiro 3 properties a good measure M
    must satisfy
  • M(A,B) 0 if A and B are statistically
    independent
  • M(A,B) increase monotonically with P(A,B) when
    P(A) and P(B) remain unchanged
  • M(A,B) decreases monotonically with P(A) or
    P(B) when P(A,B) and P(B) or P(A) remain
    unchanged

25
Constraint-Based Mining
  • Interactive, exploratory mining giga-bytes of
    data?
  • Could it be real? Making good use of
    constraints!
  • What kinds of constraints can be used in mining?
  • Knowledge type constraint classification,
    association, etc.
  • Data constraint SQL-like queries
  • Find product pairs sold together in Vancouver in
    Dec.98.
  • Dimension/level constraints
  • in relevance to region, price, brand, customer
    category.
  • Interestingness constraints
  • strong rules (min_support ? 3, min_confidence ?
    60).
  • Rule constraints
  • cheap item sales (price lt 10) triggers big
    sales (sum gt 200).

26
Rule Constraints in Association Mining
Left-hand side
Right-hand side
  • Two kind of rule constraints
  • Rule form constraints meta-rule guided mining.
  • P(x, y) Q(x, w) takes(x, database
    systems).
  • Rule (content) constraint constraint-based query
    optimization.
  • sum(LHS) lt 100 min(LHS) gt 20 count(LHS) gt 3
    sum(RHS) gt 1000

27
Constrain-Based Association Query
  • Database (1) trans (TID, Itemset ), (2)
    itemInfo (Item, Type, Price)
  • A constrained asso. query (CAQ) is in the form of
    (S1, S2 )C ,
  • where C is a set of constraints on S1, S2
    including frequency constraint
  • A classification of (single-variable)
    constraints
  • Class constraint S ? A. e.g. S ? Item
  • Domain constraint
  • S? v, ? ? ?, ?, ?, ?, ?, ? . e.g. S.Price lt
    100
  • v? S, ? is ? or ?. e.g. snacks ? S.Type
  • V? S, or S? V, ? ? ?, ?, ?, ?, ?
  • e.g. snacks, sodas ? S.Type
  • Aggregation constraint agg(S) ? v, where agg is
    in min, max, sum, count, avg, and ? ? ?, ?,
    ?, ?, ?, ? .
  • e.g. count(S1.Type) ? 1 , avg(S2.Price) ? 100

28
Constrained Association Query Optimization Problem
  • Given a CAQ (S1, S2) C , the algorithm
    should be
  • sound It only finds frequent sets that satisfy
    the given constraints C
  • complete All frequent sets satisfy the given
    constraints C are found
  • A naïve solution
  • Apply Apriori for finding all frequent sets, and
    then to test them for constraint satisfaction one
    by one.
  • A better approach
  • Comprehensive analysis of the properties of
    constraints and try to push them as deeply as
    possible inside the frequent set computation.

29
Anti-monotone and Monotone Constraints
  • A constraint Ca is anti-monotone iff. for any
    pattern S not satisfying Ca, none of the
    super-patterns of S can satisfy Ca
  • A constraint Cm is monotone iff. for any pattern
    S satisfying Cm, every super-pattern of S also
    satisfies it

30
Property of Constraints Anti-Monotone
  • Anti-monotonicity If a set S violates the
    constraint, any superset of S violates the
    constraint.
  • Examples
  • sum(S.Price) ? v is anti-monotone
  • sum(S.Price) ? v is not anti-monotone
  • Application
  • Push sum(S.price) ? 1000 deeply into iterative
    frequent set computation.

31
Characterization of Anti-Monotonicity
Constraints
S ? v, ? ? ?, ?, ? v ? S S ? V S ? V min(S)
? v min(S) ? v max(S) ? v max(S) ? v count(S) ?
v count(S) ? v sum(S) ? v sum(S) ? v avg(S) ?
v, ? ? ?, ? (frequent constraint)
yes no no yes no yes yes no yes no yes no con
vertible (yes)
32
Succinct Constraint
  • If a rule constraint is succinct set, we can
    directly generate precisely the sets that satisfy
    it, before support counting begins.
  • Example min(I.price)?500, I is a set of products
  • Knowing the price of each product can directly
    evaluate this predicate
  • Such constraints are precounting prunable

33
Property of Constraints Succinctness
  • Succinctness
  • For any set S1 and S2 satisfying C, S1 ? S2
    satisfies C
  • Given A1 is the sets of size 1 satisfying C, then
    any set S satisfying C are based on A1 , i.e., it
    contains a subset that belongs to A1 ,
  • Example
  • sum(S.Price ) ? v is not succinct
  • min(S.Price ) ? v is succinct
  • Optimization
  • If C is succinct, then C is pre-counting
    prunable. The satisfaction of the constraint
    alone is not affected by the iterative support
    counting.

34
Characterization of Constraints by Succinctness
S ? v, ? ? ?, ?, ? v ? S S ?V S ? V S ?
V min(S) ? v min(S) ? v min(S) ? v max(S) ?
v max(S) ? v max(S) ? v count(S) ? v count(S) ?
v count(S) ? v sum(S) ? v sum(S) ? v sum(S) ?
v avg(S) ? v, ? ? ?, ?, ? (frequent
constraint)
Yes yes yes yes yes yes yes yes yes yes yes weakly
weakly weakly no no no no (no)
35
Convertible Constraint
  • Suppose all items in patterns are listed in a
    total order R
  • A constraint C is convertible anti-monotone iff a
    pattern S satisfying the constraint implies that
    each suffix of S w.r.t. R also satisfies C
  • Example avg(I.price)500
  • If items are added to the itemset, sorted by
    price we know that avg(I.price) can only increase
    with the addition of a new item.
  • Similarly, there are convertible monotone
    constraints

36
Example of Convertible Constraints Avg(S) ? V
  • Let R be the value descending order over the set
    of items
  • E.g. I9, 8, 6, 4, 3, 1
  • Avg(S) ? v is convertible monotone w.r.t. R
  • If S is a suffix of S1, avg(S1) ? avg(S)
  • 8, 4, 3 is a suffix of 9, 8, 4, 3
  • avg(9, 8, 4, 3)6 ? avg(8, 4, 3)5
  • If S satisfies avg(S) ?v, so does S1
  • 8, 4, 3 satisfies constraint avg(S) ? 4, so
    does 9, 8, 4, 3
Write a Comment
User Comments (0)
About PowerShow.com