Mining Association Rules in Large Databases - PowerPoint PPT Presentation

About This Presentation
Title:

Mining Association Rules in Large Databases

Description:

Mining single-dimensional Boolean association rules from transactional databases ... people who purchase tires and auto accessories also get automotive services done ... – PowerPoint PPT presentation

Number of Views:119
Avg rating:3.0/5.0
Slides: 81
Provided by: jiaw206
Learn more at: https://www.cs.kent.edu
Category:

less

Transcript and Presenter's Notes

Title: Mining Association Rules in Large Databases


1
Mining Association Rules in Large Databases
  • Association rule mining
  • Mining single-dimensional Boolean association
    rules from transactional databases
  • Mining multilevel association rules from
    transactional databases
  • Mining multidimensional association rules from
    transactional databases and data warehouse
  • From association mining to correlation analysis
  • Constraint-based association mining
  • Summary

2
Rule Measures Support and Confidence
Customer buys both
  • Find all the rules X Y ? Z with minimum
    confidence and support
  • support, s, probability that a transaction
    contains X ? Y ? Z
  • confidence, c, conditional probability that a
    transaction having X ? Y also contains Z

Customer buys diaper
Customer buys beer
  • Let minimum support 50, and minimum confidence
    50, we have
  • A ? C (50, 66.6)
  • C ? A (50, 100)

3
Association Rule Mining
  • Given a set of transactions, find rules that will
    predict the occurrence of an item based on the
    occurrences of other items in the transaction

Market-Basket transactions
Example of Association Rules
Diaper ? Beer,Milk, Bread ?
Eggs,Coke,Beer, Bread ? Milk,
Implication means co-occurrence, not causality!
4
Definition Frequent Itemset
  • Itemset
  • A collection of one or more items
  • Example Milk, Bread, Diaper
  • k-itemset
  • An itemset that contains k items
  • Support count (?)
  • Frequency of occurrence of an itemset
  • E.g. ?(Milk, Bread,Diaper) 2
  • Support
  • Fraction of transactions that contain an itemset
  • E.g. s(Milk, Bread, Diaper) 2/5
  • Frequent Itemset
  • An itemset whose support is greater than or equal
    to a minsup threshold

5
Definition Association Rule
  • Association Rule
  • An implication expression of the form X ? Y,
    where X and Y are itemsets
  • Example Milk, Diaper ? Beer
  • Rule Evaluation Metrics
  • Support (s)
  • Fraction of transactions that contain both X and
    Y
  • Confidence (c)
  • Measures how often items in Y appear in
    transactions thatcontain X

6
Association Rule Mining Task
  • Given a set of transactions T, the goal of
    association rule mining is to find all rules
    having
  • support minsup threshold
  • confidence minconf threshold
  • Brute-force approach
  • List all possible association rules
  • Compute the support and confidence for each rule
  • Prune rules that fail the minsup and minconf
    thresholds
  • ? Computationally prohibitive!

7
Mining Association Rules
Example of Rules Milk,Diaper ? Beer (s0.4,
c0.67)Milk,Beer ? Diaper (s0.4,
c1.0) Diaper,Beer ? Milk (s0.4,
c0.67) Beer ? Milk,Diaper (s0.4, c0.67)
Diaper ? Milk,Beer (s0.4, c0.5) Milk ?
Diaper,Beer (s0.4, c0.5)
  • Observations
  • All the above rules are binary partitions of the
    same itemset Milk, Diaper, Beer
  • Rules originating from the same itemset have
    identical support but can have different
    confidence
  • Thus, we may decouple the support and confidence
    requirements

8
Mining Association RulesAn Example
Min. support 50 Min. confidence 50
  • For rule A ? C
  • support support(A ?C) 50
  • confidence support(A ?C)/support(A) 66.6
  • The Apriori principle
  • Any subset of a frequent itemset must be frequent

9
Mining Frequent Itemsets the Key Step
  • Find the frequent itemsets the sets of items
    that have minimum support
  • A subset of a frequent itemset must also be a
    frequent itemset
  • i.e., if AB is a frequent itemset, both A and
    B should be a frequent itemset
  • Iteratively find frequent itemsets with
    cardinality from 1 to k (k-itemset)
  • Use the frequent itemsets to generate association
    rules.

10
Mining Association Rules
  • Two-step approach
  • Frequent Itemset Generation
  • Generate all itemsets whose support ? minsup
  • Rule Generation
  • Generate high confidence rules from each frequent
    itemset, where each rule is a binary partitioning
    of a frequent itemset
  • Frequent itemset generation is still
    computationally expensive

11
Frequent Itemset Generation
Given d items, there are 2d possible candidate
itemsets
12
Frequent Itemset Generation
  • Brute-force approach
  • Each itemset in the lattice is a candidate
    frequent itemset
  • Count the support of each candidate by scanning
    the database
  • Match each transaction against every candidate
  • Complexity O(NMw) gt Expensive since M 2d !!!

13
Computational Complexity
  • Given d unique items
  • Total number of itemsets 2d
  • Total number of possible association rules

If d6, R 602 rules
14
Frequent Itemset Generation Strategies
  • Reduce the number of candidates (M)
  • Complete search M2d
  • Use pruning techniques to reduce M
  • Reduce the number of transactions (N)
  • Reduce size of N as the size of itemset increases
  • Used by DHP and vertical-based mining algorithms
  • Reduce the number of comparisons (NM)
  • Use efficient data structures to store the
    candidates or transactions
  • No need to match every candidate against every
    transaction

15
Reducing Number of Candidates
  • Apriori principle
  • If an itemset is frequent, then all of its
    subsets must also be frequent
  • Apriori principle holds due to the following
    property of the support measure
  • Support of an itemset never exceeds the support
    of its subsets
  • This is known as the anti-monotone property of
    support

16
Illustrating Apriori Principle
17
The Apriori Algorithm
  • Join Step Ck is generated by joining Lk-1with
    itself
  • Prune Step Any (k-1)-itemset that is not
    frequent cannot be a subset of a frequent
    k-itemset
  • Pseudo-code
  • Ck Candidate itemset of size k
  • Lk frequent itemset of size k
  • L1 frequent items
  • for (k 1 Lk !? k) do begin
  • Ck1 candidates generated from Lk
  • for each transaction t in database do
  • increment the count of all candidates in
    Ck1 that are
    contained in t
  • Lk1 candidates in Ck1 with min_support
  • end
  • return ?k Lk

18
The Apriori Algorithm Example
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Scan D
19
Generate Hash Tree
  • Suppose you have 15 candidate itemsets of length
    3
  • 1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8, 1 5
    9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3 5 6,
    3 5 7, 6 8 9, 3 6 7, 3 6 8
  • You need
  • Hash function
  • Max leaf size max number of itemsets stored in
    a leaf node (if number of candidate itemsets
    exceeds max leaf size, split the node)

20
Association Rule Discovery Hash tree
Hash Function
Candidate Hash Tree
1,4,7
3,6,9
2,5,8
Hash on 1, 4 or 7
21
How to Generate Candidates?
  • Suppose the items in Lk-1 are listed in an order
  • Step 1 self-joining Lk-1
  • insert into Ck
  • select p.item1, p.item2, , p.itemk-1, q.itemk-1
  • from Lk-1 p, Lk-1 q
  • where p.item1q.item1, , p.itemk-2q.itemk-2,
    p.itemk-1 lt q.itemk-1
  • Step 2 pruning
  • forall itemsets c in Ck do
  • forall (k-1)-subsets s of c do
  • if (s is not in Lk-1) then delete c from Ck

22
How to Count Supports of Candidates?
  • Why counting supports of candidates a problem?
  • The total number of candidates can be very huge
  • One transaction may contain many candidates
  • Method
  • Candidate itemsets are stored in a hash-tree
  • Leaf node of hash-tree contains a list of
    itemsets and counts
  • Interior node contains a hash table
  • Subset function finds all the candidates
    contained in a transaction

23
Example of Generating Candidates
  • L3abc, abd, acd, ace, bcd
  • Self-joining L3L3
  • abcd from abc and abd
  • acde from acd and ace
  • Pruning
  • acde is removed because ade is not in L3
  • C4abcd

24
Methods to Improve Aprioris Efficiency
  • Hash-based itemset counting A k-itemset whose
    corresponding hashing bucket count is below the
    threshold cannot be frequent
  • Transaction reduction A transaction that does
    not contain any frequent k-itemset is useless in
    subsequent scans
  • Partitioning Any itemset that is potentially
    frequent in DB must be frequent in at least one
    of the partitions of DB
  • Sampling mining on a subset of given data, lower
    support threshold a method to determine the
    completeness
  • Dynamic itemset counting add new candidate
    itemsets only when all of their subsets are
    estimated to be frequent

25
Compact Representation of Frequent Itemsets
  • Some itemsets are redundant because they have
    identical support as their supersets
  • Number of frequent itemsets
  • Need a compact representation

26
Maximal Frequent Itemset
An itemset is maximal frequent if none of its
immediate supersets is frequent
Maximal Itemsets
Infrequent Itemsets
Border
27
Closed Itemset
  • An itemset is closed if none of its immediate
    supersets has the same support as the itemset

28
Maximal vs Closed Itemsets
Transaction Ids
Not supported by any transactions
29
Maximal vs Closed Frequent Itemsets
Closed but not maximal
Minimum support 2
Closed and maximal
Closed 9 Maximal 4
30
Maximal vs Closed Itemsets
31
Is Apriori Fast Enough? Performance Bottlenecks
  • The core of the Apriori algorithm
  • Use frequent (k 1)-itemsets to generate
    candidate frequent k-itemsets
  • Use database scan and pattern matching to collect
    counts for the candidate itemsets
  • The bottleneck of Apriori candidate generation
  • Huge candidate sets
  • 104 frequent 1-itemset will generate 107
    candidate 2-itemsets
  • To discover a frequent pattern of size 100, e.g.,
    a1, a2, , a100, one needs to generate 2100 ?
    1030 candidates.
  • Multiple scans of database
  • Needs (n 1 ) scans, n is the length of the
    longest pattern

32
Alternative Methods for Frequent Itemset
Generation
  • Representation of Database
  • horizontal vs vertical data layout

33
FP-growth Algorithm
  • Use a compressed representation of the database
    using an FP-tree
  • Once an FP-tree has been constructed, it uses a
    recursive divide-and-conquer approach to mine the
    frequent itemsets

34
FP-tree construction
null
After reading TID1
A1
B1
After reading TID2
null
B1
A1
B1
C1
D1
35
FP-Tree Construction
Transaction Database
null
B3
A7
B5
C3
C1
D1
D1
Header table
C3
E1
D1
E1
D1
E1
D1
Pointers are used to assist frequent itemset
generation
36
FP-growth
Conditional Pattern base for D P
(A1,B1,C1), (A1,B1),
(A1,C1), (A1),
(B1,C1) Recursively apply FP-growth on
P Frequent Itemsets found (with sup gt 1) AD,
BD, CD, ACD, BCD
null
A7
B1
B5
C1
C1
D1
D1
C3
D1
D1
D1
37
Tree Projection
Set enumeration tree
Possible Extension E(A) B,C,D,E
Possible Extension E(ABC) D,E
38
Tree Projection
  • Items are listed in lexicographic order
  • Each node P stores the following information
  • Itemset for node P
  • List of possible lexicographic extensions of P
    E(P)
  • Pointer to projected database of its ancestor
    node
  • Bitvector containing information about which
    transactions in the projected database contain
    the itemset

39
Projected Database
Projected Database for node A
Original Database
For each transaction T, projected transaction at
node A is T ? E(A)
40
Benefits of the FP-tree Structure
  • Completeness
  • never breaks a long pattern of any transaction
  • preserves complete information for frequent
    pattern mining
  • Compactness
  • reduce irrelevant informationinfrequent items
    are gone
  • frequency descending ordering more frequent
    items are more likely to be shared
  • never be larger than the original database (if
    not count node-links and counts)
  • Example For Connect-4 DB, compression ratio
    could be over 100

41
Mining Frequent Patterns Using FP-tree
  • General idea (divide-and-conquer)
  • Recursively grow frequent pattern path using the
    FP-tree
  • Method
  • For each item, construct its conditional
    pattern-base, and then its conditional FP-tree
  • Repeat the process on each newly created
    conditional FP-tree
  • Until the resulting FP-tree is empty, or it
    contains only one path (single path will generate
    all the combinations of its sub-paths, each of
    which is a frequent pattern)

42
Major Steps to Mine FP-tree
  • Construct conditional pattern base for each node
    in the FP-tree
  • Construct conditional FP-tree from each
    conditional pattern-base
  • Recursively mine conditional FP-trees and grow
    frequent patterns obtained so far
  • If the conditional FP-tree contains a single
    path, simply enumerate all the patterns

43
Step 1 From FP-tree to Conditional Pattern Base
  • Starting at the frequent header table in the
    FP-tree
  • Traverse the FP-tree by following the link of
    each frequent item
  • Accumulate all of transformed prefix paths of
    that item to form a conditional pattern base

Conditional pattern bases item cond. pattern
base c f3 a fc3 b fca1, f1, c1 m fca2,
fcab1 p fcam2, cb1
44
Properties of FP-tree for Conditional Pattern
Base Construction
  • Node-link property
  • For any frequent item ai, all the possible
    frequent patterns that contain ai can be obtained
    by following ai's node-links, starting from ai's
    head in the FP-tree header
  • Prefix path property
  • To calculate the frequent patterns for a node ai
    in a path P, only the prefix sub-path of ai in P
    need to be accumulated, and its frequency count
    should carry the same count as node ai.

45
Step 2 Construct Conditional FP-tree
  • For each pattern-base
  • Accumulate the count for each item in the base
  • Construct the FP-tree for the frequent items of
    the pattern base

m-conditional pattern base fca2, fcab1

Header Table Item frequency head
f 4 c 4 a 3 b 3 m 3 p 3
f4
c1
All frequent patterns concerning m m, fm, cm,
am, fcm, fam, cam, fcam
b1
b1
c3
?
?
p1
a3
b1
m2
p2
m1
46
Step 3 Recursively mine the conditional FP-tree
Cond. pattern base of am (fc3)

Cond. pattern base of cm (f3)
f3
cm-conditional FP-tree

Cond. pattern base of cam (f3)
f3
cam-conditional FP-tree
47
FP-growth vs. Apriori Scalability With the
Support Threshold
Data set T25I20D10K
48
FP-growth vs. Tree-Projection Scalability with
Support Threshold
Data set T25I20D100K
49
Rule Generation
  • How to efficiently generate rules from frequent
    itemsets?
  • In general, confidence does not have an
    anti-monotone property
  • c(ABC ?D) can be larger or smaller than c(AB ?D)
  • But confidence of rules generated from the same
    itemset has an anti-monotone property
  • e.g., L A,B,C,D c(ABC ? D) ? c(AB ? CD)
    ? c(A ? BCD)
  • Confidence is anti-monotone w.r.t. number of
    items on the RHS of the rule

50
Rule Generation for Apriori Algorithm
Lattice of rules
Low Confidence Rule
51
Rule Generation for Apriori Algorithm
  • Candidate rule is generated by merging two rules
    that share the same prefixin the rule consequent
  • join(CDgtAB,BDgtAC)would produce the
    candidaterule D gt ABC
  • Prune rule DgtABC if itssubset ADgtBC does not
    havehigh confidence

52
Effect of Support Distribution
  • Many real data sets have skewed support
    distribution

Support distribution of a retail data set
53
Effect of Support Distribution
  • How to set the appropriate minsup threshold?
  • If minsup is set too high, we could miss itemsets
    involving interesting rare items (e.g., expensive
    products)
  • If minsup is set too low, it is computationally
    expensive and the number of itemsets is very
    large
  • Using a single minimum support threshold may not
    be effective

54
Iceberg Queries
  • Icerberg query Compute aggregates over one or a
    set of attributes only for those whose aggregate
    values is above certain threshold
  • Example
  • select P.custID, P.itemID, sum(P.qty)
  • from purchase P
  • group by P.custID, P.itemID
  • having sum(P.qty) gt 10
  • Compute iceberg queries efficiently by Apriori
  • First compute lower dimensions
  • Then compute higher dimensions only when all the
    lower ones are above the threshold

55
Pattern Evaluation
  • Association rule algorithms tend to produce too
    many rules
  • many of them are uninteresting or redundant
  • Redundant if A,B,C ? D and A,B ? D
    have same support confidence
  • Interestingness measures can be used to
    prune/rank the derived patterns
  • In the original formulation of association rules,
    support confidence are the only measures used

56
Application of Interestingness Measure
57
Computing Interestingness Measure
  • Given a rule X ? Y, information needed to compute
    rule interestingness can be obtained from a
    contingency table

Contingency table for X ? Y
Y Y
X f11 f10 f1
X f01 f00 fo
f1 f0 T
  • Used to define various measures
  • support, confidence, lift, Gini, J-measure,
    etc.

58
Drawback of Confidence
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
59
Interestingness Measurements
  • Objective measures
  • Two popular measurements
  • support and
  • confidence
  • Subjective measures (Silberschatz Tuzhilin,
    KDD95)
  • A rule (pattern) is interesting if
  • it is unexpected (surprising to the user) and/or
  • actionable (the user can do something with it)

60
Criticism to Support and Confidence
  • Example 1 (Aggarwal Yu, PODS98)
  • Among 5000 students
  • 3000 play basketball
  • 3750 eat cereal
  • 2000 both play basket ball and eat cereal
  • play basketball ? eat cereal 40, 66.7 is
    misleading because the overall percentage of
    students eating cereal is 75 which is higher
    than 66.7.
  • play basketball ? not eat cereal 20, 33.3 is
    far more accurate, although with lower support
    and confidence

61
Statistical Independence
  • Population of 1000 students
  • 600 students know how to swim (S)
  • 700 students know how to bike (B)
  • 420 students know how to swim and bike (S,B)
  • P(S?B) 420/1000 0.42
  • P(S) ? P(B) 0.6 ? 0.7 0.42
  • P(S?B) P(S) ? P(B) gt Statistical independence
  • P(S?B) gt P(S) ? P(B) gt Positively correlated
  • P(S?B) lt P(S) ? P(B) gt Negatively correlated

62
Statistical-based Measures
  • Measures that take into account statistical
    dependence

63
Example Lift/Interest
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
  • Association Rule Tea ? Coffee
  • Confidence P(CoffeeTea) 0.75
  • but P(Coffee) 0.9
  • Lift 0.75/0.9 0.8333 (lt 1, therefore is
    negatively associated)

64
Drawback of Lift Interest
Y Y
X 10 0 10
X 0 90 90
10 90 100
Y Y
X 90 0 90
X 0 10 10
90 10 100
Statistical independence If P(X,Y)P(X)P(Y) gt
Lift 1
65
There are lots of measures proposed in the
literature Some measures are good for certain
applications, but not for others What criteria
should we use to determine whether a measure is
good or bad? What about Apriori-style support
based pruning? How does it affect these measures?
66
Constraint-Based Mining
  • Interactive, exploratory mining giga-bytes of
    data?
  • Could it be real? Making good use of
    constraints!
  • What kinds of constraints can be used in mining?
  • Knowledge type constraint classification,
    association, etc.
  • Data constraint SQL-like queries
  • Find product pairs sold together in Vancouver in
    Dec.98.
  • Dimension/level constraints
  • in relevance to region, price, brand, customer
    category.
  • Rule constraints
  • small sales (price lt 10) triggers big sales
    (sum gt 200).
  • Interestingness constraints
  • strong rules (min_support ? 3, min_confidence ?
    60).

67
Rule Constraints in Association Mining
  • Two kind of rule constraints
  • Rule form constraints meta-rule guided mining.
  • P(x, y) Q(x, w) takes(x, database
    systems).
  • Rule (content) constraint constraint-based query
    optimization (Ng, et al., SIGMOD98).
  • sum(LHS) lt 100 min(LHS) gt 20 count(LHS) gt 3
    sum(RHS) gt 1000
  • 1-variable vs. 2-variable constraints
    (Lakshmanan, et al. SIGMOD99)
  • 1-var A constraint confining only one side (L/R)
    of the rule, e.g., as shown above.
  • 2-var A constraint confining both sides (L and
    R).
  • sum(LHS) lt min(RHS) max(RHS) lt 5 sum(LHS)

68
Constrain-Based Association Query
  • Database (1) trans (TID, Itemset ), (2)
    itemInfo (Item, Type, Price)
  • A constrained asso. query (CAQ) is in the form of
    (S1, S2 )C ,
  • where C is a set of constraints on S1, S2
    including frequency constraint
  • A classification of (single-variable)
    constraints
  • Class constraint S ? A. e.g. S ? Item
  • Domain constraint
  • S? v, ? ? ?, ?, ?, ?, ?, ? . e.g. S.Price lt
    100
  • v? S, ? is ? or ?. e.g. snacks ? S.Type
  • V? S, or S? V, ? ? ?, ?, ?, ?, ?
  • e.g. snacks, sodas ? S.Type
  • Aggregation constraint agg(S) ? v, where agg is
    in min, max, sum, count, avg, and ? ? ?, ?,
    ?, ?, ?, ? .
  • e.g. count(S1.Type) ? 1 , avg(S2.Price) ? 100

69
Constrained Association Query Optimization Problem
  • Given a CAQ (S1, S2) C , the algorithm
    should be
  • sound It only finds frequent sets that satisfy
    the given constraints C
  • complete All frequent sets satisfy the given
    constraints C are found
  • A naïve solution
  • Apply Apriori for finding all frequent sets, and
    then to test them for constraint satisfaction one
    by one.
  • Our approach
  • Comprehensive analysis of the properties of
    constraints and try to push them as deeply as
    possible inside the frequent set computation.

70
Anti-monotone and Monotone Constraints
  • A constraint Ca is anti-monotone iff. for any
    pattern S not satisfying Ca, none of the
    super-patterns of S can satisfy Ca
  • A constraint Cm is monotone iff. for any pattern
    S satisfying Cm, every super-pattern of S also
    satisfies it

71
Succinct Constraint
  • A subset of item Is is a succinct set, if it can
    be expressed as ?p(I) for some selection
    predicate p, where ? is a selection operator
  • SP?2I is a succinct power set, if there is a
    fixed number of succinct set I1, , Ik ?I, s.t.
    SP can be expressed in terms of the strict power
    sets of I1, , Ik using union and minus
  • A constraint Cs is succinct provided SATCs(I) is
    a succinct power set

72
Convertible Constraint
  • Suppose all items in patterns are listed in a
    total order R
  • A constraint C is convertible anti-monotone iff a
    pattern S satisfying the constraint implies that
    each suffix of S w.r.t. R also satisfies C
  • A constraint C is convertible monotone iff a
    pattern S satisfying the constraint implies that
    each pattern of which S is a suffix w.r.t. R also
    satisfies C

73
Relationships Among Categories of Constraints
Succinctness
Anti-monotonicity
Monotonicity
Convertible constraints
Inconvertible constraints
74
Property of Constraints Anti-Monotone
  • Anti-monotonicity If a set S violates the
    constraint, any superset of S violates the
    constraint.
  • Examples
  • sum(S.Price) ? v is anti-monotone
  • sum(S.Price) ? v is not anti-monotone
  • sum(S.Price) v is partly anti-monotone
  • Application
  • Push sum(S.price) ? 1000 deeply into iterative
    frequent set computation.

75
Characterization of Anti-Monotonicity
Constraints
S ? v, ? ? ?, ?, ? v ? S S ? V S ? V S ?
V min(S) ? v min(S) ? v min(S) ? v max(S) ?
v max(S) ? v max(S) ? v count(S) ? v count(S) ?
v count(S) ? v sum(S) ? v sum(S) ? v sum(S) ?
v avg(S) ? v, ? ? ?, ?, ? (frequent
constraint)
yes no no yes partly no yes partly yes no partly y
es no partly yes no partly convertible (yes)
76
Example of Convertible Constraints Avg(S) ? V
  • Let R be the value descending order over the set
    of items
  • E.g. I9, 8, 6, 4, 3, 1
  • Avg(S) ? v is convertible monotone w.r.t. R
  • If S is a suffix of S1, avg(S1) ? avg(S)
  • 8, 4, 3 is a suffix of 9, 8, 4, 3
  • avg(9, 8, 4, 3)6 ? avg(8, 4, 3)5
  • If S satisfies avg(S) ?v, so does S1
  • 8, 4, 3 satisfies constraint avg(S) ? 4, so
    does 9, 8, 4, 3

77
Property of Constraints Succinctness
  • Succinctness
  • For any set S1 and S2 satisfying C, S1 ? S2
    satisfies C
  • Given A1 is the sets of size 1 satisfying C, then
    any set S satisfying C are based on A1 , i.e., it
    contains a subset belongs to A1 ,
  • Example
  • sum(S.Price ) ? v is not succinct
  • min(S.Price ) ? v is succinct
  • Optimization
  • If C is succinct, then C is pre-counting
    prunable. The satisfaction of the constraint
    alone is not affected by the iterative support
    counting.

78
Characterization of Constraints by Succinctness
S ? v, ? ? ?, ?, ? v ? S S ?V S ? V S ?
V min(S) ? v min(S) ? v min(S) ? v max(S) ?
v max(S) ? v max(S) ? v count(S) ? v count(S) ?
v count(S) ? v sum(S) ? v sum(S) ? v sum(S) ?
v avg(S) ? v, ? ? ?, ?, ? (frequent
constraint)
Yes yes yes yes yes yes yes yes yes yes yes weakly
weakly weakly no no no no (no)
79
Why Is the Big Pie Still There?
  • More on constraint-based mining of associations
  • Boolean vs. quantitative associations
  • Association on discrete vs. continuous data
  • From association to correlation and causal
    structure analysis.
  • Association does not necessarily imply
    correlation or causal relationships
  • From intra-trasanction association to
    inter-transaction associations
  • E.g., break the barriers of transactions (Lu, et
    al. TOIS99).
  • From association analysis to classification and
    clustering analysis
  • E.g, clustering association rules

80
Summary
  • Association rule mining
  • probably the most significant contribution from
    the database community in KDD
  • A large number of papers have been published
  • Many interesting issues have been explored
  • An interesting research direction
  • Association analysis in other types of data
    spatial data, multimedia data, time series data,
    etc.
Write a Comment
User Comments (0)
About PowerShow.com