Chapter 5: Mining Frequent Patterns, Association and Correlations PowerPoint PPT Presentation

presentation player overlay
1 / 55
About This Presentation
Transcript and Presenter's Notes

Title: Chapter 5: Mining Frequent Patterns, Association and Correlations


1
Chapter 5 Mining Frequent Patterns, Association
and Correlations
  • Basic concepts and a road map
  • Efficient and scalable frequent itemset mining
    methods
  • Mining various kinds of association rules
  • From association mining to correlation analysis
  • Constraint-based association mining
  • Summary

2
What Is Frequent Pattern Analysis?
  • Frequent pattern a pattern (a set of items,
    subsequences, substructures, etc.) that occurs
    frequently in a data set
  • First proposed by Agrawal, Imielinski, and Swami
    AIS93 in the context of frequent itemsets and
    association rule mining
  • Motivation Finding inherent regularities in data
  • What products were often purchased together?
    Beer and diapers?!
  • What are the subsequent purchases after buying a
    PC?
  • What kinds of DNA are sensitive to this new drug?
  • Can we automatically classify web documents?
  • Applications
  • Basket data analysis, cross-marketing, catalog
    design, sale campaign analysis, Web log (click
    stream) analysis, and DNA sequence analysis.

3
An Example--Market Basket Analysis
4
Why Is Freq. Pattern Mining Important?
  • Discloses an intrinsic and important property of
    data sets
  • Forms the foundation for many essential data
    mining tasks
  • Association, correlation, and causality analysis
  • Sequential, structural (e.g., sub-graph) patterns
  • Pattern analysis in spatiotemporal, multimedia,
    time-series, and stream data
  • Classification associative classification
  • Cluster analysis frequent pattern-based
    clustering
  • Data warehousing iceberg cube and cube-gradient
  • Semantic data compression fascicles
  • Broad applications

5
Basic Concepts Frequent Patterns and Association
Rules
  • Itemset X x1, , xk
  • Find all the rules X ? Y with minimum support and
    confidence
  • support, s, probability that a transaction
    contains X ? Y
  • confidence, c, conditional probability that a
    transaction having X also contains Y

Let supmin 50, confmin 50 Freq. Pat.
A3, B3, D4, E3, AD3 Association rules A ?
D (60, 100) D ? A (60, 75)
6
Closed Patterns and Max-Patterns
  • A long pattern contains a combinatorial number of
    sub-patterns, e.g., a1, , a100 contains (1001)
    (1002) (110000) 2100 1 1.271030
    sub-patterns!
  • Solution Mine closed patterns and max-patterns
    instead
  • An itemset X is closed if X is frequent and there
    exists no super-pattern Y ? X, with the same
    support as X (proposed by Pasquier, et al. _at_
    ICDT99)
  • An itemset X is a max-pattern if X is frequent
    and there exists no frequent super-pattern Y ? X
    (proposed by Bayardo _at_ SIGMOD98)
  • Closed pattern is a lossless compression of freq.
    patterns
  • Reducing the of patterns and rules

7
Why Closed Patterns and Max-Patterns
Let supmin 50, confmin 50
  • Freq. Pat. A4, B3, D3
  • AB3, AD3, BD3
  • ABD3
  • Freq. Closed Pat A4, ABD3

8
Why Closed Patterns and Max-Patterns(cont.)
Let supmin 50, confmin 50
  • Frequent patterns
  • A3, B5, D4, F3
  • AB3, AD3, BD4, BF3
  • ABD3
  • Frequent Closed patterns
  • B5, D4, BF3, ABD3

9
Closed Patterns and Max-Patterns
  • Exercise. DB lta1, , a100gt, lt a1, , a50gt
  • Min_sup 1.
  • What is the set of closed itemset?
  • lta1, , a100gt 1
  • lt a1, , a50gt 2
  • What is the set of max-pattern?
  • lta1, , a100gt 1
  • What is the set of all patterns?
  • !!

10
Itemsets
  • A set of items is referred to as an itemset.
  • A itemset that contains k items is a k-itemset.
  • Example
  • The set computer, antivirus_software is a
    2-itemset.
  • The support of an itemset is the number of
    transactions that contain the item.

11
Association Rules
  • Let II1, I2, , Im be a set of items, D be a
    set of database transactions where each
    transaction is a set of items such that
  • An association rule is an implication of the form
    A?B, where and
  • The support and confidence of the rule A?B is
    defined as

12
Find Strong Association Rules
  • Association rule mining can be viewed as a
    two-step process
  • Find all frequent itemsets
  • Generate strong association rules from the
    frequent itemsets

13
Generate Association Rules
  • Given a set of frequent itemsets, association
    rules
  • can be generated as
  • For each frequent itemset l, generate all
    nonempty subsets of l
  • For every nonempty subset s of l, output the rule
    s ? (l s), if

14
An Example
Let supmin 50, confmin 50 Frequent
patterns A3, B5, D4, F3 AB3, AD3,
BD4, BF3 ABD3
  • Generate association rules
  • From AB A? B (60, 100), B?A (60, 60)
  • From AD A? D (60, 100), D?A (60, 75)
  • From BD B? D (80, 80), D?B (80, 100)
  • From BF B? F (60, 60), F?B (60, 100)
  • From ABD
  • A? BD (60, 100), B? AD (60, 60), D?AB (60,
    75)
  • BD?A (60, 100), AD ?B (60,100), AB?D (60,
    100)

15
An Example
Let supmin 50, confmin 50 Frequent Closed
patterns B5, D4, BF3, BD4, ABD3
  • Generate association rules
  • From BF B? F (60, 60), F?B (60, 100)
  • From BD B? D (80, 80), D?B (80, 100)
  • From ABD
  • A? BD (60, 100), B? AD (60, 60), D?AB (60,
    75)
  • BD?A (60, 100), AD ?B (60,100), AB?D (60,
    100)
  • A? B (60, 100), B?A (60, 60)
  • A? D (60, 100), D?A (60, 75)

16
Chapter 5 Mining Frequent Patterns, Association
and Correlations
  • Basic concepts and a road map
  • Efficient and scalable frequent itemset mining
    methods
  • Mining various kinds of association rules
  • From association mining to correlation analysis
  • Constraint-based association mining
  • Summary

17
Scalable Methods for Mining Frequent Patterns
  • The downward closure property of frequent
    patterns
  • Any subset of a frequent itemset must be frequent
  • If beer, diaper, nuts is frequent, so is beer,
    diaper
  • i.e., every transaction having beer, diaper,
    nuts also contains beer, diaper
  • Scalable mining methods Three major approaches
  • Apriori (Agrawal Srikant_at_VLDB94)
  • Freq. pattern growth (FPgrowthHan, Pei Yin
    _at_SIGMOD00)
  • Vertical data format approach (CharmZaki Hsiao
    _at_SDM02)

18
Apriori A Candidate Generation-and-Test Approach
  • Apriori pruning principle If there is any
    itemset which is infrequent, its superset should
    not be generated/tested! (Agrawal Srikant
    _at_VLDB94, Mannila, et al. _at_ KDD 94)
  • Method
  • Initially, scan DB once to get frequent 1-itemset
  • Generate length (k1) candidate itemsets from
    length k frequent itemsets
  • Test the candidates against DB
  • Terminate when no frequent or candidate set can
    be generated

19
The Apriori AlgorithmAn Example
Supmin 2
Database TDB
L1
C1
1st scan
C2
C2
L2
2nd scan
C3
L3
3rd scan
20
The Apriori Algorithm
  • Pseudo-code
  • Ck Candidate itemset of size k
  • Lk frequent itemset of size k
  • L1 frequent items
  • for (k 1 Lk !? k) do begin
  • Ck1 candidates generated from Lk
  • for each transaction t in database do
  • increment the count of all candidates in
    Ck1 that are
    contained in t
  • Lk1 candidates in Ck1 with min_support
  • end
  • return ?k Lk

21
Important Details of Apriori
  • How to generate candidates?
  • Step 1 self-joining Lk
  • Step 2 pruning
  • How to count supports of candidates?
  • Example of Candidate-generation
  • L3abc, abd, acd, ace, bcd
  • Self-joining L3L3
  • abcd from abc and abd
  • acde from acd and ace
  • Pruning
  • acde is removed because ade is not in L3
  • C4abcd

22
How to Generate Candidates?
  • Suppose the items in Lk-1 are listed in an order
  • Step 1 self-joining Lk-1
  • insert into Ck
  • select p.item1, p.item2, , p.itemk-1, q.itemk-1
  • from Lk-1 p, Lk-1 q
  • where p.item1q.item1, , p.itemk-2q.itemk-2,
    p.itemk-1 lt q.itemk-1
  • Step 2 pruning
  • forall itemsets c in Ck do
  • forall (k-1)-subsets s of c do
  • if (s is not in Lk-1) then delete c from Ck

23
An Example
  • Supmin 2
  • If we list the items as A, B, C, D, E
  • Get frequent 1-itemsets A2, B3, C3, E3
  • Generate candidate 2-itemsets AB, AC, AE, BC,
    BE, CE
  • Get the support of the candidate itemsets to find
    the frequent 2-itemsets D2AC2, BC2, CE2,
    BE3
  • Generate the candidate 3-itemsets BCE
  • select P.item1, P.item2, , P.itemk-1,
    P.itemk-1
  • from D2 p, D2 q
  • where p.item1q.item1, ,
    p.itemk-2q.itemk-2,
  • p.itemk-1 lt q.itemk-1
  • Find the frequent 3-itemsets BCE2

Database TDB
24
Challenges of Frequent Pattern Mining
  • Challenges
  • Multiple scans of transaction database
  • Huge number of candidates
  • Tedious workload of support counting for
    candidates
  • Improving Apriori general ideas
  • Reduce passes of transaction database scans
  • Shrink number of candidates
  • Facilitate support counting of candidates

25
Partition Scan Database Only Twice
  • Any itemset that is potentially frequent in DB
    must be frequent in at least one of the
    partitions of DB
  • Scan 1 partition database and find local
    frequent patterns
  • Scan 2 consolidate global frequent patterns
  • A. Savasere, E. Omiecinski, and S. Navathe. An
    efficient algorithm for mining association in
    large databases. In VLDB95

26
Sampling for Frequent Patterns
  • Select a sample of original database, mine
    frequent patterns within sample using Apriori
  • Scan database once to verify frequent itemsets
    found in sample, only borders of closure of
    frequent patterns are checked
  • Example check abcd instead of ab, ac, , etc.
  • Scan database again to find missed frequent
    patterns
  • H. Toivonen. Sampling large databases for
    association rules. In VLDB96

27
DIC Reduce Number of Scans
ABCD
  • Once both A and D are determined frequent, the
    counting of AD begins
  • Once all length-2 subsets of BCD are determined
    frequent, the counting of BCD begins

ABC
ABD
ACD
BCD
AB
AC
BC
AD
BD
CD
Transactions
1-itemsets
B
C
D
A
2-itemsets
Apriori


Itemset lattice
1-itemsets
2-items
S. Brin R. Motwani, J. Ullman, and S. Tsur.
Dynamic itemset counting and implication rules
for market basket data. In SIGMOD97
3-items
DIC
28
Bottleneck of Frequent-pattern Mining
  • Multiple database scans are costly
  • Mining long patterns needs many passes of
    scanning and generates lots of candidates
  • To find frequent itemset i1i2i100
  • of scans 100
  • of Candidates (1001) (1002) (110000)
    2100-1 1.271030 !
  • Bottleneck candidate-generation-and-test
  • Can we avoid candidate generation?

29
Implications of the Methodology
  • Mining closed frequent itemsets and max-patterns
  • CLOSET (DMKD00)
  • Mining sequential patterns
  • FreeSpan (KDD00), PrefixSpan (ICDE01)
  • Constraint-based mining of frequent patterns
  • Convertible constraints (KDD00, ICDE01)
  • Computing iceberg data cubes with complex
    measures
  • H-tree and H-cubing algorithm (SIGMOD01)

30
MaxMiner Mining Max-patterns
  • 1st scan find frequent items
  • A, B, C, D, E
  • 2nd scan find support for
  • AB, AC, AD, AE, ABCDE
  • BC, BD, BE, BCDE
  • CD, CE, CDE, DE,
  • Since BCDE is a max-pattern, no need to check
    BCD, BDE, CDE in later scan
  • R. Bayardo. Efficiently mining long patterns from
    databases. In SIGMOD98

Potential max-patterns
31
CHARM Mining by Exploring Vertical Data Format
  • Vertical format t(AB) T11, T25,
  • tid-list list of trans.-ids containing an
    itemset
  • Deriving closed patterns based on vertical
    intersections
  • t(X) t(Y) X and Y always happen together
  • t(X) ? t(Y) transaction having X always has Y
  • Using diffset to accelerate mining
  • Only keep track of differences of tids
  • t(X) T1, T2, T3, t(XY) T1, T3
  • Diffset (XY, X) T2
  • Eclat/MaxEclat (Zaki et al. _at_KDD97), VIPER(P.
    Shenoy et al._at_SIGMOD00), CHARM (Zaki
    Hsiao_at_SDM02)

32
(No Transcript)
33
Chapter 5 Mining Frequent Patterns, Association
and Correlations
  • Basic concepts and a road map
  • Efficient and scalable frequent itemset mining
    methods
  • Mining various kinds of association rules
  • From association mining to correlation analysis
  • Constraint-based association mining
  • Summary

34
Mining Various Kinds of Association Rules
  • Mining multilevel association
  • Miming multidimensional association
  • Mining quantitative association
  • Mining interesting correlation patterns

35
Mining Multiple-Level Association Rules
  • Items often form hierarchies
  • Flexible support settings
  • Items at the lower level are expected to have
    lower support

36
Mining Multiple-Level Association Rules (cont.)
  • Association rules generated from mining data at
    multiple levels of abstraction are called
    multilevel association rules
  • Using uniform minimum support for all levels
  • Using reduced minimum support at lower levels
  • Using item or group-based minimum support
  • Example
  • buys(X,laptop)?buys(X, HP printer)
  • support 8, confidence 70
  • buys(X,IBM laptop)?buys(X, HP printer)
  • support 2, confidence 72

37
Multi-level Association Redundancy Filtering
  • Some rules may be redundant due to ancestor
    relationships between items.
  • Example
  • milk ? wheat bread support 8, confidence
    70
  • 2 milk ? wheat bread support 2, confidence
    72
  • We say the first rule is an ancestor of the
    second rule.
  • A rule is redundant if its support is close to
    the expected value, based on the rules
    ancestor.

38
Mining Multi-Dimensional Association
  • Association rules that involve two or more
    predicates are referred to as multidimensional
    association rules
  • Example
  • Age(X,2029)occupations(X,student)? buys(X,
    laptop)

39
Mining Multi-Dimensional Association (cont.)
  • Single-dimensional rules
  • buys(X, milk) ? buys(X, bread)
  • Multi-dimensional rules ? 2 dimensions or
    predicates
  • Inter-dimension assoc. rules (no repeated
    predicates)
  • age(X,19-25) ? occupation(X,student) ?
    buys(X, coke)
  • hybrid-dimension assoc. rules (repeated
    predicates)
  • age(X,19-25) ? buys(X, popcorn) ? buys(X,
    coke)
  • Categorical Attributes finite number of possible
    values, no ordering among valuesdata cube
    approach
  • Quantitative Attributes numeric, implicit
    ordering among valuesdiscretization, clustering,
    and gradient approaches

40
Mining Quantitative Associations
  • Techniques can be categorized by how numerical
    attributes, such as age or salary are treated
  • Static discretization based on predefined concept
    hierarchies (data cube methods)
  • Dynamic discretization based on data distribution
    (quantitative rules, e.g., Agrawal
    Srikant_at_SIGMOD96)
  • Clustering Distance-based association (e.g.,
    Yang Miller_at_SIGMOD97)
  • one dimensional clustering then association
  • Deviation (such as Aumann and Lindell_at_KDD99)
  • Sex female gt Wage mean7/hr (overall mean
    9)

41
Chapter 5 Mining Frequent Patterns, Association
and Correlations
  • Basic concepts and a road map
  • Efficient and scalable frequent itemset mining
    methods
  • Mining various kinds of association rules
  • From association mining to correlation analysis
  • Constraint-based association mining
  • Summary

42
Strong Rules Are Not Necessarily Interesting
  • Example Of 10,000 transactions in
    AllElectronics, 6000 included computer games,
    while 7500 include videos, and 4000 include both
    computer games and videos.
  • buys(X,computer games)?buys(X,videos) s40,
    c66
  • Is the above association rule interesting?
  • Hint Overall 75 of 10,000 customers bought
    videos,
  • while only 66 of computer-game buying customers
  • bought videos

43
Interestingness Measure Correlations (Lift)
  • play basketball ? eat cereal 40, 66.7 is
    misleading
  • The overall of students eating cereal is 75 gt
    66.7.
  • play basketball ? not eat cereal 20, 33.3 is
    more accurate, although with lower support and
    confidence
  • Measure of dependent/correlated events lift

44
Lift
  • Given a rule A?B, lift measures the correlation
    between A and B.
  • The occurrence of itemset A is independent of the
    occurrence of itemset B if lift(A,B)1, i.e.,
  • if lift(A,B) gt 1, then the occurrence of A is
    positively correlated with the occurrence of B
  • if lift(A,B) lt1, then the occurrence of A is
    negatively correlated with the occurrence of B

45
?2
  • Suppose A has c distinct values a1, a2, , ac, B
    has r distinct values, b1, b2, , br. The data
    tuples can be shown as a contingency table as
    left
  • The ?2 is defined as
  • Where nij is the observed frequency of the
    joint event
  • (ai, bj) and eij is the expected frequency
    of (ai, bj),
  • which can be computed as

46
Example
  • Since ?2 gt1 and the (game, video)4000, which is
    less than the expected value 4500, buying game
    and buying video are negatively correlated.

47
Other Correlation Measures
  • All Confidence Given an itemset Xi1,i2, ,
    ik, the all confidence of X is defined as
  • Cosine Given itemsets A and B, the cosine
    measure of A and B is defined as

48
Are lift and ?2 Good Measures of Correlation?
  • Buy walnuts ? buy milk 1, 80 is
    misleading
  • if 85 of customers buy milk
  • Support and confidence are not good to represent
    correlations
  • So many interestingness measures? (Tan, Kumar,
    Sritastava _at_KDD02)

49
Which Measures Should Be Used?
  • lift and ?2 are not good measures for
    correlations in large transactional DBs
  • all-conf or coherence could be good measures
    (Omiecinski_at_TKDE03)
  • Both all-conf and coherence have the downward
    closure property
  • Efficient algorithms can be derived for mining
    (Lee et al. _at_ICDM03sub)

50
Chapter 5 Mining Frequent Patterns, Association
and Correlations
  • Basic concepts and a road map
  • Efficient and scalable frequent itemset mining
    methods
  • Mining various kinds of association rules
  • From association mining to correlation analysis
  • Constraint-based association mining
  • Summary

51
Constraint-based (Query-Directed) Mining
  • Finding all the patterns in a database
    autonomously? unrealistic!
  • The patterns could be too many but not focused!
  • Data mining should be an interactive process
  • User directs what to be mined using a data mining
    query language (or a graphical user interface)
  • Constraint-based mining
  • User flexibility provides constraints on what to
    be mined
  • System optimization explores such constraints
    for efficient miningconstraint-based mining

52
Constraints in Data Mining
  • Knowledge type constraint
  • classification, association, etc.
  • Data constraint using SQL-like queries
  • find product pairs sold together in stores in
    Chicago in Dec.02
  • Dimension/level constraint
  • in relevance to region, price, brand, customer
    category
  • Rule (or pattern) constraint
  • small sales (price lt 10) triggers big sales
    (sum gt 200)
  • Interestingness constraint
  • strong rules min_support ? 3, min_confidence
    ? 60

53
Chapter 5 Mining Frequent Patterns, Association
and Correlations
  • Basic concepts and a road map
  • Efficient and scalable frequent itemset mining
    methods
  • Mining various kinds of association rules
  • From association mining to correlation analysis
  • Constraint-based association mining
  • Summary

54
Frequent-Pattern Mining Summary
  • Frequent pattern miningan important task in data
    mining
  • Scalable frequent pattern mining methods
  • Apriori (Candidate generation test)
  • Projection-based (FPgrowth, CLOSET, ...)
  • Vertical format approach (CHARM, ...)
  • Mining a variety of rules and interesting
    patterns
  • Constraint-based mining
  • Mining sequential and structured patterns
  • Extensions and applications

55
Frequent-Pattern Mining Research Problems
  • Mining fault-tolerant frequent, sequential and
    structured patterns
  • Patterns allows limited faults (insertion,
    deletion, mutation)
  • Mining truly interesting patterns
  • Surprising, novel, concise,
  • Application exploration
  • E.g., DNA sequence analysis and bio-pattern
    classification
  • Invisible data mining
Write a Comment
User Comments (0)
About PowerShow.com