Mining Frequent Patterns, Association, and Correlations - PowerPoint PPT Presentation

1 / 71
About This Presentation
Title:

Mining Frequent Patterns, Association, and Correlations

Description:

select p.item1, p.item2, ..., p.itemk-1, q.itemk-1. from Lk-1 p, Lk-1 q. where p.item1=q.item1, ..., p.itemk-2=q.itemk-2, p.itemk-1 q.itemk-1. Step 2: pruning ... – PowerPoint PPT presentation

Number of Views:188
Avg rating:3.0/5.0
Slides: 72
Provided by: Jiawe7
Category:

less

Transcript and Presenter's Notes

Title: Mining Frequent Patterns, Association, and Correlations


1
Mining Frequent Patterns, Association, and
Correlations
2
- The Course
DS
Ch4
OLAP
Ch2
Ch3
DW
DP
DS
DM
Association
Ch5
DS
Classification
Ch6
Clustering
Ch7
DS Data source DW Data warehouse DM Data
Mining DP Staging Database
3
Motivation
in this shopping basket customer bought tomatoes,
carrots, bananas, bread, eggs, milk, etc.
how the demographical information affects what
the customer buys?
is bread usually bought with milk?
does a specific milk brand make any difference?
is the bread bought when both milk and eggs are
bought together?
where we place the tomatoes in the store to
maximize their sales?
4
Mining Frequent Patterns, Association, and
Correlations
  • Basic concepts
  • Efficient and scalable frequent itemset mining
    methods
  • Association Rule Mining
  • Mining various kinds of association rules
  • From association mining to correlation analysis
  • Constraint-based association mining
  • Summary

5
Definition Frequent Itemset
  • Itemset
  • A collection of one or more items
  • Example Milk, Bread, Sugar
  • k-itemset
  • An itemset that contains k items
  • Support count (P)
  • Frequency of occurrence of an itemset
  • E.g. P(Bread,Milk,Sugar) 2
  • Support
  • Fraction of transactions that contain an itemset
  • E.g. s(Bread, Milk, Sugar) 2/5
  • Frequent Itemset
  • An itemset whose support is greater than or equal
    to a minsup threshold

TID Items
1 Bread, Milk
2 Bread, coffee, eggs, sugar
3 Milk, coffee, coke, sugar
4 Bread, coffee, milk, sugar
5 Bread, coke , milk, sugar
6
Mining Frequent Patterns, Association and
Correlations
  • Basic concepts
  • Efficient and scalable frequent itemset mining
    methods
  • Association Rule Mining
  • Mining various kinds of association rules
  • From association mining to correlation analysis
  • Constraint-based association mining
  • Summary

7
Frequent Itemset Generation
Given d items, there are 2d possible candidate
itemsets
8
Frequent Itemset Generation
  • The are a number of algorithms to generate
    Frequent itemsets. Some of which are
  • Brute force
  • Apriori based
  • Simple
  • Hash
  • Partitioning
  • Sampling
  • FP-growth
  • Vertical Data format

9
-- Brute-Force
  • Each itemset in the lattice is a candidate
    frequent itemset
  • Count the support of each candidate by scanning
    the database
  • Match each transaction against every candidate
  • Complexity O(NMw) gt Expensive since M 2d !!!

Transactions
TID Items
1 bread, milk
2 bread, coffee, eggs, sugar
3 milk, coffee, coke, sugar
4 bread, coffee, milk, sugar
5 bread, coke , milk, sugar
Candidates















N
M
W
10
Frequent Itemset Generation Strategies
  • Reduce the number of candidates (M)
  • Complete search M2d
  • Use pruning techniques to reduce M
  • Reduce the number of transactions (N)
  • Reduce size of N as the size of itemset increases
  • Used by vertical-based mining algorithms
  • Reduce the number of comparisons (NM)
  • Use efficient data structures to store the
    candidates or transactions
  • No need to match every candidate against every
    transaction

11
- Reducing Number of Candidates
  • Apriori principle
  • If an itemset is frequent, then all of its
    subsets must also be frequent
  • Apriori principle holds due to the following
    property of the support measure
  • Support of an itemset never exceeds the support
    of its subsets
  • This is known as the anti-monotone property of
    support

12
Illustrating Apriori Principle
13
Illustrating Apriori Principle
Minimum Support 3
1-itemsets
3-itemsets
2-itemsets
Itemset Count
Bread 4
Coke 2
Milk 4
Coffee 3
Sugar 4
Eggs 1
Item Count
Bread,Milk 3
Bread,coffee 2
Bread,Sugar 3
Milk,Coffee 2
Milk,Sugar 3
Coffee,Sugar 3
Itemset Count
Bread,Milk,Sugar 3
If every subset is considered, 6C1 6C2 6C3
41 With support-based pruning, 6 6 1 13
(No need to generate candidates involving Coke or
Eggs)
14
---- The Apriori AlgorithmAn Example
Supmin 2
Itemset sup
A 2
B 3
C 3
D 1
E 3
Database TDB
Itemset sup
A 2
B 3
C 3
E 3
L1
C1
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
1st scan
C2
C2
Itemset sup
A, B 1
A, C 2
A, E 1
B, C 2
B, E 3
C, E 2
Itemset
A, B
A, C
A, E
B, C
B, E
C, E
L2
2nd scan
Itemset sup
A, C 2
B, C 2
B, E 3
C, E 2
C3
L3
Itemset
B, C, E
Itemset sup
B, C, E 2
3rd scan
15
- Apriori Algorithm
  • Method
  • Let k1
  • Generate frequent itemsets of length 1
  • Repeat until no new frequent itemsets are
    identified
  • Generate length (k1) candidate itemsets from
    length k frequent itemsets
  • Prune candidate itemsets containing subsets of
    length k that are infrequent
  • Count the support of each candidate by scanning
    the DB
  • Eliminate candidates that are infrequent, leaving
    only those that are frequent

16
---- The Apriori Algorithm
  • Pseudo-code
  • Ck Candidate itemset of size k
  • Lk frequent itemset of size k
  • L1 frequent items
  • for (k 1 Lk !? k) do begin
  • Ck1 candidates generated from Lk
  • for each transaction t in database do
  • increment the count of all candidates in
    Ck1 that are
    contained in t
  • Lk1 candidates in Ck1 with min_support
  • end
  • return ?k Lk

17
---- Important Details of Apriori
  • How to generate candidates?
  • Step 1 self-joining Lk
  • Step 2 pruning
  • How to count supports of candidates?
  • Example of Candidate-generation
  • L3abc, abd, acd, ace, bcd
  • Self-joining L3L3
  • abcd from abc and abd
  • acde from acd and ace
  • Pruning
  • acde is removed because ade is not in L3
  • C4abcd

18
---- How to Generate Candidates?
  • Suppose the items in Lk-1 are listed in an order
  • Step 1 self-joining Lk-1
  • insert into Ck
  • select p.item1, p.item2, , p.itemk-1, q.itemk-1
  • from Lk-1 p, Lk-1 q
  • where p.item1q.item1, , p.itemk-2q.itemk-2,
    p.itemk-1 lt q.itemk-1
  • Step 2 pruning
  • forall itemsets c in Ck do
  • forall (k-1)-subsets s of c do
  • if (s is not in Lk-1) then delete c from Ck

19
Factors Affecting Complexity
  • Choice of minimum support threshold
  • lowering support threshold results in more
    frequent itemsets. This may increase number of
    candidates and max length of frequent itemsets
  • Dimensionality (number of items) of the data set
  • more space is needed to store support count of
    each item
  • if number of frequent items also increases, both
    computation and I/O costs may also increase
  • Size of database
  • since Apriori makes multiple passes, run time of
    algorithm may increase with number of
    transactions
  • Average transaction width
  • transaction width increases with denser data
    sets
  • This may increase max length of frequent itemsets
    and traversals of hash tree (number of subsets in
    a transaction increases with its width)

20
-- Improving the Efficiency of Apriori
  • Reduce the number of Comparisons
  • Reduce the number of Transactions
  • Partitioning the data to find candidate itemsets
  • Sampling Mine on a subset of the given data

21
--- Reducing Number of Comparisons
  • Candidate counting Scan the database of
    transactions to determine the support of each
    candidate itemset. To reduce the number of
    comparisons, store the candidates in a hash
    structure
  • Instead of matching each transaction against
    every candidate, match it against candidates
    contained in the hashed buckets

h(x,y) ((order of x) 10 (order of y)) mod 7
Bucket address 0 1 2 3 4 5 6
Bucket count 2 2 4 2 2 4 4
Bucket content A,D C,E A,E A,E B,C B,C B,C B,C B,D B,D B,E B,E A,B A,B A,B A,B A,C A,C A,C A,C
H2
If the min_sup 3, buckets 0, 1, 3, 4 can not be
frequent so They should not be included in C2
22
--- Reduce the number of Transactions
  • A transaction that doesnt contain any frequent
    k-itemset can not contain any frequent j-itemset,
    for any j gt k. So such transaction can be marked
    or removed from subsequent scans of the database
    for j-itemsets.

23
--- Partitioning the data to find candidate
itemsets
  • Requires just 2 database scans to mine the
    frequent itemsets, if the size of each partition
    fits the available memory.
  • It consists of 2 phases
  • Phase 1
  • The algorithms partitions the D transactions in
    to n partitions.
  • For each partition find local frequent itemsets.
    Local frequent itemset have a support-count gt
    min_sup the number of transactions in that
    partitions.
  • For each itemset, using special data structures
    records the TIDs of the transactions containing
    the itemset.
  • Phase 2
  • A second scan of D is conducted to find the
    actual support of each local frequent itemset.

24
--- Partitioning the data to find candidate
itemsets
  • Partitioning of the data
  • Data set partitioning generates frequent itemsets
    based on finding frequent itemsets in subsets
    (partition) of D

25
--- Sampling Mine on a subset of the given data
  • Pick random sample S of the given data D. (Make
    sure all S fits in the available memory.)
  • Search for frequent itemsets in S instead in D.
    You can lower the support threshold to reduce the
    number of missed frequent itemsets.
  • Find the set, Ls, of frequent itemsets in S.
  • The rest of the database can be used to compute
    the actual frequencies of the itemset in Ls.
  • If Ls doesnt contain all the frequent itemsets
    in D, then a second pass will be needed.

26
Bottleneck of Frequent-pattern Mining
  • Multiple database scans are costly
  • Mining long patterns needs many passes of
    scanning and generates lots of candidates
  • To find frequent itemset i1i2i100
  • of scans 100
  • of Candidates (1001) (1002) (110000)
    2100-1 1.271030 !
  • Bottleneck candidate-generation-and-test
  • Can we avoid candidate generation?
  • Yes, if we use fp-growth algorithm (see next
    slide)

27
FP-growth Another Method for Frequent Itemset
Generation
  • Use a compressed representation of the database
    using an FP-tree
  • Once an FP-tree has been constructed, it uses a
    recursive divide-and-conquer approach to mine the
    frequent itemsets.

28
FP-Tree Construction
null
After reading TID1
A1
B1
After reading TID2
null
B1
A1
B1
C1
D1
29
FP-Tree Construction
Transaction Database
null
B3
A7
B5
C3
C1
D1
D1
Header table
C3
E1
D1
E1
D1
E1
D1
Pointers are used to assist frequent itemset
generation
30
FP-growth
Build conditional pattern base for E P
(A1,C1,D1), (A1,D1),
(B1,C1) Recursively apply FP-growth on P
null
B3
A7
B5
C3
C1
D1
C3
D1
D1
E1
E1
D1
E1
D1
31
FP-growth
Conditional tree for E
null
Conditional Pattern base for E P
(A1,C1,D1,E1), (A1,D1,E1),
(B1,C1,E1) Count for E is 3 E is frequent
itemset Recursively apply FP-growth on P
B1
A2
C1
C1
D1
D1
E1
E1
E1
32
FP-growth
Conditional tree for D within conditional tree
for E
Conditional pattern base for D within conditional
base for E P (A1,C1,D1), (A1,D1)
Count for D is 2 D,E is frequent
itemset Recursively apply FP-growth on P
null
A2
C1
D1
D1
33
FP-growth
Conditional tree for C within D within E
Conditional pattern base for C within D within E
P (A1,C1) Count for C is 1 C,D,E is
NOT frequent itemset
null
A1
C1
34
FP-growth
Conditional tree for A within D within E
Count for A is 2 A,D,E is frequent
itemset Next step Construct conditional tree C
within conditional tree E Continue until
exploring conditional tree for A (which has only
node A)
null
A2
35
Benefits of the FP-tree Structure
36
Why is FP-Growth the Winner?
  • Divide-and-conquer
  • decompose both the mining task and DB according
    to the frequent patterns obtained so far
  • leads to focused search of smaller databases
  • Other factors
  • no candidate generation, no candidate test
  • compressed database FP-tree structure
  • no repeated scan of entire database
  • basic opscounting local freq items and building
    sub FP-tree, no pattern search and matching

37
Mining Frequent Itemsets using Vertical Data
Format
  • For each item, store a list of transaction ids
    (tids) vertical data layout

TID-list
38
Mining Frequent Itemsets using Vertical Data
Format
  • Determine support of any k-itemset by
    intersecting tid-lists of two of its (k-1)
    subsets.
  • Advantage very fast support counting
  • Disadvantage intermediate tid-lists may become
    too large for memory

?
?
39
- Compact Representation of Frequent Itemsets
  • Some itemsets are redundant because they have
    identical support as their supersets
  • Number of frequent itemsets
  • Need a compact representation

40
-- Maximal Frequent Itemset
An itemset is maximal frequent if none of its
immediate supersets is frequent
Maximal Itemsets
Infrequent Itemsets
Border
41
-- Closed Itemset
  • An itemset is closed if none of its immediate
    supersets has the same support as the itemset

42
-- Maximal vs Closed Itemsets
Transaction Ids
Not supported by any transactions
43
-- Maximal vs Closed Frequent Itemsets
Closed but not maximal
Minimum support 2
Closed and maximal
Closed 9 Maximal 4
44
-- Maximal vs Closed Itemsets
45
-- Mining Closed Frequent Itemsets
  • A Naïve approach
  • Generate all possible frequent itemsets, then
    remove the non-closed itemsets
  • A recommended methodology search for frequent
    closed itemsets during mining
  • Itemset merging if Y appears in every occurrence
    of X, then Y is merged with X
  • Sub-itemset pruning if Y ? X, and sup(X)
    sup(Y), X and all of Xs descendants in the set
    enumeration tree can be pruned
  • Efficient subset checking Use compressed pattern
    tree, which is similar in structure to the
    FP-tree except its branches store closed
    itemsets.
  • Item skipping if a local frequent item has the
    same support in several header tables at
    different levels, one can prune it from the
    header table at higher levels (Used in
    depth-first mining of closed itemsets which we
    dont cover)

46
Mining Frequent Patterns, Association and
Correlations
  • Basic concepts
  • Efficient and scalable frequent itemset mining
    methods
  • Association Rule Mining
  • Mining various kinds of association rules
  • From association mining to correlation analysis
  • Constraint-based association mining
  • Summary

47
- Association Rule Mining
  • Given a set of transactions, find rules that will
    predict the occurrence of an item based on the
    occurrences of other items in the transaction

Market-Basket transactions
Example of Association Rules
TID Items
1 Bread, Milk
2 Bread, coffee, eggs, sugar
3 Milk, coffee, coke, sugar
4 Bread, coffee, milk, sugar
5 Bread, coke , milk, sugar
sugar ? coffee,Milk, Bread ?
Eggs,Coke,coffee, Bread ? Milk,
48
-- Definition Association Rule
  • Association Rule
  • An implication expression of the form X ? Y,
    where X and Y are itemsets
  • Example Milk, Sugar ? Coffee
  • Rule Evaluation Metrics
  • Support (s)
  • Fraction of transactions that contain both X and
    Y
  • Confidence (c)
  • Measures how often items in Y appear in
    transactions thatcontain X

TID Items
1 Bread, Milk
2 Bread, coffee, eggs, sugar
3 Milk, coffee, coke, sugar
4 Bread, coffee, milk, sugar
5 Bread, coke , milk, sugar
Milk, Sugar ? Coffee s P(milk,sugar,coffee/
T 2/5 0.4 c P(milk,sugar,coffee/Pmilk,su
gar 2/3 0.67
49
-- Example
50
-- Computational Complexity
  • Given d unique items
  • Total number of itemsets 2d
  • Total number of possible association rules

If d6, R 602 rules
51
-- Association Rule Mining Task
  • Given a set of transactions T, the goal of
    association rule mining is to find all rules
    having
  • support minsup threshold
  • confidence minconf threshold
  • Brute-force approach
  • List all possible association rules
  • Compute the support and confidence for each rule
  • Prune rules that fail the minsup and minconf
    thresholds
  • ? Computationally prohibitive!

52
-- Mining Association Rules
Example of Rules milk,Sugar ? coffee
(s0.4, c0.67)milk,coffee ? sugar (s0.4,
c1.0) sugar,coffee ? milk (s0.4,
c0.67) coffee ? milk,sugar (s0.4, c0.67)
sugar ? milk,coffee (s0.4, c0.5) milk ?
sugar,coffee (s0.4, c0.5)
TID Items
1 bread, milk
2 bread, coffee, eggs, sugar
3 milk, coffee, coke, sugar
4 bread, coffee, milk, sugar
5 bread, coke , milk, sugar
  • Observations
  • All the above rules are binary partitions of the
    same itemset milk, sugar, coffee
  • Rules originating from the same itemset have
    identical support but can have different
    confidence
  • Thus, we may decouple the support and confidence
    requirements

53
-- Mining Association Rules
  • Two-step approach
  • Frequent Itemset Generation
  • Generate all itemsets whose support ? minsup
  • Rule Generation
  • Generate high confidence rules from each frequent
    itemset, where each rule is a binary partitioning
    of a frequent itemset
  • Frequent itemset generation is still
    computationally expensive

54
Mining Frequent Patterns, Association and
Correlations
  • Basic concepts
  • Efficient and scalable frequent itemset mining
    methods
  • Association Rule Mining
  • Mining various kinds of association rules
  • From association mining to correlation analysis
  • Constraint-based association mining
  • Summary

55
- Mining Various Kinds of Association Rules
  • Mining multilevel association
  • Miming multidimensional association
  • Mining quantitative association

56
-- Mining Multiple-Level Association Rules
  • Items often form hierarchies
  • Flexible support settings
  • Items at the lower level are expected to have
    lower support

57
--- Multi-level Association Redundancy Filtering
  • Some rules may be redundant due to ancestor
    relationships between items.
  • Example
  • Laptop ? HP printer support 8, confidence
    70
  • IBM laptop ? HP printer support 2, confidence
    72
  • We say the first rule is an ancestor of the
    second rule.
  • A rule is redundant if its support is close to
    the expected value, based on the rules
    ancestor.

58
-- Mining Multi-Dimensional Association
  • Single-dimensional rules
  • buys(X, milk) ? buys(X, bread)
  • Multi-dimensional rules ? 2 dimensions or
    predicates
  • Inter-dimension assoc. rules (no repeated
    predicates)
  • age(X,19-25) ? occupation(X,student) ?
    buys(X, coke)
  • hybrid-dimension assoc. rules (repeated
    predicates)
  • age(X,19-25) ? buys(X, popcorn) ? buys(X,
    coke)
  • Categorical Attributes finite number of possible
    values, no ordering among valuesdata cube
    approach
  • Quantitative Attributes numeric, implicit
    ordering among valuesdiscretization, clustering,
    and gradient approaches

59
-- Mining Quantitative Associations
  • Techniques can be categorized by how numerical
    attributes, such as age or salary are treated
  • Static discretization based on predefined concept
    hierarchies (data cube methods)
  • Dynamic discretization based on data distribution

60
--- Static Discretization of Quantitative
Attributes
  • Discretized prior to mining using concept
    hierarchy.
  • Numeric values are replaced by ranges.
  • In relational database, finding all frequent
    k-predicate sets will require k or k1 table
    scans.
  • Data cube is well suited for mining.
  • The cells of an n-dimensional
  • cuboid correspond to the
  • predicate sets.
  • Mining from data cubescan be much faster.

61
-- Quantitative Association Rules
  • Numeric attributes are dynamically discretized
  • Such that the confidence or compactness of the
    rules mined is maximized
  • 2-D quantitative association rules Aquan1 ?
    Aquan2 ? Acat
  • Cluster adjacent association rules to form
    general rules using a 2-D grid
  • Example

age(X,34-35) ? income(X,30-50K) ?
buys(X,high resolution TV)
62
Mining Frequent Patterns, Association and
Correlations
  • Basic concepts
  • Efficient and scalable frequent itemset mining
    methods
  • Association Rule Mining
  • Mining various kinds of association rules
  • From association mining to correlation analysis
  • Summary

63
- Finding Interesting Association Rules
  • Depending on the minimum support and confidence
    values the user may generate a large number of
    rules to analyze and assess
  • How can we filter out rules that are potentially
    the most interesting?
  • whenever a rule is interesting (or not) can be
    evaluated either objectively or subjectively
  • the ultimate subjective users evaluation cannot
    be quantified or anticipated they are different
    for different users
  • that is why objective interestingness measures,
    based on the statistical information present in
    D, were developed

64
-- Finding Interesting Association Rules
  • The subjective evaluation of association rules
    often boils down to checking if a given rule is
    unexpected (i.e., surprises the user) and
    actionable (i.e., the user can do something
    useful based on the rule).
  • useful, when they provide high quality,
    actionable information e.g. Pepsi ? chips
  • trivial, when they are valid and supported by
    data, but useless since they confirm well known
    facts e.g. milk ? bread
  • inexplicable, when they concern valid and new
    facts, but cannot be utilized e.g. grocery_store
    ? milk_is_sold_as_often_as_bread

65
-- Finding Interesting Association Rules
  • In most cases, confidence and support values
    associated with each rule are used as the
    objective measure to select the most interesting
    rules
  • rules that have these values higher with respect
    to other rules are preferred
  • although this simple approach works in many
    cases, we will show that sometimes rules that
    have high confidence and support may be
    uninteresting and even misleading

66
-- Finding Interesting Association Rules
  • example
  • let us assume that a transactional data set
    concerning grocery store contains milk and bread
    as the frequent items
  • 2,000 transactions were recorded and among them
  • in 1,200 transactions the customers bought tea
  • in 1,650 transactions customers bough cofee
  • in 900 the customers bough both

tea not tea total
coffee 900 750 1650
not coffee 300 50 350
total 1200 800 2000
67
-- Finding Interesting Association Rules
  • example
  • given the minimum support threshold of 40 and
    minimum confidence threshold of 70 the tea ?
    coffee 45, 75 rule would be generated
  • on the other hand, due to low support and
    confidence values the tea ? not coffee 15,
    25 rule would not be generated
  • the latter rule is by far more accurate, while
    the first may be misleading

68
-- Finding Interesting Association Rules
  • example
  • tea ? coffee 45, 75 rule
  • probability of buying coffee is 82.5, while
    confidence of tea ? coffee is lower and equals
    75
  • coffee and tea are negatively associated, i.e.,
    buying one results in decrease in buying the
    other
  • obviously using this rule would not be a wise
    decision

tea not tea total
coffee 900 750 1650
not coffee 300 50 350
total 1200 800 2000
69
-- Finding Interesting Association Rules
  • alternative approach to evaluate interestingness
    of association rules is to use measures based on
    correlation
  • for A ? B rule, the itemset A is independent of
    the occurrence of the itemset B if P(A ? B)
    P(A)P(B). Otherwise, itemsets A and B are
    dependent and correlated as events.
  • correlation measure (also referred to as lift and
    interest), which is defined between itemsets A
    and B is defined as

70
-- Finding Interesting Association Rules
  • correlation measure
  • if the correlation value is less than 1, then the
    occurrence of A is negatively correlated
    (inhibits) the occurrence of B
  • if the value is greater than 1, then A and B and
    positively correlated, which means that
    occurrence of one implies (promotes) occurrence
    of the other
  • if correlation equals 1, then A and B are
    independent, i.e., there is no correlation
    between these itemsets.
  • correlation value for tea ? coffee rule equals
  • 0.45 / (0.60.825) 0.45 / 0.495 0.91

71
END
Write a Comment
User Comments (0)
About PowerShow.com