Mining Association Rules - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Mining Association Rules

Description:

Item Ordering. Re-ordering items we try to increase the effectiveness of frequency-pruning ... Put these item at the end of the ordering, so they appear in many tails ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 27
Provided by: ValuedSony2
Category:

less

Transcript and Presenter's Notes

Title: Mining Association Rules


1
Mining Association Rules
2
Data Mining Overview
  • Data Mining
  • Data warehouses and OLAP (On Line Analytical
    Processing.)
  • Association Rules Mining
  • Clustering Hierarchical and Partitional
    approaches
  • Classification Decision Trees and Bayesian
    classifiers
  • Sequential Patterns Mining
  • Advanced topics outlier detection, web mining

3
Association Rules Background
  • Given (1) database of transactions, (2) each
    transaction is a list of items (purchased by a
    customer in a visit)
  • Find all association rules that satisfy
    user-specified minimum support and minimum
    confidence interval
  • Example 30 of transactions that contain beer
    also contain diapers 5 of transactions contain
    these items
  • 30 confidence of the rule
  • 5 support of the rule
  • We are interested in finding all rules rather
    than verifying if a rule holds

4
Rule Measures Support and Confidence
Customer buys both
Customer buys diaper
  • Find all the rules X Y ? Z with minimum
    confidence and support
  • support, s, probability that a transaction
    contains X ? Y ? Z
  • confidence, c, conditional probability that a
    transaction having X ? Y also contains Z

Customer buys beer
  • Let minimum support 50, and minimum confidence
    50, we have
  • A ? C (50, 66.6)
  • C ? A (50, 100)

5
Application Examples
  • Market Basket Analysis
  • ? Maintenance Agreement (What the store
    should do to boost Maintenance Agreement sales?)
  • Home Electronics ? (What other products
    should the store stocks up on if the store has a
    sale on Home Electronics?)
  • Attached mailing in direct marketing
  • Detecting ping-ponging of patients
  • Transaction patient
  • Item doctor/clinic visited by patient
  • Support of the rule number of common patients
  • HIC Australia success story

6
Problem Statement
  • I i1, i2, , im a set of literals, called
    items
  • Transaction T a set of items s.t. T I
  • Database D a set of transactions
  • A transaction contains X, a set of items in I, if
    X T
  • An association rule is an implication of the form
    X ? Y,
  • where X,Y I
  • The rule X ? Y holds in the transaction set D
    with confidence c if c of transactions in D
    that contain X also contain Y
  • The rule X ? Y has support s in the transaction
    set D if s of transactions in D contain X Y
  • Find all rules that have support and confidence
    greater than user-specified min support and min
    confidence

7
Association Rule Mining A Road Map
  • Boolean vs. quantitative associations (Based on
    the types of values handled)
  • buys(x, SQLServer) buys(x, DMBook)
    buys(x, DBMiner) 0.2, 60
  • age(x, 30..39) income(x, 42..48K)
    buys(x, PC) 1, 75
  • Single dimension vs. multiple dimensional
    associations (see ex. Above)
  • Single level vs. multiple-level analysis
  • What brands of beers are associated with what
    brands of diapers?
  • Various extensions
  • Correlation, causality analysis
  • Association does not necessarily imply
    correlation or causality
  • Constraints enforced
  • E.g., small sales (sum lt 100) trigger big buys
    (sum gt 1,000)?

8
Problem Decomposition
  • 1. Find all sets of items that have minimum
    support (frequent itemsets)
  • 2. Use the frequent itemsets to generate the
    desired rules

9
Problem Decomposition Example
For min support 50 2 trans, and min
confidence 50
For the rule Shoes ? Jacket
  • Support Sup(Shoes,Jacket)50
  • Confidence 66.6

Jacket ? Shoes has 50 support and 100
confidence
10
Discovering Rules
  • Naïve Algorithm
  • for each frequent itemset l do
  • for each subset c of l do
  • if (support(l ) / support(l - c) gt minconf)
    then
  • output the rule (l c ) ? c,
  • with confidence support(l ) /
    support (l - c )
  • and support support(l )

11
Discovering Rules (2)
  • Lemma. If consequent c generates a valid rule, so
    do all subsets of c. (e.g. X ? YZ, then XY ? Z
    and XZ ? Y)
  • Example Consider a frequent itemset ABCDE
  • If ACDE ? B and ABCE ? D are the only
    one-consequent rules with minimum support
    confidence, then
  • ACE ? BD is the only other rule that needs to be
    tested

12
Mining Frequent Itemsets the Key Step
  • Find the frequent itemsets the sets of items
    that have minimum support
  • A subset of a frequent itemset must also be a
    frequent itemset
  • i.e., if AB is a frequent itemset, both A and
    B should be a frequent itemset
  • Iteratively find frequent itemsets with
    cardinality from 1 to k (k-itemset)
  • Use the frequent itemsets to generate association
    rules.

13
The Apriori Algorithm
  • Lk Set of frequent itemsets of size k (those
    with min support)
  • Ck Set of candidate itemset of size k
    (potentially frequent itemsets)
  • L1 frequent items
  • for (k 1 Lk !? k) do begin
  • Ck1 candidates generated from Lk
  • for each transaction t in database do
  • increment the count of all candidates in
    Ck1 that are
    contained in t
  • Lk1 candidates in Ck1 with min_support
  • end
  • return ?k Lk

14
The Apriori Algorithm Example
Min support 50 2 trans
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Scan D
15
How to Generate Candidates?
  • Suppose the items in Lk-1 are listed in order
  • Step 1 self-joining Lk-1
  • insert into Ck
  • select p.item1, p.item2, , p.itemk-1, q.itemk-1
  • from Lk-1 p, Lk-1 q
  • where p.item1q.item1, , p.itemk-2q.itemk-2,
    p.itemk-1 lt q.itemk-1
  • Step 2 pruning
  • forall itemsets c in Ck do
  • forall (k-1)-subsets s of c do
  • if (s is not in Lk-1) then delete c from Ck

16
Example of Generating Candidates
  • L3abc, abd, acd, ace, bcd
  • Self-joining L3L3
  • abcd from abc and abd
  • acde from acd and ace
  • Pruning
  • acde is removed because ade is not in L3
  • C4abcd

17
How to Count Supports of Candidates?
  • Why counting supports of candidates a problem?
  • The total number of candidates can be very huge
  • One transaction may contain many candidates
  • Method
  • Candidate itemsets are stored in a hash-tree
  • Leaf node of hash-tree contains a list of
    itemsets and counts
  • Interior node contains a hash table
  • Subset function finds all the candidates
    contained in a transaction

18
Hash-treesearch
  • Given a transaction T and a set Ck find all
    members of its members contained in T
  • Assume an ordering on the items
  • Start from the root, use every item in T to go to
    the next node
  • If you are at an interior node and you just used
    item i, then use each item that comes after i in
    T
  • If you are at a leaf node check the itemsets

19
Methods to Improve Aprioris Efficiency
  • Transaction reduction A transaction that does
    not contain any frequent k-itemset is useless in
    subsequent scans
  • Partitioning Any itemset that is potentially
    frequent in DB must be frequent in at least one
    of the partitions of DB
  • Sampling mining on a subset of given data, lower
    support threshold a method to determine the
    completeness
  • Dynamic itemset counting add new candidate
    itemsets only when all of their subsets are
    estimated to be frequent

20
Is Apriori Fast Enough? Performance Bottlenecks
  • The core of the Apriori algorithm
  • Use frequent (k 1)-itemsets to generate
    candidate frequent k-itemsets
  • Use database scan and pattern matching to collect
    counts for the candidate itemsets
  • The bottleneck of Apriori candidate generation
  • Huge candidate sets
  • 104 frequent 1-itemset will generate 107
    candidate 2-itemsets
  • To discover a frequent pattern of size 100, e.g.,
    a1, a2, , a100, one needs to generate 2100 ?
    1030 candidates.
  • Multiple scans of database
  • Needs (n 1 ) scans, n is the length of the
    longest pattern

21
Max-Miner
  • Max-miner finds long patterns efficiently the
    maximal frequent patterns
  • Instead of checking all subsets of a long pattern
    try to detect long patterns early
  • Scales linearly to the size of the patterns

22
Max-Miner the idea
Set enumeration tree of an ordered set
f
1
3
4
2
Pruning (1) set infrequency (2) Superset
frequency
1,2
1,3
1,4
2,3
2,4
3,4
1,2,3
2,3,4
1,2,4
1,3,4
Each node is a candidate group g h(g) is the
head the itemset of the node t(g) tail an
ordered set that contains all items that can
appear in the subnodes
1,2,3,4
Example h(1) 1 and t(1) 2,3,4
23
Max-miner pruning
  • When we count the support of a candidate group g,
    we compute also the support for h(g), h(g)
    t(g) and h(g) i for each i in t(g)
  • If h(g) t(g) is frequent, then stop expanding
    the node g and report the union as frequent
    itemset
  • If h(g) i is infrequent, then remove I from
    all subnodes (just remove i from any tail of a
    group after g)
  • Expand the node g by one and do the same

24
The algorithm
  • Max-Miner
  • Set candidate groups C ?
  • Set of Itemsets F ?Gen-Initial-Groups(T,C)
  • while C not empty do
  • scan T to count the support of all candidate
    groups in C
  • for each g in C s.t. h(g) U t(g) is frequent
    do
  • F ? F U h(g) U t(g)
  • Set candidate groups Cnew?
  • for each g in C such that h(g) U t(g) is
    infrequent do
  • F ?F U Gen-sub-nodes(g, Cnew)
  • C ?
  • remove from F any itemset with a proper
    superset in F
  • remove from C any group g s.t. h(g) U t(g)
    has a superset in F
  • return F

25
The algorithm (2)
  • Gen-Initial-Groups(T, C)
  • scan T to obtain F1, the set of frequent
    1-itemsets
  • impose an ordering on items in F1
  • for each item i in F1 other than the greatest
    itemset do
  • let g be a new candidate with h(g) i
  • and t(g) j j follows i in the
    ordering
  • C ?C U g
  • return the itemset F1 (an the C of course)
  • Gen-sub-nodes(g, C) / generation of new itemsets
    at the next level/
  • remove any item i from t(g) if h(g) U i is
    infrequent
  • reorder the items in t(g)
  • for each i in t(g) other than the greatest do
  • let g be a new candidate with h(g) h(g)
    U i and t(g) j j in t(g)
  • and j is after i in t(g)
  • C ? C U g
  • return h(g) U m where m is the greatest item
    in t(g) or h(g) if t(g) is empty

26
Item Ordering
  • Re-ordering items we try to increase the
    effectiveness of frequency-pruning
  • Very frequent items have higher probability to be
    contained in long patterns
  • Put these item at the end of the ordering, so
    they appear in many tails
Write a Comment
User Comments (0)
About PowerShow.com