Association Rule Mining - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Association Rule Mining

Description:

Chips, Salsa, Cookies, Crackers, Coke, Beer. A. Basket. Person ... Lettuce, Spinach, Milk, Butter, Chips. D. Lettuce, Spinach, Oranges, Celery, Apples, Grapes ... – PowerPoint PPT presentation

Number of Views:349
Avg rating:3.0/5.0
Slides: 37
Provided by: cseCu
Category:

less

Transcript and Presenter's Notes

Title: Association Rule Mining


1
Association Rule Mining
  • Zhenjiang Lin
  • Group Presentation
  • April 10, 2007

2
Overview
  • Associations
  • Market Basket Analysis
  • Basic Concepts
  • Frequent Itemsets
  • Generating Frequent Itemsets
  • Apriori
  • FP-Growth
  • Applications

3
Association Rule Learners
  • to discover elements that co-occur frequently
    within a data set consisting of multiple
    independent selections of elements (such as
    purchasing transactions), and
  • to discover rules, such as implication or
    correlation, which relate co-occurring elements.
  • to answer questions such as "if a customer
    purchases product A, how likely is he to purchase
    product B?" and "What products will a customer
    buy if he buys products C and D?" are answered by
    association-finding algorithms.
  • to reduce a potentially huge amount of
    information to a small, understandable set of
    statistically supported statements.
  • also known as market basket analysis.

4
Associations
  • Rules expressing relationships between items
  • Example
  • cereal, milk ? fruit
  • People who bought cereal and milk also bought
    fruit.
  • Stores might want to offer specials on milk and
    cereal to get people to buy more fruit.

5
Market Basket Analysis
  • Analyze tables of transactions
  • Can we hypothesize?
  • Chips gt Salsa Lettuce gt Spinach

6
Market Baskets
  • In general, data consists of

Subset of items
Transaction ID
7
Basic Concepts
  • Set of items
  • Transaction
  • Association Rule
  • - set of transactions (i.e., our data)

8
Measuring Interesting Rules
  • Support
  • Ratio of of transactions containing A and B to
    the total of transactions
  • Confidence
  • Ratio of of transactions containing A and B to
    of transactions containing A

9
Measuring Interesting Rules
  • Rules are included/excluded based on two metrics
  • minimum support level - how frequently all of the
    items in a rule appear in transactions
  • minimum confidence level - how frequently the
    left hand side of a rule implies the right hand
    side

10
Market Basket Analysis
  • What is I?
  • What is T for person B?
  • What is s(ChipsgtSalsa)?
  • What is c(ChipsgtSalsa)?

11
Frequent Itemsets
  • itemset any set of items
  • k-itemset an itemset containing k items
  • frequent itemset an itemset that satisfies a
    minimum support level
  • If I contains m items, how many itemsets are
    there?

12
Strong Association Rules
  • Given an itemset, its easy to generate
    association rules
  • Given itemset, Chips, Salsa
  • gt Chips, Salsa
  • Chips gt Salsa
  • Salsa gt Chips
  • Chips, Salsa gt
  • Strong rules are interesting
  • Generally defined as those rules satisfying
    minimum support and minimum confidence

13
Association Rule Mining
  • Two basic steps
  • Find all frequent itemsets
  • Satisfying minimum support
  • Find all strong association rules
  • Generate association rules from frequent itemsets
  • Keep rules satisfying minimum confidence

14
Generating Frequent Itemsets
  • Naïve algorithm
  • n lt- D
  • for each subset s of I do
  • l lt- 0
  • for each transaction T in D do
  • if s is a subset of T then
  • l lt- l 1
  • if minimum support lt l/n then
  • add s to frequent subsets

15
Generating Frequent Itemsets
  • Analysis of naïve algorithm
  • 2m subsets of I
  • Scan n transactions for each subset
  • O(2mn) tests of s being subset of T
  • Growth is exponential in the number of items!
  • Can we do better?

16
Generating Frequent Itemsets
  • Frequent itemsets support the apriori property
  • If A is not a frequent itemset, then any superset
    of A is not a frequent itemset.
  • Proof Let n be the number of transactions.
    Suppose A is a subset of l transactions. If A ?
    A, then A is a subset of l ? l transactions.
    Thus, if l/n lt minimum support, so is l/n.

17
Generating Frequent Itemsets
  • Central idea Build candidate k-itemsets from
    frequent (k-1)-itemsets
  • Approach
  • Find all frequent 1-itemsets
  • Extend (k-1)-itemsets to candidate k-itemsets
  • Prune candidate itemsets that do not meet the
    minimum support.

18
Generating Frequent Itemsets (Basic Apriori)
  • L1 frequent 1-itemsets
  • for (k2 L(k-1) is not empty k)
  • Ck generate k-itemset candidates from
    L(k-1)
  • for each transaction t in D
  • // The candidates that are subsets of t
  • Ctsubset(Ck,t)
  • for each candidate c in Ct
  • c.count
  • Lk c in Ck c.count gt min_sup
  • The frequent itemsets are the union of the Lk

19
FP Growth (Han, Pei, Yin 2000)
  • One problematic aspect of the Apriori is the
    candidate generation
  • Source of exponential growth
  • Another approach is to use a divide and conquer
    strategy
  • Idea Compress the database into a frequent
    pattern tree representing frequent items

20
FP Growth (Tree construction)
  • Initially, scan database for frequent 1-itemsets
  • Place resulting set in a list L in descending
    order by frequency (support)
  • Construct an FP-tree
  • Create a root node labeled null
  • Scan database
  • Process the items in each transaction in L order
  • From the root, add nodes in the order in which
    items appear in the transactions
  • Link nodes representing items along different
    branches

21
Frequent 1-itemsets
  • Minimum support of 20 (frequency of 2)
  • Frequent 1-itemsets
  • I1,I2,I3,I4,I5
  • Construct list
  • L (I2,7),(I1,6),(I3,6),(I4,2),(I5,2)

22
Build FP-Tree
  • Scan database
  • Transaction1 I1, I2, I5
  • Order I2, I1, I5
  • Process transaction
  • Add nodes in item order
  • Label with items, count

23
Build FP-Tree
null
(I2,2)
(I4,1)
24
Minining the FP-tree
  • Start at the last item in the table
  • Find all paths containing item
  • Follow the node-links
  • Identify conditional patterns
  • Patterns in paths with required frequency
  • Build conditional FP-tree C
  • Append item to all paths in C, generating
    frequent patterns
  • Mine C recursively (appending item)
  • Remove item from table and tree

25
Mining the FP-Tree
  • Prefix Paths
  • (I2 I1,1)
  • (I2 I1 I3, 1)
  • Conditional Path
  • (I2 I1, 2)
  • Conditional FP-tree
  • (I2 I1 I5, 2)

null
(I1,2)
(I2,7)
(I3,2)
null
(I4,1)
(I1,4)
(I3,2)
(I4,1)
(I2,2)
(I3,2)
(I1,2)
(I5,1)
26
Applications
  • Web Personalization
  • Genomic Data

27
Web Personalization
  • Effective Personalization Based on Association
    Rule Discovery from Web Usage Data, Mobasher, et
    al., ACM Workshop on Web Information and Data
    Management, 2001.
  • Personalization and recommendation systems
  • e.g. Amazon.coms recommended books

28
Data Preprocessing
  • Identify set of pageviews P
  • Which files result in a single browser display
    (complicated by frames, images, etc.)
  • P p1, , pn
  • Transactions T
  • From session IDs or cookies
  • T t1, , tm

29
Data Preprocessing
  • A transaction t consists of
  • t (p1t, w(p1t)), , (plt,w(plt))
  • The w is a weight associated with the pageview
  • Could be binary (purchase or non-purchase)
  • Could be related to amount of time spent on the
    page

30
Data Preprocessing
  • In the paper, only considered pageviews in a
    transaction with w(p) 1
  • Ordering of pageviews didnt matter

31
Recommendation Engine
  • Has to run online
  • i.e. must be fast
  • generate frequent itemsets first and store in a
    graph data structure for efficient searching
  • Maintains a history of the users current session
  • Sets a window size w (e.g. 3)
  • Consider pageviews A, B, C
  • A,B,C
  • If user then visits D
  • B,C,D

32
Genomic Data
  • Finding Association Rules on Heterogenous Genome
    Data, Satou et al.
  • Combined data from PDB, SWISS-PROT, and PROSITE

33
Genomic Data
  • After mining, 182388 association rules were
    generated (minimum support 5, minimum
    confidence 65)
  • Post process results with max support of 30
  • Itemsets appearing too frequently arent
    interesting
  • Reduced to 381 rules

34
Genomic Data
  • Rules generated were corroborated by biological
    background data
  • Found common substructures in serine
    endopeptidases
  • Rules were not distributed well over protein
    families
  • Still some work to be done on the data
    preprocessing stage

35
Association Rule Summary
  • Association rule mining is a fundamental tool in
    data mining
  • Several algorithms
  • Apriori Use a provable mathematical property to
    improve performance
  • FP-Growth Stop candidate generation, use
    effective data structure
  • Correlation Rules Evaluate interestingness based
    on statistics
  • Query Flocks Generalize approach with the
    purpose of query optimization (incorporation into
    database systems)

36
Association Rule Summary
  • There exist several extensions
  • Hierarchical attributes (e.g. year-gtmonth-gtweek-gtd
    ay or computer-gtluggable-gthandheld-gtpalm)
  • Multilevel/multidimensional
  • Numerical attributes
  • Constraint based
Write a Comment
User Comments (0)
About PowerShow.com