Association Rules and Sequential Patterns - PowerPoint PPT Presentation

About This Presentation
Title:

Association Rules and Sequential Patterns

Description:

Title: Data Miing and Knowledge Discvoery - Web Data Mining Author: Bamshad Mobasher Last modified by: Bamshad Mobasher Created Date: 3/29/1999 8:01:23 PM – PowerPoint PPT presentation

Number of Views:209
Avg rating:3.0/5.0
Slides: 56
Provided by: Bamsh74
Category:

less

Transcript and Presenter's Notes

Title: Association Rules and Sequential Patterns


1
Association RulesandSequential Patterns
Bamshad Mobasher DePaul University
2
Market Basket Analysis
  • Goal of MBA is to find associations (affinities)
    among groups of items occurring in a
    transactional database
  • has roots in analysis of point-of-sale data, as
    in supermarkets
  • but, has found applications in many other areas
  • Association Rule Discovery
  • most common type of MBA technique
  • Find all rules that associate the presence of one
    set of items with that of another set of items.
  • Example 98 of people who purchase tires and
    auto accessories also get automotive services
    done
  • We are interested in rules that are
  • non-trivial (and possibly unexpected)
  • actionable
  • easily explainable

3
What Is Association Mining?
  • Association rule mining searches for
    relationships between items in a data set
  • Finding association, correlation, or causal
    structures among sets of items or objects in
    transaction databases, relational databases, etc.
  • Rule form
  • Body gt Head support, confidence
  • Body and Head can be represented as sets of items
    or as predicates
  • Examples
  • diaper, milk, Thursday gt beer 0.5, 78
  • buys(x, "bread") gt buys(x, "milk") 0.6, 65
  • major(x, "CS") /\ takes(x, "DB") gt grade(x,
    "A") 1, 75
  • age(X,30-45) /\ income(X, 50K-75K) gt buys(X,
    SUVcar)
  • age30-45, income50K-75K gt carSUV

4
Different Kinds of Association Rules
  • Boolean vs. Quantitative
  • associations on discrete and categorical data vs.
    continuous data
  • Single Vs. Multiple Dimensions
  • one predicate single dimension multiple
    predicates multiple dimensions
  • buys(x, milk) gt buys(x, butter)
  • age(X,30-45) /\ income(X, 50K-75K) gt buys(X,
    SUVcar)
  • Single level vs. multiple-level analysis
  • Based on the level of abstractions involved
  • buys(x, bread) gt buys(x, milk)
  • buys(x, wheat bread) gt buys(x, 2 milk)
  • Simple vs. constraint-based
  • Constraints can be added on the rules to be
    discovered

5
Basic Concepts
  • We start with a set I of items and a set D of
    transactions
  • D is all of the transactions relevant to the
    mining task
  • A transaction T is a set of items (a subset of
    I)
  • An Association Rule is an implication on itemsets
    X and Y , denoted by X gt Y, where
  • The rule meets a minimum confidence of c, meaning
    that c of transactions in D which contain X also
    contain Y
  • In addition a minimum support of s is satisfied

6
Support and Confidence
  • Find all the rules X ? Y with minimum confidence
    and support
  • Support probability that a transaction contains
    X,Y
  • i.e., ratio of transactions in which X, Y occur
    together to all transactions in database.
  • Confidence conditional probability that a
    transaction having X also contains Y
  • i.e., ratio of transactions in which X, Y occur
    together to those in which X occurs.

In general confidence of a rule LHS gt RHS can be
computed as the support of the whole itemset
divided by the support of LHS Confidence (LHS
gt RHS) Support(LHS È RHS) / Support(LHS)
7
Support and Confidence - Example
Itemset A, C has a support of 2/5 40 Rule
A gt C has confidence of 50 Rule C gt
A has confidence of 100 Support for A, C, E
? Support for A, D, F ? Confidence for A, D
gt F ? Confidence for A gt D, F ?
8
Improvement (Lift)
  • High confidence rules are not necessarily useful
  • what if confidence of A, B gt C is less than
    Pr(C)?
  • improvement gives the predictive power of a rule
    compared to just random chance

9
Steps in Association Rule Discovery
  • Find the frequent itemsets
  • Frequent item sets are the sets of items that
    have minimum support
  • Support is downward closed, so, a subset of a
    frequent itemset must also be a frequent itemset
  • if AB is a frequent itemset, both A and B
    are frequent itemsets
  • this also means that if an itemset that doesnt
    satisfy minimum support, none of its supersets
    will either (this is essential for pruning search
    space)
  • Iteratively find frequent itemsets with
    cardinality from 1 to k (k-itemsets)
  • Use the frequent itemsets to generate association
    rules

10
Mining Association Rules - An Example
Min. support 50 Min. confidence 50
Only need to keep these since A and C are
subsets of A,C
  • For rule A ? C
  • support support(A, C) 50
  • confidence support(A, C)/support(A) 66.6

11
Apriori Algorithm
Ck Candidate itemset of size k Lk Frequent
itemset of size k
Join Step Ck is generated by joining Lk-1with
itself Prune Step Any (k-1)-itemset that is not
frequent cannot be a subset of a frequent
k-itemset
12
Example of Generating Candidates
  • L3abc, abd, acd, ace, bcd
  • Self-joining L3L3
  • abcd from abc and abd
  • acde from acd and ace
  • Pruning
  • acde is removed because ade is not in L3
  • C4 abcd

13
Apriori Algorithm - An Example
Assume minimum support 2
14
Apriori Algorithm - An Example
The final frequent item sets are those
remaining in L2 and L3. However, 2,3, 2,5,
and 3,5 are all contained in the larger item
set 2, 3, 5. Thus, the final group of item sets
reported by Apriori are 1,3 and 2,3,5. These
are the only item sets from which we will
generate association rules.
15
Generating Association Rulesfrom Frequent
Itemsets
  • Only strong association rules are generated
  • Frequent itemsets satisfy minimum support
    threshold
  • Strong rules are those that satisfy minimum
    confidence threshold
  • confidence(A gt B) Pr(B A)

For each frequent itemset, f, generate all
non-empty subsets of f For every non-empty subset
s of f do if support(f)/support(s) ³
min_confidence then output rule s gt
(f-s) end
16
Generating Association Rules(Example Continued)
  • Item sets 1,3 and 2,3,5
  • Recall that confidence of a rule LHS ? RHS is
    Support of itemset (i.e. LHS È RHS) divided by
    support of LHS.

Candidate rules for 1,3 Candidate rules for 1,3 Candidate rules for 2,3,5 Candidate rules for 2,3,5 Candidate rules for 2,3,5 Candidate rules for 2,3,5
Rule Conf. Rule Conf. Rule Conf.
1?3 2/2 1.0 2,3?5 2/2 1.00 2?5 3/3 1.00
3?1 2/3 0.67 2,5?3 2/3 0.67 2?3 2/3 0.67
3,5?2 2/2 1.00 3?2 2/3 0.67
2?3,5 2/3 0.67 3?5 2/3 0.67
3?2,5 2/3 0.67 5?2 3/3 1.00
5?2,3 2/3 0.67 5?3 2/3 0.67
Assuming a min. confidence of 75, the final set
of rules reported by Apriori are 1?3,
3,5?2, 5?2 and 2?5
17
Multiple-Level Association Rules
  • Items often form a hierarchy
  • Items at the lower level are expected to have
    lower support
  • Rules regarding itemsets at appropriate levels
    could be quite useful
  • Transaction database can be encoded based on
    dimensions and levels

18
Mining Multi-Level Associations
  • A top_down, progressive deepening approach
  • First find high-level strong rules
  • milk bread 20, 60
  • Then find their lower-level weaker rules
  • 2 milk wheat bread 6, 50
  • When one threshold set for all levels if support
    too high then it is possible to miss meaningful
    associations at low level if support too low
    then possible generation of uninteresting rules
  • different minimum support thresholds across
    multi-levels lead to different algorithms (e.g.,
    decrease min-support at lower levels)
  • Variations at mining multiple-level association
    rules
  • Level-crossed association rules
  • milk wonder wheat bread
  • Association rules with multiple, alternative
    hierarchies
  • 2 milk wonder bread

19
Quantitative Association Rules
Handling quantitative rules may require mapping
of the continuous variables into Boolean
20
MBA in Text / Web Content Mining
  • Documents Associations
  • Find (content-based) associations among documents
    in a collection
  • Documents correspond to items and words
    correspond to transactions
  • Frequent itemsets are groups of docs in which
    many words occur in common
  • Term Associations
  • Find associations among words based on their
    occurrences in documents
  • similar to above, but invert the table (terms as
    items, and docs as transactions)

21
MBA in Web Usage Mining
  • Association Rules in Web Transactions
  • discover affinities among sets of Web page
    references across user sessions
  • Examples
  • 60 of clients who accessed /products/, also
    accessed /products/software/webminer.htm
  • 30 of clients who accessed /special-offer.html,
    placed an online order in /products/software/
  • Actual Example from IBM official Olympics Site
  • Badminton, Diving gt Table Tennis
    conf???69.7,???sup???0.35
  • Applications
  • Use rules to serve dynamic, customized contents
    to users
  • prefetch files that are most likely to be
    accessed
  • determine the best way to structure the Web site
    (site optimization)
  • targeted electronic advertising and increasing
    cross sales

22
Web Usage Mining Example
  • Association Rules From Cray Research Web Site
  • Design suggestions
  • from rules 1 and 2 there is something in
    J90.html that should be moved to th page
    /PUBLIC/product-info/T3E (why?)

23
Sequential / Navigational Patterns
  • Sequential patterns add an extra dimension to
    frequent itemsets and association rules - time.
  • Items can appear before, after, or at the same
    time as each other.
  • General form x of the time, when A appears in
    a transaction, B appears within z transactions.
  • note that other items may appear between A and B,
    so sequential patterns do not necessarily imply
    consecutive appearances of items (in terms of
    time)
  • Examples
  • Renting Star Wars, then Empire Strikes Back,
    then Return of the Jedi in that order
  • Collection of ordered events within an interval
  • Most sequential pattern discovery algorithms are
    based on extensions of the Apriori algorithm for
    discovering itemsets
  • Navigational Patterns
  • they can be viewed as a special form of
    sequential patterns which capture navigational
    patterns among users of a site
  • in this case a session is a consecutive sequence
    of pageview references for a user over a
    specified period of time

24
Mining Sequences - Example
Customer-sequence
Sequential patterns with support gt 0.25(C),
(H)(C), (DG)
25
Sequential Pattern Mining Cases and Parameters
  • Duration of a time sequence T
  • Sequential pattern mining can then be confined to
    the data within a specified duration
  • Ex. Subsequences corresponding to the year of
    1999
  • Ex. Partitioned sequences, such as every year, or
    every week after stock crashes, or every two
    weeks before and after a volcano eruption
  • Event folding window w
  • If w T, time-insensitive frequent patterns are
    found
  • If w 0 (no event sequence folding), sequential
    patterns are found where each event occurs at a
    distinct time instant
  • If 0 lt w lt T, sequences occurring within the same
    period w are folded in the analysis

26
Sequential Pattern Mining Cases and Parameters
  • Time interval, int, between events in the
    discovered pattern
  • int 0 no interval gap is allowed, i.e., only
    strictly consecutive sequences are found
  • Ex. Find frequent patterns occurring in
    consecutive weeks
  • min_int ? int ? max_int find patterns that are
    separated by at least min_int but at most max_int
  • Ex. If a person rents movie A, it is likely she
    will rent movie B within 30 days (int ? 30)
  • int c ? 0 find patterns carrying an exact
    interval
  • Ex. Every time when Dow Jones drops more than
    5, what will happen exactly two days later?
    (int 2)

27
Mining Navigational Patterns
  • Approach build an aggregated sequence tree
  • this is the approach taken by Web Utilization
    Miner (WUM) - Spiliopoulou, 1998
  • for each occurrence of a sequence start a new
    branch or increase the frequency counts of
    matching nodes
  • in example below, note that s6 contains b
    twice, hence the sequence is lt(b,1),(d,1),(b,2),(e
    ,1)gt

28
Mining Navigational Patterns
The aggregated sequence tree can be used directly
to determine support and confidence for
navigational patterns
Note that each node represents a navigational
path ending in that node
Support count at the node / count at
root Confidence count at the node / count at
the parent
Navigation pattern a ? b Support 11/35
0.31 Confidence 11/21 0.52
Nav. pattern a ? b ? e Support 11/35
0.31 Confidence 11/11 1.00
Nav. patterns a ? b ? e ? f Support 3/35
0.086 Confidence 3/11 0.27
29
Mining Navigational Patterns
  • WUM supports a powerful mining query language to
    extract patterns from the aggregated tree
  • Example query
  • For example, patterns matching the query with X
    b are

SELECT t NODES AS X Y Z, TEMPLATE AS t WHERE
X.support gt 20 AND Y.support gt 6 AND Z.support
gt 4
30
Mining Navigational Patterns
  • Another Approach Markov Chains
  • idea is to model the navigational sequences
    through the site as a state-transition diagram
    without cycles (a directed acyclic graph)
  • a Markov Chain consists of a set of states (pages
    or pageviews in the site)
  • S s1, s2, , sn
  • and a set of transition probabilities
  • P p1,1, , p1,n, p2,1, , p2,n, , pn,1,
    , pn,n
  • a path r from a state si to a state sj, is a
    sequence states where the transition
    probabilities for all consecutive states are
    greater than 0.
  • the probability of reaching a state sj from a
    state si via a path r is the product of all the
    probabilities along the path
  • the probability of reaching sj from si is the sum
    over all paths

31
Mining Navigational Patterns
An example Markov Chain
  • What is the probability that a user who visits
    the welcome page purchases a product?
  • Home -gt Search -gt PD -gt 1/3 1/2 1/2 1/12
  • Home -gt Cat -gt PD -gt 1/3 1/3 1/2 1/18
  • Home -gt Cat -gt 1/3 1/3 1/9
  • Home -gt RS -gt PD -gt 1/3 2/3 1/2 1/9

Sum 13/36
32
Markov Chain Example Calculating conditional
probabilities for transitions
Web site hyperlink graph
Sessions A, B A, B A, B, C A, B, C A, B, C, D
A, B, C, E A, C, E A, C, E A, B, D A, B, D A,
B, D, E B, C B, C B, C, D B, C, E B, E,D
B
D
A
0.57
C
E
Transition B?C Total occurrences of B 14
Total occurrence of BC 8 Pr(CB) 8/14
0.57
33
Tools Weka Package
  • Weka
  • set of Java packages developed at the University
    of Waikato in New Zealand
  • includes packages for data filtering, association
    rules, classification, clustering, and
    instance-based learning
  • Web site www.cs.waikato.ac.nz/ml/weka
  • can be used both from command line, or using the
    Java based GUI
  • requires the data to be in a standard format
    called ARFF

34
Weka ARFF Format
  • ARFF files have two main sections
  • Attributes
  • categorical (nominal) attributes along with their
    values
  • integer attributes along with a range
  • real attributes
  • Data section
  • each record has values corresponding to the order
    in which attributes were specified in the
    attribute section

_at_RELATION zoo _at_ATTRIBUTE animal
aardvark,antelope,bass,bear,boar, . . .
_at_ATTRIBUTE hair false, true _at_ATTRIBUTE
feathers false, true _at_ATTRIBUTE eggs false,
true _at_ATTRIBUTE milk false, true _at_ATTRIBUTE
airborne false, true _at_ATTRIBUTE aquatic false,
true _at_ATTRIBUTE predator false,
true _at_ATTRIBUTE toothed false, true _at_ATTRIBUTE
backbone false, true _at_ATTRIBUTE breathes
false, true _at_ATTRIBUTE venomous false,
true _at_ATTRIBUTE fins false, true _at_ATTRIBUTE
legs INTEGER 0,9 _at_ATTRIBUTE tail false,
true _at_ATTRIBUTE domestic false,
true _at_ATTRIBUTE catsize false, true _at_ATTRIBUTE
type mammal, bird, reptile, fish, insect, . . .
. . .
35
Weka ARFF Format
  • Data portion of the ARFF file for Zoo animals
  • For association rule discovery, we first need to
    discretize using Weka Filters

. . . _at_DATA Instances (101) aardvark,true,f
alse,false,true,false,false,true,true,true,true,fa
lse,false,4,false,false,true,mammal antelope,true,
false,false,true,false,false,false,true,true,true,
false,false,4,true,false,true,mammal bass,false,fa
lse,true,false,false,true,true,true,true,false,fal
se,true,0,true,false,false,fish bear,true,false,fa
lse,true,false,false,true,true,true,true,false,fal
se,4,false,false,true,mammal boar,true,false,false
,true,false,false,true,true,true,true,false,false,
4,true,false,true,mammal buffalo,true,false,false,
true,false,false,false,true,true,true,false,false,
4,true,false,true,mammal calf,true,false,false,tru
e,false,false,false,true,true,true,false,false,4,t
rue,true,true,mammal carp,false,false,true,false,f
alse,true,false,true,true,false,false,true,0,true,
true,false,fish catfish,false,false,true,false,fal
se,true,true,true,true,false,false,true,0,true,fal
se,false,fish cavy,true,false,false,true,false,fal
se,false,true,true,true,false,false,4,false,true,f
alse,mammal cheetah,true,false,false,true,false,fa
lse,true,true,true,true,false,false,4,true,false,t
rue,mammal . . .
36
Weka Explorer Interface
Can open the native ARFF format or the standard
CSV format
37
(No Transcript)
38
(No Transcript)
39
Weka Attribute Filters
40
Weka Attribute Filters
41
We can discretize children manually since it
only has a small number of discrete values
After Saving the new relation in ARFF format
42
(No Transcript)
43
(No Transcript)
44
Weka Discretization Filter
45
Weka Discretization Filter
46
Weka Discretization Filter
47
Weka Discretization Filter
After Saving the new relation in ARFF format
48
Weka Discretization Filter
After renaming attribute values for age and
income
49
Weka Association Rules
50
Weka Association Rules
51
Weka Association Rules
52
Weka Association Rules
Another try with Lift gt 1.5
53
Weka Association Rules
Another try with Lift gt 1.5
54
(No Transcript)
55
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com