Title: Association Rules and Sequential Patterns
1Association RulesandSequential Patterns
Bamshad Mobasher DePaul University
2Market Basket Analysis
- Goal of MBA is to find associations (affinities)
among groups of items occurring in a
transactional database - has roots in analysis of point-of-sale data, as
in supermarkets - but, has found applications in many other areas
- Association Rule Discovery
- most common type of MBA technique
- Find all rules that associate the presence of one
set of items with that of another set of items. - Example 98 of people who purchase tires and
auto accessories also get automotive services
done - We are interested in rules that are
- non-trivial (and possibly unexpected)
- actionable
- easily explainable
3What Is Association Mining?
- Association rule mining searches for
relationships between items in a data set - Finding association, correlation, or causal
structures among sets of items or objects in
transaction databases, relational databases, etc. - Rule form
- Body gt Head support, confidence
- Body and Head can be represented as sets of items
or as predicates - Examples
- diaper, milk, Thursday gt beer 0.5, 78
- buys(x, "bread") gt buys(x, "milk") 0.6, 65
- major(x, "CS") /\ takes(x, "DB") gt grade(x,
"A") 1, 75 - age(X,30-45) /\ income(X, 50K-75K) gt buys(X,
SUVcar) - age30-45, income50K-75K gt carSUV
4Different Kinds of Association Rules
- Boolean vs. Quantitative
- associations on discrete and categorical data vs.
continuous data - Single Vs. Multiple Dimensions
- one predicate single dimension multiple
predicates multiple dimensions - buys(x, milk) gt buys(x, butter)
- age(X,30-45) /\ income(X, 50K-75K) gt buys(X,
SUVcar) - Single level vs. multiple-level analysis
- Based on the level of abstractions involved
- buys(x, bread) gt buys(x, milk)
- buys(x, wheat bread) gt buys(x, 2 milk)
- Simple vs. constraint-based
- Constraints can be added on the rules to be
discovered
5Basic Concepts
- We start with a set I of items and a set D of
transactions -
- D is all of the transactions relevant to the
mining task - A transaction T is a set of items (a subset of
I) - An Association Rule is an implication on itemsets
X and Y , denoted by X gt Y, where -
- The rule meets a minimum confidence of c, meaning
that c of transactions in D which contain X also
contain Y - In addition a minimum support of s is satisfied
6Support and Confidence
- Find all the rules X ? Y with minimum confidence
and support - Support probability that a transaction contains
X,Y - i.e., ratio of transactions in which X, Y occur
together to all transactions in database. - Confidence conditional probability that a
transaction having X also contains Y - i.e., ratio of transactions in which X, Y occur
together to those in which X occurs.
In general confidence of a rule LHS gt RHS can be
computed as the support of the whole itemset
divided by the support of LHS Confidence (LHS
gt RHS) Support(LHS È RHS) / Support(LHS)
7Support and Confidence - Example
Itemset A, C has a support of 2/5 40 Rule
A gt C has confidence of 50 Rule C gt
A has confidence of 100 Support for A, C, E
? Support for A, D, F ? Confidence for A, D
gt F ? Confidence for A gt D, F ?
8Improvement (Lift)
- High confidence rules are not necessarily useful
- what if confidence of A, B gt C is less than
Pr(C)? - improvement gives the predictive power of a rule
compared to just random chance
9Steps in Association Rule Discovery
- Find the frequent itemsets
- Frequent item sets are the sets of items that
have minimum support - Support is downward closed, so, a subset of a
frequent itemset must also be a frequent itemset - if AB is a frequent itemset, both A and B
are frequent itemsets - this also means that if an itemset that doesnt
satisfy minimum support, none of its supersets
will either (this is essential for pruning search
space) - Iteratively find frequent itemsets with
cardinality from 1 to k (k-itemsets) - Use the frequent itemsets to generate association
rules
10Mining Association Rules - An Example
Min. support 50 Min. confidence 50
Only need to keep these since A and C are
subsets of A,C
- For rule A ? C
- support support(A, C) 50
- confidence support(A, C)/support(A) 66.6
11Apriori Algorithm
Ck Candidate itemset of size k Lk Frequent
itemset of size k
Join Step Ck is generated by joining Lk-1with
itself Prune Step Any (k-1)-itemset that is not
frequent cannot be a subset of a frequent
k-itemset
12Example of Generating Candidates
- L3abc, abd, acd, ace, bcd
- Self-joining L3L3
- abcd from abc and abd
- acde from acd and ace
- Pruning
- acde is removed because ade is not in L3
- C4 abcd
13Apriori Algorithm - An Example
Assume minimum support 2
14Apriori Algorithm - An Example
The final frequent item sets are those
remaining in L2 and L3. However, 2,3, 2,5,
and 3,5 are all contained in the larger item
set 2, 3, 5. Thus, the final group of item sets
reported by Apriori are 1,3 and 2,3,5. These
are the only item sets from which we will
generate association rules.
15Generating Association Rulesfrom Frequent
Itemsets
- Only strong association rules are generated
- Frequent itemsets satisfy minimum support
threshold - Strong rules are those that satisfy minimum
confidence threshold - confidence(A gt B) Pr(B A)
For each frequent itemset, f, generate all
non-empty subsets of f For every non-empty subset
s of f do if support(f)/support(s) ³
min_confidence then output rule s gt
(f-s) end
16Generating Association Rules(Example Continued)
- Item sets 1,3 and 2,3,5
- Recall that confidence of a rule LHS ? RHS is
Support of itemset (i.e. LHS È RHS) divided by
support of LHS.
Candidate rules for 1,3 Candidate rules for 1,3 Candidate rules for 2,3,5 Candidate rules for 2,3,5 Candidate rules for 2,3,5 Candidate rules for 2,3,5
Rule Conf. Rule Conf. Rule Conf.
1?3 2/2 1.0 2,3?5 2/2 1.00 2?5 3/3 1.00
3?1 2/3 0.67 2,5?3 2/3 0.67 2?3 2/3 0.67
3,5?2 2/2 1.00 3?2 2/3 0.67
2?3,5 2/3 0.67 3?5 2/3 0.67
3?2,5 2/3 0.67 5?2 3/3 1.00
5?2,3 2/3 0.67 5?3 2/3 0.67
Assuming a min. confidence of 75, the final set
of rules reported by Apriori are 1?3,
3,5?2, 5?2 and 2?5
17Multiple-Level Association Rules
- Items often form a hierarchy
- Items at the lower level are expected to have
lower support - Rules regarding itemsets at appropriate levels
could be quite useful - Transaction database can be encoded based on
dimensions and levels
18Mining Multi-Level Associations
- A top_down, progressive deepening approach
- First find high-level strong rules
- milk bread 20, 60
- Then find their lower-level weaker rules
- 2 milk wheat bread 6, 50
- When one threshold set for all levels if support
too high then it is possible to miss meaningful
associations at low level if support too low
then possible generation of uninteresting rules - different minimum support thresholds across
multi-levels lead to different algorithms (e.g.,
decrease min-support at lower levels) - Variations at mining multiple-level association
rules - Level-crossed association rules
- milk wonder wheat bread
- Association rules with multiple, alternative
hierarchies - 2 milk wonder bread
19Quantitative Association Rules
Handling quantitative rules may require mapping
of the continuous variables into Boolean
20MBA in Text / Web Content Mining
- Documents Associations
- Find (content-based) associations among documents
in a collection - Documents correspond to items and words
correspond to transactions - Frequent itemsets are groups of docs in which
many words occur in common - Term Associations
- Find associations among words based on their
occurrences in documents - similar to above, but invert the table (terms as
items, and docs as transactions)
21MBA in Web Usage Mining
- Association Rules in Web Transactions
- discover affinities among sets of Web page
references across user sessions - Examples
- 60 of clients who accessed /products/, also
accessed /products/software/webminer.htm - 30 of clients who accessed /special-offer.html,
placed an online order in /products/software/ - Actual Example from IBM official Olympics Site
- Badminton, Diving gt Table Tennis
conf???69.7,???sup???0.35 - Applications
- Use rules to serve dynamic, customized contents
to users - prefetch files that are most likely to be
accessed - determine the best way to structure the Web site
(site optimization) - targeted electronic advertising and increasing
cross sales
22Web Usage Mining Example
- Association Rules From Cray Research Web Site
- Design suggestions
- from rules 1 and 2 there is something in
J90.html that should be moved to th page
/PUBLIC/product-info/T3E (why?)
23Sequential / Navigational Patterns
- Sequential patterns add an extra dimension to
frequent itemsets and association rules - time. - Items can appear before, after, or at the same
time as each other. - General form x of the time, when A appears in
a transaction, B appears within z transactions. - note that other items may appear between A and B,
so sequential patterns do not necessarily imply
consecutive appearances of items (in terms of
time) - Examples
- Renting Star Wars, then Empire Strikes Back,
then Return of the Jedi in that order - Collection of ordered events within an interval
- Most sequential pattern discovery algorithms are
based on extensions of the Apriori algorithm for
discovering itemsets - Navigational Patterns
- they can be viewed as a special form of
sequential patterns which capture navigational
patterns among users of a site - in this case a session is a consecutive sequence
of pageview references for a user over a
specified period of time
24Mining Sequences - Example
Customer-sequence
Sequential patterns with support gt 0.25(C),
(H)(C), (DG)
25Sequential Pattern Mining Cases and Parameters
- Duration of a time sequence T
- Sequential pattern mining can then be confined to
the data within a specified duration - Ex. Subsequences corresponding to the year of
1999 - Ex. Partitioned sequences, such as every year, or
every week after stock crashes, or every two
weeks before and after a volcano eruption - Event folding window w
- If w T, time-insensitive frequent patterns are
found - If w 0 (no event sequence folding), sequential
patterns are found where each event occurs at a
distinct time instant - If 0 lt w lt T, sequences occurring within the same
period w are folded in the analysis
26Sequential Pattern Mining Cases and Parameters
- Time interval, int, between events in the
discovered pattern - int 0 no interval gap is allowed, i.e., only
strictly consecutive sequences are found - Ex. Find frequent patterns occurring in
consecutive weeks - min_int ? int ? max_int find patterns that are
separated by at least min_int but at most max_int - Ex. If a person rents movie A, it is likely she
will rent movie B within 30 days (int ? 30) - int c ? 0 find patterns carrying an exact
interval - Ex. Every time when Dow Jones drops more than
5, what will happen exactly two days later?
(int 2)
27Mining Navigational Patterns
- Approach build an aggregated sequence tree
- this is the approach taken by Web Utilization
Miner (WUM) - Spiliopoulou, 1998 - for each occurrence of a sequence start a new
branch or increase the frequency counts of
matching nodes - in example below, note that s6 contains b
twice, hence the sequence is lt(b,1),(d,1),(b,2),(e
,1)gt
28Mining Navigational Patterns
The aggregated sequence tree can be used directly
to determine support and confidence for
navigational patterns
Note that each node represents a navigational
path ending in that node
Support count at the node / count at
root Confidence count at the node / count at
the parent
Navigation pattern a ? b Support 11/35
0.31 Confidence 11/21 0.52
Nav. pattern a ? b ? e Support 11/35
0.31 Confidence 11/11 1.00
Nav. patterns a ? b ? e ? f Support 3/35
0.086 Confidence 3/11 0.27
29Mining Navigational Patterns
- WUM supports a powerful mining query language to
extract patterns from the aggregated tree - Example query
- For example, patterns matching the query with X
b are
SELECT t NODES AS X Y Z, TEMPLATE AS t WHERE
X.support gt 20 AND Y.support gt 6 AND Z.support
gt 4
30Mining Navigational Patterns
- Another Approach Markov Chains
- idea is to model the navigational sequences
through the site as a state-transition diagram
without cycles (a directed acyclic graph) - a Markov Chain consists of a set of states (pages
or pageviews in the site) - S s1, s2, , sn
- and a set of transition probabilities
- P p1,1, , p1,n, p2,1, , p2,n, , pn,1,
, pn,n - a path r from a state si to a state sj, is a
sequence states where the transition
probabilities for all consecutive states are
greater than 0. - the probability of reaching a state sj from a
state si via a path r is the product of all the
probabilities along the path - the probability of reaching sj from si is the sum
over all paths
31Mining Navigational Patterns
An example Markov Chain
- What is the probability that a user who visits
the welcome page purchases a product? - Home -gt Search -gt PD -gt 1/3 1/2 1/2 1/12
- Home -gt Cat -gt PD -gt 1/3 1/3 1/2 1/18
- Home -gt Cat -gt 1/3 1/3 1/9
- Home -gt RS -gt PD -gt 1/3 2/3 1/2 1/9
Sum 13/36
32Markov Chain Example Calculating conditional
probabilities for transitions
Web site hyperlink graph
Sessions A, B A, B A, B, C A, B, C A, B, C, D
A, B, C, E A, C, E A, C, E A, B, D A, B, D A,
B, D, E B, C B, C B, C, D B, C, E B, E,D
B
D
A
0.57
C
E
Transition B?C Total occurrences of B 14
Total occurrence of BC 8 Pr(CB) 8/14
0.57
33Tools Weka Package
- Weka
- set of Java packages developed at the University
of Waikato in New Zealand - includes packages for data filtering, association
rules, classification, clustering, and
instance-based learning - Web site www.cs.waikato.ac.nz/ml/weka
- can be used both from command line, or using the
Java based GUI - requires the data to be in a standard format
called ARFF
34Weka ARFF Format
- ARFF files have two main sections
- Attributes
- categorical (nominal) attributes along with their
values - integer attributes along with a range
- real attributes
- Data section
- each record has values corresponding to the order
in which attributes were specified in the
attribute section
_at_RELATION zoo _at_ATTRIBUTE animal
aardvark,antelope,bass,bear,boar, . . .
_at_ATTRIBUTE hair false, true _at_ATTRIBUTE
feathers false, true _at_ATTRIBUTE eggs false,
true _at_ATTRIBUTE milk false, true _at_ATTRIBUTE
airborne false, true _at_ATTRIBUTE aquatic false,
true _at_ATTRIBUTE predator false,
true _at_ATTRIBUTE toothed false, true _at_ATTRIBUTE
backbone false, true _at_ATTRIBUTE breathes
false, true _at_ATTRIBUTE venomous false,
true _at_ATTRIBUTE fins false, true _at_ATTRIBUTE
legs INTEGER 0,9 _at_ATTRIBUTE tail false,
true _at_ATTRIBUTE domestic false,
true _at_ATTRIBUTE catsize false, true _at_ATTRIBUTE
type mammal, bird, reptile, fish, insect, . . .
. . .
35Weka ARFF Format
- Data portion of the ARFF file for Zoo animals
- For association rule discovery, we first need to
discretize using Weka Filters
. . . _at_DATA Instances (101) aardvark,true,f
alse,false,true,false,false,true,true,true,true,fa
lse,false,4,false,false,true,mammal antelope,true,
false,false,true,false,false,false,true,true,true,
false,false,4,true,false,true,mammal bass,false,fa
lse,true,false,false,true,true,true,true,false,fal
se,true,0,true,false,false,fish bear,true,false,fa
lse,true,false,false,true,true,true,true,false,fal
se,4,false,false,true,mammal boar,true,false,false
,true,false,false,true,true,true,true,false,false,
4,true,false,true,mammal buffalo,true,false,false,
true,false,false,false,true,true,true,false,false,
4,true,false,true,mammal calf,true,false,false,tru
e,false,false,false,true,true,true,false,false,4,t
rue,true,true,mammal carp,false,false,true,false,f
alse,true,false,true,true,false,false,true,0,true,
true,false,fish catfish,false,false,true,false,fal
se,true,true,true,true,false,false,true,0,true,fal
se,false,fish cavy,true,false,false,true,false,fal
se,false,true,true,true,false,false,4,false,true,f
alse,mammal cheetah,true,false,false,true,false,fa
lse,true,true,true,true,false,false,4,true,false,t
rue,mammal . . .
36Weka Explorer Interface
Can open the native ARFF format or the standard
CSV format
37(No Transcript)
38(No Transcript)
39Weka Attribute Filters
40Weka Attribute Filters
41We can discretize children manually since it
only has a small number of discrete values
After Saving the new relation in ARFF format
42(No Transcript)
43(No Transcript)
44Weka Discretization Filter
45Weka Discretization Filter
46Weka Discretization Filter
47Weka Discretization Filter
After Saving the new relation in ARFF format
48Weka Discretization Filter
After renaming attribute values for age and
income
49Weka Association Rules
50Weka Association Rules
51Weka Association Rules
52Weka Association Rules
Another try with Lift gt 1.5
53Weka Association Rules
Another try with Lift gt 1.5
54(No Transcript)
55(No Transcript)