Association Rules and Sequential Patterns

About This Presentation

Title:

Association Rules and Sequential Patterns

Description:

Title: Data Miing and Knowledge Discvoery - Web Data Mining Author: Bamshad Mobasher Last modified by: Bamshad Mobasher Created Date: 3/29/1999 8:01:23 PM – PowerPoint PPT presentation

Number of Views:212

Avg rating:3.0/5.0

Slides: 56

Provided by: Bamsh74

Learn more at: http://facweb.cs.depaul.edu

Category:

more less

Transcript and Presenter's Notes

Title: Association Rules and Sequential Patterns

1
Association RulesandSequential Patterns
Bamshad Mobasher DePaul University
2
Market Basket Analysis

Goal of MBA is to find associations (affinities)
among groups of items occurring in a
transactional database
has roots in analysis of point-of-sale data, as
in supermarkets
but, has found applications in many other areas
Association Rule Discovery
most common type of MBA technique
Find all rules that associate the presence of one
set of items with that of another set of items.
Example 98 of people who purchase tires and
auto accessories also get automotive services
done
We are interested in rules that are
non-trivial (and possibly unexpected)
actionable
easily explainable

3
What Is Association Mining?

Association rule mining searches for
relationships between items in a data set
Finding association, correlation, or causal
structures among sets of items or objects in
transaction databases, relational databases, etc.
Rule form
Body gt Head support, confidence
Body and Head can be represented as sets of items
or as predicates
Examples
diaper, milk, Thursday gt beer 0.5, 78
buys(x, "bread") gt buys(x, "milk") 0.6, 65
major(x, "CS") /\ takes(x, "DB") gt grade(x,
"A") 1, 75
age(X,30-45) /\ income(X, 50K-75K) gt buys(X,
SUVcar)
age30-45, income50K-75K gt carSUV

4
Different Kinds of Association Rules

Boolean vs. Quantitative
associations on discrete and categorical data vs.
continuous data
Single Vs. Multiple Dimensions
one predicate single dimension multiple
predicates multiple dimensions
buys(x, milk) gt buys(x, butter)
age(X,30-45) /\ income(X, 50K-75K) gt buys(X,
SUVcar)
Single level vs. multiple-level analysis
Based on the level of abstractions involved
buys(x, bread) gt buys(x, milk)
buys(x, wheat bread) gt buys(x, 2 milk)
Simple vs. constraint-based
Constraints can be added on the rules to be
discovered

5
Basic Concepts

We start with a set I of items and a set D of
transactions
D is all of the transactions relevant to the
mining task
A transaction T is a set of items (a subset of
I)
An Association Rule is an implication on itemsets
X and Y , denoted by X gt Y, where
The rule meets a minimum confidence of c, meaning
that c of transactions in D which contain X also
contain Y
In addition a minimum support of s is satisfied

6
Support and Confidence

Find all the rules X ? Y with minimum confidence
and support
Support probability that a transaction contains
X,Y
i.e., ratio of transactions in which X, Y occur
together to all transactions in database.
Confidence conditional probability that a
transaction having X also contains Y
i.e., ratio of transactions in which X, Y occur
together to those in which X occurs.

In general confidence of a rule LHS gt RHS can be
computed as the support of the whole itemset
divided by the support of LHS Confidence (LHS
gt RHS) Support(LHS È RHS) / Support(LHS)
7
Support and Confidence - Example
Itemset A, C has a support of 2/5 40 Rule
A gt C has confidence of 50 Rule C gt
A has confidence of 100 Support for A, C, E
? Support for A, D, F ? Confidence for A, D
gt F ? Confidence for A gt D, F ?
8
Improvement (Lift)

High confidence rules are not necessarily useful
what if confidence of A, B gt C is less than
Pr(C)?
improvement gives the predictive power of a rule
compared to just random chance

9
Steps in Association Rule Discovery

Find the frequent itemsets
Frequent item sets are the sets of items that
have minimum support
Support is downward closed, so, a subset of a
frequent itemset must also be a frequent itemset
if AB is a frequent itemset, both A and B
are frequent itemsets
this also means that if an itemset that doesnt
satisfy minimum support, none of its supersets
will either (this is essential for pruning search
space)
Iteratively find frequent itemsets with
cardinality from 1 to k (k-itemsets)
Use the frequent itemsets to generate association
rules

10
Mining Association Rules - An Example
Min. support 50 Min. confidence 50
Only need to keep these since A and C are
subsets of A,C

For rule A ? C
support support(A, C) 50
confidence support(A, C)/support(A) 66.6

11
Apriori Algorithm
Ck Candidate itemset of size k Lk Frequent
itemset of size k
Join Step Ck is generated by joining Lk-1with
itself Prune Step Any (k-1)-itemset that is not
frequent cannot be a subset of a frequent
k-itemset
12
Example of Generating Candidates

L3abc, abd, acd, ace, bcd
Self-joining L3L3
abcd from abc and abd
acde from acd and ace
Pruning
acde is removed because ade is not in L3
C4 abcd

13
Apriori Algorithm - An Example
Assume minimum support 2
14
Apriori Algorithm - An Example
The final frequent item sets are those
remaining in L2 and L3. However, 2,3, 2,5,
and 3,5 are all contained in the larger item
set 2, 3, 5. Thus, the final group of item sets
reported by Apriori are 1,3 and 2,3,5. These
are the only item sets from which we will
generate association rules.
15
Generating Association Rulesfrom Frequent
Itemsets

Only strong association rules are generated
Frequent itemsets satisfy minimum support
threshold
Strong rules are those that satisfy minimum
confidence threshold
confidence(A gt B) Pr(B A)

For each frequent itemset, f, generate all
non-empty subsets of f For every non-empty subset
s of f do if support(f)/support(s) ³
min_confidence then output rule s gt
(f-s) end
16
Generating Association Rules(Example Continued)

Item sets 1,3 and 2,3,5
Recall that confidence of a rule LHS ? RHS is
Support of itemset (i.e. LHS È RHS) divided by
support of LHS.

Candidate rules for 1,3 Candidate rules for 1,3 Candidate rules for 2,3,5 Candidate rules for 2,3,5 Candidate rules for 2,3,5 Candidate rules for 2,3,5
Rule Conf. Rule Conf. Rule Conf.
1?3 2/2 1.0 2,3?5 2/2 1.00 2?5 3/3 1.00
3?1 2/3 0.67 2,5?3 2/3 0.67 2?3 2/3 0.67
3,5?2 2/2 1.00 3?2 2/3 0.67
2?3,5 2/3 0.67 3?5 2/3 0.67
3?2,5 2/3 0.67 5?2 3/3 1.00
5?2,3 2/3 0.67 5?3 2/3 0.67
Assuming a min. confidence of 75, the final set
of rules reported by Apriori are 1?3,
3,5?2, 5?2 and 2?5
17
Multiple-Level Association Rules

Items often form a hierarchy
Items at the lower level are expected to have
lower support
Rules regarding itemsets at appropriate levels
could be quite useful
Transaction database can be encoded based on
dimensions and levels

18
Mining Multi-Level Associations

A top_down, progressive deepening approach
First find high-level strong rules
milk bread 20, 60
Then find their lower-level weaker rules
2 milk wheat bread 6, 50
When one threshold set for all levels if support
too high then it is possible to miss meaningful
associations at low level if support too low
then possible generation of uninteresting rules
different minimum support thresholds across
multi-levels lead to different algorithms (e.g.,
decrease min-support at lower levels)
Variations at mining multiple-level association
rules
Level-crossed association rules
milk wonder wheat bread
Association rules with multiple, alternative
hierarchies
2 milk wonder bread

19
Quantitative Association Rules
Handling quantitative rules may require mapping
of the continuous variables into Boolean
20
MBA in Text / Web Content Mining

Documents Associations
Find (content-based) associations among documents
in a collection
Documents correspond to items and words
correspond to transactions
Frequent itemsets are groups of docs in which
many words occur in common
Term Associations
Find associations among words based on their
occurrences in documents
similar to above, but invert the table (terms as
items, and docs as transactions)

21
MBA in Web Usage Mining

Association Rules in Web Transactions
discover affinities among sets of Web page
references across user sessions
Examples
60 of clients who accessed /products/, also
accessed /products/software/webminer.htm
30 of clients who accessed /special-offer.html,
placed an online order in /products/software/
Actual Example from IBM official Olympics Site
Badminton, Diving gt Table Tennis
conf???69.7,???sup???0.35
Applications
Use rules to serve dynamic, customized contents
to users
prefetch files that are most likely to be
accessed
determine the best way to structure the Web site
(site optimization)
targeted electronic advertising and increasing
cross sales

22
Web Usage Mining Example

Association Rules From Cray Research Web Site
Design suggestions
from rules 1 and 2 there is something in
J90.html that should be moved to th page
/PUBLIC/product-info/T3E (why?)

23
Sequential / Navigational Patterns

Sequential patterns add an extra dimension to
frequent itemsets and association rules - time.
Items can appear before, after, or at the same
time as each other.
General form x of the time, when A appears in
a transaction, B appears within z transactions.
note that other items may appear between A and B,
so sequential patterns do not necessarily imply
consecutive appearances of items (in terms of
time)
Examples
Renting Star Wars, then Empire Strikes Back,
then Return of the Jedi in that order
Collection of ordered events within an interval
Most sequential pattern discovery algorithms are
based on extensions of the Apriori algorithm for
discovering itemsets
Navigational Patterns
they can be viewed as a special form of
sequential patterns which capture navigational
patterns among users of a site
in this case a session is a consecutive sequence
of pageview references for a user over a
specified period of time

24
Mining Sequences - Example
Customer-sequence
Sequential patterns with support gt 0.25(C),
(H)(C), (DG)
25
Sequential Pattern Mining Cases and Parameters

Duration of a time sequence T
Sequential pattern mining can then be confined to
the data within a specified duration
Ex. Subsequences corresponding to the year of
1999
Ex. Partitioned sequences, such as every year, or
every week after stock crashes, or every two
weeks before and after a volcano eruption
Event folding window w
If w T, time-insensitive frequent patterns are
found
If w 0 (no event sequence folding), sequential
patterns are found where each event occurs at a
distinct time instant
If 0 lt w lt T, sequences occurring within the same
period w are folded in the analysis

26
Sequential Pattern Mining Cases and Parameters

Time interval, int, between events in the
discovered pattern
int 0 no interval gap is allowed, i.e., only
strictly consecutive sequences are found
Ex. Find frequent patterns occurring in
consecutive weeks
min_int ? int ? max_int find patterns that are
separated by at least min_int but at most max_int
Ex. If a person rents movie A, it is likely she
will rent movie B within 30 days (int ? 30)
int c ? 0 find patterns carrying an exact
interval
Ex. Every time when Dow Jones drops more than
5, what will happen exactly two days later?
(int 2)

27
Mining Navigational Patterns

Approach build an aggregated sequence tree
this is the approach taken by Web Utilization
Miner (WUM) - Spiliopoulou, 1998
for each occurrence of a sequence start a new
branch or increase the frequency counts of
matching nodes
in example below, note that s6 contains b
twice, hence the sequence is lt(b,1),(d,1),(b,2),(e
,1)gt

28
Mining Navigational Patterns
The aggregated sequence tree can be used directly
to determine support and confidence for
navigational patterns
Note that each node represents a navigational
path ending in that node
Support count at the node / count at
root Confidence count at the node / count at
the parent
Navigation pattern a ? b Support 11/35
0.31 Confidence 11/21 0.52
Nav. pattern a ? b ? e Support 11/35
0.31 Confidence 11/11 1.00
Nav. patterns a ? b ? e ? f Support 3/35
0.086 Confidence 3/11 0.27
29
Mining Navigational Patterns

WUM supports a powerful mining query language to
extract patterns from the aggregated tree
Example query
For example, patterns matching the query with X
b are

SELECT t NODES AS X Y Z, TEMPLATE AS t WHERE
X.support gt 20 AND Y.support gt 6 AND Z.support
gt 4
30
Mining Navigational Patterns

Another Approach Markov Chains
idea is to model the navigational sequences
through the site as a state-transition diagram
without cycles (a directed acyclic graph)
a Markov Chain consists of a set of states (pages
or pageviews in the site)
S s1, s2, , sn
and a set of transition probabilities
P p1,1, , p1,n, p2,1, , p2,n, , pn,1,
, pn,n
a path r from a state si to a state sj, is a
sequence states where the transition
probabilities for all consecutive states are
greater than 0.
the probability of reaching a state sj from a
state si via a path r is the product of all the
probabilities along the path
the probability of reaching sj from si is the sum
over all paths

31
Mining Navigational Patterns
An example Markov Chain

What is the probability that a user who visits
the welcome page purchases a product?
Home -gt Search -gt PD -gt 1/3 1/2 1/2 1/12
Home -gt Cat -gt PD -gt 1/3 1/3 1/2 1/18
Home -gt Cat -gt 1/3 1/3 1/9
Home -gt RS -gt PD -gt 1/3 2/3 1/2 1/9

Sum 13/36
32
Markov Chain Example Calculating conditional
probabilities for transitions
Web site hyperlink graph
Sessions A, B A, B A, B, C A, B, C A, B, C, D
A, B, C, E A, C, E A, C, E A, B, D A, B, D A,
B, D, E B, C B, C B, C, D B, C, E B, E,D
B
D
A
0.57
C
E
Transition B?C Total occurrences of B 14
Total occurrence of BC 8 Pr(CB) 8/14
0.57
33
Tools Weka Package

Weka
set of Java packages developed at the University
of Waikato in New Zealand
includes packages for data filtering, association
rules, classification, clustering, and
instance-based learning
Web site www.cs.waikato.ac.nz/ml/weka
can be used both from command line, or using the
Java based GUI
requires the data to be in a standard format
called ARFF

34
Weka ARFF Format

ARFF files have two main sections
Attributes
categorical (nominal) attributes along with their
values
integer attributes along with a range
real attributes
Data section
each record has values corresponding to the order
in which attributes were specified in the
attribute section

_at_RELATION zoo _at_ATTRIBUTE animal
aardvark,antelope,bass,bear,boar, . . .
_at_ATTRIBUTE hair false, true _at_ATTRIBUTE
feathers false, true _at_ATTRIBUTE eggs false,
true _at_ATTRIBUTE milk false, true _at_ATTRIBUTE
airborne false, true _at_ATTRIBUTE aquatic false,
true _at_ATTRIBUTE predator false,
true _at_ATTRIBUTE toothed false, true _at_ATTRIBUTE
backbone false, true _at_ATTRIBUTE breathes
false, true _at_ATTRIBUTE venomous false,
true _at_ATTRIBUTE fins false, true _at_ATTRIBUTE
legs INTEGER 0,9 _at_ATTRIBUTE tail false,
true _at_ATTRIBUTE domestic false,
true _at_ATTRIBUTE catsize false, true _at_ATTRIBUTE
type mammal, bird, reptile, fish, insect, . . .
. . .
35
Weka ARFF Format

Data portion of the ARFF file for Zoo animals
For association rule discovery, we first need to
discretize using Weka Filters

. . . _at_DATA Instances (101) aardvark,true,f
alse,false,true,false,false,true,true,true,true,fa
lse,false,4,false,false,true,mammal antelope,true,
false,false,true,false,false,false,true,true,true,
false,false,4,true,false,true,mammal bass,false,fa
lse,true,false,false,true,true,true,true,false,fal
se,true,0,true,false,false,fish bear,true,false,fa
lse,true,false,false,true,true,true,true,false,fal
se,4,false,false,true,mammal boar,true,false,false
,true,false,false,true,true,true,true,false,false,
4,true,false,true,mammal buffalo,true,false,false,
true,false,false,false,true,true,true,false,false,
4,true,false,true,mammal calf,true,false,false,tru
e,false,false,false,true,true,true,false,false,4,t
rue,true,true,mammal carp,false,false,true,false,f
alse,true,false,true,true,false,false,true,0,true,
true,false,fish catfish,false,false,true,false,fal
se,true,true,true,true,false,false,true,0,true,fal
se,false,fish cavy,true,false,false,true,false,fal
se,false,true,true,true,false,false,4,false,true,f
alse,mammal cheetah,true,false,false,true,false,fa
lse,true,true,true,true,false,false,4,true,false,t
rue,mammal . . .
36
Weka Explorer Interface
Can open the native ARFF format or the standard
CSV format
37
(No Transcript)
38
(No Transcript)
39
Weka Attribute Filters
40
Weka Attribute Filters
41
We can discretize children manually since it
only has a small number of discrete values
After Saving the new relation in ARFF format
42
(No Transcript)
43
(No Transcript)
44
Weka Discretization Filter
45
Weka Discretization Filter
46
Weka Discretization Filter
47
Weka Discretization Filter
After Saving the new relation in ARFF format
48
Weka Discretization Filter
After renaming attribute values for age and
income
49
Weka Association Rules
50
Weka Association Rules
51
Weka Association Rules
52
Weka Association Rules
Another try with Lift gt 1.5
53
Weka Association Rules
Another try with Lift gt 1.5
54
(No Transcript)
55
(No Transcript)

Write a Comment

User Comments (0)