Title: Mining Sequential Patterns
1Mining Sequential Patterns
- Presenters
- Qian Bai, Jiguo Jiang
2Mining Sequential Patterns
- Introduction
- The Algorithm
- Aprioriall, AprioriSome, DynamicSome
- Performance
- Conclusions
3Introduction
-
- Background
- Problem Statement
- An Example
- Related Work
4Background
- Customer purchase patterns
- Buy computer, then buy software
- Rent Star War, then Empire Strikes Back, and
then Return of the Jedi - Buy Fitted Sheet and flat sheet and pillow
cases, followed by comforter, and then
followed by drapes and ruffles - Web access patterns
- Open www.yorku.ca, then open www.cs.yorku.ca/mail
5Background (Continue)
- The sequential pattern mining problem was first
introduced by Agrawal and Srikant - Definition Given a set of sequences, each of
which sequence consists of a list of elements and
each element consists of a set of items, and
given a user-specified min-support threshold,
sequential pattern mining is to find all frequent
subsequences, i.e., the subsequences whose
occurrence frequency in the set of sequences is
no less than min-support
6Problem Statement
- After reading the three papers about Mining
Sequential Patterns, we focus on a database D of
customer transactions - Each transaction consists of the following
fields - Customer-id
- Transaction-time
- Items purchased in the transaction
- Note
- No customer has more than one transaction with
the same transaction time. - We do not consider quantities of items bought in
a transaction -
7Problem Statement (Continue)
- Terminology
- Itemset a non-empty set of items. (30, 40, 50),
(60) - Sequence ordered list of itemsets. lt (30, 40,
50) (60) gt - Sequence Length number of itemsets in a
sequence. - Contained A sequence (a1, a2, , aN) is
contained in another sequence (b1, b2, , bM) if
there exist integers i1lti2ltltiN such that a1?
bi1, a2?bi2, , aN?biN - lt (30) (40 50) gt is contained in lt (70) (30 80)
(40 50 60) gt - lt (30) (50) gt is NOT contained in lt (30 50) gt
8Problem Statement (Continue)
- Terminology (Continue)
- Maximal Sequence A sequence is maximal if it is
not contained in any other sequence - Support A customer supports a sequence s if s is
contained in the customer-sequence for this
customer. It is the fraction of total customers
who support this sequence - Litemset (Large itemset) An itemset satisfying
the minimum support - Large sequence A sequence satisfying the minimum
support constraint is called a large sequence
9Problem Statement (Continue)
-
- Given a database D of customer transactions, the
problem of mining sequential patterns is to find
the maximal sequences among all sequences that
have a certain user-specified minimum support.
Each such maximal sequence represents a
sequential pattern
10An Example
- A Database sorted by Customer ID and Transaction
Time
Customer ID Transaction Time Items Bought
1 1 June 25 93 June 30 93 30 90
2 2 2 June 10 93 June 15 93 June 20 93 10, 20 30 40, 60, 70
3 June 25 93 30, 50, 70
4 4 4 June 25 93 June 30 93 July 25 93 30 40, 70 90
5 June 12 93 90
11An Example (Continue)
- Customer-Sequence Version of the Database
- Note
- Patterns are not necessarily contiguous.
- Some sequences, such as lt (30) gt, lt (30) (40) gt
though having minimum support, are not in the
answer because they are not maximal
Customer ID Customer Sequence
1 2 3 4 5 lt (30) (90) gt lt (10 20) (30) (40 60 70) gt lt (30 50 70) gt lt (30) (40 70) (90) gt lt (90) gt
Sequential Patterns with support gt 25
lt (30) (90) gt (Supported by 1 and 4) lt (30) (40 70) gt (Supported by 2 and 4)
12Related Work
- Differences between Association Rule Mining in
Customer Transaction Database and Sequential
Pattern Mining - Association Rules Mining
- Finding what items are bought together
- Finding intra-transaction patterns
- Patterns are unordered set of items
- Sequential Patterns Mining
- Finding what items are bought in different
transactions - Finding inter-transaction patterns
- Patterns are ordered list of sets of items
13Algorithm
- Sort phase
- Sort database with customer-id as the major key
and transaction-time as the minor key - Litemset phase
- Scan database to find the set of all 1 sequence
litemsets L1 based on the given minimum support - Map large itemsets to a set of contiguous
integers by treating litemsets as single
entities. - Example 30 40 70 40 70 90 can be
mapped to 1 2 3 4 5
14Algorithm(Continue)
- Transformation phase
- Replace each transaction by the set of 1-sequence
litemsets that it contains - Delete customer sequences that contain no
1-sequence litemset - Keep the same total number of customers
- Example given (30) (90) (40) (70) (40 70) are
1-sequence litemsets
ID Before Transformed After Transformed
1 2 3 (30) (90) (10 20) (40 60 70 (50) (30) (90) (40) (70) (40 70 ?
15Algorithm(Continue)
- Sequence phase
- Find the frequent sequences
- Three algorithmsAprioriAll, AprioriSome,
DynamicSome - Maximal phase
- Delete sequences that are subsequences of other
large sequences - Combine with the sequence phase in AprioriSome
and DynamicSome algorithm - Example given sequences 1 2 3 4 1 2 1
3 1 2 3, the maximal sequences will be 4 1
2 3
16Algorithm AprioriAll
- Main idea
- All of the subsets of a frequent sequence must be
frequent sequences too - If a set is not frequent sequence, then its
supersets will not be frequent sequences - Example
- 1 2 3 is a frequent sequence, 1 2 3 1 2
2 3 must be frequent sequences. - 1 is not a frequent sequence, then 1 2 1 3
are not frequent sequences.
17AprioriAll (Continue)
- Step 1 k 2
- Step 2 Form Ck using Apriori-generate function
- Step3 Scan database and generate Lk from Ck
based on the minimum support - Step 4 If Lkis not empty, set k k1. Then
repeat step 2 and step 3
18AprioriAll (Continue)
- Apriori-generate
- Join two sequences in Lk-1 to generate Ck
- Step 1 for each two sequences in Lk-1 that have
the same 1st to k-2th itemsets, select the 1 to
k-1 litemset from the first sequence, and join
with the last litemset from another sequence - Step 2 delete all sequences in Ck if some of
their sub sequences are not in Lk-1 - Example
- Given L3 1 2 32 3 41 2 41 3 41 3 5
- step 1 C4 1 2 3 4 1 3 4 5 1 3 5 41 2 4
3 - step 2 C4 1 2 3 4
19AprioriAll (Continue)
- Example min_sup 3
- Large sequence 1 2 31 4
2-seq. Sup.
1 2 1 3 1 4 2 3 2 4 3 4 3 3 3 3 1 1
ID Mapping Seq.
1 2 3 4 5 (14) (12 3 (1 2 2 3) (12 34) (14)
1 seq. Sup.
1 2 3 4 5 3 3 3
3-seq. Sup.
1 2 3 3
20AprioriSome
- Intuition the subsets of a frequent sequence
will not be in the final maximum sequences -
- Example Suppose 2 3 3 4 1 2 1 2 3
are frequent sequences, then the final maximum
sequences are 3 4 and 1 2 3
21AprioriSome (Continue)
- Step1 set C1 L1, last 1, k2
- Step 2 forward phase
- Step 2.1 generate Ck from either Lk-1 or Ck-1
- Step 2.2 if knext(last), scan database to
generate Lk based on the minimum support, and set
last k - Step 2.3 if both Ck and Llast are not empty,
increase k by 1, and repeat from step 2.1 - Step 3 back ward phase
- Step 3.1 decrease k by 1. If Lk is empty, delete
sequences in Ck contained in Li where igtk. Scan
database again to generate Lk based on the given
minimum support. If Lk is not empty, delete
sequences in Lk contained in Li where igtk. - Step 3.2 if kgt1, repeat from step 3.1.
- Step 4 union all the sequences in L
22AprioriSome (Continue)
- Efficiency highly depends on the next(k)
function - Tradeoff between counting non-maximal sequences
versus counting extensions of small candidate
sequences. - A special cases next(k) k1
- Example based on the ratio of the number of Lk
to the number of Ck, we decide the value of k
23AprioriSome (Continue)
- Example next(k) 2k, min_sup2
- Answers
- 1 2 3 41 3 54 5
3 seq. 4 seq. Sup.
1 2 3 1 2 4 1 3 4 1 3 5 2 3 4 1 4 5 3 4 5 1 2 3 4 1 3 4 5 2 1
ID Mapping Seq.
1 2 3 4 5 (1 5234) (1343 5) (1234) (135) (45)
1 seq. Sup.
1 2 3 4 5 4 2 4 4 4
2 seq. Sup.
1 2 1 3 1 4 1 5 2 3 2 4 2 5 3 4 3 5 4 5 2 4 3 2 2 2 0 3 2 2
3 seq. Sup.
1 3 5 3 4 5 1 4 5 2 1 1
24DynamicSome
- Intuition same idea as AprioriSome
- Differences between two algorithms
AprioriSome DynamicSome
K next(last) K kstep
Ck Lk-1/ Ck-1 Ck otf-generate(Lk,Lstep,c)
Two phases Forward, backward Three phases Forward, backward and intermediate
Initialize L1 Initialize L1 to Lstep
25DynamicSome (Continue)
- Step 1 generate L1 to Lstep based on Apriori
algorithm - Step 2 forward phase
- Step 2.1 Set k step
- Step 2.2 scan db to generate Ckstep using
otf-generate(Lk,Lstep,c), and then generate
Lkstep from Ckstep based on the given minimum
support - Step 2.3 if Lk is not empty, set k kstep and
repeat from step 2.2 - Step 3 intermediate phase
- Generate all the missing Ck based on Lk-1 or Ck-1
- Step 4 backward phase which is same as that of
AprioriSome
26DynamicSome (Continue)
- On-the-fly candidate generation
- c ltc1 c2 ..cngt, Lk and Lj
- Xk subseq(Lk,c)
- For all sequences x belong to Xk do
- End minjx is contained in ltc1 c2 cjgt
- Xj subseq(Lj,c)
- For all sequences x belong to Xj
- Start maxjx is contained in ltcj cj1 cngt
- Answer join of Xk with Xj if Xk.endlt Xj.start
27DynamicSome (Continue)
- Example
- C lt1 2 3 7 4gt L2 lt1 2gtlt1 3gtlt3 4gt
-
- Thus, result lt1 2 3 4gt
Seq. End start
lt1 2gt 2 1
lt1 3gt 3 1
lt3 4gt 4 3
28DynamicSome (Continue)
- Example step 2, min_sup 2
- Answers
- 1 2 3 41 3 54 5
1 seq. Sup.
1 2 3 4 5 4 2 4 4 4
2 seq. Sup.
1 3 1 2 1 4 1 5 2 3 2 4 2 5 3 4 3 5 4 5 2 4 3 2 2 2 0 3 2 2
4 seq. Sup.
lt1 2 3 4gt lt1 3 4 5gt 2 1
3 seq. Sup.
lt1 2 3gt lt1 2 4gt lt1 3 4gt lt1 3 5gt lt3 4 5gt 2 1
29Performance
30Performance (Continue)
Note The result of DynamicSome was not ploted
for low values of minimum support since it
generated too many candidates and ran out of
memory.
31Performance (Continue)
32Performance (Continue)
33Performance (Continue)
34Conclusions
- The problem of mining sequential patterns from a
database of customer transactions was introduced
and three algorithms for solving this problem was
presented. - Two of the algorithms, AprioriSome and
AprioriAll, have comparable performance, although
AprioriSome performs a little better for the
lower values of the minimum support. - Scale-up experiments show that both AprioriSome
and AprioriAll scale linearly with the number of
customer transactions.