Mining Sequential Patterns - PowerPoint PPT Presentation

About This Presentation
Title:

Mining Sequential Patterns

Description:

Rent 'Star War', then 'Empire Strikes Back', and then 'Return of the Jedi' Buy 'Fitted Sheet and flat sheet and pillow cases', followed by 'comforter', and ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 35
Provided by: cseY
Category:

less

Transcript and Presenter's Notes

Title: Mining Sequential Patterns


1
Mining Sequential Patterns
  • Presenters
  • Qian Bai, Jiguo Jiang

2
Mining Sequential Patterns
  • Introduction
  • The Algorithm
  • Aprioriall, AprioriSome, DynamicSome
  • Performance
  • Conclusions

3
Introduction
  • Background
  • Problem Statement
  • An Example
  • Related Work

4
Background
  • Customer purchase patterns
  • Buy computer, then buy software
  • Rent Star War, then Empire Strikes Back, and
    then Return of the Jedi
  • Buy Fitted Sheet and flat sheet and pillow
    cases, followed by comforter, and then
    followed by drapes and ruffles
  • Web access patterns
  • Open www.yorku.ca, then open www.cs.yorku.ca/mail

5
Background (Continue)
  • The sequential pattern mining problem was first
    introduced by Agrawal and Srikant
  • Definition Given a set of sequences, each of
    which sequence consists of a list of elements and
    each element consists of a set of items, and
    given a user-specified min-support threshold,
    sequential pattern mining is to find all frequent
    subsequences, i.e., the subsequences whose
    occurrence frequency in the set of sequences is
    no less than min-support

6
Problem Statement
  • After reading the three papers about Mining
    Sequential Patterns, we focus on a database D of
    customer transactions
  • Each transaction consists of the following
    fields
  • Customer-id
  • Transaction-time
  • Items purchased in the transaction
  • Note
  • No customer has more than one transaction with
    the same transaction time.
  • We do not consider quantities of items bought in
    a transaction

7
Problem Statement (Continue)
  • Terminology
  • Itemset a non-empty set of items. (30, 40, 50),
    (60)
  • Sequence ordered list of itemsets. lt (30, 40,
    50) (60) gt
  • Sequence Length number of itemsets in a
    sequence.
  • Contained A sequence (a1, a2, , aN) is
    contained in another sequence (b1, b2, , bM) if
    there exist integers i1lti2ltltiN such that a1?
    bi1, a2?bi2, , aN?biN
  • lt (30) (40 50) gt is contained in lt (70) (30 80)
    (40 50 60) gt
  • lt (30) (50) gt is NOT contained in lt (30 50) gt

8
Problem Statement (Continue)
  • Terminology (Continue)
  • Maximal Sequence A sequence is maximal if it is
    not contained in any other sequence
  • Support A customer supports a sequence s if s is
    contained in the customer-sequence for this
    customer. It is the fraction of total customers
    who support this sequence
  • Litemset (Large itemset) An itemset satisfying
    the minimum support
  • Large sequence A sequence satisfying the minimum
    support constraint is called a large sequence

9
Problem Statement (Continue)
  • Given a database D of customer transactions, the
    problem of mining sequential patterns is to find
    the maximal sequences among all sequences that
    have a certain user-specified minimum support.
    Each such maximal sequence represents a
    sequential pattern

10
An Example
  • A Database sorted by Customer ID and Transaction
    Time

Customer ID Transaction Time Items Bought
1 1 June 25 93 June 30 93 30 90
2 2 2 June 10 93 June 15 93 June 20 93 10, 20 30 40, 60, 70
3 June 25 93 30, 50, 70
4 4 4 June 25 93 June 30 93 July 25 93 30 40, 70 90
5 June 12 93 90
11
An Example (Continue)
  • Customer-Sequence Version of the Database
  • Note
  • Patterns are not necessarily contiguous.
  • Some sequences, such as lt (30) gt, lt (30) (40) gt
    though having minimum support, are not in the
    answer because they are not maximal

Customer ID Customer Sequence
1 2 3 4 5 lt (30) (90) gt lt (10 20) (30) (40 60 70) gt lt (30 50 70) gt lt (30) (40 70) (90) gt lt (90) gt
Sequential Patterns with support gt 25
lt (30) (90) gt (Supported by 1 and 4) lt (30) (40 70) gt (Supported by 2 and 4)
12
Related Work
  • Differences between Association Rule Mining in
    Customer Transaction Database and Sequential
    Pattern Mining
  • Association Rules Mining
  • Finding what items are bought together
  • Finding intra-transaction patterns
  • Patterns are unordered set of items
  • Sequential Patterns Mining
  • Finding what items are bought in different
    transactions
  • Finding inter-transaction patterns
  • Patterns are ordered list of sets of items

13
Algorithm
  • Sort phase
  • Sort database with customer-id as the major key
    and transaction-time as the minor key
  • Litemset phase
  • Scan database to find the set of all 1 sequence
    litemsets L1 based on the given minimum support
  • Map large itemsets to a set of contiguous
    integers by treating litemsets as single
    entities.
  • Example 30 40 70 40 70 90 can be
    mapped to 1 2 3 4 5

14
Algorithm(Continue)
  • Transformation phase
  • Replace each transaction by the set of 1-sequence
    litemsets that it contains
  • Delete customer sequences that contain no
    1-sequence litemset
  • Keep the same total number of customers
  • Example given (30) (90) (40) (70) (40 70) are
    1-sequence litemsets

ID Before Transformed After Transformed
1 2 3 (30) (90) (10 20) (40 60 70 (50) (30) (90) (40) (70) (40 70 ?
15
Algorithm(Continue)
  • Sequence phase
  • Find the frequent sequences
  • Three algorithmsAprioriAll, AprioriSome,
    DynamicSome
  • Maximal phase
  • Delete sequences that are subsequences of other
    large sequences
  • Combine with the sequence phase in AprioriSome
    and DynamicSome algorithm
  • Example given sequences 1 2 3 4 1 2 1
    3 1 2 3, the maximal sequences will be 4 1
    2 3

16
Algorithm AprioriAll
  • Main idea
  • All of the subsets of a frequent sequence must be
    frequent sequences too
  • If a set is not frequent sequence, then its
    supersets will not be frequent sequences
  • Example
  • 1 2 3 is a frequent sequence, 1 2 3 1 2
    2 3 must be frequent sequences.
  • 1 is not a frequent sequence, then 1 2 1 3
    are not frequent sequences.

17
AprioriAll (Continue)
  • Step 1 k 2
  • Step 2 Form Ck using Apriori-generate function
  • Step3 Scan database and generate Lk from Ck
    based on the minimum support
  • Step 4 If Lkis not empty, set k k1. Then
    repeat step 2 and step 3

18
AprioriAll (Continue)
  • Apriori-generate
  • Join two sequences in Lk-1 to generate Ck
  • Step 1 for each two sequences in Lk-1 that have
    the same 1st to k-2th itemsets, select the 1 to
    k-1 litemset from the first sequence, and join
    with the last litemset from another sequence
  • Step 2 delete all sequences in Ck if some of
    their sub sequences are not in Lk-1
  • Example
  • Given L3 1 2 32 3 41 2 41 3 41 3 5
  • step 1 C4 1 2 3 4 1 3 4 5 1 3 5 41 2 4
    3
  • step 2 C4 1 2 3 4

19
AprioriAll (Continue)
  • Example min_sup 3
  • Large sequence 1 2 31 4

2-seq. Sup.
1 2 1 3 1 4 2 3 2 4 3 4 3 3 3 3 1 1
ID Mapping Seq.
1 2 3 4 5 (14) (12 3 (1 2 2 3) (12 34) (14)
1 seq. Sup.
1 2 3 4 5 3 3 3
3-seq. Sup.
1 2 3 3
20
AprioriSome
  • Intuition the subsets of a frequent sequence
    will not be in the final maximum sequences
  • Example Suppose 2 3 3 4 1 2 1 2 3
    are frequent sequences, then the final maximum
    sequences are 3 4 and 1 2 3

21
AprioriSome (Continue)
  • Step1 set C1 L1, last 1, k2
  • Step 2 forward phase
  • Step 2.1 generate Ck from either Lk-1 or Ck-1
  • Step 2.2 if knext(last), scan database to
    generate Lk based on the minimum support, and set
    last k
  • Step 2.3 if both Ck and Llast are not empty,
    increase k by 1, and repeat from step 2.1
  • Step 3 back ward phase
  • Step 3.1 decrease k by 1. If Lk is empty, delete
    sequences in Ck contained in Li where igtk. Scan
    database again to generate Lk based on the given
    minimum support. If Lk is not empty, delete
    sequences in Lk contained in Li where igtk.
  • Step 3.2 if kgt1, repeat from step 3.1.
  • Step 4 union all the sequences in L

22
AprioriSome (Continue)
  • Efficiency highly depends on the next(k)
    function
  • Tradeoff between counting non-maximal sequences
    versus counting extensions of small candidate
    sequences.
  • A special cases next(k) k1
  • Example based on the ratio of the number of Lk
    to the number of Ck, we decide the value of k

23
AprioriSome (Continue)
  • Example next(k) 2k, min_sup2
  • Answers
  • 1 2 3 41 3 54 5

3 seq. 4 seq. Sup.
1 2 3 1 2 4 1 3 4 1 3 5 2 3 4 1 4 5 3 4 5 1 2 3 4 1 3 4 5 2 1
ID Mapping Seq.
1 2 3 4 5 (1 5234) (1343 5) (1234) (135) (45)
1 seq. Sup.
1 2 3 4 5 4 2 4 4 4
2 seq. Sup.
1 2 1 3 1 4 1 5 2 3 2 4 2 5 3 4 3 5 4 5 2 4 3 2 2 2 0 3 2 2
3 seq. Sup.
1 3 5 3 4 5 1 4 5 2 1 1
24
DynamicSome
  • Intuition same idea as AprioriSome
  • Differences between two algorithms

AprioriSome DynamicSome
K next(last) K kstep
Ck Lk-1/ Ck-1 Ck otf-generate(Lk,Lstep,c)
Two phases Forward, backward Three phases Forward, backward and intermediate
Initialize L1 Initialize L1 to Lstep
25
DynamicSome (Continue)
  • Step 1 generate L1 to Lstep based on Apriori
    algorithm
  • Step 2 forward phase
  • Step 2.1 Set k step
  • Step 2.2 scan db to generate Ckstep using
    otf-generate(Lk,Lstep,c), and then generate
    Lkstep from Ckstep based on the given minimum
    support
  • Step 2.3 if Lk is not empty, set k kstep and
    repeat from step 2.2
  • Step 3 intermediate phase
  • Generate all the missing Ck based on Lk-1 or Ck-1
  • Step 4 backward phase which is same as that of
    AprioriSome

26
DynamicSome (Continue)
  • On-the-fly candidate generation
  • c ltc1 c2 ..cngt, Lk and Lj
  • Xk subseq(Lk,c)
  • For all sequences x belong to Xk do
  • End minjx is contained in ltc1 c2 cjgt
  • Xj subseq(Lj,c)
  • For all sequences x belong to Xj
  • Start maxjx is contained in ltcj cj1 cngt
  • Answer join of Xk with Xj if Xk.endlt Xj.start

27
DynamicSome (Continue)
  • Example
  • C lt1 2 3 7 4gt L2 lt1 2gtlt1 3gtlt3 4gt
  • Thus, result lt1 2 3 4gt

Seq. End start
lt1 2gt 2 1
lt1 3gt 3 1
lt3 4gt 4 3
28
DynamicSome (Continue)
  • Example step 2, min_sup 2
  • Answers
  • 1 2 3 41 3 54 5

1 seq. Sup.
1 2 3 4 5 4 2 4 4 4
2 seq. Sup.
1 3 1 2 1 4 1 5 2 3 2 4 2 5 3 4 3 5 4 5 2 4 3 2 2 2 0 3 2 2
4 seq. Sup.
lt1 2 3 4gt lt1 3 4 5gt 2 1
3 seq. Sup.
lt1 2 3gt lt1 2 4gt lt1 3 4gt lt1 3 5gt lt3 4 5gt 2 1
29
Performance


30
Performance (Continue)


Note The result of DynamicSome was not ploted
for low values of minimum support since it
generated too many candidates and ran out of
memory.
31
Performance (Continue)


32
Performance (Continue)


33
Performance (Continue)


34
Conclusions
  • The problem of mining sequential patterns from a
    database of customer transactions was introduced
    and three algorithms for solving this problem was
    presented.
  • Two of the algorithms, AprioriSome and
    AprioriAll, have comparable performance, although
    AprioriSome performs a little better for the
    lower values of the minimum support.
  • Scale-up experiments show that both AprioriSome
    and AprioriAll scale linearly with the number of
    customer transactions.
  • Question?
Write a Comment
User Comments (0)
About PowerShow.com