A Flexible Sequential Pattern Mining Algorithm with Efficient Indexes for Support Counting - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

A Flexible Sequential Pattern Mining Algorithm with Efficient Indexes for Support Counting

Description:

A Flexible Sequential Pattern Mining Algorithm with Efficient Indexes for ... Pentium 4 PC with 4GB DDR400MHz memory, running Microsoft Windows XP with SP2. ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 40
Provided by: Davi691
Category:

less

Transcript and Presenter's Notes

Title: A Flexible Sequential Pattern Mining Algorithm with Efficient Indexes for Support Counting


1
A Flexible Sequential Pattern Mining Algorithm
with Efficient Indexes for Support Counting
  • Presented by Jie-Ru Lin
  • 11/24/2009
  • M9510692_at_fcu.edu.tw

2
Outline
  • Introduction
  • Motivation
  • Traditional Algorithm vs. Our Algorithm
  • FSPE algorithm and a simple example
  • Experimental result
  • Extensions and Discussion
  • Conclusion and future work

3
Introduction (1/2)
  • Due to the increasing need of data analysis, the
    use of data mining is growing at a rapid pace.
  • Sequential pattern mining is an important data
    mining method with broad applications including
    bioinformatics, web access tracing,, etc.

4
Introduction (2/2)
  • Sequential Pattern Mining
  • Sequential pattern mining discovers frequent
    subsequences as patterns in a sequence database.
  • A sequence database contains ordered elements or
    events.
  • Printer? Scanner? Computer? MP3

5
Motivation
  • Setting the ideal threshold requires
  • Domain Knowledge
  • Knowing the distribution of data
  • Pre-determining Minimum Support
  • Too small threshold ? thousands of patterns
  • Too large threshold ? no patterns

6
Traditional Algorithm vs. Our Algorithm
  • Traditional algorithm
  • Scan database once to find frequent 1-sequences
    by using a minimum support threshold
  • Generate candidate 2-sequences? Scan database
    twice to determine frequent sequences? Generate
    candidate 3-sequences? Scan database the third
    time to determine frequent sequences ?.
  • Our algorithm
  • Read the first transaction, and directly
    calculate the support of all candidate n-
    sequences.
  • Repeat the process until the end of the database.

7
Our proposed FSPE Algorithm
  • Fast Sequential Pattern Enumeration (FSPE)
    algorithm
  • Read the database only once because each
    transaction is processed right after input
  • No minimum support is specified to prune the
    candidate sequences
  • Every candidate is enumerated with efficient
    index on the support counter based on the
    corresponding prefix.
  • Users can specify any minimum support later in
    rule generation to extract sequence patterns as
    needed.

8
How to calculate index(1/4)
(1,1,1)
(1,3,2) (1,1,0) (0,2,2)
-
z?
x?
y?
(3x2)28
(1,3,1)
(1,3,2)
(1,2,1)
(1,2,2)
(1,3,3)
(1,1,1)
(1,2,3)
(1,1,2)
(1,1,3)
item number 3 3-sequence (1,1,1)(1,1,2),(1,1,3)
.
9
How to calculate index(2/4)
Index 1 2 3 4 5 .
10
How to calculate index(3/4)
(2,3,1) (2,1,0) (0,2,1)
-
(3x2)17
(2,3,1)
(2,3,2)
(2,2,1)
(2,2,2)
(2,3,3)
(2,1,1)
(2,1,2)
(2,2,3)
(2,1,3)
item number 3 3-sequence (2,1,1),(2,1,2),(2,1,3
).
11
How to calculate index(4/4)
(3,3,1)
(3,3,2)
(3,2,1)
(3,2,2)
(3,3,3)
(3,1,1)
(3,2,3)
(3,1,2)
(3,1,3)
item number 3 3-sequence (3,1,1),(3,1,2),(3,1,3
).
12
A simple example (1/7)
Enumerate all candidate sequence
13
A simple example (2/7)
Candidate n-itemset sequence
Sequence_id
Sequence
3 1-itemset
1
3 1 1
Index 1 2 3 4 5 .
2
1 2 3
1
1-itemset
3
1 3 2
2-itemset
Prefix 3 Counters
1 3
4
3-itemset
Assume the number of items 3
4-itemset
5-itemset
14
A simple example (3/7)
Candidate n-itemset sequence
Sequence_id
Sequence
1 1-itemset
1
3 1 1
Index 1 2 3 4 5 .
2
1 2 3
1
1-itemset
3
1 3 2
2-itemset
Prefix 1 Counters
1 3
4
3-itemset
The number of items 3
4-itemset
5-itemset
15
A simple example (4/7)
Candidate n-itemset sequence
Sequence_id
Sequence
1 1-itemset
1
3 1 1
Index 1 2 3 4 5 .
2
1 2 3
1
1-itemset
3
1 3 2
2-itemset
Prefix 1 Counters
4
1 3
3-itemset
The number of items 3
4-itemset
5-itemset
16
A simple example (5/7)
Candidate n-itemset sequence
Sequence_id
Sequence
3,1 2-itemset
1
3 1 1
2
1 2 3
Index 1 2 3 4 5 .
1
3
1 3 2
1-itemset
1
2-itemset
1 3
4
Prefix 3 Counters
3-itemset
The number of items 3
4-itemset
5-itemset
17
A simple example (6/7)
Candidate n-itemset sequence
Sequence_id
Sequence
1,1 2-itemset
1
3 11
2
1 2 3
Index 1 2 3 4 5 .
1
3
1 3 2
1-itemset
2-itemset
1
1 3
4
Prefix 1 Counters
3-itemset
The number of items 3
4-itemset
5-itemset
18
A simple example (7/7)
Candidate n-itemset sequence
Sequence_id
Sequence
3,1,13-itemset
1
3 1 1
Index 1 2 3 4 5 .
2
1 2 3
1
1-itemset
3
1 3 2
2-itemset
Prefix 3 Counters
4
1 3
3-itemset
1
The number of items 3
4-itemset
5-itemset
19
Result of mining the sample database
Index 1 2 3 4 5 .
1, 1 1 1, 2 2 1, 3 3
20
Experiments
  • Runtime test
  • Compare PrefixSpan with FSPE algorithms
  • Scalability test
  • Find the performance of FSPE with respect to
    different number of customers and average length
    of sequences

21
Experimental Environment
  • All experiments were conducted on a 3.0GHz Intel
    Pentium 4 PC with 4GB DDR400MHz memory, running
    Microsoft Windows XP with SP2.

22
Datasets for Runtime Test
  • The synthetic datasets were generated from the
    IBM synthetic data generator C1000SL1.5TL3NI0.02
  • The average number of transactions is 1.5 and the
    average number of items is 3. There are 20
    different unique items.

23
Runtime -- Compare PrefixSpan, and FSPE
C1000SL1.5TL3NI0.02
24
Scalability test for FSPE algorithm
SL3TL1NI0.02
25
Scalability test for FSPE algorithm
C10TL1NI0.02
26
Extensions and Discussion
  • We modify our approach to propose an incremental
    sequential pattern mining framework.
  • The issue of maintaining sequential patterns
    becomes essential because transactions may be
    updated over time.

27
How to calculate index(1/9)
Prefix 1
(1,1,1,1)
z?
x?
y?
(1,1,1,1)
Prefix 1
(1,1,1,2)
(1,1,1,3)
Original item number 3 4-sequence
(1,1,1,1),(1,1,1,2),(1,1,1,3).
28
How to calculate index(2/9)
Prefix 1
Original item number 3 4-sequence
(1,1,1,1),(1,1,1,2),(1,1,1,3).
29
How to calculate index(3/9)
1,2,4,1
1,1,4,1
1,3,4,1
1,1,4,2
1,2,4,2
1,3,4,2
1,1,4,3
1,2,4,3
1,3,4,3
Prefix 1
Original item number 3 New item 4 4-sequence
(1,1,4,1),(1,1,4,2),(1,1,4,3).
30
How to calculate index(4/9)
Prefix 1
Original item number 3 New item 4
Prefix 1
31
How to calculate index(5/9)
(1,2,4,4)
(1,3,4,4)
(1,1,4,4)
(1,1,3,4)
(1,2,3,4)
(1,3,3,4)
Original item number 3 New item 4
(1,3,2,4)
(1,2,2,4)
(1,1,2,4)
(1,2,1,4)
(1,3,1,4)
(1,1,1,4)
Prefix 1
32
How to calculate index(6/9)
36
44
43
Original item number 3 New item 4
42
41
Prefix 1
33
How to calculate index(7/9)
(1,4,4,1)
(1,4,4,2)
(1,4,3,1)
(1,4,4,3)
(1,4,3,2)
(1,4,4,4)
(1,4,2,1)
(1,4,3,3)
(1,4,2,2)
(1,4,1,1)
(1,4,2,3)
(1,4,3,4)
(1,4,1,2)
Original item number 3 New item 4
(1,4,1,3)
(1,4,2,4)
(1,4,1,4)
Prefix 1
34
How to calculate index(8/9)
61
57
53
48
49
Original item number 3 New item 4
Prefix 1
35
How to calculate index(9/9)
5
6
1
7
2
8
3
Original item number 4
4
New item 5
Prefix 1
36
Conclusion and future work(1/3)
  • We propose a new approach to mine sequential
    patterns with an efficient indexing method for
    all candidate.

37
Conclusion and future work(2/3)
  • Advantage
  • Read the database only once
  • No minimum support is specified to prune the
    candidate itemset sequences
  • Every candidate is enumerated with efficient
    index on the support counter based on the
    corresponding prefix.
  • Users can specify any minimum support later in
    rule generation to extract sequence patterns as
    needed.
  • Disadvantage FSPE cant deal with new items

38
Conclusion and future work(3/3)
  • Improved method
  • Incremental mining algorithm.
  • Parallel algorithm
  • In the future, we plan to implement the FSPE
    algorithm to mine more interesting patterns
  • such as time-interval sequential patterns,
    constrained sequential patterns, closed and
    maximal sequential patterns, and others.

39
Thank you
  • M9510692_at_fcu.edu.tw
Write a Comment
User Comments (0)
About PowerShow.com