Title: Pattern-Growth Methods for Sequential Pattern Mining
1Pattern-Growth Methods for Sequential Pattern
Mining
2Outline
- Sequential pattern mining
- Apriori-like methods
- GSP
- Pattern-growth methods
- FreeSpan
- PrefixSpan
- Performance analysis
- Conclusions
3Motivation
- Sequential pattern mining Finding time-related
frequent patterns - Most data and applications are time-related
- Customer shopping patterns, telephone calling
patterns - Natural disasters (e.g., earthquake, hurricane)
- Disease and treatment
- Stock market fluctuation
- Weblog click stream analysis
- DNA sequence analysis
4Concepts
- Let Ii1,i2,,in be a set of all items
- Itemset is a subset of items
- Sequence is an ordered list of itemset. itemsets
are called elements. The number of items in the
sequence is its length - e.g. lt (ef)(ab)(df)cb gt
- A sequence ?lta1a2angt is called subsequence of
?ltb1b2bmgt, denoted ???, if there exist integers
1?j1 ltj2ltltjn ?m such that a1?bj1,
a2?bj2,,an?bjn - e.g. lta(bc)dcgtis subsequence of lta(abc)(ac)d(cf)gt
5Concepts (cont)
- Sequence database is a set of tuples ltsid,sgt, sid
is a sequence_id, and s is a sequence. A tuple is
said to contain a sequence ? if ? is a
subsequence of s - Support of ? is the number of tuples in the
database containing ? - If the support of ? no less than a threshold, it
is called sequential pattern - lt(ab)cgt is a sequential pattern given support
threshold min_sup 2
SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
6Problem definition
- Given a sequence database and min_sup threshold,
the problem of sequential pattern mining is to
find the complete set of sequential patterns in
the database
7Apriori-like methods
- Apriori property If a sequence S is not
frequent, then every super-sequence of S is not
frequent - e.g. ltbhgt is infrequent, so do ltabhgt,ltb(dh)gt
- GSP (Generalized Sequential Pattern) algorithm
- Level-by-level do
- Generate candidate sequences
- Use Apriori property to prune candidates
- Scan database to collect support counts
8GSP Mining Process
9Bottlenecks of Apriori-Like Methods
- Potentially huge set of candidate sequences
- 1,000 frequent length-1 sequences generate
length-2 candidates - Multiple scans of database
- Difficulties at mining long sequential patterns
- Exponential number of short candidates
- A length-100 sequential pattern needs candidate
sequences
10Pattern-growth methods
- A divide-and-conquer approach
- Recursively project a sequence database into a
set of smaller databases - Mine each projected database to find the subset
of patterns - Algorithms
- FreeSpan Frequent Pattern-Projected Sequential
Pattern Mining - PrefixSpan Prefix-Projected Sequential Pattern
Mining
11FreeSpan
- Example given a sequence database S and
min_support 2 - Step 1 find length-1 sequential patterns and
list them in support descending order - f_list a4,b4,c4,d3,e3,f3
SID Sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lt(eg(af)cbcgt
12FreeSpan (cont)
- Step 2 divide search space. The complete set of
seq. pat. can be partitioned into 6 disjoint
subsets - ones only contain item a
- ones contain item b but no items after b in
f_list - ones contain item c but no items after c in
f_list - ones contain item d but no items after d in
f_list - ones contain item e but no items after e in
f_list - ones contain item f
- find subsets of sequential patterns. They can be
mined by constructing projected databases and
mining each recursively
13FreeSpan (cont)
- Finding Seq. Patterns containing item b but no
items after b in f_list - ltbgt-projected database lta(ab)agt, ltabagt, lt(ab)bgt,
ltabgt - Find all the length-2 seq. pat. containing item b
but no items after b in f_list ltabgt4, ltbagt2,
lt(ab)gt2 - Further partition and mining
SID Sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lt(eg(af)cbcgt
14From FreeSpan to PrefixSpan
- Freespan
- Projection-based No candidate sequence needs to
be generated - But, projection can be performed at any point in
the sequence, and the projected sequences may not
shrink much. For example, the size of f-projected
database is the same as the original sequence
database - PrefixSpan
- Projection-based
- But only prefix-based projection less
projections and quickly shrinking sequences
15PrefixSpan-concepts
- Suppose all items in an element are listed
alphabetically. - Given a sequence ?lte1e2engt, ?lte1e2emgt(m?n)
- Prefix ? is the prefix of ? iff (1) eiei (i
?m-1) (2) em ? em(3) all items in (em- em) are
alphabetically after those in em. - e.g. ?lta(abc)(ac)d(cf)gt, ?lta(ab)gt, ?lta(bc)gt
- Postfix sequence ?lte1e2emgt, ?ltemem1engt
is called the postfix of ? w.r.t. prefix ?, where
em(em-em), denoted as ??.? - e.g. ?lt(_c)(ac)d(cf)gt is the postfix of ? w.r.t.
prefix lta(ab)gt
16PrefixSpan-concepts (cont)
- Projected database let ? be a sequential pattern
in S. ?-projected database, denoted s?, is the
collection of postfixes of sequences in S w.r.t.
prefix ? - Support count in projected database let ? be a
sequential pattern in S, ? be a sequence having
prefix ?. The support count of ? in ?-projected
database is the number of sequence ? in s? such
that ???.?
17PrefixSpan-process
- Step 1 find length-1 sequential patterns
- ltagt4, ltbgt4, ltcgt4, ltdgt3, ltegt3, ltfgt3
- Step 2 divide search space. The complete set of
seq. pat. can be partitioned into 6 subsets - ones having prefix ltagt
- ones having prefix ltbgt
-
- ones having prefix ltfgt
- find subsets of sequential patterns. They can be
mined by constructing projected databases and
mining each recursively
SID Sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lt(eg(af)cbcgt
18PrefixSpan-Process (cont)
- Finding Seq. Patterns with Prefix ltagt
- ltagt-projected database lt(abc)(ac)d(cf)gt,
lt(_d)c(bc)(ae)gt, lt(_b)(df)cbgt, lt(_f)cbcgt - Find all the length-2 seq. pat. having prefix
ltagtltaagt2, ltabgt4, lt(ab)gt2, ltacgt4, ltadgt2,
ltafgt2 - Further partition into 6 subsets
- Having prefix ltaagt
-
- Having prefix ltafgt
SID Sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lt(eg(af)cbcgt
19Completeness of PrefixSpan
SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
Length-1 sequential patterns ltagt, ltbgt, ltcgt, ltdgt,
ltegt, ltfgt
prefix ltcgt, , ltfgt
prefix ltagt
prefix ltbgt
ltagt-projected database lt(abc)(ac)d(cf)gt lt(_d)c(bc)
(ae)gt lt(_b)(df)cbgt lt(_f)cbcgt
ltbgt-projected database
Length-2 seq. pan ltaagt, ltabgt, lt(ab)gt, ltacgt, ltadgt,
ltafgt
prefix ltafgt
prefix ltaagt
ltaagt-proj. db
ltafgt-proj. db
20Efficiency of PrefixSpan
- No candidate sequence needs to be generated
- Projected databases keep shrinking
- Major cost of PrefixSpan constructing projected
databases - Can be improved by bi-level projections and
pseudo-projections
21Optimization Techniques in PrefixSpan
- Single-level vs. bi-level projection
- Bi-level projection with 3-way checking may
reduce the number and size of projected databases - Physical projection vs. pseudo-projection
- Pseudo-projection may reduce the effort of
projection when the projected database fits in
main memory
22S-matrix for sequence database
Length-1 sequential patterns ltagt, ltbgt, ltcgt, ltdgt,
ltegt, ltfgt
ltaagt happens twice
lt(ac)gt happens once
2
a
S-matrix
ltacgt happens 4 times
1
(4, 2, 2)
b
3
(3, 3, 2)
(4, 2, 1)
c
0
(1, 3, 0)
(2, 2, 0)
(2, 1, 1)
d
0
(1, 1, 0)
(1, 2, 0)
(1, 2, 0)
(1, 2, 1)
e
ltcagthappens twice
1
(2, 0, 1)
(1, 1, 1)
(1, 2, 1)
(2, 2, 0)
(2, 1, 1)
f
f
e
d
c
b
a
All length-2 sequential patterns are found in
S-matrix
23S-matrix for ltabgt-projected database
- ltabgt-projected database
- lt(_c)(ac)d(cf)gt,lt(_c)(ae)gt,ltcgt
- frequent itemsltagt,ltcgt,lt(_c)gt
- S-matrix
SID Sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lt(eg(af)cbcgt
No a(_c), no count
Lead to pattern lta(bc)agt
a 0
c (1, 0, 1) 1
(_c) (?, 2, ?) (?, 1, ?) ?
a c (_c)
24Scaling-up by Bi-level Projection
- Partition search space based on length-2
sequential patterns - Only form projected databases and pursue
recursive mining over bi-level projected databases
25Benefits of Bi-level Projection
- More patterns are found in each shoot
- Much less projections
- In the example, there are 53 patterns.
- 53 level-by-level projections
- 22 bi-level projections
263-way Apriori Checking
- Using Apriori heuristic to prune items in
projected databases
ltacdgt cannot be a pattern w.r.t.
min_support2 exclude d from ltacgt-projected
database
a 2
b (4, 2, 2) 1
c (4, 2, 1) (3, 3, 2) 3
d (2, 1, 1) (2, 2, 0) (1, 3, 0) 0
e (1, 2, 1) (1, 2, 0) (1, 2, 0) (1, 1, 0) 0
f (2, 1, 1) (2, 2, 0) (1, 2, 1) (1, 1, 1) (2, 0, 1) 1
a b c d e f
27 Pseudo-projection
- Major cost of PrefixSpan projection
- Postfixes of sequences often appear repeatedly in
recursive projected databases - When the projected database fit in memory, use
pointers to form projections - Pointer to the sequence
- Offset of the postfix
28Pseudo-Projection vs. Physical Projection
- Pseudo-projection avoids physically copying
postfixes - Efficient when database fits in main memory
- Not efficient when database cannot fit in main
memory - Disk-based random accessing is very costly
- Suggested Approach
- Integration of physical and pseudo-projection
- Swapping to pseudo-projection when the data set
fits in memory
29Experiments
- Synthetic datasets were generated using procedure
described in R.Agrawal and R.Srikant. Mining
sequential patterns. In Proc. 1995 ICDE95 - number of items 1000
- number of sequences in the data set 10,000
- average number of items within elements 8
- average number of elements in a sequence 8
30Experiments (cont)
- Comparing PrefixSpan with GSP and FreeSpan in
large databases - GSP (IBM Almaden, Srikant Agrawal EDBT96)
- FreeSpan (J. Han J. Pei, B. Mortazavi-Asi, Q.
Chen, U. Dayal, M.C. Hsu, KDD00) - Prefix-Span-1 (single-level projection)
- Prefix-Span-2 (bi-level projection)
- Comparing effects of pseudo-projection
- Comparing I/O cost and scalability
31PrefixSpan Is Faster Than GSP and FreeSpan
32Effect of Pseudo-Projection for projected
database fit in memory
33I/O Cost When It Cannot Fit in Memory
34Scalability (When DB Is Large)
min_sup0.2
35Conclusions
- Both PrefixSpan and FreeSpan are pattern-growth
methods which perform better than Apriori-like
methods for sequential pattern mining problem - PrefixSpan is more elegant than FreeSpan
- Apriori heuristic is integrated into bi-level
projection in PrefixSpan - Pseudo-projection substantially enhances the
performance of the memory-based processing
36References
- J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U.
Dayal, and M.-C. Hsu. FreeSpan Frequent
pattern-projected sequential pattern mining.
KDD'00, pages 355-359. - J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q.
Chen, U. Dayal, and M.-C. Hsu. PrefixSpan
Mining sequential patterns efficiently by
prefix-projected pattern growth. ICDE'01, pages
215-224. - R. Srikant and R. Agrawal. Mining sequential
patterns Generalizations and performance
improvements. EDBT'96, pages 3-17.
37QA
38Thanks