Title: Multi-dimensional Sequential Pattern Mining
1Multi-dimensional Sequential Pattern Mining
- Helen Pinto, Jiawei Han, Jian Pei, Ke Wang,
Qiming Chen, Umeshwar Dayal
2Outline
- Why multidimensional sequential pattern mining?
- Problem definition
- Algorithms
- Experimental results
- Conclusions
3Why Sequential Pattern Mining?
- Sequential pattern mining Finding time-related
frequent patterns (frequent subsequences) - Many data and applications are time-related
- Customer shopping patterns, telephone calling
patterns - E.g., first buy computer, then CD-ROMS, software,
within 3 mos. - Natural disasters (e.g., earthquake, hurricane)
- Disease and treatment
- Stock market fluctuation
- Weblog click stream analysis
- DNA sequence analysis
4Motivating Example
- Sequential patterns are useful
- free internet access ? buy package 1 ? upgrade
to package 2 - Marketing, product design development
- Problems lack of focus
- Various groups of customers may have different
patterns - MD-sequential pattern mining integrate
multi-dimensional analysis and sequential pattern
mining
5Sequences and Patterns
- Given a set of sequences, find the complete set
of frequent subsequences
A sequence lt (ef) (ab) (df) c b gt
A sequence database
Elements items within an element are listed
alphabetically
SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
lta(bc)dcgt is a subsequence of lta(abc)(ac)d(cf)gt
Given support threshold min_sup 2, lt(ab)cgt is a
sequential pattern
6Sequential Pattern Basics
A sequence database
ltad(ae)gt is a subsequence of lta(bd)bcb(ade)gt
Given support threshold min_sup 2, lt(bd)cbgt is a
sequential pattern
7MD Sequence Database
- P(,Chicago,,ltbfgt) matches tuple 20 and 30
- If support 2, P is a MD sequential pattern
cid Cust_grp City Age_grp sequence
10 Business Boston Middle lt(bd)cbagt
20 Professional Chicago Young lt(bf)(ce)(fg)gt
30 Business Chicago Middle lt(ah)abfgt
40 Education New York Retired lt(be)(ce)gt
8Mining of MD Seq. Pat.
- Embedding MD information into sequences
- Using a uniform seq. pat. mining method
- Integration of seq. pat. mining and MD analysis
method
9UNISEQ
- Embed MD information into sequences
cid Cust_grp City Age_grp sequence
10 Business Boston Middle lt(bd)cbagt
20 Professional Chicago Young lt(bf)(ce)(fg)gt
30 Business Chicago Middle lt(ah)abfgt
40 Education New York Retired lt(be)(ce)gt
Mine the extended sequence database using
sequential pattern mining methods
cid MD-extension of sequences
10 lt(Business,Boston,Middle)(bd)cbagt
20 lt(Professional,Chicago,Young)(bf)(ce)(fg)gt
30 lt(Business,Chicago,Middle)(ah)abfgt
40 lt(Education,New York,Retired)(be)(ce)gt
10Mine Sequential Patterns by Prefix Projections
- Step 1 find length-1 sequential patterns
- ltagt, ltbgt, ltcgt, ltdgt, ltegt, ltfgt
- Step 2 divide search space. The complete set of
seq. pat. can be partitioned into 6 subsets - The ones having prefix ltagt
- The ones having prefix ltbgt
-
- The ones having prefix ltfgt
SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
11Find Seq. Patterns with Prefix ltagt
- Only need to consider projections w.r.t. ltagt
- ltagt-projected database lt(abc)(ac)d(cf)gt,
lt(_d)c(bc)(ae)gt, lt(_b)(df)cbgt, lt(_f)cbcgt - Find all the length-2 seq. pat. Having prefix
ltagt ltaagt, ltabgt, lt(ab)gt, ltacgt, ltadgt, ltafgt - Further partition into 6 subsets
- Having prefix ltaagt
-
- Having prefix ltafgt
SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
12Completeness of PrefixSpan
SDB
SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
Length-1 sequential patterns ltagt, ltbgt, ltcgt, ltdgt,
ltegt, ltfgt
Having prefix ltcgt, , ltfgt
Having prefix ltagt
Having prefix ltbgt
ltagt-projected database lt(abc)(ac)d(cf)gt lt(_d)c(bc)
(ae)gt lt(_b)(df)cbgt lt(_f)cbcgt
ltbgt-projected database
Length-2 sequential patterns ltaagt, ltabgt,
lt(ab)gt, ltacgt, ltadgt, ltafgt
Having prefix ltaagt
Having prefix ltafgt
ltaagt-proj. db
ltafgt-proj. db
13Efficiency of PrefixSpan
- No candidate sequence needs to be generated
- Projected databases keep shrinking
- Major cost of PrefixSpan constructing projected
databases - Can be improved by bi-level projections
14Mining MD-Patterns
MD pattern (,Chicago,)
cid Cust_grp City Age_grp sequence
10 Business Boston Middle lt(bd)cbagt
20 Professional Chicago Young lt(bf)(ce)(fg)gt
30 Business Chicago Middle lt(ah)abfgt
40 Education New York Retired lt(be)(ce)gt
(cust-grp,city,age-grp)
(cust-grp,city)
Cust-grp,,age-grp)
(cust-grp,,)
(,city,)
(,,age-grp)
BUC processing
All
15Dim-Seq
- First find MD-patterns
- E.g. (,Chicago,)
- Form projected sequence database
- lt(bf)(ce)(fg)gt and lt(ah)abfgt for (,Chicago,)
- Find seq. pat in projected database
- E.g. (,Chicago,,ltbfgt)
cid Cust_grp City Age_grp sequence
10 Business Boston Middle lt(bd)cbagt
20 Professional Chicago Young lt(bf)(ce)(fg)gt
30 Business Chicago Middle lt(ah)abfgt
40 Education New York Retired lt(be)(ce)gt
16Seq-Dim
- Find sequential patterns
- E.g. ltbfgt
- Form projected MD-database
- E.g. (Professional,Chicago,Young) and
(Business,Chicago,Middle) for ltbfgt - Mine MD-patterns
- E.g. (,Chicago,,ltbfgt)
cid Cust_grp City Age_grp sequence
10 Business Boston Middle lt(bd)cbagt
20 Professional Chicago Young lt(bf)(ce)(fg)gt
30 Business Chicago Middle lt(ah)abfgt
40 Education New York Retired lt(be)(ce)gt
17Scalability Over Dimensionality
18Scalability Over Cardinality
19Scalability Over Support Threshold
20Scalability Over Database Size
21Pros Cons of Algorithms
- Seq-Dim is efficient and scalable
- Fastest in most cases
- UniSeq is also efficient and scalable
- Fastest with low dimensionality
- Dim-Seq has poor scalability
22Conclusions
- MD seq. pat. mining are interesting and useful
- Mining MD seq. pat. efficiently
- Uniseq, Dim-Seq, and Seq-Dim
- Future work
- Applications of sequential pattern mining
23References (1)
- R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. VLDB'94, pages
487-499. - R. Agrawal and R. Srikant. Mining sequential
patterns. ICDE'95, pages 3-14. - C. Bettini, X. S. Wang, and S. Jajodia. Mining
temporal relationships with multiple
granularities in time sequences. Data Engineering
Bulletin, 2132-38, 1998. - M. Garofalakis, R. Rastogi, and K. Shim. Spirit
Sequential pattern mining with regular expression
constraints. VLDB'99, pages 223-234. - J. Han, G. Dong, and Y. Yin. Efficient mining of
partial periodic patterns in time series
database. ICDE'99, pages 106-115. - J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U.
Dayal, and M.-C. Hsu. FreeSpan Frequent
pattern-projected sequential pattern mining.
KDD'00, pages 355-359.
24References (2)
- J. Han, J. Pei, and Y. Yin. Mining frequent
patterns without candidate generation. SIGMOD'00,
pages 1-12. - H. Lu, J. Han, and L. Feng. Stock movement and
n-dimensional intertransaction association rules.
DMKD'98, pages 121-127. - H. Mannila, H. Toivonen, and A. I. Verkamo.
Discovery of frequent episodes in event
sequences. Data Mining and Knowledge Discovery,
1259-289, 1997. - B. "Ozden, S. Ramaswamy, and A. Silberschatz.
Cyclic association rules. ICDE'98, pages 412-421. - J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q.
Chen, U. Dayal, and M.-C. Hsu. PrefixSpan
Mining sequential patterns efficiently by
prefix-projected pattern growth. ICDE'01, pages
215-224. - R. Srikant and R. Agrawal. Mining sequential
patterns Generalizations and performance
improvements. EDBT'96, pages 3-17.