Title: A Short Introduction to Sequential Data Mining
1A Short Introduction to Sequential Data Mining
- Koji IWANUMA
- Hidetomo NABESHIMA
- University of Yamanashi
- The First Franco-Japanese Symposium on Knowledge
Discovery in System Biology, September 17,
Aix-en-Provence
2Two Main Frameworks of Sequential Mining
- Sequential pattern mining for multiple data
sequences - Sequential pattern mining for a single data
sequence
Sequence ID Purchase data record
1 ltbread, cheesegt
2 lt(wheat, milk), bread, (berry, sausage)gt
3 lt(bread, pumpkin, sausage)gt
4 ltbread, cheese, sausagegt
5 ltcheesegt
Data sequence
ltS1 S2 S3 S4 S5 S6 S7 Sngt
3What Is Sequential Pattern Mining?
J. Han and M. Kamber. Data Mining Concepts and
Techniques, www.cs.uiuc.edu/hanji
- Given a set of sequences, find the complete set
of frequent subsequences
A sequence lt (ef) (ab) (df) c b gt
A sequence database
An element may contain a set of items. Items
within an element are unordered and we list them
alphabetically.
SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
lta(bc)dcgt is a subsequence of lta(abc)(ac)d(cf)gt
Given support threshold min_sup 2, lt(ab)cgt is a
sequential pattern
4Challenges on Sequential Pattern Mining
- A huge number of possible sequential patterns are
hidden in databases - A mining algorithm should
- find the complete set of patterns, when possible,
satisfying the minimum support (frequency)
threshold - be highly efficient, scalable, involving only a
small number of database scans - be able to incorporate various kinds of
user-specific constraints
J. Han and M. Kamber. Data Mining Concepts and
Techniques, www.cs.uiuc.edu/hanji
5Sequential Pattern Mining Algorithms for Multiple
Data Sequences
- Apriori-based method GSP (Generalized Sequential
Patterns Srikant Agrawal _at_ EDBT96) - Pattern-growth methods FreeSpan PrefixSpan
(Han et al._at_KDD00 Pei, et al._at_ICDE01) - Vertical format-based mining SPADE (Zaki_at_Machine
Leanining00) - Constraint-based sequential pattern mining
(SPIRIT Garofalakis, Rastogi, Shim_at_VLDB99 Pei,
Han, Wang _at_ CIKM02) - Mining closed sequential patterns CloSpan (Yan,
Han Afshar _at_SDM03)
J. Han and M. Kamber. Data Mining Concepts and
Techniques, www.cs.uiuc.edu/hanji
6Mining Sequential Patterns from a Very-Long
Single Sequence
A series of daily news paper articles
lt
gt
typhoon
flood, landslide
typhoon
flood, landslide
lttyphoon (flood, landslide)gt
7Sequential Pattern Mining Algorithms for a Single
data Sequence
- Discovery of frequent episodes in event
sequences, based on a sliding window system
Mannila 1998 - The frequency measure becomes anti-monotonic, but
has a problem, i.e., a duplicate counting of an
occurrence. - Asynchronous periodic pattern mining Yang et.al
2000, Huang 2004 - Any anti-monotonic frequency measures are not
investigated. - On-line approximation algorithm for mining
frequent items, not for frequent subsequences - Lossy counting algorithm Manku and Motwani,
VLDB02
8Research in Our Laboratory
- Sequential Data Mining from a very-large single
data sequence. - Main target sequential textual data,
especially, newspaper-articles corpora - Objectives to generate a robust and useful
large-scale event-sequences corpus. - Application 1 topic tracking/detection in
information retrieval. - Application 2 automated content-tracking in WEB.
- Application 3 scenario/story semi-automatic
creation - Ordinary temporal data analysis various log
data in computer systems, genetic information,
etc.
9Technical Topics (1/2)
- A new framework for extracting frequent
subsequences from a single long data sequence
in IEEE Inter. Conf. on Data Mining 2005
(ICDM2005) - A new rational frequency measures, which
satisfies the Apriori (anti-monotonic) property
and has no duplicate counting. - A fast on-line algorithm for a some limited case
10Technical Topics (1/2)
- On-going current works and future work
- On-line rational filters based on confidence
criteria and/or information-gain for eliminating
redundant valueless sequences from system output - Methods for finding meta-structures embedded in
huge amount of frequent sequences generated by a
system - A method using compression based on context-free
grammar-inference/learning - More fast extraction algorithm based on a method
for simultaneously searching multiple strings
over compressed data.
11References
- Jiawei Han and Micheline Kamber. Data Mining
Concepts and Techniques (Chapter 8).
www.cs.uiuc.edu/hanj
12Thanks for your attention!!