A Short Introduction to Sequential Data Mining - PowerPoint PPT Presentation

About This Presentation

Title:

A Short Introduction to Sequential Data Mining

Description:

The First Franco-Japanese Symposium on Knowledge Discovery in System ... (bread, pumpkin, sausage) 3 (wheat, milk), bread, (berry, sausage) 2 bread, cheese ... – PowerPoint PPT presentation

Number of Views:222

Avg rating:3.0/5.0

Slides: 13

Provided by: researc46

Category:

more less

Transcript and Presenter's Notes

Title: A Short Introduction to Sequential Data Mining

1
A Short Introduction to Sequential Data Mining

Koji IWANUMA
Hidetomo NABESHIMA
University of Yamanashi
The First Franco-Japanese Symposium on Knowledge
Discovery in System Biology, September 17,
Aix-en-Provence

2
Two Main Frameworks of Sequential Mining

Sequential pattern mining for multiple data
sequences
Sequential pattern mining for a single data
sequence

Sequence ID Purchase data record
1 ltbread, cheesegt
2 lt(wheat, milk), bread, (berry, sausage)gt
3 lt(bread, pumpkin, sausage)gt
4 ltbread, cheese, sausagegt
5 ltcheesegt
Data sequence
ltS1 S2 S3 S4 S5 S6 S7 Sngt
3
What Is Sequential Pattern Mining?
J. Han and M. Kamber. Data Mining Concepts and
Techniques, www.cs.uiuc.edu/hanji

Given a set of sequences, find the complete set
of frequent subsequences

A sequence lt (ef) (ab) (df) c b gt
A sequence database
An element may contain a set of items. Items
within an element are unordered and we list them
alphabetically.
SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
lta(bc)dcgt is a subsequence of lta(abc)(ac)d(cf)gt
Given support threshold min_sup 2, lt(ab)cgt is a
sequential pattern
4
Challenges on Sequential Pattern Mining

A huge number of possible sequential patterns are
hidden in databases
A mining algorithm should
find the complete set of patterns, when possible,
satisfying the minimum support (frequency)
threshold
be highly efficient, scalable, involving only a
small number of database scans
be able to incorporate various kinds of
user-specific constraints

J. Han and M. Kamber. Data Mining Concepts and
Techniques, www.cs.uiuc.edu/hanji
5
Sequential Pattern Mining Algorithms for Multiple
Data Sequences

Apriori-based method GSP (Generalized Sequential
Patterns Srikant Agrawal _at_ EDBT96)
Pattern-growth methods FreeSpan PrefixSpan
(Han et al._at_KDD00 Pei, et al._at_ICDE01)
Vertical format-based mining SPADE (Zaki_at_Machine
Leanining00)
Constraint-based sequential pattern mining
(SPIRIT Garofalakis, Rastogi, Shim_at_VLDB99 Pei,
Han, Wang _at_ CIKM02)
Mining closed sequential patterns CloSpan (Yan,
Han Afshar _at_SDM03)

J. Han and M. Kamber. Data Mining Concepts and
Techniques, www.cs.uiuc.edu/hanji
6
Mining Sequential Patterns from a Very-Long
Single Sequence
A series of daily news paper articles
lt
gt
typhoon
flood, landslide
typhoon
flood, landslide
lttyphoon (flood, landslide)gt
7
Sequential Pattern Mining Algorithms for a Single
data Sequence

Discovery of frequent episodes in event
sequences, based on a sliding window system
Mannila 1998
The frequency measure becomes anti-monotonic, but
has a problem, i.e., a duplicate counting of an
occurrence.
Asynchronous periodic pattern mining Yang et.al
2000, Huang 2004
Any anti-monotonic frequency measures are not
investigated.
On-line approximation algorithm for mining
frequent items, not for frequent subsequences
Lossy counting algorithm Manku and Motwani,
VLDB02

8
Research in Our Laboratory

Sequential Data Mining from a very-large single
data sequence.
Main target sequential textual data,
especially, newspaper-articles corpora
Objectives to generate a robust and useful
large-scale event-sequences corpus.
Application 1 topic tracking/detection in
information retrieval.
Application 2 automated content-tracking in WEB.
Application 3 scenario/story semi-automatic
creation
Ordinary temporal data analysis various log
data in computer systems, genetic information,
etc.

9
Technical Topics (1/2)

A new framework for extracting frequent
subsequences from a single long data sequence
in IEEE Inter. Conf. on Data Mining 2005
(ICDM2005)
A new rational frequency measures, which
satisfies the Apriori (anti-monotonic) property
and has no duplicate counting.
A fast on-line algorithm for a some limited case

10
Technical Topics (1/2)

On-going current works and future work
On-line rational filters based on confidence
criteria and/or information-gain for eliminating
redundant valueless sequences from system output
Methods for finding meta-structures embedded in
huge amount of frequent sequences generated by a
system
A method using compression based on context-free
grammar-inference/learning
More fast extraction algorithm based on a method
for simultaneously searching multiple strings
over compressed data.

11
References