A Short Introduction to Sequential Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

A Short Introduction to Sequential Data Mining

Description:

The First Franco-Japanese Symposium on Knowledge Discovery in System ... (bread, pumpkin, sausage) 3 (wheat, milk), bread, (berry, sausage) 2 bread, cheese ... – PowerPoint PPT presentation

Number of Views:222
Avg rating:3.0/5.0
Slides: 13
Provided by: researc46
Category:

less

Transcript and Presenter's Notes

Title: A Short Introduction to Sequential Data Mining


1
A Short Introduction to Sequential Data Mining
  • Koji IWANUMA
  • Hidetomo NABESHIMA
  • University of Yamanashi
  • The First Franco-Japanese Symposium on Knowledge
    Discovery in System Biology, September 17,
    Aix-en-Provence

2
Two Main Frameworks of Sequential Mining
  • Sequential pattern mining for multiple data
    sequences
  • Sequential pattern mining for a single data
    sequence

Sequence ID Purchase data record
1 ltbread, cheesegt
2 lt(wheat, milk), bread, (berry, sausage)gt
3 lt(bread, pumpkin, sausage)gt
4 ltbread, cheese, sausagegt
5 ltcheesegt
Data sequence
ltS1 S2 S3 S4 S5 S6 S7 Sngt
3
What Is Sequential Pattern Mining?
J. Han and M. Kamber. Data Mining Concepts and
Techniques, www.cs.uiuc.edu/hanji
  • Given a set of sequences, find the complete set
    of frequent subsequences

A sequence lt (ef) (ab) (df) c b gt
A sequence database
An element may contain a set of items. Items
within an element are unordered and we list them
alphabetically.
SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
lta(bc)dcgt is a subsequence of lta(abc)(ac)d(cf)gt
Given support threshold min_sup 2, lt(ab)cgt is a
sequential pattern
4
Challenges on Sequential Pattern Mining
  • A huge number of possible sequential patterns are
    hidden in databases
  • A mining algorithm should
  • find the complete set of patterns, when possible,
    satisfying the minimum support (frequency)
    threshold
  • be highly efficient, scalable, involving only a
    small number of database scans
  • be able to incorporate various kinds of
    user-specific constraints

J. Han and M. Kamber. Data Mining Concepts and
Techniques, www.cs.uiuc.edu/hanji
5
Sequential Pattern Mining Algorithms for Multiple
Data Sequences
  • Apriori-based method GSP (Generalized Sequential
    Patterns Srikant Agrawal _at_ EDBT96)
  • Pattern-growth methods FreeSpan PrefixSpan
    (Han et al._at_KDD00 Pei, et al._at_ICDE01)
  • Vertical format-based mining SPADE (Zaki_at_Machine
    Leanining00)
  • Constraint-based sequential pattern mining
    (SPIRIT Garofalakis, Rastogi, Shim_at_VLDB99 Pei,
    Han, Wang _at_ CIKM02)
  • Mining closed sequential patterns CloSpan (Yan,
    Han Afshar _at_SDM03)

J. Han and M. Kamber. Data Mining Concepts and
Techniques, www.cs.uiuc.edu/hanji
6
Mining Sequential Patterns from a Very-Long
Single Sequence
A series of daily news paper articles
lt
gt
typhoon
flood, landslide
typhoon
flood, landslide
lttyphoon (flood, landslide)gt
7
Sequential Pattern Mining Algorithms for a Single
data Sequence
  • Discovery of frequent episodes in event
    sequences, based on a sliding window system
    Mannila 1998
  • The frequency measure becomes anti-monotonic, but
    has a problem, i.e., a duplicate counting of an
    occurrence.
  • Asynchronous periodic pattern mining Yang et.al
    2000, Huang 2004
  • Any anti-monotonic frequency measures are not
    investigated.
  • On-line approximation algorithm for mining
    frequent items, not for frequent subsequences
  • Lossy counting algorithm Manku and Motwani,
    VLDB02

8
Research in Our Laboratory
  • Sequential Data Mining from a very-large single
    data sequence.
  • Main target sequential textual data,
    especially, newspaper-articles corpora
  • Objectives to generate a robust and useful
    large-scale event-sequences corpus.
  • Application 1 topic tracking/detection in
    information retrieval.
  • Application 2 automated content-tracking in WEB.
  • Application 3 scenario/story semi-automatic
    creation
  • Ordinary temporal data analysis various log
    data in computer systems, genetic information,
    etc.

9
Technical Topics (1/2)
  • A new framework for extracting frequent
    subsequences from a single long data sequence
    in IEEE Inter. Conf. on Data Mining 2005
    (ICDM2005)
  • A new rational frequency measures, which
    satisfies the Apriori (anti-monotonic) property
    and has no duplicate counting.
  • A fast on-line algorithm for a some limited case

10
Technical Topics (1/2)
  • On-going current works and future work
  • On-line rational filters based on confidence
    criteria and/or information-gain for eliminating
    redundant valueless sequences from system output
  • Methods for finding meta-structures embedded in
    huge amount of frequent sequences generated by a
    system
  • A method using compression based on context-free
    grammar-inference/learning
  • More fast extraction algorithm based on a method
    for simultaneously searching multiple strings
    over compressed data.

11
References
  • Jiawei Han and Micheline Kamber. Data Mining
    Concepts and Techniques (Chapter 8).
    www.cs.uiuc.edu/hanj

12
Thanks for your attention!!
Write a Comment
User Comments (0)
About PowerShow.com