Mining Sequence Patterns in Transactional Databases - PowerPoint PPT Presentation

About This Presentation
Title:

Mining Sequence Patterns in Transactional Databases

Description:

First buy computer, then CD-ROM, and then digital camera, within 3 months. ... Candidate Generate-and-test: Drawbacks. A huge set of candidate sequences generated. ... – PowerPoint PPT presentation

Number of Views:830
Avg rating:3.0/5.0
Slides: 16
Provided by: jiaw202
Learn more at: http://web.cs.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: Mining Sequence Patterns in Transactional Databases


1
Mining Sequence Patterns in Transactional
Databases
  • CS240B --UCLA
  • Notes by Carlo Zaniolo
  • Based on those by J. Han

2
Sequence Databases Sequential Patterns
  • Transaction databases, time-series databases vs.
    sequence databases
  • Frequent patterns vs. (frequent) sequential
    patterns
  • Applications of sequential pattern mining
  • Customer shopping sequences
  • First buy computer, then CD-ROM, and then digital
    camera, within 3 months.
  • Medical treatments, natural disasters (e.g.,
    earthquakes), science eng. processes, stocks
    and markets, etc.
  • Telephone calling patterns, Weblog click streams
  • DNA sequences and gene structures

3
What Is Sequential Pattern Mining?
  • Given a set of sequences, find the complete set
    of frequent subsequences

A sequence lt (ef) (ab) (df) c b gt
A sequence database
An element may contain a set of items. Items
within an element are unordered and we list them
alphabetically.
SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
lta(bc)dcgt is a subsequence of lta(abc)(ac)d(cf)gt
4
Subsequence
  • lta(bc)dcgt is a subsequence of
    lta(abc)(ac)d(cf)gt
  • Def S1 is a subsequence of S2 if S1 can be
    obtained from S2 by eliminating some of its
    elements.
  • This is a partial order, not a lattice. No proper
    union and intersection operations

SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
The pattern lt(ab)cgt Has support 2 in our
Database.
A sequence database
5
The Apriori Property of Sequential Patterns
  • A basic property Apriori (Agrawal Sirkant94)
  • If a sequence S is not frequent
  • Then none of the super-sequences of S is
    frequentantimonotonicity
  • E.g, lthbgt is infrequent ? so do lthabgt and lt(ah)bgt

Given support threshold min_sup 2
6
GSPGeneralized Sequential Pattern Mining
  • GSP (Generalized Sequential Pattern) mining
    algorithm
  • proposed by Agrawal and Srikant, EDBT96
  • Outline of the method
  • Initially, every item in DB is a candidate of
    length-1
  • for each level (i.e., sequences of length-k) do
  • scan database to collect support count for each
    candidate sequence
  • generate candidate length-(k1) sequences from
    length-k frequent sequences using Apriori
  • repeat until no frequent sequence or no candidate
    can be found
  • Major strength Candidate pruning by Apriori

7
Finding Length-1 Sequential Patterns
  • Examine GSP using an example
  • Initial candidates all singleton sequences
  • ltagt, ltbgt, ltcgt, ltdgt, ltegt, ltfgt, ltggt, lthgt
  • Scan database once, count support for candidates

Cand Sup
ltagt 3
ltbgt 5
ltcgt 4
ltdgt 3
ltegt 3
ltfgt 2
ltggt 1
lthgt 1
8
GSP Generating Length-2 Candidates
ltagt ltbgt ltcgt ltdgt ltegt ltfgt
ltagt ltaagt ltabgt ltacgt ltadgt ltaegt ltafgt
ltbgt ltbagt ltbbgt ltbcgt ltbdgt ltbegt ltbfgt
ltcgt ltcagt ltcbgt ltccgt ltcdgt ltcegt ltcfgt
ltdgt ltdagt ltdbgt ltdcgt ltddgt ltdegt ltdfgt
ltegt lteagt ltebgt ltecgt ltedgt lteegt ltefgt
ltfgt ltfagt ltfbgt ltfcgt ltfdgt ltfegt ltffgt
51 length-2 Candidates
Without Apriori property, 8887/292 candidates
ltagt ltbgt ltcgt ltdgt ltegt ltfgt
ltagt lt(ab)gt lt(ac)gt lt(ad)gt lt(ae)gt lt(af)gt
ltbgt lt(bc)gt lt(bd)gt lt(be)gt lt(bf)gt
ltcgt lt(cd)gt lt(ce)gt lt(cf)gt
ltdgt lt(de)gt lt(df)gt
ltegt lt(ef)gt
ltfgt
Apriori prunes 44.57 candidates
9
The GSP Mining Process
min_sup 2
10
Candidate Generate-and-test Drawbacks
  • A huge set of candidate sequences generated.
  • Especially 2-item candidate sequence.
  • Multiple Scans of database needed.
  • The length of each candidate grows by one at each
    database scan.
  • Inefficient for mining long sequential patterns.
  • A long pattern grow up from short patterns
  • The number of short patterns is exponential to
    the length of mined patterns
  • Windows can be used to limit the search
  • Maximum intervals can be imposed between items.
  • No efficient algorithm at hand for data streams.

11
From Sequential Patterns to Structured Patterns
  • Sets, sequences, trees, graphs, and other
    structures
  • Transaction DB Sets of items
  • i1, i2, , im,
  • Seq. DB Sequences of sets
  • lti1, i2, , im, in, ikgt,
  • Sets of Sequences
  • lti1, i2gt, , ltim, in, ikgt,
  • Sets of trees t1, t2, , tn
  • Sets of graphs (mining for frequent subgraphs)
  • g1, g2, , gn
  • Mining structured patterns in XML documents,
    bio-chemical structures, etc.

12
Episodes and Episode Pattern Mining
  • Other methods for specifying the kinds of
    patterns
  • Serial episodes A ? B
  • Parallel episodes A B
  • Regular expressions (A B)C(D ? E)
  • Methods for episode pattern mining
  • Variations of Apriori-like algorithms, e.g., GSP
  • Database projection-based pattern growth
  • Similar to the frequent pattern growth without
    candidate generation

13
Periodicity Analysis
  • Periodicity is everywhere tides, seasons, daily
    power consumption, etc.
  • Full periodicity
  • Every point in time contributes (precisely or
    approximately) to the periodicity
  • Partial periodicit A more general notion
  • Only some segments contribute to the periodicity
  • Jim reads NY Times 700-730 am every week day
  • Cyclic association rules
  • Associations which form cycles
  • Methods
  • Full periodicity FFT, other statistical analysis
    methods
  • Partial and cyclic periodicity Variations of
    Apriori-like mining methods

14
Sequential Pattern Mining Algorithms
  • Concept introduction and an initial Apriori-like
    algorithm
  • Agrawal Srikant. Mining sequential patterns,
    ICDE95
  • Apriori-based method GSP (Generalized Sequential
    Patterns Srikant Agrawal _at_ EDBT96)
  • Pattern-growth methods FreeSpan PrefixSpan
    (Han et al._at_KDD00 Pei, et al._at_ICDE01)
  • Vertical format-based mining SPADE (Zaki_at_Machine
    Leanining00)
  • Constraint-based sequential pattern mining
    (SPIRIT Garofalakis, Rastogi, Shim_at_VLDB99 Pei,
    Han, Wang _at_ CIKM02)
  • Mining closed sequential patterns CloSpan (Yan,
    Han Afshar _at_SDM03)

15
Ref Mining Sequential Patterns
  • R. Srikant and R. Agrawal. Mining sequential
    patterns Generalizations and performance
    improvements. EDBT96.
  • H. Mannila, H Toivonen, and A. I. Verkamo.
    Discovery of frequent episodes in event
    sequences. DAMI97.
  • M. Zaki. SPADE An Efficient Algorithm for Mining
    Frequent Sequences. Machine Learning, 2001.
  • J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and
    M.-C. Hsu. PrefixSpan Mining Sequential Patterns
    Efficiently by Prefix-Projected Pattern Growth.
    ICDE'01 (TKDE04).
  • J. Pei, J. Han and W. Wang, Constraint-Based
    Sequential Pattern Mining in Large Databases,
    CIKM'02.
  • X. Yan, J. Han, and R. Afshar. CloSpan Mining
    Closed Sequential Patterns in Large Datasets.
    SDM'03.
  • J. Wang and J. Han, BIDE Efficient Mining of
    Frequent Closed Sequences, ICDE'04.
  • H. Cheng, X. Yan, and J. Han, IncSpan
    Incremental Mining of Sequential Patterns in
    Large Database, KDD'04.
  • J. Han, G. Dong and Y. Yin, Efficient Mining of
    Partial Periodic Patterns in Time Series
    Database, ICDE'99.
  • J. Yang, W. Wang, and P. S. Yu, Mining
    asynchronous periodic patterns in time series
    data, KDD'00.
Write a Comment
User Comments (0)
About PowerShow.com