Sequential Pattern Mining - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Sequential Pattern Mining

Description:

Transaction databases, time-series databases vs. sequence databases ... Telephone calling patterns, Weblog click streams. DNA sequences and gene structures ... – PowerPoint PPT presentation

Number of Views:236
Avg rating:3.0/5.0
Slides: 47
Provided by: irenako
Category:

less

Transcript and Presenter's Notes

Title: Sequential Pattern Mining


1
COMP5318/4044, Lecture 9Knowledge Discovery and
Data Mining
  • Sequential Pattern Mining

2
Sequence Databases and Sequential Pattern Analysis
  • Transaction databases, time-series databases vs.
    sequence databases
  • Frequent patterns vs. (frequent) sequential
    patterns
  • Applications of sequential pattern mining
  • Customer shopping sequences
  • First buy computer, then CD-ROM, and then digital
    camera, within 3 months.
  • Medical treatment, natural disasters (e.g.,
    earthquakes), science engineering processes,
    stocks and markets, etc.
  • Telephone calling patterns, Weblog click streams
  • DNA sequences and gene structures

3
Recall about Support and Confidence
  • The support of an association rule X-gtY is the
    percentage of transactions that contain X ?Y
  • The confidence of an association rule X-gtY is the
    ratio of the number of transactions that contain
    X ?Y to the number of transactions that contain X

4
What Is Sequential Pattern Mining?
  • Given a set of sequences, find the complete set
    of frequent subsequences

A sequence lt (ef) (ab) (df) c b gt
A sequence database
An element may contain a set of items. Items
within an element are unordered and we list them
alphabetically.
lta(bc)dcgt is a subsequence of lta(abc)(ac)d(cf)gt
Given support threshold min_sup 2, lt(ab)cgt is a
sequential pattern (cf. SID 10 30)
5
Challenges on Sequential Pattern Mining
  • A huge number of possible sequential patterns are
    hidden in databases
  • A mining algorithm should
  • find the complete set of patterns, when possible,
    satisfying the minimum support (frequency)
    threshold
  • be highly efficient, scalable, involving only a
    small number of database scans
  • be able to incorporate various kinds of
    user-specific constraints

6
Studies on Sequential Pattern Mining
  • Concept introduction and an initial Apriori-like
    algorithm
  • R. Agrawal R. Srikant. Mining sequential
    patterns, ICDE95
  • GSPAn Apriori-based, influential mining method
    (developed at IBM Almaden)
  • R. Srikant R. Agrawal. Mining sequential
    patterns Generalizations and performance
    improvements, EDBT96
  • From sequential patterns to episodes
    (Apriori-like constraints)
  • H. Mannila, H. Toivonen A.I. Verkamo.
    Discovery of frequent episodes in event
    sequences, Data Mining and Knowledge Discovery,
    1997
  • Mining sequential patterns with constraints
  • M.N. Garofalakis, R. Rastogi, K. Shim SPIRIT
    Sequential Pattern Mining with Regular Expression
    Constraints. VLDB 1999

7
GSPA Generalized Sequential Pattern Mining
Algorithm
  • GSP (Generalized Sequential Pattern) mining
    algorithm
  • proposed by Agrawal and Srikant, EDBT96
  • Outline of the method
  • Initially, every item in DB is a candidate of
    length-1
  • for each level (i.e., sequences of length-k) do
  • scan database to collect support count for each
    candidate sequence
  • generate candidate length-(k1) sequences from
    length-k frequent sequences using Apriori
  • repeat until no frequent sequence or no candidate
    can be found
  • Major strength Candidate pruning by Apriori

8
A Basic Property of Sequential Patterns Apriori
  • A basic property Apriori (Agrawal Sirkant94)
  • If a sequence S is not frequent
  • Then none of the super-sequences of S is frequent
  • E.g, lthbgt is infrequent ? so do lthabgt and lt(ah)bgt

Given support threshold min_sup 2
9
The GSP Algorithm
  • Take sequences in form of ltxgt as length-1
    candidates
  • Scan database once, find F1, the set of length-1
    sequential patterns
  • Let k1 while Fk is not empty do
  • Form Ck1, the set of length-(k1) candidates
    from Fk
  • If Ck1 is not empty, scan database once, find
    Fk1, the set of length-(k1) sequential patterns
  • Let kk1

10
Finding Length-1 Sequential Patterns
  • Examine GSP using an example
  • Initial candidates all singleton sequences
  • ltagt, ltbgt, ltcgt, ltdgt, ltegt, ltfgt, ltggt, lthgt
  • Scan database once, count support for candidates

11
Generating Length-2 Candidates
36
51 length-2 Candidates
15
Without Apriori property, 8887/292 candidates
Apriori prunes 44.57 candidates
12
Generating Length-2 Candidates
13
Generating Length-2 Candidates
14
Length-2 Sequential Patterns
  • After scanning the database to collect support
    count for each length-2 candidate
  • There are 19 length-2 candidates which pass the
    minimum support threshold
  • They are length-2 sequential patterns
  • 16 of them in the pattern of ltxygt
  • 3 of them in the pattern of lt(xy)gt

15
Generating Length-3 Candidates and Finding
Length-3 Patterns
  • Generate Length-3 Candidates
  • Self-join length-2 sequential patterns
  • Based on the Apriori property
  • ltabgt, ltaagt and ltbagt are all length-2 sequential
    patterns ? ltabagt is a length-3 candidate
  • 46 candidates are generated
  • lt(bd)gt, ltbbgt and ltdbgt are all length-2 sequential
    patterns ? lt(bd)bgt is a length-3 candidate
  • 27 candidates are generated
  • Find Length-3 Sequential Patterns
  • Scan database once more, collect support counts
    for candidates
  • 19 out of 73 candidates pass support threshold

16
Generating Length-3 Candidates
  • ltaaagt0, ltaabgt0
  • ltabagt2, ltabbgt2, ltabcgt1, ltabdgt1, ltabegt1,
    ltabfgt1
  • ltbaagt, ltbabgt
  • ltbbagt, ltbbbgt, ltbbcgt, ltbbdgt, ltbbegt, ltbbfgt
  • ltbcagt, ltbcbgt, ltbcdgt
  • ltbdagt, ltbdbgt, ltbdcgt
  • ltbfbgt, ltbffgt
  • ltcaagt, ltcabgt
  • ltcbagt, ltcbbgt, ltcbcgt, ltcbdgt, ltcbegt, ltcbfgt
  • ltcdagt, ltcdbgt, ltcdcgt
  • ltdaagt, ltdabgt
  • ltdbagt, ltdbbgt, ltdbcgt, ltdbdgt, ltdbegt, ltdbfgt
  • ltdcagt, ltdcbgt, ltdcdgt
  • ltfbagt, ltfbbgt, ltfbcgt, ltfbdgt, ltfbegt, ltfbfgt
  • ltffbgt, ltfffgt
  • Example of generating ltxyzgt pattern for ltaagt
  • Need to concatenate another Length-2 frequent
    itemset
  • Concatenating another frequent itemsets that
    start with a to form ltaaagt and ltaabgt

min_sup 2
17
Generating Length-3 Candidates
  • Example of generating lt(xy)zgt pattern for lt(bd)gt
  • Need to concatenate another Length-2 frequent
    itemset
  • Concatenating those patterns that end with b to
    form something like
  • lta(bd)gt, ltb(bd)gt, ltc(bd)gt, ltd(bd)gt, ltf(bd)gt
  • Concatenating those patterns that starts with d
    to form something like
  • lt(bd)agt, lt(bd)bgt, lt(bd)cgt

18
The GSP Mining Process
min_sup 2
19
Bottlenecks of GSP
  • A huge set of candidates could be generated
  • 1,000 frequent length-1 sequences generate
    length-2 candidates!
  • Multiple scans of database in mining
  • Real challenge mining long sequential patterns
  • An exponential number of short candidates
  • A length-100 sequential pattern needs 1030
    candidate
    sequences!

20
Pattern-growth methods
  • A divide-and-conquer approach
  • Recursively project a sequence database into a
    set of smaller databases
  • Mine each projected database to find the subset
    of patterns
  • Algorithms
  • FreeSpan Frequent Pattern-Projected Sequential
    Pattern Mining
  • PrefixSpan Prefix-Projected Sequential Pattern
    Mining

21
FreeSpan
  • J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U.
    Dayal, and M.-C. Hsu. FreeSpan Frequent
    pattern-projected sequential pattern mining.
    KDD'00, pages 355-359.
  • Example given a sequence database S and
    min_support 2
  • Step 1 find length-1 sequential patterns and
    list them in support descending order
  • f_list a4,b4,c4,d3,e3,f3 g1

22
FreeSpan (cont)
  • Step 2 divide search space. The complete set of
    seq. pat. can be partitioned into 6 disjoint
    subsets (move down the f_list)
  • ones only contain item a
  • ones contain item b but no items after b in
    f_list
  • ones contain item c but no items after c in
    f_list
  • ones contain item d but no items after d in
    f_list
  • ones contain item e but no items after e in
    f_list
  • ones contain item f
  • find subsets of sequential patterns. They can be
    mined by constructing projected databases and
    mining each recursively

23
FreeSpan (cont)
  • Finding Seq. Patterns containing item b but no
    items after b in f_list
  • ltbgt-projected database
  • lta(ab)agt, ltabagt, lt(ab)bgt, ltabgt
  • Find all the length-2 seq. pat. containing item b
    but no items after b in f_list
  • ltabgt4, ltbagt2, lt(ab)gt2
  • Further partition and mining

f_list a4,b4, c4,d3,e3,f3
24
From FreeSpan to PrefixSpan
  • Freespan
  • Projection-based No candidate sequence needs to
    be generated
  • But, projection can be performed at any point in
    the sequence, and the projected sequences may not
    shrink much. For example, the size of f-projected
    database is the same as the original sequence
    database
  • PrefixSpan
  • Projection-based
  • But only prefix-based projection less
    projections and quickly shrinking sequences

25
PrefixSpan (Prefix-projected Sequential Pattern
Mining)
  • Projection-based
  • But only prefix-based projection less
    projections and quickly shrinking sequences
  • J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and
    M.-C. Hsu, "PrefixSpan Mining Sequential
    Patterns Efficiently by Prefix-Projected Pattern
    Growth", Proc. 2001 Int. Conf. on Data
    Engineering (ICDE'01), Heidelberg, Germany, April
    2001.

26
Prefix of A Sequence
  • Given sequence slt(abc)(bd)(ace)gt
  • lt(abc)gt, lt(abc)(bd)gt are prefixes of s
  • Given an alphabetical order of items in each
    itemset (element), lt(a)gt, lt(ab)gt, lt(abc)(b)gt,
    lt(abc)(bd)(a)gt, and lt(abc)(bd)(ac)gt are also
    prefixes of s
  • lt(ab)(bd)gt, lt(bd)(ac)gt are NOT prefixes of s

27
Pattern Growth (prefixSpan)
  • Prefix and Suffix (Projection)
  • ltagt, ltaagt, lta(ab)gt and lta(abc)gt are prefixes of
    sequence lta(abc)(ac)d(cf)gt
  • Given sequence lta(abc)(ac)d(cf)gt

28
PrefixSpan-concepts
  • Suppose all items in an element are listed
    alphabetically.
  • Given a sequence ?lte1e2engt, ?lte1e2emgt(m?n)
  • Prefix ? is the prefix of ? iff (1) eiei (i
    ?m-1) (2) em ? em(3) all items in (em- em) are
    alphabetically after those in em.
  • e.g. ?lta(abc)(ac)d(cf)gt, ?lta(ab)gt
  • Postfix sequence ?lte1e2emgt, ?ltemem1engt
    is called the postfix of ? w.r.t. prefix ?, where
    em(em-em), denoted as ??.?
  • e.g. ?lt(_c)(ac)d(cf)gt is the postfix of ? w.r.t.
    prefix lta(ab)gt

29
PrefixSpan-concepts (cont)
  • Projected database let ? be a sequential pattern
    in S. ?-projected database, denoted s?, is the
    collection of postfixes of sequences in S w.r.t.
    prefix ?
  • Support count in projected database let ? be a
    sequential pattern in S, ? be a sequence having
    prefix ?. The support count of ? in ?-projected
    database is the number of sequence ? in s? such
    that ???.?

30
PrefixSpan-process
  • Step 1 find length-1 sequential patterns
  • ltagt4, ltbgt4, ltcgt4, ltdgt3, ltegt3, ltfgt3, ltggt1
  • Step 2 divide search space. The complete set of
    seq. pat. can be partitioned into 6 subsets
  • ones having prefix ltagt
  • ones having prefix ltbgt
  • ones having prefix ltfgt
  • find subsets of sequential patterns. They can be
    mined by constructing projected databases and
    mining each recursively

31
PrefixSpan-Process (cont)
  • Finding Seq. Patterns with Prefix ltagt
  • ltagt-projected database
  • lt(abc)(ac)d(cf)gt, lt(_d)c(bc)(ae)gt, lt(_b)(df)cbgt,
    lt(_f)cbcgt
  • Find all the length-2 seq. pat. having prefix
    ltagt
  • ltaagt2, ltabgt4, lt(ab)gt2, ltacgt4, ltadgt2, ltafgt2
  • Further partition into 6 subsets
  • Having prefix ltaagt
  • Having prefix ltafgt

32
Example
An Example ( min_sup2)
33
PrefixSpan Algorithm
Main Idea Use frequent prefixes to divide the
search space and to project sequence databases.
only search the relevant sequences.
PrefixSpan(?, i, S?)
  • Scan S? once, find the set of frequent items b
    such that
  • b can be assembled to the last element of ? to
    form a sequential pattern or
  • ltbgt can be appended to ? to form a sequential
    pattern.
  • For each frequent item b, appended it to ? to
    form a sequential pattern ?, and output ?
  • For each ?, construct ?-projected database
    S?, and call PrefixSpan(?, i1,S?).

34
Example to be continued
  • ltaagt-projected database
  • lt(_bc)(ac)d(cf)gt
  • ltabgt-projected database
  • lt(_c)(ac)d(cf)gt, lt(_c)agt, and ltcgt
  • lt(ab)gt-projected database
  • lt(_c)(ac)d(cf)gt, lt(df)cbgt

35
Example to be continued
36
PrefixSpan Is Faster than GSP and FreeSpan
37
Effect of Pseudo-Projection
38
Completeness of PrefixSpan
SDB
Length-1 sequential patterns ltagt, ltbgt, ltcgt, ltdgt,
ltegt, ltfgt
Having prefix ltcgt, , ltfgt
Having prefix ltagt
Having prefix ltbgt
ltagt-projected database lt(abc)(ac)d(cf)gt lt(_d)c(bc)
(ae)gt lt(_b)(df)cbgt lt(_f)cbcgt
ltbgt-projected database

Length-2 sequential patterns ltaagt, ltabgt,
lt(ab)gt, ltacgt, ltadgt, ltafgt

Having prefix ltaagt
Having prefix ltafgt

ltaagt-proj. db
ltafgt-proj. db
39
Efficiency of PrefixSpan
  • No candidate sequence needs to be generated
  • Projected databases keep shrinking
  • Major cost of PrefixSpan constructing projected
    databases
  • Can be improved by bi-level projections

40
Bi-Level Projection
For each length-2 sequential pattern ?, construct
?-projected database
41
ltabgt-Projected Database
  • The ltabgt-projected database contains three
    sequences
  • lt(_c)(ac)(cf)gt, lt(_c)agt, ltcgt
  • Scanning it to produce three frequent items
  • ltagt, lt(_c)gt, ltcgt
  • Only one pattern achieved the min_supp 2, which
    is lt(_c)agt
  • Same set of sequential patterns will be produced
  • 53 projected databases for the 53 sequential
    patterns produced but only 22 projections using
    the bi-level approach

42
Other Optimization Technique in PrefixSpan
  • Pseudo-projection may reduce the effort of
    projection when the projected database fits in
    main memory

43
Speed-up by Pseudo-projection
  • Major cost of PrefixSpan projection
  • Postfixes of sequences often appear repeatedly in
    recursive projected databases
  • When (projected) database can be held in main
    memory, use pointers to form projections
  • Pointer to the sequence
  • Offset of the postfix

slta(abc)(ac)d(cf)gt
ltagt
lt(abc)(ac)d(cf)gt
sltagt ( , 2)
ltabgt
lt(_c)(ac)d(cf)gt
sltabgt ( , 4)
44
Performance Comparison
  • PrefixSpan-1 is PrefixSpan with level-by-level
    projection
  • PrefixSpan-2 is PrefixSpan with bi-level
    projection

45
More about Pseudo-Projection
  • Pseudo-projection avoids physically copying
    postfixes
  • Efficient in running time and space when database
    can be held in main memory
  • However, it is not efficient when database cannot
    fit in main memory
  • Disk-based random accessing is very costly
  • Suggested Approach
  • Integration of physical and pseudo-projection
  • Swapping to pseudo-projection when the data set
    fits in memory

46
The Final Word
  • Sequential Pattern Mining is useful in many
    application, e.g. weblog analysis, financial
    market prediction and even NLP
  • It is similar to the frequent itemsets mining,
    but with temporal (sequences) taking into
    consideration
  • We have looked at two different approaches that
    are descendants from two popular algorithms in
    mining frequent itemsets
  • Candidates Generation GSP
  • Pattern Growth PrefixSpan
Write a Comment
User Comments (0)
About PowerShow.com