Constraintbased sequential pattern mining: the patterngrowth method - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Constraintbased sequential pattern mining: the patterngrowth method

Description:

Authors:Jian Pei, Jiawei Han, and Wei Wang ... GSP, SPADE, Prefix growth. Regular expression constraints into sequential pattern mining ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 35
Provided by: Supe1
Category:

less

Transcript and Presenter's Notes

Title: Constraintbased sequential pattern mining: the patterngrowth method


1
Constraint-based sequential pattern mining the
pattern-growth method
AuthorsJian Pei, Jiawei Han, and Wei
Wang SourceJournal of Intelligent Information
Systems, Volume 28, Number 2, pp.133-160,
2007/4 ReporterChin-Chih Chan Date2007/12/17 E-m
ail m9622967
2
Outline
  • Introduction
  • Definition
  • Constraint Sequential pattern mining
  • Categories of constraints
  • Algorithm PG
  • Computing projection
  • Mining sequential patterns with prefix-monotone
    constraints
  • Experimental results and performance

3
Introduction
  • Sequential pattern mining is an important data
    mining task with broad applications
  • Sequential pattern mining finding the often
    occur sequence in a large database
  • Often occur the support of the sequence the
    user defined minimum support
  • Support of sequence s how many sequence in DB
    contain s
  • Applications network traffic analysis, bio
    information

4
Example sequential pattern mining
  • If the support threshold min_sup 2
  • (ab)d is a subsequence of both the second
    sequence, lt e(ab)(bc)dd gt, and the third one,
    lt c(aef )(abc)ddgt
  • (ab)d is one of the sequential pattern

5
Introduction
  • Mining the complete set of sequential patterns is
    still tough in both effectiveness and efficiency
  • We want to only mine the sequential patterns that
    are highly interesting to users
  • Improve the effectiveness by focusing only on
    interesting patterns
  • We want to mine sequential patterns with all
    kinds of constraints

6
Pattern-growth (PG) method
  • We present a algorithm pattern-growth (PG)
  • Constraints can be effectively and efficiently
    pushed deep into the sequential data mining method

7
Sequential pattern mining concepts
  • I x1, , xn be a set of items
  • An itemset is a non-empty subset of items
  • A sequence a ltX1, , Xlgt is an ordered list of
    itemsets
  • An itemset Xi (1 ? i ? l ) in a sequence is
    called a transaction
  • The number of transactions in a sequence is
    called the length of the sequence
  • For an l-sequence a, we have len(a) l

8
Subsequence and super-sequence
  • A transaction Xi have a special attribute,
    times-stamp, denoted by Xi.time
  • For a sequence a lt X1, , Xl gt, we assume
    Xi.time lt Xj.time for 1? I lt j ? l
  • lt X1, X2gt ? lt X2, X1gt
  • A sequence a lt X1, , Xn gt is called a
    subsequence of sequence ß lt Y1, , Ym gt (n?m),
    denoted by a?ß, if there exist integers 1?
    i1ltltin ? m such that Xi ? Yi1, , Xn ? Yin
  • And ß is a super-sequence of a

9
Example subsequence
  • Sequence lt(ab)dgt is a subsequence of both
    lte(ab)bcddgt and ltc(aef)(abc)ddgt

e
(ab)
d
b
c
d
c
(aef)
d
(abc)
d
(ab)
d
(ab)
d
Contain and in time order
10
Categories of constraints
  • Constraint 1 (Item constraint)
  • Example Cbookstore (a) (?i 1 i len(a),
    a i ? B)
  • Constraint 2 (Length constraint )
  • Example Clen(a) (len(a) 50)
  • Constraint 3 (Super-pattern constraint)
  • Example Cpat(a) lt (PC)(digital_camera) gt?a
  • Constraint 4 (Aggregate constraint)
  • Example Cavg(a) avg(a) 30
  • Constraint 5 (Regular expression constraint)
  • Example Travel ( New York New York City )
    ( Hotels Motels )

11
(No Transcript)
12
Algorithm PG
  • Difference with traditional sequential pattern
  • Definition
  • Prefix
  • Projection
  • Projected DB
  • Algorithm PG

13
The classical sequential pattern mining
  • The claaical Apriori property base algorithm
  • Property any super-pattern of an infrequent
    pattern cannot be frequent
  • A breadth-first, level-by-level search
  • Just squeeze constraints into the
    Apriori-framework
  • However, some important constraints can not be
    solved with Apriori property base
  • EX regular expression

14
Prefix growth(PG) algorithm
  • A prefix-monotone property
  • Can solve most of the constraints discussed so
    far
  • PG push such constraints into sequential pattern
    mining
  • Make it more efficiency and effectiveness
  • Efficiency take less computational time and
    space
  • Effectiveness some pattern that user didnt
    interesting is pruned

15
Definition order
  • All items in a transaction are written with
    respect to the order R
  • written in the form of (ade)(bc) instead of
    (dae)(cb)
  • item x precedes item y is denoted by x ? y
  • The alphabetical order is often used

16
Definition the prefix
  • Given a sequence a lt X1 , , Xn gt, sequence
    ß lt X1 , , XkY gt is called a prefix ofa
    if
  • (1) k lt n
  • (2) Y ? Xk1
  • (3) ?y ? Y, ?z ? (Xk1 - Y), y ? z
  • Example sequence a lt(abc)(acd)(bef ) gt
  • sequence ß lt(abc)(ac)gt is a prefix of sequence
    a
  • sequence ? lt (abc)(ad) gt is not a prefix of a

17
The concept of projected database
  • For sequence a ? ß, sequence ? is said the
    projection of ß with respect to a if
  • (1) ? ? ß
  • (2) a is a prefix of ?
  • (3) there exists no proper super-sequence ? of ?
    such that ? ? ß and ? also has a as a prefix
  • Projection is also denoted by ? ß / a

18
Example for projection
  • For example, if a bc, ß (abc)d(ace) f , then
    ? ß/a b(ce) f

19
Algorithm for computing projection
20
Example for computing projection
  • For example, if a bc, ß (abc)d(ace) f , then
    ? ß/a b(ce) f

J1
J2
J3
J4
d
(abc)
(ace)
f
x c
Z e
c
b
Output lt b (c ? e) f ) gt
i2
i1
A2
A1
21
Algorithm for PG
22
Flow for PG
Sequence DB S
Scan S for 1-item support
ltagt
ltbgt
ltcgt
If frequent then do
ltagt-projected DB
ltbgt-projected DB
ltcgt-projected DB
Scan ltagt-projected DB for 2-sequence support
ltaagt
ltabgt
ltacgt
lt(ab)gt
lt(ac)gt
23
Example for mining sequential patterns with
prefix-monotone constraints
  • The task be mining sequential patterns with a
    regular expression constraint
    C a bb(bc)ddd and min_sup 2

24
Prefix_growth(ltgt, SDB)
SDBltgt
  • Let l be the length of ltgt, Scan SDB ltgt,
    find length-(l 1) frequent prefix in SDBltgt

ltagt 4, ltbgt 4, ltcgt 4, ltdgt 3, ltegt 3
ltf gt 1, is Infrequent item
SDBltgt
lta(bc)egt contains no subsequence satisfying the
constraint
25
SDBltagt
lt a gt fails C
lt (ae) gt 1, is Infrequent item lt aa gt 1, is
Infrequent item
lt (ab) gt 2, ltabgt 3, ltacgt 3, ltadgt 3
SDBltabgt
lt (abc) gt 1, lt (abb) gt 1, is Infrequent item
lt a(bc) gt 2, ltabdgt 3
26
SDBlta(bc)gt
Sequential pattern a(bc)d satisfies the
constraint
27
SDBltacgt
Every sequence in the projected database contains
no subsequence satisfying the constrain
SDBltadgt
lt add gt is a sequential pattern satisfying the
constraint
It results in two final patterns a(bc)d, add
28
Experimental results and performance study
  • Experiment hardware
  • Compare response time without constraint
  • GSP, SPADE, Prefix growth
  • Regular expression constraints into sequential
    pattern mining
  • Scalability of PG
  • With support threshold
  • With database

29
Experiment hardware
30
Compare response time without constraint
  • Experiment On dataset C10T5S4I1.25D200k
  • Contains 100, 000 sequences with 10, 000 items.
  • The expected average number of items within a
    transaction is 5
  • Denoted as T5
  • The expected average number of transaction in
    maximal sequential pattern is 4
  • Denoted as S4

31
Pushing regular expression constraints into
sequential pattern mining
  • We randomly generate 1,000 constraints
  • The support threshold is set to 0.2
  • PG can prune both patterns and projected
    databases, but SPIRIT(V) has to scan the whole
    sequence database repeatedly

32
Scalability of PG with respect to support
threshold
33
Scalability of PG with respect to database size
  • The support threshold is set to 0.2

34
Conclusions
  • We characterize constraints for sequential
    pattern mining
  • An efficient algorithm, PG, is developed to push
    prefix-monotone constraints deep into the mining
    process
Write a Comment
User Comments (0)
About PowerShow.com