Sequential%20Pattern%20Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Sequential%20Pattern%20Mining

Description:

GSP algorithm PrefixSpan Sequence Data Object ... (abc)(ac)d(cf) Given support threshold min_sup =2, is a sequential pattern SID sequence 10 – PowerPoint PPT presentation

Number of Views:133
Avg rating:3.0/5.0
Slides: 24
Provided by: WeiW155
Category:

less

Transcript and Presenter's Notes

Title: Sequential%20Pattern%20Mining


1
Sequential Pattern Mining
  • CS 685 Special Topics in Data Mining
  • Spring 2008

2
Sequential Pattern Mining
  • Why sequential pattern mining?
  • GSP algorithm
  • PrefixSpan

3
Sequence Data
Sequence Database
4
Examples of Sequence Data
Sequence Database Sequence Element (Transaction) Event(Item)
Customer Purchase history of a given customer A set of items bought by a customer at time t Books, diary products, CDs, etc
Web Data Browsing activity of a particular Web visitor A collection of files viewed by a Web visitor after a single mouse click Home page, index page, contact info, etc
Event data History of events generated by a given sensor Events triggered by a sensor at time t Types of alarms generated by sensors
Genome sequences DNA sequence of a particular species An element of the DNA sequence Bases A,T,G,C
Element (Transaction)
Event (Item)
E1E2
E1E3
E2
E3E4
E2
Sequence
5
Formal Definition of a Sequence
  • A sequence is an ordered list of elements
    (transactions)
  • s lt e1 e2 e3 gt
  • Each element contains a collection of events
    (items)
  • ei i1, i2, , ik
  • Each element is attributed to a specific time or
    location
  • Length of a sequence, s, is given by the number
    of elements of the sequence
  • A k-sequence is a sequence that contains k events
    (items)

6
What Is Sequential Pattern Mining?
  • Given a set of sequences, find the complete set
    of frequent subsequences

A sequence lt (ef) (ab) (df) c b gt
A sequence database
SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
An element may contain a set of items. Items
within an element are unordered and we list them
alphabetically.
lta(bc)dcgt is a subsequence of lta(abc)(ac)d(cf)gt
Given support threshold min_sup 2, lt(ab)cgt is a
sequential pattern
7
Sequential Pattern Mining Definition
  • Given
  • a database of sequences
  • a user-specified minimum support threshold,
    minsup
  • Task
  • Find all subsequences with support minsup

8
Sequential Pattern Mining Challenge
  • Given a sequence lta b c d e f g h igt
  • Examples of subsequences
  • lta c d f g gt, lt c d e gt, lt b g gt,
    etc.
  • How many k-subsequences can be extracted from a
    given n-sequence?
  • lta b c d e f g h igt n 9
  • k4 Y _ _ Y Y _ _ _Y
  • lta d e igt

9
Challenges on Sequential Pattern Mining
  • A huge number of possible sequential patterns are
    hidden in databases
  • A mining algorithm should
  • Find the complete set of patterns satisfying the
    minimum support (frequency) threshold
  • Be highly efficient, scalable, involving only a
    small number of database scans
  • Be able to incorporate various kinds of
    user-specific constraints

10
A Basic Property of Sequential Patterns Apriori
  • A basic property Apriori (Agrawal Sirkant94)
  • If a sequence S is not frequent
  • Then none of the super-sequences of S is frequent
  • E.g, lthbgt is infrequent ? so do lthabgt and lt(ah)bgt

Given support threshold min_sup 2
11
Basic Algorithm Breadth First Search (GSP)
  • L1
  • While (ResultL ! NULL)
  • Candidate Generate
  • Prune
  • Test
  • LL1

12
Finding Length-1 Sequential Patterns
Cand Sup
ltagt 3
ltbgt 5
ltcgt 4
ltdgt 3
ltegt 3
ltfgt 2
ltggt 1
lthgt 1
  • Initial candidates all singleton sequences
  • ltagt, ltbgt, ltcgt, ltdgt, ltegt, ltfgt, ltggt, lthgt
  • Scan database once, count support for candidates

min_sup 2
13
Generating Length-2 Candidates
ltagt ltbgt ltcgt ltdgt ltegt ltfgt
ltagt ltaagt ltabgt ltacgt ltadgt ltaegt ltafgt
ltbgt ltbagt ltbbgt ltbcgt ltbdgt ltbegt ltbfgt
ltcgt ltcagt ltcbgt ltccgt ltcdgt ltcegt ltcfgt
ltdgt ltdagt ltdbgt ltdcgt ltddgt ltdegt ltdfgt
ltegt lteagt ltebgt ltecgt ltedgt lteegt ltefgt
ltfgt ltfagt ltfbgt ltfcgt ltfdgt ltfegt ltffgt
51 length-2 Candidates
ltagt ltbgt ltcgt ltdgt ltegt ltfgt
ltagt lt(ab)gt lt(ac)gt lt(ad)gt lt(ae)gt lt(af)gt
ltbgt lt(bc)gt lt(bd)gt lt(be)gt lt(bf)gt
ltcgt lt(cd)gt lt(ce)gt lt(cf)gt
ltdgt lt(de)gt lt(df)gt
ltegt lt(ef)gt
ltfgt
Without Apriori property, 8887/292 candidates
Apriori prunes 44.57 candidates
14
The Mining Process
min_sup 2
15
Candidate Generate-and-test Drawbacks
  • A huge set of candidate sequences generated.
  • Especially 2-item candidate sequence.
  • Multiple Scans of database needed.
  • Inefficient for mining long sequential patterns.
  • A long pattern grow up from short patterns
  • The number of short patterns is exponential to
    the length of mined patterns.

16
Bottlenecks of GSP
  • A huge set of candidates could be generated
  • 1,000 frequent length-1 sequences generate s huge
    number of length-2 candidates!
  • Multiple scans of database in mining
  • The length of each candidate grows by one at each
    database scan.
  • Mining long sequential patterns
  • Needs an exponential number of short candidates
  • A length-100 sequential pattern needs 1030
    candidate
    sequences!

17
Pattern Growth (prefixSpan)
  • Prefix and Suffix (Projection)
  • ltagt, ltaagt, lta(ab)gt and lta(abc)gt are prefixes of
    sequence lta(abc)(ac)d(cf)gt
  • Given sequence lta(abc)(ac)d(cf)gt

Prefix Suffix (Prefix-Based Projection)
ltagt lt(abc)(ac)d(cf)gt
ltaagt lt(_bc)(ac)d(cf)gt
lta(ab)gt lt(_c)(ac)d(cf)gt
18
Mining Sequential Patterns by Prefix Projections
  • Step 1 find length-1 sequential patterns
  • ltagt, ltbgt, ltcgt, ltdgt, ltegt, ltfgt
  • Step 2 divide search space. The complete set of
    seq. pat. can be partitioned into 6 subsets
  • The ones having prefix ltagt
  • The ones having prefix ltbgt
  • The ones having prefix ltfgt

SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
19
Finding Seq. Patterns with Prefix ltagt
  • Only need to consider projections w.r.t. ltagt
  • ltagt-projected database lt(abc)(ac)d(cf)gt,
    lt(_d)c(bc)(ae)gt, lt(_b)(df)cbgt, lt(_f)cbcgt
  • Find all the length-2 seq. pat. Having prefix
    ltagt ltaagt, ltabgt, lt(ab)gt, ltacgt, ltadgt, ltafgt
  • Further partition into 6 subsets
  • Having prefix ltaagt
  • Having prefix ltafgt

SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
20
Completeness of PrefixSpan
SDB
SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
Length-1 sequential patterns ltagt, ltbgt, ltcgt, ltdgt,
ltegt, ltfgt
Having prefix ltcgt, , ltfgt
Having prefix ltagt
Having prefix ltbgt
ltagt-projected database lt(abc)(ac)d(cf)gt lt(_d)c(bc)
(ae)gt lt(_b)(df)cbgt lt(_f)cbcgt
ltbgt-projected database

Length-2 sequential patterns ltaagt, ltabgt,
lt(ab)gt, ltacgt, ltadgt, ltafgt

Having prefix ltaagt
Having prefix ltafgt

ltaagt-proj. db
ltafgt-proj. db
21
Efficiency of PrefixSpan
  • No candidate sequence needs to be generated
  • Projected databases keep shrinking
  • Major cost of PrefixSpan constructing projected
    databases
  • Can be improved by pseudo-projections

22
Speed-up by Pseudo-projection
  • Major cost of PrefixSpan projection
  • Postfixes of sequences often appear repeatedly in
    recursive projected databases
  • When (projected) database can be held in main
    memory, use pointers to form projections
  • Pointer to the sequence
  • Offset of the postfix

slta(abc)(ac)d(cf)gt
ltagt
lt(abc)(ac)d(cf)gt
sltagt ( , 2)
ltabgt
lt(_c)(ac)d(cf)gt
sltabgt ( , 4)
23
Pseudo-Projection vs. Physical Projection
  • Pseudo-projection avoids physically copying
    postfixes
  • Efficient in running time and space when database
    can be held in main memory
  • However, it is not efficient when database cannot
    fit in main memory
  • Disk-based random accessing is very costly
  • Suggested Approach
  • Integration of physical and pseudo-projection
  • Swapping to pseudo-projection when the data set
    fits in memory
Write a Comment
User Comments (0)
About PowerShow.com