Jen-Wei Huang - PowerPoint PPT Presentation

1 / 102
About This Presentation
Title:

Jen-Wei Huang

Description:

Jen-Wei Huang jwhuang_at_gmail.com National Taiwan University * Jen-Wei Huang * Outlines Introduction Preliminaries Algorithm ... – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 103
Provided by: JenWei
Category:
Tags: algorithm | apriori | huang | jen | wei

less

Transcript and Presenter's Notes

Title: Jen-Wei Huang


1
?????,?????
  • Jen-Wei Huang
  • ???
  • jwhuang_at_gmail.com
  • National Taiwan University

2
(No Transcript)
3
http//www.wretch.cc/blog/EtudeBIKE
4
http//www.giant-bicycles.com/zh-TW/
5
(No Transcript)
6
(No Transcript)
7
http//cape7.pixnet.net/blog
8
http//cape7.pixnet.net/blog
9
http//cape7.pixnet.net/blog
10
http//www.wretch.cc/blog/orzboyz
http//blog.sina.com.tw/9winds/
http//atomcinema.pixnet.net/blog
11
(No Transcript)
12
http//www.amazon.com
13
http//www.amazon.com
14
http//www.hq.nasa.gov/office/pao/History/ap11an
n/kippsphotos/apollo.html
15
A General Model for Sequential Pattern Mining
with a Progressive Database
  • Jen-Wei Huang, Chi-Yao Tseng,
  • Jian-Chih Ou and Ming-Syan Chen
  • National Taiwan University

IEEE Trans. on Knowledge and Data Engineering,
Vol. 20, No. 6, June 2008
16
Outlines
  • Introduction
  • Preliminaries
  • Algorithm Pisa
  • Experiments
  • Conclusions
  • Q A

16
17
Introduction to SPM
  • Mining of frequently occurring patterns related
    to time or other sequences.
  • J. Han, Data Mining Concepts and Techniques
  • Given a set of sequences, find the complete set
    of frequent subsequences
  • J. Pei, PrefixSpan
  • Ex) What items one will buy if he/she has bought
    some certain items

17
18
Time-related data
  • Customers buying behavior
  • Natural phenomena
  • Sensor network data
  • Web access patterns
  • Stock price changes
  • DNA sequence applications

18
19
Definition
  • Let I x1, x2, ..., xn be a set of different
    items.
  • An element e, denoted by (xi xj ...), is a subset
    of items ? I of which items appear in a sequence
    at the same time.
  • A sequence s, denoted by lt e1, e2, ..., em gt, is
    an ordered list of elements.
  • A sequence database Db contains a set of
    sequences and Db represents the number of
    sequences in Db.

19
20
Definition
  • A sequence a lt a1, a2, ..., an gt is a
    subsequence of another sequence ß lt b1, b2,
    ..., bm gt if
  • there exists a set of integers,
  • 1 i1 lt i2 lt ... lt in m, such that
  • a1 ? bi1 , a2 ? bi2 , ..., and an ? bin .

20
21
Definition
  • The sequential pattern mining can be defined as
  • "Given a sequence database, Db, and a
    user-defined minimum support, min_sup, find the
    complete set of subsequences whose occurrence
    frequencies min_sup Db."

21
22
Three Categories
  • Depending on the management of the corresponding
    database, sequential pattern mining can be
    divided into three categories, namely sequential
    pattern mining with
  • a static database.
  • an incremental database.
  • a progressive database.

22
23
How To Do Sequential Pattern Mining on a Static
Database
  • An Overview

24
How?
  • Apriori-like algorithms
  • AprioriAll by Agrawal et al
  • GSP by R. Srikant et al
  • Partition-based algorithms
  • FreeSpan by J. Han et al
  • PrefixSpan by J. Pei et al
  • Vertical format algorithms
  • SPADE by Zaki et al
  • SPAM by Ayres et al

25
Apriori-like Algorithms
  • 1.Sort phase
  • Sort the database
  • Customer id as the primary key and time as the
    second key
  • 2.Litemset phase
  • Count the frequency of each itemset
  • The fraction of customers who bought the itemset

26
Apriori-like Algorithms
  • 3.Transformation phase
  • Transform each tx to all litemsets in the form of
  • C01 lt(1,5) (2) (3) (4)gt
  • C02 lt(1) (3) (4) (3,5)gt
  • C03 lt(1) (2) (3) (4gt
  • C04 lt(1) (3) (5)gt
  • C05 lt(4) (5)gt

27
Itemset
10 3
20 3
30 4
40 3
50 1
60 1
70 4
90 4
10 20 1
40 60 1
40 70 3
60 70 1
40 60 70 1
30 50 1
30 70 1
50 70 1
30 50 70 1
CID Items
2 10 20
5 90
2 30
2 40 60 70
4 30
3 30 50 70
1 30
1 90
4 40 70
4 90
3 10
5 10
1 40 70
5 20
2 90
3 20
CID Items
1 30 90 40 70
2 10 20 30 40 60 70 90
3 30 50 70 10 20
4 30 40 70 90
5 90 10 20
28
Itemset New
10 3 1
20 3 2
30 4 3
40 3 4
70 4 5
90 4 6
40 70 3 7
CID Items
1 3 6 4, 5, 7
2 1, 2 3 4, 5, 7 6
3 3, 5 1 2
4 3 4, 5, 7 6
5 6 1 2
29
Apriori-like Algorithms
  • 4.Mining phase
  • Apriori-like algorithm
  • 5.Maximal phase
  • Find the maximum patterns

30
CID Items
1 3 6 4, 5, 7
2 1, 2 3 4, 5, 7 6
3 3, 5 1 2
4 3 4, 5, 7 6
5 6 1 2
Itemset
1 2 2
1 3 1
1 4 1
1 5 1
1 6 1
1 7 1
2 1 0
2 3 1
2 4 1
2 5 1
2 6 1
2 7 1
3 1 1
3 2 1
Itemset
3 4 3
3 5 3
3 6 3
3 7 3
4 1 0
4 2 0
4 3 0
4 5 0
4 6 2
4 7 0
5 1 1
5 2 1
5 3 0
5 4 0
Itemset
5 6 2
5 7 0
6 1 1
6 2 1
6 3 0
6 4 1
6 5 1
6 7 1
7 1 0
7 2 0
7 3 0
7 4 0
7 5 0
7 6 2
31
Itemset
10 3 1
20 3 2
30 4 3
40 3 4
70 4 5
90 4 6
40 70 3 7
CID Items
1 3 6 4, 5, 7
2 1, 2 3 4, 5, 7 6
3 3, 5 1 2
4 3 4, 5, 7 6
5 6 1 2
Itemset
3 4 6 2
3 5 6 2
3 7 6 2
Therefore, frequent sequential patterns are lt1
2gt lt3 4gt lt3 5gt lt3 6gt lt3 7gt lt4 6gt lt5 6gt lt7 6gt lt3 4
6gt lt3 5 6gt lt3 7 6gt
According to mappings, original frequent
sequential patterns are lt10 20gt lt30 40gt lt30 70gt
lt30 90gt lt30 40 70gt lt40 90gt lt70 90gt lt40 70 90gt
lt30 40 90gt lt30 70 90gt lt30 40 70 90gt
32
According to mappings, original frequent
sequential patterns are lt10 20gt lt30 40gt lt30 70gt
lt30 90gt lt30 40 70gt lt40 90gt lt70 90gt lt40 70 90gt
lt30 40 90gt lt30 70 90gt lt30 40 70 90gt
Because lt30 40gt and lt30 70gt are contained by lt30
40 70gt lt40 90gt and lt70 90gt are contained by
lt40 70 90gt lt30 40 90gt and lt30 70 90gt are
contained by lt30 40 70 90gt,
final maximal sequential patterns are lt10 20gt
lt30 90gt lt30 40 70gt lt40 70 90gt lt30 40 70 90gt
33
Related Works
  • Static database
  • AprioriAll by Agrawal et al
  • GSP by R. Srikant et al
  • SPADE by Zaki et al
  • FreeSpan by J. Han et al
  • PrefixSpan by J. Pei et al
  • SPAM by Ayres et al

33
34
Related Works
  • Incremental database
  • ISM by Parthasarathy et al
  • IncSP by Lin et al
  • ISE by Masseglia et al
  • IncSpan by Cheng et al
  • MILE by Chen et al

34
35
Motivation
  • The assumption of having a static database may
    not hold in practice.
  • The data in real world change on the fly.
  • Finding sequential patterns in an incremental
    database may lack of interest to the users.
  • It is noted that users are usually more
    interested in the recent data than the old ones.

35
36
Motivation
  • If a certain sequence does not have any newly
    arriving elements, this sequence will still stay
    in the database and undesirably contribute to
    Db.
  • New sequential patterns which appear frequently
    in the recent sequences may not be considered as
    frequent sequential patterns.

36
37
Definition -- Period of Interest
  • Period of Interest (abbreviated as POI) is a
    sliding window
  • whose length is a user-specified time interval,
  • continuously advancing as the time goes by.
  • The sequences having elements whose timestamps
    fall into this period, POI, contribute to the
    Db for current sequential patterns.

37
38
A
C
AD
B
B
C
BD
AD
B
A
C
A
A
B
C
BC
D
BC
D
C
A
D
C
D
B
D
A
A
C
SID
time
POI5, min_supp0.5
39
Outlines
  • Introduction
  • Preliminaries
  • Algorithm Pisa
  • Experiments
  • Conclusions
  • Q A

39
40
Progressive Sequential Pattern
  • Progressive sequential pattern mining problem is
    defined as follows
  • "Given a progressive sequence database, a
    user-specified period of interest, and a
    user-defined minimum support threshold, find the
    complete set of frequent subsequences whose
    occurrence frequencies are greater than or equal
    to the minimum support times the number of
    sequences in every period of interest of the
    database."

40
41
Naïve Algorithm
  • Use conventional static sequential pattern mining
    algorithms to mine sequential patterns separately
    from all combination of POIs
  • e.g., Db1,5, Db2,6, Db3,7, Db4,8, Db5,9, etc.
  • For the sequence database which has the elements
    appearing in the interval of n timestamps, the
    total number of POIs in this interval is equal to
    (n - POI 1).

41
42
Prior Work
  • The only prior work on progressive database is
    GSP and MFS proposed by Zhang based on static
    algorithms GSP and MFS (also derived by the same
    authors).
  • However, these algorithms still have to re-mine
    each sub-database using the static algorithms GSP
    and MFS.
  • Nevertheless, the performance improvement of GSP
    and MFS over GSP and MFS is only within 15 as
    reported by their authors.

42
43
Algorithm DirApp
  • Stands for Direct Append.
  • Consists of two procedures
  • Progressively Updating
  • abbreviated as PrUp
  • Immediately Filtering
  • abbreviated as ImFi

43
44
Procedure PrUp
  • When progressively reading newly incoming
    elements, Procedure PrUp can
  • update each sequence in the sequence database
  • generate candidate sequential patterns
  • calculate occurrence frequencies of all candidate
    equential patterns in the current POI.

44
45
Procedure ImFi
  • DirApp uses Procedure ImFi to
  • filter out obsolete data from the existing
    sequence database
  • prune away obsolete candidate sequential patterns
    from the candidate set.
  • report the most up-to-date frequent sequential
    patterns to the user in every POI

45
46
A
B
C
AD
B
47
Example
47
48
(1)
(4)
Db1,1
A1
Db1,4
A1
B2
AB1
C4
AC1
BC2
ABC1
(2)
Db1,2
A1
B2
AB1
(3)
Db1,3
A1
B2
AB1
49
(4)
(5)
Db1,4
A1
B2
AB1
C4
AC1
BC2
ABC1
Db1,5 Db1,5
A5 B(AD)2
B2 ABD1
AB1 AB(AD)1
C4 CA4
AC1 CD4
BC2 C(AD)4
ABC1 ACD1
D5 AC(AD)1
(AD)5 BCA2
AD1 BCD2
A(AD)1 BC(AD)2
BA2 ABCD1
BD2 ABC(AD)1
50
(5)
(6)
Db2,6
A5
B2
C4
BC2
D5
(AD)5
BA2
BD2
B(AD)2
CA4
CD4
C(AD)4
BCA2
BCD2
BC(AD)2
Db1,5 Db1,5
A5 B(AD)2
B2 ABD1
AB1 AB(AD)1
C4 CA4
AC1 CD4
BC2 C(AD)4
ABC1 ACD1
D5 AC(AD)1
(AD)5 BCA2
AD1 BCD2
A(AD)1 BC(AD)2
BA2 ABCD1
BD2 ABC(AD)1
51
(6)
(7)
Db3,7
A5
C4
D5
(AD)5
CA4
CD4
C(AD)4
B7
AB5
CB4
DB5
(AD)B5
CAB4
CDB4
C(AD)B4
Db2,6
A5
B2
C4
BC2
D5
(AD)5
BA2
BD2
B(AD)2
CA4
CD4
C(AD)4
BCA2
BCD2
BC(AD)2

52
(1)
(4)
(5)
(6)
(7)
Db1,1
A1
Db1,4
A1
B2
AB1
C4
AC1
BC2
ABC1
Db2,6
A5
B2
C4
BC2
D5
(AD)5
BA2
BD2
B(AD)2
CA4
CD4
C(AD)4
BCA2
BCD2
BC(AD)2
Db3,7
A5
C4
D5
(AD)5
CA4
CD4
C(AD)4
B7
AB5
CB4
DB5
(AD)B5
CAB4
CDB4
C(AD)B4
Db1,5 Db1,5
A5 B(AD)2
B2 ABD1
AB1 AB(AD)1
C4 CA4
AC1 CD4
BC2 C(AD)4
ABC1 ACD1
D5 AC(AD)1
(AD)5 BCA2
AD1 BCD2
A(AD)1 BC(AD)2
BA2 ABCD1
BD2 ABC(AD)1
(2)
Db1,2
A1
B2
AB1
(3)
Db1,3
A1
B2
AB1
53
S01
Db1,2(4) Db1,2(4)
AB1 3
A(BC)1 1
AC1 1
(AD)B1 1
DB1 1
S02
S03
S04
Db1,2
A1
B2
AB1
Db1,2 Db1,2
A1 AB1
D1 DB1
(AD)1 (AD)B1
B2
Db1,2 Db1,2
A1 AB1
B2 AC1
C2 A(BC)1
(BC)2
Db1,2
D2
AB1(3)
54
(2)
(3)
(4)
(5)
Db1,2(4) Db1,2(4)
AB1 3
A(BC)1 1
AC1 1
(AD)B1 1
DB1 1
Db1,3(5) Db1,3(5)
AB1 3
A(BC)1 1
AC1 1
(AD)B1 1
DB1 1
A(BC)B1 1
ACB1 1
(BC)B2 1
CB2 1
DC2 1
Db1,4(5) Db1,4(5) Db1,4(5) Db1,4(5)
AB1 3 A(BC)BC1 1
A(BC)1 1 A(BC)C1 1
AC1 2 (AD)A1 1
(AD)B1 1 (AD)BA1 1
DB3 2 BA2 1
A(BC)B1 1 BC3 2
ACB1 1 (BC)BC2 1
(BC)B2 1 (BC)C2 1
CB2 1 DA1 1
DC2 1 DBA1 1
ABC1 2
Db1,5(5) Db1,5(5) Db1,5(5) Db1,5(5) Db1,5(5) Db1,5(5) Db1,5(5) Db1,5(5)
AB1 3 ABC1 2 DBA3 2 BCA2 1
A(BC)1 1 A(BC)BC1 1 A(AD)1 1 BC(AD)2 1
AC1 2 A(BC)C1 1 AB(AD)1 1 BCD2 1
(AD)B1 1 (AD)A1 1 ABC(AD)1 1 BD2 1
DB3 2 (AD)BA1 1 ABCD1 1 CA4 2
A(BC)B1 1 BA4 3 ABD1 1 C(AD)4 1
ACB1 1 BC3 2 AC(AD)1 1 CD4 1
(BC)B2 1 (BC)BC2 1 ACD1 1 DCA2 1
CB2 1 (BC)C2 1 AD1 1
DC2 1 DA3 3 B(AD)2 1
AB1(3)
AB1(3)
DA3(3)
BA4(3)
AB1(3)
AB1(3)
55
(9)
(6)
(7)
(8)
Db5,9(5) Db5,9(5)
DB5 1
BC7 1
AB5 2
A(BC)5 1
AC8 5
(AD)B5 1
ABC5 1
(AD)BC5 1
(AD)C5 1
DBC5 1
DC5 1
ACD6 2
AD6 2
CD8 2
Db2,6(5) Db2,6(5) Db2,6(5) Db2,6(5)
DB3 1 BC(AD)2 1
(BC)B2 1 BCD2 1
CB2 1 BD2 1
DC2 1 CA4 3
BA4 4 C(AD)4 1
BC3 2 CD4 1
(BC)BC2 1 DCA2 1
(BC)C2 1 (BC)A2 1
DA3 2 (BC)BA2 1
DBA3 1 (BC)BCA2 1
B(AD)2 1 (BC)CA2 1
BCA3 2 CBA2 1
Db3,7(5) Db3,7(5) Db3,7(5) Db3,7(5)
DB5 2 (AD)B5 1
BA4 2 BAC4 1
BC4 2 CAB4 2
DA3 1 CA(BC)3 1
DBA3 1 C(AD)B4 1
BCA3 1 CB4 2
CA4 3 C(BC)3 1
C(AD)4 1 CDB4 1
CD4 1 DAC3 1
AB5 2 DBAC3 1
A(BC)5 1 DBC3 1
AC5 2 DC3 1
Db4,8(6) Db4,8(6) Db4,8(6) Db4,8(6)
DB5 1 BAC4 1
BA4 1 CAB4 1
BC7 2 C(AD)B4 1
CA4 2 CB4 1
C(AD)4 1 CDB4 1
CD4 1 ABC5 1
AB5 2 (AD)BC5 1
A(BC)5 1 (AD)C5 1
AC6 4 DBC5 1
(AD)B5 1 DC5 1
AC6(4)
BA4(4)
CA4(3)
CA4(3)
AC8(5)
56
The Advantages of DirApp
  • DirApp needs only one scan of newly arriving
    elements and the candidate set at each timestamp
    rather than quadratic scans by conventional
    algorithms.
  • DirApp can
  • maintain latest data sequences
  • find the complete set of up-to-date sequential
    patterns
  • delete obsolete data and patterns rapidly

56
57
The Disadvantages of DirApp
  • DirApp needs lots of working space to store the
    candidate sets for all sequences.
  • Scanning all candidate sets induces huge
    computation in execution time.
  • DirApp needs another data structure to calculate
    the occurrence frequencies of all candidate
    sequential patterns.

57
58
Outlines
  • Introduction
  • Preliminaries
  • Algorithm Pisa
  • Experiments
  • Conclusions
  • Q A

58
59
Algorithm Pisa
  • Pisa stands for Progressive mIning of Sequential
    pAtterns
  • Pisa utilizes a Progressive Sequential tree
    (abbreviated as PS-tree) to maintain the
    information of all sequences in each POI to
  • update each sequence
  • find up-to-date sequential patterns

59
60
PS-tree
  • The nodes in PS-tree can be divided into two
    different types
  • Root node
  • Common nodes
  • Each common node stores two information
  • Node label element in a sequence
  • Sequence list
  • sequence IDs containing this element
  • marked by corresponding timestamps

Root
60
61
PS-tree
  • Whenever there are a series of elements appearing
    in the same sequence, there will be a series of
    nodes labeled by each element with the same
    sequence IDs in their sequence lists.
  • The first node will be connected to the Root node
    representing the first element.
  • The other nodes will be connected to the first
    node analogously.

61
62
PS-tree
Root
Root
62
63
PS-tree
  • The path from Root node to any other node
    represents the candidate sequential pattern
    appearing in this sequence.
  • The appearing timestamp for each candidate
    sequential pattern will be marked in the node
    labeled by the last element.

63
64
PS-tree
Root
Root
64
65
Algorithm Pisa
  • When receiving elements at timestamp t1, Pisa
    traverses the PS-tree in post-order to
  • delete the obsolete elements from
  • update current sequences in
  • insert newly arriving elements into
  • the PS-tree of timestamp t and
  • transforms it into PS-tree of timestamp t1.

65
66
For a common node
  • Pisa deletes the obsolete sequences in the
    sequence list of this node
  • If there is no sequence ID left in the sequence
    list, Pisa prunes this node away from its parent
  • Pisa checks the sequence IDs left in the sequence
    list to see if there is newly arriving element of
    the sequences
  • If there is no newly arriving element, Pisa goes
    to the next node

66
67
For a common node
  • Otherwise, Pisa generates all combination of
    candidate elements from the arriving element
  • Ex) ABC -gt A, B, C, AB, AC, BC, ABC
  • For each candidate element that does not exist on
    the path from Root to the current node
  • If there is a child of the same label, Pisa
    updates the timestamp of this sequence to the
    timestamp of the same sequence in parents
    sequence list.
  • Otherwise, Pisa creates a new child of this
    element with the sequence ID and the timestamp of
    the same sequence in parents sequence list.

67
68
For Root node
  • Instead of checking the sequence list, Pisa
    examines all sequences that have newly arriving
    elements.
  • After Pisa generates all combination of candidate
    element, for each of them
  • If there is a child of the same label, Pisa
    updates the timestamp of this sequence to t1.
  • Otherwise, Pisa creates a new child of this
    element with sequence ID and timestamp t1.

68
69
Algorithm Pisa
  • After Pisa processes a common node, if the number
    of sequence IDs in the sequence list is larger
    than the min_suppDbp,q,
  • the path from Root to this node will be
    outputted as a frequent sequential pattern.

69
70
PS-tree
Root
Root
70
71
Root
POI5, min_supp0.5
72
Db1,1(3)
73
Db1,2(4)
B
Db1,1(3)
B
BC
D
D
AB1(3)
74
Db1,3(5)
B
C
D
AB1(3)
75
C
Db1,4(5)
A
C
B
AB1(3)
76
Db1,5(5)
AB1(3)
BA4(3)
DA3(3)
77
Db2,6(5)
A
CA4(3)
BA4(4)
78
B
Db3,7(5)
BC
C
CA4(3)
79
C
Db4,8(6)
C
A
AC6(3)
80
Db5,9(5)
D
D
C
AC8(4)
81
BD
Db6,10(5)
D
CD8(4)
82
The Advantages of Pisa
  • Pisa needs only one scan of newly arriving
    elements and the PS-tree at each timestamp rather
    than quadratic scans by conventional algorithms.
  • Pisa can
  • maintain latest data sequences
  • find the complete set of up-to-date sequential
    patterns
  • delete obsolete data and patterns rapidly

82
83
The Advantages of Pisa
  • Each path from Root to any other node on PS-tree
    forms a unique candidate sequential pattern. Thus
    Pisa combines the same candidate patterns
    together and all patterns do not have to store
    their prefix elements.
  • PS-tree consumes smaller space.
  • Dealing with the same sequential patterns
    together is also very efficient in execution
    time.
  • Fast Pisa with approximation results.

83
84
Outlines
  • Introduction
  • Preliminaries
  • Algorithm Pisa
  • Experiments
  • Conclusions
  • Q A

84
85
Experiments
  • Comparative algorithms
  • GSP -- re-mining version of GSP
  • SPAM -- re-mining version of SPAM
  • DirApp
  • Environment
  • Pentium 4 3GHz CPU and 2GB RAM
  • Coded in C

85
86
Experiments
  • The synthetic datasets are generated in the way
    similar to the IBM data generator designed for
    testing sequential pattern mining algorithms.

86
87
Experiments
  • We divide the target dataset into n timestamps.
  • According to the POI, the first m timestamps (m
    POI and m lt n) are viewed as the original
    database and the rest of transactions in the
    dataset are received by the system incrementally.

87
88
Experiments
  • The first run of the experiments mines the first
    POI from the beginning m timestamps of the
    dataset.
  • After that, we shift the POI forward t (tltltm)
    timestamps forward for the following runs.

88
89
Experiments
  • The real data sets are from KDDCUP07.
  • We randomly choose successive 120 days for the
    performance evaluation. A timestamp is set as 3
    days in order to obtain sufficient frequent
    sequential patterns.
  • Therefore, there are total 40 timestamps and POI
    is set as 10. The new datasets contain more than
    5000 sequences and 2000 different items.

89
90
Cumulative Execution Time
90
91
Minimum Support
91
92
Length of POI
92
93
Number of Sequences
93
94
Scalability of Pisa
94
95
Real Data Set
95
96
Improvement of FastPisa
96
97
Information Lose of FastPisa
97
98
Outlines
  • Introduction
  • Preliminaries
  • Algorithm Pisa
  • Experiments
  • Conclusions
  • Q A

98
99
Conclusions
  • We proposed a progressive algorithm Pisa to
    handle the progressive sequential pattern mining
    problem without re-mining all sub-databases at
    each timestamp.
  • Pisa needs only one scan of newly arriving
    elements and the PS-tree at each timestamp rather
    than quadratic scans by conventional algorithms.

99
100
Conclusions
  • Pisa can
  • maintain the latest information of sequences
  • find the complete set of up-to-date sequential
    patterns
  • delete obsolete data and patterns rapidly
  • Pisa also
  • consumes less space
  • has high efficiency
  • possesses great scalability

100
101
References
  • R. Srikant and R.Agrawal, Mining Sequential
    Patterns Generalizations and Performance
    Improvements. Proc. of ICDE, 1995
  • J. Ayres, J. Gehrke, T. Yiu, and J. Flannick.
    Sequential pattern mining using a bitmap
    representation. Proc. of ACM SIGKDD, 2002.
  • M. Zhang, B. Kao, D. W.-L. Cheung, and C. L. Yip.
    Efficient algorithms for incremental update of
    frequent sequences. Proc. of PAKDD, 2002.

101
102
Thank You !
  • Q A

102
Write a Comment
User Comments (0)
About PowerShow.com