Stream Sequential Pattern Mining with Precise Error Bounds - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Stream Sequential Pattern Mining with Precise Error Bounds

Description:

Luiz F. Mendes Bolin Ding Jiawei Han. Outline. Introduction. Problem statement ... static database has been studied extensively by data mining researchers. ... – PowerPoint PPT presentation

Number of Views:156
Avg rating:3.0/5.0
Slides: 18
Provided by: makingCsi
Category:

less

Transcript and Presenter's Notes

Title: Stream Sequential Pattern Mining with Precise Error Bounds


1
Stream Sequential Pattern Mining with Precise
Error Bounds
  • ICDM 2008
  • Luiz F. Mendes Bolin Ding Jiawei Han

2
Outline
  • Introduction
  • Problem statement
  • Two methods for mining sequential patterns
  • Experimental

3
Introduction
  • The problem of mining sequential patterns from a
    large
  • static database has been studied extensively
    by data mining researchers. This is an important
    problem with many real-world applications such as
    customer purchase behavior analysis, DNA sequence
    analysis, and analysis of scienti?c experiments

4
Problem statement
  • The count of a sequence s, denoted by count(s),
    is de?ned as the number of sequences that contain
    s. The support of a sequence s, denoted by
    supp(s),is de?ned as count(s) divided by the
    total number of sequences seen
  • If supp(s) s
  • where s is a user-supplied minimum support
    threshold

5
Problem statement(cont.)
  • Example 1 Suppose the length of our data stream
    is only 3 sequences S1 lta, b, cgt, S2 lta, cgt,
    and S3 ltb, cgt. Let us assume we are given that
    s 0.5. The set of sequential patterns and their
    corresponding counts are as follows
  • ltagt2,ltbgt2,ltcgt3
  • lta, cgt2,
  • and ltb, cgt2.

6
Two methods for mining sequential patterns
  • The SS-BE Method
  • (Stream Sequence miner using Bounded Error)
  • SS-MB
  • (Stream Sequence miner using Memory Bounds).

7
The SS-BE Method
  • Given a stream of sequences DS1,S2,
  • Suppose the batch length L is 4,
  • the minimum support threshold s is 0.75,
    the significance threshold e is 0.5, the batch
    support threshold a is 0.4 and the pruningperiod
    d is 2.
  • Batch B1 lta,b,cgt,lta,cgt,lta,bgt,ltb,cgt
  • Batch B2 lta,b,c,dgt,ltc,a,bgt,ltd,a,bgt,lta,e,bgt
  • The stream length N 8

8
The SS-BE Method(cont.)
  • To the first batch minimum support 0.4
  • The frequent sequences found, followed by their
    support counts
  • Batch B1 lta,b,cgt,lta,cgt,lta,bgt,ltb,cgt
  • ltagt3 ltbgt3 ltcgt3
  • lta,bgt2 lta,cgt2 ltb,cgt2
  • inserted into the tree with the respective counts
    and a batchCount of 1

9
The SS-BE Method(cont.)
  • Again with support 0.4
  • The frequent sequences found, followed by their
    support counts
  • Batch B2 lta,b,c,dgt,ltc,a,bgt,ltd,a,bgt,lta,e,bgt
  • ltagt4 ltbgt4 ltcgt2 ltdgt2 lta,bgt4
  • ltagt,ltbgt,ltcgtandlta,bgt to have their counts
    incremented by 4, 4, 2, and 4 add batchCount of 2
  • And ltdgt batchCount is 1

10
The SS-BE Method(cont.)
  • B1gt ltagt3 ltbgt3 ltcgt3 lta,bgt2 lta,cgt2 ltb,cgt2
  • B2gt ltagt4 ltbgt4 ltcgt2 ltdgt2 lta,bgt4

11
The SS-BE Method(cont.)
  • B B - batchCount
  • ltagt ltbgt ltcgt lta,bgt
  • batchCount2
  • Other batchCount is 1
  • a0.4
  • d21 4 pruing
  • c50 4 no pruing

12
The SS-BE Method(cont.)
  • in this case count B 4
  • error
  • (s-e)N (0.75-0.5)82
  • ltagt7 ltbgt7 ltcgt5 lta,bgt6

13
The SS-MB Method
  • Given a stream of sequences DS1,S2,
  • Suppose the batch length L is 4,
  • the minimum support threshold s is 0.75,
    the significance threshold e is 0.5 and the
    maximum number of nodes allowed in the tree after
    processing any given batch is 7 Batch B1
    lta,b,cgt,lta,cgt,lta,bgt,ltb,cgt
  • Batch B2 lta,b,c,dgt,ltc,a,bgt,ltd,a,bgt,lta,e,bgt
  • The stream length N 8

14
The SS-MB Method(cont.)
  • B1gt ltagt3 ltbgt3 ltcgt3 lta,bgt2 lta,cgt2 ltb,cgt2
  • B2gt ltagt4 ltbgt4 ltcgt2 ltdgt2 lta,bgt4

15
The SS-MB Method(cont.)
  • Because there are now 8 nodes in the tree and the
    maximum is 7, we must remove the sequence having
    minimum count from the tree. Breaking ties
    arbitrarily, the node corresponding to the
    sequence ltb, cgt is removed.
  • The variable min is set to this sequences count
    before being deleted, 2
  • The algorithm outputs all sequences
    corresponding to nodes having count above
  • New node
  • Count min
  • (s-e)N (0.75-0.5)82

16
Experimental
17
Experimental
Write a Comment
User Comments (0)
About PowerShow.com