Data Stream - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Data Stream

Description:

No. of false negatives = 1. No. of false positives = 0. If we say: The algorithm has no false negatives. ... Condition 1: There is no false negative. ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 47
Provided by: raym168
Category:

less

Transcript and Presenter's Notes

Title: Data Stream


1
COMP537
  • Data Stream

Prepared by Raymond Wong Presented by Raymond
Wong raywong_at_cse
2
Data Mining over Static Data
  • Association
  • Clustering
  • Classification

Output (Data Mining Results)
Static Data
3
Data Mining over Data Streams
  • Association
  • Clustering
  • Classification

Output (Data Mining Results)

Unbounded Data
Real-time Processing
4
Data Streams
Each point a transaction
5
Data Streams
6
Entire Data Streams
Each point a transaction
Obtain the data mining results from all data
points read so far
7
Entire Data Streams
Each point a transaction
Obtain the data mining results over a sliding
window
8
Data Streams
  • Entire Data Streams
  • Data Streams with Sliding Window

9
Entire Data Streams
Frequent pattern/item
  • Association
  • Clustering
  • Classification

10
Frequent Item over Data Streams
  • Let N be the length of the data streams
  • Let s be the support threshold (in fraction)
    (e.g., 20)
  • Problem We want to find all items with frequency
    gt sN

Each point a transaction
11
Data Streams
12
Data Streams
  • Frequent item
  • I1
  • Infrequent item
  • I2
  • I3

Output (Data Mining Results)
Static Data
Output (Data Mining Results)

Unbounded Data
  • Frequent item
  • I1
  • I3
  • Infrequent item
  • I2

13
False Positive/Negative
  • E.g.
  • Expected Output
  • Frequent item
  • I1
  • Infrequent item
  • I2
  • I3
  • Algorithm Output
  • Frequent item
  • I1
  • I3
  • Infrequent item
  • I2
  • False Positive
  • The item is classified as frequent item
  • In fact, the item is infrequent

Which item is one of the false positives?
I3
More?
No.
No. of false positives 1
If we sayThe algorithm has no false positives.
All true infrequent items are classified as
infrequent items in the algorithm output.
14
False Positive/Negative
  • E.g.
  • Expected Output
  • Frequent item
  • I1
  • I3
  • Infrequent item
  • I2
  • Algorithm Output
  • Frequent item
  • I1
  • Infrequent item
  • I2
  • I3
  • False Negative
  • The item is classified as infrequent item
  • In fact, the item is frequent

Which item is one of the false negatives?
I3
More?
No.
No. of false negatives 1
No. of false positives
0
If we sayThe algorithm has no false negatives.
All true frequent items are classified as
frequent items in the algorithm output.
15
Data Streams
We need to introduce an input error parameter ?
16
Data Streams
  • Frequent item
  • I1
  • Infrequent item
  • I2
  • I3

Output (Data Mining Results)
Static Data
Output (Data Mining Results)

Unbounded Data
  • Frequent item
  • I1
  • I3
  • Infrequent item
  • I2

17
Data Streams
N total no. of occurrences of items
  • Store the statistics of all items
  • I1 10
  • I2 8
  • I3 12

N 20
? 0.2
?N 4
Output (Data Mining Results)
Static Data
D lt ?N ?
Diff. D
0
Yes
4
Yes
2
Yes
Output (Data Mining Results)

Unbounded Data
  • Estimate the statistics of all items
  • I1 10
  • I2 4
  • I3 10

18
?-deficient synopsis
  • Let N be the current length of the stream(or
    total no. of occurrences of items)
  • Let ? be an input parameter (a real number from 0
    to 1)
  • An algorithm maintains an ?-deficient synopsis if
    its output satisfies the following properties
  • Condition 1 There is no false negative.

All true frequent items are classified as
frequent items in the algorithm output.
  • Condition 2 The difference between the estimated
    frequency and the true frequency is at most ?N.
  • Condition 3 All items whose true frequencies
    less than (s-?)N are classified as infrequent
    items in the algorithm output

19
Frequent Pattern Mining over Entire Data Streams
  • Algorithm
  • Sticky Sampling Algorithm
  • Lossy Counting Algorithm
  • Space-Saving Algorithm

20
Sticky Sampling Algorithm
Support threshold
Stored in the memory
Sticky Sampling
Error parameter
Confidence parameter
Frequent items Infrequent items
21
Sticky Sampling Algorithm
  • The sampling rate r varies over the lifetime of a
    stream
  • Confidence parameter ? (a small real number)
  • Let t ??1/? ln(s-1?-1)?

22
Sticky Sampling Algorithm
  • e.g. s 0.02? 0.01
  • 0.1
  • t 622
  • The sampling rate r varies over the lifetime of a
    stream
  • Confidence parameter ? (a small real number)
  • Let t ??1/? ln(s-1?-1)?

11244
1 2622
12452488
26221 4622
46221 8622
24894976
23
Sticky Sampling Algorithm
  • e.g. s 0.5? 0.35
  • 0.5
  • t 4
  • The sampling rate r varies over the lifetime of a
    stream
  • Confidence parameter ? (a small real number)
  • Let t ??1/? ln(s-1?-1)?

18
1 24
916
241 44
441 84
1732
24
Sticky Sampling Algorithm
element
Estimated frequency
  • S empty list ? will contain (e, f)
  • When data e arrives,
  • if e exists in S, increment f in (e, f)
  • if e does not exist in S, add entry (e, 1) with
    prob. 1/r (where r sampling rate)
  • When r changes,
  • For each entry (e, f),
  • Repeatedly toss a coin with P(head) 1/r until
    the outcome of the coin toss is head
  • If the outcome of the toss is tail,
  • Decrement f in (e, f)
  • If f 0, delete the entry (e, f)
  • Output Get a list of items where
    f ? ?N gt sN

25
Analysis
  • ?-deficient synopsis
  • Sticky Sampling computes an ?-deficient synopsis
    with probability at least 1-?
  • Memory Consumption
  • Sticky Sampling occupies at most ?2/? ln(s-1?-1)
    entries ?

26
Frequent Pattern Mining over Entire Data Streams
  • Algorithm
  • Sticky Sampling Algorithm
  • Lossy Counting Algorithm
  • Space-Saving Algorithm

27
Lossy Counting Algorithm
Support threshold
Stored in the memory
Lossy Counting
Error parameter
Frequent items Infrequent items
28
Lossy Counting Algorithm
Each point a transaction
N current length of stream

29
Lossy Counting Algorithm
element
Frequency of element since this entry was
inserted into D
  • D Empty set
  • Will contain (e, f, ??)

Max. possible error in f
  • When data e arrives,
  • If e exists in D,
  • Increment f in (e, f, ??)
  • If e does not exist in D,
  • Add entry (e, 1, bcurrent-1)
  • Remove some entries in D whenever N ? 0 mod
    w(i.e., whenever it reaches the bucket
    boundary)The rule of deletion is (e, f, ?)
    is deleted if f ? lt bcurrent
  • Output Get a list of items where f
    ? ?N gt sN

30
Lossy Counting Algorithm
  • ?-deficient synopsis
  • Lossy Counting computes an ?-deficient synopsis
  • Memory Consumption
  • Lossy Counting occupies at most ??1/? log(?N)
    entries?.

31
Comparison
e.g. s 0.02? 0.01 ? 0.1 N 1000
Memory 1243
Memory 231
32
Comparison
e.g. s 0.02? 0.01 ? 0.1 N 1,000,000
Memory 1243
Memory 922
33
Comparison
e.g. s 0.02? 0.01 ? 0.1 N 1,000,000,000
Memory 1243
Memory 1612
34
Frequent Pattern Mining over Entire Data Streams
  • Algorithm
  • Sticky Sampling Algorithm
  • Lossy Counting Algorithm
  • Space-Saving Algorithm

35
Sticky Sampling Algorithm
Support threshold
Stored in the memory
Sticky Sampling
Error parameter
Confidence parameter
Frequent items Infrequent items
36
Lossy Counting Algorithm
Support threshold
Stored in the memory
Lossy Counting
Error parameter
Frequent items Infrequent items
37
Space-Saving Algorithm
Support threshold
Stored in the memory
Space-Saving
Memory parameter
Frequent items Infrequent items
38
Space-Saving
  • M the greatest number of possible entries stored
    in the memory

39
Space-Saving
Frequency of element since this entry was
inserted into D
element
  • D Empty set
  • Will contain (e, f, ??)

Max. possible error in f
  • pe 0
  • When data e arrives,
  • If e exists in D,
  • Increment f in (e, f, ??)
  • If e does not exist in D,
  • If the size of D M
  • pe ? mine??D f ?
  • Remove all entries e where f ? ?? pe
  • Add entry (e, 1, pe)
  • Output Get a list of items where f
    ? gt sN

40
Space-Saving
  • Greatest Error
  • Let E be the greatest error in any estimated
    frequency. E ? 1/M
  • ?-deficient synopsis
  • Space-Saving computes an ?-deficient synopsis if
    E ? ?

41
Comparison
e.g. s 0.02? 0.01 ? 0.1 N 1,000,000,000
Memory 1243
Memory 1612
Memory can be very large (e.g., 4,000,000) Since
E lt 1/M ? the error is very small
42
Data Streams
  • Entire Data Streams
  • Data Streams with Sliding Window

43
Data Streams with Sliding Window
Frequent pattern/itemset
  • Association
  • Clustering
  • Classification

44
Sliding Window
  • Mining Frequent Itemsets in a sliding window
  • E.g. t1 I1 I2 t2 I1 I3 I4
  • To find frequent itemsets in a sliding window

45
Sliding Window
Storage
Storage
Storage
Storage
46
Sliding Window
B3
B4
B1
B2
Storage
Storage
Storage
Storage
Storage
Remove the whole batch
Write a Comment
User Comments (0)
About PowerShow.com