Checking properties of data streams - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

Checking properties of data streams

Description:

What we ask about the data. How we find the answer to our questions. ... line with good least squares fit. Can a group of processors observing the same ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 12
Provided by: samp97
Category:

less

Transcript and Presenter's Notes

Title: Checking properties of data streams


1
Checking properties of data streams
  • Sampath Kannan
  • University of Pennsylvania
  • (joint with J. Feigenbaum, M.Strauss and M.
    Viswanathan)

2
Talk Outline
  • Motivation and Model
  • Relationship to other models
  • Detecting Anomalies
  • Grouping data
  • Future work

3
Motivation
  • Deluge of time-dependent data if we dont
  • process it soon, it will become irrelevant! or
  • it will be too late!
  • Need to change both
  • What we ask about the data.
  • How we find the answer to our questions.
  • Need good theoretical models to model the
    constraints under which we operate.

4
Model
Data Stream
0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0
1 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 0 1 0 1 0 1 1 0 0
0 0 1 0 1 0 1 1
could be a Cisco router with netflow software.
Processor
Memory
(Size is small relative to stream size.)
5
Relationship to other models
Spot-checking model is related. Key difference
--- only a few scans of the data stream
allowed. Actually, many of the spot-checkers we
have designed already do only a few scans of
the data and can be viewed as stream
checkers. So we already know a stream
checker for sorting.
6
Models of finite automata that are similar have
been studied. Our model is one of the first to
integrate randomization in the processing Closely
related to model of Henzinger, Raghavan,
Rajagopalan. Also related --- sketch model
of Broder, Charikar, Frieze, Mitzenmacher.
7
Detecting anomalies
Trivial case
0 1 1 1 0 0 1 0 1 1 0 1 0 0 0 0 0 1 1 0 1 1 0 0
1 1 1 0 1 1 1 0 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 1 0
1 1 0
Is this the output of a fair coin? Can actually
keep track of total number of heads and tails and
tosses. Another solution Sample a few positions
and tabulate results only in these
positions Still works.
8
Less Trivial Example
4 3 5 3 7 8 6 8 1 7 2 6 4 3 3 2
1 1 2 3 3 4 4 6 7 9 2 3 6 7 8 5
1 4 6 7 9 2 4
9
The Real Problem
(3, 5, ) (2, 7, -) (4, 1, -) (5, 2, -) (3, 8, -)
(2, 4, ) (1, 10, ) (6, 7, -) (7, 7, -) (4, 5,
) (7, 2, ) (1, 5, -)
Each triple of form (i,a ,) or (i,b ,-). At
most one triple for each i and sign.
i
i

10
Grouping
How do we make sure data is grouped?
  • Idea very similar to some spot-checkers works.
  • Randomly sample sqrt(n) positions.
  • Choose one position at random in between each
    consecutive pair of positions chosen in
    previous step.
  • Check that sample is grouped.

11
Future Work
  • Quickly summable (almost) 4-wise independent
    random
  • variables seems like a powerful tool for dealing
    with
  • stream data.
  • This can perhaps be used to address other
    important
  • questions such as
  • Given a stream of (x,y) coordinates find the
    line with good least squares fit.
  • Can a group of processors observing the same
    stream do some joint computation more
    efficiently?
  • What other properties and primitives will be
    useful?
Write a Comment
User Comments (0)
About PowerShow.com