Title: Checking properties of data streams
1Checking properties of data streams
- Sampath Kannan
- University of Pennsylvania
- (joint with J. Feigenbaum, M.Strauss and M.
Viswanathan)
2Talk Outline
- Motivation and Model
- Relationship to other models
- Detecting Anomalies
- Grouping data
- Future work
3Motivation
- Deluge of time-dependent data if we dont
- process it soon, it will become irrelevant! or
- it will be too late!
- Need to change both
- What we ask about the data.
- How we find the answer to our questions.
- Need good theoretical models to model the
constraints under which we operate.
4Model
Data Stream
0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0
1 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 0 1 0 1 0 1 1 0 0
0 0 1 0 1 0 1 1
could be a Cisco router with netflow software.
Processor
Memory
(Size is small relative to stream size.)
5Relationship to other models
Spot-checking model is related. Key difference
--- only a few scans of the data stream
allowed. Actually, many of the spot-checkers we
have designed already do only a few scans of
the data and can be viewed as stream
checkers. So we already know a stream
checker for sorting.
6Models of finite automata that are similar have
been studied. Our model is one of the first to
integrate randomization in the processing Closely
related to model of Henzinger, Raghavan,
Rajagopalan. Also related --- sketch model
of Broder, Charikar, Frieze, Mitzenmacher.
7Detecting anomalies
Trivial case
0 1 1 1 0 0 1 0 1 1 0 1 0 0 0 0 0 1 1 0 1 1 0 0
1 1 1 0 1 1 1 0 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 1 0
1 1 0
Is this the output of a fair coin? Can actually
keep track of total number of heads and tails and
tosses. Another solution Sample a few positions
and tabulate results only in these
positions Still works.
8Less Trivial Example
4 3 5 3 7 8 6 8 1 7 2 6 4 3 3 2
1 1 2 3 3 4 4 6 7 9 2 3 6 7 8 5
1 4 6 7 9 2 4
9The Real Problem
(3, 5, ) (2, 7, -) (4, 1, -) (5, 2, -) (3, 8, -)
(2, 4, ) (1, 10, ) (6, 7, -) (7, 7, -) (4, 5,
) (7, 2, ) (1, 5, -)
Each triple of form (i,a ,) or (i,b ,-). At
most one triple for each i and sign.
i
i
10Grouping
How do we make sure data is grouped?
- Idea very similar to some spot-checkers works.
- Randomly sample sqrt(n) positions.
- Choose one position at random in between each
consecutive pair of positions chosen in
previous step. - Check that sample is grouped.
11Future Work
- Quickly summable (almost) 4-wise independent
random - variables seems like a powerful tool for dealing
with - stream data.
- This can perhaps be used to address other
important - questions such as
- Given a stream of (x,y) coordinates find the
line with good least squares fit. - Can a group of processors observing the same
stream do some joint computation more
efficiently? - What other properties and primitives will be
useful?