Title: PIRS: Query Verification on Data Streams
1PIRS Query Verification on Data Streams
- Ke Yi, Hong Kong University of Science and
Technology - Feifei Li, Florida State University
- Marios Hadjieleftheriou, ATT Labs
- George Kollios, Boston University
- Divesh Srivastava, ATT Labs
work done while the 1st and 2nd authors were
working at ATT labs.
2Publishing Data and Outsourcing Query Service
Network
0 1 1 0 0 1 1 1 0
IP Traffic Streamcoming from
Gigascopeanalysis tool by
Results
statistics
3Revisiting the CISCO ATT Example
Network
Gigascope
IP Traffic Stream
0 1 1 0 0 1 1 1 0
statistics
lawyers sign the trust agreement
Could we help? (computer scientists)
4Concrete Example
IP Stream
. . .
pm
p3
p2
p1
srcIP, destIP, packet_size
- Continuous Query
- SELECT SUM(packet_size) FROM IP_trace
- GROUP BY srcIP, destIP
- Answer
Groups
1 2 3 . . . n
5 10KB 2KB 150KB . . . 5KB
10 11KB 130KB 1MB . . . 20KB
13 . . .
Time
5Continuous Query Verification (CQV) on Data
Streams
Group 1
Group 2
- Client register query
- Server reports answer
- upon request
Group 3
Server maintains exact answer
Source of streams
Client maintains synopsis X
Both client and server monitor the same stream
SELECT SUM(packet_size) From IP_Trace GROUP BY
src_ip, dest_ip
6The Model for the Stream
T3
T1
T2
agg_attribute group_id
11
91
7i
S
0
VT
0
0
0
9
0
7
10
V1
V2
V3
Vn
Vi
7Continuous Query Verification CQV
T1
T2
T3
91
7i
11
S
Update X
Update V
0
VT
0
0
0
9
0
7
10
XT
V1
V2
V3
Vn
Vi
Synopsis
0
0
2
0
9
0
5
10
V1
V2
V3
Vn
Vi
8PIRS Polynomial Identity Random Synopsis
choose prime p
chose a random number
raise alarm if not equal
o/w no alarm
9Incremental Update to PIRS
T1
T2
91
7i
11
S
update to v1
update to vi
update to v1
An update to group i with value u could be done
in logu time (exponential by squaring)
10It Solves CQV problem!
Theorem Given any
PIRS raises an alarm
with probability at least 1-d
a polynomial with 1 as the leading coefficient is
completely determined by its zeroes
Due to the fundamental theorem of algebra.
Since we have pgtm/ d choices for a the
probability that X(V)X(W) is at most d
11Optimality of PIRS
Theorem PIRS occupies O(log m/d log n) bits of
space (3 words only at most, i.e., p, a, X(V)),
spends O(1) time to process a tuple for count
query, or O(log u) time to process a tuple for
sum query.
Theorem Any synopsis for solving the CQV problem
with error probability at most d has to keep
?(log minn,m/d) bits.
12Multiple Queries
Q1
Q2
Q1
Q2
V1..n2
V1..n1
V1..(n1n2)
X1
X2
X
Theorem our synopses use constant space for
multiple queries.
91,8
S
update to v1
update to v8
13Handle the Load Shedding
- Semantic Load Shedding drop tuples from certain
groups - Small number of groups having errors
- Random Load Shedding
- All groups have small amount of errors
14CQV with Semantic Load Shedding
Randomly drop certain tuples according to groups
91
7i
2j
11
4k
51
Server claims at most ? number of groups have
errors
To detect if more than ? groups having errors!
We have designed synopses using O(? log 1/d log
n) bits of space and achieve the error
probability at most d
15PIRS? An Exact Solution
b(8)2
Alarm
v8
If at least one layer raises alarms
PIRS
PIRS
PIRS
k buckets
Alarm
log 1/d
If at least buckets raise alarms
PIRS
PIRS
PIRS
16PIRS? An Exact Solution
Theorem PIRS? requires O(?2 log1/d logn) bits,
spends O( log1/d ) time to process a tuple and
solves CQV with semantic load shedding.
17Intuition on Approximation
the approximation
probability to raise alarm
the ideal synopsis
number of errors
?
?-
?
18PIRS? An Approximate Solution
Theorem PIRS? requires O(? log1/d logn) bits,
spends O(? log1/d ) time to process a tuple.
19CQV with Random Load Shedding
Randomly drop tuples
All groups have small errors
To detect if any group has error greater than a
claimed threshold
Theorem Any synopsis solves this problem with
error probability at most d requires at least
?(n) bits (reducing to the problem of estimating
infinite frequency moment the number of
occurrence of the most frequent item).
20Sliding Window and Other Queries
- It is easy to extend PIRS to work with sliding
window model since it is decomposable, i.e.,
X(v1v2)X(v1)X(v2). - Other queries that can be transformed into Group
By aggregation queries. - Details in the paper.
21Some Experiments
- We use real streams
- World Cup Data (WC)
- IP traces from the ATT network (IP)
- We perform the following query
- WC Aggregate on response size and group by
client id/object id (50M groups) - IP Aggregate on packet size and group by source
IP/destination IP (7M groups) - Hardware for the client
- 2.8GHz Intel Pentium 4 CPU
- 512 MB memory
- Linux Machine
22Detection Accuracy
Over 100,000 random attacks, PIRS identifies all
of them.
23Memory Usage of Exact
Exacts memory usage is linear and expensive.
PIRS using only constant 3 words (27 bytes) at
all time.
24Update Time (per tuple) of Exact
Cache misses and memory swap
- Exact is fast when memory usage is small.
- It becomes extremely slow due to cache misses and
memory swap operations.
25Running Time Analysis
Average Update Time
WC IPs
Count 0.98 µs 0.98 µs
Sum 8.01 µs 6.69 µs
IPs exhibits smaller update cost for sum query as
the average value of u is smaller than that of WC
26Multiple Queries Exact Memory Usage
Exacts memory usage is linear w.r.t number of
queries and increasing over time.
PIRS always using only constant 3 words (27
bytes).
27Multiple Queries Exact Update Time Per Tuple
28Multiple Queries PIRS Update Time Per Tuple
29The Library
Download PIRS and other synopses
at http//www.cs.fsu.edu/lifeifei/pirs/
30Conclusion
- Space and Update efficient synopsis for verifying
continuous group-by aggregation queries on
streaming data - Could be generalized to handle selection query,
and sliding-window semantics - How about more complicated queries?
31Thanks!
32Problem and Goals
- Assumption
- Client and DSMS observe the same stream
- Problem
- Client needs to verify the results
- Goals
- Be memory, update efficient
- Tolerance for a limited number of errors
- Tolerance for small errors
- Support multiple queries
33Related Techniques to PIRS
- Incremental Cryptography
- Block operation (insert, delete), cannot support
arithmetic operation - Program Verification
- Server may pass the program execution but simply
return random outputs - Fingerprinting Technique
- PIRS is a fingerprinting technique
34CQV with Semantic Load Shedding
35PIRS? An Approximate Solution
Theorem PIRS? 1.raises no alarm with
probability at least 1- d on any
2.raises an alarm with probability at least 1- d
on any
For any cgt-lnln20.367
Using the intuition of coupon collector
problem and the Chernoff bound.
36PIRS? An Approximate Solution
Alarm
If majority layers raise alarms
bi2
vi
PIRS
PIRS
PIRS
k buckets
Alarm
log 1/d
If all k buckets raise alarms
PIRS
PIRS
PIRS
37Information Disclosure on Multiple Attacks
PIRS X(V) on r
R
Insight server could potentially gets rid of d
portion of seeds from each notified failed attack!
Learns nothing about r
38Information Disclosure on Multiple Attacks
Bob
Theorem For the total of k attacks made by Bob
to PIRS, the probability that none of them
succeeds is at least 1-kd.
39Proof of the Optimality
40Proof of the Optimality