Title: Approximate Aggregation Techniques for Sensor Databases
1Approximate Aggregation Techniques for Sensor
Databases
- John Byers
- Department of Computer Science
Boston University - Joint work with Jeffrey Considine, George
Kollios, Feifei Li
2Sensor Network Model
- Large set of sensors distributed in a sensor
field. - Communication via a wireless ad-hoc network.
- Node and links are failure-prone.
- Sensors are resource-constrained
- Limited memory, battery-powered, messaging is
costly.
3Sensor Databases
- Treat sensor field as a distributed database
- But data is gathered, not stored nor saved.
- Perform standard queries over sensor field
- COUNT, SUM, GROUP-BY
- Exemplified by work such as TAG and Cougar
- For this talk
- One-shot queries
- Continuous queries are a natural extension.
4Tiny Aggregation (TAG) ApproachMadden,
Franklin, Hellerstein, Hong
- Aggregation component of TinyDB
- Follows database approach
- Uses simple SQL-like language for queries
- Power-aware, in-network query processing
- Optimizations are transparent to end-user.
- TAG supports COUNT, SUM, AVG, MIN, MAX and others
5TAG (continued)
- Queries proceed in two phases
- Phase 1
- Sink broadcasts desire to compute an aggregate.
- Nodes create a routing tree with the sink as the
root. - Phase 2
- Nodes start sending back partial results.
- Each node receives the partial results of its
children and computes a new partial result. - Then it forwards the new partial result to its
parent. - Can compute any decomposable function
- f (v1, v2, , vn) g( f (v1, .., vk), f (vk1,
, vn))
6Example for SUM
sink
20
- Sink initiates the query
- Nodes form a spanning tree
- Each node sends its partial
- result to its parent
- Sink computes the total sum
7Classification of Aggregates
- TAG classifies aggregates according to
- Size of partial state
- Monotonicity
- Exemplary vs. summary
- Duplicate-sensitivity
- MIN/MAX (cheap and easy)
- Small state, monotone, exemplary,
duplicate-insensitive - COUNT/SUM (considerably harder)
- Small state and monotone, BUT duplicate-sensitive
- Cheap if aggregating over tree without losses
- Expensive with multiple paths and losses
8Basic approaches to computing SUM
- Separate, reliable delivery of every value to
sink - Extremely costly in energy and energy consumption
- Aggregate values back to sink along a tree
- A single fault eliminates values of an entire
subtree - Split values and route fractions separately
- Send (value / k) to each of k parents
- Better variance, but same expectation as approach
(2) - Send values along multiple paths
- Duplicates need to be handled.
- ltID, valuegt pairs have limited in-network
aggregation.
9Design Objectives for Robust SUM
- Admit in-network aggregation of partial values
- Let aggregates be both order-insensitive and
duplicate-insensitive - Be agnostic to routing protocol
- Trust routing protocol to be best-effort.
- Routing and aggregation logically decoupled NG
03. - Some routing algorithms better than others.
10Design Objectives (cont)
- Final aggregate is exact if at least one
representative from each leaf survives to reach
the sink. - This wont happen in practice.
- It is reasonable to hope for approximate results.
- We argue that it is reasonable to use aggregation
methods that are themselves approximate.
11Outline
- Motivation for sensor databases and aggregation.
- COUNT aggregation via Flajolet-Martin
- SUM aggregation
- Experimental evaluation
12Flajolet / Martin sketches JCSS 85
- Goal Estimate N from a small-space
representation of a set. - Sketch of a union of items is the OR of their
sketches - Insertion order and duplicates dont matter!
Prerequisite Let h be a random, binary hash
function. Sketch of an item For each unique
item with ID x, For each integer 1 i k in
turn, Compute h (x, i). Stop when h (x, i)
1, and set bit i.
X 0 0 1 0 0
Z 1 0 0 0 0
X Z 1 0 1 0 0
n
13Flajolet / Martin sketches (cont)
Estimating COUNT Take the sketch of a set of N
items. Let j be the position of the leftmost zero
in the sketch. j is an estimator of log2 (0.77 N)
S
1
1
1
0
1
j 3
Best guess COUNT 11
- Fixable drawbacks
- Estimate has faint bias
- Variance in the estimate is large.
14Flajolet / Martin sketches (cont)
- Standard variance reduction methods apply.
- Compute m independent sketches in parallel.
- Compute m independent estimates of N.
- Take the mean of the estimates.
- Provable tradeoffs between m and variance of the
estimator
15Application to COUNT
- Each sensor computes k independent sketches of
itself (using unique ID x) - Coming next sensor computes a sketch of its
value. - Use a robust routing algorithm to route sketches
up to the sink. - Aggregate the k sketches via union en-route.
- The sink then estimates the count.
16Multipath Routing
Two paths from the source to the sink that differ
in at least two nodes
17Routing Methodologies
- Considerable work on reliable delivery via
multipath routing - Directed diffusion IGE 00
- Braided diffusion GGSE 01
- GRAdient Broadcast YZLZ 02
- Broadcast intermediate results along gradient
back to source - Can dynamically control width of broadcast
- Trade off fault tolerance and transmission costs
- Our approach similar to GRAB
- Broadcast. Grab if upstream, ignore if downstream
- Common goal try to get at least one copy to sink
18Simple Upstream Routing
- By expanding ring search, nodes can compute their
hop distance from the sink. - Refer to nodes at distance i as level i.
- At level i, gather aggregates from level i1.
- Then broadcast aggregates to level i - 1
neighbors. - Ignore downstream and sidestream aggregates.
19Extending Flajolet / Martin Sketches
- Also interested in approximating SUM
- FM sketches can handle this (albeit clumsily)
- To insert a value of 500, perform 500 distinct
item insertions - Our observation We can simulate a large number
of insertions into an FM sketch more efficiently. - Sensor-net restrictions
- No floating point operations
- Must keep memory usage and CPU time to a minimum
20Simulating a set of insertions
- Set all the low-order bits in the safe region.
- First S log c 2 log log c bits are set to 1
w.h.p. - Statistically estimate number of trials going
beyond safe region - Probability of a trial doing so is simply 2-S
- Number of trials B(c,2-S). Mean O(log2 c)
- For trials and bits outside safe region, set
those bits manually. - Running time is O(1) for each outlying trial.
- Expected running time
O(log c) time to draw from B(c,2-S)
O(log2 c)
21Fast sampling from discrete pdfs
- We need to generate samples from B(n, p).
- General problem sampling from a discrete pdf.
- Assume can draw uniformly at random from 0,1.
- With an event space of size N
- O(log N) lookups are immediate.
- Represent the cdf in an array of size N.
- Draw from 0, 1 and do binary search.
- Cleverer methods for O(log log N), O(log N) time
Amazingly, this can be done in constant time!
22Constant Time Sampling
- Theorem Walker 77 For any discrete pdf D
over a sample space of size N, a table of size
O(N) can be constructed in O(N) time that enables
random variables to be drawn from D using at most
two table lookups.
23Sampling in O(1) time Walker 77
- Start with a discrete pdf. 0.40, 0.30, 0.15,
0.10, 0.05 - Construct a table of 2N entries.
Algorithm Pick a column at random. Pick x
uniformly from 0, 1. If x lt pi ? output i.
Else output Qi
A B C D E
pi
0.25
1
1
0.75
0.5
__
__
Qi
A
B
A
In table above PrB 1 0.2 0.5 0.2
0.3 PrC 0.75 0.2
0.15
24Methods of Walker 77 (cont.)
- Ok, but how do you construct the table?
Table construction Take below-average i.
Choose pi to satisfy xi pi /n. Set j with
largest xj as Qi Reduce xj accordingly. Repeat.
A B C D E
0.05
0.15
0.10
0.40
0.30
0.20
0
0
0.25
0
0.20
A B C D E
pi
0.25
1
1
0.75
0.5
__
__
Qi
A
B
A
Linear time construction.
25Back to extending FM sketches
- We need to sample from B(c, 2-S) for various
values of S. - Using Walkers method, we can sample from B(c,
2-S) in O(1) time and O(c) space, assuming tables
are pre-computed offline.
26Back to extending FM sketches (cont)
- With more cleverness, we can trade off space for
time. Recall that, - Running time time to sample from B O(log2 c)
- Sampling in O(log2 c) time leads to O(c / log2 c)
space. - With max sensor value of 216, saving a log2 c
term is a 256-fold space savings. - Tables for S 1, 2,, 16 together take 4600
bytes - (without this optimization, tables would be gt1MB)
27Intermission
- FM sketches require more work initially.
- Need k bits to represent a single bit!
- But
- Sketched values can easily be aggregated.
- Aggregation operation (OR) is both
order-insensitive and duplicate-insensitive. - Result is a natural fit with sensor aggregation.
28Outline
- Sensor database motivation
- COUNT aggregation via Flajolet-Martin
- SUM aggregation
- Experimental evaluation
29Experimental Results
- We employ the publicly available TAG simulator.
- Basic topologies grid (2-D lattice) and random
- Can modulate
- Grid size default 30 by 30
- Node, packet, or link loss rate default 5 link
loss rate - Number of bitmaps default twenty 16-bit
sketches. - Transmission radius default 8 neighbors on the
grid
30Experimental Results
- We consider four main methods.
- TAG transmit aggregates up a single tree
- DAG-based TAG Send a 1/k fraction of the
aggregated values to each of k parents. - SKETCH broadcast an aggregated sketch to all
neighbors at level i 1 - LIST explicitly enumerate all ltkey, valuegt
pairs and broadcast to all neighbors at level i
1. - LIST vs. SKETCH measures the penalty associated
with approximate values.
31Message Comparison
- TAG transmit aggregates up a single tree
- 1 message transmitted per node.
- 1 message received per node (on average).
- Message size 16 bits.
- SKETCH broadcast a sketch up the tree
- 1 message transmitted per node.
- Fanout of k receivers per transmission (constant
k). - Message size 20 16-bit sketches 320 bits.
32COUNT vs Link Loss (Grid)
33COUNT vs Link Loss (Grid)
34COUNT vs Network Diameter (Grid)
35COUNT vs Link Loss (Random)
36SUM vs Link Loss
37Compressability
- The FM sketches are amenable to compression.
- We employ a very basic method
- Run length encode initial prefix of ones.
- Run length encode suffix of zeroes.
- Represent the middle explicitly.
- Method can be applied to a group of sketches.
- This alone buys about a factor of 3.
- Better methods exist.
38Compression
39Space Usage
40Future Directions
- Spatio-temporal queries
- Restrict queries to specific regions of space,
time, or space-time. - Other aggregates
- What else needs to be computed or approximated?
- Better aggregation methods
- FM sketches have rather high variance.
- Many other sketching methods can potentially be
used.