Approximate Aggregation Techniques for Sensor Databases - PowerPoint PPT Presentation

About This Presentation

Title:

Approximate Aggregation Techniques for Sensor Databases

Description:

Broadcast intermediate results along gradient back to source ... TAG: transmit aggregates up a single tree. DAG-based TAG: Send a 1/k fraction of the aggregated ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 41

Provided by: jeffreyc5

Learn more at: https://www.cs.bu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Approximate Aggregation Techniques for Sensor Databases

1
Approximate Aggregation Techniques for Sensor
Databases

John Byers
Department of Computer Science
Boston University
Joint work with Jeffrey Considine, George
Kollios, Feifei Li

2
Sensor Network Model

Large set of sensors distributed in a sensor
field.
Communication via a wireless ad-hoc network.
Node and links are failure-prone.
Sensors are resource-constrained
Limited memory, battery-powered, messaging is
costly.

3
Sensor Databases

Treat sensor field as a distributed database
But data is gathered, not stored nor saved.
Perform standard queries over sensor field
COUNT, SUM, GROUP-BY
Exemplified by work such as TAG and Cougar
For this talk
One-shot queries
Continuous queries are a natural extension.

4
Tiny Aggregation (TAG) ApproachMadden,
Franklin, Hellerstein, Hong

Aggregation component of TinyDB
Follows database approach
Uses simple SQL-like language for queries
Power-aware, in-network query processing
Optimizations are transparent to end-user.
TAG supports COUNT, SUM, AVG, MIN, MAX and others

5
TAG (continued)

Queries proceed in two phases
Phase 1
Sink broadcasts desire to compute an aggregate.
Nodes create a routing tree with the sink as the
root.
Phase 2
Nodes start sending back partial results.
Each node receives the partial results of its
children and computes a new partial result.
Then it forwards the new partial result to its
parent.
Can compute any decomposable function
f (v1, v2, , vn) g( f (v1, .., vk), f (vk1,
, vn))

6
Example for SUM
sink
20

Sink initiates the query
Nodes form a spanning tree
Each node sends its partial
result to its parent
Sink computes the total sum

7
Classification of Aggregates

TAG classifies aggregates according to
Size of partial state
Monotonicity
Exemplary vs. summary
Duplicate-sensitivity
MIN/MAX (cheap and easy)
Small state, monotone, exemplary,
duplicate-insensitive
COUNT/SUM (considerably harder)
Small state and monotone, BUT duplicate-sensitive
Cheap if aggregating over tree without losses
Expensive with multiple paths and losses

8
Basic approaches to computing SUM

Separate, reliable delivery of every value to
sink
Extremely costly in energy and energy consumption
Aggregate values back to sink along a tree
A single fault eliminates values of an entire
subtree
Split values and route fractions separately
Send (value / k) to each of k parents
Better variance, but same expectation as approach
(2)
Send values along multiple paths
Duplicates need to be handled.
ltID, valuegt pairs have limited in-network
aggregation.

9
Design Objectives for Robust SUM

Admit in-network aggregation of partial values
Let aggregates be both order-insensitive and
duplicate-insensitive
Be agnostic to routing protocol
Trust routing protocol to be best-effort.
Routing and aggregation logically decoupled NG
03.
Some routing algorithms better than others.

10
Design Objectives (cont)

Final aggregate is exact if at least one
representative from each leaf survives to reach
the sink.
This wont happen in practice.
It is reasonable to hope for approximate results.
We argue that it is reasonable to use aggregation
methods that are themselves approximate.

11
Outline

Motivation for sensor databases and aggregation.
COUNT aggregation via Flajolet-Martin
SUM aggregation
Experimental evaluation

12
Flajolet / Martin sketches JCSS 85

Goal Estimate N from a small-space
representation of a set.
Sketch of a union of items is the OR of their
sketches
Insertion order and duplicates dont matter!

Prerequisite Let h be a random, binary hash
function. Sketch of an item For each unique
item with ID x, For each integer 1 i k in
turn, Compute h (x, i). Stop when h (x, i)
1, and set bit i.
X 0 0 1 0 0

Z 1 0 0 0 0

X Z 1 0 1 0 0
n
13
Flajolet / Martin sketches (cont)
Estimating COUNT Take the sketch of a set of N
items. Let j be the position of the leftmost zero
in the sketch. j is an estimator of log2 (0.77 N)
S
1
1
1
0
1
j 3
Best guess COUNT 11

Fixable drawbacks
Estimate has faint bias
Variance in the estimate is large.

14
Flajolet / Martin sketches (cont)

Standard variance reduction methods apply.
Compute m independent sketches in parallel.
Compute m independent estimates of N.
Take the mean of the estimates.
Provable tradeoffs between m and variance of the
estimator

15
Application to COUNT

Each sensor computes k independent sketches of
itself (using unique ID x)
Coming next sensor computes a sketch of its
value.
Use a robust routing algorithm to route sketches
up to the sink.
Aggregate the k sketches via union en-route.
The sink then estimates the count.

16
Multipath Routing

Braided Paths

Two paths from the source to the sink that differ
in at least two nodes
17
Routing Methodologies

Considerable work on reliable delivery via
multipath routing
Directed diffusion IGE 00
Braided diffusion GGSE 01
GRAdient Broadcast YZLZ 02
Broadcast intermediate results along gradient
back to source
Can dynamically control width of broadcast
Trade off fault tolerance and transmission costs
Our approach similar to GRAB
Broadcast. Grab if upstream, ignore if downstream
Common goal try to get at least one copy to sink

18
Simple Upstream Routing

By expanding ring search, nodes can compute their
hop distance from the sink.
Refer to nodes at distance i as level i.
At level i, gather aggregates from level i1.
Then broadcast aggregates to level i - 1
neighbors.
Ignore downstream and sidestream aggregates.

19
Extending Flajolet / Martin Sketches

Also interested in approximating SUM
FM sketches can handle this (albeit clumsily)
To insert a value of 500, perform 500 distinct
item insertions
Our observation We can simulate a large number
of insertions into an FM sketch more efficiently.
Sensor-net restrictions
No floating point operations
Must keep memory usage and CPU time to a minimum

20
Simulating a set of insertions

Set all the low-order bits in the safe region.
First S log c 2 log log c bits are set to 1
w.h.p.
Statistically estimate number of trials going
beyond safe region
Probability of a trial doing so is simply 2-S
Number of trials B(c,2-S). Mean O(log2 c)
For trials and bits outside safe region, set
those bits manually.
Running time is O(1) for each outlying trial.
Expected running time
O(log c) time to draw from B(c,2-S)
O(log2 c)

21
Fast sampling from discrete pdfs

We need to generate samples from B(n, p).
General problem sampling from a discrete pdf.
Assume can draw uniformly at random from 0,1.
With an event space of size N
O(log N) lookups are immediate.
Represent the cdf in an array of size N.
Draw from 0, 1 and do binary search.
Cleverer methods for O(log log N), O(log N) time

Amazingly, this can be done in constant time!
22
Constant Time Sampling

Theorem Walker 77 For any discrete pdf D
over a sample space of size N, a table of size
O(N) can be constructed in O(N) time that enables
random variables to be drawn from D using at most
two table lookups.

23
Sampling in O(1) time Walker 77

Start with a discrete pdf. 0.40, 0.30, 0.15,
0.10, 0.05
Construct a table of 2N entries.

Algorithm Pick a column at random. Pick x
uniformly from 0, 1. If x lt pi ? output i.
Else output Qi
A B C D E
pi
0.25
1
1
0.75
0.5
__
__
Qi
A
B
A
In table above PrB 1 0.2 0.5 0.2
0.3 PrC 0.75 0.2
0.15
24
Methods of Walker 77 (cont.)

Ok, but how do you construct the table?

Table construction Take below-average i.
Choose pi to satisfy xi pi /n. Set j with
largest xj as Qi Reduce xj accordingly. Repeat.
A B C D E
0.05
0.15
0.10
0.40
0.30
0.20
0
0
0.25
0
0.20
A B C D E
pi
0.25
1
1
0.75
0.5
__
__
Qi
A
B
A
Linear time construction.
25
Back to extending FM sketches

We need to sample from B(c, 2-S) for various
values of S.
Using Walkers method, we can sample from B(c,
2-S) in O(1) time and O(c) space, assuming tables
are pre-computed offline.

26
Back to extending FM sketches (cont)

With more cleverness, we can trade off space for
time. Recall that,
Running time time to sample from B O(log2 c)
Sampling in O(log2 c) time leads to O(c / log2 c)
space.
With max sensor value of 216, saving a log2 c
term is a 256-fold space savings.
Tables for S 1, 2,, 16 together take 4600
bytes
(without this optimization, tables would be gt1MB)

27
Intermission

FM sketches require more work initially.
Need k bits to represent a single bit!
But
Sketched values can easily be aggregated.
Aggregation operation (OR) is both
order-insensitive and duplicate-insensitive.
Result is a natural fit with sensor aggregation.

28
Outline

Sensor database motivation
COUNT aggregation via Flajolet-Martin
SUM aggregation
Experimental evaluation

29
Experimental Results

We employ the publicly available TAG simulator.
Basic topologies grid (2-D lattice) and random
Can modulate
Grid size default 30 by 30
Node, packet, or link loss rate default 5 link
loss rate
Number of bitmaps default twenty 16-bit
sketches.
Transmission radius default 8 neighbors on the
grid

30
Experimental Results

We consider four main methods.
TAG transmit aggregates up a single tree
DAG-based TAG Send a 1/k fraction of the
aggregated values to each of k parents.
SKETCH broadcast an aggregated sketch to all
neighbors at level i 1
LIST explicitly enumerate all ltkey, valuegt
pairs and broadcast to all neighbors at level i
1.
LIST vs. SKETCH measures the penalty associated
with approximate values.

31
Message Comparison

TAG transmit aggregates up a single tree
1 message transmitted per node.
1 message received per node (on average).
Message size 16 bits.
SKETCH broadcast a sketch up the tree
1 message transmitted per node.
Fanout of k receivers per transmission (constant
k).
Message size 20 16-bit sketches 320 bits.

32
COUNT vs Link Loss (Grid)
33
COUNT vs Link Loss (Grid)
34
COUNT vs Network Diameter (Grid)
35
COUNT vs Link Loss (Random)
36
SUM vs Link Loss
37
Compressability