Approximate Aggregation Techniques for Sensor Databases - PowerPoint PPT Presentation

About This Presentation
Title:

Approximate Aggregation Techniques for Sensor Databases

Description:

Broadcast intermediate results along gradient back to source ... TAG: transmit aggregates up a single tree. DAG-based TAG: Send a 1/k fraction of the aggregated ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 41
Provided by: jeffreyc5
Learn more at: https://www.cs.bu.edu
Category:

less

Transcript and Presenter's Notes

Title: Approximate Aggregation Techniques for Sensor Databases


1
Approximate Aggregation Techniques for Sensor
Databases
  • John Byers
  • Department of Computer Science
    Boston University
  • Joint work with Jeffrey Considine, George
    Kollios, Feifei Li

2
Sensor Network Model
  • Large set of sensors distributed in a sensor
    field.
  • Communication via a wireless ad-hoc network.
  • Node and links are failure-prone.
  • Sensors are resource-constrained
  • Limited memory, battery-powered, messaging is
    costly.

3
Sensor Databases
  • Treat sensor field as a distributed database
  • But data is gathered, not stored nor saved.
  • Perform standard queries over sensor field
  • COUNT, SUM, GROUP-BY
  • Exemplified by work such as TAG and Cougar
  • For this talk
  • One-shot queries
  • Continuous queries are a natural extension.

4
Tiny Aggregation (TAG) ApproachMadden,
Franklin, Hellerstein, Hong
  • Aggregation component of TinyDB
  • Follows database approach
  • Uses simple SQL-like language for queries
  • Power-aware, in-network query processing
  • Optimizations are transparent to end-user.
  • TAG supports COUNT, SUM, AVG, MIN, MAX and others

5
TAG (continued)
  • Queries proceed in two phases
  • Phase 1
  • Sink broadcasts desire to compute an aggregate.
  • Nodes create a routing tree with the sink as the
    root.
  • Phase 2
  • Nodes start sending back partial results.
  • Each node receives the partial results of its
    children and computes a new partial result.
  • Then it forwards the new partial result to its
    parent.
  • Can compute any decomposable function
  • f (v1, v2, , vn) g( f (v1, .., vk), f (vk1,
    , vn))

6
Example for SUM
sink
20
  • Sink initiates the query
  • Nodes form a spanning tree
  • Each node sends its partial
  • result to its parent
  • Sink computes the total sum

7
Classification of Aggregates
  • TAG classifies aggregates according to
  • Size of partial state
  • Monotonicity
  • Exemplary vs. summary
  • Duplicate-sensitivity
  • MIN/MAX (cheap and easy)
  • Small state, monotone, exemplary,
    duplicate-insensitive
  • COUNT/SUM (considerably harder)
  • Small state and monotone, BUT duplicate-sensitive
  • Cheap if aggregating over tree without losses
  • Expensive with multiple paths and losses

8
Basic approaches to computing SUM
  • Separate, reliable delivery of every value to
    sink
  • Extremely costly in energy and energy consumption
  • Aggregate values back to sink along a tree
  • A single fault eliminates values of an entire
    subtree
  • Split values and route fractions separately
  • Send (value / k) to each of k parents
  • Better variance, but same expectation as approach
    (2)
  • Send values along multiple paths
  • Duplicates need to be handled.
  • ltID, valuegt pairs have limited in-network
    aggregation.

9
Design Objectives for Robust SUM
  • Admit in-network aggregation of partial values
  • Let aggregates be both order-insensitive and
    duplicate-insensitive
  • Be agnostic to routing protocol
  • Trust routing protocol to be best-effort.
  • Routing and aggregation logically decoupled NG
    03.
  • Some routing algorithms better than others.

10
Design Objectives (cont)
  • Final aggregate is exact if at least one
    representative from each leaf survives to reach
    the sink.
  • This wont happen in practice.
  • It is reasonable to hope for approximate results.
  • We argue that it is reasonable to use aggregation
    methods that are themselves approximate.

11
Outline
  • Motivation for sensor databases and aggregation.
  • COUNT aggregation via Flajolet-Martin
  • SUM aggregation
  • Experimental evaluation

12
Flajolet / Martin sketches JCSS 85
  • Goal Estimate N from a small-space
    representation of a set.
  • Sketch of a union of items is the OR of their
    sketches
  • Insertion order and duplicates dont matter!

Prerequisite Let h be a random, binary hash
function. Sketch of an item For each unique
item with ID x, For each integer 1 i k in
turn, Compute h (x, i). Stop when h (x, i)
1, and set bit i.
X 0 0 1 0 0

Z 1 0 0 0 0

X Z 1 0 1 0 0
n
13
Flajolet / Martin sketches (cont)
Estimating COUNT Take the sketch of a set of N
items. Let j be the position of the leftmost zero
in the sketch. j is an estimator of log2 (0.77 N)
S
1
1
1
0
1
j 3
Best guess COUNT 11
  • Fixable drawbacks
  • Estimate has faint bias
  • Variance in the estimate is large.

14
Flajolet / Martin sketches (cont)
  • Standard variance reduction methods apply.
  • Compute m independent sketches in parallel.
  • Compute m independent estimates of N.
  • Take the mean of the estimates.
  • Provable tradeoffs between m and variance of the
    estimator

15
Application to COUNT
  • Each sensor computes k independent sketches of
    itself (using unique ID x)
  • Coming next sensor computes a sketch of its
    value.
  • Use a robust routing algorithm to route sketches
    up to the sink.
  • Aggregate the k sketches via union en-route.
  • The sink then estimates the count.

16
Multipath Routing
  • Braided Paths

Two paths from the source to the sink that differ
in at least two nodes
17
Routing Methodologies
  • Considerable work on reliable delivery via
    multipath routing
  • Directed diffusion IGE 00
  • Braided diffusion GGSE 01
  • GRAdient Broadcast YZLZ 02
  • Broadcast intermediate results along gradient
    back to source
  • Can dynamically control width of broadcast
  • Trade off fault tolerance and transmission costs
  • Our approach similar to GRAB
  • Broadcast. Grab if upstream, ignore if downstream
  • Common goal try to get at least one copy to sink

18
Simple Upstream Routing
  • By expanding ring search, nodes can compute their
    hop distance from the sink.
  • Refer to nodes at distance i as level i.
  • At level i, gather aggregates from level i1.
  • Then broadcast aggregates to level i - 1
    neighbors.
  • Ignore downstream and sidestream aggregates.

19
Extending Flajolet / Martin Sketches
  • Also interested in approximating SUM
  • FM sketches can handle this (albeit clumsily)
  • To insert a value of 500, perform 500 distinct
    item insertions
  • Our observation We can simulate a large number
    of insertions into an FM sketch more efficiently.
  • Sensor-net restrictions
  • No floating point operations
  • Must keep memory usage and CPU time to a minimum

20
Simulating a set of insertions
  • Set all the low-order bits in the safe region.
  • First S log c 2 log log c bits are set to 1
    w.h.p.
  • Statistically estimate number of trials going
    beyond safe region
  • Probability of a trial doing so is simply 2-S
  • Number of trials B(c,2-S). Mean O(log2 c)
  • For trials and bits outside safe region, set
    those bits manually.
  • Running time is O(1) for each outlying trial.
  • Expected running time
    O(log c) time to draw from B(c,2-S)
    O(log2 c)

21
Fast sampling from discrete pdfs
  • We need to generate samples from B(n, p).
  • General problem sampling from a discrete pdf.
  • Assume can draw uniformly at random from 0,1.
  • With an event space of size N
  • O(log N) lookups are immediate.
  • Represent the cdf in an array of size N.
  • Draw from 0, 1 and do binary search.
  • Cleverer methods for O(log log N), O(log N) time

Amazingly, this can be done in constant time!
22
Constant Time Sampling
  • Theorem Walker 77 For any discrete pdf D
    over a sample space of size N, a table of size
    O(N) can be constructed in O(N) time that enables
    random variables to be drawn from D using at most
    two table lookups.

23
Sampling in O(1) time Walker 77
  • Start with a discrete pdf. 0.40, 0.30, 0.15,
    0.10, 0.05
  • Construct a table of 2N entries.

Algorithm Pick a column at random. Pick x
uniformly from 0, 1. If x lt pi ? output i.
Else output Qi
A B C D E
pi
0.25
1
1
0.75
0.5
__
__
Qi
A
B
A
In table above PrB 1 0.2 0.5 0.2
0.3 PrC 0.75 0.2
0.15
24
Methods of Walker 77 (cont.)
  • Ok, but how do you construct the table?

Table construction Take below-average i.
Choose pi to satisfy xi pi /n. Set j with
largest xj as Qi Reduce xj accordingly. Repeat.
A B C D E
0.05
0.15
0.10
0.40
0.30
0.20
0
0
0.25
0
0.20
A B C D E
pi
0.25
1
1
0.75
0.5
__
__
Qi
A
B
A
Linear time construction.
25
Back to extending FM sketches
  • We need to sample from B(c, 2-S) for various
    values of S.
  • Using Walkers method, we can sample from B(c,
    2-S) in O(1) time and O(c) space, assuming tables
    are pre-computed offline.

26
Back to extending FM sketches (cont)
  • With more cleverness, we can trade off space for
    time. Recall that,
  • Running time time to sample from B O(log2 c)
  • Sampling in O(log2 c) time leads to O(c / log2 c)
    space.
  • With max sensor value of 216, saving a log2 c
    term is a 256-fold space savings.
  • Tables for S 1, 2,, 16 together take 4600
    bytes
  • (without this optimization, tables would be gt1MB)

27
Intermission
  • FM sketches require more work initially.
  • Need k bits to represent a single bit!
  • But
  • Sketched values can easily be aggregated.
  • Aggregation operation (OR) is both
    order-insensitive and duplicate-insensitive.
  • Result is a natural fit with sensor aggregation.

28
Outline
  • Sensor database motivation
  • COUNT aggregation via Flajolet-Martin
  • SUM aggregation
  • Experimental evaluation

29
Experimental Results
  • We employ the publicly available TAG simulator.
  • Basic topologies grid (2-D lattice) and random
  • Can modulate
  • Grid size default 30 by 30
  • Node, packet, or link loss rate default 5 link
    loss rate
  • Number of bitmaps default twenty 16-bit
    sketches.
  • Transmission radius default 8 neighbors on the
    grid

30
Experimental Results
  • We consider four main methods.
  • TAG transmit aggregates up a single tree
  • DAG-based TAG Send a 1/k fraction of the
    aggregated values to each of k parents.
  • SKETCH broadcast an aggregated sketch to all
    neighbors at level i 1
  • LIST explicitly enumerate all ltkey, valuegt
    pairs and broadcast to all neighbors at level i
    1.
  • LIST vs. SKETCH measures the penalty associated
    with approximate values.

31
Message Comparison
  • TAG transmit aggregates up a single tree
  • 1 message transmitted per node.
  • 1 message received per node (on average).
  • Message size 16 bits.
  • SKETCH broadcast a sketch up the tree
  • 1 message transmitted per node.
  • Fanout of k receivers per transmission (constant
    k).
  • Message size 20 16-bit sketches 320 bits.

32
COUNT vs Link Loss (Grid)
33
COUNT vs Link Loss (Grid)
34
COUNT vs Network Diameter (Grid)
35
COUNT vs Link Loss (Random)
36
SUM vs Link Loss
37
Compressability
  • The FM sketches are amenable to compression.
  • We employ a very basic method
  • Run length encode initial prefix of ones.
  • Run length encode suffix of zeroes.
  • Represent the middle explicitly.
  • Method can be applied to a group of sketches.
  • This alone buys about a factor of 3.
  • Better methods exist.

38
Compression
39
Space Usage
40
Future Directions
  • Spatio-temporal queries
  • Restrict queries to specific regions of space,
    time, or space-time.
  • Other aggregates
  • What else needs to be computed or approximated?
  • Better aggregation methods
  • FM sketches have rather high variance.
  • Many other sketching methods can potentially be
    used.
Write a Comment
User Comments (0)
About PowerShow.com