Title: ModelDriven Data Acquisition in Sensor Networks Part II
1Model-Driven Data Acquisition in Sensor
Networks- Part II -
- Paper by Amol Deshpande, Carlos Guestrin, Samuel
R. Madden, Joseph M. Hellerstein and Wei Hong - In Proceedings of VLDB 2004
- Presented by Shantha Ramachandran
- Oct. 4, 2004
2Last Time
- BBQ probabilistic model used to create
observation plan for sensor network - Probability density function
- Value, range and average queries
3Outline
- Introduction
- Overview of approach
- Model-based querying
- Choosing an observation plan
- Experimental results
- Extensions and future directions
- Related work
- Conclusions
4Choosing an Observation Plan
- Pdfs can be conditioned on the value of observed
attributes o - Gives us a more confident answer to a query
- Important which attribute to observe?
- Focus select attributes that are expected to
increase the confidences in the answer to the
query, at a minimum cost
5Cost of Observations
- Let O 1,..,n be a set of observations
- Expected cost C(O) of observing attributes O
- C(O) Ca(O) Ct(O)
6- Ca(O) data acquisition cost
- Sum of energy required to observe attributes O
- Ca(O) SieOCa(i)
- where Ca(i) is the cost of observing attribute i
7- Ct(O) expected data transmission cost
- Depends on data collection mechanism used to
collect observations from network - Depends on network topology
- If topology is unknown or changing, cost function
is basically random - Therefore, assume networks with known topologies
8- Network graph
- Set of edges ?
- Each edge eij
- 2 link quality estimates pij, pji
- Probability that packet from i will reach j
- Assume pij, pji are independent
- Expected of transmission acknowledgement
messages required to guarantee a successful
transmission is 1 / pijpji - Use these values to estimate transmission cost
9- Choose a simple path through network that visits
all sensors, observes O, returns - Ct(O) is defined to be expected cost of this path
- C(O) Ca(O) Ct(O)
10Improvement in Confidence
- Observing attributes O should improve the
confidence of posterior density - Should be able to answer query with higher level
of confidence
11- Suppose we have a range query Xi e ai, bi
- We can compute benefit Ri(o) of o
- Ri(o) maxP(Xi e ai, bio), 1- P(Xi e ai,
bio) - Ri(o) measures our confidence after observing o
12- For value and average queries
- Ri(o) P(Xi e xi-e, xieo)
- xi is the posterior mean of Xi given o
13- Specific value o of O is not known a priori
- Must compute expected benefit Ri(O)
- Ri(O) ?p(o)Ri(o)do
14- We may have range or value queries over multiple
attributes - Trying to achieve a particular marginal
confidence over each attribute - Must decide how to trade off confidences between
different attributes
15- For a query over attributes Q 1,,n
- Define total benefit R(o) as either
- R(o) minieQRi(o)
- R(o) 1/Q SieQRi(o)
- Focus minimize number of mistakes made by query
processor - Use average benefit to decide when to stop
observing new attributes
16Optimization
- We have so far defined R(O) and C(O) as expected
benefit and cost - Different sets of observed attributes lead to
different benefits and costs - If user wants confidence level 1-d, we want to
pick a set of attributes O that meet the
confidence at minimum cost - minimizeO C(O),such that R(O)gt 1- d
- This is generally NP-hard
17- Two algorithms for solving optimization problem
- Exhaustive search
- Greedy algorithm
18Exhaustive Search
- Exhaustively search over all possible subsets of
possible observations, O - Finds the optimal subset
- Exponential running time
19Greedy Algorithm
- Uses a greedy incremental heuristic
- Initialize with empty set of attributes, O ø
- For each attribute Xi not in set
- Compute new expected benefit R(OUi) and cost
C(OUi) - If some set G reaches desired confidence
- Then pick from G the one with lowest total cost
- And terminate
- Else if G ø, we have not reached our desired
confidence - Then add attribute with highest benefit over cost
ratio - Repeat until desired confidence is reached
20Experimental Results
- Measure performance of BBQ on real world data
sets - Goal demonstrate that BBQ provides ability to
efficiently execute approximate queries with user
specified confidences
21Data Sets
- Results based on running experiments on two real
world data sets - Collected using TinyDB
22Garden
- One month trace of 83,000 readings
- 11 sensors in a redwood tree at UC Botanical
Garden in Berkeley - Sensors placed at four different altitudes
- Collected light, humidity, temperature and
voltage readings once every five minutes - Data split into training and test data sets
- Model built on training set
23Lab
- 54 sensors in Intel Research, Berkeley lab
- Collected light, humidity, temperature and
voltage readings - Also collected network connectivity information
- 8 days of readings
- 6 days training
- 2 days test
24Query Workload
- Two sets of query workloads
- Value queries
- Predicate queries
25Value Queries
- Main type of queries anticipated
- Ask to report sensor readings at all sensors
- Within error bound e
- With specified confidence d
26Predicate Queries
- Selection queries over sensor readings
- Ask for all sensors that satisfy a certain
predicate - With specified confidence d
- Also looked at average queries
- Do not present these results
27Comparison Systems
- Compare BBQ against
- TinyDB-style Querying
- Approximate-Caching
28TinyDB-style Querying
- Query disseminated into sensor network using tree
structure - At each mote, sensor reading is observed
- Results reported back along same tree to base
station - Combine results on the way back to minimize
communication costs
29Approximate-Caching
- Base station maintains view of readings at all
motes - View is guaranteed to be within a certain
interval of the actual sensor readings - If value of sensor falls outside this interval,
motes are required to report new reading
30Methodology
- BBQ used to build model of training data
- Includes transition model for each hour of the
day - Generate traces from test data by taking one
reading randomly per hour - Issue one query per hour
31- Model computes a priori probability for each
predicate - Choose one or more sensor readings to observe if
confidence bounds not met - Execute generated observation plan
- Updates model with observed values from test data
- Compares predicted values for non-observed
readings to test data from that hour
32- Measure accuracy
- Compute average mistakes per hour
- How many reported values are further away from
the actual values than the specified error bound - Number of predicates whose truth value was
incorrectly approximated
33- TinyDB
- All queries answer correctly
- Approximate-Caching
- Values reported upon deviation
- No mistakes!
34- Compute cost and accuracy for each observation
plan - Both acquisition cost and communication cost
35Garden datasetValue Based Queries
- Want to analyze performance of value queries on
garden in detail - Show effectiveness of BBQ
- Query requires system to report temperatures at
all motes to within specified e - Confidence 97, e varied
36- Vary e from 0 to 1degrees Celsius
- Cost of BBQ falls rapidly as e increases
- Percentage of errors stays below 5 confidence
threshold
Figure 4 Relative Costs 1
37- For reasonable values of e
- BBQ uses significantly less communication
- Approximate-Caching always reports values to
within e - Does not make mistakes
- Average observation error is close to BBQ
38- Percentage of sensors that BBQ observes by hour
- Varying e
Figure 5 Number of sensors 1
39- As e gets small (lt0.1), must observe all nodes on
every query - Variance between nodes high enough that it cannot
infer value of one sensor from anothers with any
accuracy - As e gets large (gt1), few observations are needed
- Changes in one sensor predict values of others
- Intermediate e
- More observations are needed, especially during
times when readings change drastically
40Garden DatasetCost vs. Confidence
- Compare cost of plan execution
- confidence 80 to 99
- epsilon varying between 0.1 and 1.0
Figure 6 Energy and errors vs. confidence
interval and epsilon 1
41- Decreasing confidence intervals or epsilon
reduces energy per query
- Confidence 95
- Errors 0.5
- Reduce expected energy cost from 5.4 J to 150 mJ
per query - Factor of 40 reduction
Figure 6A 1
42- Meet or exceed confidence interval in almost all
cases
Figure 6B 1
43Additional Experiments
- Performance of greedy algorithm vs. optimal
algorithm - Performance of dynamic filter vs. static model
44Garden DatasetRange Queries
- Ran a number of experiments with range queries
- Average number of observations required for 95
confidence
Figure 7 BBQs performace 1
45- 3 different range queries
- Temp. in 17,18
- Temp. in 19,20
- Temp. in 21,22
- In all 3 cases, error rates were at or below 5
- Different range queries required observations at
different times
46Lab Dataset
- Similar experiments run on lab dataset
- Higher number of attributes
- Temperatures harder to predict than outdoors
- Human intervention -gt randomness
47- Cost incurred answering value query
- Confidence varied
- As required confidence drops, BBQ is more
efficient
Figure 8A Energy vs. confidence interval and
epsilon 1
48- BBQ achieved specified confidence bounds in
almost all cases
Figure 8B Errors vs. confidence interval and
epsilon 1
49- Example traversal executing a value query
- 99 confidence, e0.5 degrees C
- Initial set, 8 am
Figure 9 Traversals of the lab network 1
50Extensions and Future Directions
- So far, authors have focused on core architecture
of BBQ - Goal unifying probabilistic models with
declarative queries - Several possible extensions
51Conditional Plans
- Generate plans that include early stopping
conditions - Generate plans that explore different parts of
the network, depending on the values of observed
attributes
52More Complex Models
- In particular, models that detect faulty sensors
- Answer fault detection queries
- Give correct answers to queries in the presence
of faults
53Outliers
- Does not work well in current implementation
- Only way to detect outliers is to continuously
sample sensors - Outlier scheme would likely have high sensing
cost - Probabilistic techniques are expected to work to
avoid excessive communication
54Support for Dynamic Networks
- Current approach works best in static network
- Systematic study how network topologies change
over time - New sensors added, existing sensors move
- Topology change recovery strategies
- Find alternate routes through network
55Continuous Queries
- Currently, exploration plan re-executes at root
node - May be possible to install code that causes
devices to periodically push readings during
times of high change
56Related Work
- Substantial work has been done on approximate
query processing in the database community - Using model-like synopses for query answering
- Instead of probabilistic models
- Most do not use correlations
57AQUA Project
- Proposes sampling-based synopses that can provide
approximate answers to a variety of queries using
a fraction of the total data in the database - Includes tight bounds on the correctness of the
answers - Designed to work in and environment where it is
possible to generate independent random samples
of data
58- Does not exploit correlations
- Lacks predictive power of probabilistic models
- Others propose exploiting correlations through
graphical model techniques for approximate query
processing
59Approximate Caching
- Olsten et al.
- Provides bounded approximation of values of a
number of cached objects at some server - Server stores cached values along with absolute
bounds for deviation - When objects notice values outside bounds, they
send an update
60- This requires cached objects to continuously
monitor values - High energy overhead
- Can detect outliers
- BBQ cannot
61IDSQInformation Driven Sensor Querying
- Probabilistic models
- Estimation of target position in tracking
applications - Sensors tasked to maximally reduce positional
uncertainty of target
62ACQPAcquisitional Query Processing
- Query processing in an environment like sensor
networks - Must be sensitive to costs of acquiring data
- Main goal avoid unnecessary data acquisition
63CONTROL
- Provide interface that allows users to see
partially complete answers with confidence bounds
for long running aggregate queries - No correlations
64Conclusions
- Proposed architecture for integrating database
systems with correlation-aware probabilistic
model - Do not directly query the network
- Rather, build model from stored and current
readings - Answer SQL queries by consulting model
65- Advantages to using model in sensor network
- Shield users from faulty sensors
- Reduce number of expensive sensor readings and
radio transmissions
66- BBQ shows encouraging order of magnitude
reductions in sampling and communication costs - BBQ general architecture is seen as proper
platform for answer queries and interpreting data
from real-world environments like sensornets - Conventional database technology is not equipped
to deal with lossiness, noise and non-uniformity
which is inherent in such environments
67Thank YouQuestions?
1 A. Deshpande, C. Guestrin, S. Madden, J.
Hellerstein, W. Hong. Model-Driven Data
Acquisition in Sensor Networks. In Proc. Of the
30th VLDB Conference, 2004.