Title: Probabilistic Aggregation in Distributed Networks
1Probabilistic Aggregation in Distributed Networks
- Ling Huang, Ben Zhao,
- Anthony Joseph and John Kubiatowicz
- hling, ravenben, adj, kubitrong_at_eecs.berkeley.ed
u - June, 2004
2Outline
- Background
- Motivation
- Statistical properties of real life data streams
- Problem of existing approaches
- Our Approach
- Reduce communication overhead
- Recover from loss
- Evaluation
- Conclusion and future work
3Background
- Aggregate functions
- MIN, MAX, AVG, COUNT, , etc.
- In-Network hierarchical processing
- Query propagation
- Tree construction
- Aggregates computed epoch by epoch
- Addressing fault-tolerance
- Multi-root
- Multi-tree
- Reliable transmission
4Motivation
- Data aggregation is an important function for all
network infrastructures - Sensor networks
- P2P networks
- Network monitoring and intrusion detection
systems - Exact result not achievable in face of loss and
faults - High cost when adding fault-tolerance
- Low communication overhead, accurate
approximation is crucial - But, its difficult to achieve
5Observation Comparison of Data Streams
Three real-world data traces and a random trace
6Statistical Properties of Data Streams
Relative Increment is defined as
There is temporal correlation in real data
stream, by which we can leverage to maintain
aggregate data accuracy, while reducing
communication overhead and recovering from data
loss.
Density estimation for relative increment
7Problems in Existing Approaches
- Few approach exploits the temporal properties and
is designed to handle data loss - Simple last-value algorithm for data loss
recovery in TAG - Multi-root/tree make things worse by consuming
more resource - Fragile for large process groups
- Need all relevant nodes for participation
- Difficult to trade accuracy for communication
overhead - Good applications need this tradeoff
- Only need approximation
- But, minimize resource consumption
- Centralize solution of adaptive filtering
proposed by Olston et.al.
8Our Approach
- Probabilistic data aggregation a scalable and
robust approach - Exploit and leverage statistical properties of
data stream in temporal domain - Apply statistical algorithms to data aggregation
- Develop protocol that handles loss and failures
as essential part of normal operations - Nodes participate in aggregation and
communication according to statistical sampling
algorithm - In the absence of data, estimate value using time
series algorithms - Differentiate between voluntary and involuntary
Loss
9Reducing Communication Overhead
- Trade off between accuracy and resource
consumption - Allow selective participation of nodes while
maintaining aggregate accuracy - Node participates in the operation with certain
probability, which is the design parameter of the
algorithm - Sampling strategies
- Uniform Sampling all nodes use the identical
sampling rate - Subtree-size based Sampling sampling rate of a
node is proportional to the size of its subtree - Variance based sampling a sensor only reports a
new value if it is above or below a threshold
percentage its last reported value.
10Performance of Sampling algorithms
- As fewer nodes participate, overall accuracy
decreases for all algorithms. - Uniform sampling performs worst.
- Variance based sampling is most accurate,
11Observation Long-Term Pattern in Data
Daily patterns in a weekly data stream
Data source bandwidth measurements for the CUDI
network interface on an Abilene router with
5-minute average.
12Two Level Representation of Data
Monday Data
The data stream can be decomposed into two
layers the long trend (pattern), which changes
slowly the residual, high frequency but low
amplitude.
13Recovering From Loss
- Traditional Approaches
- Last seen data as approximation for current epoch
- Linear Prediction
- Two-Level data representation and prediction
- Long term trend B-spline estimation
- High frequency residual ARMA modeling
- ARMA stands for AutoRegressive and Moving Average
model, which is a standard time series technique
to model chaotic data stream
14Two-Level Data Prediction
- B-spline modeling for long term trend
- Piecewise continuous, low-degree B-spline can
represent complex shapes - Least-square B-spline regression for two-level
decomposition - B-Spline extension for future forecasting
- ARMA forecasting for transient oscillation
- System Identification to determine the order of
the model - Parameter estimation by optimization algorithm
- Low complexity recursive equation for future
forecasting - Statistical properties for the calibration of
prediction results
15Performance of Prediction Algorithms
Performance of Prediction Algorithms For MAX
Operation in Lossless Environment
16Performance of Prediction Algorithms
Performance of prediction algorithms in lossy
environments. Average loss rate of the network is
20. The ration of loss rate between wide-area
links and local links is 31.
17Summary of Results
- All prediction algorithms are effective in
improving the accuracy of aggregation results - Two-level prediction approach perform the best in
all situations - Achieve more than 90 of accuracy even under each
node nonparticipation with rate up to 60 - Is effective even in a high loss environment
18Conclusion and Future Work
- Apply statistical algorithms to data aggregation
system - quantify the statistical properties of real-world
measurement data - propose the concept of probabilistic
participation of nodes - propose multi-level prediction mechanism to
recover from sampling and data loss - Uniqueness multi-level prediction enables high
accuracy even under high loss and voluntary
non-participation - Future Work
- Develop online algorithm and exploit tradeoff
between prediction accuracy and computation and
storage cost - Build real system for applications with health
monitoring, traffic measurement and router
statistics aggregation - Real system implementation and Deployment
19The Danger of Prediction