Probabilistic Aggregation in Distributed Networks - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

Probabilistic Aggregation in Distributed Networks

Description:

High frequency residual: ARMA modeling. ARMA stands for AutoRegressive and Moving Average model, which is a standard ... ARMA forecasting for transient oscillation ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 20

Provided by: HongZ8

Learn more at: http://oasis.cs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Probabilistic Aggregation in Distributed Networks

1
Probabilistic Aggregation in Distributed Networks

Ling Huang, Ben Zhao,
Anthony Joseph and John Kubiatowicz
hling, ravenben, adj, kubitrong_at_eecs.berkeley.ed
u
June, 2004

2
Outline

Background
Motivation
Statistical properties of real life data streams
Problem of existing approaches
Our Approach
Reduce communication overhead
Recover from loss
Evaluation
Conclusion and future work

3
Background

Aggregate functions
MIN, MAX, AVG, COUNT, , etc.
In-Network hierarchical processing
Query propagation
Tree construction
Aggregates computed epoch by epoch
Addressing fault-tolerance
Multi-root
Multi-tree
Reliable transmission

4
Motivation

Data aggregation is an important function for all
network infrastructures
Sensor networks
P2P networks
Network monitoring and intrusion detection
systems
Exact result not achievable in face of loss and
faults
High cost when adding fault-tolerance
Low communication overhead, accurate
approximation is crucial
But, its difficult to achieve

5
Observation Comparison of Data Streams
Three real-world data traces and a random trace
6
Statistical Properties of Data Streams
Relative Increment is defined as
There is temporal correlation in real data
stream, by which we can leverage to maintain
aggregate data accuracy, while reducing
communication overhead and recovering from data
loss.
Density estimation for relative increment
7
Problems in Existing Approaches

Few approach exploits the temporal properties and
is designed to handle data loss
Simple last-value algorithm for data loss
recovery in TAG
Multi-root/tree make things worse by consuming
more resource
Fragile for large process groups
Need all relevant nodes for participation
Difficult to trade accuracy for communication
overhead
Good applications need this tradeoff
Only need approximation
But, minimize resource consumption
Centralize solution of adaptive filtering
proposed by Olston et.al.

8
Our Approach

Probabilistic data aggregation a scalable and
robust approach
Exploit and leverage statistical properties of
data stream in temporal domain
Apply statistical algorithms to data aggregation
Develop protocol that handles loss and failures
as essential part of normal operations
Nodes participate in aggregation and
communication according to statistical sampling
algorithm
In the absence of data, estimate value using time
series algorithms
Differentiate between voluntary and involuntary
Loss

9
Reducing Communication Overhead

Trade off between accuracy and resource
consumption
Allow selective participation of nodes while
maintaining aggregate accuracy
Node participates in the operation with certain
probability, which is the design parameter of the
algorithm
Sampling strategies
Uniform Sampling all nodes use the identical
sampling rate
Subtree-size based Sampling sampling rate of a
node is proportional to the size of its subtree
Variance based sampling a sensor only reports a
new value if it is above or below a threshold
percentage its last reported value.

10
Performance of Sampling algorithms

As fewer nodes participate, overall accuracy
decreases for all algorithms.
Uniform sampling performs worst.
Variance based sampling is most accurate,

11
Observation Long-Term Pattern in Data
Daily patterns in a weekly data stream
Data source bandwidth measurements for the CUDI
network interface on an Abilene router with
5-minute average.
12
Two Level Representation of Data
Monday Data
The data stream can be decomposed into two
layers the long trend (pattern), which changes
slowly the residual, high frequency but low
amplitude.
13
Recovering From Loss

Traditional Approaches
Last seen data as approximation for current epoch
Linear Prediction
Two-Level data representation and prediction
Long term trend B-spline estimation
High frequency residual ARMA modeling
ARMA stands for AutoRegressive and Moving Average
model, which is a standard time series technique
to model chaotic data stream

14
Two-Level Data Prediction

B-spline modeling for long term trend
Piecewise continuous, low-degree B-spline can
represent complex shapes
Least-square B-spline regression for two-level
decomposition
B-Spline extension for future forecasting
ARMA forecasting for transient oscillation
System Identification to determine the order of
the model
Parameter estimation by optimization algorithm
Low complexity recursive equation for future
forecasting
Statistical properties for the calibration of
prediction results

15
Performance of Prediction Algorithms
Performance of Prediction Algorithms For MAX
Operation in Lossless Environment
16
Performance of Prediction Algorithms
Performance of prediction algorithms in lossy
environments. Average loss rate of the network is
20. The ration of loss rate between wide-area
links and local links is 31.
17
Summary of Results

All prediction algorithms are effective in
improving the accuracy of aggregation results
Two-level prediction approach perform the best in
all situations
Achieve more than 90 of accuracy even under each
node nonparticipation with rate up to 60
Is effective even in a high loss environment

18
Conclusion and Future Work

Apply statistical algorithms to data aggregation
system
quantify the statistical properties of real-world
measurement data
propose the concept of probabilistic
participation of nodes
propose multi-level prediction mechanism to
recover from sampling and data loss
Uniqueness multi-level prediction enables high
accuracy even under high loss and voluntary
non-participation
Future Work
Develop online algorithm and exploit tradeoff
between prediction accuracy and computation and
storage cost
Build real system for applications with health
monitoring, traffic measurement and router
statistics aggregation
Real system implementation and Deployment