A Grid-Based Middleware for Processing Distributed Data Streams

About This Presentation

Title:

A Grid-Based Middleware for Processing Distributed Data Streams

Description:

LSS stores unprocessed itemsets Dynamic Migration Memory usage of LSS Dynamic Migration Memory usage of LSS Dynamic Migration LSS migration does not impact ... – PowerPoint PPT presentation

Number of Views:141

Avg rating:3.0/5.0

Slides: 73

Provided by: weil69

Category:

more less

Transcript and Presenter's Notes

Title: A Grid-Based Middleware for Processing Distributed Data Streams

1
A Grid-Based Middleware for Processing
Distributed Data Streams

Liang Chen
Advisor Gagan Agrawal
Computer Science Engineering

2
Roadmap

Introduction
Motivation
Our approach and challenges
System Overview and Initial Evaluation
Introduce system architecture and design
Discuss the self-adaptation function
Self-Adaptation Algorithm
Explain the algorithm
Evaluate the system by using two data mining
applications
Resource Allocation Schemes
Dynamic Migration
Motivation
Light-weight summary structure (LSS)
How applications utilize the dynamic migration
Evaluation
Adaptive Volume Rendering
Related work
Conclusion and Future work

3
Introduction-Motivation

What is data steam
Data stream data arrive continuously
Enormous volume and must be processed online
Need to be processed in real-time
Data sources could be distributed
Data Stream Applications
Online network intrusion detection
Sensor networks
Network Fault Management system for
telecommunication network elements

4
Introduction-Motivation
Network Fault Management System (NFM) analyzing
distributed alarm streams

Switch Network
NFM (Network Fault Management) System
5
Introduction-Motivation

Challenges
Data and/or computation intensive
System can be easily overloaded

6
Introduction-Motivation

Possible solutions
Grid computing technologies
Automatically adjust processing rate

7
Introduction-Motivation

The needs for processing distributed data streams
A middleware running in Grid
Allocate Grid resources
Provide self-adaptation function

8
Introduction-Our Approach

We implemented a middleware to meet the needs
Five contributions of our work
1. Utilizing existing grid standards
Liang Chen, K. Reddy and G. Agrawal
GATES A Grid-Based Middleware for Processing
Distributed Data Streams.HPDC, 2004.
2. Providing self-Adaptation functionality
Liang Chen and G. Agrawal Supporting
Self-Adaptation in Streaming Data Mining
Applications. IPDPS, 2006.
3. Supporting automatic resource allocation
Liang Chen and G. Agrawal A Static
Resource Allocation Framework for Grid-Based
Streaming Applications. Concurrency Computation
Practice and Experience Journal, Volume 18, Issue
6 , Pages 653 - 666.
4. Supporting efficient dynamic migration
Liang Chen, Q. Zhu and G. Agrawal A
Supporting Dynamic Migration in Tightly Coupled
Grid Applications. SC 2006.
5. Studying adaptive rendering application

9
Roadmap

Introduction
Motivation
Our approach and challenges
System Overview and Initial Evaluation
Introduce system architecture and design
Discuss the self-adaptation algorithms
Self-Adaptation Algorithm
Introduce the algorithm
Evaluate the system by using two data mining
applications
Resource Allocation Schemes
Dynamic Migration
Motivation
Light-weight summary structure (LSS)
How applications utilize the dynamic migration
Evaluation
Adaptive Volume Rendering
Related work
Conclusion and Future work

10
System Architecture and Design(Architecture)

Use Globus Toolkit 3.0, built on OGSA
Allows users to specify their algorithms
implemented in Java
Take care of plugging user-defined algorithms
into the system and running them in Grid.
Applications need be broken down into a number of
pipelined stages

11
System Architecture and Design(Architecture)
Application
Stage A
Stage B
Stage C
12
System Architecture and Design(GATES API
Functions)

Public class Second-Stage implements
StreamProcessing
void work(buffer in, buffer out)
while(true)
DATA GATES.getFromInputBuffer(in)
Inter-Results Processing(Data)
GATES.putToOutputBuffer (out,
Inter-Results)

13
Adaptation Parameter

Definition
A parameter in an application
Changing the parameters value can change
processing rate of the application, also impact
accuracy of the processing
Two kinds of adaptation parameters
Performance parameter
Accuracy parameter
Example
Sampling rate is an accuracy parameter

14
Pseudo Codes Again with Self-adaptation API
Functions
Public class Second-Stage implements
StreamProcessing //Initialize
sampling-rate Sampling-rate (Max Min)/2
void work(buffer in, buffer out)
GATES.specifyAccuracyPara(Sampling-rate, Max,
Min) while(true) DATA
GATES.getFromInputBuffer(in)
Inter-Results Processing(Data,
Sampling-rate) GATES.putToOutputBuffer
(out, Inter-Results) Sampling-rate
GATES.getSuggestedValue()
15
Roadmap

Introduction
Motivation
Our approach and challenges
System Overview and Initial Evaluation
Introduce system architecture and design
Discuss the self-adaptation function
Self-Adaptation Algorithm
Explain the algorithm
Evaluate the system by using two data mining
applications
Resource Allocation Schemes
Dynamic Migration
Motivation
Light-weight summary structure (LSS)
How applications utilize the dynamic migration
Evaluation
Adaptive Volume Rendering
Related work
Conclusion and Future work

16
Self-Adaptation Algorithm

View the system as a pipeline
To ensure real-time processing, a balanced
pipeline is needed
When average queue length is too small or too
large, queue is under or over loaded. Pipeline is
not balanced.

Measure the average lengths of the queues in the
pipeline

When GATES.getSuggestedValue() is invoked, use
the heuristic way to determine a new value for
the adaptation parameter according to the
measured lengths

17
Self-adaptation Algorithm

The way we measure average queue length
the heuristic way to adjust an adaptation
parameter
Should the adaptation parameter be modified, and
if so, in which direction?
How to find a new value (update the value) of the
adaptation parameter

18
Self-adaptation Algorithm

Should the adaptation parameter be modified, and
if so, in which direction?
The answer is related to the pipelines load
state.

19
Self-adaptation Algorithm
Convergent States
Non-Convergent States
20
Self-adaptation Algorithm
Summary of Load States
21
Self-adaptation Algorithm

How to determine a new value for the adaptation
parameter
Linear update increase or decrease by a
fixed value
Hard to find a proper fixed value
Binary search

22
Self-adaptation Algorithm
Left Border
Current Value
Right Border
New Value
23
(No Transcript)
24
Self-adaptation Algorithm

Two Data mining applications
Clustream Clustering data-points in streams

25
Data Mining Applications System Evaluation

Dist-Freq-Counting finding frequent itemsets
from distributed streams

26
Data Mining Applications System Evaluation
27
Data Mining Applications System Evaluation
28
Data Mining Applications System Evaluation
29
Data Mining Applications System Evaluation
30
Data Mining Applications System Evaluation
31
Data Mining Applications System Evaluation
Data Mining Applications System Evaluation
32
Data Mining Applications System Evaluation
Data Mining Applications System Evaluation
33
Data Mining Applications System Evaluation
Data Mining Applications System Evaluation
34
Roadmap

Introduction
Motivation
Our approach and challenges
System Overview and Initial Evaluation
Introduce system architecture and design
Discuss the self-adaptation algorithms
Self-Adaptation Algorithm
Explain the algorithm
Evaluate the system by using two data mining
applications
Resource Allocation Schemes
Dynamic Migration
Motivation
Light-weight summary structure (LSS)
How applications utilize the dynamic migration
Evaluation
Adaptive Volume Rendering
Related work
Conclusion and Future work

35
Resource Allocation Schemes

Problem Definition
Grid resource allocation for pipelined
applications that process distributed streaming
data in real-time is challenging
The scheme consists of two parts
Static Part allocate resources before an
application runs
Dynamic Part re-allocate resources in run-time
A framework to monitor resources and support
dynamic resource allocation

36
Static Allocation Scheme

Static allocation problem determining a
deployment configuration
Objective Automatically generate a deployment
configuration according to the information of
available resources

The number of data sources and their location

The destination

The number of stages consisting of a pipeline
The number of instances of each stage
How the instances connect to each other
The node where each instance is placed

37
Roadmap

Introduction
Motivation
Our approach and challenges
System Overview and Initial Evaluation
Introduce system architecture and design
Discuss the self-adaptation algorithms
Improved Self-Adaptation
self-adaptation algorithm
Evaluate the system by using two data mining
applications
Resource Allocation Schemes
Dynamic Migration
Motivation
Light-weight summary structure (LSS)
How applications utilize the dynamic migration
Evaluation
Adaptive Volume Rendering
Related work
Conclusion and Future work

38
Dynamic Migration-Motivation

Grid resources vary frequently
Dynamically allocating new resources and
migrating applications to the new resources
improve performance
Checkpointing is a classic method to support
dynamic migration
A snapshot of systems running state
Transmit to a remote site
Restore execution context and restart processes
Disadvantages of checkpointing
Platform dependent
Inefficient
Involve lots of implementation efforts
Our approach is base on Light-weight Summary
Structure (LSS)

39
Dynamic Migration-LSS

Processing Structure
...
while(true)
read_data_from_streams()
process_data()
accumulate_intermediate_results()
reset_auxiliary_structures()
...
Data structure storing summary information is
Light-weight summary structure
Others are Auxiliary structures

40
Dynamic Migration-LSS

Two observations with respect to LSS
The size of LSS is much smaller than that of the
total memory
Auxiliary structures are usually reset at the end
of each loop. Unnecessary to migrate auxiliary
structures when migration occurs at the end of a
loop
LSS can be used to support dynamic migration
GAETS provides an API function to allocate a
block of memory to be LSS
An application stores summary information to LSS
transmit only LSS at the end of the loop to a new
node and restore the LSS at the new node

41
Dynamic Migrationsupported by GATES
42
Dynamic Migration

Advantages of using LSS
Efficient, only LSS is migrated
Not impact the accuracy of processing
Support migration across heterogeneous platforms
Reduce application developers efforts on making
application capable of migration

43
Dynamic Migration
44
Dynamic Migration

Evaluation
Three applications
Counting sample
LSS stores intermediate top M frequently
occurring numbers
Clustream, clustering data points in streams
LSS stores micro-clusters computed at the second
stage
Dist-Freq-Counting, finding frequent itemsets in
distributed streams.
LSS stores unprocessed itemsets

45
Dynamic Migration

Memory usage of LSS

46
Dynamic Migration

Migration using LSS is efficient

47
Dynamic Migration

Migration using LSS is efficient

48
Dynamic Migration

Benefits of migration in a dyamic environment

49
Dynamic Migration

Memory usage of LSS

50
Dynamic Migration

Migration using LSS is efficient

51
Dynamic Migration

Migration using LSS is efficient

52
Dynamic Migration

Benefits of migration in a dynamic environment

53
Dynamic Migration

LSS migration does not impact processing accuracy
The counting sample application was used
Compared the average accuracy of the processing
results from the non-migration and the migration
versions, they are 97.28 and 97.51 accurate

54
Roadmap

Introduction
Motivation
Our approach and challenges
System Overview and Initial Evaluation
Introduce system architecture and design
Discuss the self-adaptation algorithms
Self-Adaptation Algorithm
Explain the algorithm
Evaluate the system by using two data mining
applications
Resource Allocation Schemes
Dynamic Migration
Motivation
Light-weight summary structure (LSS)
How applications utilize the dynamic migration
Evaluation
Adaptive Volume Rendering
Related work
Conclusion and Future work

55
Adaptive Volume Rendering

Motivation Grid computing is needed
Visualization involves large volumes of dataset
We focus on streaming volume data
Interactively visualizing volume data in
real-time is needed
Computationally intensive
Resources consumed
Real-time processing can not be guaranteed
The places where data are generated are
distributed
Typical client-server architecture is not
scalable
Network bandwidths of wide-area networks are low
Computing capability of normal desktop is not
enough
Grid techniques would be a good solution
Divide the procedure into stages organized in a
pipeline
Allocate nodes close to data source to
pre-process volume data
The size of intermediate results is much smaller

56
Adaptive Volume Rendering

Motivation GATES is desirable
Automatic adaptation is desirable
Volume rendering algorithms running on a grid
need to be highly adaptive
Adaptation usually achieved by manually adjusting
adaptation parameters
Such manual parameter adaptation is very
challenging in a grid environment
Automatic resource allocation is desirable
Grid environment is highly changeable
The GATES middleware could fulfill the needs
Grid-based
Provide the self-adaptation function to
applications
Automatically allocate Grid resources

57
Adaptive Volume Rendering

Overall design
Two pipelined steps the first step
Build octrees from volume data
Octree is a tree data structure, in which each
internal node has up to 8 children
Here, we use an octree to represent
multiresolution information for a volume
Procedure to build an octree for a volume is as
follows
Divide volume space into 8 subvolumes and create
8 children nodes
For each subvolume, calculate standard deviation
of all voxels in the subvolume, and store the
deviation to the corresponding child node
If the deviation is larger than a pre-defined
value, divide the subvolume, repeat the above
procedure. Otherwise, stop

58
Adaptive Volume Rendering

Overall design
Two pipelined steps the second step
Use an octree and its corresponding volume to
render images
Provided an error tolerance (or user-defined
resolution), use DFS to traverse the octree and
stop at the nodes where the deviation is less
than the resolution or error tolerance.
Project the corresponding 3D-subvolumes to an
image

59
Adaptive Volume Rendering
60
Adaptive Volume Rendering

Make the rendering self-adaptive
Two adaptation parameters used in the third stage
Error Tolerance performance parameter
Image Size accuracy parameter
Only one adaptation parameter can be adjusted by
GATES. So we fix one and adjust the other

61
(No Transcript)
62
Adaptive Volume Rendering

Experiment 1

63
Adaptive Volume Rendering

100kbps

150kbps
200kbps
250kbps
64
Adaptive Volume Rendering

Experiment 2

65
Adaptive Volume Rendering

Experiment 3 compare the performance of two
implementations
Java-imple
C-imple

66
Adaptive Volume Rendering

Experiment 3 compare the performance of two
implementations

67
Related Work

Middleware for data stream processing
Data cutter, Stampede
Differences in a cluster, no self-adaptation, no
specifically for real-time processing
Continuous query systems
STREAM, dQUOB, TelegraphCQ, NiagraCQ
Differences centralized, no adaptation supports
Distributed continuous query systems
Aurora, Medusa, Borealis
Differences continuous queries, not in Grid
environment
In-Network aggregation in sensor network
Stream-based overlay networks

68
Related work

Grid Resource Allocation
Condor, Realtor, ACDS
Main Differences our work focus on Grid resource
allocation for workflow applications
Adaptation Through a Middleware
Cheng et al.s adaptation framework, SWiFT,
Conductor, DART, ROAM
Main Differences our work focus on general
supports for adaptation in run-time
Dynamic Migration in Grid Environment
Condor, XCATS, Charm
Main Differences our work use LSS

69
Conclusion

Grid computing could be an effective solution for
distributed data stream processing
GATES
Distributed processing
Exploit grid web services
Self-adaptation to meet the real-time constraints
Grid resource allocation schemes and dynamic
migration

70
Future Work

CPU cycles and Network bandwidths
Currently, only network bandwidth is considered a
constraint when scheduling Grid resources
Few related work proposes a metric to integrate
both for pipelined appliations
Port GATES from GT3 to GT4
Support fault-tolerance and high availability
Further relieve programming burdens from
application develops
Specify meta-data
Support distributed continuous queries
Specify a set of query operators

71
Acknowledgements

My advisor, Prof. Agrawal, proposed the idea of
implementing the middleware, and gave lots
advices for the directions of my research
Prof. Shen gave lots of helps on implementing the
render application, and provided lots of write-up
for the chapter 7