glimcher@cse.ohio-state.eduP. 1 - PowerPoint PPT Presentation

About This Presentation

Title:

glimcher@cse.ohio-state.eduP. 1

Description:

A Grid-Based Middleware for Scalable Processing of Remote Data Leonid Glimcher Advisor: Gagan Agrawal – PowerPoint PPT presentation

Number of Views:122

Avg rating:3.0/5.0

Slides: 49

Provided by: Leoni169

Learn more at: https://cse.osu.edu

Category:

more less

Transcript and Presenter's Notes

Title: glimcher@cse.ohio-state.eduP. 1

1
A Grid-Based Middleware for Scalable Processing
of Remote Data

Leonid Glimcher
Advisor Gagan Agrawal

2
Abundance of Data

Data generated everywhere
Sensors (intrusion detection, satellites)
Scientific simulations (fluid dynamics, molecular
dynamics)
Business transactions (purchases, market trends)
Analysis needed to translate data into knowledge
Growing data size creates problems
Online repositories are becoming more prominent
for data storage

3
Emergence of Grid Computing

Provides access to
Resources
Data
otherwise inaccessible
Aimed at providing a stable foundation for
distributed services, but comes at a price
Standards are needed to integrate distributed
services
OGSA WSRF
Cloud utility computing models
Resources - more accessible
Services - more usable

4
Remote Data Analysis

Remote data analysis
Grid is a good fit
Details can be very tedious
Data retrieval, movement and caching
Parallel data processing
Resource allocation
Application configuration
Middleware can be useful to abstract away details

5
Our Approach

Supporting development of scalable applications
that process remote data using middleware
FREERIDE-G (Framework for Rapid Implementation of
Datamining Engines in Grid)

Middleware user
Repository cluster
Compute cluster
6
FREERIDE-G Middleware Contributions

ADR-based version for mining remote data
SRB-based version integrated with standards for
remote data access
Performance prediction framework for resource and
replica selection
FREERIDE-G grid service integrated with OGSA
computing standards
Load balancing and data integration module
3 parallel middleware-based applications
(developed in context of FREERIDE)
EM clustering,
vortex detection,
defect detection and categorization

7
SDSC Storage Resource Broker

FREERIDE-G targets SRB resident data
Standard for remote data
Storage
Access
Data can be distributed across organizations and
heterogeneous storage systems
Access provided through client API

8
FREERIDE (G) Processing Structure

KEY observation most data mining algorithms
follow canonical loop
Middleware API
Subset of data to be processed
Reduction object
Local and global reduction operations
Iterator
Derived from precursor system FREERIDE

While( ) forall( data instances d)
(I , d) process(d) R(I) R(I)
op d .
9
Classic Data Mining Applications

Tested here
Expectation Maximization clustering
Kmeans clustering
K Nearest Neighbor search
Previously on FREERIDE
Apriori and FPGrowth frequent itemset miners
RainForest decision tree construction

10
Scientific data processing applications

VortexPro
Finds vortices in volumetric fluid/gas flow
datasets
Defect detection
Finds defects in molecular lattice datasets
Categorizes defects into classes
Middleware is useful for wide variety of
algorithms

11
FREERIDE-G Evolution

FREERIDE
data stored locally
FREERIDE-G
ADR responsible for remote data retrieval
SRB responsible for remote data retrieval
FREERIDE-G grid service
Grid service featuring
Load balancing
Data integration

12
Evolution
FREERIDE-G-ADR
FREERIDE
Application Data ADR SRB Globus
FREERIDE-G-SRB
FREERIDE-G-GT
13
Publications

Published
Middleware for Data Mining Applications on
Clusters and Grids, invited to special issue of
Journal of Parallel and Distributed Computing
(JPDC) on "Parallel Techniques for Information
Extraction"
Parallelizing EM Clustering Algorithm on a
Cluster of SMPs, Euro-Par04.
Scaling and Parallelizing a Scientific Feature
Mining Application Using a Cluster Middleware,
IPDPS04
Parallelizing a Defect Detection and
Categorization Application, IPDPS05.
FREERIDE-G Supporting Applications that Mine
Remote Data Repositories, ICPP06
A Performance Prediction Framework for
Grid-Based Data Mining Applications, IPDPS07
A Middleware for Developing and Deploying
Scalable Remote Mining Services, to appear in
CCGRID08
FREERIDE-G Enabling Distributed Processing of
Large Datasets, to appear in DADC08 workshop in
conjunction with HPDC08
In Submission
Supporting Load Balancing and Data Integration
For Distributed Data-Intensive Applications,
submitted to SuperComputing08
Impact of Bandwidth and I/O Concurrency on
Performance of Remote Data Processing
Applications, submitted to Grid08
A Middleware for Remote Mining of Data From
SRB-based Servers

14
Outline

Motivation
Introduction to FREERIDE-G Middleware
Latest work (since candidacy)
Load Balancing and Data Integration
Integration with Grid Computing Standards
Related Work
Conclusion

15
Previous FREERIDE-G Limitations

Data required to be resident on a single cluster
Storage data format has to be same as processing
data format
Processing has to be confined to a single cluster
Load balancing and data integration module helps
FREERIDE-G overcome limitations efficiently

16
Motivating scenario
B
A
D
Data
C
17
Run-time Load Balancing

Based on unbalanced (initial) distribution figure
out an partitioning scheme (across repositories)
Model for load balancing based on two components
(across processing nodes)
partitioning of data chunks,
data transfer time.
weights used to combine two components
Organize remote retrievals
Use multiple concurrent transfers where required
to alleviate data delivery bottleneck

18
Load Balancing Algorithm
Foreach (chunk c) If ( no processing node
assigned to c ) Determine Transfer Costs
across all attributes //Compare processing
costs across all nodes Foreach (processing
node p) Assume chunk is assigned to p,
Update processing cost and transfer
cost Calculate averages for all other processing
nodes (except p) for processing cost
and transfer cost Compute difference in
both costs between p and averages Add
weighted costs together to compute total cost
if (cost lt minimum cost)
update minimum Assign c to processing node
with minimum cost
19
Approach to Integration

Bandwidth available also used to determine level
of concurrency
After all vertical views of chunks are
re-distributed to destination repository nodes,
use wrapper to convert data
Automatically generated wrapper
Input physical data layout (storage layout)
Output logical data layout (application view)

Horizontal View
X1 Y1 Z1

Xn Yn Zn
Vertical View
20
Experimental Setup

Settings
Organizational Grid
Wide Area Network
Goals are to evaluate
Scalability
Load balancing accuracy
Adaptability to scenarios
compute bound,
I/O bound,
WAN setting

21
Setup 1 Organizational Grid
Repository cluster (bmi-ri)

Data hosted on Opteron 250s
Processed on Opteron 254s
2 clusters connected through 2 10 GB optical
fibers
Both clusters within same city (0.5 mile apart)
Evaluating
Scalability
Adaptability
Integration overhead

Compute cluster (cse-ri)
22
Setup 2 WAN
Repository cluster (Kent ST)

Data Repository
Opteron 250s (OSU)
Opteron 258s (Kent St)
Processed on Opteron 254s
No dedicated link between processing and
repository clusters
Evaluating
Scalability
Adaptability

Repository cluster (OSU)
Compute cluster (OSU)
23
Scalability in Organizational Grid

Vortex Detection (14.8 GB)
Linear speedup for even data node - compute node
scale-up
Near-linear speedup for uneven compute node
scale-up
Our approach outperforms initial (unbalanced)
Comes close to (statically) balanced configuration

24
Evaluating Balancing Accuracy
EM (25.6 GB)
Kmeans (25.6 GB)
25
Model Adaptability -- Compute Bound Scenario

50 MB and 200 MB bandwidth
Kmeans clustering (25.6 GB)
Best results 25-75 combo (skewed towards work
load size component)
Initial (unbalanced) overhead
57 over balanced
Dynamic overhead
5 over balanced

26
Model Adaptability -- I/O Bound Scenario

15 MB and 60 MB bandwidth
EM clustering (25.6 GB)
Best results 25-75 (skewed towards data transfer
component)
Initial (unbalanced) overhead
38 over balanced
Dynamic overhead
7 over balanced

27
Wrapper overhead evaluation

Kmeans clustering (25.6 GB)
Wrapper overhead
kmeans clustering - 7
vortex detection - 3
Data integration overhead has negligible effect
on grid service scalability

28
WAN Setting Adaptability

Vortex Detection (14.6 GB)
25-75 combo results in lowest overhead (favoring
data delivery component)
Unbalanced configuration 20 overhead over
balanced
Our approach
overhead reduced to 8

29
Summary

Added middleware support for
load balancing
data integration
Results
wrapping overhead quite low
load balancing accuracy good
both factors captured in model are important
model is adaptable using weight combinations

30
Outline

Motivation
Introduction to FREERIDE-G Middleware
Latest work
Load Balancing and Data Integration
Integration with Grid Computing Standards
Related Work
Conclusion

31
FREERIDE-G Grid Service

Emergence of Grid leads to computing standards
and infrastructures
Data hosts component already integrated through
SRB for remote data access
OGSA previous standard for Grid Services
WSRF standard that merges Web and Grid Service
definitions
Globus Toolkit is most fitting infrastructure for
Grid Service conversion of FREERIDE-G

32
FREERIDE-G System Architecture
33
SRB Data Host

SRB Master
Connection establishment
Authentication
Forks agent to service I/O
SRB Agent
Performs remote I/O
Services multiple client API requests
MCAT
Catalogs data associated with datasets, users,
resources
Services metadata queries (through SRB agent)

34
Compute Node

More compute nodes than data hosts
Each node
Registers IO (from index)
Connects to data host
While (chunks to process)
Dispatch IO request(s)
Poll pending IO
Process retrieved chunks

35
FREERIDE-G in Action
Compute Node
Data Host
I/O Registration
Connection establishment
SRB Agent
While (more chunks to process)
I/O request dispatched
Pending I/O polled
MCAT
SRB Master
Retrieved data chunks analyzed
SRB Agent
Compute Node
36
Implementation Challenges

Interaction with Code Repository
Simplified Wrapper and Interface Generator
XML descriptors of API functions
Each API function wrapped in own class
Integration with MPICH-G2
Supports MPI
Deployed through Globus components (GRAM)
Hides potential heterogeneity in service startup
and management

37
Experimental setup

Organizational Grid
Data hosted on Opteron 250 cluster
Processed on Opteron 254 cluster
Connected using 2 10 GB optical fibers
Goals
Demonstrate parallel scalability of applications
Evaluate overhead of using MPICH-G2 and Globus
Toolkit deployment mechanisms

38
Deployment Overhead Evaluation

Clearly a small overhead associated with using
Globus and MPICH-G2 for middleware deployment.
Kmeans Clustering with 6.4 GB dataset 18-20.
Vortex Detection with 14.8 GB dataset 17-20.

39
Summary of results

FREERIDE-G middleware for scalable processing
of remote data -- now as a grid service
Compliance with grid and repository standards
Scalable performance with respect to data and
compute nodes
Low deployment overheads as compared to non-grid
version
Further ease of development
High portability

40
Outline