Title: glimcher@cse.ohio-state.eduP. 1
1A Grid-Based Middleware for Scalable Processing
of Remote Data
- Leonid Glimcher
- Advisor Gagan Agrawal
2Abundance of Data
- Data generated everywhere
- Sensors (intrusion detection, satellites)
- Scientific simulations (fluid dynamics, molecular
dynamics) - Business transactions (purchases, market trends)
- Analysis needed to translate data into knowledge
- Growing data size creates problems
- Online repositories are becoming more prominent
for data storage
3Emergence of Grid Computing
- Provides access to
- Resources
- Data
- otherwise inaccessible
- Aimed at providing a stable foundation for
distributed services, but comes at a price - Standards are needed to integrate distributed
services - OGSA WSRF
- Cloud utility computing models
- Resources - more accessible
- Services - more usable
4Remote Data Analysis
- Remote data analysis
- Grid is a good fit
- Details can be very tedious
- Data retrieval, movement and caching
- Parallel data processing
- Resource allocation
- Application configuration
- Middleware can be useful to abstract away details
5Our Approach
- Supporting development of scalable applications
that process remote data using middleware
FREERIDE-G (Framework for Rapid Implementation of
Datamining Engines in Grid)
Middleware user
Repository cluster
Compute cluster
6FREERIDE-G Middleware Contributions
- ADR-based version for mining remote data
- SRB-based version integrated with standards for
remote data access - Performance prediction framework for resource and
replica selection - FREERIDE-G grid service integrated with OGSA
computing standards - Load balancing and data integration module
- 3 parallel middleware-based applications
(developed in context of FREERIDE) - EM clustering,
- vortex detection,
- defect detection and categorization
7SDSC Storage Resource Broker
- FREERIDE-G targets SRB resident data
- Standard for remote data
- Storage
- Access
- Data can be distributed across organizations and
heterogeneous storage systems - Access provided through client API
8FREERIDE (G) Processing Structure
- KEY observation most data mining algorithms
follow canonical loop - Middleware API
- Subset of data to be processed
- Reduction object
- Local and global reduction operations
- Iterator
- Derived from precursor system FREERIDE
While( ) forall( data instances d)
(I , d) process(d) R(I) R(I)
op d .
9Classic Data Mining Applications
- Tested here
- Expectation Maximization clustering
- Kmeans clustering
- K Nearest Neighbor search
- Previously on FREERIDE
- Apriori and FPGrowth frequent itemset miners
- RainForest decision tree construction
10Scientific data processing applications
- VortexPro
- Finds vortices in volumetric fluid/gas flow
datasets - Defect detection
- Finds defects in molecular lattice datasets
- Categorizes defects into classes
- Middleware is useful for wide variety of
algorithms
11FREERIDE-G Evolution
- FREERIDE
- data stored locally
- FREERIDE-G
- ADR responsible for remote data retrieval
- SRB responsible for remote data retrieval
- FREERIDE-G grid service
- Grid service featuring
- Load balancing
- Data integration
12Evolution
FREERIDE-G-ADR
FREERIDE
Application Data ADR SRB Globus
FREERIDE-G-SRB
FREERIDE-G-GT
13Publications
- Published
- Middleware for Data Mining Applications on
Clusters and Grids, invited to special issue of
Journal of Parallel and Distributed Computing
(JPDC) on "Parallel Techniques for Information
Extraction" - Parallelizing EM Clustering Algorithm on a
Cluster of SMPs, Euro-Par04. - Scaling and Parallelizing a Scientific Feature
Mining Application Using a Cluster Middleware,
IPDPS04 - Parallelizing a Defect Detection and
Categorization Application, IPDPS05. - FREERIDE-G Supporting Applications that Mine
Remote Data Repositories, ICPP06 - A Performance Prediction Framework for
Grid-Based Data Mining Applications, IPDPS07 - A Middleware for Developing and Deploying
Scalable Remote Mining Services, to appear in
CCGRID08 - FREERIDE-G Enabling Distributed Processing of
Large Datasets, to appear in DADC08 workshop in
conjunction with HPDC08 - In Submission
- Supporting Load Balancing and Data Integration
For Distributed Data-Intensive Applications,
submitted to SuperComputing08 - Impact of Bandwidth and I/O Concurrency on
Performance of Remote Data Processing
Applications, submitted to Grid08 - A Middleware for Remote Mining of Data From
SRB-based Servers
14Outline
- Motivation
- Introduction to FREERIDE-G Middleware
- Latest work (since candidacy)
- Load Balancing and Data Integration
- Integration with Grid Computing Standards
- Related Work
- Conclusion
15Previous FREERIDE-G Limitations
- Data required to be resident on a single cluster
- Storage data format has to be same as processing
data format - Processing has to be confined to a single cluster
- Load balancing and data integration module helps
FREERIDE-G overcome limitations efficiently
16Motivating scenario
B
A
D
Data
C
17Run-time Load Balancing
- Based on unbalanced (initial) distribution figure
out an partitioning scheme (across repositories) - Model for load balancing based on two components
(across processing nodes) - partitioning of data chunks,
- data transfer time.
- weights used to combine two components
- Organize remote retrievals
- Use multiple concurrent transfers where required
to alleviate data delivery bottleneck
18Load Balancing Algorithm
Foreach (chunk c) If ( no processing node
assigned to c ) Determine Transfer Costs
across all attributes //Compare processing
costs across all nodes Foreach (processing
node p) Assume chunk is assigned to p,
Update processing cost and transfer
cost Calculate averages for all other processing
nodes (except p) for processing cost
and transfer cost Compute difference in
both costs between p and averages Add
weighted costs together to compute total cost
if (cost lt minimum cost)
update minimum Assign c to processing node
with minimum cost
19Approach to Integration
- Bandwidth available also used to determine level
of concurrency - After all vertical views of chunks are
re-distributed to destination repository nodes,
use wrapper to convert data - Automatically generated wrapper
- Input physical data layout (storage layout)
- Output logical data layout (application view)
Horizontal View
X1 Y1 Z1
Xn Yn Zn
Vertical View
20Experimental Setup
- Settings
- Organizational Grid
- Wide Area Network
- Goals are to evaluate
- Scalability
- Load balancing accuracy
- Adaptability to scenarios
- compute bound,
- I/O bound,
- WAN setting
21Setup 1 Organizational Grid
Repository cluster (bmi-ri)
- Data hosted on Opteron 250s
- Processed on Opteron 254s
- 2 clusters connected through 2 10 GB optical
fibers - Both clusters within same city (0.5 mile apart)
- Evaluating
- Scalability
- Adaptability
- Integration overhead
Compute cluster (cse-ri)
22Setup 2 WAN
Repository cluster (Kent ST)
- Data Repository
- Opteron 250s (OSU)
- Opteron 258s (Kent St)
-
- Processed on Opteron 254s
- No dedicated link between processing and
repository clusters - Evaluating
- Scalability
- Adaptability
Repository cluster (OSU)
Compute cluster (OSU)
23Scalability in Organizational Grid
- Vortex Detection (14.8 GB)
- Linear speedup for even data node - compute node
scale-up - Near-linear speedup for uneven compute node
scale-up - Our approach outperforms initial (unbalanced)
- Comes close to (statically) balanced configuration
24Evaluating Balancing Accuracy
EM (25.6 GB)
Kmeans (25.6 GB)
25Model Adaptability -- Compute Bound Scenario
- 50 MB and 200 MB bandwidth
- Kmeans clustering (25.6 GB)
- Best results 25-75 combo (skewed towards work
load size component) - Initial (unbalanced) overhead
- 57 over balanced
- Dynamic overhead
- 5 over balanced
26Model Adaptability -- I/O Bound Scenario
- 15 MB and 60 MB bandwidth
- EM clustering (25.6 GB)
- Best results 25-75 (skewed towards data transfer
component) - Initial (unbalanced) overhead
- 38 over balanced
- Dynamic overhead
- 7 over balanced
27Wrapper overhead evaluation
- Kmeans clustering (25.6 GB)
- Wrapper overhead
- kmeans clustering - 7
- vortex detection - 3
- Data integration overhead has negligible effect
on grid service scalability
28WAN Setting Adaptability
- Vortex Detection (14.6 GB)
- 25-75 combo results in lowest overhead (favoring
data delivery component) - Unbalanced configuration 20 overhead over
balanced - Our approach
- overhead reduced to 8
29Summary
- Added middleware support for
- load balancing
- data integration
- Results
- wrapping overhead quite low
- load balancing accuracy good
- both factors captured in model are important
- model is adaptable using weight combinations
30Outline
- Motivation
- Introduction to FREERIDE-G Middleware
- Latest work
- Load Balancing and Data Integration
- Integration with Grid Computing Standards
- Related Work
- Conclusion
31FREERIDE-G Grid Service
- Emergence of Grid leads to computing standards
and infrastructures - Data hosts component already integrated through
SRB for remote data access - OGSA previous standard for Grid Services
- WSRF standard that merges Web and Grid Service
definitions - Globus Toolkit is most fitting infrastructure for
Grid Service conversion of FREERIDE-G
32FREERIDE-G System Architecture
33SRB Data Host
- SRB Master
- Connection establishment
- Authentication
- Forks agent to service I/O
- SRB Agent
- Performs remote I/O
- Services multiple client API requests
- MCAT
- Catalogs data associated with datasets, users,
resources - Services metadata queries (through SRB agent)
34Compute Node
- More compute nodes than data hosts
- Each node
- Registers IO (from index)
- Connects to data host
- While (chunks to process)
- Dispatch IO request(s)
- Poll pending IO
- Process retrieved chunks
35FREERIDE-G in Action
Compute Node
Data Host
I/O Registration
Connection establishment
SRB Agent
While (more chunks to process)
I/O request dispatched
Pending I/O polled
MCAT
SRB Master
Retrieved data chunks analyzed
SRB Agent
Compute Node
36Implementation Challenges
- Interaction with Code Repository
- Simplified Wrapper and Interface Generator
- XML descriptors of API functions
- Each API function wrapped in own class
- Integration with MPICH-G2
- Supports MPI
- Deployed through Globus components (GRAM)
- Hides potential heterogeneity in service startup
and management
37Experimental setup
- Organizational Grid
- Data hosted on Opteron 250 cluster
- Processed on Opteron 254 cluster
- Connected using 2 10 GB optical fibers
- Goals
- Demonstrate parallel scalability of applications
- Evaluate overhead of using MPICH-G2 and Globus
Toolkit deployment mechanisms
38Deployment Overhead Evaluation
- Clearly a small overhead associated with using
Globus and MPICH-G2 for middleware deployment. - Kmeans Clustering with 6.4 GB dataset 18-20.
- Vortex Detection with 14.8 GB dataset 17-20.
39Summary of results
- FREERIDE-G middleware for scalable processing
of remote data -- now as a grid service - Compliance with grid and repository standards
- Scalable performance with respect to data and
compute nodes - Low deployment overheads as compared to non-grid
version - Further ease of development
- High portability
40Outline
- Motivation
- Introduction to FREERIDE-G Middleware
- Latest work
- Load Balancing and Data Integration
- Integration with Grid Computing Standards
- Related Work
- Conclusion
41Grid Computing
- THE GRID Standards
- Open Grid Service Architecture
- Web Services Resource Framework
- Implementations
- Globus toolkit, WSRF.NET, WSRFLite
- Other Grid Systems
- Condor and Legion,
- Workflow composition systems
- GridLab, Nimrod/G (GridBus), Cactus, GRIST
42Data Grids
- Metadata cataloging
- Artemis project
- Remote data retrieval
- SRB, SEMPLAR
- DataCutter, STORM
- Stream processing midleware
- GATES
- dQUOB
- Data integration
- Automatic wrapper generation
43Remote Data in Grid
- KnowledgeGrid tool-set
- Developing datamining workflows (composition)
- GridMiner toolkit
- Composition of distributed, parallel workflows
- DiscoveryNet layer
- Creation, deployment, management of services
- DataMiningGrid framework
- Distributed knowledge discovery
- Other projects partially overlap in goals with
ours
44Resource Allocation Approaches
- 3 broad categories for resource allocation
- Heuristic approach to mapping
- Prediction through modeling
- Statistical estimation/prediction
- Analytical modeling of parallel application
- Simulation based performance prediction
45Outline
- Motivation
- Introduction to FREERIDE-G Middleware
- Latest work
- Load Balancing and Data Integration
- Integration with Grid Computing Standards
- Related Work
- Conclusion
46Conclusions
- FREERIDE-G middleware for scalable processing
of remote data - Integrated with grid computing standards
- Supports load balancing and data integration
- Support for high-end processing
- Ease use of parallel configurations
- Hide details of data movement and caching
- Fine-tuned performance models
- Compliance with remote repository standards
47Middleware Component Interaction
Load Balancing
Data Integration
48Scalability Evaluation
- Scalable implementations with respect to numbers
of - Compute nodes,
- Data repository nodes.
- Vortex Detection with 14.8 GB dataset (top).
- Kmeans Clustering with 6.4 GB dataset (bottom)