glimcher@cse.ohio-state.eduP. 1 - PowerPoint PPT Presentation

About This Presentation
Title:

glimcher@cse.ohio-state.eduP. 1

Description:

A Grid-Based Middleware for Scalable Processing of Remote Data Leonid Glimcher Advisor: Gagan Agrawal – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 49
Provided by: Leoni169
Category:

less

Transcript and Presenter's Notes

Title: glimcher@cse.ohio-state.eduP. 1


1
A Grid-Based Middleware for Scalable Processing
of Remote Data
  • Leonid Glimcher
  • Advisor Gagan Agrawal

2
Abundance of Data
  • Data generated everywhere
  • Sensors (intrusion detection, satellites)
  • Scientific simulations (fluid dynamics, molecular
    dynamics)
  • Business transactions (purchases, market trends)
  • Analysis needed to translate data into knowledge
  • Growing data size creates problems
  • Online repositories are becoming more prominent
    for data storage

3
Emergence of Grid Computing
  • Provides access to
  • Resources
  • Data
  • otherwise inaccessible
  • Aimed at providing a stable foundation for
    distributed services, but comes at a price
  • Standards are needed to integrate distributed
    services
  • OGSA WSRF
  • Cloud utility computing models
  • Resources - more accessible
  • Services - more usable

4
Remote Data Analysis
  • Remote data analysis
  • Grid is a good fit
  • Details can be very tedious
  • Data retrieval, movement and caching
  • Parallel data processing
  • Resource allocation
  • Application configuration
  • Middleware can be useful to abstract away details

5
Our Approach
  • Supporting development of scalable applications
    that process remote data using middleware
    FREERIDE-G (Framework for Rapid Implementation of
    Datamining Engines in Grid)

Middleware user
Repository cluster
Compute cluster
6
FREERIDE-G Middleware Contributions
  • ADR-based version for mining remote data
  • SRB-based version integrated with standards for
    remote data access
  • Performance prediction framework for resource and
    replica selection
  • FREERIDE-G grid service integrated with OGSA
    computing standards
  • Load balancing and data integration module
  • 3 parallel middleware-based applications
    (developed in context of FREERIDE)
  • EM clustering,
  • vortex detection,
  • defect detection and categorization

7
SDSC Storage Resource Broker
  • FREERIDE-G targets SRB resident data
  • Standard for remote data
  • Storage
  • Access
  • Data can be distributed across organizations and
    heterogeneous storage systems
  • Access provided through client API

8
FREERIDE (G) Processing Structure
  • KEY observation most data mining algorithms
    follow canonical loop
  • Middleware API
  • Subset of data to be processed
  • Reduction object
  • Local and global reduction operations
  • Iterator
  • Derived from precursor system FREERIDE

While( ) forall( data instances d)
(I , d) process(d) R(I) R(I)
op d .
9
Classic Data Mining Applications
  • Tested here
  • Expectation Maximization clustering
  • Kmeans clustering
  • K Nearest Neighbor search
  • Previously on FREERIDE
  • Apriori and FPGrowth frequent itemset miners
  • RainForest decision tree construction

10
Scientific data processing applications
  • VortexPro
  • Finds vortices in volumetric fluid/gas flow
    datasets
  • Defect detection
  • Finds defects in molecular lattice datasets
  • Categorizes defects into classes
  • Middleware is useful for wide variety of
    algorithms

11
FREERIDE-G Evolution
  • FREERIDE
  • data stored locally
  • FREERIDE-G
  • ADR responsible for remote data retrieval
  • SRB responsible for remote data retrieval
  • FREERIDE-G grid service
  • Grid service featuring
  • Load balancing
  • Data integration

12
Evolution
FREERIDE-G-ADR
FREERIDE
Application Data ADR SRB Globus
FREERIDE-G-SRB
FREERIDE-G-GT
13
Publications
  • Published
  • Middleware for Data Mining Applications on
    Clusters and Grids, invited to special issue of
    Journal of Parallel and Distributed Computing
    (JPDC) on "Parallel Techniques for Information
    Extraction"
  • Parallelizing EM Clustering Algorithm on a
    Cluster of SMPs, Euro-Par04.
  • Scaling and Parallelizing a Scientific Feature
    Mining Application Using a Cluster Middleware,
    IPDPS04
  • Parallelizing a Defect Detection and
    Categorization Application, IPDPS05.
  • FREERIDE-G Supporting Applications that Mine
    Remote Data Repositories, ICPP06
  • A Performance Prediction Framework for
    Grid-Based Data Mining Applications, IPDPS07
  • A Middleware for Developing and Deploying
    Scalable Remote Mining Services, to appear in
    CCGRID08
  • FREERIDE-G Enabling Distributed Processing of
    Large Datasets, to appear in DADC08 workshop in
    conjunction with HPDC08
  • In Submission
  • Supporting Load Balancing and Data Integration
    For Distributed Data-Intensive Applications,
    submitted to SuperComputing08
  • Impact of Bandwidth and I/O Concurrency on
    Performance of Remote Data Processing
    Applications, submitted to Grid08
  • A Middleware for Remote Mining of Data From
    SRB-based Servers

14
Outline
  • Motivation
  • Introduction to FREERIDE-G Middleware
  • Latest work (since candidacy)
  • Load Balancing and Data Integration
  • Integration with Grid Computing Standards
  • Related Work
  • Conclusion

15
Previous FREERIDE-G Limitations
  • Data required to be resident on a single cluster
  • Storage data format has to be same as processing
    data format
  • Processing has to be confined to a single cluster
  • Load balancing and data integration module helps
    FREERIDE-G overcome limitations efficiently

16
Motivating scenario
B
A
D
Data
C
17
Run-time Load Balancing
  • Based on unbalanced (initial) distribution figure
    out an partitioning scheme (across repositories)
  • Model for load balancing based on two components
    (across processing nodes)
  • partitioning of data chunks,
  • data transfer time.
  • weights used to combine two components
  • Organize remote retrievals
  • Use multiple concurrent transfers where required
    to alleviate data delivery bottleneck

18
Load Balancing Algorithm
Foreach (chunk c) If ( no processing node
assigned to c ) Determine Transfer Costs
across all attributes //Compare processing
costs across all nodes Foreach (processing
node p) Assume chunk is assigned to p,
Update processing cost and transfer
cost Calculate averages for all other processing
nodes (except p) for processing cost
and transfer cost Compute difference in
both costs between p and averages Add
weighted costs together to compute total cost
if (cost lt minimum cost)
update minimum Assign c to processing node
with minimum cost
19
Approach to Integration
  • Bandwidth available also used to determine level
    of concurrency
  • After all vertical views of chunks are
    re-distributed to destination repository nodes,
    use wrapper to convert data
  • Automatically generated wrapper
  • Input physical data layout (storage layout)
  • Output logical data layout (application view)

Horizontal View
X1 Y1 Z1

Xn Yn Zn
Vertical View
20
Experimental Setup
  • Settings
  • Organizational Grid
  • Wide Area Network
  • Goals are to evaluate
  • Scalability
  • Load balancing accuracy
  • Adaptability to scenarios
  • compute bound,
  • I/O bound,
  • WAN setting

21
Setup 1 Organizational Grid
Repository cluster (bmi-ri)
  • Data hosted on Opteron 250s
  • Processed on Opteron 254s
  • 2 clusters connected through 2 10 GB optical
    fibers
  • Both clusters within same city (0.5 mile apart)
  • Evaluating
  • Scalability
  • Adaptability
  • Integration overhead

Compute cluster (cse-ri)
22
Setup 2 WAN
Repository cluster (Kent ST)
  • Data Repository
  • Opteron 250s (OSU)
  • Opteron 258s (Kent St)
  • Processed on Opteron 254s
  • No dedicated link between processing and
    repository clusters
  • Evaluating
  • Scalability
  • Adaptability

Repository cluster (OSU)
Compute cluster (OSU)
23
Scalability in Organizational Grid
  • Vortex Detection (14.8 GB)
  • Linear speedup for even data node - compute node
    scale-up
  • Near-linear speedup for uneven compute node
    scale-up
  • Our approach outperforms initial (unbalanced)
  • Comes close to (statically) balanced configuration

24
Evaluating Balancing Accuracy
EM (25.6 GB)
Kmeans (25.6 GB)
25
Model Adaptability -- Compute Bound Scenario
  • 50 MB and 200 MB bandwidth
  • Kmeans clustering (25.6 GB)
  • Best results 25-75 combo (skewed towards work
    load size component)
  • Initial (unbalanced) overhead
  • 57 over balanced
  • Dynamic overhead
  • 5 over balanced

26
Model Adaptability -- I/O Bound Scenario
  • 15 MB and 60 MB bandwidth
  • EM clustering (25.6 GB)
  • Best results 25-75 (skewed towards data transfer
    component)
  • Initial (unbalanced) overhead
  • 38 over balanced
  • Dynamic overhead
  • 7 over balanced

27
Wrapper overhead evaluation
  • Kmeans clustering (25.6 GB)
  • Wrapper overhead
  • kmeans clustering - 7
  • vortex detection - 3
  • Data integration overhead has negligible effect
    on grid service scalability

28
WAN Setting Adaptability
  • Vortex Detection (14.6 GB)
  • 25-75 combo results in lowest overhead (favoring
    data delivery component)
  • Unbalanced configuration 20 overhead over
    balanced
  • Our approach
  • overhead reduced to 8

29
Summary
  • Added middleware support for
  • load balancing
  • data integration
  • Results
  • wrapping overhead quite low
  • load balancing accuracy good
  • both factors captured in model are important
  • model is adaptable using weight combinations

30
Outline
  • Motivation
  • Introduction to FREERIDE-G Middleware
  • Latest work
  • Load Balancing and Data Integration
  • Integration with Grid Computing Standards
  • Related Work
  • Conclusion

31
FREERIDE-G Grid Service
  • Emergence of Grid leads to computing standards
    and infrastructures
  • Data hosts component already integrated through
    SRB for remote data access
  • OGSA previous standard for Grid Services
  • WSRF standard that merges Web and Grid Service
    definitions
  • Globus Toolkit is most fitting infrastructure for
    Grid Service conversion of FREERIDE-G

32
FREERIDE-G System Architecture
33
SRB Data Host
  • SRB Master
  • Connection establishment
  • Authentication
  • Forks agent to service I/O
  • SRB Agent
  • Performs remote I/O
  • Services multiple client API requests
  • MCAT
  • Catalogs data associated with datasets, users,
    resources
  • Services metadata queries (through SRB agent)

34
Compute Node
  • More compute nodes than data hosts
  • Each node
  • Registers IO (from index)
  • Connects to data host
  • While (chunks to process)
  • Dispatch IO request(s)
  • Poll pending IO
  • Process retrieved chunks

35
FREERIDE-G in Action
Compute Node
Data Host
I/O Registration
Connection establishment
SRB Agent
While (more chunks to process)
I/O request dispatched
Pending I/O polled
MCAT
SRB Master
Retrieved data chunks analyzed
SRB Agent
Compute Node
36
Implementation Challenges
  • Interaction with Code Repository
  • Simplified Wrapper and Interface Generator
  • XML descriptors of API functions
  • Each API function wrapped in own class
  • Integration with MPICH-G2
  • Supports MPI
  • Deployed through Globus components (GRAM)
  • Hides potential heterogeneity in service startup
    and management

37
Experimental setup
  • Organizational Grid
  • Data hosted on Opteron 250 cluster
  • Processed on Opteron 254 cluster
  • Connected using 2 10 GB optical fibers
  • Goals
  • Demonstrate parallel scalability of applications
  • Evaluate overhead of using MPICH-G2 and Globus
    Toolkit deployment mechanisms

38
Deployment Overhead Evaluation
  • Clearly a small overhead associated with using
    Globus and MPICH-G2 for middleware deployment.
  • Kmeans Clustering with 6.4 GB dataset 18-20.
  • Vortex Detection with 14.8 GB dataset 17-20.

39
Summary of results
  • FREERIDE-G middleware for scalable processing
    of remote data -- now as a grid service
  • Compliance with grid and repository standards
  • Scalable performance with respect to data and
    compute nodes
  • Low deployment overheads as compared to non-grid
    version
  • Further ease of development
  • High portability

40
Outline
  • Motivation
  • Introduction to FREERIDE-G Middleware
  • Latest work
  • Load Balancing and Data Integration
  • Integration with Grid Computing Standards
  • Related Work
  • Conclusion

41
Grid Computing
  • THE GRID Standards
  • Open Grid Service Architecture
  • Web Services Resource Framework
  • Implementations
  • Globus toolkit, WSRF.NET, WSRFLite
  • Other Grid Systems
  • Condor and Legion,
  • Workflow composition systems
  • GridLab, Nimrod/G (GridBus), Cactus, GRIST

42
Data Grids
  • Metadata cataloging
  • Artemis project
  • Remote data retrieval
  • SRB, SEMPLAR
  • DataCutter, STORM
  • Stream processing midleware
  • GATES
  • dQUOB
  • Data integration
  • Automatic wrapper generation

43
Remote Data in Grid
  • KnowledgeGrid tool-set
  • Developing datamining workflows (composition)
  • GridMiner toolkit
  • Composition of distributed, parallel workflows
  • DiscoveryNet layer
  • Creation, deployment, management of services
  • DataMiningGrid framework
  • Distributed knowledge discovery
  • Other projects partially overlap in goals with
    ours

44
Resource Allocation Approaches
  • 3 broad categories for resource allocation
  • Heuristic approach to mapping
  • Prediction through modeling
  • Statistical estimation/prediction
  • Analytical modeling of parallel application
  • Simulation based performance prediction

45
Outline
  • Motivation
  • Introduction to FREERIDE-G Middleware
  • Latest work
  • Load Balancing and Data Integration
  • Integration with Grid Computing Standards
  • Related Work
  • Conclusion

46
Conclusions
  • FREERIDE-G middleware for scalable processing
    of remote data
  • Integrated with grid computing standards
  • Supports load balancing and data integration
  • Support for high-end processing
  • Ease use of parallel configurations
  • Hide details of data movement and caching
  • Fine-tuned performance models
  • Compliance with remote repository standards

47
Middleware Component Interaction
Load Balancing
Data Integration
48
Scalability Evaluation
  • Scalable implementations with respect to numbers
    of
  • Compute nodes,
  • Data repository nodes.
  • Vortex Detection with 14.8 GB dataset (top).
  • Kmeans Clustering with 6.4 GB dataset (bottom)
Write a Comment
User Comments (0)
About PowerShow.com