Leonid GlimcherP. 1 - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Leonid GlimcherP. 1

Description:

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher {glimcher_at_cse.ohio-state.edu} Challenges for Application Development Analysis of large amounts of disk ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 24

Provided by: LeonidG9

Category:

more less

Transcript and Presenter's Notes

Title: Leonid GlimcherP. 1

1
FREERIDE-G Framework for Developing Grid-Based
Data Mining Applications

L. Glimcher, R. Jin, G. Agrawal
Presented by Leo Glimcher
glimcher_at_cse.ohio-state.edu

2
Distributed Data-Intensive Science
Compute Cluster
?
User
Data Repository Cluster
3
Challenges for Application Development

Analysis of large amounts of disk resident data
Incorporating parallel processing into analysis
Processing needs to be independent of other
elements and easy to specify
Coordination of storage, network and computing
resources required
Transparency of data retrieval, staging and
caching is desired

4
FREERIDE-G Goals

Support High-End Processing
Enable efficient processing of large scale data
mining computations
Ease Use of Parallel Configurations
Support shared and distributed memory
parallelization starting from a common high-level
interface
Hide Details of Data Movement and Caching
Data staging and caching (when feasible/appropriat
e) needs to be transparent to application
developer

5
Presentation Road Map

Motivation and goals
System architecture and overview
Applications used for evaluation
Experimental evaluation
Related work in distributed data-intensive
science
Conclusions and future work

6
FREERIDE-G Architecture
User cluster
Data Repository
Data Processing
Data Retrieval
Caching Retrieval
Communication
Data Distribution
Compute Nodes
Communication
Data Processing
Data server
Caching Retrieval
Communication
Data Processing
Data server
Caching Retrieval
Data Retrieval
Communication
Data Distribution
Compute Nodes
Communication
Data Processing
Caching Retrieval
Communication
7
Data Server Functionality

Data retrieval
data chunks read from repository disks
Data distribution
each chunk assigned a processing node destination
in user cluster
Data communication
each chunk forwarded to destination processing
node
Data server runs on every on-line data repository
node, automating data delivery to the end-user

8
Compute Node Functionality

Data communication
data chunks received from corresponding data node
Computation
application specific processing performed on each
chunk
Data caching retrieval
for multi-pass algorithms data cached locally on
1st pass and retrieved locally for sub-sequent
passes
Compute server runs on every processing node to
receive data and process it in an application
specific way

9
Processing structure of FREERIDE-G

Built on FREERIDE
KEY observation most algorithms follow canonical
loop
Middleware API
Subset of data to be processed
Reduction object
Local and global reduction operations
Iterator
Supports
Disk resident datasets
Shared Distributed Memory

While( ) forall( data instances d)
I process(d) R(I) R(I) op d
.
10
Summary of implementation issues

Managing and communicating remote data
2-way coordination required
Load distribution
if compute cluster bigger than data cluster
Parallel processing on compute cluster
FREERIDE-G supports generalized reductions
Caching
benefits multi-pass algorithms

11
Remote Data Issues

Managing data communication
ADR library used for scheduling and performing
data retrieval at repository site
communication timing coordinated between source
and destination
Caching
local file system used for caching
avoids redundant communication of data for
(P-1)/P iterations

12
Parallel data processing issues

Load distribution
Needed when more compute nodes are available then
data nodes
Hashing on unique chunk ID
Parallel processing on compute cluster
After data is distributed, local reduction
performed on every node
Reduction object gathered at Master node
Global combination (reduction) performed on
Master node

13
Application Summary

Data Mining
K-Nearest Neighbor search
K-means clustering
EM clustering
Scientific Feature Mining
Vortex detection in the fluid flow dataset
Molecular defect detection in the molecular
dynamics dataset

14
Goals for Experimental Evaluation

Evaluation parallel scalability of applications
developed
Numbers of data and compute nodes kept equal with
variable parallel configurations
Evaluating scalability of compute nodes
Number of compute nodes kept independent of
number of data nodes
Evaluating benefits of caching
Multi-pass algorithms evaluated

15
Evaluating Overall Scalability

Cluster of 700 MHz Pentiums
Connected through Myrinet LANai 7.0 (no access to
high bandwidth network)
Equal number of repository and compute nodes

16
Overall Scalability

All 5 applications tested
High parallel efficiency
Good scalability with respect to
problem size
processing node number

17
Evaluating Scalability of Compute Nodes

Compute cluster size is greater than data
repository cluster size.
Applications (single pass only)
kNN search,
molecular defect detection,
vortex detection (next slide),
Parallel configurations
Data nodes 1 to 8
Compute nodes 1 to 16.

18
Compute Node Scalability

Only data processing work parallelized
Data retrieval and communication times not
effected
Speedups are sub-linear
Better resource utilization leads to analysis
time decrease

19
Evaluating effects of caching

Network bandwidth simulated 500 KB/sec
Caching vs. non-caching versions compared
Comparing data communication times (P passes)
factor of P decrease from caching
Caching benefit depends on
application
network bandwidth

20
Related Work

Support for grid-based data mining
Knowledge Grid toolset
Grid-Miner toolkit
Discovery Net layer
DataMiningGrid framework
No interface for easing parallelization and
abstracting data movement
GRIST support for astronomy related mining on
the grid
Specific to the astronomical domain
FREERIDE-G is built directly on top of FREERIDE.

21
Conclusions

FREERIDE-G supports remote data analysis from
high-level interface
Evaluated on variety of algorithms
Demonstrated scalability in terms of
Even data-compute scale-up
Compute node scale-up (only processing time)
Multi-pass algorithms benefit from data caching

22
Continuing Work on FREERIDE-G

High bandwidth network evaluation
Performance prediction based resource selection
Resource allocation
More sophisticated caching and data communication
mechanisms (SRB)
Data format issues wrapper integration
Higher-level front-end to further ease
development of data analysis tools for the grid

23
(No Transcript)

Write a Comment

User Comments (0)