Title: Leonid GlimcherP. 1
1FREERIDE-G Framework for Developing Grid-Based
Data Mining Applications
- L. Glimcher, R. Jin, G. Agrawal
- Presented by Leo Glimcher
- glimcher_at_cse.ohio-state.edu
2Distributed Data-Intensive Science
Compute Cluster
?
User
Data Repository Cluster
3Challenges for Application Development
- Analysis of large amounts of disk resident data
- Incorporating parallel processing into analysis
- Processing needs to be independent of other
elements and easy to specify - Coordination of storage, network and computing
resources required - Transparency of data retrieval, staging and
caching is desired
4FREERIDE-G Goals
- Support High-End Processing
- Enable efficient processing of large scale data
mining computations - Ease Use of Parallel Configurations
- Support shared and distributed memory
parallelization starting from a common high-level
interface - Hide Details of Data Movement and Caching
- Data staging and caching (when feasible/appropriat
e) needs to be transparent to application
developer
5Presentation Road Map
- Motivation and goals
- System architecture and overview
- Applications used for evaluation
- Experimental evaluation
- Related work in distributed data-intensive
science - Conclusions and future work
6FREERIDE-G Architecture
User cluster
Data Repository
Data Processing
Data Retrieval
Caching Retrieval
Communication
Data Distribution
Compute Nodes
Communication
Data Processing
Data server
Caching Retrieval
Communication
Data Processing
Data server
Caching Retrieval
Data Retrieval
Communication
Data Distribution
Compute Nodes
Communication
Data Processing
Caching Retrieval
Communication
7Data Server Functionality
- Data retrieval
- data chunks read from repository disks
- Data distribution
- each chunk assigned a processing node destination
in user cluster - Data communication
- each chunk forwarded to destination processing
node - Data server runs on every on-line data repository
node, automating data delivery to the end-user
8Compute Node Functionality
- Data communication
- data chunks received from corresponding data node
- Computation
- application specific processing performed on each
chunk - Data caching retrieval
- for multi-pass algorithms data cached locally on
1st pass and retrieved locally for sub-sequent
passes - Compute server runs on every processing node to
receive data and process it in an application
specific way
9Processing structure of FREERIDE-G
- Built on FREERIDE
- KEY observation most algorithms follow canonical
loop - Middleware API
- Subset of data to be processed
- Reduction object
- Local and global reduction operations
- Iterator
- Supports
- Disk resident datasets
- Shared Distributed Memory
While( ) forall( data instances d)
I process(d) R(I) R(I) op d
.
10Summary of implementation issues
- Managing and communicating remote data
- 2-way coordination required
- Load distribution
- if compute cluster bigger than data cluster
- Parallel processing on compute cluster
- FREERIDE-G supports generalized reductions
- Caching
- benefits multi-pass algorithms
11Remote Data Issues
- Managing data communication
- ADR library used for scheduling and performing
data retrieval at repository site - communication timing coordinated between source
and destination - Caching
- local file system used for caching
- avoids redundant communication of data for
(P-1)/P iterations
12Parallel data processing issues
- Load distribution
- Needed when more compute nodes are available then
data nodes - Hashing on unique chunk ID
- Parallel processing on compute cluster
- After data is distributed, local reduction
performed on every node - Reduction object gathered at Master node
- Global combination (reduction) performed on
Master node
13Application Summary
- Data Mining
- K-Nearest Neighbor search
- K-means clustering
- EM clustering
- Scientific Feature Mining
- Vortex detection in the fluid flow dataset
- Molecular defect detection in the molecular
dynamics dataset
14Goals for Experimental Evaluation
- Evaluation parallel scalability of applications
developed - Numbers of data and compute nodes kept equal with
variable parallel configurations - Evaluating scalability of compute nodes
- Number of compute nodes kept independent of
number of data nodes - Evaluating benefits of caching
- Multi-pass algorithms evaluated
15Evaluating Overall Scalability
- Cluster of 700 MHz Pentiums
- Connected through Myrinet LANai 7.0 (no access to
high bandwidth network) - Equal number of repository and compute nodes
16Overall Scalability
- All 5 applications tested
- High parallel efficiency
- Good scalability with respect to
- problem size
- processing node number
17Evaluating Scalability of Compute Nodes
- Compute cluster size is greater than data
repository cluster size. - Applications (single pass only)
- kNN search,
- molecular defect detection,
- vortex detection (next slide),
- Parallel configurations
- Data nodes 1 to 8
- Compute nodes 1 to 16.
18Compute Node Scalability
- Only data processing work parallelized
- Data retrieval and communication times not
effected - Speedups are sub-linear
- Better resource utilization leads to analysis
time decrease
19Evaluating effects of caching
- Network bandwidth simulated 500 KB/sec
- Caching vs. non-caching versions compared
- Comparing data communication times (P passes)
- factor of P decrease from caching
- Caching benefit depends on
- application
- network bandwidth
20Related Work
- Support for grid-based data mining
- Knowledge Grid toolset
- Grid-Miner toolkit
- Discovery Net layer
- DataMiningGrid framework
- No interface for easing parallelization and
abstracting data movement - GRIST support for astronomy related mining on
the grid - Specific to the astronomical domain
- FREERIDE-G is built directly on top of FREERIDE.
21Conclusions
- FREERIDE-G supports remote data analysis from
high-level interface - Evaluated on variety of algorithms
- Demonstrated scalability in terms of
- Even data-compute scale-up
- Compute node scale-up (only processing time)
- Multi-pass algorithms benefit from data caching
22Continuing Work on FREERIDE-G
- High bandwidth network evaluation
- Performance prediction based resource selection
- Resource allocation
- More sophisticated caching and data communication
mechanisms (SRB) - Data format issues wrapper integration
- Higher-level front-end to further ease
development of data analysis tools for the grid
23(No Transcript)