Leonid GlimcherP. 1 - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Leonid GlimcherP. 1

Description:

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher {glimcher_at_cse.ohio-state.edu} Challenges for Application Development Analysis of large amounts of disk ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 24
Provided by: LeonidG9
Category:

less

Transcript and Presenter's Notes

Title: Leonid GlimcherP. 1


1
FREERIDE-G Framework for Developing Grid-Based
Data Mining Applications
  • L. Glimcher, R. Jin, G. Agrawal
  • Presented by Leo Glimcher
  • glimcher_at_cse.ohio-state.edu

2
Distributed Data-Intensive Science
Compute Cluster
?
User
Data Repository Cluster
3
Challenges for Application Development
  • Analysis of large amounts of disk resident data
  • Incorporating parallel processing into analysis
  • Processing needs to be independent of other
    elements and easy to specify
  • Coordination of storage, network and computing
    resources required
  • Transparency of data retrieval, staging and
    caching is desired

4
FREERIDE-G Goals
  • Support High-End Processing
  • Enable efficient processing of large scale data
    mining computations
  • Ease Use of Parallel Configurations
  • Support shared and distributed memory
    parallelization starting from a common high-level
    interface
  • Hide Details of Data Movement and Caching
  • Data staging and caching (when feasible/appropriat
    e) needs to be transparent to application
    developer

5
Presentation Road Map
  • Motivation and goals
  • System architecture and overview
  • Applications used for evaluation
  • Experimental evaluation
  • Related work in distributed data-intensive
    science
  • Conclusions and future work

6
FREERIDE-G Architecture
User cluster
Data Repository
Data Processing
Data Retrieval
Caching Retrieval
Communication
Data Distribution
Compute Nodes
Communication
Data Processing
Data server
Caching Retrieval
Communication
Data Processing
Data server
Caching Retrieval
Data Retrieval
Communication
Data Distribution
Compute Nodes
Communication
Data Processing
Caching Retrieval
Communication
7
Data Server Functionality
  • Data retrieval
  • data chunks read from repository disks
  • Data distribution
  • each chunk assigned a processing node destination
    in user cluster
  • Data communication
  • each chunk forwarded to destination processing
    node
  • Data server runs on every on-line data repository
    node, automating data delivery to the end-user

8
Compute Node Functionality
  • Data communication
  • data chunks received from corresponding data node
  • Computation
  • application specific processing performed on each
    chunk
  • Data caching retrieval
  • for multi-pass algorithms data cached locally on
    1st pass and retrieved locally for sub-sequent
    passes
  • Compute server runs on every processing node to
    receive data and process it in an application
    specific way

9
Processing structure of FREERIDE-G
  • Built on FREERIDE
  • KEY observation most algorithms follow canonical
    loop
  • Middleware API
  • Subset of data to be processed
  • Reduction object
  • Local and global reduction operations
  • Iterator
  • Supports
  • Disk resident datasets
  • Shared Distributed Memory

While( ) forall( data instances d)
I process(d) R(I) R(I) op d
.
10
Summary of implementation issues
  • Managing and communicating remote data
  • 2-way coordination required
  • Load distribution
  • if compute cluster bigger than data cluster
  • Parallel processing on compute cluster
  • FREERIDE-G supports generalized reductions
  • Caching
  • benefits multi-pass algorithms

11
Remote Data Issues
  • Managing data communication
  • ADR library used for scheduling and performing
    data retrieval at repository site
  • communication timing coordinated between source
    and destination
  • Caching
  • local file system used for caching
  • avoids redundant communication of data for
    (P-1)/P iterations

12
Parallel data processing issues
  • Load distribution
  • Needed when more compute nodes are available then
    data nodes
  • Hashing on unique chunk ID
  • Parallel processing on compute cluster
  • After data is distributed, local reduction
    performed on every node
  • Reduction object gathered at Master node
  • Global combination (reduction) performed on
    Master node

13
Application Summary
  • Data Mining
  • K-Nearest Neighbor search
  • K-means clustering
  • EM clustering
  • Scientific Feature Mining
  • Vortex detection in the fluid flow dataset
  • Molecular defect detection in the molecular
    dynamics dataset

14
Goals for Experimental Evaluation
  • Evaluation parallel scalability of applications
    developed
  • Numbers of data and compute nodes kept equal with
    variable parallel configurations
  • Evaluating scalability of compute nodes
  • Number of compute nodes kept independent of
    number of data nodes
  • Evaluating benefits of caching
  • Multi-pass algorithms evaluated

15
Evaluating Overall Scalability
  • Cluster of 700 MHz Pentiums
  • Connected through Myrinet LANai 7.0 (no access to
    high bandwidth network)
  • Equal number of repository and compute nodes

16
Overall Scalability
  • All 5 applications tested
  • High parallel efficiency
  • Good scalability with respect to
  • problem size
  • processing node number

17
Evaluating Scalability of Compute Nodes
  • Compute cluster size is greater than data
    repository cluster size.
  • Applications (single pass only)
  • kNN search,
  • molecular defect detection,
  • vortex detection (next slide),
  • Parallel configurations
  • Data nodes 1 to 8
  • Compute nodes 1 to 16.

18
Compute Node Scalability
  • Only data processing work parallelized
  • Data retrieval and communication times not
    effected
  • Speedups are sub-linear
  • Better resource utilization leads to analysis
    time decrease

19
Evaluating effects of caching
  • Network bandwidth simulated 500 KB/sec
  • Caching vs. non-caching versions compared
  • Comparing data communication times (P passes)
  • factor of P decrease from caching
  • Caching benefit depends on
  • application
  • network bandwidth

20
Related Work
  • Support for grid-based data mining
  • Knowledge Grid toolset
  • Grid-Miner toolkit
  • Discovery Net layer
  • DataMiningGrid framework
  • No interface for easing parallelization and
    abstracting data movement
  • GRIST support for astronomy related mining on
    the grid
  • Specific to the astronomical domain
  • FREERIDE-G is built directly on top of FREERIDE.

21
Conclusions
  • FREERIDE-G supports remote data analysis from
    high-level interface
  • Evaluated on variety of algorithms
  • Demonstrated scalability in terms of
  • Even data-compute scale-up
  • Compute node scale-up (only processing time)
  • Multi-pass algorithms benefit from data caching

22
Continuing Work on FREERIDE-G
  • High bandwidth network evaluation
  • Performance prediction based resource selection
  • Resource allocation
  • More sophisticated caching and data communication
    mechanisms (SRB)
  • Data format issues wrapper integration
  • Higher-level front-end to further ease
    development of data analysis tools for the grid

23
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com