Title: The Efficient Handling of BLAST Applications on the GRID
1The Efficient Handling of BLAST Applications on
the GRID
- Hurng-Chun Lee1 and Jakub Moscicki2
- 1 Academia Sinica Computing Centre, Taiwan
- 2 CERN IT-GD-ED, Switzerland
- The consideration of distributing BLAST jobs
- The master-worker computing model of BLAST
- mpiBLAST
- The Gridified BLAST
- Summary
3The considerations of distributing BLAST jobs
- BLAST has been widely and routinely used for
sequence analysis - The essential component in most of
bioinformatics and life science applications - Problem Complexity O(SqxSd)
- Sq The query size
- Sd The database size
- In most cases, Sd gtgt Sq
- e.g. Sq O(MB), Sd O(GB)
- The cost of moving query is lower
- Database management, storage and sharing issues
- Replication, Archive
- Privacy, Security
- Other perspective for service providing
- scalability, robustness
4The master-worker model of BLAST
- Database splitting is the easiest way to
distribute BLAST jobs - Fragmented databases for avoiding the memory
swapping - Each sub task can be 100 independent
- Each worker requests the tasks from master (pull
model) and runs the normal BLAST search - The individual result can be easily merged by
master process - Report generation (BioSeq fetching)
- Multi-query blast search can be easily split to
multiple independent single-query blast search by
a trivial script - Master-worker model can also be applied in each
single-query search
5mpiBLASTLANL, US http//mpiblast.lanl.gov
- The MPI implementation of BLAST master-worker
model - Advantages
- High throughput
- Load Balancing
- Running in local cluster
- Performance and Problem size still be limited by
local computing power - Simultaneous I/O to centralized database causes
the performance bottleneck - Database sharing is still difficult
6mpiBLAST-g2 ASCC, Taiwan and PRAGMA
- A GT2-enabled parallel BLAST runs on Grid
- MPICH-g2
- The enhancement from mpiBLAST by ASCC
- Performing cross cluster scheme of job execution
- Performing remote database sharing
- Help Tools for
- database replication
- automatic resource specification and job
submission (with static resource table) - multi-query job splitting and result merging
- Close link with mpiBLAST development team
- The new patches of mpiBLAST can be quickly
applied in mpiBLAST-g2
7SC2004 mpiBLAST-g2 demonstration
8mpiBLAST-g2 current deployment
-- From PRAGMA GOC http//pragma-goc.rocksclusters
9mpiBLAST-g2Performance Evaluation (perfect case)
Elapsed time
Searching Merging BioSeq fetching Overall
- Database est_human 3.5 GBytes
- Queries 441 test sequences 300 KBytes
- Overall speedup is approximately linear
10mpiBLAST-g2Performance Evaluation (worse case)
Elapsed time
- Database drosophila NT 122 MBytes
- Queries 441 test sequences 300 KBytes
- The overall speedup is limited by the unscalable
BioSeq fetching
Searching Merging BioSeq fetching Overall
11Issues of mpiBLAST-g2
- Single error will crash the whole job
- The MPICH nature
- Error might be due to the transient problem on
the loosely coupled Grid environment - MPI Job will be started only when all resources
are available - Different level of resource availability
- Error recovery is required for
- providing a robust application service on the
Grid - efficiently using the Grid resources
- Asynchronous task dispatching/pulling to use the
available resources immediately
12The DIANEhttp//cern.ch/diane
- DIstributed ANalysis Environment
- Lightweight distributed framework for parallel
scientific applications in master-worker model - A perfect match of the mpiBLAST computing model
- Current applications
- BLAST for Genomic Sequence Analysis (DIANE-BLAST)
- Geant4 Simulation for Radiotherapy and
Astrophysics - Image Rendering
- Data Analysis for High Energy Physics
13DIANE Features
Pull Model
Batch and Interactive
- Rapid prototyping
- Python and CORBA
- Error recovery
- Heartbeat worker health check
- Resubmission of failed tasks
- User defined error recovery method
- No need of outbound connectivity
- Proxy of workers with only private IP
- Job submitters for
- Simple fork
- Condor, LSF, SGE, PBS
- GT2, LCG, gLite
Distributed workers
14DIANE-BLAST implementation
- Splitting mpiBLAST-g2 to DIANE components
- Master (Planner and Integrator), Worker
- Wrapping each component with Python
- Hooking core BLAST C libraries with python swig
- Implementing the DIANE GT2 job submitter
- For running workers on the GT2-enabled clusters
- Reusing the deployed databases for mpiBLAST-g2
15mpiBLAST-g2 vs. DIANE-BLASTThe Speedup
- Query
- Drosophila chromosome 4
- size 1.2 Mbps
- DB
- Drosophila nucleotide sequence database
- size 1170 seq. 122 Mbps
- no. fragments 32
- Computing Resource
- Available of CPU 12
- PIII 1.4GHz
- 1GByte Memory
16mpiBLAST-g2 vs. DIANE-BLAST The Worker Lifeline
- DIANE-BLAST task dispatching
- Handled by DIANEs task thread
- Due to the bugs in the current DIANE release
- mpiBLAST-g2 task dispatching
- mpiBLAST-g2 task handling logic
17mpiBLAST-g2 vs. DIANE-BLASTOverall Comparisons
- mpiBLAST-g2
- Master-Worker model implemented by using MPICH-g2
libraries - Gridification efforts
- Implementing database sharing with GASSCOPY API
- Recompilation with MPICH-g2 and GT2 libraries
- Error recovery
- Need the fault-tolerance MPI
- Cross cluster computation
- Requiring outbound connectivity on each worker
- Performance/Throughput
- In cluster performance is as well as the original
- Pluggable application for DIANE Master-Worker
framework - Gridification efforts
- Through the gridified DIANE framework
- Error recovery
- Task resubmission
- Tracking the health of each worker
- Cross cluster computation
- Using proxy for workers with private IPs
- Performance/Throughput
- Performance can be tuned by controlling the job
- Two grid-enabled BLAST implementations
(mpiBLAST-g2 and DIANE-BLAST) were introduced for
efficient handling the BLAST jobs on the Grid - Both implementations are based on the
Master-Worker model for distributing BLAST jobs
on the Grid - The mpiBLAST-g2 has good scalability and speedup
in some cases - Require the fault-tolerance MPI implementation
for error recovery - In the unscalable cases, BioSeq fetching is the
bottleneck - DIANE-BLAST provides flexible mechanism for error
recovery - Any master-worker workflow can be easily plugged
into this framework - The job thread control should be improved to
achieving the good performance and scalability
19Thanks for your attention!!