Title: A GT3 based BLAST grid service for biomedical research
1 A GT3 based BLAST grid service for biomedical
research Micha Bayer1, Aileen Campbell2 Davy
Virdee2 1National e-Science Centre, e-Science
Hub, Kelvin Building, University of Glasgow,
Glasgow G12 8QQ 2Edikt, National e-Science
Centre, e-Science Institute, 15 South College
Street, Edinburgh EH8 9AA
- Overview
- BLAST is a well-known program for biological
sequence comparison - used to compare query sequences to a set of
target sequences in order to find similar
sequences in the target set - can be extremely compute intensive
- we present a parallel implementation of BLAST
delivered via a GT3 grid service - part of the BRIDGES project, a UK e-Science
project aimed at providing a grid based
environment for research into the genetic causes
of hypertension (http//www.brc.dcs.gla.ac.uk/proj
ects/bridges/)
Scheduler Algorithm parse input and count no. of
query sequences poll resources and establish
total no. of idle nodes set number of sub-jobs to
be run to be equal to total no. of idle
nodes calculate no. of sequences to be run per
sub-job n ( no. of idle nodes/no. of
sequences) while there are sequences left save n
sequences to a sub-job input file if the number
of idle nodes is 0 make up small number of
sub-jobs (currently hardcoded to 5) and evenly
distribute these into queues across resources
else for each resource send i subjobs to the
resource as separate threads where is the number
of idle nodes on the resource when results are
complete save to file in the original input file
order return this to the user
- Parallel BLAST
- to achieve maximum performance in a grid context,
we have parallelised BLAST - multiple query sequences are partitioned into
sub-jobs on the basis of the number of idle
compute nodes available and then processed on
these in batches - we have provided our own java based scheduler
which distributes sub-jobs across an array of
resources
- System Architecture
- grid service uses GT3.0.2 core only
- we have provided our own wrappers for OpenPBS
client side and the Condor submission components - a scheduler component examines the input, polls
resources for available processors and farms out
subtasks to the resources - details of resources (i.e. clusters) are held in
single XML config file adding new resources is
easy - target databases are located on execute nodes or
on cluster masternode to minimise stage-in time
these need updating regularly
- Client Side
- users of service range from expert to low
computer literacy - delivery mechanism chosen was therefore via
BRIDGES web portal (see below) - Java based graphical client to service is
downloaded via Java webstart - allows for easy, centralised updates
- also provides good opportunity to explore client
side Globus
- Design Issues
- no suitable metaschedulers available at time of
designing the system had to write our own - system only uses GT3 core as a thin layer
between client side and scheduler since full GT3
was due to be replaced by WSRF minimises future
porting effort
- Summary
- We have constructed a parallelised BLAST service
that farms out multiple query sequences as
subjobs to a pool of resources. - Our scheduler runs over OpenBPS and Condor
resources via our own java wrappers. - Client side delivery is through a Java GUI
delivered via a web portal and Java Webstart.
- Compute Resources Used
- ScotGRID compute cluster at Glasgow Univ. a 250
processor Linux cluster - Condor pool at National e-Science Centre, Glasgow
Univ. 25 desktop machines, single processors
- Contact / Further Information
- BRIDGES website and portal at http//www.brc.dcs.
gla.ac.uk/projects/bridges/ - email contact michab_at_dcs.gla.ac.uk