Title: Computing the SmithWaterman Algorithm on the Illinois BioGrid
1Computing the Smith-Waterman Algorithm on the
Illinois Bio-Grid Dave S. Angulo1, Nigel M.
Parsad2, Tom Goodale3, Gabrielle Allen3, Ed
Seidel3 1The School of Computer Science,
Telecommunications and Information Systems,
Depaul University dangulo_at_cs.depaul.edu 2Kurt
Rossman Labs, The University of Chicago
nmp_at_cs.uchicago.edu 3Albert Einstein Institute,
Golm (AEI/MPG) goodale_at_aei-potsdam.mpg.de
allen_at_aei-potsdam.mpg.de eseidel_at_aie-potdam.mpg.
de
Motivation To exploit the prodigious
computational resources of the Illinois Bio-Grid
(IBG) by simultaneously querying multiple
protein sequences against multiple protein
sequence databases for homology . The
Smith-Waterman algorithm will be utilized as it
guarantees the optimal local pairwise
alignment between homologous sequences. The
efficiencies gained by the parallel distribution
of both the database query and the dynamic
programming load should be substantially greater
than the single sequence/single protein
database search that is the current
computational biology standard.
N X M pairwise protein alignments using
Smith-Waterman
N-processor Grid
M-protein sequence databases
Strategy To develop and implement a
Smith-Waterman software toolkit (SWTask) to run
in the distributed environment of the IBG. This
toolkit will be part of a larger IBG
Bioinformatic Workbench whose modules will also
allow for the Grid-enabled computation of the
FASTA and BLAST algorithms. The SWTask will
include task farming, data acquisition, and
Smith-Waterman software modules.
Task Farming Basics on the Grid
- Task Manager Hierarchy
- In the traditional Master/Slave task manager
architecture, there are problems with slave
startup and communication between master and
slave. Specific issues include
authentication/authorization to start remote
jobs, queues on remote sources, and firewalls
between resources. -
- A three-level hierarchy provides solutions to
these issues - Level 1 The Task Farm Manager (0), a.k.a.
TFM(0), farms out tasks to remote resources on
the Grid and was the Master in the traditional
Master/Slave architecture. - Level 2 A Task Farm Manager (1), a.k.a. TFM(1),
is started on a queue for each remote resource
assigned a task. - Level 3 The specific computational task. This
level corresponds to the Slave in the three-level
model.
TFM(0) implemented in Cactus
TFM(0)
TM modules used for starting remote TFM(1)s
TFM(1)
TFM(1)
Designed for the Grid
TFM(1)
TFM(1)
Tasks can be anything in this case the
computation of a bioinformatics application
- Task Manager module Structure
- The Task Farm Manager (TFM) utilizes the ASCA
generic task farm module as well as the Task Farm
Logic Manager module (TFLM). For TFM(0),
ASCA(0) requests information from TFLM(0)
regarding the minimum number of tasks that can
be run (MinTasks), how many tasks are desired
(DesiredTasks), and how many processors and how
much memory is required per task
(TaskRequirements). - When a TFM(1) requests a task, the TFM(0) calls
GetMoreTasks which manages a list of task ids
for uncompleted tasks. Then for each task,
TFM(1) calls GetInputFile which provides the
required parameters for the specific source
files to be processed. - The SWLM module is the logic manager specific
to Smith-Waterman applications. SWLM provides
info as to what tasks to start and what
parameters to run for each input files. - The SWTask module (not shown) will communicate
with the SWLM to get and process files on the
task end.
Generic Part
Application Specific
Smith-Waterman Task Farming (SWTask) on the IBG
Task Management Scenario A. The TFM(0) gives P
TFM(1) processors individual directives to
download, process, and own one source data
file
i). Each TFM(1) processor downloads a source
data file, strips off non-essential
annotations, and stores the annotations on
local disk.
ii). Each
TFM(1) processor saves the resulting stripped
source file in memory for sequence alignment
analysis using Smith-Waterman.
iii). Each TFM(1) processor remains
prepared to send and receive stripped source
files to and from other TFM(1) peers on the
Grid. iv). The TFM(0)
keeps track of which TFM(1) processors own what
source file.. B. The TFM(0) gives T TFM(1)
processors directives to obtain a second source
data file (all or partial) from a TFM(1) peer
i). Each TFM(1)
processor asks a TFM(1) peer for the second
stripped source file. ii). Each
TFM(1) processor then does a pairwise sequence
comparison of the two files in
memory. iii). Each TFM(1) processor
then requests more work from the TFM(0). The
TFM(0) may then direct the TFM(1) to ask a
peer for a third file in a second
thread. C. The TFM(0) tracks and dynamically
manages i). TFM(1) progress. ii
). TFM(1) task distribution based upon workload
sharing and processor speed (via completion
requests).
Machines Involved A. An N processor Grid which
dynamically allocates resources for client
processes. B. Have one processor designated as
the Master Task Farm Manager TFM(0). C. Have
M processors designated as the Worker Task Farm
Managers TFM (1).
Data Involved A. P source data files (estimate
140) from sequence database. B. Each data file
has perhaps 100,000 sequence strings with
potentially 4,000 characters per string. C. P
can be broken into subsets P, P etc.
Tasks Involved A. Download P source data files.
Total number of characters to compare is
approximately 56,000,000 (140 source files x
100,000 sequence strings x 4,000 characters per
string). B. Complete a W x W character
expression. For two source files P1 and P2,
consider P1 x P2. C. Since P1 x P2 P2 x P1,
only the upper matrix of comparisons will be
performed.