Computing the SmithWaterman Algorithm on the Illinois BioGrid - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

Computing the SmithWaterman Algorithm on the Illinois BioGrid

Description:

Task Farming Basics on the Grid. Smith-Waterman Task Farming (SWTask) on the IBG ... An N processor Grid which dynamically allocates resources for client processes. ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 2
Provided by: ish81
Category:

less

Transcript and Presenter's Notes

Title: Computing the SmithWaterman Algorithm on the Illinois BioGrid


1
Computing the Smith-Waterman Algorithm on the
Illinois Bio-Grid Dave S. Angulo1, Nigel M.
Parsad2, Tom Goodale3, Gabrielle Allen3, Ed
Seidel3 1The School of Computer Science,
Telecommunications and Information Systems,
Depaul University dangulo_at_cs.depaul.edu 2Kurt
Rossman Labs, The University of Chicago
nmp_at_cs.uchicago.edu 3Albert Einstein Institute,
Golm (AEI/MPG) goodale_at_aei-potsdam.mpg.de
allen_at_aei-potsdam.mpg.de eseidel_at_aie-potdam.mpg.
de
Motivation To exploit the prodigious
computational resources of the Illinois Bio-Grid
(IBG) by simultaneously querying multiple
protein sequences against multiple protein
sequence databases for homology . The
Smith-Waterman algorithm will be utilized as it
guarantees the optimal local pairwise
alignment between homologous sequences. The
efficiencies gained by the parallel distribution
of both the database query and the dynamic
programming load should be substantially greater
than the single sequence/single protein
database search that is the current
computational biology standard.
N X M pairwise protein alignments using
Smith-Waterman
N-processor Grid
M-protein sequence databases
Strategy To develop and implement a
Smith-Waterman software toolkit (SWTask) to run
in the distributed environment of the IBG. This
toolkit will be part of a larger IBG
Bioinformatic Workbench whose modules will also
allow for the Grid-enabled computation of the
FASTA and BLAST algorithms. The SWTask will
include task farming, data acquisition, and
Smith-Waterman software modules.
Task Farming Basics on the Grid
  • Task Manager Hierarchy
  • In the traditional Master/Slave task manager
    architecture, there are problems with slave
    startup and communication between master and
    slave. Specific issues include
    authentication/authorization to start remote
    jobs, queues on remote sources, and firewalls
    between resources.
  • A three-level hierarchy provides solutions to
    these issues
  • Level 1 The Task Farm Manager (0), a.k.a.
    TFM(0), farms out tasks to remote resources on
    the Grid and was the Master in the traditional
    Master/Slave architecture.
  • Level 2 A Task Farm Manager (1), a.k.a. TFM(1),
    is started on a queue for each remote resource
    assigned a task.
  • Level 3 The specific computational task. This
    level corresponds to the Slave in the three-level
    model.

TFM(0) implemented in Cactus
TFM(0)
TM modules used for starting remote TFM(1)s
TFM(1)
TFM(1)
Designed for the Grid
TFM(1)
TFM(1)
Tasks can be anything in this case the
computation of a bioinformatics application
  • Task Manager module Structure
  • The Task Farm Manager (TFM) utilizes the ASCA
    generic task farm module as well as the Task Farm
    Logic Manager module (TFLM). For TFM(0),
    ASCA(0) requests information from TFLM(0)
    regarding the minimum number of tasks that can
    be run (MinTasks), how many tasks are desired
    (DesiredTasks), and how many processors and how
    much memory is required per task
    (TaskRequirements).
  • When a TFM(1) requests a task, the TFM(0) calls
    GetMoreTasks which manages a list of task ids
    for uncompleted tasks. Then for each task,
    TFM(1) calls GetInputFile which provides the
    required parameters for the specific source
    files to be processed.
  • The SWLM module is the logic manager specific
    to Smith-Waterman applications. SWLM provides
    info as to what tasks to start and what
    parameters to run for each input files.
  • The SWTask module (not shown) will communicate
    with the SWLM to get and process files on the
    task end.

Generic Part
Application Specific
Smith-Waterman Task Farming (SWTask) on the IBG
Task Management Scenario A. The TFM(0) gives P
TFM(1) processors individual directives to
download, process, and own one source data
file
i). Each TFM(1) processor downloads a source
data file, strips off non-essential
annotations, and stores the annotations on
local disk.
ii). Each
TFM(1) processor saves the resulting stripped
source file in memory for sequence alignment
analysis using Smith-Waterman.


iii). Each TFM(1) processor remains
prepared to send and receive stripped source
files to and from other TFM(1) peers on the
Grid. iv). The TFM(0)
keeps track of which TFM(1) processors own what
source file.. B. The TFM(0) gives T TFM(1)
processors directives to obtain a second source
data file (all or partial) from a TFM(1) peer
i). Each TFM(1)
processor asks a TFM(1) peer for the second
stripped source file. ii). Each
TFM(1) processor then does a pairwise sequence
comparison of the two files in
memory. iii). Each TFM(1) processor
then requests more work from the TFM(0). The
TFM(0) may then direct the TFM(1) to ask a
peer for a third file in a second
thread. C. The TFM(0) tracks and dynamically
manages i). TFM(1) progress. ii
). TFM(1) task distribution based upon workload
sharing and processor speed (via completion
requests).
Machines Involved A. An N processor Grid which
dynamically allocates resources for client
processes. B. Have one processor designated as
the Master Task Farm Manager TFM(0). C. Have
M processors designated as the Worker Task Farm
Managers TFM (1).
Data Involved A. P source data files (estimate
140) from sequence database. B. Each data file
has perhaps 100,000 sequence strings with
potentially 4,000 characters per string. C. P
can be broken into subsets P, P etc.
Tasks Involved A. Download P source data files.
Total number of characters to compare is
approximately 56,000,000 (140 source files x
100,000 sequence strings x 4,000 characters per
string). B. Complete a W x W character
expression. For two source files P1 and P2,
consider P1 x P2. C. Since P1 x P2 P2 x P1,
only the upper matrix of comparisons will be
performed.
Write a Comment
User Comments (0)
About PowerShow.com