Parallel Genomic Sequence-Searching on an Ad-Hoc Grid: Experiences, Lessons Learned, and Implications - PowerPoint PPT Presentation

About This Presentation

Title:

Parallel Genomic Sequence-Searching on an Ad-Hoc Grid: Experiences, Lessons Learned, and Implications

Description:

Up to 100x speedup in search times with little loss of sensitivity. ... Search of hard queries can be speeded up with more memory. Sampling NT sequences search ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 23

Provided by: hesha2

Learn more at: https://arcb.csc.ncsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Genomic Sequence-Searching on an Ad-Hoc Grid: Experiences, Lessons Learned, and Implications

1
Parallel Genomic Sequence-Searching on an Ad-Hoc
Grid Experiences, Lessons Learned, and
Implications

Mark K. Gardner (Virginia Tech)
Wu-chun Feng (Virginia Tech)
Jeremy Archuleta (U. Utah)
Heshan Lin (NCSU)
Xiaosong Ma (NCSU ORNL)

Nominated for Best Paper Award, SC 2006, Tampa, FL
2
Overview

StorCloud Demo of SC05
I/O throughput competition of real world
scientific applications
When Sun., Nov. 13 to Thu., Nov. 17, 2005
Part of slides modified from StorCloud
presentation mpiBLAST on the GreenGene
Distributed Supercomputer (Wu Feng et. al.)
Story
Built an ad-hoc grid (GreenGene) with 3048
Processor for intensive genomic sequence search
(search NT against NT with mpiBLAST)
Team
Institutions
LANL, NCSU, U. Utah, and Virginia Tech
Vendors
Intel, Panta Systems, and Foundry Networks

3
GreenGene Grid

How?

U.Utah
Va Tech
4
Outline

About BLAST and mpiBLAST
Motivation
Planning
Estimate resource requirements
What kind of grid do we need
System design
Hardware architecture
Software architecture
Results
Conclusion

5
What is BLAST?

Basic Local Alignment Sequence Tool
Ubiquitous sequence database search tool used in
molecular biology
Given a query DNA or amino-acid (AA) sequence,
BLAST
Finds similar sequences in database
Reports statistical significance of similarities
between query and database
Newly sequenced genomes are typically
BLAST-searched against database of known genes
Similar sequences may have similar functions in a
new organism

6
BLAST at the Core of Sequence DB Search

Widely used
Approximately 75-90 of all compute cycles in
life sciences are devoted to BLAST searches
But, it is
Computationally demanding, O(n2) (variant of
string matching algorithm)
Requires seq database to be stored in memory to
perform efficiently
Challenge sequence databases growing
exponentially

7
mpiBLAST Algorithm Querying the Database

Open source BLAST parallelization (developed at
LANL)
Parallel approach segment and distribute
database across cluster
Advantage deliver super-linear speedup by
avoiding repeated I/O
Limitation poor performance in handle search
with large output volume because of results
merging bottleneck

8
mpiBLAST-PIO Enhancing Efficiency

Optimizations transferred from pioBLAST
Research prototype developed at NCSU and ORNL
Lin et. al. IPDPS05
Dramatically improves search throughput and
scalability
Using parallel I/O techniques to remove result
merging bottleneck
Results buffered and outputted concurrently by
workers
Enhancing output processing to reduce
communication volume
Largely used in SC StorCloud demo

9
Why Sequence-Search the NT Database Against
Itself?

From a Biological Perspective
Aids in understanding of which genetic codes are
unique and which are redundant
Enables a number of useful studies from organism
barcoding to gene function and evolution
From a Computer Science Perspective
Provides pertinent demonstration of
mpiBLAST/pios scalability to larger problems (NT
is one of the largest seq databases)
Can potentially generate huge output data
Enables realization of advanced indexing
structure that tracks relationships among
sequences in the database
Such indexing structures can provide
Up to 100x speedup in search times with little
loss of sensitivity.
Up to 20x compression of the database using
phylogenetic methods.

10
Resource Estimation

Why do we care?
To evaluate the feasibility of the project
To make better scheduling decision
Whats the complexity of the problem?
Intuitively estimation by seq length
NT composition

11
Sequence Length Based Estimation

Simple linear extrapolation appears mission
impossible
Because of hard queries
intensive computation, large quantities of
intermediate results
Fortunately,
Weak correlation between sequence length and
resource requirements because of BLAST employs
heuristics
G1 sequences well behaved, large portion of
sequences belong to G1
Search of hard queries can be speeded up with
more memory

Sampling NT sequences search
12
Better Predictor?

Hit-based rather than length-based?
Two phase BLAST search
First phase find hits in word level
Second phase extend matched words in both
direction to find maximal segment pair (longest
local matching substring)
Computation of first phase much less expensive
then that of second phase
Modified BLAST algorithm to collect number of
hits in the first phase
Attractive utilizing internal knowledge of BLAST
algorithm

13
Number of Hits Not a Better Predictor

Linear regression on data collected from 500 seqs
Y output size, execution time X length, hits
Number of hits not necessary better
Difference of mean square errors lt 5
High correlation (0.9942) between number of hits
and sequence length
Sequence length is much easier to collect

14
What Kind of Grid Do We Need?

Existing grid frameworks (such as Globus) not
what we want
Not available or well tested on Mac OS X and
64-bit Linux OS
mpiBLAST-PIO not ported to Globus
High learning curve for installation and
configuration
Home made grid software wrote from scratch
Just fit our needs
Easy to deploy, allow full control

15
Hardware Architecture

Heterogeneous environment
Interoperability is big concern

Cluster Organization Architecture Memory Procs File System
System X Virginia Tech Dual 2.3GHz PowerPC 970FX 4GB 2200 NFS
TunnelArch Univ. of Utah Dual AMD Opteron 240 CPU 4GB 126 PVFS
TunnelArch Univ. of Utah Dual AMD Opteron 244 CPU 2GB 128 PVFS
Dupon Intel Quad core N/A 512/256 NFS
Jarrel Intel Dual 3.4GHz Intel P4 2GB 20 NFS
Blade Center Intel Dual 2.66GHz Intel Xeon 2GB 28 NFS
Panta Panta Systems Four AMD Opteron 246HE 2GB 32 NFS
16
(No Transcript)
17
Software Architecture

Hierarchical design
SuperMaster assign queries, fetch results, load
balancing
GroupMaster fetch queries, perform search
How to choose group size?
Challenges heterogeneity, scalability, fault
tolerance

18
Heterogeneity And Accessibility

Only use four existing, cross-platform tools
Perl, ssh, rsync, bash
5 scripts, totaled only 458 lines
Fast deployment in Unix like systems
Customize mpiBLAST-PIO
System X need special care
Porting issues because of Mac OS and Power PC
Implement pseudo-parallel-write to improve output
performance on NFS

19
Design for Scalability

Managing thousands of procs efficiently with
loosely coupled, hierarchical design
Reduce loads on SuperMaster
Passive SuperMaster easy to add group masters,
regroup processors, and avoid security hole
Allow incremental system start
Hiding WAN latency by queuing queries in local
Prevent bubbles in the pipeline
Ensuring data integrity with MD5 checksum
A silent error every 500GB Paxson 1999
Alleviating network bandwidth constraint with
compression (compression ration 15 17)

20
Fault Tolerance

Serious mean time failure lt 10 hours in machines
with thousands of processors Reed 2004
Re-execution rather than checkpoint-restart
Primary issue query states management
Maintain all query states in file system

21
Results