Title: Parallel Genomic Sequence-Searching on an Ad-Hoc Grid: Experiences, Lessons Learned, and Implications
1Parallel Genomic Sequence-Searching on an Ad-Hoc
Grid Experiences, Lessons Learned, and
Implications
- Mark K. Gardner (Virginia Tech)
- Wu-chun Feng (Virginia Tech)
- Jeremy Archuleta (U. Utah)
- Heshan Lin (NCSU)
- Xiaosong Ma (NCSU ORNL)
Nominated for Best Paper Award, SC 2006, Tampa, FL
2Overview
- StorCloud Demo of SC05
- I/O throughput competition of real world
scientific applications - When Sun., Nov. 13 to Thu., Nov. 17, 2005
- Part of slides modified from StorCloud
presentation mpiBLAST on the GreenGene
Distributed Supercomputer (Wu Feng et. al.) - Story
- Built an ad-hoc grid (GreenGene) with 3048
Processor for intensive genomic sequence search
(search NT against NT with mpiBLAST) - Team
- Institutions
- LANL, NCSU, U. Utah, and Virginia Tech
- Vendors
- Intel, Panta Systems, and Foundry Networks
3GreenGene Grid
U.Utah
Va Tech
4Outline
- About BLAST and mpiBLAST
- Motivation
- Planning
- Estimate resource requirements
- What kind of grid do we need
- System design
- Hardware architecture
- Software architecture
- Results
- Conclusion
5What is BLAST?
- Basic Local Alignment Sequence Tool
- Ubiquitous sequence database search tool used in
molecular biology - Given a query DNA or amino-acid (AA) sequence,
BLAST - Finds similar sequences in database
- Reports statistical significance of similarities
between query and database - Newly sequenced genomes are typically
BLAST-searched against database of known genes - Similar sequences may have similar functions in a
new organism
6BLAST at the Core of Sequence DB Search
- Widely used
- Approximately 75-90 of all compute cycles in
life sciences are devoted to BLAST searches - But, it is
- Computationally demanding, O(n2) (variant of
string matching algorithm) - Requires seq database to be stored in memory to
perform efficiently - Challenge sequence databases growing
exponentially
7mpiBLAST Algorithm Querying the Database
- Open source BLAST parallelization (developed at
LANL) - Parallel approach segment and distribute
database across cluster - Advantage deliver super-linear speedup by
avoiding repeated I/O - Limitation poor performance in handle search
with large output volume because of results
merging bottleneck
8mpiBLAST-PIO Enhancing Efficiency
- Optimizations transferred from pioBLAST
- Research prototype developed at NCSU and ORNL
Lin et. al. IPDPS05 - Dramatically improves search throughput and
scalability - Using parallel I/O techniques to remove result
merging bottleneck - Results buffered and outputted concurrently by
workers - Enhancing output processing to reduce
communication volume - Largely used in SC StorCloud demo
9Why Sequence-Search the NT Database Against
Itself?
- From a Biological Perspective
- Aids in understanding of which genetic codes are
unique and which are redundant - Enables a number of useful studies from organism
barcoding to gene function and evolution - From a Computer Science Perspective
- Provides pertinent demonstration of
mpiBLAST/pios scalability to larger problems (NT
is one of the largest seq databases) - Can potentially generate huge output data
- Enables realization of advanced indexing
structure that tracks relationships among
sequences in the database - Such indexing structures can provide
- Up to 100x speedup in search times with little
loss of sensitivity. - Up to 20x compression of the database using
phylogenetic methods.
10Resource Estimation
- Why do we care?
- To evaluate the feasibility of the project
- To make better scheduling decision
- Whats the complexity of the problem?
- Intuitively estimation by seq length
- NT composition
11Sequence Length Based Estimation
- Simple linear extrapolation appears mission
impossible - Because of hard queries
- intensive computation, large quantities of
intermediate results - Fortunately,
- Weak correlation between sequence length and
resource requirements because of BLAST employs
heuristics - G1 sequences well behaved, large portion of
sequences belong to G1 - Search of hard queries can be speeded up with
more memory
Sampling NT sequences search
12Better Predictor?
- Hit-based rather than length-based?
- Two phase BLAST search
- First phase find hits in word level
- Second phase extend matched words in both
direction to find maximal segment pair (longest
local matching substring) - Computation of first phase much less expensive
then that of second phase - Modified BLAST algorithm to collect number of
hits in the first phase - Attractive utilizing internal knowledge of BLAST
algorithm
13Number of Hits Not a Better Predictor
- Linear regression on data collected from 500 seqs
- Y output size, execution time X length, hits
- Number of hits not necessary better
- Difference of mean square errors lt 5
- High correlation (0.9942) between number of hits
and sequence length - Sequence length is much easier to collect
14What Kind of Grid Do We Need?
- Existing grid frameworks (such as Globus) not
what we want - Not available or well tested on Mac OS X and
64-bit Linux OS - mpiBLAST-PIO not ported to Globus
- High learning curve for installation and
configuration - Home made grid software wrote from scratch
- Just fit our needs
- Easy to deploy, allow full control
15Hardware Architecture
- Heterogeneous environment
- Interoperability is big concern
Cluster Organization Architecture Memory Procs File System
System X Virginia Tech Dual 2.3GHz PowerPC 970FX 4GB 2200 NFS
TunnelArch Univ. of Utah Dual AMD Opteron 240 CPU 4GB 126 PVFS
TunnelArch Univ. of Utah Dual AMD Opteron 244 CPU 2GB 128 PVFS
Dupon Intel Quad core N/A 512/256 NFS
Jarrel Intel Dual 3.4GHz Intel P4 2GB 20 NFS
Blade Center Intel Dual 2.66GHz Intel Xeon 2GB 28 NFS
Panta Panta Systems Four AMD Opteron 246HE 2GB 32 NFS
16(No Transcript)
17Software Architecture
- Hierarchical design
- SuperMaster assign queries, fetch results, load
balancing - GroupMaster fetch queries, perform search
- How to choose group size?
- Challenges heterogeneity, scalability, fault
tolerance
18Heterogeneity And Accessibility
- Only use four existing, cross-platform tools
- Perl, ssh, rsync, bash
- 5 scripts, totaled only 458 lines
- Fast deployment in Unix like systems
- Customize mpiBLAST-PIO
- System X need special care
- Porting issues because of Mac OS and Power PC
- Implement pseudo-parallel-write to improve output
performance on NFS
19Design for Scalability
- Managing thousands of procs efficiently with
loosely coupled, hierarchical design - Reduce loads on SuperMaster
- Passive SuperMaster easy to add group masters,
regroup processors, and avoid security hole - Allow incremental system start
- Hiding WAN latency by queuing queries in local
- Prevent bubbles in the pipeline
- Ensuring data integrity with MD5 checksum
- A silent error every 500GB Paxson 1999
- Alleviating network bandwidth constraint with
compression (compression ration 15 17)
20Fault Tolerance
- Serious mean time failure lt 10 hours in machines
with thousands of processors Reed 2004 - Re-execution rather than checkpoint-restart
- Primary issue query states management
- Maintain all query states in file system
21Results
- Finished 1/7 NT in one day
- Coalesced sequences into batches targeting 30
minutes search time - Execution statistics
- Output size 600K 7GB per batch, 284.2KB per
seq - Execution time 6 secs 1.6 hours, average 9
mins per batch
22Conclusion
- Not be able to take advantage of existing grid
software - Home made grid software did work
- Enables rapid development and deployment
- Portable to Unix like platforms
- Identify hard queries for bio research
- Future work
- Extend framework to support more general
applications - Better resource estimation