Parallel Genomic Sequence-Searching on an Ad-Hoc Grid: Experiences, Lessons Learned, and Implications - PowerPoint PPT Presentation

About This Presentation
Title:

Parallel Genomic Sequence-Searching on an Ad-Hoc Grid: Experiences, Lessons Learned, and Implications

Description:

Up to 100x speedup in search times with little loss of sensitivity. ... Search of hard queries can be speeded up with more memory. Sampling NT sequences search ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 23
Provided by: hesha2
Category:

less

Transcript and Presenter's Notes

Title: Parallel Genomic Sequence-Searching on an Ad-Hoc Grid: Experiences, Lessons Learned, and Implications


1
Parallel Genomic Sequence-Searching on an Ad-Hoc
Grid Experiences, Lessons Learned, and
Implications
  • Mark K. Gardner (Virginia Tech)
  • Wu-chun Feng (Virginia Tech)
  • Jeremy Archuleta (U. Utah)
  • Heshan Lin (NCSU)
  • Xiaosong Ma (NCSU ORNL)

Nominated for Best Paper Award, SC 2006, Tampa, FL
2
Overview
  • StorCloud Demo of SC05
  • I/O throughput competition of real world
    scientific applications
  • When Sun., Nov. 13 to Thu., Nov. 17, 2005
  • Part of slides modified from StorCloud
    presentation mpiBLAST on the GreenGene
    Distributed Supercomputer (Wu Feng et. al.)
  • Story
  • Built an ad-hoc grid (GreenGene) with 3048
    Processor for intensive genomic sequence search
    (search NT against NT with mpiBLAST)
  • Team
  • Institutions
  • LANL, NCSU, U. Utah, and Virginia Tech
  • Vendors
  • Intel, Panta Systems, and Foundry Networks

3
GreenGene Grid
  • How?

U.Utah
Va Tech
4
Outline
  • About BLAST and mpiBLAST
  • Motivation
  • Planning
  • Estimate resource requirements
  • What kind of grid do we need
  • System design
  • Hardware architecture
  • Software architecture
  • Results
  • Conclusion

5
What is BLAST?
  • Basic Local Alignment Sequence Tool
  • Ubiquitous sequence database search tool used in
    molecular biology
  • Given a query DNA or amino-acid (AA) sequence,
    BLAST
  • Finds similar sequences in database
  • Reports statistical significance of similarities
    between query and database
  • Newly sequenced genomes are typically
    BLAST-searched against database of known genes
  • Similar sequences may have similar functions in a
    new organism

6
BLAST at the Core of Sequence DB Search
  • Widely used
  • Approximately 75-90 of all compute cycles in
    life sciences are devoted to BLAST searches
  • But, it is
  • Computationally demanding, O(n2) (variant of
    string matching algorithm)
  • Requires seq database to be stored in memory to
    perform efficiently
  • Challenge sequence databases growing
    exponentially

7
mpiBLAST Algorithm Querying the Database
  • Open source BLAST parallelization (developed at
    LANL)
  • Parallel approach segment and distribute
    database across cluster
  • Advantage deliver super-linear speedup by
    avoiding repeated I/O
  • Limitation poor performance in handle search
    with large output volume because of results
    merging bottleneck

8
mpiBLAST-PIO Enhancing Efficiency
  • Optimizations transferred from pioBLAST
  • Research prototype developed at NCSU and ORNL
    Lin et. al. IPDPS05
  • Dramatically improves search throughput and
    scalability
  • Using parallel I/O techniques to remove result
    merging bottleneck
  • Results buffered and outputted concurrently by
    workers
  • Enhancing output processing to reduce
    communication volume
  • Largely used in SC StorCloud demo

9
Why Sequence-Search the NT Database Against
Itself?
  • From a Biological Perspective
  • Aids in understanding of which genetic codes are
    unique and which are redundant
  • Enables a number of useful studies from organism
    barcoding to gene function and evolution
  • From a Computer Science Perspective
  • Provides pertinent demonstration of
    mpiBLAST/pios scalability to larger problems (NT
    is one of the largest seq databases)
  • Can potentially generate huge output data
  • Enables realization of advanced indexing
    structure that tracks relationships among
    sequences in the database
  • Such indexing structures can provide
  • Up to 100x speedup in search times with little
    loss of sensitivity.
  • Up to 20x compression of the database using
    phylogenetic methods.

10
Resource Estimation
  • Why do we care?
  • To evaluate the feasibility of the project
  • To make better scheduling decision
  • Whats the complexity of the problem?
  • Intuitively estimation by seq length
  • NT composition

11
Sequence Length Based Estimation
  • Simple linear extrapolation appears mission
    impossible
  • Because of hard queries
  • intensive computation, large quantities of
    intermediate results
  • Fortunately,
  • Weak correlation between sequence length and
    resource requirements because of BLAST employs
    heuristics
  • G1 sequences well behaved, large portion of
    sequences belong to G1
  • Search of hard queries can be speeded up with
    more memory

Sampling NT sequences search
12
Better Predictor?
  • Hit-based rather than length-based?
  • Two phase BLAST search
  • First phase find hits in word level
  • Second phase extend matched words in both
    direction to find maximal segment pair (longest
    local matching substring)
  • Computation of first phase much less expensive
    then that of second phase
  • Modified BLAST algorithm to collect number of
    hits in the first phase
  • Attractive utilizing internal knowledge of BLAST
    algorithm

13
Number of Hits Not a Better Predictor
  • Linear regression on data collected from 500 seqs
  • Y output size, execution time X length, hits
  • Number of hits not necessary better
  • Difference of mean square errors lt 5
  • High correlation (0.9942) between number of hits
    and sequence length
  • Sequence length is much easier to collect

14
What Kind of Grid Do We Need?
  • Existing grid frameworks (such as Globus) not
    what we want
  • Not available or well tested on Mac OS X and
    64-bit Linux OS
  • mpiBLAST-PIO not ported to Globus
  • High learning curve for installation and
    configuration
  • Home made grid software wrote from scratch
  • Just fit our needs
  • Easy to deploy, allow full control

15
Hardware Architecture
  • Heterogeneous environment
  • Interoperability is big concern

Cluster Organization Architecture Memory Procs File System
System X Virginia Tech Dual 2.3GHz PowerPC 970FX 4GB 2200 NFS
TunnelArch Univ. of Utah Dual AMD Opteron 240 CPU 4GB 126 PVFS
TunnelArch Univ. of Utah Dual AMD Opteron 244 CPU 2GB 128 PVFS
Dupon Intel Quad core N/A 512/256 NFS
Jarrel Intel Dual 3.4GHz Intel P4 2GB 20 NFS
Blade Center Intel Dual 2.66GHz Intel Xeon 2GB 28 NFS
Panta Panta Systems Four AMD Opteron 246HE 2GB 32 NFS
16
(No Transcript)
17
Software Architecture
  • Hierarchical design
  • SuperMaster assign queries, fetch results, load
    balancing
  • GroupMaster fetch queries, perform search
  • How to choose group size?
  • Challenges heterogeneity, scalability, fault
    tolerance

18
Heterogeneity And Accessibility
  • Only use four existing, cross-platform tools
  • Perl, ssh, rsync, bash
  • 5 scripts, totaled only 458 lines
  • Fast deployment in Unix like systems
  • Customize mpiBLAST-PIO
  • System X need special care
  • Porting issues because of Mac OS and Power PC
  • Implement pseudo-parallel-write to improve output
    performance on NFS

19
Design for Scalability
  • Managing thousands of procs efficiently with
    loosely coupled, hierarchical design
  • Reduce loads on SuperMaster
  • Passive SuperMaster easy to add group masters,
    regroup processors, and avoid security hole
  • Allow incremental system start
  • Hiding WAN latency by queuing queries in local
  • Prevent bubbles in the pipeline
  • Ensuring data integrity with MD5 checksum
  • A silent error every 500GB Paxson 1999
  • Alleviating network bandwidth constraint with
    compression (compression ration 15 17)

20
Fault Tolerance
  • Serious mean time failure lt 10 hours in machines
    with thousands of processors Reed 2004
  • Re-execution rather than checkpoint-restart
  • Primary issue query states management
  • Maintain all query states in file system

21
Results
  • Finished 1/7 NT in one day
  • Coalesced sequences into batches targeting 30
    minutes search time
  • Execution statistics
  • Output size 600K 7GB per batch, 284.2KB per
    seq
  • Execution time 6 secs 1.6 hours, average 9
    mins per batch

22
Conclusion
  • Not be able to take advantage of existing grid
    software
  • Home made grid software did work
  • Enables rapid development and deployment
  • Portable to Unix like platforms
  • Identify hard queries for bio research
  • Future work
  • Extend framework to support more general
    applications
  • Better resource estimation
Write a Comment
User Comments (0)
About PowerShow.com