Biosequence Similarity Search on the Mercury System - PowerPoint PPT Presentation

About This Presentation
Title:

Biosequence Similarity Search on the Mercury System

Description:

... Computer Science and Engineering, Washington University in Saint Louis, MO ... Washington University in St. Louis. Slide # 3. Basic Local Alignment Search Tool ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 26
Provided by: praveenkri
Category:

less

Transcript and Presenter's Notes

Title: Biosequence Similarity Search on the Mercury System


1
Biosequence Similarity Search on the Mercury
System
  • Praveen Krishnamurthy, Jeremy Buhler, Roger
    Chamberlain, Mark Franklin, Kwame Gyang, and
    Joseph Lancaster
  • Department of Computer Science and Engineering,
    Washington University in Saint Louis, MO

Supported by an NIH STTR Grant NSF Grants
DBI-0237902, ITR-0313203, CCR-0217334
2
Outline
  • Overview of BLAST
  • Overview of the Mercury system
  • Description of BLASTN algorithm
  • Algorithmic changes to BLASTN
  • Improvement in performance
  • Related work
  • Conclusion

3
Basic Local Alignment Search Tool
  • Biosequence comparison software
  • Query sequence (new genome) to large database of
    known biosequences
  • Look for similar regions
  • Exponential growth of genomic databases
  • Longer time for searches to complete
  • Solutions
  • Perform comparison over multiple machines
  • Specialized hardware - Our Approach

4
The Mercury System
5
The Mercury System
  • Proximity to disk
  • Simple operations performed close to disk
  • Avoids CPU use
  • 400 Mbytes/s throughput from the disk
  • Concurrent Independent operation
  • Does not use processor cache cycles, memory or
    I/O buses
  • Reconfigurable logic
  • Logic can be tuned to the particular need of the
    application

6
BLASTN
  • BLASTN
  • Both the query and the database are long DNA
    strings
  • Consist of A, C, T, G and some unknowns
  • Each stage processes lesser data
  • The stages become more computationally expensive

7
BLASTN - Terminology
Query
ACTGTGTTTCACTGACGGGTGT
Database
CTGTGTCCCCAACACTGCTGACGTAGAATCGTGTAG
w-mer is a sequence of w consecutive bases
8
BLASTN - Pipeline - Stage 1
  • Matches each 11-mer in query to database
  • Exact string matching
  • 83 of overall time is spent in this stage
  • Filters 92 of data entering this stage
  • Only 8 of data proceeds to the next stage

9
BLASTN - Pipeline - Stage 2
  • Extends the matches from stage 1

ACTGTGTTTCACTGACGGGTGT
GTGTCCCCAACATTTCACTGACGAGAATCGTGTAG
10
BLASTN - Pipeline - Stage 2
  • Extends the matches from stage 1
  • Allows mismatches of individual bases
  • Does not allow gaps in either the query or the
    database
  • Match score should be higher than threshold to
    proceed
  • 16 of pipeline time is spent is this stage
  • Only 2/100,000 of data entering this stage
    proceeds to the next stage

11
BLASTN - Pipeline - Stage 3
  • Extends the matches from stage 2

ACCACTGTTTCACTGACG_GA_T_GT
CTGTGTCCCCAC_GTTTCACTGACGAGAATCGTGTAG
12
BLASTN - Pipeline - Stage 3
  • Extends the matches from stage 2
  • Scores matches with Gaps inserted in both the
    sequences
  • Smith-Waterman dynamic programming algorithm
  • lt1 of pipeline time is spent is this stage

13
NCBI - BLASTN
  • Stage 1 (word matching) is implemented as a
    lookup table
  • Efficient only for certain word lengths (w 11)
  • Performance degrades dramatically for larger
    query sizes

Pentium-4 2.6GHz 1Gbyte RAM
14
Firmware implementation - Stage 1
Eliminates false-positives from Bloom filters,
obtain offset in query
Discards matches that are close to one another
Matches 11-mers to query, but generates
false-positives
15
Bloom filters operation
Programming the query into the bloom filter
(processing query)
K Hash Functions
query
11-mer
m-bit vector
16
Bloom filters operation
Finding matches in the database
1 Potential match
K Hash Functions
database
11-mer
0 Not a match
m-bit vector
17
Bloom filters operation
Finding matches in the database
?
1 Potential match
K Hash Functions
?
database
11-mer
0 Not a match
?
m-bit vector
False positives are eliminated using a hash
table
18
Bloom filter performance
19
Performance analysis
Firmware Vs. Software Stage 1
20
Overall system throughput
Tputoverall min (Tput1, Tput(23))
21
Stage 2 in firmware - Throughput
22
Stage 2 in firmware - Speedup
23
Related work
  • Hardware based commercial systems
  • Paracel GeneMatcherTM, used ASIC, and hence is
    inflexible
  • RDisk, FPGA based system with throughput of 60
    Mbases/s for stage 1
  • High-end commercial system
  • Paracel BLASTMachine2TM, 32 CPU linux cluster
  • 2.93 Mbases/s for 2.8 Mbase query
  • 2 times faster than 1-node Mercury BLASTN
  • TimeLogic DeCypherBLASTTM, FPGA based
  • 213 Kbases/s for a 16 Mbase query
  • Comparable to 1-node Mercury BLASTN

24
Conclusion
  • BLASTN on the Mercury system
  • Bloom filters to improve performance of stage 1
  • Efficient hash functions in hardware
  • 7x improvement in speed with only stage 1
    firmware
  • gt50x speedup with stage 2 implemented in firmware
  • Future work
  • Algorithmic changes to stage 2
  • Efficient use of hardware capabilities
  • Other apps
  • BLASTP, BLASTX etc.

25
Thank you
Write a Comment
User Comments (0)
About PowerShow.com