Biosequence Similarity Search on the Mercury System - PowerPoint PPT Presentation

About This Presentation

Title:

Biosequence Similarity Search on the Mercury System

Description:

... Computer Science and Engineering, Washington University in Saint Louis, MO ... Washington University in St. Louis. Slide # 3. Basic Local Alignment Search Tool ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 26

Provided by: praveenkri

Learn more at: https://www.arl.wustl.edu

Category:

more less

Transcript and Presenter's Notes

Title: Biosequence Similarity Search on the Mercury System

1
Biosequence Similarity Search on the Mercury
System

Praveen Krishnamurthy, Jeremy Buhler, Roger
Chamberlain, Mark Franklin, Kwame Gyang, and
Joseph Lancaster
Department of Computer Science and Engineering,
Washington University in Saint Louis, MO

Supported by an NIH STTR Grant NSF Grants
DBI-0237902, ITR-0313203, CCR-0217334
2
Outline

Overview of BLAST
Overview of the Mercury system
Description of BLASTN algorithm
Algorithmic changes to BLASTN
Improvement in performance
Related work
Conclusion

3
Basic Local Alignment Search Tool

Biosequence comparison software
Query sequence (new genome) to large database of
known biosequences
Look for similar regions
Exponential growth of genomic databases
Longer time for searches to complete
Solutions
Perform comparison over multiple machines
Specialized hardware - Our Approach

4
The Mercury System
5
The Mercury System

Proximity to disk
Simple operations performed close to disk
Avoids CPU use
400 Mbytes/s throughput from the disk
Concurrent Independent operation
Does not use processor cache cycles, memory or
I/O buses
Reconfigurable logic
Logic can be tuned to the particular need of the
application

6
BLASTN

BLASTN
Both the query and the database are long DNA
strings
Consist of A, C, T, G and some unknowns
Each stage processes lesser data
The stages become more computationally expensive

7
BLASTN - Terminology
Query
ACTGTGTTTCACTGACGGGTGT
Database
CTGTGTCCCCAACACTGCTGACGTAGAATCGTGTAG
w-mer is a sequence of w consecutive bases
8
BLASTN - Pipeline - Stage 1

Matches each 11-mer in query to database
Exact string matching
83 of overall time is spent in this stage
Filters 92 of data entering this stage
Only 8 of data proceeds to the next stage

9
BLASTN - Pipeline - Stage 2

Extends the matches from stage 1

ACTGTGTTTCACTGACGGGTGT
GTGTCCCCAACATTTCACTGACGAGAATCGTGTAG
10
BLASTN - Pipeline - Stage 2

Extends the matches from stage 1
Allows mismatches of individual bases
Does not allow gaps in either the query or the
database
Match score should be higher than threshold to
proceed
16 of pipeline time is spent is this stage
Only 2/100,000 of data entering this stage
proceeds to the next stage

11
BLASTN - Pipeline - Stage 3

Extends the matches from stage 2

ACCACTGTTTCACTGACG_GA_T_GT
CTGTGTCCCCAC_GTTTCACTGACGAGAATCGTGTAG
12
BLASTN - Pipeline - Stage 3

Extends the matches from stage 2
Scores matches with Gaps inserted in both the
sequences
Smith-Waterman dynamic programming algorithm
lt1 of pipeline time is spent is this stage

13
NCBI - BLASTN

Stage 1 (word matching) is implemented as a
lookup table
Efficient only for certain word lengths (w 11)
Performance degrades dramatically for larger
query sizes

Pentium-4 2.6GHz 1Gbyte RAM
14
Firmware implementation - Stage 1
Eliminates false-positives from Bloom filters,
obtain offset in query
Discards matches that are close to one another
Matches 11-mers to query, but generates
false-positives
15
Bloom filters operation
Programming the query into the bloom filter
(processing query)
K Hash Functions
query
11-mer
m-bit vector
16
Bloom filters operation
Finding matches in the database
1 Potential match
K Hash Functions
database
11-mer
0 Not a match
m-bit vector
17
Bloom filters operation
Finding matches in the database
?
1 Potential match
K Hash Functions
?
database
11-mer
0 Not a match
?
m-bit vector
False positives are eliminated using a hash
table
18
Bloom filter performance
19
Performance analysis
Firmware Vs. Software Stage 1
20
Overall system throughput
Tputoverall min (Tput1, Tput(23))
21
Stage 2 in firmware - Throughput
22
Stage 2 in firmware - Speedup
23
Related work