q-gram Based Database Searching Using A Suffix Array (QUASAR) - PowerPoint PPT Presentation

About This Presentation
Title:

q-gram Based Database Searching Using A Suffix Array (QUASAR)

Description:

q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt. A. Crauser ... 13.270 seconds Search Time. QUASAR Test Run: 90 seconds Load Time ... – PowerPoint PPT presentation

Number of Views:142
Avg rating:3.0/5.0
Slides: 17
Provided by: hb89
Category:

less

Transcript and Presenter's Notes

Title: q-gram Based Database Searching Using A Suffix Array (QUASAR)


1
q-gram Based Database Searching Using A Suffix
Array (QUASAR)
  • S. Burkhardt
  • A. Crauser
  • H-P. Lenhof

E. Rivals P. Ferragina M. Vingron
Max-Planck Institut f. Informatik,
Saarbrücken Deutsches Krebsforschungszentrum,
Heidelberg
2
Outline
  • Existing Work
  • Motivation
  • Problem
  • Algorithm
  • Results

3
Existing Work
  • Examples
  • BLAST
  • FASTA
  • Linear Scan (No Index)
  • Good Sensitivity

4
Motivation
  • Today New Applications
  • Examples
  • EST-Clustering
  • Large Scale Shotgun Assembly
  • Low Sensitivity
  • Multiple Searches
  • Specialized Algorithms Needed

5
Problem Definition
w 8
  • Local Alignment, minimum Length w
  • Low Error Rate (lt10 Edit Distance)

6
The Algorithm
  • Filter Step
  • Identify Hotspots
  • Scan Step
  • Scan Hotspots with BLAST

7
The Algorithm
  • q-gram Filtration
  • Block Addressing
  • Suffix Array
  • Window Shifting

T C G A T T A C
T C G A T T A C A G T G A A T
q 3 of q-grams P - q 1
w 8
G C A T T C G A T G G A C T G G A C T A G T G A A
T C A G T
Edit Distance e at least t P - q 1 -
(qe) common q-grams
8
The Algorithm
  • q-gram Filtration
  • Block Addressing
  • Suffix Array
  • Window Shifting

T C G A T T A C
  • Divide D into Blocks
  • Count matching q-grams per Block
  • Scan Blocks with counter ³ t

How to find the matching q-grams?
G C A T T C G A T G G A C T G G A C T A G T G A A
T C A G T
9
The Algorithm
  • q-gram Filtration
  • Block Addressing
  • Suffix Array
  • Window Shifting

T C G A T T A C
G C A T T C G A T G G A C T G G A C T A G T G A A
T C A G T
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29
10
The Algorithm
  • q-gram Filtration
  • Block Addressing
  • Suffix Array
  • Window Shifting

q 3 w 8 e 1 t 3
T C G A T T A C A G T G A A T
T C G A T T A C
  • Move Window over Query
  • Mark full Blocks for each Window
  • Scan Marked Blocks

G C A T T C G A T G G A C T G G A C T A G T G A A
T C A G T
11
Results
  • Influence of the Block Size
  • Sensitivity
  • Running Times
  • Overhead for loading the Index

Benchmark System Ultra Sparc Processor, 333Mhz,
4GB RAM
12
Results
Influence of Block Size
13
Sensitivity
Results
  • 1000 Queries
  • BLAST Cutoff E 0.00001
  • Number of identical hitlists
  • Mouse EST DB 91.4
  • Human EST DB 97.1
  • QUASAR finds many Hits below selected Error Level

14
Results
Running Times
  • Test Parameters
  • 6 Error
  • w 50
  • q 11
  • block size 2048
  • scan with BLAST
  • time averaged for 1000 queries
  • 30 times faster than BLAST

15
Results
Overhead for Loading the Index
  • 1000 queries
  • Human EST DB, 280 Mbps
  • BLAST Test Run
  • 5 seconds Load Time
  • 13.270 seconds Search Time
  • QUASAR Test Run
  • 90 seconds Load Time
  • 380 seconds Search Time

16
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com