QUASAR - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

QUASAR

Description:

QUASAR. Q-gram Based Database Searching Using a Suffix Array ... QUASAR was developed to quickly detect sequences with strong similarity to the ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 35
Provided by: dbis
Category:
Tags: quasar | quasar

less

Transcript and Presenter's Notes

Title: QUASAR


1
QUASAR
  • Q-gram Based Database Searching Using a Suffix
    Array
  • By Stefan Burkhardt, Andreas Crauser, Poalo
    Ferragina, Hans-Peter Lenhof, Eric Rivals and
    Martin Vingron

2
Structure
  • Introduction / Problem
  • Algorithm
  • An Example
  • Results

3
Introduction
  • Goal of QUASAR
  • Edit Distance
  • Idea of QUASAR

4
The Goal of QUASAR
  • QUASAR - Q-gram Alignment based on Suffix ARray
  • QUASAR was developed to quickly detect sequences
    with strong similarity to the query, in a context
    where many searches are conducted on one database.

Example search query S
CCATTAGCTAA on database D
AGCTATTAACGTCA
Example search query S
CCATTAGCTAA on database D
AGCTATTAACGTCA
5
Edit Distance (Levenshtein distance)
  • metric for measuring the similarity of two
    strings
  • The edit distance is the minimal number of
    changed, added or deleted characters of one
    defined string S to get another defined string T
  • example
  • Edit distance between Tier and Tor

change i to o Toer delete e
Tor 2 changes ? edit distance between
Tier and Tor is 2
6
The Idea of QUASAR
  • To speed up, in comparison to existing Algorithms
    like BLAST, we use a filtering technique
  • Filter selects sequences with high possibility of
    high similarity in the database
  • Pass these to the matching-algorithm BLAST to
    inspect the sequences in depth

CACATGAGAA
Search query S Database D
AAAGGGGTTCCCCCTAAACACTGACGAACTGACGAAGTCCAAAAGG TTT
TAACCCCTTTAAAGGGCGACTTGACACCATTGAGAACCCAAAA GGGGTT
TCCCTTTGGGCCCGGAAGGAATTAATTCCBBBAAAAAACC
CACTGACGAACTGACGAAGT, GACTTGACACCATTGAGAAC
7
The Algorithm
8
Q-gram
  • definition
  • A q-gram of String S is a substrings of S with
    fixed length q.

example q 2 w1 TRUDIGERSTER w2
STEVENSPIELBERG
example q 2 w1 TRUDIGERSTER TR RU
UD DI IG GE ER RS ST TE ER w2
STEVENSPIELBERG ST TE EV VE EN NS SP
IE EL LB BE ER RG
example q 2 w1 TRUDIGERSTER TR RU
UD DI IG GE ER RS ST TE ER w2
STEVENSPIELBERG ST TE EV VE EN NS SP
IE EL LB BE ER RG
9
Q-gram Lemma
  • Lemma
  • Let P and S be strings of length w with at most k
    differences. Then P and S share at least
  • tw-q1-kq common q-grams

examples for q3, k1 and w8 ? t 8-31-133
10
Filtering Idea
  • Given k for the maximum of differences on a fixed
    window size w, we get approximate matches by
    finding subsequences of D with at least t shared
    q-grams.

Example for k1, w8, q3 ? tw-q1-kq3
CACATGAGAA
Search query S Database D
AAAGGGGTTCCCCCTAAACACTGAGGAACTGACGAAGTCCAAAAGG
PROBLEM For each window position in S we have
to check all possible subsequences in the
database
11
Filtering Idea - Partition
  • Instead of testing all subsequences of size w in
    the database we count shared q-grams on
    non-overlapping blocks of given size b

Example for k1, w8, q3 ? tw-q1-kq3, b16
(b2w)
CACATGAGAA
Search query S Database D
0
5
0
AAAGGGGTTCCCCCTAAACACTGAGGAACTGACGAAGTCCAAAAGG
PROBLEM We have to find all occurrence of a
q-gram to increase the counters
12
A Short Summery
  • QUASAR first filters the pieces for possible
    matches and passes them to the matching-algorithm
    BLAST
  • To filter, QUASAR takes these pieces from the
    database which have a certain number of shared
    q-grams with the search query.
  • To count the number of shared q-grams, it
    partitions the database into non-overlapping
    blocks of fixed length b.
  • To find the position of the shared q-grams it
    uses an index based on a suffix-array

13
An Example
  • Define Variables
  • Build Suffix Array
  • Build Hitlist
  • Counting Q-grams

14
Define Variables
  • size of q-grams q 3
  • alphabet ?
    A,C,G,T
  • window size w 8
  • edit distance k 1
  • threshold t w-q1-kq 8-31-13 3
  • block size b 16
  • query sequence S
    CACATGAGAA
  • database D
    ATCAAGTTCTAAAGAT

  • TGACACCTGAGAAAGT

  • CGAAGGGCTAACTTTC

  • AAAGTTCTTAGAGAGA

15
Build Suffix Array (part 1)
1 ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTC
AAAGTTCTTAGAGAGA 2 TCAAGTTCTAAAGATTGACACCTGAGAAAGT
CGAAGGGCTAACTTTCAAAGTTCTTAGAGAGA 3 CAAGTTCTAAAGATT
GACACCTGAGAAAGTCGAAGGGCTAACTTTCAAAGTTCTTAGAGAGA 4
AAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAAAGT
TCTTAGAGAGA 29 AAGTCGAAGGGCTAACTTTCAAAGTTCTTAGAG
AGA 30 AGTCGAAGGGCTAACTTTCAAAGTTCTTAGAGAGA 31 GTCG
AAGGGCTAACTTTCAAAGTTCTTAGAGAGA 32 TCGAAGGGCTAACTTT
CAAAGTTCTTAGAGAGA 33 CGAAGGGCTAACTTTCAAAGTTCTTAGAG
AGA 34 GAAGGGCTAACTTTCAAAGTTCTTAGAGAGA 35 AAGGGCTA
ACTTTCAAAGTTCTTAGAGAGA 36 AGGGCTAACTTTCAAAGTTCTTAG
AGAGA .. 61 GAGA 62 AGA 63 GA 64 A
16
Build Suffix Array (part 2)
17
Build Hitlist
18
Finish Initialisation of QUASAR
gt partition db
gt init block counter to 0
gt place window
  • S CACATGAGAA
  • ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
    AGTTCTTAGAGAGA

19
Start Counting q-grams
gt get next q-gram
gt increase counter(s)
gt look up in hitlist
  • S CACATGAGAA
  • ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
    AGTTCTTAGAGAGA

ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
Bounds Occured g-grams
q-gram counter b1 116
b2 924 b3 1732
b4 2540
b5 3348 b6 4156
b7 5764


0
0
0
0
0
0
0

0 CAC
01 CAC
01
0
0
0

0
20
Start Counting q-grams
gt get next q-gram
gt increase counter(s)
gt look up in hitlist
  • S CACATGAGAA
  • ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
    AGTTCTTAGAGAGA

ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
Bounds Occured g-grams
q-gram counter b1 116
b2 924 b3 1732
b4 2540
b5 3348 b6 4156
b7 5764


0
0
0
0
0
0
0

0 CAC ACA
2 CAC ACA
2
0
0
0

0
21
Start Counting q-grams
gt get next q-gram
gt increase counter(s)
gt look up in hitlist
  • S CACATGAGAA
  • ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
    AGTTCTTAGAGAGA

ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
Bounds Occured g-grams
q-gram counter b1 116
b2 924 b3 1732
b4 2540
b5 3348 b6 4156
b7 5764


0
0
0
0
0
0
0

0 CAC ACA
2 CAC ACA
2
0
0
0

0
22
Start Counting q-grams
gt get next q-gram
gt increase counter(s)
gt look up in hitlist
  • S CACATGAGAA
  • ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
    AGTTCTTAGAGAGA

ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
Bounds Occured g-grams
q-gram counter b1 116
b2 924 b3 1732
b4 2540
b5 3348 b6 4156
b7 5764


0
0
0
0
0
0
0

0 CAC ACA
2 CAC ACA
2
0
0
0

0
23
Start Counting q-grams
gt get next q-gram
gt increase counter(s)
gt look up in hitlist
  • S CACATGAGAA
  • ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
    AGTTCTTAGAGAGA

ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
Bounds Occured g-grams
q-gram counter b1 116
b2 924 b3 1732
b4 2540
b5 3348 b6 4156
b7 5764


0
0
0
0
0
0
0

0 CAC ACA TGA
3 CAC ACA TGA TGA
4
0
0
0

0
24
Start Counting q-grams
gt get next q-gram
gt increase counter(s)
gt look up in hitlist
  • S CACATGAGAA
  • ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
    AGTTCTTAGAGAGA

ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
Bounds Occured g-grams
q-gram counter b1 116
b2 924 b3 1732
b4 2540
b5 3348 b6 4156
b7 5764


0
0
0
0
0
0
0

0 CAC ACA TGA
3 CAC ACA TGA TGA GAG
5 GAG
1
0
0 GAG
GAG
2
25
Start Counting q-grams
gt get next q-gram
gt increase counter(s)
gtshift window
gt look up in hitlist
  • S CACATGAGAA
  • ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
    AGTTCTTAGAGAGA

IDEA take lost q-gram out and the new one in
ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
Bounds Occured g-grams
q-gram counter b1 116
b2 924 b3 1732
b4 2540
b5 3348 b6 4156
b7 5764


0
0
0
0
0
0
0

0 CAC ACA TGA
3 CAC ACA TGA TGA GAG
5 GAG
1
0
0 GAG
GAG
2

0 CAC ACA TGA
3 CAC ACA TGA TGA GAG
5 GAG
1
0
0 GAG
GAG
2
26
Counting q-grams
gt handle lost q-gram
gt handle additional g-gram
decrease counter by one on all blocks containing
CAC and are smaler than threshold t (t3).
Increase all counters of the blocks containing
AGA
  • S CACATGAGAA
  • ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
    AGTTCTTAGAGAGA

ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
Bounds Occured g-grams
q-gram counter b1 116
b2 924 b3 1732
b4 2540
b5 3348 b6 4156
b7 5764


0
0
0
0
0
0
0

0 CAC ACA TGA
3 CAC ACA TGA TGA GAG
5 GAG
1
0
0 GAG
GAG
2
AGA
1 CAC ACA TGA AGA
4 CAC ACA TGA TGA GAG AGA
6 GAG AGA
2
0
0 GAG
GAG AGA AGA AGA
5
27
Counting q-grams
gt handle lost q-gram
gt shift window
gt handle additional g-gram
  • S CACATGAGAA
  • ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
    AGTTCTTAGAGAGA

ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
Bounds Occured g-grams
q-gram counter b1 116
b2 924 b3 1732
b4 2540
b5 3348 b6 4156
b7 5764


0
0
0
0
0
0
0
AGA
1 CAC ACA TGA AGA
4 CAC ACA TGA TGA GAG AGA
6 GAG AGA
2
0
0 GAG
GAG AGA AGA AGA
5
AGA
1 CAC ACA TGA AGA
4 CAC ACA TGA TGA GAG AGA GAA
7 GAG AGA GAA GAA
4 GAA
1
0 GAG
GAG AGA AGA AGA
5
28
Counting q-grams
gt pass blocks with counter gtt to BLAST
  • S CACATGAGAA
  • ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
    AGTTCTTAGAGAGA

S CACATGAGAA
ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
Bounds Occured g-grams
q-gram counter b1 116
b2 924 b3 1732
b4 2540
b5 3348 b6 4156
b7 5764


0
0
0
0
0
0
0
AGA
1 CAC ACA TGA AGA
4 CAC ACA TGA TGA GAG AGA
6 GAG AGA
2
0
0 GAG
GAG AGA AGA AGA
5
AGA
1 CAC ACA TGA AGA
4 CAC ACA TGA TGA GAG AGA GAA
7 GAG AGA GAA GAA
4 GAA
1
0 GAG
GAG AGA AGA AGA
5
AGA
1 CAC ACA TGA AGA
4 CAC ACA TGA TGA GAG AGA GAA
7 GAG AGA GAA GAA
4 GAA
1
0 GAG
GAG AGA AGA AGA
5
29
Result
  • QUASAR in the real world

30
Results
  • window size w50
  • q11
  • threshold t for edit distance 3
  • block size b1024bps

2. Space Suffix Array has to fit in the Memory
31
Questions?
32
  • Lemma
  • Let P and S be strings of length w with at most k
    differences. Then P and S share at least
  • tw-q1-kq common q-grams

Example q3, k2, w15
w15
q3 1 q3 1
w-k(q1)
q-grams w-k(q1)-(q-1) k w-kq-q1
33
Suffix Array
  • example

1. endoplasmatic 2. ndoplasmatic 3.
doplasmatic 4. oplasmatic 5. plasmatic
6. lasmatic 7. asmatic 8.
smatic 9. matic 10. atic 11.
tic 12. ic 13. c
1. endoplasmatic 2. ndoplasmatic 3.
doplasmatic 4. oplasmatic 5. plasmatic
6. lasmatic 7. asmatic 8.
smatic 9. matic 10. atic 11.
tic 12. ic 13. c
endoplasmatic
7. asmatic 10. atic 13.
c 3. doplasmatic 1. endoplasmatic 12.
ic 6. lasmatic 9. matic 2.
ndoplasmatic 4. oplasmatic 5. plasmatic
8. smatic 11. tic
build suffix
array
7,10,13,3,1,12,6,9,2,4,5,8,11
order lexicographical
build suffix array
completed!
34
Blocksize dependency
Write a Comment
User Comments (0)
About PowerShow.com