Title: QUASAR
1QUASAR
- Q-gram Based Database Searching Using a Suffix
Array - By Stefan Burkhardt, Andreas Crauser, Poalo
Ferragina, Hans-Peter Lenhof, Eric Rivals and
Martin Vingron
2Structure
- Introduction / Problem
- Algorithm
- An Example
- Results
3Introduction
- Goal of QUASAR
- Edit Distance
- Idea of QUASAR
4The Goal of QUASAR
- QUASAR - Q-gram Alignment based on Suffix ARray
-
- QUASAR was developed to quickly detect sequences
with strong similarity to the query, in a context
where many searches are conducted on one database.
Example search query S
CCATTAGCTAA on database D
AGCTATTAACGTCA
Example search query S
CCATTAGCTAA on database D
AGCTATTAACGTCA
5Edit Distance (Levenshtein distance)
- metric for measuring the similarity of two
strings - The edit distance is the minimal number of
changed, added or deleted characters of one
defined string S to get another defined string T - example
- Edit distance between Tier and Tor
change i to o Toer delete e
Tor 2 changes ? edit distance between
Tier and Tor is 2
6The Idea of QUASAR
- To speed up, in comparison to existing Algorithms
like BLAST, we use a filtering technique - Filter selects sequences with high possibility of
high similarity in the database - Pass these to the matching-algorithm BLAST to
inspect the sequences in depth
CACATGAGAA
Search query S Database D
AAAGGGGTTCCCCCTAAACACTGACGAACTGACGAAGTCCAAAAGG TTT
TAACCCCTTTAAAGGGCGACTTGACACCATTGAGAACCCAAAA GGGGTT
TCCCTTTGGGCCCGGAAGGAATTAATTCCBBBAAAAAACC
CACTGACGAACTGACGAAGT, GACTTGACACCATTGAGAAC
7The Algorithm
8Q-gram
- definition
- A q-gram of String S is a substrings of S with
fixed length q.
example q 2 w1 TRUDIGERSTER w2
STEVENSPIELBERG
example q 2 w1 TRUDIGERSTER TR RU
UD DI IG GE ER RS ST TE ER w2
STEVENSPIELBERG ST TE EV VE EN NS SP
IE EL LB BE ER RG
example q 2 w1 TRUDIGERSTER TR RU
UD DI IG GE ER RS ST TE ER w2
STEVENSPIELBERG ST TE EV VE EN NS SP
IE EL LB BE ER RG
9Q-gram Lemma
- Lemma
- Let P and S be strings of length w with at most k
differences. Then P and S share at least - tw-q1-kq common q-grams
examples for q3, k1 and w8 ? t 8-31-133
10Filtering Idea
- Given k for the maximum of differences on a fixed
window size w, we get approximate matches by
finding subsequences of D with at least t shared
q-grams.
Example for k1, w8, q3 ? tw-q1-kq3
CACATGAGAA
Search query S Database D
AAAGGGGTTCCCCCTAAACACTGAGGAACTGACGAAGTCCAAAAGG
PROBLEM For each window position in S we have
to check all possible subsequences in the
database
11Filtering Idea - Partition
- Instead of testing all subsequences of size w in
the database we count shared q-grams on
non-overlapping blocks of given size b
Example for k1, w8, q3 ? tw-q1-kq3, b16
(b2w)
CACATGAGAA
Search query S Database D
0
5
0
AAAGGGGTTCCCCCTAAACACTGAGGAACTGACGAAGTCCAAAAGG
PROBLEM We have to find all occurrence of a
q-gram to increase the counters
12A Short Summery
- QUASAR first filters the pieces for possible
matches and passes them to the matching-algorithm
BLAST - To filter, QUASAR takes these pieces from the
database which have a certain number of shared
q-grams with the search query. - To count the number of shared q-grams, it
partitions the database into non-overlapping
blocks of fixed length b. - To find the position of the shared q-grams it
uses an index based on a suffix-array
13An Example
- Define Variables
- Build Suffix Array
- Build Hitlist
- Counting Q-grams
14Define Variables
- size of q-grams q 3
- alphabet ?
A,C,G,T - window size w 8
- edit distance k 1
- threshold t w-q1-kq 8-31-13 3
- block size b 16
- query sequence S
CACATGAGAA - database D
ATCAAGTTCTAAAGAT -
TGACACCTGAGAAAGT -
CGAAGGGCTAACTTTC -
AAAGTTCTTAGAGAGA
15Build Suffix Array (part 1)
1 ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTC
AAAGTTCTTAGAGAGA 2 TCAAGTTCTAAAGATTGACACCTGAGAAAGT
CGAAGGGCTAACTTTCAAAGTTCTTAGAGAGA 3 CAAGTTCTAAAGATT
GACACCTGAGAAAGTCGAAGGGCTAACTTTCAAAGTTCTTAGAGAGA 4
AAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAAAGT
TCTTAGAGAGA 29 AAGTCGAAGGGCTAACTTTCAAAGTTCTTAGAG
AGA 30 AGTCGAAGGGCTAACTTTCAAAGTTCTTAGAGAGA 31 GTCG
AAGGGCTAACTTTCAAAGTTCTTAGAGAGA 32 TCGAAGGGCTAACTTT
CAAAGTTCTTAGAGAGA 33 CGAAGGGCTAACTTTCAAAGTTCTTAGAG
AGA 34 GAAGGGCTAACTTTCAAAGTTCTTAGAGAGA 35 AAGGGCTA
ACTTTCAAAGTTCTTAGAGAGA 36 AGGGCTAACTTTCAAAGTTCTTAG
AGAGA .. 61 GAGA 62 AGA 63 GA 64 A
16Build Suffix Array (part 2)
17Build Hitlist
18Finish Initialisation of QUASAR
gt partition db
gt init block counter to 0
gt place window
- S CACATGAGAA
- ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
19Start Counting q-grams
gt get next q-gram
gt increase counter(s)
gt look up in hitlist
- S CACATGAGAA
- ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
Bounds Occured g-grams
q-gram counter b1 116
b2 924 b3 1732
b4 2540
b5 3348 b6 4156
b7 5764
0
0
0
0
0
0
0
0 CAC
01 CAC
01
0
0
0
0
20Start Counting q-grams
gt get next q-gram
gt increase counter(s)
gt look up in hitlist
- S CACATGAGAA
- ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
Bounds Occured g-grams
q-gram counter b1 116
b2 924 b3 1732
b4 2540
b5 3348 b6 4156
b7 5764
0
0
0
0
0
0
0
0 CAC ACA
2 CAC ACA
2
0
0
0
0
21Start Counting q-grams
gt get next q-gram
gt increase counter(s)
gt look up in hitlist
- S CACATGAGAA
- ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
Bounds Occured g-grams
q-gram counter b1 116
b2 924 b3 1732
b4 2540
b5 3348 b6 4156
b7 5764
0
0
0
0
0
0
0
0 CAC ACA
2 CAC ACA
2
0
0
0
0
22Start Counting q-grams
gt get next q-gram
gt increase counter(s)
gt look up in hitlist
- S CACATGAGAA
- ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
Bounds Occured g-grams
q-gram counter b1 116
b2 924 b3 1732
b4 2540
b5 3348 b6 4156
b7 5764
0
0
0
0
0
0
0
0 CAC ACA
2 CAC ACA
2
0
0
0
0
23Start Counting q-grams
gt get next q-gram
gt increase counter(s)
gt look up in hitlist
- S CACATGAGAA
- ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
Bounds Occured g-grams
q-gram counter b1 116
b2 924 b3 1732
b4 2540
b5 3348 b6 4156
b7 5764
0
0
0
0
0
0
0
0 CAC ACA TGA
3 CAC ACA TGA TGA
4
0
0
0
0
24Start Counting q-grams
gt get next q-gram
gt increase counter(s)
gt look up in hitlist
- S CACATGAGAA
- ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
Bounds Occured g-grams
q-gram counter b1 116
b2 924 b3 1732
b4 2540
b5 3348 b6 4156
b7 5764
0
0
0
0
0
0
0
0 CAC ACA TGA
3 CAC ACA TGA TGA GAG
5 GAG
1
0
0 GAG
GAG
2
25Start Counting q-grams
gt get next q-gram
gt increase counter(s)
gtshift window
gt look up in hitlist
- S CACATGAGAA
- ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
IDEA take lost q-gram out and the new one in
ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
Bounds Occured g-grams
q-gram counter b1 116
b2 924 b3 1732
b4 2540
b5 3348 b6 4156
b7 5764
0
0
0
0
0
0
0
0 CAC ACA TGA
3 CAC ACA TGA TGA GAG
5 GAG
1
0
0 GAG
GAG
2
0 CAC ACA TGA
3 CAC ACA TGA TGA GAG
5 GAG
1
0
0 GAG
GAG
2
26Counting q-grams
gt handle lost q-gram
gt handle additional g-gram
decrease counter by one on all blocks containing
CAC and are smaler than threshold t (t3).
Increase all counters of the blocks containing
AGA
- S CACATGAGAA
- ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
Bounds Occured g-grams
q-gram counter b1 116
b2 924 b3 1732
b4 2540
b5 3348 b6 4156
b7 5764
0
0
0
0
0
0
0
0 CAC ACA TGA
3 CAC ACA TGA TGA GAG
5 GAG
1
0
0 GAG
GAG
2
AGA
1 CAC ACA TGA AGA
4 CAC ACA TGA TGA GAG AGA
6 GAG AGA
2
0
0 GAG
GAG AGA AGA AGA
5
27Counting q-grams
gt handle lost q-gram
gt shift window
gt handle additional g-gram
- S CACATGAGAA
- ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
Bounds Occured g-grams
q-gram counter b1 116
b2 924 b3 1732
b4 2540
b5 3348 b6 4156
b7 5764
0
0
0
0
0
0
0
AGA
1 CAC ACA TGA AGA
4 CAC ACA TGA TGA GAG AGA
6 GAG AGA
2
0
0 GAG
GAG AGA AGA AGA
5
AGA
1 CAC ACA TGA AGA
4 CAC ACA TGA TGA GAG AGA GAA
7 GAG AGA GAA GAA
4 GAA
1
0 GAG
GAG AGA AGA AGA
5
28Counting q-grams
gt pass blocks with counter gtt to BLAST
- S CACATGAGAA
- ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
S CACATGAGAA
ATCAAGTTCTAAAGATTGACACCTGAGAAAGTCGAAGGGCTAACTTTCAA
AGTTCTTAGAGAGA
Bounds Occured g-grams
q-gram counter b1 116
b2 924 b3 1732
b4 2540
b5 3348 b6 4156
b7 5764
0
0
0
0
0
0
0
AGA
1 CAC ACA TGA AGA
4 CAC ACA TGA TGA GAG AGA
6 GAG AGA
2
0
0 GAG
GAG AGA AGA AGA
5
AGA
1 CAC ACA TGA AGA
4 CAC ACA TGA TGA GAG AGA GAA
7 GAG AGA GAA GAA
4 GAA
1
0 GAG
GAG AGA AGA AGA
5
AGA
1 CAC ACA TGA AGA
4 CAC ACA TGA TGA GAG AGA GAA
7 GAG AGA GAA GAA
4 GAA
1
0 GAG
GAG AGA AGA AGA
5
29Result
30Results
- window size w50
- q11
- threshold t for edit distance 3
- block size b1024bps
2. Space Suffix Array has to fit in the Memory
31Questions?
32- Lemma
- Let P and S be strings of length w with at most k
differences. Then P and S share at least - tw-q1-kq common q-grams
Example q3, k2, w15
w15
q3 1 q3 1
w-k(q1)
q-grams w-k(q1)-(q-1) k w-kq-q1
33Suffix Array
1. endoplasmatic 2. ndoplasmatic 3.
doplasmatic 4. oplasmatic 5. plasmatic
6. lasmatic 7. asmatic 8.
smatic 9. matic 10. atic 11.
tic 12. ic 13. c
1. endoplasmatic 2. ndoplasmatic 3.
doplasmatic 4. oplasmatic 5. plasmatic
6. lasmatic 7. asmatic 8.
smatic 9. matic 10. atic 11.
tic 12. ic 13. c
endoplasmatic
7. asmatic 10. atic 13.
c 3. doplasmatic 1. endoplasmatic 12.
ic 6. lasmatic 9. matic 2.
ndoplasmatic 4. oplasmatic 5. plasmatic
8. smatic 11. tic
build suffix
array
7,10,13,3,1,12,6,9,2,4,5,8,11
order lexicographical
build suffix array
completed!
34Blocksize dependency