Title: CSE182-L4: Keyword matching
1CSE182-L4 Keyword matching
2Backward scoring
- Defin Sbi,j Best scoring alignment of the
suffixes si1..n and tj1..m - Q What is the score of the best alignment of the
two strings s and t? - HW Write the recurrences for Sb
3Forward/Backward computations
j
m
1
- Fj Score of the best scoring alignment of
s1..n/2 and t1..j - Fj Sn/2,j
- Bj Score of the best scoring alignment of
sn/21..n and tj1..m - Bj Sbn/2,j
n/2
4Forward/Backward computations
j
m
1
- At the optimal coordinate, j
- FjBjSn,m
- In O(nm) time, and O(m) space, we can compute one
of the coordinates on the optimum path.
n/2
5Forward, Backward computation
- There exists a coordinate, j
- FjBjSn,m
- In O(nm) time, and O(m) space, we can compute one
of the coordinates on the optimum path.
6Linear Space Alignment
- Align(1..n,1..m)
- For all 1ltj lt m
- Compute FjS(n/2,j)
- For all 1ltj lt m
- Compute BjSb(n/2,j)
- j maxj FjBj
- X Align(1..n/2,1..j)
- Y Align(n/2..n,j..m)
- Return X,j,Y
7Linear Space complexity
- T(nm) c.nm T(nm/2) O(nm)
- Space O(m)
8Summary
- We considered the basics of sequence alignment
- Opt score computation
- Reconstructing alignments
- Local alignments
- Affine gap costs
- Space saving measures
- Can we recreate Blast?
9Blast and local alignment
- Concatenate all of the database sequences to form
one giant sequence. - Do local alignment computation with the query.
10Large database search
Database size n100M, Querysize m1000. O(nm)
1011 computations
11Why is Blast Fast?
12Silly Question!
- True or False No two people in new york city
have the same number of hair
13Observations
- Much of the database is random from the querys
perspective - Consider a random DNA string of length n.
- PrAPrC PrGPrT0.25
- Assume for the moment that the query is all As
(length m). - What is the probability that an exact match to
the query can be found?
14Basic probability
- Probability that there is a match starting at a
fixed position i 0.25m - What is the probability that some position i has
a match. - Dependencies confound probability estimates.
15Basic ProbabilityExpectation
- Q Toss a coin each time it comes up heads, you
get a dollar - What is the money you expect to get after n
tosses? - Let Xi be the amount earned in the i-th toss
16Expected number of matches
- Expected number of matches can still be computed.
i
- Let Xi1 if there is a match starting at
position i, Xi0 otherwise
- Expected number of matches
17Expected number of exact Matches is small!
- Expected number of matches n0.25m
- If n107, m10,
- Then, expected number of matches 9.537
- If n107, m11
- expected number of hits 2.38
- n107,m12,
- Expected number of hits 0.5 lt 1
- Bottom Line An exact match to a substring of the
query is unlikely just by chance.
18Observation 2
- What is the pigeonhole principle?
19Why is this important?
- Suppose we are looking for sequences that are 80
identical to the query sequence of length 100. - Assume that the mismatches are randomly
distributed. - What is the probability that there is no stretch
of 10 bp, where the query and the subject match
exactly? - Rough calculations show that it is very low.
Exact match of a short query substring to a truly
similar subject is very high. - The above equation does not take dependencies
into account - Reality is better because the matches are not
randomly distributed
20Just the Facts
- Consider the set of all substrings of the query
string of fixed length W. - Prob. of exact match to a random database string
is very low. - Prob. of exact match to a true homolog is very
high. - Keyword Search (exact matches) is MUCH faster
than sequence alignment
21BLAST
Database (n)
- Consider all (m-W) query words of size W (Default
11) - Scan the database for exact match to all such
words - For all regions that hit, extend using a dynamic
programming alignment. - Can be many orders of magnitude faster than SW
over the entire string
22Why is BLAST fast?
- Assume that keyword searching does not consume
any time and that alignment computation the
expensive step. - Query m1000, random Db n107, no TP
- SW O(nm) 1000107 1010 computations
- BLAST, W11
- E(11-mer hits) 1000 (1/4)11 1072384
- Number of computations 23841001002.384107
- Ratio1010/(2.384107)420
- Further speed improvements are possible
23Keyword Matching
- How fast can we match keywords?
- Hash table/Db index? What is the size of the hash
table, for m11 - Suffix trees? What is the size of the suffix
trees? - Trie based search. We will do this in class.
AATCA
567
24Related notes
- How to choose the alignment region?
- Extend greedily until the score falls below a
certain threshold - What about protein sequences?
- Default word size 3, and mismatches are
allowed. - Like sequences, BLAST has been evolving
continuously - Banded alignment
- Seed selection
- Scanning for exact matches, keyword search versus
database indexing
25P-value computation
- How significant is a score? What happens to
significance when you change the score function - A simple empirical method
- Compute a distribution of scores against a random
database. - Use an estimate of the area under the curve to
get the probability. - OR, fit the distribution to one of the standard
distributions.
26Z-scores for alignment
- Initial assumption was that the scores followed a
normal distribution. - Z-score computation
- For any alignment, score S, shuffle one of the
sequences many times, and recompute alignment.
Get mean and standard deviation - Look up a table to get a P-value
27Blast E-value
- Initial (and natural) assumption was that scores
followed a Normal distribution - 1990, Karlin and Altschul showed that ungapped
local alignment scores follow an exponential
distribution - Practical consequence
- Longer tail.
- Previously significant hits now not so
significant
28Exponential distribution
- Random Database, Pr(1) p
- What is the expected number of hits to a sequence
of k 1s - Instead, consider a random binary Matrix.
Expected of diagonals of k 1s
29- As you increase k, the number decreases
exponentially. - The number of diagonals of k runs can be
approximated by a Poisson process - In ungapped alignments, we replace the coin
tosses by column scores, but the behaviour does
not change (Karlin Altschul). - As the score increases, the number of alignments
that achieve the score decreases exponentially
30Blast E-value
- Choose a score such that the expected score
between a pair of residues lt 0 - Expected number of alignments with a particular
score - For small values, E-value and P-value are the
same
31Blast Variants
- What is mega-blast?
- What is discontiguous mega-blast?
- Phi-Blast/Psi-Blast?
- BLAT?
- PatternHunter?
Longer seeds. Seeds with dont care
values Later Database pre-processing Seeds with
dont care values
32Silly Quiz
- Name a famous Bioinformatics Researcher
- Name a famous Bioinformatics Researcher who is a
woman
33Scoring DNA
34DNA scoring matrices
- So far, we considered a simple match/mismatch
criterion. - The nucleotides can be grouped into Purines (A,G)
and Pyrimidines. - Nucleotide substitutions within a group
(transitions) are more likely than those across a
group (transversions)
35Scoring proteins
- Scoring protein sequence alignments is a much
more complex task than scoring DNA - Not all substitutions are equal
- Problem was first worked on by Pauling and
collaborators - In the 1970s, Margaret Dayhoff created the first
similarity matrices. - One size does not fit all
- Homologous proteins which are evolutionarily
close should be scored differently than proteins
that are evolutionarily distant - Different proteins might evolve at different
rates and we need to normalize for that
36PAM 1 distance
- Two sequences are 1 PAM apart if they differ in 1
of the residues.
1 mismatch
- PAM1(a,b) Prresidue b substitutes residue a,
when the sequences are 1 PAM apart
37PAM1 matrix
- Align many proteins that are very similar
- Is this a problem?
- PAM1 distance is the probability of a
substitution when 1 of the residues have changed - Estimate the frequency Pba of residue a being
substituted by residue b. - S(a,b) log10(Pab/PaPb) log10(Pba/Pb)
38PAM 1
39PAM distance
- Two sequences are 1 PAM apart when they differ in
1 of the residues. - When are 2 sequences 2 PAMs apart?
2 PAM
40Higher PAMs
- PAM2(a,b) ?c PAM1(a,c). PAM1 (c,b)
- PAM2 PAM1 PAM1 (Matrix multiplication)
- PAM250
- PAM1PAM249
- PAM1250
41Note This is not the score matrix What happens
as you keep increasing the power?
42Scoring using PAM matrices
- Suppose we know that two sequences are 250 PAMs
apart. - S(a,b) log10(Pab/PaPb) log10(Pba/Pb)
log10(PAM250(a,b)/Pb)
43BLOSUM series of Matrices
- Henikoff Henikoff Sequence substitutions in
evolutionarily distant proteins do not seem to
follow the PAM distributions - A more direct method based on hand-curated
multiple alignments of distantly related proteins
from the BLOCKS database. - BLOSUM60 Merge all proteins that have greater
than 60. Then, compute the substitution
probability. - In practice BLOSUM62 seems to work very well.
44PAM vs. BLOSUM
- What is the correspondence?
- PAM1 Blosum1
- PAM2 Blosum2
- Blosum62
- PAM250 Blosum100
45Dictionary Matching, R.E. matching, and position
specific scoring
46Keyword search
- Recall In BLAST, we get a collection of keywords
from the query sequence, and identify all db
locations with an exact match to the keyword. - Question Given a collection of strings
(keywords), find all occrrences in a database
string where they keyword might match.
47Dictionary Matching
1POTATO 2POTASSIUM 3TASTE
P O T A S T P O T A T O
database
dictionary
- Q Given k words (si has length li), and a
database of size n, find all matches to these
words in the database string. - How fast can this be done?
48Dict. Matching string matching
- How fast can you do it, if you only had one word
of length m? - Trivial algorithm O(nm) time
- Pre-processing O(m), Search O(n) time.
- Dictionary matching
- Trivial algorithm (l1l2l3)n
- Using a keyword tree, lpn (lp is the length of
the longest pattern) - Aho-Corasick O(n) after preprocessing O(l1l2..)
- We will consider the most general case
49Direct Algorithm
P O P O P O T A S T P O T A T O
P O T A T O
P O T A T O
P O T A T O
P O T A T O
P O T A T O
- Observations
- When we mismatch, we (should) know something
about where the next match will be. - When there is a mismatch, we (should) know
something about other patterns in the dictionary
as well.
50The Trie Automaton
- Construct an automaton A from the dictionary
- Av,x describes the transition from node v to a
node w upon reading x. - Au,T v, and Au,S w
- Special root node r
- Some nodes are terminal, and labeled with the
index of the dictionary word.
1POTATO 2POTASSIUM 3TASTE
v
u
1
r
S
2
w
3
51An O(lpn) algorithm for keyword matching
- Start with the first position in the db, and the
root node. - If successful transition
- Increment current pointer
- Move to a new node
- If terminal node success
- Else
- Retract current pointer
- Increment start pointer
- Move to root repeat
52Illustration
P O T A S T P O T A T O
v
1
S
53Idea for improving the time
- Suppose we have partially matched pattern i
(indicated by l, and c), but fail subsequently.
If some other pattern j is to match - Then prefix(pattern j) suffix first c-l
characters of pattern(i))
c
l
P O T A S T P O T A T O
P O T A S S I U M
Pattern i
T A S T E
1POTATO 2POTASSIUM 3TASTE
Pattern j
54Improving speed of dictionary matching
- Every node v corresponds to a string sv that is a
prefix of some pattern. - Define Fv to be the node u such that su is the
longest suffix of sv - If we fail to match at v, we should jump to Fv,
and commence matching from there - Let lpv su
2
3
4
5
1
S
11
6
7
9
10
8
55An O(n) alg. For keyword matching
- Start with the first position in the db, and the
root node. - If successful transition
- Increment current pointer
- Move to a new node
- If terminal node success
- Else (if at root)
- Increment current pointer
- Mv start pointer
- Move to root
- Else
- Move start pointer forward
- Move to failure node
56Illustration
P O T A S T P O T A T O
l
c
1
P
O
T
A
T
O
v
T
S
U
I
S
M
A
S
E
T
57Time analysis
- In each step, either c is incremented, or l is
incremented - Neither pointer is ever decremented (lpv lt
c-l). - l and c do not exceed n
- Total time lt 2n
l
c
P O T A S T P O T A T O
58Blast Putting it all together
- Input Query of length m, database of size n
- Select word-size, scoring matrix, gap penalties,
E-value cutoff
59Blast Steps
- Generate an automaton of all query keywords.
- Scan database using a Dictionary Matching
algorithm (O(n) time). Identify all hits. - Extend each hit using a variant of local
alignment algorithm. Use the scoring matrix and
gap penalties. - For each alignment with score S, compute the
bit-score, E-value, and the P-value. Sort
according to increasing E-value until the cut-off
is reached. - Output results.
60Protein Sequence Analysis
- What can you do if BLAST does not return a hit?
- Sometimes, homology (evolutionary similarity)
exists at very low levels of sequence similarity.
- A Accept hits at higher P-value.
- This increases the probability that the sequence
similarity is a chance event. - How can we get around this paradox?
- Reformulated Q suppose two sequences B,C have
the same level of sequence similarity to sequence
A. If A B are related in function, can we assume
that A C are? If not, how can we distinguish?