CSE182-L4: Keyword matching - PowerPoint PPT Presentation

About This Presentation

Title:

CSE182-L4: Keyword matching

Description:

Q: What is the score of the best alignment of the two strings s and t? ... Concatenate all of the database sequences to form one giant sequence. ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 61

Provided by: vineet50

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: CSE182-L4: Keyword matching

1
CSE182-L4 Keyword matching
2
Backward scoring

Defin Sbi,j Best scoring alignment of the
suffixes si1..n and tj1..m
Q What is the score of the best alignment of the
two strings s and t?
HW Write the recurrences for Sb

3
Forward/Backward computations
j
m
1

Fj Score of the best scoring alignment of
s1..n/2 and t1..j
Fj Sn/2,j
Bj Score of the best scoring alignment of
sn/21..n and tj1..m
Bj Sbn/2,j

n/2
4
Forward/Backward computations
j
m
1

At the optimal coordinate, j
FjBjSn,m
In O(nm) time, and O(m) space, we can compute one
of the coordinates on the optimum path.

n/2
5
Forward, Backward computation

There exists a coordinate, j
FjBjSn,m
In O(nm) time, and O(m) space, we can compute one
of the coordinates on the optimum path.

6
Linear Space Alignment

Align(1..n,1..m)
For all 1ltj lt m
Compute FjS(n/2,j)
For all 1ltj lt m
Compute BjSb(n/2,j)
j maxj FjBj
X Align(1..n/2,1..j)
Y Align(n/2..n,j..m)
Return X,j,Y

7
Linear Space complexity

T(nm) c.nm T(nm/2) O(nm)
Space O(m)

8
Summary

We considered the basics of sequence alignment
Opt score computation
Reconstructing alignments
Local alignments
Affine gap costs
Space saving measures
Can we recreate Blast?

9
Blast and local alignment

Concatenate all of the database sequences to form
one giant sequence.
Do local alignment computation with the query.

10
Large database search
Database size n100M, Querysize m1000. O(nm)
1011 computations
11
Why is Blast Fast?
12
Silly Question!

True or False No two people in new york city
have the same number of hair

13
Observations

Much of the database is random from the querys
perspective
Consider a random DNA string of length n.
PrAPrC PrGPrT0.25
Assume for the moment that the query is all As
(length m).
What is the probability that an exact match to
the query can be found?

14
Basic probability

Probability that there is a match starting at a
fixed position i 0.25m
What is the probability that some position i has
a match.
Dependencies confound probability estimates.

15
Basic ProbabilityExpectation

Q Toss a coin each time it comes up heads, you
get a dollar
What is the money you expect to get after n
tosses?
Let Xi be the amount earned in the i-th toss

16
Expected number of matches

Expected number of matches can still be computed.

Let Xi1 if there is a match starting at
position i, Xi0 otherwise

Expected number of matches

17
Expected number of exact Matches is small!

Expected number of matches n0.25m
If n107, m10,
Then, expected number of matches 9.537
If n107, m11
expected number of hits 2.38
n107,m12,
Expected number of hits 0.5 lt 1
Bottom Line An exact match to a substring of the
query is unlikely just by chance.

18
Observation 2

What is the pigeonhole principle?

19
Why is this important?

Suppose we are looking for sequences that are 80
identical to the query sequence of length 100.
Assume that the mismatches are randomly
distributed.
What is the probability that there is no stretch
of 10 bp, where the query and the subject match
exactly?
Rough calculations show that it is very low.
Exact match of a short query substring to a truly
similar subject is very high.
The above equation does not take dependencies
into account
Reality is better because the matches are not
randomly distributed

20
Just the Facts

Consider the set of all substrings of the query
string of fixed length W.
Prob. of exact match to a random database string
is very low.
Prob. of exact match to a true homolog is very
high.
Keyword Search (exact matches) is MUCH faster
than sequence alignment

21
BLAST
Database (n)

Consider all (m-W) query words of size W (Default
11)
Scan the database for exact match to all such
words
For all regions that hit, extend using a dynamic
programming alignment.
Can be many orders of magnitude faster than SW
over the entire string

22
Why is BLAST fast?

Assume that keyword searching does not consume
any time and that alignment computation the
expensive step.
Query m1000, random Db n107, no TP
SW O(nm) 1000107 1010 computations
BLAST, W11
E(11-mer hits) 1000 (1/4)11 1072384
Number of computations 23841001002.384107
Ratio1010/(2.384107)420
Further speed improvements are possible

23
Keyword Matching

How fast can we match keywords?
Hash table/Db index? What is the size of the hash
table, for m11
Suffix trees? What is the size of the suffix
trees?
Trie based search. We will do this in class.

AATCA
567
24
Related notes

How to choose the alignment region?
Extend greedily until the score falls below a
certain threshold
What about protein sequences?
Default word size 3, and mismatches are
allowed.
Like sequences, BLAST has been evolving
continuously
Banded alignment
Seed selection
Scanning for exact matches, keyword search versus
database indexing

25
P-value computation

How significant is a score? What happens to
significance when you change the score function
A simple empirical method
Compute a distribution of scores against a random
database.
Use an estimate of the area under the curve to
get the probability.
OR, fit the distribution to one of the standard
distributions.

26
Z-scores for alignment

Initial assumption was that the scores followed a
normal distribution.
Z-score computation
For any alignment, score S, shuffle one of the
sequences many times, and recompute alignment.
Get mean and standard deviation
Look up a table to get a P-value

27
Blast E-value

Initial (and natural) assumption was that scores
followed a Normal distribution
1990, Karlin and Altschul showed that ungapped
local alignment scores follow an exponential
distribution
Practical consequence
Longer tail.
Previously significant hits now not so
significant

28
Exponential distribution

Random Database, Pr(1) p
What is the expected number of hits to a sequence
of k 1s
Instead, consider a random binary Matrix.
Expected of diagonals of k 1s

As you increase k, the number decreases
exponentially.
The number of diagonals of k runs can be
approximated by a Poisson process
In ungapped alignments, we replace the coin
tosses by column scores, but the behaviour does
not change (Karlin Altschul).
As the score increases, the number of alignments
that achieve the score decreases exponentially

30
Blast E-value

Choose a score such that the expected score
between a pair of residues lt 0
Expected number of alignments with a particular
score
For small values, E-value and P-value are the
same

31
Blast Variants

What is mega-blast?
What is discontiguous mega-blast?
Phi-Blast/Psi-Blast?
BLAT?
PatternHunter?

Longer seeds. Seeds with dont care
values Later Database pre-processing Seeds with
dont care values
32
Silly Quiz

Name a famous Bioinformatics Researcher

Name a famous Bioinformatics Researcher who is a
woman

33
Scoring DNA

DNA has structure.

34
DNA scoring matrices

So far, we considered a simple match/mismatch
criterion.
The nucleotides can be grouped into Purines (A,G)
and Pyrimidines.
Nucleotide substitutions within a group
(transitions) are more likely than those across a
group (transversions)

35
Scoring proteins

Scoring protein sequence alignments is a much
more complex task than scoring DNA
Not all substitutions are equal
Problem was first worked on by Pauling and
collaborators
In the 1970s, Margaret Dayhoff created the first
similarity matrices.
One size does not fit all
Homologous proteins which are evolutionarily
close should be scored differently than proteins
that are evolutionarily distant
Different proteins might evolve at different
rates and we need to normalize for that

36
PAM 1 distance

Two sequences are 1 PAM apart if they differ in 1
of the residues.

1 mismatch

PAM1(a,b) Prresidue b substitutes residue a,
when the sequences are 1 PAM apart

37
PAM1 matrix

Align many proteins that are very similar
Is this a problem?
PAM1 distance is the probability of a
substitution when 1 of the residues have changed
Estimate the frequency Pba of residue a being
substituted by residue b.
S(a,b) log10(Pab/PaPb) log10(Pba/Pb)

38
PAM 1
39
PAM distance

Two sequences are 1 PAM apart when they differ in
1 of the residues.
When are 2 sequences 2 PAMs apart?

2 PAM
40
Higher PAMs

PAM2(a,b) ?c PAM1(a,c). PAM1 (c,b)
PAM2 PAM1 PAM1 (Matrix multiplication)
PAM250
PAM1PAM249
PAM1250

41
Note This is not the score matrix What happens
as you keep increasing the power?
42
Scoring using PAM matrices

Suppose we know that two sequences are 250 PAMs
apart.
S(a,b) log10(Pab/PaPb) log10(Pba/Pb)
log10(PAM250(a,b)/Pb)

43
BLOSUM series of Matrices

Henikoff Henikoff Sequence substitutions in
evolutionarily distant proteins do not seem to
follow the PAM distributions
A more direct method based on hand-curated
multiple alignments of distantly related proteins
from the BLOCKS database.
BLOSUM60 Merge all proteins that have greater
than 60. Then, compute the substitution
probability.
In practice BLOSUM62 seems to work very well.

44
PAM vs. BLOSUM

What is the correspondence?
PAM1 Blosum1
PAM2 Blosum2
Blosum62
PAM250 Blosum100

45
Dictionary Matching, R.E. matching, and position
specific scoring
46
Keyword search

Recall In BLAST, we get a collection of keywords
from the query sequence, and identify all db
locations with an exact match to the keyword.
Question Given a collection of strings
(keywords), find all occrrences in a database
string where they keyword might match.

47
Dictionary Matching
1POTATO 2POTASSIUM 3TASTE
P O T A S T P O T A T O
database
dictionary

Q Given k words (si has length li), and a
database of size n, find all matches to these
words in the database string.
How fast can this be done?

48
Dict. Matching string matching

How fast can you do it, if you only had one word
of length m?
Trivial algorithm O(nm) time
Pre-processing O(m), Search O(n) time.
Dictionary matching
Trivial algorithm (l1l2l3)n
Using a keyword tree, lpn (lp is the length of
the longest pattern)
Aho-Corasick O(n) after preprocessing O(l1l2..)
We will consider the most general case

49
Direct Algorithm
P O P O P O T A S T P O T A T O
P O T A T O
P O T A T O
P O T A T O
P O T A T O
P O T A T O

Observations
When we mismatch, we (should) know something
about where the next match will be.
When there is a mismatch, we (should) know
something about other patterns in the dictionary
as well.

50
The Trie Automaton

Construct an automaton A from the dictionary
Av,x describes the transition from node v to a
node w upon reading x.
Au,T v, and Au,S w
Special root node r
Some nodes are terminal, and labeled with the
index of the dictionary word.

1POTATO 2POTASSIUM 3TASTE
v
u
1
r
S
2
w
3
51
An O(lpn) algorithm for keyword matching

Start with the first position in the db, and the
root node.
If successful transition
Increment current pointer
Move to a new node
If terminal node success
Else
Retract current pointer
Increment start pointer
Move to root repeat

52
Illustration
P O T A S T P O T A T O
v
1
S
53
Idea for improving the time

Suppose we have partially matched pattern i
(indicated by l, and c), but fail subsequently.
If some other pattern j is to match
Then prefix(pattern j) suffix first c-l
characters of pattern(i))

c
l
P O T A S T P O T A T O
P O T A S S I U M
Pattern i
T A S T E
1POTATO 2POTASSIUM 3TASTE
Pattern j
54
Improving speed of dictionary matching

Every node v corresponds to a string sv that is a
prefix of some pattern.
Define Fv to be the node u such that su is the
longest suffix of sv
If we fail to match at v, we should jump to Fv,
and commence matching from there
Let lpv su

2
3
4
5
1
S
11
6
7
9
10
8
55
An O(n) alg. For keyword matching

Start with the first position in the db, and the
root node.
If successful transition
Increment current pointer
Move to a new node
If terminal node success
Else (if at root)
Increment current pointer
Mv start pointer
Move to root
Else
Move start pointer forward
Move to failure node

56
Illustration
P O T A S T P O T A T O
l
c
1
P
O
T
A
T
O
v
T
S
U
I
S
M
A
S
E
T
57
Time analysis

In each step, either c is incremented, or l is
incremented
Neither pointer is ever decremented (lpv lt
c-l).
l and c do not exceed n
Total time lt 2n

l
c
P O T A S T P O T A T O
58
Blast Putting it all together

Input Query of length m, database of size n
Select word-size, scoring matrix, gap penalties,
E-value cutoff

59
Blast Steps

Generate an automaton of all query keywords.
Scan database using a Dictionary Matching
algorithm (O(n) time). Identify all hits.
Extend each hit using a variant of local
alignment algorithm. Use the scoring matrix and
gap penalties.
For each alignment with score S, compute the
bit-score, E-value, and the P-value. Sort
according to increasing E-value until the cut-off
is reached.
Output results.

60
Protein Sequence Analysis

What can you do if BLAST does not return a hit?
Sometimes, homology (evolutionary similarity)
exists at very low levels of sequence similarity.

A Accept hits at higher P-value.
This increases the probability that the sequence
similarity is a chance event.
How can we get around this paradox?
Reformulated Q suppose two sequences B,C have
the same level of sequence similarity to sequence
A. If A B are related in function, can we assume
that A C are? If not, how can we distinguish?

Write a Comment

User Comments (0)