Fasta, Blast, Probabilities - PowerPoint PPT Presentation

About This Presentation

Title:

Fasta, Blast, Probabilities

Description:

Fasta, Blast, Probabilities – PowerPoint PPT presentation

Number of Views:105

Avg rating:3.0/5.0

Slides: 40

Provided by: NirF6

Category:

more less

Transcript and Presenter's Notes

Title: Fasta, Blast, Probabilities

1
Fasta, Blast, Probabilities
2
Reminder

Last classes we discussed dynamic programming
algorithms for
global alignment
local alignment
Multiple alignment
All of these assumed a scoring rulethat
determines the quality of perfect matches,
substitutions, insertions, and deletions.

3
Alignment in Real Life

One of the major uses of alignments is to find
sequences in a database.
The current protein database contains about 100
millions (i.e.,108) residues ! So searching a
1000 long target sequence requires to evaluate
about 1011 matrix cells which will take about
three hours in the rate of 10 millions
evaluations per second.
Quite annoying when, say, one thousand target
sequences need to be searched because it will
take about four months to run.

4
Heuristic Fast Search

Instead, most searches rely on heuristic
procedures
These are not guaranteed to find the best match
Sometimes, they will completely miss a
high-scoring match
We now describe the main ideas used by the
best known of these heuristic procedures.

5
Basic Intuition

Almost all heuristic search procedures are based
on the observation that real-life matches often
contain long strings with gap-less matches.
These heuristic try to find significant gap-less
matches and then extend them.

6
Banded DP

Suppose that we have two strings s1..n and
t1..m such that n?m
If the optimal alignment of s and t has few gaps,
then path of the alignment will be close to
diagonal

s
t
7
Banded DP

To find such a path, it suffices to search in a
diagonal region of the matrix.
If the diagonal band has width k, then the
dynamic programming step takes O(kn).
Much faster than O(n2) of standard DP.

8
Banded DP for local alignment

Problem Where is the banded diagonal ? It
need not be the main diagonal when looking for a
good local alignment.

How do we select which subsequences to align
using banded DP?
We heuristically find potential diagonals and
evaluate them using Banded DP. This is the main
idea of FASTA.
9
Finding Potential Diagonals

Suppose that we have a relatively long gap-less
match
AGCGCCATGGATTGAGCGA
TGCGACATTGATCGACCTA
Can we find clues that will let us find it
quickly?
Each such sequence defines a potential diagonal
(which is then evaluated using Banded DP.

10
Signature of a Match

Assumption good matches contain several
patches of perfect matches
AGCGCCATGGATTGAGCTA
TGCGACATTGATCGACCTA
Since this is a gap-less alignment, all perfect
match regionsshould be on one diagonal

11
FASTA-finding ungapped matches

Input strings s and t, and a parameter ktup
Find all pairs (i,j) such that si..iktuptj..j
ktup
Locate sets of pairs that are on the same
diagonal
By sorting according to the difference i-j
Compute the score for the diagonal that contains
all these pairs

12
FASTA-finding ungapped matches

Input strings s and t, and a parameter ktup
Find all pairs (i,j) such that si..iktuptj..j
ktup
Step one prepare an index of the database and
the query sequence such that given a sequence of
length ktup, one gets the list of positions.
(Linear time).
Step two for each ktup from the query add 1 in
the diagonal (i-j) in which it appears. Then find
contiguous (possibly with mismatch) ktup in
diagonals.

13
FASTA- using banded DP

Step 3
Select the ten high scoring contiguous segments
Try and score all combinations of these ten
segments in order to constitute a pass into the
matrix
Step 4
Run banded DP on the region containing the best
scoring pass (say with width 12).
Hence, the algorithm may combine some diagonals
into gapped matches (in the example below combine
diagonals 2 and 3).

14
FASTA- practical choices
Most applications of FASTA use very small ktup
(1-2 for proteins, and 4-6 for DNA). Higher
values are faster, yielding less diagonal to
search around, but increase the chance to miss
the optimal local alignment.
Some implementation choices /tricks have not been
explicated herein.
15
FASTA-summary

Input strings s and t, and a parameter ktup
1,2,4,5, or 6 depending on the application.
Output A highly scored local alignment
Find pairs of matching substrings
si..iktuptj..jktup
Extend to ungapped diagonals
Extend to gapped matches using banded DP

16
BLAST Overview(Basic Local Alignment Search
Tool)

Input strings s and t, and a parameter T
threshold value
Output A highly scored local alignment
Definition Two strings s and t of length k are
a high scoring pair (HSP) if d(s,t) gt T (usually
consider un-gapped alignments only).
Find high scoring pairs of substrings such that
d(s,t) gt T
These words serve as seeds for finding longer
matches
Extend to ungapped diagonals (as in FASTA)
Extend to gapped matches

17
BLAST Overview (cont.)

Step 1 Find high scoring pairs of substrings
such that d(s,t) gt T (The seeds)
Find all strings of length k which score at least
T with substrings of s in a gapless alignment (k
4 for proteins, 11 for DNA)
(note possibly, not all k-words must be
tested, e.g. when such a word scores less than T
with itself).
Find in t all exact matches with each of the
above strings.

18
Extending Potential Matches
Once a seed is found, BLAST attempts to find a
local alignment that extends the seed. Seeds on
the same diagonal are combined (as in FASTA),
then extended as far as possible in a greedy
manner without gap.
s
During the extension phase, the search stops when
the score passes below some lower bound computed
by BLAST (to save time). For the best ungap
alignment do a banded SW an assign a
probabilistic score.
t
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
Why use probability to define and/or interpret a
scoring function ?

Similarity is probabilistic in nature because
biological changes like mutation, recombination,
and selection, are not deterministic.
We could answer questions such as
How probable two sequences are similar?
Is the similarity found significant or random?
How to change a similarity score when, say,
mutation rate of a specific area on the
chromosome becomes known ?

27
A Probabilistic Model

For now, we will focus on alignment without
indels.
For now, we assume each position (nucleotide
/amino-acid) is independent of other positions.
We consider two options
M the sequences are Matched (related)
R the sequences are Random (unrelated)

28
Unrelated Sequences

Our random model of unrelated sequences is simple
Each position is sampled independently from a
distribution over the alphabet ?
We assume there is a distribution q(?) that
describes the probability of letters in such
positions.
Then

29
Related Sequences

We assume that each pair of aligned positions
(si,ti) evolved from a common ancestor
Let p(a,b) be a distribution over pairs of
letters.
p(a,b) is the probability that some ancestral
letter evolved into this particular pair of
letters.

30
Odd-Ratio Test for Alignment
If Q gt 1, then the two strings s and t are more
likely to be related (M) than unrelated (R). If
Q lt 1, then the two strings s and t are more
likely to be unrelated (R) than related (M).
31
Log Odd-Ratio Test for Alignment

Taking logarithm of Q yields

Score(si,ti)
If log Q gt 0, then s and t are more likely to be
related. If log Q lt 0, then they are more likely
to be unrelated. How can we relate this
quantity to a score function ?
32
Probabilistic Interpretation of Scores

We define the scoring function via
Then, the score of an alignment is the log-ratio
between the two models
Score gt 0 ? Model is more likely
Score lt 0 ? Random is more likely

33
Estimating Probabilities

Suppose we are given a long string s1..n of
letters from ?
We want to estimate the distribution q() that
generated the sequence
How should we go about this?
We build on the theory of parameter estimation in
statistics using either maximum likelihood
estimation or the Bayesian approach .

34
Estimating q(?)

Suppose we are given a long string s1..n of
letters from ?
s can be the concatenation of all sequences in
our database
We want to estimate the distribution q(?)
That is, q is defined per letter

Likelihood function
35
Estimating q(?) (cont.)

How do we define q?

Likelihood function
ML parameters (Maximum Likelihood)
MAP parameters (Maximum A posteriori Probability)
36
Estimating p(,)

Intuition
Find pair of aligned sequences s1..n, t1..n,
Estimate probability of pairs
Again, s and t can be the concatenation of many
aligned pairs from the database

Number of times a is aligned with b in (s,t)
37
Problems in Estimating p(,)

How do we find pairs of aligned sequences?
How far is the ancestor ?
earlier divergence ? low sequence similarity
later divergence ? high sequence similarity
Does one letter mutate to the other or are they
both mutations of a common ancestor having yet
another residue/nucleotide acid ?

38
Estimating p(,) for proteins
39
PAM-1 matrices
For PAM-1 it is assumed that 1 of all amino
acids are mutated.
Proportion of mutation that concern a times
number of mutation
40
PAM-1 matrices
Define Mab to be the probability matrix for
switching from a to b via a mutation
41
Properties of PAM-1 matrices
Note that
Namely, the probability of not changing and
changing sums to 1.
42
Model of Evolution

Again, we need to make some assumptions
Each position changes independently of the rest
The probability of mutations is the same in each
positions
Evolution does not remember

Time
t
t?
t2?
t3?
t4?
43
Model of Evolution

How do we model such a process?
This process is called a Markov Chain
A chain is defined by the transition probability
P(Xt? bXta) - the probability that the next
state is b given that the current state is a
We often describe these probabilities by a
matrixM?ab P(Xt? bXta)

44
Multi-Step Changes

Based on Mab, we can compute the probabilities of
changes over two time periods

Thus M2? M?M?
By induction (HMW exercise) Mk? M? k

45
Longer Term Changes

Estimate M? (PAM-1 matrices)
Use Mk? M? k (PAM-k matrices)
Define

46
Using PAM

Historically researchers use PAM-250. (The only
one published in the original paper.)
Original PAM matrices were based on small number
of proteins (circa 1978). Later versions use many
more examples.
Used to be the most popular scoring rule, but
there are some problems with PAM matrices.

47
(No Transcript)
48
Problems with PAM
Normalization step is quite arbitrary. If, for
example, we define relative mutability using the
constant 50 rather than 100, we get
We will get
Now we have two different ways to estimate the
matrix M4? M4? M? 4 as we did before or
M4? M2? 2
M250? for example does not reflect well long
period changes.
49
BLOSUM

Idea use aligned ungapped regions of protein
families.These are assumed to have a common
ancestor. Similar ideas but better statistics
and modeling.
Procedure
Cluster together sequences in a family whenever
more than L identical residues are shared.
Count number of substitutions across different
clusters in the same family.
Estimate frequencies as before.
Practice Blosum50 and Blosum62 are wildly used.
(See page 43-44 in the text book).

Considered state of the art nowadays.
50
BLOSUM 62

Write a Comment

User Comments (0)