Fasta, Blast, Probabilities - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Fasta, Blast, Probabilities

Description:

... T with substrings of s in a gapless alignment (k = 4 for proteins, 11 for DNA) ... in the text book). Considered state of the art nowadays. 50. BLOSUM 62 ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 40
Provided by: nirfri7
Category:

less

Transcript and Presenter's Notes

Title: Fasta, Blast, Probabilities


1
Fasta, Blast, Probabilities
2
Reminder
  • Last classes we discussed dynamic programming
    algorithms for
  • global alignment
  • local alignment
  • Multiple alignment
  • All of these assumed a scoring rulethat
    determines the quality of perfect matches,
    substitutions, insertions, and deletions.

3
Alignment in Real Life
  • One of the major uses of alignments is to find
    sequences in a database.
  • The current protein database contains about 100
    millions (i.e.,108) residues ! So searching a
    1000 long target sequence requires to evaluate
    about 1011 matrix cells which will take about
    three hours in the rate of 10 millions
    evaluations per second.
  • Quite annoying when, say, one thousand target
    sequences need to be searched because it will
    take about four months to run.

4
Heuristic Fast Search
  • Instead, most searches rely on heuristic
    procedures
  • These are not guaranteed to find the best match
  • Sometimes, they will completely miss a
    high-scoring match
  • We now describe the main ideas used by the
    best known of these heuristic procedures.

5
Basic Intuition
  • Almost all heuristic search procedures are based
    on the observation that real-life matches often
    contain long strings with gap-less matches.
  • These heuristic try to find significant gap-less
    matches and then extend them.

6
Banded DP
  • Suppose that we have two strings s1..n and
    t1..m such that n?m
  • If the optimal alignment of s and t has few gaps,
    then path of the alignment will be close to
    diagonal

s
t
7
Banded DP
  • To find such a path, it suffices to search in a
    diagonal region of the matrix.
  • If the diagonal band has width k, then the
    dynamic programming step takes O(kn).
  • Much faster than O(n2) of standard DP.

8
Banded DP for local alignment
  • Problem Where is the banded diagonal ? It
    need not be the main diagonal when looking for a
    good local alignment.

How do we select which subsequences to align
using banded DP?
We heuristically find potential diagonals and
evaluate them using Banded DP. This is the main
idea of FASTA.
9
Finding Potential Diagonals
  • Suppose that we have a relatively long gap-less
    match
  • AGCGCCATGGATTGAGCGA
  • TGCGACATTGATCGACCTA
  • Can we find clues that will let us find it
    quickly?
  • Each such sequence defines a potential diagonal
    (which is then evaluated using Banded DP.


10
Signature of a Match
  • Assumption good matches contain several
    patches of perfect matches
  • AGCGCCATGGATTGAGCTA
  • TGCGACATTGATCGACCTA
  • Since this is a gap-less alignment, all perfect
    match regionsshould be on one diagonal

11
FASTA-finding ungapped matches
  • Input strings s and t, and a parameter ktup
  • Find all pairs (i,j) such that si..iktuptj..j
    ktup
  • Locate sets of pairs that are on the same
    diagonal
  • By sorting according to the difference i-j
  • Compute the score for the diagonal that contains
    all these pairs

12
FASTA-finding ungapped matches
  • Input strings s and t, and a parameter ktup
  • Find all pairs (i,j) such that si..iktuptj..j
    ktup
  • Step one prepare an index of the database and
    the query sequence such that given a sequence of
    length ktup, one gets the list of positions.
    (Linear time).
  • Step two for each ktup from the query add 1 in
    the diagonal (i-j) in which it appears. Then find
    contiguous (possibly with mismatch) ktup in
    diagonals.

13
FASTA- using banded DP
  • Step 3
  • Select the ten high scoring contiguous segments
  • Try and score all combinations of these ten
    segments in order to constitute a pass into the
    matrix
  • Step 4
  • Run banded DP on the region containing the best
    scoring pass (say with width 12).
  • Hence, the algorithm may combine some diagonals
    into gapped matches (in the example below combine
    diagonals 2 and 3).

14
FASTA- practical choices
Most applications of FASTA use very small ktup
(1-2 for proteins, and 4-6 for DNA). Higher
values are faster, yielding less diagonal to
search around, but increase the chance to miss
the optimal local alignment.
Some implementation choices /tricks have not been
explicated herein.
15
FASTA-summary
  • Input strings s and t, and a parameter ktup
    1,2,4,5, or 6 depending on the application.
  • Output A highly scored local alignment
  • Find pairs of matching substrings
    si..iktuptj..jktup
  • Extend to ungapped diagonals
  • Extend to gapped matches using banded DP

16
BLAST Overview(Basic Local Alignment Search
Tool)
  • Input strings s and t, and a parameter T
    threshold value
  • Output A highly scored local alignment
  • Definition Two strings s and t of length k are
    a high scoring pair (HSP) if d(s,t) gt T (usually
    consider un-gapped alignments only).
  • Find high scoring pairs of substrings such that
    d(s,t) gt T
  • These words serve as seeds for finding longer
    matches
  • Extend to ungapped diagonals (as in FASTA)
  • Extend to gapped matches

17
BLAST Overview (cont.)
  • Step 1 Find high scoring pairs of substrings
    such that d(s,t) gt T (The seeds)
  • Find all strings of length k which score at least
    T with substrings of s in a gapless alignment (k
    4 for proteins, 11 for DNA)
  • (note possibly, not all k-words must be
    tested, e.g. when such a word scores less than T
    with itself).
  • Find in t all exact matches with each of the
    above strings.

18
Extending Potential Matches
Once a seed is found, BLAST attempts to find a
local alignment that extends the seed. Seeds on
the same diagonal are combined (as in FASTA),
then extended as far as possible in a greedy
manner without gap.
s
During the extension phase, the search stops when
the score passes below some lower bound computed
by BLAST (to save time). For the best ungap
alignment do a banded SW an assign a
probabilistic score.
t
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
Why use probability to define and/or interpret a
scoring function ?
  • Similarity is probabilistic in nature because
    biological changes like mutation, recombination,
    and selection, are not deterministic.
  • We could answer questions such as
  • How probable two sequences are similar?
  • Is the similarity found significant or random?
  • How to change a similarity score when, say,
    mutation rate of a specific area on the
    chromosome becomes known ?

27
A Probabilistic Model
  • For now, we will focus on alignment without
    indels.
  • For now, we assume each position (nucleotide
    /amino-acid) is independent of other positions.
  • We consider two options
  • M the sequences are Matched (related)
  • R the sequences are Random (unrelated)

28
Unrelated Sequences
  • Our random model of unrelated sequences is simple
  • Each position is sampled independently from a
    distribution over the alphabet ?
  • We assume there is a distribution q(?) that
    describes the probability of letters in such
    positions.
  • Then

29
Related Sequences
  • We assume that each pair of aligned positions
    (si,ti) evolved from a common ancestor
  • Let p(a,b) be a distribution over pairs of
    letters.
  • p(a,b) is the probability that some ancestral
    letter evolved into this particular pair of
    letters.

30
Odd-Ratio Test for Alignment
If Q gt 1, then the two strings s and t are more
likely to be related (M) than unrelated (R). If
Q lt 1, then the two strings s and t are more
likely to be unrelated (R) than related (M).
31
Log Odd-Ratio Test for Alignment
  • Taking logarithm of Q yields

Score(si,ti)
If log Q gt 0, then s and t are more likely to be
related. If log Q lt 0, then they are more likely
to be unrelated. How can we relate this
quantity to a score function ?
32
Probabilistic Interpretation of Scores
  • We define the scoring function via
  • Then, the score of an alignment is the log-ratio
    between the two models
  • Score gt 0 ? Model is more likely
  • Score lt 0 ? Random is more likely

33
Estimating Probabilities
  • Suppose we are given a long string s1..n of
    letters from ?
  • We want to estimate the distribution q() that
    generated the sequence
  • How should we go about this?
  • We build on the theory of parameter estimation in
    statistics using either maximum likelihood
    estimation or the Bayesian approach .

34
Estimating q(?)
  • Suppose we are given a long string s1..n of
    letters from ?
  • s can be the concatenation of all sequences in
    our database
  • We want to estimate the distribution q(?)
  • That is, q is defined per letter

Likelihood function
35
Estimating q(?) (cont.)
  • How do we define q?

Likelihood function
ML parameters (Maximum Likelihood)
MAP parameters (Maximum A posteriori Probability)
36
Estimating p(,)
  • Intuition
  • Find pair of aligned sequences s1..n, t1..n,
  • Estimate probability of pairs
  • Again, s and t can be the concatenation of many
    aligned pairs from the database

Number of times a is aligned with b in (s,t)
37
Problems in Estimating p(,)
  • How do we find pairs of aligned sequences?
  • How far is the ancestor ?
  • earlier divergence ? low sequence similarity
  • later divergence ? high sequence similarity
  • Does one letter mutate to the other or are they
    both mutations of a common ancestor having yet
    another residue/nucleotide acid ?

38
Estimating p(,) for proteins
39
PAM-1 matrices
For PAM-1 it is assumed that 1 of all amino
acids are mutated.
Proportion of mutation that concern a times
number of mutation
40
PAM-1 matrices
Define Mab to be the probability matrix for
switching from a to b via a mutation
41
Properties of PAM-1 matrices
Note that
Namely, the probability of not changing and
changing sums to 1.
42
Model of Evolution
  • Again, we need to make some assumptions
  • Each position changes independently of the rest
  • The probability of mutations is the same in each
    positions
  • Evolution does not remember

Time
t
t?
t2?
t3?
t4?
43
Model of Evolution
  • How do we model such a process?
  • This process is called a Markov Chain
  • A chain is defined by the transition probability
  • P(Xt? bXta) - the probability that the next
    state is b given that the current state is a
  • We often describe these probabilities by a
    matrixM?ab P(Xt? bXta)

44
Multi-Step Changes
  • Based on Mab, we can compute the probabilities of
    changes over two time periods
  • Thus M2? M?M?
  • By induction (HMW exercise) Mk? M? k

45
Longer Term Changes
  • Estimate M? (PAM-1 matrices)
  • Use Mk? M? k (PAM-k matrices)
  • Define

46
Using PAM
  • Historically researchers use PAM-250. (The only
    one published in the original paper.)
  • Original PAM matrices were based on small number
    of proteins (circa 1978). Later versions use many
    more examples.
  • Used to be the most popular scoring rule, but
    there are some problems with PAM matrices.

47
(No Transcript)
48
Problems with PAM
Normalization step is quite arbitrary. If, for
example, we define relative mutability using the
constant 50 rather than 100, we get
We will get
Now we have two different ways to estimate the
matrix M4? M4? M? 4 as we did before or
M4? M2? 2
M250? for example does not reflect well long
period changes.
49
BLOSUM
  • Idea use aligned ungapped regions of protein
    families.These are assumed to have a common
    ancestor. Similar ideas but better statistics
    and modeling.
  • Procedure
  • Cluster together sequences in a family whenever
    more than L identical residues are shared.
  • Count number of substitutions across different
    clusters in the same family.
  • Estimate frequencies as before.
  • Practice Blosum50 and Blosum62 are wildly used.
    (See page 43-44 in the text book).

Considered state of the art nowadays.
50
BLOSUM 62
Write a Comment
User Comments (0)
About PowerShow.com