Title: Gapped BLAST and PSI-BLAST
1Gapped BLAST and PSI-BLAST
- Altschul et al
- Presenter ??? ???
2Outline
- BLAST 1.0 background (from lecture slides)
- BLAST 2.0
- Gapped BLAST
- PSI-BLAST
- Demonstration
3Statistical preliminaries
- Pi background probability that amino acids
occur randomly at all position - E number of distinct HSPs with normalized score
at least S - sij
- qij target frequency of aligned pair of letters
(i, j) with HSP, high-scoring segment paris
4Outline
- BLAST 1.0 background (from lecture slides)
- BLAST 2.0
- Gapped BLAST
- PSI-BLAST
5BLAST
- Basic Local Alignment Search Tool(by Altschul,
Gish, Miller, Myers and Lipman) - The central idea of the BLAST algorithm is that a
statistically significant alignment is likely to
contain a high-scoring pair of aligned words.
6The maximal segment pair measure
- A maximal segment pair (MSP) is defined to be the
highest scoring pair of identical length segments
chosen from 2 sequences.(for DNA Identities
5 Mismatches -4)
- The MSP score may be computed in time
proportional to the product of their lengths.
(How?) An exact procedure is too time consuming. - BLAST heuristically attempts to calculate the MSP
score.
the highest scoring pair
7BLAST
- Build the hash table for Sequence A.
- Scan Sequence B for hits.
- Extend hits.
8BLAST
Step 1 Build the hash table for Sequence A.
(3-tuple example)
For protein sequences Seq. A ELVISAdd xyz to
the hash table if Score(xyz, ELV) ? TAdd
xyz to the hash table if Score(xyz, LVI) ?
TAdd xyz to the hash table if Score(xyz,
VIS) ? T
For DNA sequences Seq. A AGATCGAT
12345678 AAAAAC..AGA 1..ATC 3..CGA
5..GAT 2 6..TCG 4..TTT
The higher T, the less sensitivity, but faster
9BLAST
Step2 Scan sequence B for hits.
10BLAST
Step2 Scan sequence B for hits.
Step 3 Extend hits.
BLAST 2.0 saves the time spent in extension, and
considers gapped alignments.
hit
Terminate if the score of the sxtension fades
away. (That is, when we reach a segment pair
whose score falls a certain distance below the
best score found for shorter extensions.)
11Outline
- BLAST 1.0 background (from lecture slides)
- BLAST 2.0
- Gapped BLAST
- PSI-BLAST
12Two-Hit Method
- BLAST 1.o
- Extension step accounts for 90 of total time
- Observations
- HSP of interest is much longer than a single word
pair - Entail multiple hits on the same diagonal and
within short distance of one another - Invoke an extension only when two non-overlapping
hits are found within distance A on the same
diagonal
13Demonstration
- Recenti the most recent hit found on the ith
diagonal (always increasing)
overlap
14Discussion
- T must to be lowered
- More one-hits while the majority are dismissed
- Speed
- Twice as rapid as one-hit
- Sensitivity
- Almost the same
15Outline
- BLAST 1.0 background (from lecture slides)
- BLAST 2.0
- Gapped BLAST
- PSI-BLAST
16Gapped BLAST
- Original BLAST find several distinct HSPs
- All HSPs related to one alignment should be found
- Now
- Find one HSP only seed, than use 2-hit
- T can be raised ? faster
- Find all HSPs vs find one HSP for one optimal
alignment - For example, result should gt 0.95, p miss prob
of HSP - Orignial with 2 HSP (1-p)(1-p)gt0.95? plt0.025
- Now p2lt0.05?p0.22
17Gapped BLAST (contd)
- A gapped extension takes much longer to execute
than an ungapped extension, but by performing
very few of them the fraction of the total time
could be kept low. - Trigger a gapped extension for any HSP exceeding
score Sg
18Example
- Original BLAST locates only the first and the
last ungapped aligment, E-value gt 50 times
19Outline
- BLAST 1.0 background (from lecture slides)
- BLAST 2.0
- Gapped BLAST
- PSI-BLAST
20PSI-BLAST
- position-specific score matrices
- Vs substitution matrices
- Use it as ordinary ways
- Iterated, using position-specific score matrices
- For a BLAST run
- Constructed automatically from the output
- Use this matrix in place of the query for the
next run - For proteins, query L
- Position-specific matrix L 20
- Benefits
- Better to detect weak relationships
21Construct Position-specific matrix
- Construct multiple alignment M from the output
- For every column of M
- Find reduced Mc of column C
- Calculate scores in column C of the
position-specific matrix
22Construct multiple alignment M
- Collect sequence segments output
- With E-value below a Threshold (why)
- Identical sequence are dropped
- Pair-wise alignment columns with query involves
inserted gap are ignored - Multiple alignment M has same length (column
length) as query
23Construct multiple alignment M
24Calculate position-specific matrix score
- The scores of a given alignment column should
dependent the residues appeared on the column - But upon those in other columns as well
25Find reduced Mc of column C
- R sequences contribute a residue in column C
- Mc those columns of M in which all the sequences
are represented
26Calculate scores in column C of the
position-specific matrix
- Related to all residues frequency observed fi,
and number of independent residues in column C
(Nc) - log(Qi/Pi)
- Qi estimated probability for residue i to be
found in C
27BLAST applied to position-specific matrices
28- Thank you
- Any problems now?
29Outline
- BLAST 1.0 background (from lecture slides)
- BLAST 2.0
- Gapped BLAST
- PSI-BLAST
- Demonstration